# Binary search on PRE-SORTED text waves

These 2 functions do a binary search on an ALREADY ALPHABETICALLY SORTED text wave, looking for the point in the wave that matches the input text. Either the first matching point in the wave, or the last matching point is searched for.

1. I'd appreciate feedback on correctness/efficiency.

2. If there is an Igor function that will do this already, I don't know about it. Maybe it could be a wish list item?

3. FindLastTextOccurrence and FindFirstTextOccurrence could be combined, with another input parameter to select for first or last occurrence.

4. As it stands, the match is case-insensitive. Another input parameter for case sensitivity of the search could be added to the function and passed to cmpStr; the user would have to ensure that the text wave was sorted the same way, case-insensitive or not.

//******************************************************************************************************
// Binary Searches for sorted text waves. Not exactly math utilities, but sorting is mathematical
// THESE ASSUME THAT THE TEXT WAVE IS ALREADY ALPHABETICALLY SORTED

// A function to calculate base 2 logarithms
STATIC CONSTANT log2 =  0.301029995664
Function logBase2 (theValue)
variable theValue

return log (theValue)/log2
end

//******************************************************************************************************
// finds the first occurrence of a text within a range of points in a SORTED text wave
// returns the point number of the last occurrence, or -1 times the position where thetext should be inserted  if the text was not found
function FindFirstTextOccurrence (theWave, thetext, startPos, endPos)
wave/T theWave // alpabetically sorted text wave.
string thetext // the text to find in this wave, matching the entire contents of the wave at that point
variable startPos, endPos // range over which to search

// limit start and end positions to possible values
startPos = max (0, startPos)
endPos = min (numPnts (theWave)-1, endPos)
variable iPos // the point to be compared
variable  theCmp // the result of the comparison
variable firstPt =startPos , lastPt = endPos // variables the define the range oer which comparisons will be made
variable iComp, maxComp = ceil (logBase2 (endPos - StartPos + 1)) // sanity check to limit number of comparisons to theoretical max. In case wave is NOT sorted, e.g.
for (iComp =0; (firstPt <= lastPt) && (iComp <= maxComp); iComp +=1)
iPos = trunc ((firstPt + lastPt)/2)
theCmp = cmpStr (thetext, theWave [iPos])
if (theCmp ==0) //thetext is the same as theWave [iPos]
if ((iPos ==startPos) || (cmpStr (theText, theWave [iPos -1]) == 1)) // then iPos is the first occurence of thetext in theWave from startPos to endPos
return iPos
else //  there are more copies of theText in theWave before iPos
lastPt = iPos-1
endif
elseif (theCmp == 1) //thetext is alphabetically after theWave [iPos]
firstPt = iPos +1
else // thetext is alphabetically before theWave [iPos]
lastPt =iPos -1
endif
endfor
if (iComp > maxComp)// max number of comparisons performed. Don't think this will ever happen
return NaN
else
if (cmpStr (thetext, theWave [iPos]) == -1)
return -(iPos-1) // After this position is where theText would fit alphabetically
else
return -iPos
endif
endif
end

//******************************************************************************************************
// finds the last occurence of a text within a range of points in a SORTED text wave
// returns the point number of the last occurrence, or  -1 times the position where thetext should be inserted  if the text was not found
function FindLastTextOccurrence(theWave, thetext, startPos, endPos)
wave/T theWave // alpabetically sorted text wave.
string thetext // the text to find in this wave, matching the entire contents of the wave at that point
variable startPos, endPos // range over which to search.

// limit start and end positions to possible values
startPos = max (0, startPos)
endPos = min (numPnts (theWave)-1, endPos)
variable iPos // the point to be compared
variable  theCmp // the result of the comparison
variable firstPt =startPos , lastPt = endPos // variables the define the range oer which comparisons will be made
variable iComp, maxComp = ceil (logBase2 (endPos - StartPos + 1)) // sanity check to limit number of comparisons to theoretical max. In case wave is NOT sorted, e.g.
for (iComp =0; (firstPt <= lastPt) && (iComp <= maxComp); iComp +=1)
iPos = trunc ((firstPt + lastPt)/2)
theCmp = cmpStr (thetext, theWave [iPos])
if (theCmp ==0) //thetext is the same as theWave [iPos]
if ((iPos ==endPos) || (cmpStr (theText, theWave [iPos +1]) == -1)) // then iPos is the last occurence of thetext in theWave from startPos to endPos
return iPos
else //  there are more copies of theText in theWave after iPos
firstPt = iPos +1
endif
elseif (theCmp == 1) //thetext is alphabetically after theWave [iPos]
firstPt = iPos +1
else // thetext is alphabetically before theWave [iPos]
lastPt =iPos -1
endif
endfor
if (iComp > maxComp)// max number of comparisons performed. Don't think this will ever happen
return NaN
else
if (cmpStr (thetext, theWave [iPos]) == -1)
return -(iPos-1) // After this position is where theText would fit alphabetically
else
return -iPos
endif
endif
end

//******************************************************************************************************
// A test of the FindFirstTextOccurrence and FindLastTextOccurrence functions, using FindValue as a baseline
// both accuracy and speed are tested

function testFindFirstLastText (maxN)
variable maxN

variable iPt, nPts = floor (logBase2 (maxN))
// output waves
make/o/n=(nPts) FindValueFirstTime_out, FindValueLastTime_out, FFTOtime_out, FLTOtime_out
setScale d 0,0 ,"s" FindValueFirstTime_out, FindValueLastTime_out, FFTOtime_out, FLTOtime_out
make/o/n=(nPts) FindValueFirstPos_out, FindValueLastPos_out, FFTOpos_out, FLTOpos_out
// make data wave
variable iN, iVal, valToN
make/o/T/n =(maxN) testTextWave
WAVE/T testTextWave
for (iN =0, iVal =0;iN < maxN; iVal +=1)
for (valToN = iN + ceil (logBase2 (iVal +2)); iN < valToN;  iN +=1)
testTextWave [iN] = num2str (iVal)
endfor
endfor
// sort it, because CmpStr does not do alphanumeric comparison
Sort testTextWave, testTextWave
// varables for finding and timing
string toFind
variable fPos, myTimer, elapsedTime
for (iPt =0; iPt < nPts; ipt += 1)
// find increasingly further values in the wave
toFind = testTextWave [2^ (iPt + 2)]
// find first occurrence
// use find value
myTimer = startmstimer
FindValue/S=0 /TEXT=toFind/TXOP=4  testTextWave
elapsedTime = stopMSTimer(myTimer )/1e06
FindValueFirstTime_out [iPt] = elapsedTime
FindValueFirstPos_out [iPt] = V_Value
// use FindFirstTextOccurrence
myTimer = startmstimer
fPos = FindFirstTextOccurrence (testTextWave, toFind, 0, maxN)
elapsedTime = stopMSTimer(myTimer )
FFTOtime_out [iPt] = elapsedTime/1e06
FFTOpos_out  [iPt]  = fPos
// find last occurrence
// with findValue followed by a loop
myTimer = startmstimer
FindValue/S=0 /TEXT=toFind/TXOP=4  testTextWave
for (iN= V_Value +1;cmpStr (testTextWave [iN], toFind) ==0 && iN < maxN; iN +=1)
endfor
elapsedTime = stopMSTimer(myTimer )/1e06
FindValueLastTime_out [iPt] = elapsedTime
FindValueLastPos_out [iPt] = iN-1
// with FIndLastTextOccurrence
myTimer = startmstimer
fPos = FindLastTextOccurrence (testTextWave, toFind, 0, maxN)
elapsedTime = stopMSTimer(myTimer )
FLTOtime_out [iPt] = elapsedTime/1e06
FLTOpos_out  [iPt]  = fPos
endfor
display FFTOtime_out, FLTOtime_out vs FindValueFirstPos_out
modifygraph rgb = (0,0,0)
appendtoGraph FindValueFirstTime_out, FindValueLastTime_out vs FindValueLastPos_out
modifygraph mode =4
label bottom "Points in Text Wave"
label left "Time to Find Element (\\U)"
Legend/C/N=text0/F=0/B=1/A=MT
ModifyGraph log(bottom)=1
edit FindValueFirstPos_out, FFTOpos_out, FindValueLastPos_out, FLTOpos_out
end
FIndValue is not optimal for a sorted wave
The FindValue operation does not assume that the wave is sorted. As such, it must iterate through each point in the wave, i.e., no binary search goodness. With large enough N (probably far larger than I will ever use, but for the sake of argument, remember that Igor can make ginormous waves), even an operation built in compiled code (like FindValue) will take longer than a more efficient algorithm in slower-running Igor script.

Also, the FindValue operation does not give an easy way to find the last occurrence of a value in a wave. One could find the first matching value with FindValue and then iterate from there point by point with cmpStr, but that will get slower than a binary search pretty quickly.

Try this with maxN = 65536

function testfval (maxN)
variable maxN

make/o/n=(ln (maxN)) FindValueTime_out, FFTOtime_out, FindValuePos_out, FFTOpos_out, N_out
setScale d 0,0 ,"s", FFTOtime_out, FindValueTime_out
variable iN, N, iVal, nVal, stopVal, fPos
string toFind
variable myTimer, elapsedTime
// examine a geometric progression of N
for (N=2; N < maxN +1; N *= 2)
// make a text wave with increasingly longer series of the same number string
make/o/T/n =(N) testTextWave
for (iN =0, iVal =1;iN < N; iVal +=1)
for (stopVal = round (iN + 1 + ln (iVal)); iN < stopVal;  iN +=1)
testTextWave [iN] = num2str (iVal)
endfor
endfor
// sort it, because CmpStr does not do alphanumeric comparison
Sort testTextWave, testTextWave
// find last value in wae
toFind = testTextWave [numPnts (testTextWave) -1]
// use find value
myTimer = startmstimer
FindValue/S=0 /TEXT=toFind/TXOP=4  testTextWave
elapsedTime = stopMSTimer(myTimer )/1e06
fPos = V_Value
FindValueTime_out [ln (N)] = elapsedTime
FindValuePos_out [ln (N)] = fPos
// use FindFirstTextOccurrence
myTimer = startmstimer
fPos = FindFirstTextOccurrence (testTextWave, toFind, 0)
elapsedTime = stopMSTimer(myTimer )
FFTOtime_out [ln (N)] = elapsedTime/1e06
FFTOpos_out  [ln (N)]  = fPos
N_out [ln (N)]  = N
endfor
display FFTOtime_out  vs N_out
modifygraph rgb = (0,0,0)
appendtoGraph FindValueTime_out vs N_out
modifygraph mode =4
label bottom "Points in Text Wave"
label left "Time to Find Element (\\U)"
ModifyGraph log(bottom)=1
Legend/C/N=text0/A=MT/J "\\s(FFTOtime_out) Find First Text Occurrence\r\\s(FindValueTime_out) Find Value"
end

On my computer (old and busted G4 powerbook) the 2 methods approach equality at N = 4096. At bigger N, FindFirstTextOccurrence is faster.

Dr. Jamie Boyd, Ph.D.
UBC In Vivo Imaging Core

This is a visualization of testfval with 2^18. This function tests best case performance for FindFirstTextOccurence.

The line

fPos = FindFirstTextOccurrence (testTextWave, toFind, 0, DimSize(testtextwave, ROWS))

needs to be changed so that it compiles *and* it requires rtGlobals=1. Forum Support Gallery