Binary search on PRE-SORTED text waves

These 2 functions do a binary search on an ALREADY ALPHABETICALLY SORTED text wave, looking for the point in the wave that matches the input text. Either the first matching point in the wave, or the last matching point is searched for.


  1. I'd appreciate feedback on correctness/efficiency.

  2. If there is an Igor function that will do this already, I don't know about it. Maybe it could be a wish list item?

  3. FindLastTextOccurrence and FindFirstTextOccurrence could be combined, with another input parameter to select for first or last occurrence.

  4. As it stands, the match is case-insensitive. Another input parameter for case sensitivity of the search could be added to the function and passed to cmpStr; the user would have to ensure that the text wave was sorted the same way, case-insensitive or not.


//******************************************************************************************************
// Binary Searches for sorted text waves. Not exactly math utilities, but sorting is mathematical
// THESE ASSUME THAT THE TEXT WAVE IS ALREADY ALPHABETICALLY SORTED

// A function to calculate base 2 logarithms
STATIC CONSTANT log2 =  0.301029995664
Function logBase2 (theValue)
    variable theValue
   
    return log (theValue)/log2
end

//******************************************************************************************************
// finds the first occurrence of a text within a range of points in a SORTED text wave
// returns the point number of the last occurrence, or -1 times the position where thetext should be inserted  if the text was not found
// Last Modified Feb 03 2012 by Jamie Boyd
function FindFirstTextOccurrence (theWave, thetext, startPos, endPos)
    wave/T theWave // alpabetically sorted text wave.
    string thetext // the text to find in this wave, matching the entire contents of the wave at that point
    variable startPos, endPos // range over which to search
   
    // limit start and end positions to possible values
    startPos = max (0, startPos)
    endPos = min (numPnts (theWave)-1, endPos)
    variable iPos // the point to be compared
    variable  theCmp // the result of the comparison
    variable firstPt =startPos , lastPt = endPos // variables the define the range oer which comparisons will be made
    variable iComp, maxComp = ceil (logBase2 (endPos - StartPos + 1)) // sanity check to limit number of comparisons to theoretical max. In case wave is NOT sorted, e.g.
    for (iComp =0; (firstPt <= lastPt) && (iComp <= maxComp); iComp +=1)
        iPos = trunc ((firstPt + lastPt)/2)
        theCmp = cmpStr (thetext, theWave [iPos])
        if (theCmp ==0) //thetext is the same as theWave [iPos]
             if ((iPos ==startPos) || (cmpStr (theText, theWave [iPos -1]) == 1)) // then iPos is the first occurence of thetext in theWave from startPos to endPos
                return iPos
            else //  there are more copies of theText in theWave before iPos
                lastPt = iPos-1
            endif
        elseif (theCmp == 1) //thetext is alphabetically after theWave [iPos]
            firstPt = iPos +1
        else // thetext is alphabetically before theWave [iPos]
            lastPt =iPos -1
        endif
    endfor
    if (iComp > maxComp)// max number of comparisons performed. Don't think this will ever happen
        return NaN  
    else
        if (cmpStr (thetext, theWave [iPos]) == -1)
            return -(iPos-1) // After this position is where theText would fit alphabetically
        else
            return -iPos
        endif
    endif
end

//******************************************************************************************************
// finds the last occurence of a text within a range of points in a SORTED text wave
// returns the point number of the last occurrence, or  -1 times the position where thetext should be inserted  if the text was not found
// Last Modified Feb 03 2012 by Jamie Boyd
function FindLastTextOccurrence(theWave, thetext, startPos, endPos)
    wave/T theWave // alpabetically sorted text wave.
    string thetext // the text to find in this wave, matching the entire contents of the wave at that point
    variable startPos, endPos // range over which to search.
   
    // limit start and end positions to possible values
    startPos = max (0, startPos)
    endPos = min (numPnts (theWave)-1, endPos)
    variable iPos // the point to be compared
    variable  theCmp // the result of the comparison
    variable firstPt =startPos , lastPt = endPos // variables the define the range oer which comparisons will be made
    variable iComp, maxComp = ceil (logBase2 (endPos - StartPos + 1)) // sanity check to limit number of comparisons to theoretical max. In case wave is NOT sorted, e.g.
    for (iComp =0; (firstPt <= lastPt) && (iComp <= maxComp); iComp +=1)
        iPos = trunc ((firstPt + lastPt)/2)
        theCmp = cmpStr (thetext, theWave [iPos])
        if (theCmp ==0) //thetext is the same as theWave [iPos]
             if ((iPos ==endPos) || (cmpStr (theText, theWave [iPos +1]) == -1)) // then iPos is the last occurence of thetext in theWave from startPos to endPos
                return iPos
            else //  there are more copies of theText in theWave after iPos
                firstPt = iPos +1
            endif
        elseif (theCmp == 1) //thetext is alphabetically after theWave [iPos]
            firstPt = iPos +1
        else // thetext is alphabetically before theWave [iPos]
            lastPt =iPos -1
        endif
    endfor
    if (iComp > maxComp)// max number of comparisons performed. Don't think this will ever happen
        return NaN  
    else
        if (cmpStr (thetext, theWave [iPos]) == -1)
            return -(iPos-1) // After this position is where theText would fit alphabetically
        else
            return -iPos
        endif
    endif
end

//******************************************************************************************************
// A test of the FindFirstTextOccurrence and FindLastTextOccurrence functions, using FindValue as a baseline
// both accuracy and speed are tested
// Last Modified Feb 03 2012 by Jamie Boyd

function testFindFirstLastText (maxN)
    variable maxN
   
    variable iPt, nPts = floor (logBase2 (maxN))
    // output waves
    make/o/n=(nPts) FindValueFirstTime_out, FindValueLastTime_out, FFTOtime_out, FLTOtime_out
    setScale d 0,0 ,"s" FindValueFirstTime_out, FindValueLastTime_out, FFTOtime_out, FLTOtime_out
    make/o/n=(nPts) FindValueFirstPos_out, FindValueLastPos_out, FFTOpos_out, FLTOpos_out
    // make data wave
    variable iN, iVal, valToN
    make/o/T/n =(maxN) testTextWave
    WAVE/T testTextWave
    for (iN =0, iVal =0;iN < maxN; iVal +=1)
        for (valToN = iN + ceil (logBase2 (iVal +2)); iN < valToN;  iN +=1)
            testTextWave [iN] = num2str (iVal)
        endfor
    endfor
    // sort it, because CmpStr does not do alphanumeric comparison
    Sort testTextWave, testTextWave
    // varables for finding and timing
    string toFind
    variable fPos, myTimer, elapsedTime
    for (iPt =0; iPt < nPts; ipt += 1)
        // find increasingly further values in the wave
         toFind = testTextWave [2^ (iPt + 2)]
        // find first occurrence
        // use find value
        myTimer = startmstimer
        FindValue/S=0 /TEXT=toFind/TXOP=4  testTextWave
        elapsedTime = stopMSTimer(myTimer )/1e06
        FindValueFirstTime_out [iPt] = elapsedTime
        FindValueFirstPos_out [iPt] = V_Value
        // use FindFirstTextOccurrence
        myTimer = startmstimer
        fPos = FindFirstTextOccurrence (testTextWave, toFind, 0, maxN)
        elapsedTime = stopMSTimer(myTimer )
        FFTOtime_out [iPt] = elapsedTime/1e06
        FFTOpos_out  [iPt]  = fPos
        // find last occurrence
        // with findValue followed by a loop
        myTimer = startmstimer
        FindValue/S=0 /TEXT=toFind/TXOP=4  testTextWave
        for (iN= V_Value +1;cmpStr (testTextWave [iN], toFind) ==0 && iN < maxN; iN +=1)
        endfor
        elapsedTime = stopMSTimer(myTimer )/1e06
        FindValueLastTime_out [iPt] = elapsedTime
        FindValueLastPos_out [iPt] = iN-1
        // with FIndLastTextOccurrence
        myTimer = startmstimer
        fPos = FindLastTextOccurrence (testTextWave, toFind, 0, maxN)
        elapsedTime = stopMSTimer(myTimer )
        FLTOtime_out [iPt] = elapsedTime/1e06
        FLTOpos_out  [iPt]  = fPos
    endfor
    display FFTOtime_out, FLTOtime_out vs FindValueFirstPos_out
    modifygraph rgb = (0,0,0)
    appendtoGraph FindValueFirstTime_out, FindValueLastTime_out vs FindValueLastPos_out
    modifygraph mode =4
    label bottom "Points in Text Wave"
    label left "Time to Find Element (\\U)"
    Legend/C/N=text0/F=0/B=1/A=MT
    ModifyGraph log(bottom)=1
    edit FindValueFirstPos_out, FFTOpos_out, FindValueLastPos_out, FLTOpos_out
end
FIndValue is not optimal for a sorted wave
The FindValue operation does not assume that the wave is sorted. As such, it must iterate through each point in the wave, i.e., no binary search goodness. With large enough N (probably far larger than I will ever use, but for the sake of argument, remember that Igor can make ginormous waves), even an operation built in compiled code (like FindValue) will take longer than a more efficient algorithm in slower-running Igor script.

Also, the FindValue operation does not give an easy way to find the last occurrence of a value in a wave. One could find the first matching value with FindValue and then iterate from there point by point with cmpStr, but that will get slower than a binary search pretty quickly.

Try this with maxN = 65536

function testfval (maxN)
    variable maxN
   
    make/o/n=(ln (maxN)) FindValueTime_out, FFTOtime_out, FindValuePos_out, FFTOpos_out, N_out
    setScale d 0,0 ,"s", FFTOtime_out, FindValueTime_out
    variable iN, N, iVal, nVal, stopVal, fPos
    string toFind
    variable myTimer, elapsedTime
    // examine a geometric progression of N
    for (N=2; N < maxN +1; N *= 2)
        // make a text wave with increasingly longer series of the same number string
        make/o/T/n =(N) testTextWave
        for (iN =0, iVal =1;iN < N; iVal +=1)
            for (stopVal = round (iN + 1 + ln (iVal)); iN < stopVal;  iN +=1)
                testTextWave [iN] = num2str (iVal)
            endfor
        endfor
        // sort it, because CmpStr does not do alphanumeric comparison
        Sort testTextWave, testTextWave
        // find last value in wae
         toFind = testTextWave [numPnts (testTextWave) -1]
         // use find value
        myTimer = startmstimer
        FindValue/S=0 /TEXT=toFind/TXOP=4  testTextWave
        elapsedTime = stopMSTimer(myTimer )/1e06
        fPos = V_Value
        FindValueTime_out [ln (N)] = elapsedTime
        FindValuePos_out [ln (N)] = fPos
        // use FindFirstTextOccurrence
        myTimer = startmstimer
        fPos = FindFirstTextOccurrence (testTextWave, toFind, 0)
        elapsedTime = stopMSTimer(myTimer )
        FFTOtime_out [ln (N)] = elapsedTime/1e06
        FFTOpos_out  [ln (N)]  = fPos
        N_out [ln (N)]  = N
    endfor
    display FFTOtime_out  vs N_out
    modifygraph rgb = (0,0,0)
    appendtoGraph FindValueTime_out vs N_out
    modifygraph mode =4
    label bottom "Points in Text Wave"
    label left "Time to Find Element (\\U)"
    ModifyGraph log(bottom)=1
    Legend/C/N=text0/A=MT/J "\\s(FFTOtime_out) Find First Text Occurrence\r\\s(FindValueTime_out) Find Value"
end


On my computer (old and busted G4 powerbook) the 2 methods approach equality at N = 4096. At bigger N, FindFirstTextOccurrence is faster.

Dr. Jamie Boyd, Ph.D.
UBC In Vivo Imaging Core

This is a visualization of testfval with 2^18. This function tests best case performance for FindFirstTextOccurence.

 

The line

 

        fPos = FindFirstTextOccurrence (testTextWave, toFind, 0, DimSize(testtextwave, ROWS))

needs to be changed so that it compiles *and* it requires rtGlobals=1.

Graph2.png

Forum

Support

Gallery

Igor Pro 9

Learn More

Igor XOP Toolkit

Learn More

Igor NIDAQ Tools MX

Learn More