function speed-up: extracting point numbers

I have two integer waves AA and BB, where AA has maybe about 80,000 points and BB maybe 6,000. All numbers in BB are unique and all of them occur in AA at least once.
Now I want to find the point/index numbers of AA where the values of BB occur. Finding means I duplicate AA and set a point to 1 if the number is within BB or 0 if not.

The function looks like:

function FindIt(aa, bb)
wave aa, bb

variable timer = startMStimer

Make/O/N=(numPnts(aa)) cc=0

variable i
for(i=0; i<numPnts(bb); i+=1)
Multithread cc += aa == bb[i] ? 1 : 0
endfor

Printf "Time required: %f seconds\r", (stopMStimer(timer)/1e6)

end

However it is annoyingly slow. For the wave lengths given above it takes on my work computer about 25s.
I'm convinced there is a better way but I just don't see it at the moment!

Any ideas are highly appreciated!

Well I don't understand completely what you want to do.
At least your code does not extract all bb indizes of the contents of aa.

So I can only guess what you want to do, so here is my try:
Function main()
SetRandomSeed 123456
Make/U/I/O/N=60000 aa=NaN
Make/U/I/O/N=8000  bb=NaN

aa = abs(enoise(DimSize(bb,0)-1))
bb = DimSize(bb,0)-p

variable timer
timer = startMStimer
FindIt(aa,bb)
Printf "Time required: %f seconds\r", (stopMStimer(timer)/1e6)

timer = startMStimer
FindIt_tb(aa,bb)
Printf "Time required: %f seconds\r", (stopMStimer(timer)/1e6)
End

function FindIt(aa, bb)
wave aa, bb

Make/O/N=(numPnts(aa)) cc=0

variable i
for(i=0; i<numPnts(bb); i+=1)
Multithread cc += aa == bb[i] ? 1 : 0
endfor
end

// Assumes -1 is not a valid entry in bb
function FindIt_tb(aa, bb)
wave aa, bb

Duplicate/O aa,cc

variable i
variable size = DimSize(aa,0)
for(i=0; i<size; i+=1)
FindValue/I=(aa[i]) bb
cc[i] = V_value
endfor
end

Numbers from my machine are 8.7s for your solution and 2.4s for my try.

In case this is still too slow you would need to give more info about the situation. Is bb constant? If yes, you could try to sort bb and use a binary search[1].
Do you need the contents of aa afterwards? If no, you can just overwrite the contents of aa and skip creating cc.

[1]: http://www.igorexchange.com/node/2775
Hi Thomas,

I have attached an experiment with real world data to make it more clear.
Wave AA and BB are sorted and are positive integers. Wave AA can contain multiple entries of e.g. BB[3] - I think this is why I can't use FindValue, because it will only find the first entry of BB[3] (?). If you execute FindIt(aa, bb) then e.g. cc[0] and cc[211] will be 1 because bb[0] and bb[1] are present in aa at these point numbers.
This is what I want to get out, or a wave which contains the actual point numbers, e.g. {0, 211, ...}
FindIt.pxp (393.78 KB)
Note that I iterate over aa not bb. Therefore it does not matter if multiple bb entries are in aa, as long as the entries in bb are unique.

And as I see you are not interested in the index but just if zero I tried with
function FindIt_tb(aa, bb)
wave aa, bb

Duplicate/O aa,cc_tb

variable  timer = startMStimer

variable i
variable size = DimSize(aa,0)
for(i=0; i<size; i+=1)
FindValue/I=(aa[i]) bb
cc_tb[i] = V_Value
endfor
// convert to 1 and 0
cc_tb = (cc_tb[p] == -1 ? 0 : 1)
Printf "Time required: %f seconds\r", (stopMStimer(timer)/1e6)
end
well, you got me thinking about FindValue! In a sorted wave I can use the start value to decrease the size of the part of the array that is being searched in each iteration.
I can also use a wave with the point numbers as output and use something like (I still iterate over bb):

function FindIt2(aa, bb)
wave aa, bb

variable  timer = startMStimer
Make/I/U/O/N=1 cc=0             //cc[0]=0 is always true for my aa and bb
variable size = numpnts(bb)
Variable start = 1

variable i=0, j=1, tmp

for(i=0; i<size; i+=1)

FindValue/S=(start) /I=(bb[i]) aa
if (V_value != -1)
InsertPoints j, 1, cc
cc[j] = V_Value
j+=1
i-=1
Start = V_Value+1
endif
endfor

Printf "Time required: %f seconds\r", (stopMStimer(timer)/1e6)
end

I need to check if this gives me correct data at the boundaries but I get a speedup compared to my initial version by a factor 10. This is very nice! Thanks!
One more idea:

Suppose you wrote the following expression
Duplicate/O aa,cc
cc=0
...
Variable b1,b2,b3
b1=bb[i]
b2=bb[i+1]
b3=bb[i+2]
...
MatrixOp/O cc=cc+equal(aa,b1)+equal(aa,b2)+equal(aa,b3)...

and executed this in a loop.

Note also that your loop should ideally use a fixed variable rather than calling a function. For N equal() calls I'd use something like
Variable NP=numpnts(bb)
for(i=0;i<NP;i+=N)

Hope this helps,

AG
InsertPoints has to move memory and so will be slow.

Perhaps you can create the output wave at full size, initialize it to NaN, and then zap any NaNs at the end using WaveTransform zapNaNs.
thomas_braun wrote:
Well I don't understand completely what you want to do.
At least your code does not extract all bb indizes of the contents of aa.

Thomas, I should have written this at the beginning: Note that WaveMin and WaveMax of aa and bb are identical. This is why iterating over bb is sufficient. What I do is finding the values of bb in an image M_aa. Only a small proportion of M_aa falls within WaveMin(bb) and WaveMax(bb). This is why I redimension M_aa into aa, duplicate and create a point number wave pn=p and then Sort aa aa,pn. Then I can clip aa to the values of interest. Using the output from above and the clipped pn, I can reconstruct the pixel coordinates. This should be more efficient than ImageThreshold, since only a small proportion of the image needs attention.

My last version version still had some issues, but now it's running fine. I also got rid of InsertPoints, but it only made it marginally faster.
Thanks for all suggestions and pointing me to FindValue!
ChrLie wrote:

Note that WaveMin and WaveMax of aa and bb are identical.

Ah that's the missing information which clarifies the differences between our approaches.