parallel processing issue

I want to extract and sum up "beams" from a 3D wave depending on a ROI or mask wave, which I do using the code below:

function GetSum(cube, mask)
    wave cube, mask

    Make/O/N=(DimSize(cube, 2)) total=0
       
    variable cx, cy
    for(cx = 0; cx < DimSize(Cube, 0); cx += 1)        
            for(cy = 0; cy < DimSize(Cube, 1); cy += 1)
                   
                if (mask[cx][cy] ==1)           // mask[cx][cy] can be 1 or 0              
                    ImageTransform /Beam ={(cx), (cy)} getBeam cube
                    wave W_Beam
                    total+=W_Beam  
                endif      
           
            endfor
    endfor     
   
    KillWaves/Z W_Beam
end


For the size of my 3D wave this process takes about 2-3 seconds. To improve performance I copied and modified the parallel processing example from the help files to this:

Threadsafe function workerFunc(cube, total, mask, col)
    wave cube, total, mask
    variable col
   
    variable i
    for (i=0; i<DimSize(cube, 0);i+=1)
       
        if (mask[i][col] ==1)  
            ImageTransform /Beam ={(i), (col)} getBeam cube
            wave W_Beam
            total+=W_Beam  
        endif
    endfor

end


Function MT_GetSum(cube, mask)
    WAVE cube, mask
   
    Make/O/N=(DimSize(cube, 2)) total=0
   
    Variable ncol= DimSize(cube,1)
    Variable i,col,nthreads= ThreadProcessorCount
    Variable threadGroupID= ThreadGroupCreate(nthreads)
   
    for(col=0; col<ncol;)
        for(i=0; i<nthreads; i+=1)
       
            ThreadStart threadGroupID,i,workerFunc(cube,total, mask,col)
           
            col+=1
            if( col>=ncol )
                break
            endif
        endfor
       
        do
            Variable threadGroupStatus = ThreadGroupWait(threadGroupID,100)
        while( threadGroupStatus != 0 )
    endfor
    Variable dummy= ThreadGroupRelease(threadGroupID)
End



This gives a 60-70% speedup but it appears that wave "total" from the two methods does not contain identical values. The parallel version has slightly lower totals between 0-3% (0.25 avg) compared to the single thread version.
Any ideas what I'm doing wrong?

I don't have a lot of time but from a quick look at your post:

ChrLie wrote:

Threadsafe function workerFunc(cube, total, mask, col)
    wave cube, total, mask
    variable col
   
    variable i
    for (i=0; i<DimSize(cube, 0);i+=1)
       
        if (mask[i][col] ==1)  
            ImageTransform /Beam ={(i), (col)} getBeam cube
            wave W_Beam
            total+=W_Beam  
        endif
    endfor

end



Think about what happens in your parallel calculation. Igor spawns a bunch of threads and these start crunching away, independently and regardless of what other threads may be doing. This also means that if different threads write to the same memory, each will modify this memory as it pleases and the result is unpredictable. Read the section called "Example atomic operation" here to understand what that means with respect to your problem.

Igor tries to prevent these problems from occurring by giving each thread its own, private data folder hierarchy, such that all waves created by the thread are accessible only by that thread. The exception, of course, are function arguments. In other words, every thread refers to the same "total" wave.

So when you do this:
total+=W_Beam  

The result is undefined because multiple threads can modify it simultaneously.

Your only option is to provide each thread with its own copy of total (so if there are 8 threads, create 8 copies of total, all initialized to zero), and to add all of these waves together into the final result after the threads have completed.


Also, the loop in which you call ThreadStart has issues, but I don't have time now so maybe someone else will chip in on that.
Looking at this again, your ThreadStart loop does look OK. I misread it due to reading it in a hurry, sorry. However, my comment above still stands.

I played with this a bit:
function GetSum(cube, mask) // this is your original function
    wave cube, mask
 
    Make/O/N=(DimSize(cube, 2))/D total=0
 
    variable cx, cy
    for(cx = 0; cx < DimSize(Cube, 0); cx += 1)        
            for(cy = 0; cy < DimSize(Cube, 1); cy += 1)
 
                if (mask[cx][cy] ==1)           // mask[cx][cy] can be 1 or 0              
                    ImageTransform /Beam ={(cx), (cy)} getBeam cube
                    wave W_Beam
                    total+=W_Beam[p]
                endif      
 
            endfor
    endfor     
 
    KillWaves/Z W_Beam
end

Function GetSum2(cube, mask)    // MatrixOP-based
    wave cube, mask
   
    Make/O/N=(DimSize(cube, 2))/D total=0
   
    MatrixOP /FREE Multiplied = cube * mask
   
    variable i
    for (i = 0; i < DimSize(Multiplied, 2); i+=1)
        ImageTransform /P=(i) sumPlane, Multiplied
        total[i] += V_value
    endfor
End

function GetSum3(cube, mask)    // MultiThreaded
    wave cube, mask
 
    Make/O/N=(DimSize(cube, 2))/D total=0
   
    MultiThread total = WorkerFunc3(cube, mask, p)
end

ThreadSafe Function WorkerFunc3(cube, mask, plane)
    wave cube, mask
    variable plane
   
    variable i, j, total = 0
    for (i = 0; i < DimSize(cube, 0); i+=1)
        for (j = 0; j < DimSize(cube, 1); j+=1)
            if (mask[i][j] == 1)
                total += cube[i][j][plane]
            endif
        endfor
    endfor
   
    return total
End

function GetSum4(cube, mask)    // MultiThreaded + MatrixOP
    wave cube, mask
 
    Make/O/N=(DimSize(cube, 2))/D total=0
   
    MultiThread total = WorkerFunc4(cube, mask, p)
end

ThreadSafe Function WorkerFunc4(cube, mask, plane)
    wave cube, mask
    variable plane
   
    MatrixOP /FREE W_Sum = sum(mask * cube[][][plane])
    return W_Sum[0]
End

Function DoTiming()
    Make /o/n=(128,128) /D M_Mask = round(enoise(0.5) + 0.5)
    Make /o/n=(128,128, 512) /D/O M_cube = enoise(1)
   
    variable timerRef, time1, time2, time3, time4
   
    timerRef = StartMSTimer
    GetSum(m_cube, m_mask)
    time1 = StopMSTimer(timerRef) / 1e6
   
    timerRef = StartMSTimer
    GetSum2(m_cube, m_mask)
    time2 = StopMSTimer(timerRef) / 1e6
   
    timerRef = StartMSTimer
    GetSum3(m_cube, m_mask)
    time3 = StopMSTimer(timerRef) / 1e6
   
    timerRef = StartMSTimer
    GetSum4(m_cube, m_mask)
    time4 = StopMSTimer(timerRef) / 1e6
   
    Printf "Original: %f, MatrixOP: %f s, MultiThread: %f s, MatrixOP + MultiThread: %f s\r", time1, time2, time3, time4
End


Result on my computer:
•DoTiming()
  Original: 0.343414 s, MatrixOP: 0.056957 s, MultiThread: 0.389801 s, MatrixOP + MultiThread: 0.008289 s


In my implementation, multithreading of the 'naive' Igor code actually leads to a slowdown. Using MatrixOP, however, is much faster than the naive code and multithreaded MatrixOP appears to be the performance winner.

Note the dirty trick that I'm using here – I make use of the fact that your mask contains only zero or one, and therefore the undesired values can be removed simply by multiplying the data with the mask (but of course you need to make a copy of the data to do that). Doing this multiplication is much faster than the explicit Igor code because it runs entirely in machine code, and computers are very good at multiplying lots of numbers.

ChrLie wrote:
The parallel version has slightly lower totals between 0-3% (0.25 avg) compared to the single thread version.

I think that this is probably due to a mistake in your code. But be wary of using single-precision waves when not necessary. Use doubles (Make /D) when practical.

I'd be interested to hear what the performance difference is on your dataset.
741 wrote:

Note the dirty trick that I'm using here – I make use of the fact that your mask contains only zero or one, and therefore the undesired values can be removed simply by multiplying the data with the mask (but of course you need to make a copy of the data to do that). Doing this multiplication is much faster than the explicit Igor code because it runs entirely in machine code, and computers are very good at multiplying lots of numbers.


GetSum4() is awesome!

Thinking of "planes" instead of "beams" clearly makes a difference. The result for my cube:

  Original: 2.831555, MatrixOP: 0.350114 s, MultiThread: 6.146490 s, MatrixOP + MultiThread: 0.097558 s


As for differences in totals in the initial post; I somehow had the feeling that all threads working on one wave (or memory) might be the problem but your solution is much more elegant anyway!

Thanks a lot!