Building a faster way to integrate an image stack

Suppose that I have an image stack imgsrc. I want to create an image stack where each plane N is the sum of the N planes from imgsrc. The result is to be an integral.

I have played around to create the code below. It works, but it is tediously slow over a large image stack (2448,2048,154). The time sink seems to be the need to run an explicit for loop construction.

// snippet of interest
    nlayers = DimSize(imgsrc,2)
    wave sumimg = integrate_stack(imgsrc,0)
    duplicate/FREE sumimg tmpimg
    Redimension/N=(-1,-1,nlayers) tmpimg
    for (ic=1;ic<nlayers;ic+=1)
        wave sumimg = integrate_stack(imgsrc,ic)
        tmpimg[][][ic] = sumimg[p][q]/(ic+1)
    endfor
 
ThreadSafe Static Function/WAVE integrate_stack(wave instack, variable nlayer)
 
    MatrixOP/FREE subset = instack[][][0,nlayer]
    MatrixOP/FREE/NTHR=0 reswave = sumBeams(subset)
    return reswave
end

I have to wonder whether I am missing a faster method to build the integral summation, primarily to remove the for loop.

I'd be glad for insights *in IP9 please*.

I am not sure of the meaning of your second sentence. Are you trying to get a single image from a 3-D wave consisting of a stack of 2D images?

If so,  MatrixOP can return a 2D image from the sum of all beams. Perhaps this may work for you. (IP 9.05)

See DisplayHelpTopic "MatrixOP" :

sumBeams(w)    Returns an (n x m) matrix containing the sum over all layers of all the beams of the 3D wave w :
    
    A beam is a 1D array in the Z-direction.
    sumBeams is a non-layered function which requires that w  be a proper 3D wave and not the result of another expression. 

I start with a stack of N layers (imgsrc - 3D wave). I want to create a stack of N layers (3D wave) where each layer j is the result of running sumBeams(imgsrc) for the layers 0 to j.

The function integrate_stack(...) returns a 2D wave that is the result of running sumBeams(imgsrc) on layers 0 to j. The for-loop builds the 3D wave into tmping. I wonder if the explicit for-loop can be collapsed to a (faster) implicit loop.

My suggestion might be worth considering for varying numbers of layers if you use a loop sequentially removing images from the stack. The simple delete layer operation would be to DeletePoints with M=2

DeletePoints [/M=dim ] startElement, numElements, waveName [, waveName ]...

Another faster deletion alternative (from the MatrixOP help file) might be:

" You can also extract layers from a 3D wave starting with layer a and increasing the layer number up to layer b using increments c:
MatrixOp destWave = wave3d[][][a,b,c] "

After each layer deletion, the beam summation should be faster.  Here is the sumBeams( ) formula omitted from my first reply

[ pasted formula image did not transfer ]

I have found the "Profiling Igor Procedures" method to be very useful in probing slow functions. 

 

 

I first tried to reformulate the problem to make it simpler. The following code is slightly faster. Unfortunately, the output is also not exactly the same (there is a slight discrepancy on the order of 10^-8), but I couldn't find where this comes from. Maybe someone else has a better idea.

Function image_stack(wave imgsrc)
 
//  variable timer = StopMSTimer(-2)
 
    int ic   = 0
    int numX = DimSize(imgsrc,0)
    int numY = DimSize(imgsrc,1)
    int numZ = DimSize(imgsrc,2)
    
    Make/FREE/N=(numX, numY ,numZ) tmpimg = 0
    for (; ic < numZ; ic++)
        MultiThread tmpimg[][][ic, numZ-1] += imgsrc[p][q][ic] / (r + 1)
    endfor
 
//  print ((StopMSTimer(-2) - timer) / 1000), "ms"
End

 

Thanks. I will consider this approach.

> The following code is slightly faster. 

Perhaps an additional savings is realized using duplicate/FREE imgsrc, tmpimg and running the for-loop from ic=1 rather than ic=0?

> Unfortunately, the output is also not exactly the same (there is a slight discrepancy on the order of 10^-8), but I couldn't find where this comes from.

The difference between the two cases may be due to differences in where various floating point conversions are used.

Here are the test results.

            nlayers = DimSize(imgsrc,2) - 1
            variable timerRefNum = StartMSTimer
            for (ic=1;ic<nlayers+1;ic++)
                wave sumimg = integrate_stack(imgsrc,ic)
                tmpimg[][][ic] = sumimg[p][q]/(ic+1)
            endfor
            variable ms = StopMSTimer(timerRefNum)
            print "loop one ", ms/(nlayers+1)
            timerRefNum = StartMSTimer
            for (ic=1;ic<nlayers+1;ic++)
                 MultiThread tmpimg[][][ic,nlayers] += imgsrc[p][q][ic] / (r + 1)
            endfor
            ms = StopMSTimer(timerRefNum)
            print "loop two ", ms/(nlayers+1)
 
 
ThreadSafe Static Function/WAVE integrate_stack(wave instack, variable nlayer)
 
    MatrixOP/FREE subset = instack[][][0,nlayer]
    MatrixOP/FREE/NTHR=0 reswave = fp32(sumBeams(subset))
    return reswave
end
  loop one   440767
  loop two   3.92332e+06

The first method is about 10x faster overall. It clocks in at about 441 s per layer on average. To test further, I did ic = 1 and ic = 153, both as hard-coded and as inputs.

            variable timerRefNum = StartMSTimer
            wave sumimg = integrate_stack(imgsrc,1)
            tmpimg[][][1] = sumimg[p][q]/2
            variable ms = StopMSTimer(timerRefNum)
            print "loop ic=1 ", ms
            timerRefNum = StartMSTimer
            wave sumimg = integrate_stack(imgsrc,153)
            tmpimg[][][153] = sumimg[p][q]/154
            ms = StopMSTimer(timerRefNum)
            print "loop ic=153 ", ms
 
  loop ic=1     164770
  loop ic=153   870973
 
            ic = 1
            variable timerRefNum = StartMSTimer
            wave sumimg = integrate_stack(imgsrc,ic)
            tmpimg[][][ic] = sumimg[p][q]/(ic+1)
            variable ms = StopMSTimer(timerRefNum)
            print "loop ic=1 ", ms
            ic = 153
            timerRefNum = StartMSTimer
            wave sumimg = integrate_stack(imgsrc,ic)
            tmpimg[][][ic] = sumimg[p][q]/(ic+1)
            ms = StopMSTimer(timerRefNum)
            print "loop ic=153 ", ms
 
  loop ic=1     179918
  loop ic=153   852644

Two lessons. First, the significant overhead seems to be in operating the for-loop. The internals on the for-loop are an order of magnitude faster it seems. Second, integrating the stack with just 2 layers is about 5x faster than integrating the stack with 153 layers.

A few quick comments:

First, the variations in the results are likely due to floating point issues when summing many numbers.  IP10 has more accurate summing implemented (especially in MatrixOP).

The difficulty with the task is that MatrixOP only implements full beam sums (i.e., all layers).  This suggest a few alternative approaches which I have not tested:

1.  Run a loop over the layers what executes something like:

// First create an output wave:
Duplicate/O imgsrc,integrated
// copy just the relevant part in a loop
Duplicate/FREE/RMD[][][0,i] imgsrc,junk
// sum the beams
MatrixOP/O/FREE intLayer=sumBeams(junk)
// store the output
ImageTransform/P=(i)/D=intLayer setPlane integrated

2.  If the duplication is too slow, you can compute the integral in reverse, i.e., start with the full N layers and then on each iteration remove a layer using Redimension and compute sumBeams() for the remaining layers. 

 

3.  Create a 1D wave of N points and set it to 1.  Then run N reverse iterations:

Make/free/N=(N) sWave=1
for(iter=N-1;iter>0;iter-=1)// layer 0 should remain unchanged
    sWave[iter]=0
    MatrixOP/O/FREE junk=scaleLayers(imgsrc,sWave)  
    MatrixOP/O/FREE intLayer=sumBeams(junk)                 // IP10
    ImageTransform/P=(iter)/D=intLayer setPlane integrated
endfor

4.  Suppose your imgsrc is RxCxN wave.  You can use a combination of MatrixOP transposeVol and Redimension to convert imgsrc into a 2D wave that contains Nx(R*C).  Then follow this with Matrixop Integrate(,1) and then revert the 2D wave back to 3D.  This would look like (note, it's untested):

MatrixOP/O/FREE j2d=TransposeVol(imgsrc,2)
variable rows=DimSize(imgsrc,0),cols=DimSize(imgsrc,1)
Redimension/N=(N,rows*cols) j2d
MatrixOP/O/FREE jInt=integrate(j2d,1)
Redimension/N=(N,rows,cols) jInt
MatrixOP/O integrated=TransposeVol(jInt,4)

 

Placing example data might help with developing faster solutions, as people can try out code on the example. 

I wonder if experimenting with `sumdimension` might help here? `sumdimension` sums values in an N dimensional srcwave along a specified dimension, producing an outwave with N-1 dimensions. The operation is also threadsafe. From the sounds of it it does the same operation as matrixop+sumbeams.

On a single calculation you'd expect matrixop to be faster, but if you have a parallelised for loop with sumdimension it may turn out faster.

Remember possibilities of contention in parallelisation. If an inner function uses a parallelised `matrixop` with `nthr=0` you also try to parallelise an outer function (e.g. a multithreaded for loop) that contains the inner function, then you'll likely end up with the processor thrashing around. 

 

 

 

AG ...

> 1.  Run a loop over the layers what executes something like:

I already pull out a sub stack with only the planes of interest.

> 2.  If the duplication is too slow, ... and 3. ...

I'll try these.

I've played with variations on Concatenate versus insertZPlane, though not yet with ImageTransform, as the step after generating the integrated 2D image. I don't know that this change is tweaking around the edges versus removing the outer for-loop. In this regard, I think the third option might be the most interesting to explore since (if I understand correctly) it would remove the for-loop over layers.

Andy ...

I will consider extracting a MWE (minimum working example) with requisite data in an experiment. I might do this especially since I did end with one example using MatrixOP that crashed my system.

I appreciate the note to avoid an approach that causes the processor to thrash around. I will explore removing the NTHR=0 flag on the MatrixOP step in the ThreadSafe function.

 

 

Out of curiosity, I tried sumdimension and it is about 4.5x faster with my limited data. I guess this scales badly as well for JJ's data as seen above with my last try, though.

Function image_stack(wave imgsrc)
 
//  variable timer = StopMSTimer(-2)
 
    int ic   = 0
    int numX = DimSize(imgsrc,0)
    int numY = DimSize(imgsrc,1)
    int numZ = DimSize(imgsrc,2)
    
    Duplicate/FREE imgsrc, imgcpy
    Make/D/FREE/N=(numX, numY ,numZ) tmpimg = 0
    Make/D/FREE/N=(numX, numY) tmpsum
    for (; ic < numZ; ic++)
        int layer = numZ-ic
        SumDimension/D=2/DEST=tmpsum imgcpy
        MultiThread tmpimg[][][layer-1] = tmpsum[p][q] / layer
        Redimension/N=(numX, numY ,layer-1) imgcpy
    endfor
 
//  print ((StopMSTimer(-2) - timer) / 1000), "ms"
End

 

Here is a MWE. The experiment contains an image and a procedure to integrate it. I've kept the code that I have above and the code from AG.

    nlayers = DimSize(imgsrc,2) 
    switch(how)
        case 1:
            tRN = StartMSTimer
            wave sumimg = integrate_stack(imgsrc,0)
            duplicate/FREE sumimg tmpimg
            Redimension/N=(-1,-1,nlayers) tmpimg            
            for (ic=1;ic<nlayers;ic++)
                wave sumimg = integrate_stack(imgsrc,ic)
                tmpimg[][][ic] = sumimg[p][q]/(ic+1)
            endfor          
            tpLs = StopMSTimer(tRN)/(1000*nlayers)
            break
        case 2:
            tRN = StartMSTimer
            MatrixOP/O/FREE j2d=TransposeVol(imgsrc,2)
            rows=DimSize(imgsrc,0)
            cols=DimSize(imgsrc,1)
            Redimension/N=(nlayers,rows*cols) j2d
            MatrixOP/O/FREE jInt = integrate(j2d,1)
            Redimension/N=(nlayers,rows,cols) jInt
            MatrixOP/FREE tmpimg = TransposeVol(jInt,4)
            tpLs = StopMSTimer(tRN)/(1000*nlayers)          
            break
    endswitch
    print how, " time per layer (s): ", tpLs
•test_integrate(1)
  1   time per layer (s):   613.266
•test_integrate(2)
  2   time per layer (s):   81.841

The resultant integrated image is not saved in this experiment. A test remains to be done for any differences in the results from the two methods.

minimum working example experiment (38.05 MB)

@jjweimer, Make sure in method 1 that you do that assignment using multithread:

Multithread tmpimg[][][ic] = sumimg[p][q]/(ic+1)

It should improve performance quite a bit.

I did some other profiling, and the vast majority of time is spent in that assignment (79%). For method 2, it's actually MatrixOP transposeVol (83%) that dominates the execution time, as opposed to MatrixOP integrate (7%)

Adding MultiThread as suggested by @Ben gives this improvement.

•test_integrate(1)
  1   time per layer (s):   438.3

Using the code below as per @Tony gives this.

        case 3:
            tRN = StartMSTimer
            duplicate/FREE imgsrc tmpimg
            Integrate/DIM=2/P imgsrc /D=tmpimg
            tpLs = StopMSTimer(tRN)/(1000*nlayers)  
            break       
 
•test_integrate(3)
  3   time per layer (s):   152.724

The fastest approach appears to be with AG's suggestion.

I will compare the output images as a next step.

After further testing, I have settled on the code below.

wave tmpimg = integrate_stack(imgsrcr,0)
Redimension/N=(-1,-1,nlayers) tmpimg            
for (ic=1;ic<nlayers;ic++)
    wave sumimg = integrate_stack(imgsrcr,ic)
    MultiThread tmpimg[][][ic] = sumimg[p][q]/(ic+1)
endfor          
 
ThreadSafe Function/WAVE integrate_stack(wave instack, variable nlayer)
 
    MatrixOP/FREE subset = instack[][][0,nlayer]
    MatrixOP/FREE reswave = sumBeams(subset)
    return reswave
end

Starting from an unsigned 8 bit image, the code above returns an unsigned integer 32 bit image. The conversion 8 bit to 32 bit happens in the sumBums call (even with the flag /NPRM on MatrixOP). The code provided by AG that uses TransposeVol requires further steps that extend the time. As is, it returns an 8 bit image, so round-off errors will be more likely. The code provided by Tony using Integrate/DIM=2 is faster at larger stacks but slower on smaller stacks. It returns a floating point 32 bit image.

The code will be used for the stack math->integrate option in the Image Tools package at its next update.

Thanks to all for the suggestions.