Building a faster way to integrate an image stack

Suppose that I have an image stack imgsrc. I want to create an image stack where each plane N is the sum of the N planes from imgsrc. The result is to be an integral.

I have played around to create the code below. It works, but it is tediously slow over a large image stack (2448,2048,154). The time sink seems to be the need to run an explicit for loop construction.

// snippet of interest
    nlayers = DimSize(imgsrc,2)
    wave sumimg = integrate_stack(imgsrc,0)
    duplicate/FREE sumimg tmpimg
    Redimension/N=(-1,-1,nlayers) tmpimg
    for (ic=1;ic<nlayers;ic+=1)
        wave sumimg = integrate_stack(imgsrc,ic)
        tmpimg[][][ic] = sumimg[p][q]/(ic+1)
    endfor
 
ThreadSafe Static Function/WAVE integrate_stack(wave instack, variable nlayer)
 
    MatrixOP/FREE subset = instack[][][0,nlayer]
    MatrixOP/FREE/NTHR=0 reswave = sumBeams(subset)
    return reswave
end

I have to wonder whether I am missing a faster method to build the integral summation, primarily to remove the for loop.

I'd be glad for insights *in IP9 please*.

I am not sure of the meaning of your second sentence. Are you trying to get a single image from a 3-D wave consisting of a stack of 2D images?

If so,  MatrixOP can return a 2D image from the sum of all beams. Perhaps this may work for you. (IP 9.05)

See DisplayHelpTopic "MatrixOP" :

sumBeams(w)    Returns an (n x m) matrix containing the sum over all layers of all the beams of the 3D wave w :
    
    A beam is a 1D array in the Z-direction.
    sumBeams is a non-layered function which requires that w  be a proper 3D wave and not the result of another expression. 

I start with a stack of N layers (imgsrc - 3D wave). I want to create a stack of N layers (3D wave) where each layer j is the result of running sumBeams(imgsrc) for the layers 0 to j.

The function integrate_stack(...) returns a 2D wave that is the result of running sumBeams(imgsrc) on layers 0 to j. The for-loop builds the 3D wave into tmping. I wonder if the explicit for-loop can be collapsed to a (faster) implicit loop.

My suggestion might be worth considering for varying numbers of layers if you use a loop sequentially removing images from the stack. The simple delete layer operation would be to DeletePoints with M=2

DeletePoints [/M=dim ] startElement, numElements, waveName [, waveName ]...

Another faster deletion alternative (from the MatrixOP help file) might be:

" You can also extract layers from a 3D wave starting with layer a and increasing the layer number up to layer b using increments c:
MatrixOp destWave = wave3d[][][a,b,c] "

After each layer deletion, the beam summation should be faster.  Here is the sumBeams( ) formula omitted from my first reply

[ pasted formula image did not transfer ]

I have found the "Profiling Igor Procedures" method to be very useful in probing slow functions. 

 

 

I first tried to reformulate the problem to make it simpler. The following code is slightly faster. Unfortunately, the output is also not exactly the same (there is a slight discrepancy on the order of 10^-8), but I couldn't find where this comes from. Maybe someone else has a better idea.

Function image_stack(wave imgsrc)
 
//  variable timer = StopMSTimer(-2)
 
    int ic   = 0
    int numX = DimSize(imgsrc,0)
    int numY = DimSize(imgsrc,1)
    int numZ = DimSize(imgsrc,2)
    
    Make/FREE/N=(numX, numY ,numZ) tmpimg = 0
    for (; ic < numZ; ic++)
        MultiThread tmpimg[][][ic, numZ-1] += imgsrc[p][q][ic] / (r + 1)
    endfor
 
//  print ((StopMSTimer(-2) - timer) / 1000), "ms"
End

 

Thanks. I will consider this approach.

> The following code is slightly faster. 

Perhaps an additional savings is realized using duplicate/FREE imgsrc, tmpimg and running the for-loop from ic=1 rather than ic=0?

> Unfortunately, the output is also not exactly the same (there is a slight discrepancy on the order of 10^-8), but I couldn't find where this comes from.

The difference between the two cases may be due to differences in where various floating point conversions are used.

Here are the test results.

            nlayers = DimSize(imgsrc,2) - 1
            variable timerRefNum = StartMSTimer
            for (ic=1;ic<nlayers+1;ic++)
                wave sumimg = integrate_stack(imgsrc,ic)
                tmpimg[][][ic] = sumimg[p][q]/(ic+1)
            endfor
            variable ms = StopMSTimer(timerRefNum)
            print "loop one ", ms/(nlayers+1)
            timerRefNum = StartMSTimer
            for (ic=1;ic<nlayers+1;ic++)
                 MultiThread tmpimg[][][ic,nlayers] += imgsrc[p][q][ic] / (r + 1)
            endfor
            ms = StopMSTimer(timerRefNum)
            print "loop two ", ms/(nlayers+1)
 
 
ThreadSafe Static Function/WAVE integrate_stack(wave instack, variable nlayer)
 
    MatrixOP/FREE subset = instack[][][0,nlayer]
    MatrixOP/FREE/NTHR=0 reswave = fp32(sumBeams(subset))
    return reswave
end
  loop one   440767
  loop two   3.92332e+06

The first method is about 10x faster overall. It clocks in at about 441 s per layer on average. To test further, I did ic = 1 and ic = 153, both as hard-coded and as inputs.

            variable timerRefNum = StartMSTimer
            wave sumimg = integrate_stack(imgsrc,1)
            tmpimg[][][1] = sumimg[p][q]/2
            variable ms = StopMSTimer(timerRefNum)
            print "loop ic=1 ", ms
            timerRefNum = StartMSTimer
            wave sumimg = integrate_stack(imgsrc,153)
            tmpimg[][][153] = sumimg[p][q]/154
            ms = StopMSTimer(timerRefNum)
            print "loop ic=153 ", ms
 
  loop ic=1     164770
  loop ic=153   870973
 
            ic = 1
            variable timerRefNum = StartMSTimer
            wave sumimg = integrate_stack(imgsrc,ic)
            tmpimg[][][ic] = sumimg[p][q]/(ic+1)
            variable ms = StopMSTimer(timerRefNum)
            print "loop ic=1 ", ms
            ic = 153
            timerRefNum = StartMSTimer
            wave sumimg = integrate_stack(imgsrc,ic)
            tmpimg[][][ic] = sumimg[p][q]/(ic+1)
            ms = StopMSTimer(timerRefNum)
            print "loop ic=153 ", ms
 
  loop ic=1     179918
  loop ic=153   852644

Two lessons. First, the significant overhead seems to be in operating the for-loop. The internals on the for-loop are an order of magnitude faster it seems. Second, integrating the stack with just 2 layers is about 5x faster than integrating the stack with 153 layers.

A few quick comments:

First, the variations in the results are likely due to floating point issues when summing many numbers.  IP10 has more accurate summing implemented (especially in MatrixOP).

The difficulty with the task is that MatrixOP only implements full beam sums (i.e., all layers).  This suggest a few alternative approaches which I have not tested:

1.  Run a loop over the layers what executes something like:

// First create an output wave:
Duplicate/O imgsrc,integrated
// copy just the relevant part in a loop
Duplicate/FREE/RMD[][][0,i] imgsrc,junk
// sum the beams
MatrixOP/O/FREE intLayer=sumBeams(junk)
// store the output
ImageTransform/P=(i)/D=intLayer setPlane integrated

2.  If the duplication is too slow, you can compute the integral in reverse, i.e., start with the full N layers and then on each iteration remove a layer using Redimension and compute sumBeams() for the remaining layers. 

 

3.  Create a 1D wave of N points and set it to 1.  Then run N reverse iterations:

Make/free/N=(N) sWave=1
for(iter=N-1;iter>0;iter-=1)// layer 0 should remain unchanged
    sWave[iter]=0
    MatrixOP/O/FREE junk=scaleLayers(imgsrc,sWave)  
    MatrixOP/O/FREE intLayer=sumBeams(junk)                 // IP10
    ImageTransform/P=(iter)/D=intLayer setPlane integrated
endfor

4.  Suppose your imgsrc is RxCxN wave.  You can use a combination of MatrixOP transposeVol and Redimension to convert imgsrc into a 2D wave that contains Nx(R*C).  Then follow this with Matrixop Integrate(,1) and then revert the 2D wave back to 3D.  This would look like (note, it's untested):

MatrixOP/O/FREE j2d=TransposeVol(imgsrc,2)
variable rows=DimSize(imgsrc,0),cols=DimSize(imgsrc,1)
Redimension/N=(N,rows*cols) j2d
MatrixOP/O/FREE jInt=integrate(j2d,1)
Redimension/N=(N,rows,cols) jInt
MatrixOP/O integrated=TransposeVol(jInt,4)

 

Placing example data might help with developing faster solutions, as people can try out code on the example. 

I wonder if experimenting with `sumdimension` might help here? `sumdimension` sums values in an N dimensional srcwave along a specified dimension, producing an outwave with N-1 dimensions. The operation is also threadsafe. From the sounds of it it does the same operation as matrixop+sumbeams.

On a single calculation you'd expect matrixop to be faster, but if you have a parallelised for loop with sumdimension it may turn out faster.

Remember possibilities of contention in parallelisation. If an inner function uses a parallelised `matrixop` with `nthr=0` you also try to parallelise an outer function (e.g. a multithreaded for loop) that contains the inner function, then you'll likely end up with the processor thrashing around. 

 

 

 

AG ...

> 1.  Run a loop over the layers what executes something like:

I already pull out a sub stack with only the planes of interest.

> 2.  If the duplication is too slow, ... and 3. ...

I'll try these.

I've played with variations on Concatenate versus insertZPlane, though not yet with ImageTransform, as the step after generating the integrated 2D image. I don't know that this change is tweaking around the edges versus removing the outer for-loop. In this regard, I think the third option might be the most interesting to explore since (if I understand correctly) it would remove the for-loop over layers.

Andy ...

I will consider extracting a MWE (minimum working example) with requisite data in an experiment. I might do this especially since I did end with one example using MatrixOP that crashed my system.

I appreciate the note to avoid an approach that causes the processor to thrash around. I will explore removing the NTHR=0 flag on the MatrixOP step in the ThreadSafe function.

 

 

Out of curiosity, I tried sumdimension and it is about 4.5x faster with my limited data. I guess this scales badly as well for JJ's data as seen above with my last try, though.

Function image_stack(wave imgsrc)
 
//  variable timer = StopMSTimer(-2)
 
    int ic   = 0
    int numX = DimSize(imgsrc,0)
    int numY = DimSize(imgsrc,1)
    int numZ = DimSize(imgsrc,2)
    
    Duplicate/FREE imgsrc, imgcpy
    Make/D/FREE/N=(numX, numY ,numZ) tmpimg = 0
    Make/D/FREE/N=(numX, numY) tmpsum
    for (; ic < numZ; ic++)
        int layer = numZ-ic
        SumDimension/D=2/DEST=tmpsum imgcpy
        MultiThread tmpimg[][][layer-1] = tmpsum[p][q] / layer
        Redimension/N=(numX, numY ,layer-1) imgcpy
    endfor
 
//  print ((StopMSTimer(-2) - timer) / 1000), "ms"
End

 

Here is a MWE. The experiment contains an image and a procedure to integrate it. I've kept the code that I have above and the code from AG.

    nlayers = DimSize(imgsrc,2) 
    switch(how)
        case 1:
            tRN = StartMSTimer
            wave sumimg = integrate_stack(imgsrc,0)
            duplicate/FREE sumimg tmpimg
            Redimension/N=(-1,-1,nlayers) tmpimg            
            for (ic=1;ic<nlayers;ic++)
                wave sumimg = integrate_stack(imgsrc,ic)
                tmpimg[][][ic] = sumimg[p][q]/(ic+1)
            endfor          
            tpLs = StopMSTimer(tRN)/(1000*nlayers)
            break
        case 2:
            tRN = StartMSTimer
            MatrixOP/O/FREE j2d=TransposeVol(imgsrc,2)
            rows=DimSize(imgsrc,0)
            cols=DimSize(imgsrc,1)
            Redimension/N=(nlayers,rows*cols) j2d
            MatrixOP/O/FREE jInt = integrate(j2d,1)
            Redimension/N=(nlayers,rows,cols) jInt
            MatrixOP/FREE tmpimg = TransposeVol(jInt,4)
            tpLs = StopMSTimer(tRN)/(1000*nlayers)          
            break
    endswitch
    print how, " time per layer (s): ", tpLs
•test_integrate(1)
  1   time per layer (s):   613.266
•test_integrate(2)
  2   time per layer (s):   81.841

The resultant integrated image is not saved in this experiment. A test remains to be done for any differences in the results from the two methods.

minimum working example experiment (38.05 MB)

@jjweimer, Make sure in method 1 that you do that assignment using multithread:

Multithread tmpimg[][][ic] = sumimg[p][q]/(ic+1)

It should improve performance quite a bit.

I did some other profiling, and the vast majority of time is spent in that assignment (79%). For method 2, it's actually MatrixOP transposeVol (83%) that dominates the execution time, as opposed to MatrixOP integrate (7%)

Adding MultiThread as suggested by @Ben gives this improvement.

•test_integrate(1)
  1   time per layer (s):   438.3

Using the code below as per @Tony gives this.

        case 3:
            tRN = StartMSTimer
            duplicate/FREE imgsrc tmpimg
            Integrate/DIM=2/P imgsrc /D=tmpimg
            tpLs = StopMSTimer(tRN)/(1000*nlayers)  
            break       
 
•test_integrate(3)
  3   time per layer (s):   152.724

The fastest approach appears to be with AG's suggestion.

I will compare the output images as a next step.

After further testing, I have settled on the code below.

wave tmpimg = integrate_stack(imgsrcr,0)
Redimension/N=(-1,-1,nlayers) tmpimg            
for (ic=1;ic<nlayers;ic++)
    wave sumimg = integrate_stack(imgsrcr,ic)
    MultiThread tmpimg[][][ic] = sumimg[p][q]/(ic+1)
endfor          
 
ThreadSafe Function/WAVE integrate_stack(wave instack, variable nlayer)
 
    MatrixOP/FREE subset = instack[][][0,nlayer]
    MatrixOP/FREE reswave = sumBeams(subset)
    return reswave
end

Starting from an unsigned 8 bit image, the code above returns an unsigned integer 32 bit image. The conversion 8 bit to 32 bit happens in the sumBums call (even with the flag /NPRM on MatrixOP). The code provided by AG that uses TransposeVol requires further steps that extend the time. As is, it returns an 8 bit image, so round-off errors will be more likely. The code provided by Tony using Integrate/DIM=2 is faster at larger stacks but slower on smaller stacks. It returns a floating point 32 bit image.

The code will be used for the stack math->integrate option in the Image Tools package at its next update.

Thanks to all for the suggestions.

I hope I am not too late to the party.

My solution would be to only calculate the sum of, for instance, Layer 1 and 2 once, instead of recalculating it again and again for the sums of each additional layer.

I also used ImageTransform setPlane to write the results back into the wave, which in my experience is much faster than wave indexing using [p][q][r].

The fastest method I tested was using FastOP and splitting the y axis into sections for multithreading, which turned out to be a little tricky.

I haven't carefully tested the code, but all three methods I tested give the same integrated wave, which is promising.

I ran into memory issues so I used a slightly-smaller wave size than stated in the original post.

IntegrateStacks() will run the self-contained example.

Function IntegrateStacks()
//  Igor Pro Forum Topic
 
    //  Creates a test stack
    //  This is about as big as I could make the stacks without running into memory issues
    Make/O/D/N=(1000,256,150) ImageStack
    MultiThread ImageStack[][][]=gnoise(0.1)+r
    
    //  Creates a temporary wave to hold the result
    //  To save memory you could overwrite the original, but I preferred seperate waves for testing
    Duplicate/O/FREE ImageStack, IntegratedImageStack
    
    //  Calculates the number of layers in the stack
    Variable NumberOfLayers=DimSize(ImageStack, 2), CurrentLayer=1, PrevLayer=0
    
 
    //  ------------------------------------------- 
    //  METHOD 1: Combined 530 ms
    //  ------------------------------------------- 
 
    //  Starts a timer
    Variable t=StartMSTimer
    
    //  Creates a temporary wave to hold the increasing sum
    Duplicate/O/FREE/R=[][][0] ImageStack, ImageLayerSum
    
    //  Sums up the layers, one layer at a time
    for (CurrentLayer=1; CurrentLayer<NumberOfLayers; CurrentLayer+=1)
        
        //  Add the next layer to the sum
        PrevLayer=CurrentLayer-1
        MatrixOP/O/S/NTHR=0 ImageLayerSum=IntegratedImageStack[][][PrevLayer]+ImageStack[][][CurrentLayer]      //  460 ms
        
        //  Writes the sum into the results wave
        ImageTransform /P=(CurrentLayer) /D=ImageLayerSum setPlane IntegratedImageStack //  49 ms       Much faster than MultiThread IntegratedImageStack[][][CurrentLayer]=ImageLayerSum[p][q][0]
    endfor
    
    //  Prints how long it took to complete the calculation
    Print("METHOD 1 = "+Num2Str(StopMSTimer(t)/1000)+" ms")
    
    //  Copies the result into a permanent wave
    Duplicate/O IntegratedImageStack, IntegratedImageMethod1
    
    //  Resets the temporary result wave
    FAstOP IntegratedImageStack=ImageStack
 
 
    //  -------------------------------------------
    //  METHOD 2: Combined 300 ms
    //  Difficult to multithread for a single wave, but would probably multithread very well for multiple waves
    //  ------------------------------------------- 
 
    //  Starts a timer
    t=StartMSTimer
    
    //  Creates a temporary wave to hold the increasing sum
    Duplicate/O/FREE/R=[][][0] ImageStack, ImageLayerSum
 
    //  Sums up the layers, one layer at a time
    for (CurrentLayer=1; CurrentLayer<NumberOfLayers; CurrentLayer+=1)
    
        //  Extracts the next layer from the stack
        Duplicate/O/FREE/R=[][][CurrentLayer] ImageStack, ImageLayer1   //  165 ms,      Faster than MatrixOP/O/S ImageLayer1=ImageStack[][][a]
        
        //  Add the layer to the sum
        FastOP ImageLayerSum=ImageLayerSum+ImageLayer1  //  31 ms
        
        //  Writes the sum into the results wave
        ImageTransform /P=(CurrentLayer) /D=ImageLayerSum setPlane IntegratedImageStack //  49 ms       Much faster than MultiThread IntegratedImageStack[][][CurrentLayer]=ImageLayerSum[p][q][0]
    endfor
    
    //  Prints how long it took to complete the calculation
    Print("METHOD 2 = "+Num2Str(StopMSTimer(t)/1000)+" ms")
    
    //  Copies the result into a permanent wave
    Duplicate/O IntegratedImageStack, IntegratedImageMethod2
    
    //  Resets the temporary result wave
    FAstOP IntegratedImageStack=ImageStack
 
 
 
    //  ------------------------------------------- 
    //  METHOD 3: Combined 110 ms
    //  A complicated attempt at multithreading METHOD 2 for a single wave, by splitting the image stack into multiple substacks along the y axis
    //  ------------------------------------------- 
        
    //  Starts a timer
    t=StartMSTimer
 
    //  Finds the number of threads to use
    Variable NumberOfThreads=ThreadProcessorCount
 
    //  Calculates the x and y size of the image stacks
    Variable XSize=DimSize(ImageStack, 0)
    Variable YSize=DimSize(ImageStack, 1)
    
    //  Checks that the image stacks can be redimensioned for multithreading along the y axis without resizing the image stacks, which takes forever for a large matrix
    //  The y size in the forum example was 2048, which I hope is no fluke, because it is divisible by the processor count
    if (mod(YSize, NumberOfThreads)==0)
    
        //  Splits the image stacks along the y axis. The last dimension now represents both the layer and the thread number
        //  I was not able to make ImageTransform setPlane work for a 4 dimensional wave or I would have split the threads and layers into separate dimensions
        Redimension /N=(XSize*YSize*NumberOfLayers) ImageStack, IntegratedImageStack
        Redimension /N=(XSize, YSize/NumberOfThreads, NumberOfThreads*NumberOfLayers) ImageStack, IntegratedImageStack
    
        //  Integrates each substack using METHOD 2
        //  DummyWave is only used for easy multithreading
        Make/O/FREE/N=(NumberOfThreads) DummyWave
        MultiThread DummyWave=ThreadSafeIntegrateLayers(ImageStack, IntegratedImageStack, p, NumberOfThreads, NumberOfLayers)       //  110 ms  (0.03 ms combined for the four redimensions)
    
        //  Restores the original dimensions of the image stacks
        Redimension /N=(XSize*YSize*NumberOfLayers) ImageStack, IntegratedImageStack
        Redimension /N=(XSize, YSize, NumberOfLayers) ImageStack, IntegratedImageStack
    endif
 
    //  Prints how long it took to complete the calculation
    Print("METHOD 3 = "+Num2Str(StopMSTimer(t)/1000)+" ms")
    
    //  Copies the result into a permanent wave
    Duplicate/O IntegratedImageStack, IntegratedImageMethod3
    
    //  Resets the temporary result wave
    FastOP IntegratedImageStack=ImageStack
 
    
    
 
    //  Compares the results of the three methods
    Edit IntegratedImageMethod1
    Edit IntegratedImageMethod2
    Edit IntegratedImageMethod3
end
 
 
ThreadSafe Function ThreadSafeIntegrateLayers(ImageStack, IntegratedImageStack, ThreadNumber, NumberOfThreads, NumberOfLayers)
//  Using METHOD 2
Wave ImageStack, IntegratedImageStack
Variable ThreadNumber, NumberOfThreads, NumberOfLayers
 
    //  Creates a temporary sublayer wave wave to hold the increasing sum
    Duplicate/O/R=[][][ThreadNumber] ImageStack, ImageLayerSum
    
    //  Sums up the sublayers, one layer at a time
    Variable CurrentLayer=1
    for (CurrentLayer=1; CurrentLayer<NumberOfLayers; CurrentLayer+=1)
    
        //  Extracts the next sublayer from the stack
        //  The 3rd dimension is a little complicated becomes it contains both thread and layer information
        Duplicate/O/R=[][][ThreadNumber+CurrentLayer*NumberOfThreads] ImageStack, ImageLayer1
        
        //  Add the sublayer to the sum
        FastOP ImageLayerSum=ImageLayerSum+ImageLayer1
        
        //  Writes the sum into the results wave
        //  I "THINK" multiple threads of ImageTransform writing to the same single result wave is threadsafe as long as they are writing to different sections of the wave
        //  I was not able to make ImageTransform setPlane work for a 4 dimensional wave or I would have split the threads and layers into separate dimensions
        ImageTransform /P=(ThreadNumber+CurrentLayer*NumberOfThreads) /D=ImageLayerSum setPlane IntegratedImageStack
    endfor
end

 

> I hope I am not too late to the party.

Certainly not. Especially when you bring such interesting additions to the discussions.

Two points of reference should be considered. One is that the final result has to be divided by the number of layers to bring the intensity back to the original scale. The other is that the images should probably be converted/promoted in bit depth just before the division to avoid truncation round off errors that will cause gaps in the final histogram.

I believe the approach in your Method 2 (e.g. with duplicate rather than MatrixOP and with adding to the sum at each loop step) offers a significant improvement worth using. I'm working on a few other tweaks and will post a revised experiment at some time later.

Thanks!

I am now starting with an 8-bit unsigned integer wave

Interestingly, I found MatrixOP to convert the wave into 16-bit floating point much faster than Redimension could do floating point or 32-bit integer. This is the most time-consuming step of the whole operation, and, unfortunately. multithreading doesn't seem to do anything here.

I added the normalization by the number of layers

I made the redimension of the wave for multithreading a little easier to understand, by combining x and y in dimension one, the thread number / substack index in dimension two and the layers in dimension three.

I also made the function use the single-threaded version, when the multithreaded version is not possible because the number of data points are nor divisible by the processor count.

Use RunMe() to run the example. I hope I didn't break anything. I didn't have time to test it much.

Function RunMe()
//  Igor Pro Forum Topic
 
    //  Creates a test stack
    //  This is about as big as I could make the stacks without running into memory issues
    Make/O/B/U/N=(1000,1024,150) ImageStack
    MultiThread ImageStack[][][]=r
 
    //  Starts a timer
    Variable t=StartMSTimer
 
    //  Converts the image stack to 32-bit floating point
    MatrixOP/O/NTHR=0 IntegratedImageStack=ImageStack/1 //  105 ms
//  MatrixOP/O/NTHR=1 IntegratedImageStack=ImageStack/1 //  110 ms  
 
//  Duplicate/O ImageStack, IntegratedImageStack
//  Redimension/D IntegratedImageStack  //  548 ms
//  Redimension/S IntegratedImageStack  //  700 ms
//  Redimension/W/U IntegratedImageStack    //  700 ms
 
    //  Integrates IntegratedImageStack and overwites the wave with the result
    IntegrateStacks(IntegratedImageStack)       //  50 ms
     
    //  Prints how long it took to complete the calculation
    Print(Num2Str(StopMSTimer(t)/1000)+" ms")
     
     // Views the result
     //Edit IntegratedImageStack
end
 
 
 
Function IntegrateStacks(ImageStack)
//  Igor Pro Forum Topic
//  Integrates ImageStack and stores the result in IntegratedImageStack
Wave ImageStack
 
    //  Calculates the x and y size of the images and the total number of layers
    Variable XSize=DimSize(ImageStack, 0)
    Variable YSize=DimSize(ImageStack, 1)
    Variable NumberOfLayers=DimSize(ImageStack, 2)
 
    //  Finds the number of threads to potentially use
    Variable NumberOfThreads=ThreadProcessorCount
 
    //  ------------------------------------------- 
    //  If the number of points per layer is divisible by the processor count,
    //  the data points in each layer are divided up into a number of sublayers
    //  matching the processor count for multithreading
    //  (Combined 50 ms)
    //  ------------------------------------------- 
    if (mod(XSize*YSize, NumberOfThreads)==0)
    
        //  Splits the data points of each layer into sublayers that can be handled sepüarately by the multiple threads
        //  and written to with ImageTransform setCol
        //  Redimension is very fast as long as the total number of data points remains unchanged
        Redimension /N=(XSize*YSize*NumberOfLayers) ImageStack      //  0.01 ms
        Redimension /N=((XSize*YSize)/NumberOfThreads, NumberOfThreads, NumberOfLayers) ImageStack      //  0.01 ms
    
        //  Integrates each substack by multithreading
        //  DummyWave is only used for easier multithreading
        Make/O/FREE/N=(NumberOfThreads) DummyWave
        MultiThread DummyWave=ThreadSafeIntegrateLayers(ImageStack, p)      //  50 ms
    
        //  Restores the original dimensions of the image stacks
        Redimension /N=(XSize*YSize*NumberOfLayers) ImageStack      //  0.01 ms
        Redimension /N=(XSize, YSize, NumberOfLayers) ImageStack        //  0.01 ms
 
    //  ------------------------------------------- 
    //  If the number of points per layer is NOT divisible by the processor count,
    //  a single-threaded calculation is used instead
    //  (Combined 140 ms)
    //  ------------------------------------------- 
    else
    
        //  Creates temporary waves to hold the the increasing total sum and the normalized sum
        Duplicate/O/FREE/R=[][][0] ImageStack, ImageLayerTotalSum, ImageLayerNormSum
 
        //  Sums up the layers, one layer at a time
        Variable CurrentLayer=1
        for (CurrentLayer=1; CurrentLayer<NumberOfLayers; CurrentLayer+=1)
    
            //  Extracts the next layer from the stack
            Duplicate/O/FREE/R=[][][CurrentLayer] ImageStack, NextLayer
        
            //  Adds the next layer to the total sum
            FastOP ImageLayerTotalSum=ImageLayerTotalSum+NextLayer
            
            //  Normalizes the total sum
            FastOP ImageLayerNormSum=(1/(CurrentLayer+1))*ImageLayerTotalSum
        
            //  Writes the normalized sum back into the integrated image stack wave
            ImageTransform /P=(CurrentLayer) /D=ImageLayerNormSum setPlane ImageStack
        endfor
    endif
end
 
 
 
ThreadSafe Function ThreadSafeIntegrateLayers(ImageStack, ThreadNumber)
//  Integrates one sublayer
Wave ImageStack
Variable ThreadNumber
 
    //  Extracts the number of layers in the image stack
    Variable NumberOfLayers=DimSize(ImageStack, 2)
 
    //  Creates temporary waves to hold the the increasing total sum and the normalized sum
    Duplicate/O/FREE/R=[][ThreadNumber][0] ImageStack, NextLayer, ImageLayerTotalSum, ImageLayerNormSum
 
    //  Sums up the sublayers, one layer at a time
    Variable CurrentLayer=1
    for (CurrentLayer=1; CurrentLayer<NumberOfLayers; CurrentLayer+=1)
 
        //  Extracts the next sublayer from the stack
        Duplicate/O/FREE/R=[][ThreadNumber][CurrentLayer] ImageStack, NextLayer
        
        //  Adds the next sublayer to the total sum
        FastOP ImageLayerTotalSum=ImageLayerTotalSum+NextLayer
            
        //  Normalizes the total sum
        FastOP ImageLayerNormSum=(1/(CurrentLayer+1))*ImageLayerTotalSum
        
        //  Writes the normalized sum back into the correct position of integrated image stack wave
        //  I "THINK" multiple threads of ImageTransform writing to the same single result wave is threadsafe as long as they are writing to different sections of the wave
        ImageTransform /G=(ThreadNumber) /P=(CurrentLayer) /D=ImageLayerNormSum putCol ImageStack
    endfor
end

 

I can also make MatrixOP convert the stacks to 32-bit unsigned integer (/I/U) faster than redimension by using:

Make/O/I/U/FREE/N=(1) FormatWave=0
MatrixOP/O/NTHR=0 IntegratedImageStack=ImageStack+FormatWave[0]

but I cannot make it do a 16-bit integer (/W/U). The following still creates a 32-bit integer:

Make/O/W/U/FREE/N=(1) FormatWave=0
MatrixOP/O/NTHR=0 IntegratedImageStack=ImageStack+FormatWave[0]

 

Interestingly ImageTransform getPlane (14 ms) can extract a plane from the image stack much faster than Duplicate (75 ms).

//  Extracts the next layer from the stack
//  75 ms
//Duplicate/O/FREE/R=[][][CurrentLayer] ImageStack, NextLayer
    
//  Extracts the next layer from the stack
//  14 ms
ImageTransform /P=(CurrentLayer) /PTYP=0 getPlane ImageStack
Wave NextLayer=M_ImagePlane

But the same is not true for the multithreaded calculation where a column is extracted. There ImageTransform getCol (11 ms) and Duplicate (12 ms) perform the same.

//  Extracts the next sublayer from the stack
//  12 ms
//Duplicate/O/FREE/R=[][ThreadNumber][CurrentLayer] ImageStack, NextLayer
        
//  Extracts the next sublayer from the stack
//  11 ms
ImageTransform /G=(ThreadNumber) /P=(CurrentLayer) getCol ImageStack
Wave NextLayer=W_ExtractedCol

 

In general, without further convolutions, the two cases below are the fastest.

case 4:     // duplicate + multithread
    tRN = StartMSTimer
    MatrixOP/FREE tmpimg = fp32(src)
    duplicate/FREE/R=[][][0] tmpimg sumimg
    for (ic=1;ic<nlayers;ic++)
        duplicate/O/FREE/R=[][][ic] tmpimg currlayer
        MultiThread sumimg += currlayer/(ic+1)
        ImageTransform/P=(ic)/D=sumimg setPlane tmpimg
    endfor
    tpLs = StopMSTimer(tRN)/(1000*nlayers)
    break   
case 5:     // imagetransform + fastop
    tRN = StartMSTimer
    MatrixOP/FREE tmpimg = fp32(src)
    duplicate/FREE/R=[][][0] tmpimg sumimg
    for (ic=1;ic<nlayers;ic++)
        ImageTransform/P=(ic)/PTYP=0 getPlane tmpimg
        wave currlayer = M_ImagePlane
        FastOP sumimg = sumimg + (1/(ic+1))*currlayer
        ImageTransform/P=(ic)/D=sumimg setPlane tmpimg
    endfor
    killwaves/Z M_ImagePlane
    tpLs = StopMSTimer(tRN)/(1000*nlayers)
    break

I convert to fp32 at the outset. Case 5 is about 2.5 - 3 times faster per layer than case 4. Both approaches give a result that is fp32. Case 5 appears to be at least 20 times faster than my original method.

I attach an experiment with an image stack for anyone interested in exploring further. I'm rather satisfied with using the approach in case 5 to carry forward.

Thanks everyone for the suggestions. The insights about MatrixOP versus ImageTransform and FastOP versus MultiThread will certainly improve my coding practices.

minimum working example experiment (50.32 MB)

I don't think this will give the correct total sum

FastOP sumimg = sumimg + (1/(ic+1))*currlayer

You need two waves: one to hold the total sum and another where you normalize the total sum.

Your version gives: (Layer0) / 1 + (Layer1) / 2 + (Layer2) / 3, instead of (Layer 0 + Layer1 + Layer2) / 3

But maybe I am misunderstanding the normalization you are trying to do.

Ooops! Thanks @olelytken. Good to have the extra eyes on it. Here is the correct code for case 4 and case 5. Case 5 is still 3x faster than case 4 but now only about 8x faster than my original.

case 4:     // duplicate + multithread
    tRN = StartMSTimer
    MatrixOP/FREE tmpimg = fp32(src)
    duplicate/FREE/R=[][][0] tmpimg sumimg, normimg
    for (ic=1;ic<nlayers;ic++)
        duplicate/O/FREE/R=[][][ic] tmpimg currlayer
        MultiThread sumimg += currlayer
        MultiThread normimg = sumimg/(ic+1)
        ImageTransform/P=(ic)/D=normimg setPlane tmpimg
    endfor
    tpLs = StopMSTimer(tRN)/(1000*nlayers)
    break   
case 5:     // imagetransform + fastop
    tRN = StartMSTimer
    MatrixOP/FREE tmpimg = fp32(src)
    duplicate/FREE/R=[][][0] tmpimg sumimg, normimg
    for (ic=1;ic<nlayers;ic++)
        ImageTransform/P=(ic)/PTYP=0 getPlane tmpimg
        wave currlayer = M_ImagePlane
        FastOP sumimg = sumimg + currlayer
        FastOP normimg = (1/(ic + 1))*sumimg
        ImageTransform/P=(ic)/D=normimg setPlane tmpimg
    endfor
    killwaves/Z M_ImagePlane
    tpLs = StopMSTimer(tRN)/(1000*nlayers)
    break