
Building a faster way to integrate an image stack
Suppose that I have an image stack imgsrc. I want to create an image stack where each plane N is the sum of the N planes from imgsrc. The result is to be an integral.
I have played around to create the code below. It works, but it is tediously slow over a large image stack (2448,2048,154). The time sink seems to be the need to run an explicit for loop construction.
// snippet of interest nlayers = DimSize(imgsrc,2) wave sumimg = integrate_stack(imgsrc,0) duplicate/FREE sumimg tmpimg Redimension/N=(-1,-1,nlayers) tmpimg for (ic=1;ic<nlayers;ic+=1) wave sumimg = integrate_stack(imgsrc,ic) tmpimg[][][ic] = sumimg[p][q]/(ic+1) endfor ThreadSafe Static Function/WAVE integrate_stack(wave instack, variable nlayer) MatrixOP/FREE subset = instack[][][0,nlayer] MatrixOP/FREE/NTHR=0 reswave = sumBeams(subset) return reswave end
I have to wonder whether I am missing a faster method to build the integral summation, primarily to remove the for loop.
I'd be glad for insights *in IP9 please*.
I am not sure of the meaning of your second sentence. Are you trying to get a single image from a 3-D wave consisting of a stack of 2D images?
If so, MatrixOP can return a 2D image from the sum of all beams. Perhaps this may work for you. (IP 9.05)
See DisplayHelpTopic "MatrixOP" :
sumBeams(w) Returns an (n x m) matrix containing the sum over all layers of all the beams of the 3D wave w :
A beam is a 1D array in the Z-direction.
sumBeams is a non-layered function which requires that w be a proper 3D wave and not the result of another expression.
September 28, 2025 at 06:43 am - Permalink
I start with a stack of N layers (imgsrc - 3D wave). I want to create a stack of N layers (3D wave) where each layer j is the result of running sumBeams(imgsrc) for the layers 0 to j.
The function integrate_stack(...) returns a 2D wave that is the result of running sumBeams(imgsrc) on layers 0 to j. The for-loop builds the 3D wave into tmping. I wonder if the explicit for-loop can be collapsed to a (faster) implicit loop.
September 28, 2025 at 02:06 pm - Permalink
My suggestion might be worth considering for varying numbers of layers if you use a loop sequentially removing images from the stack. The simple delete layer operation would be to DeletePoints with M=2
DeletePoints [/M=dim ] startElement, numElements, waveName [, waveName ]...
Another faster deletion alternative (from the MatrixOP help file) might be:
" You can also extract layers from a 3D wave starting with layer a and increasing the layer number up to layer b using increments c:
MatrixOp destWave = wave3d[][][a,b,c] "
After each layer deletion, the beam summation should be faster. Here is the sumBeams( ) formula omitted from my first reply
I have found the "Profiling Igor Procedures" method to be very useful in probing slow functions.
September 29, 2025 at 04:27 am - Permalink
I first tried to reformulate the problem to make it simpler. The following code is slightly faster. Unfortunately, the output is also not exactly the same (there is a slight discrepancy on the order of 10^-8), but I couldn't find where this comes from. Maybe someone else has a better idea.
September 29, 2025 at 04:53 am - Permalink
Thanks. I will consider this approach.
> The following code is slightly faster.
Perhaps an additional savings is realized using duplicate/FREE imgsrc, tmpimg and running the for-loop from ic=1 rather than ic=0?
> Unfortunately, the output is also not exactly the same (there is a slight discrepancy on the order of 10^-8), but I couldn't find where this comes from.
The difference between the two cases may be due to differences in where various floating point conversions are used.
September 29, 2025 at 07:17 am - Permalink
Here are the test results.
loop one 440767 loop two 3.92332e+06
The first method is about 10x faster overall. It clocks in at about 441 s per layer on average. To test further, I did ic = 1 and ic = 153, both as hard-coded and as inputs.
Two lessons. First, the significant overhead seems to be in operating the for-loop. The internals on the for-loop are an order of magnitude faster it seems. Second, integrating the stack with just 2 layers is about 5x faster than integrating the stack with 153 layers.
September 29, 2025 at 01:41 pm - Permalink
A few quick comments:
First, the variations in the results are likely due to floating point issues when summing many numbers. IP10 has more accurate summing implemented (especially in MatrixOP).
The difficulty with the task is that MatrixOP only implements full beam sums (i.e., all layers). This suggest a few alternative approaches which I have not tested:
1. Run a loop over the layers what executes something like:
2. If the duplication is too slow, you can compute the integral in reverse, i.e., start with the full N layers and then on each iteration remove a layer using Redimension and compute sumBeams() for the remaining layers.
3. Create a 1D wave of N points and set it to 1. Then run N reverse iterations:
4. Suppose your imgsrc is RxCxN wave. You can use a combination of MatrixOP transposeVol and Redimension to convert imgsrc into a 2D wave that contains Nx(R*C). Then follow this with Matrixop Integrate(,1) and then revert the 2D wave back to 3D. This would look like (note, it's untested):
September 29, 2025 at 03:26 pm - Permalink
Placing example data might help with developing faster solutions, as people can try out code on the example.
I wonder if experimenting with `sumdimension` might help here? `sumdimension` sums values in an N dimensional srcwave along a specified dimension, producing an outwave with N-1 dimensions. The operation is also threadsafe. From the sounds of it it does the same operation as matrixop+sumbeams.
On a single calculation you'd expect matrixop to be faster, but if you have a parallelised for loop with sumdimension it may turn out faster.
Remember possibilities of contention in parallelisation. If an inner function uses a parallelised `matrixop` with `nthr=0` you also try to parallelise an outer function (e.g. a multithreaded for loop) that contains the inner function, then you'll likely end up with the processor thrashing around.
September 29, 2025 at 03:30 pm - Permalink
AG ...
> 1. Run a loop over the layers what executes something like:
I already pull out a sub stack with only the planes of interest.
> 2. If the duplication is too slow, ... and 3. ...
I'll try these.
I've played with variations on Concatenate versus insertZPlane, though not yet with ImageTransform, as the step after generating the integrated 2D image. I don't know that this change is tweaking around the edges versus removing the outer for-loop. In this regard, I think the third option might be the most interesting to explore since (if I understand correctly) it would remove the for-loop over layers.
Andy ...
I will consider extracting a MWE (minimum working example) with requisite data in an experiment. I might do this especially since I did end with one example using MatrixOP that crashed my system.
I appreciate the note to avoid an approach that causes the processor to thrash around. I will explore removing the NTHR=0 flag on the MatrixOP step in the ThreadSafe function.
September 29, 2025 at 04:52 pm - Permalink
Out of curiosity, I tried sumdimension and it is about 4.5x faster with my limited data. I guess this scales badly as well for JJ's data as seen above with my last try, though.
September 29, 2025 at 08:01 pm - Permalink
Here is a MWE. The experiment contains an image and a procedure to integrate it. I've kept the code that I have above and the code from AG.
The resultant integrated image is not saved in this experiment. A test remains to be done for any differences in the results from the two methods.
September 30, 2025 at 05:18 pm - Permalink
Have you tried Integrate/DIM=2?
October 1, 2025 at 03:40 am - Permalink
@jjweimer, Make sure in method 1 that you do that assignment using multithread:
Multithread tmpimg[][][ic] = sumimg[p][q]/(ic+1)
It should improve performance quite a bit.
I did some other profiling, and the vast majority of time is spent in that assignment (79%). For method 2, it's actually MatrixOP transposeVol (83%) that dominates the execution time, as opposed to MatrixOP integrate (7%)
October 1, 2025 at 11:37 am - Permalink
Adding MultiThread as suggested by @Ben gives this improvement.
Using the code below as per @Tony gives this.
The fastest approach appears to be with AG's suggestion.
I will compare the output images as a next step.
October 1, 2025 at 12:24 pm - Permalink
After further testing, I have settled on the code below.
Starting from an unsigned 8 bit image, the code above returns an unsigned integer 32 bit image. The conversion 8 bit to 32 bit happens in the sumBums call (even with the flag /NPRM on MatrixOP). The code provided by AG that uses TransposeVol requires further steps that extend the time. As is, it returns an 8 bit image, so round-off errors will be more likely. The code provided by Tony using Integrate/DIM=2 is faster at larger stacks but slower on smaller stacks. It returns a floating point 32 bit image.
The code will be used for the stack math->integrate option in the Image Tools package at its next update.
Thanks to all for the suggestions.
October 5, 2025 at 06:43 pm - Permalink