Building a faster way to integrate an image stack
Suppose that I have an image stack imgsrc. I want to create an image stack where each plane N is the sum of the N planes from imgsrc. The result is to be an integral.
I have played around to create the code below. It works, but it is tediously slow over a large image stack (2448,2048,154). The time sink seems to be the need to run an explicit for loop construction.
// snippet of interest nlayers = DimSize(imgsrc,2) wave sumimg = integrate_stack(imgsrc,0) duplicate/FREE sumimg tmpimg Redimension/N=(-1,-1,nlayers) tmpimg for (ic=1;ic<nlayers;ic+=1) wave sumimg = integrate_stack(imgsrc,ic) tmpimg[][][ic] = sumimg[p][q]/(ic+1) endfor ThreadSafe Static Function/WAVE integrate_stack(wave instack, variable nlayer) MatrixOP/FREE subset = instack[][][0,nlayer] MatrixOP/FREE/NTHR=0 reswave = sumBeams(subset) return reswave end
I have to wonder whether I am missing a faster method to build the integral summation, primarily to remove the for loop.
I'd be glad for insights *in IP9 please*.
I am not sure of the meaning of your second sentence. Are you trying to get a single image from a 3-D wave consisting of a stack of 2D images?
If so, MatrixOP can return a 2D image from the sum of all beams. Perhaps this may work for you. (IP 9.05)
See DisplayHelpTopic "MatrixOP" :
sumBeams(w) Returns an (n x m) matrix containing the sum over all layers of all the beams of the 3D wave w :
A beam is a 1D array in the Z-direction.
sumBeams is a non-layered function which requires that w be a proper 3D wave and not the result of another expression.
September 28, 2025 at 06:43 am - Permalink
I start with a stack of N layers (imgsrc - 3D wave). I want to create a stack of N layers (3D wave) where each layer j is the result of running sumBeams(imgsrc) for the layers 0 to j.
The function integrate_stack(...) returns a 2D wave that is the result of running sumBeams(imgsrc) on layers 0 to j. The for-loop builds the 3D wave into tmping. I wonder if the explicit for-loop can be collapsed to a (faster) implicit loop.
September 28, 2025 at 02:06 pm - Permalink
My suggestion might be worth considering for varying numbers of layers if you use a loop sequentially removing images from the stack. The simple delete layer operation would be to DeletePoints with M=2
DeletePoints [/M=dim ] startElement, numElements, waveName [, waveName ]...Another faster deletion alternative (from the MatrixOP help file) might be:
" You can also extract layers from a 3D wave starting with layer a and increasing the layer number up to layer b using increments c:
MatrixOp destWave = wave3d[][][a,b,c] "
After each layer deletion, the beam summation should be faster. Here is the sumBeams( ) formula omitted from my first reply
I have found the "Profiling Igor Procedures" method to be very useful in probing slow functions.
September 29, 2025 at 04:27 am - Permalink
I first tried to reformulate the problem to make it simpler. The following code is slightly faster. Unfortunately, the output is also not exactly the same (there is a slight discrepancy on the order of 10^-8), but I couldn't find where this comes from. Maybe someone else has a better idea.
September 29, 2025 at 04:53 am - Permalink
Thanks. I will consider this approach.
> The following code is slightly faster.
Perhaps an additional savings is realized using duplicate/FREE imgsrc, tmpimg and running the for-loop from ic=1 rather than ic=0?
> Unfortunately, the output is also not exactly the same (there is a slight discrepancy on the order of 10^-8), but I couldn't find where this comes from.
The difference between the two cases may be due to differences in where various floating point conversions are used.
September 29, 2025 at 07:17 am - Permalink
Here are the test results.
loop one 440767 loop two 3.92332e+06The first method is about 10x faster overall. It clocks in at about 441 s per layer on average. To test further, I did ic = 1 and ic = 153, both as hard-coded and as inputs.
Two lessons. First, the significant overhead seems to be in operating the for-loop. The internals on the for-loop are an order of magnitude faster it seems. Second, integrating the stack with just 2 layers is about 5x faster than integrating the stack with 153 layers.
September 29, 2025 at 01:41 pm - Permalink
A few quick comments:
First, the variations in the results are likely due to floating point issues when summing many numbers. IP10 has more accurate summing implemented (especially in MatrixOP).
The difficulty with the task is that MatrixOP only implements full beam sums (i.e., all layers). This suggest a few alternative approaches which I have not tested:
1. Run a loop over the layers what executes something like:
2. If the duplication is too slow, you can compute the integral in reverse, i.e., start with the full N layers and then on each iteration remove a layer using Redimension and compute sumBeams() for the remaining layers.
3. Create a 1D wave of N points and set it to 1. Then run N reverse iterations:
4. Suppose your imgsrc is RxCxN wave. You can use a combination of MatrixOP transposeVol and Redimension to convert imgsrc into a 2D wave that contains Nx(R*C). Then follow this with Matrixop Integrate(,1) and then revert the 2D wave back to 3D. This would look like (note, it's untested):
September 29, 2025 at 03:26 pm - Permalink
Placing example data might help with developing faster solutions, as people can try out code on the example.
I wonder if experimenting with `sumdimension` might help here? `sumdimension` sums values in an N dimensional srcwave along a specified dimension, producing an outwave with N-1 dimensions. The operation is also threadsafe. From the sounds of it it does the same operation as matrixop+sumbeams.
On a single calculation you'd expect matrixop to be faster, but if you have a parallelised for loop with sumdimension it may turn out faster.
Remember possibilities of contention in parallelisation. If an inner function uses a parallelised `matrixop` with `nthr=0` you also try to parallelise an outer function (e.g. a multithreaded for loop) that contains the inner function, then you'll likely end up with the processor thrashing around.
September 29, 2025 at 03:30 pm - Permalink
AG ...
> 1. Run a loop over the layers what executes something like:
I already pull out a sub stack with only the planes of interest.
> 2. If the duplication is too slow, ... and 3. ...
I'll try these.
I've played with variations on Concatenate versus insertZPlane, though not yet with ImageTransform, as the step after generating the integrated 2D image. I don't know that this change is tweaking around the edges versus removing the outer for-loop. In this regard, I think the third option might be the most interesting to explore since (if I understand correctly) it would remove the for-loop over layers.
Andy ...
I will consider extracting a MWE (minimum working example) with requisite data in an experiment. I might do this especially since I did end with one example using MatrixOP that crashed my system.
I appreciate the note to avoid an approach that causes the processor to thrash around. I will explore removing the NTHR=0 flag on the MatrixOP step in the ThreadSafe function.
September 29, 2025 at 04:52 pm - Permalink
Out of curiosity, I tried sumdimension and it is about 4.5x faster with my limited data. I guess this scales badly as well for JJ's data as seen above with my last try, though.
September 29, 2025 at 08:01 pm - Permalink
Here is a MWE. The experiment contains an image and a procedure to integrate it. I've kept the code that I have above and the code from AG.
The resultant integrated image is not saved in this experiment. A test remains to be done for any differences in the results from the two methods.
September 30, 2025 at 05:18 pm - Permalink
Have you tried Integrate/DIM=2?
October 1, 2025 at 03:40 am - Permalink
@jjweimer, Make sure in method 1 that you do that assignment using multithread:
Multithread tmpimg[][][ic] = sumimg[p][q]/(ic+1)It should improve performance quite a bit.
I did some other profiling, and the vast majority of time is spent in that assignment (79%). For method 2, it's actually MatrixOP transposeVol (83%) that dominates the execution time, as opposed to MatrixOP integrate (7%)
October 1, 2025 at 11:37 am - Permalink
Adding MultiThread as suggested by @Ben gives this improvement.
Using the code below as per @Tony gives this.
The fastest approach appears to be with AG's suggestion.
I will compare the output images as a next step.
October 1, 2025 at 12:24 pm - Permalink
After further testing, I have settled on the code below.
Starting from an unsigned 8 bit image, the code above returns an unsigned integer 32 bit image. The conversion 8 bit to 32 bit happens in the sumBums call (even with the flag /NPRM on MatrixOP). The code provided by AG that uses TransposeVol requires further steps that extend the time. As is, it returns an 8 bit image, so round-off errors will be more likely. The code provided by Tony using Integrate/DIM=2 is faster at larger stacks but slower on smaller stacks. It returns a floating point 32 bit image.
The code will be used for the stack math->integrate option in the Image Tools package at its next update.
Thanks to all for the suggestions.
October 5, 2025 at 06:43 pm - Permalink
I hope I am not too late to the party.
My solution would be to only calculate the sum of, for instance, Layer 1 and 2 once, instead of recalculating it again and again for the sums of each additional layer.
I also used ImageTransform setPlane to write the results back into the wave, which in my experience is much faster than wave indexing using [p][q][r].
The fastest method I tested was using FastOP and splitting the y axis into sections for multithreading, which turned out to be a little tricky.
I haven't carefully tested the code, but all three methods I tested give the same integrated wave, which is promising.
I ran into memory issues so I used a slightly-smaller wave size than stated in the original post.
IntegrateStacks() will run the self-contained example.
October 7, 2025 at 05:37 am - Permalink
> I hope I am not too late to the party.
Certainly not. Especially when you bring such interesting additions to the discussions.
Two points of reference should be considered. One is that the final result has to be divided by the number of layers to bring the intensity back to the original scale. The other is that the images should probably be converted/promoted in bit depth just before the division to avoid truncation round off errors that will cause gaps in the final histogram.
I believe the approach in your Method 2 (e.g. with duplicate rather than MatrixOP and with adding to the sum at each loop step) offers a significant improvement worth using. I'm working on a few other tweaks and will post a revised experiment at some time later.
Thanks!
October 7, 2025 at 01:43 pm - Permalink
I am now starting with an 8-bit unsigned integer wave
Interestingly, I found MatrixOP to convert the wave into 16-bit floating point much faster than Redimension could do floating point or 32-bit integer. This is the most time-consuming step of the whole operation, and, unfortunately. multithreading doesn't seem to do anything here.
I added the normalization by the number of layers
I made the redimension of the wave for multithreading a little easier to understand, by combining x and y in dimension one, the thread number / substack index in dimension two and the layers in dimension three.
I also made the function use the single-threaded version, when the multithreaded version is not possible because the number of data points are nor divisible by the processor count.
Use RunMe() to run the example. I hope I didn't break anything. I didn't have time to test it much.
October 8, 2025 at 05:37 am - Permalink
I can also make MatrixOP convert the stacks to 32-bit unsigned integer (/I/U) faster than redimension by using:
but I cannot make it do a 16-bit integer (/W/U). The following still creates a 32-bit integer:
October 9, 2025 at 02:29 am - Permalink
Interestingly ImageTransform getPlane (14 ms) can extract a plane from the image stack much faster than Duplicate (75 ms).
But the same is not true for the multithreaded calculation where a column is extracted. There ImageTransform getCol (11 ms) and Duplicate (12 ms) perform the same.
October 9, 2025 at 03:04 am - Permalink
In general, without further convolutions, the two cases below are the fastest.
I convert to fp32 at the outset. Case 5 is about 2.5 - 3 times faster per layer than case 4. Both approaches give a result that is fp32. Case 5 appears to be at least 20 times faster than my original method.
I attach an experiment with an image stack for anyone interested in exploring further. I'm rather satisfied with using the approach in case 5 to carry forward.
Thanks everyone for the suggestions. The insights about MatrixOP versus ImageTransform and FastOP versus MultiThread will certainly improve my coding practices.
October 9, 2025 at 09:18 am - Permalink
I don't think this will give the correct total sum
FastOP sumimg = sumimg + (1/(ic+1))*currlayerYou need two waves: one to hold the total sum and another where you normalize the total sum.
Your version gives: (Layer0) / 1 + (Layer1) / 2 + (Layer2) / 3, instead of (Layer 0 + Layer1 + Layer2) / 3
But maybe I am misunderstanding the normalization you are trying to do.
October 10, 2025 at 03:48 am - Permalink
Ooops! Thanks @olelytken. Good to have the extra eyes on it. Here is the correct code for case 4 and case 5. Case 5 is still 3x faster than case 4 but now only about 8x faster than my original.
October 10, 2025 at 06:57 am - Permalink