ImageTransform vs Concatenate execution time

Hi all, 

I was testing what is the fastest way to stack 2d waves. I did a comparison between "ImageTrasnform stackImages" and "Concatenate".

I used the following code.

Function MakeTheWaves(variable nn)
    variable i
    for(i = 0;i < nn; i++)
        Make/O/N=(4096,4096) $("wname_" + num2str(i))
    return 0

Function MakeTheWaveRefs(variable nn)
    variable i
    Make/O/N=(nn)/WAVE mxpwaveRef
    for(i = 0;i < nn; i++)
         mxpwaveRef[i] = $("wname_" + num2str(i))
    return 0

Function [variable tN0, variable tP0, variable tN1, variable tP1] timerFunction(variable nn)

    WAVE/T mxpwaveRef
    variable tm00=stopmstimer(-2)
    ImageTransform/NP=(nn) stackImages $"wname_0"
    variable tm01=stopmstimer(-2)
    variable tm10=stopmstimer(-2)
    Concatenate/O/NP=2 {mxpwaveRef}, M_Concatenate
    variable tm11=stopmstimer(-2)
    tN0 = (tm01-tm00)/1e6
    tN1 = (tm11-tm10)/1e6
    // Now all stacks are pre-allocated, redo the operations
    ImageTransform/NP=(nn) stackImages $"wname_0"
    Concatenate/O/NP=2 {mxpwaveRef}, M_Concatenate
    tP0 = (tm01-tm00)/1e6
    tP1 = (tm11-tm10)/1e6
    return [tN0, tP0, tN1, tP1] // 0 = ImageTransform, 1 = Concatenate

Function RunTheTest(int startN, int stopN, int stepN) // Entry point
    variable nn, i, tN0, tP0, tN1, tP1
    string dataFolderName = "TestResults_" + num2str(startN) + "_" + num2str(stopN) + "_" + num2str(stepN)
    NewDataFolder/O $dataFolderName
    DFREF dfr = $dataFolderName
    int sizeN = (stopN - startN)/stepN + 1
    Make/N=(sizeN) dfr:results_CN /WAVE=wcn
    Make/N=(sizeN) dfr:results_IT /WAVE=wit
    Make/N=(sizeN) dfr:results_CN_Pre /WAVE=wcnP
    Make/N=(sizeN) dfr:results_IT_Pre /WAVE=witP
    SetScale/P x, startN, stepN, "Nr. Images", wcn, wcnP, wit, witP
    for(nn = startN; nn < stopN + 1; nn += stepN)
        [tN0, tP0, tN1, tP1] = timerFunction(nn)
         wcn[i] = tN1
         wit[i] = tN0
         wcnP[i] = tP1
         witP[i] = tP0

You can see the comparison between ImageTransform and Concatenation in the graphs (green and black line, respectively). I found most interesting the comparison between times depending whether the destination wave is already present. 

Concatenation seems to perform the same (or worse!) when the destination wave is already there (haven't test the effect of the absence of flag /O).

On the other hand, ImageTransform stackImages has considerable gain for images of size (512 x 512) and is faster most of the times for (4096 x 4096) images. Strangely enough I see a reversal of this trend when I try to stack 50-80 images.

Are these results expected or I miss something here?


My system

MacBookPro 2020 - 2,3 GHz Quad-Core Intel Core i7, 16 GB 3733 MHz LPDDR4X.















img1 img2 img3 img4 img5 img6

The fact that your Concatenate "pre-allocated" case is slower than your "Created" case is due to you assuming that Concatenate/O works differently than it does.

I think you are assuming that if the output wave already exists and is the correct size, memory doesn't need to be reallocated and therefore the operation should complete faster. That's not a crazy assumption, but it's wrong. Surprisingly, the actual behavior is even documented in the Details section of the Concatenate topic:

If destWave  does not already exist or, if the /O flag is used, destWave  is created by duplication of the first source wave. Waves are concatenated in order through the list of source waves. If destWave  exists and the /O flag is not used, then the concatenation starts with destWave.

Perhaps "destWave is created..." should be changed to "destWave is [re]created..." to make it slightly more clear what is happening.

ImageTransform stackImages works differently in the case when the output wave already exists. It calls an internal command to make the output wave of the correct size and type, but if the output wave already exists and is already the correct size and type this is almost a no-op.

Your findings with the ~100 layer large images are also explainable. A 4096x4096x100 image of single precision floating point type is about 6.7 GB. That's just for the output, and you have the same number of bytes in your input waves. So that's over 13 GB of memory needed. According to your system info you have 16 GB RAM, so you're likely getting into the need for virtual memory somewhere around 75 layers.

On my machine (Windows 11, Intel 12900KS processor, 64GB RAM), I'm getting times that are close to 10x faster than yours. But I'm also seeing that the "Clean" case of ImageTransform is slower than the "clean" Concatenate case (tested only with 20 and 100 layers). I think your last graph shows the "clean" case of both operations, and if so you don't see quite the same thing. Looking at the code that executes it makes sense that ImageTransform would be slower in the "clean" case, but the explanation is getting into the weeds and some of the difference might be due to how memory allocation works on macOS vs. Windows.