Massive performance gains of MatrixOp for common tasks.
Wed, 09/02/2020  09:26 am
Hello,
I recently was looking into some speed optimization of Igor Pro code for common tasks:
 Filling a numerical wave with a value.
 Copying a numerical wave to a second one.
 Summing up all values in a numerical wave
I created some simple benchmark function, where the standard approach for the task and a replacement is timed.
Benchmark Code
variable i, s, e, size, r, runs
variable base, diff
variable maxSize = 100000 // Bench wave size up to
variable step= 256 // with these step
Make/O/D/N=(maxSize) t1, t2
t1 = NaN
t2 = NaN
Display t1
ModifyGraph mode=3,msize=0.2
ModifyGraph grid=1,zero(left)=1
Label left "time in s"
Label bottom "wave size"
TextBox/C/N=text0/F=0/A=MC "System ID"
Display t2
SetAxis left 0,*
ModifyGraph mode=3,msize=0.2
ModifyGraph grid=1
Label left "percent faster"
Label bottom "wave size"
TextBox/C/N=text0/F=0/A=MC "System ID"
runs = 10000 // how often we run the same task repeatedly for statistics
Make/FREE/D/N=(runs) times
for(size = 1; size < maxSize; size += step)
Make/O/D/N=(size) w1 // set data type here
for(i = 0; i < runs; i += 1)
s = stopmstimer(2)
r = sum(w1) // Reference function
e = stopmstimer(2)  s
times[i] = e
endfor
t1[size] = median(times) / 1E6
for(i = 0; i < runs; i += 1)
s = stopmstimer(2)
MatrixOp/FREE w2 = sum(w1) // Replacement function
e = stopmstimer(2)  s
times[i] = e
endfor
e = median(times) / 1E6
base = t1[size]
diff = t1[size]  e
t1[size] = diff
t2[size] = 100 / base * diff
DoUpdate
endfor
End
So first I create two waves t1 and t2 to store some results and create a graph for them.
Then there is a outer loop that increases the size of our wave to bench and creates it with the desired data type.
Inside there are two loops that time the reference and the replacement. Both are setup exactly the same way and run 10000 times to get a good statistics.
In t1 is the time difference of Reference  Replacement stored. So positive times mean that our replacement is faster.
While that is nice to know, it doesn't tell us much. It could be that the reference takes 1 s to run and the Replacement 0.999 s. So what we also have to look at is the relative speed up.
This is stored in t2. The time of the Reference is used as 100 %. A result of 50% means the replacement is 2x faster. A result of 90% means it is 10x faster.
Test Setup
I tested on two older systems:
 Ryzen Threadripper 1950X from 2017 with 16C/32T
 i9 7900X, also from 2017 with 10C/20T
I used latest Igor Pro 8 Build 36002. The bench function was run with Debugger off.
References and Replacements
As seen if the bench function one pair is:
r = sum(w1) vs. MatrixOp/FREE w2 = sum(w1)
both sum up all values in the wave into a double precision value. The only functional difference is that MatrixOp returns a wave where I have to read the result from w2[0] instead of r.
I chose MatrixOp because it makes use of multiple cores, which promises a speed gain.
More:
Copying a wave:
Duplicate/O w1, w2 vs. MatrixOp/O w2 = w1
Filling a wave with a value:
FastOp w1 = 0 vs. MatrixOp/O w1 = 0
All benchmarks were done with onedimensional and twodimensional waves and for byte and double precision data type. For twodimensional waves the size was set with /N=(size, size), so the memory footprint increases quadratically.
Results
The results show the graph of t2, so the relative speedup. 20 % faster means that the replacement needs 80% of the time of the reference.

r = sum(w1) vs. MatrixOp/FREE w2 = sum(w1)

one dimensional wave

data type byte
MatrixOp is faster on the Threadripper with wave sizes bigger than ~70,000 points. On the i9 MatrixOp is faster for wave sizes bigger than ~40,000 points, at a wave size of 90,000 points it is already 2x faster.

r = sum(w1) vs. MatrixOp/FREE w2 = sum(w1)

one dimensional wave

data type double
Interestingly there is not much difference when changing the wave type to double precision.
MatrixOp is faster on the Threadripper with wave sizes bigger than ~70,000 points. On the i9 MatrixOp is faster for wave sizes bigger than ~40,000 points, at a wave size of 90,000 points it is already 2x faster.

r = sum(w1) vs. MatrixOp/FREE w2 = sum(w1)

two dimensional wave

data type byte
MatrixOp is faster on the Threadripper with wave sizes bigger than ~(270, 270) points. On the i9 MatrixOp is faster for wave sizes bigger than (200, 200) points, at a wave size of (300, 300) points it is already 2x faster.

r = sum(w1) vs. MatrixOp/FREE w2 = sum(w1)

two dimensional wave

data type double
Interestingly also for 2d waves there is not much difference if one uses byte or double type waves.
MatrixOp is faster on the Threadripper with wave sizes bigger than ~(270, 270) points. On the i9 MatrixOp is faster for wave sizes bigger than (200, 200) points, at a wave size of (300, 300) points it is already 2x faster.
For wave copy it turned out that it is only faster for 2D waves.

Duplicate/O w1, w2 vs. MatrixOp/O w2 = w1

two dimensional wave

data type byte
Starting from a wave size of ~(2000, 2000) and bigger the Threadripper gets a consistent gain of 40%, so MatrixOp is nearly 2x faster than Duplicate.
The i9 shows only a a speed increase of ~10% for waves bigger than ~(2500, 2500).

Duplicate/O w1, w2 vs. MatrixOp/O w2 = w1

two dimensional wave

data type double
For double precision waves the Threadripper shows a consistent speedup of 40% with waves bigger than ~(700, 700). The i9 shows a very slight gain for bigger waves.
Summary
The results show that with a recent CPU with mutiple cores the speed gain can be dramatical, when using the right approach. MatrixOp is one of the operations that uses automatically multithreading and can replace some common functions.
Especially when working with larger data sets a ~10x speedup is really good.
For wave copy of 1D waves it is worth considering to use a Redimension/E=1 to change the source wave to a 2D wave, then use MatrixOp and Redimension/E=1 the target wave back to 1D. (where possible).
Michael Huth
byte physics