# Replacing MultiThread by MatrixOp for faster calculations I would like to move some calculations into their on preemptive threads.

The documentation states (my testing proved that as well):

Although it is legal to use the MultiThread mechanism in a threadsafe function
and will likely result in a substantial loss of speed.

So I ventured into using matrixOP, which does not have this limitation.

I've came up with the following code:

Constant NUM_RUNS = 10

Function async(numRows, numCols)

variable numCols, numRows

printf "%d x %d\r", numRows, numCols

print " ### Main thread ### "

variable speedup = Dostuff(numRows, numCols)

printf "speedup: %g\r", speedup

print " ### In preemptive thread ### "

do
End

variable numRows, numCols

variable refTime, i

variable startRow = 1
variable endRow = trunc(numRows/2)
variable startCol = 1
variable endCol = 3

Make/O/N=(nuMRows, numCols)/R result1, result2
Make/O/W/N=(nuMRows, numCols) data = q
Make/O/N=(numCols) fac = p + 0.1

result1 = NaN
result2 = NaN

refTime = stopmstimer(-2)
for(i = 0; i < NUM_RUNS; i += 1)
Multithread result1[startRow, endRow][startCol, endCol] = data[p][q] / fac[q]
endfor

variable elapsedMT = (stopmstimer(-2) - refTime)/1e6/NUM_RUNS

printf "Time [s]: %f WaveCRC: %X\r", elapsedMT, WaveCRC(0, result1)

refTime = stopmstimer(-2)
for(i = 0; i < NUM_RUNS; i += 1)
Matrixop/NPRM/O/S result2 = insertMat(fp32(scaleCols(subRange(data, startRow, endRow, startCol, endCol), rec(subRange(fac, startCol, endCol, 0, 0)))), result2, startRow, startCol)
endfor

variable elapsedMO = (stopmstimer(-2) - refTime)/1e6/NUM_RUNS

printf "Time [s]: %f WaveCRC: %X\r", elapsedMO, WaveCRC(0, result1)

return elapsedMT / elapsedMO
End

Is the matrixOP expression really how its done? It looks a bit complex.

My testing shows that matrixOp always wins in the preemptive thread.

In the main thread it depends:

``````•async(1e4, 8)
10000 x 8
Time [s]: 0.000467 WaveCRC: FDE72EA3
Time [s]: 0.000163 WaveCRC: FDE72EA3
speedup: 2.87014
Time [s]: 0.001634 WaveCRC: FDE72EA3
Time [s]: 0.000187 WaveCRC: FDE72EA3
•async(1e5, 16)
100000 x 16
Time [s]: 0.003650 WaveCRC: 87F8C6FD
Time [s]: 0.005148 WaveCRC: 87F8C6FD
speedup: 0.708938
Time [s]: 0.015059 WaveCRC: 87F8C6FD
Time [s]: 0.005198 WaveCRC: 87F8C6FD``````

Does someone have experience why this is the case?

I think you got the MatrixOP syntax right.  You could make it simpler (by subranging only once and removing some flags) but it could be less efficient.

I also suspect you may want the last CRC test to be WaveCRC(0, result2)?

Your results show that MatrixOP performance is the same in both cases.  This is also what I expect.  On the other hand if you wonder why the MultiThread approach seems to have worse performance I would guess it has to do with the fact that that it is simply less efficient to multithread when it is not in the main thread.  Idle threads are a possible cause.  Others here may suggest additional explanations.

In this connection it is worth mentioning that certain functionality that involves automatic multi-threading may not be available if it is called from anywhere outside the main thread.  This also holds for MatrixOP.

I assume this all started because of curiosity about the best way to optimize some code.  Here are some general thoughts:

Suppose you are trying to execute a command that performs a calculation for an output wave of N points (in at most 2D) and that this calculation when performed by Igor code (non matrixOP) requires M flops/wave point output.  Denote by Oi the overhead cost of running in threads.  This will be a function of N.  So this gives you a total cost of Oi+N*M/nThreads.  Now MatrixOP will require W flops per data point (typically with W<<M) but has additional overhead in memory allocations Om.  So when MatrixOP is executing in the main thread the cost of the calculation is roughly Om+N*W.  It gets a bit more complicated here because Om depends on the number of MatrixOP tokens and parenthesis in the MatrixOP expression and the size of the tokens (compared to processor cache) but I think that you can see that when N is sufficiently large, the key here involves comparison of M/nThreads vs. W.

Unfortunately, we can only have a rough idea about the ratio of M to W and it varies depending on the complexity of the expression.  Very few expressions are internally optimized for Igor code where M/W can be O(1) but I have seen ratios of well over 100.  If you are using nThreads ~10 deciding which approach to use may require some experimentation.

When you try to evaluate the complexity of the syntax, it is worth remembering that an expression e.g.,

wave1=exp(wave2)

is going to execute much faster in MatrixOP (or WaveTransform) because it is using MKL vectorized routines.

Another point worth remembering is that (at least according to INTEL docs for TBB) you need approximately 1e6 flops to make threading worthwhile.

My conclusions are:

2. Experiment with waves of the exact dimensions to determine the best approach.

A.G.

I will add that when you use the MultiThread keyword for a wave assignment, Igor internally uses ThreadGroupCreate to divide the assignment into multiple threads. So, while it will likely take testing different approaches to each analysis to determine which is faster on your hardware, you shouldn't need to test MultiThread vs. ThreadGroupCreate for multithreading, since they're both essentially doing the same thing under the hood.

I realized I never replied...

@aclight: Thanks that is indeed worth remembering

@Igor: Thanks for your explanations. I only have one comment:

When you try to evaluate the complexity of the syntax, it is worth remembering that an expression e.g.,

wave1=exp(wave2)

is going to execute much faster in MatrixOP (or WaveTransform) because it is using MKL vectorized routines.

Are you sure you are always using vectorized routines? AFAIR IP waves are 8 byte aligned and the intel optimization manual  states in Table 11.1 that at least SSE requires 16-byte alignment, optimized move instructions for AVX benefit from 32-byte alignment and in case one has a XEON Phi lying around (I don't) this benefits from 64-byte alignment.

Thomas,

I'm not sure that getting into any more details here is a productive use of time.  Given the technology in IP8 I think you may have better performance with MatrixOP for the particular example above (assuming the calculations exceed the allocation time in MatrixOP).

A.G.