Replacing MultiThread by MatrixOp for faster calculations

I would like to move some calculations into their on preemptive threads.

The documentation states (my testing proved that as well):

Although it is legal to use the MultiThread mechanism in a threadsafe function
that is already running in a preemptive thread via ThreadStart, it is not recommended
and will likely result in a substantial loss of speed.

So I ventured into using matrixOP, which does not have this limitation.

I've came up with the following code:

Constant NUM_RUNS = 10

Function async(numRows, numCols)

    variable numCols, numRows
   
    printf "%d x %d\r", numRows, numCols

    print " ### Main thread ### "

    variable speedup = Dostuff(numRows, numCols)

    printf "speedup: %g\r", speedup

    print " ### In preemptive thread ### "

    Variable threadGroupID= ThreadGroupCreate(1)
   
    ThreadStart threadGroupID,0,Dostuff(numRows, numCols)

    do
        Variable threadGroupStatus = ThreadGroupWait(threadGroupID,100)
        while( threadGroupStatus != 0 )
    Variable dummy= ThreadGroupRelease(threadGroupID)
End

threadsafe Function Dostuff(numRows, numCols)
    variable numRows, numCols

    variable refTime, i
   
    variable startRow = 1
    variable endRow = trunc(numRows/2)
    variable startCol = 1
    variable endCol = 3

    Make/O/N=(nuMRows, numCols)/R result1, result2
    Make/O/W/N=(nuMRows, numCols) data = q
    Make/O/N=(numCols) fac = p + 0.1
   
    result1 = NaN
    result2 = NaN

    refTime = stopmstimer(-2)
    for(i = 0; i < NUM_RUNS; i += 1)
        Multithread result1[startRow, endRow][startCol, endCol] = data[p][q] / fac[q]
    endfor
   
    variable elapsedMT = (stopmstimer(-2) - refTime)/1e6/NUM_RUNS
   
    printf "Time [s]: %f WaveCRC: %X\r", elapsedMT, WaveCRC(0, result1)

    refTime = stopmstimer(-2)
    for(i = 0; i < NUM_RUNS; i += 1)
        Matrixop/NPRM/O/S result2 = insertMat(fp32(scaleCols(subRange(data, startRow, endRow, startCol, endCol), rec(subRange(fac, startCol, endCol, 0, 0)))), result2, startRow, startCol)
    endfor

    variable elapsedMO = (stopmstimer(-2) - refTime)/1e6/NUM_RUNS

    printf "Time [s]: %f WaveCRC: %X\r", elapsedMO, WaveCRC(0, result1)

    return elapsedMT / elapsedMO
End

Is the matrixOP expression really how its done? It looks a bit complex.

My testing shows that matrixOp always wins in the preemptive thread.

In the main thread it depends:

•async(1e4, 8)
  10000 x 8
   ### Main thread ### 
  Time [s]: 0.000467 WaveCRC: FDE72EA3
  Time [s]: 0.000163 WaveCRC: FDE72EA3
  speedup: 2.87014
   ### In preemptive thread ### 
  Time [s]: 0.001634 WaveCRC: FDE72EA3
  Time [s]: 0.000187 WaveCRC: FDE72EA3
•async(1e5, 16)
  100000 x 16
   ### Main thread ### 
  Time [s]: 0.003650 WaveCRC: 87F8C6FD
  Time [s]: 0.005148 WaveCRC: 87F8C6FD
  speedup: 0.708938
   ### In preemptive thread ### 
  Time [s]: 0.015059 WaveCRC: 87F8C6FD
  Time [s]: 0.005198 WaveCRC: 87F8C6FD

Does someone have experience why this is the case?
 

I think you got the MatrixOP syntax right.  You could make it simpler (by subranging only once and removing some flags) but it could be less efficient. 

I also suspect you may want the last CRC test to be WaveCRC(0, result2)?

Your results show that MatrixOP performance is the same in both cases.  This is also what I expect.  On the other hand if you wonder why the MultiThread approach seems to have worse performance I would guess it has to do with the fact that that it is simply less efficient to multithread when it is not in the main thread.  Idle threads are a possible cause.  Others here may suggest additional explanations.

In this connection it is worth mentioning that certain functionality that involves automatic multi-threading may not be available if it is called from anywhere outside the main thread.  This also holds for MatrixOP.

I assume this all started because of curiosity about the best way to optimize some code.  Here are some general thoughts:

Suppose you are trying to execute a command that performs a calculation for an output wave of N points (in at most 2D) and that this calculation when performed by Igor code (non matrixOP) requires M flops/wave point output.  Denote by Oi the overhead cost of running in threads.  This will be a function of N.  So this gives you a total cost of Oi+N*M/nThreads.  Now MatrixOP will require W flops per data point (typically with W<<M) but has additional overhead in memory allocations Om.  So when MatrixOP is executing in the main thread the cost of the calculation is roughly Om+N*W.  It gets a bit more complicated here because Om depends on the number of MatrixOP tokens and parenthesis in the MatrixOP expression and the size of the tokens (compared to processor cache) but I think that you can see that when N is sufficiently large, the key here involves comparison of M/nThreads vs. W. 

Unfortunately, we can only have a rough idea about the ratio of M to W and it varies depending on the complexity of the expression.  Very few expressions are internally optimized for Igor code where M/W can be O(1) but I have seen ratios of well over 100.  If you are using nThreads ~10 deciding which approach to use may require some experimentation.

When you try to evaluate the complexity of the syntax, it is worth remembering that an expression e.g.,

wave1=exp(wave2)

is going to execute much faster in MatrixOP (or WaveTransform) because it is using MKL vectorized routines.

Another point worth remembering is that (at least according to INTEL docs for TBB) you need approximately 1e6 flops to make threading worthwhile.

My conclusions are:

1. Don't rush to thread/multithread.

2. Experiment with waves of the exact dimensions to determine the best approach.

 

A.G.

 

I will add that when you use the MultiThread keyword for a wave assignment, Igor internally uses ThreadGroupCreate to divide the assignment into multiple threads. So, while it will likely take testing different approaches to each analysis to determine which is faster on your hardware, you shouldn't need to test MultiThread vs. ThreadGroupCreate for multithreading, since they're both essentially doing the same thing under the hood.

I realized I never replied...

@aclight: Thanks that is indeed worth remembering

@Igor: Thanks for your explanations. I only have one comment:

When you try to evaluate the complexity of the syntax, it is worth remembering that an expression e.g.,

wave1=exp(wave2)

is going to execute much faster in MatrixOP (or WaveTransform) because it is using MKL vectorized routines.
 

Are you sure you are always using vectorized routines? AFAIR IP waves are 8 byte aligned and the intel optimization manual [1] states in Table 11.1 that at least SSE requires 16-byte alignment, optimized move instructions for AVX benefit from 32-byte alignment and in case one has a XEON Phi lying around (I don't) this benefits from 64-byte alignment.

[1]: https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64…
[2]: https://software.intel.com/en-us/articles/data-alignment-to-assist-vect…
 

Thomas,

I'm not sure that getting into any more details here is a productive use of time.  Given the technology in IP8 I think you may have better performance with MatrixOP for the particular example above (assuming the calculations exceed the allocation time in MatrixOP).

A.G.