Replacing MultiThread by MatrixOp for faster calculations

•async(1e4, 8) 10000 x 8 ### Main thread ### Time [s]: 0.000467 WaveCRC: FDE72EA3 Time [s]: 0.000163 WaveCRC: FDE72EA3 speedup: 2.87014 ### In preemptive thread ### Time [s]: 0.001634 WaveCRC: FDE72EA3 Time [s]: 0.000187 WaveCRC: FDE72EA3 •async(1e5, 16) 100000 x 16 ### Main thread ### Time [s]: 0.003650 WaveCRC: 87F8C6FD Time [s]: 0.005148 WaveCRC: 87F8C6FD speedup: 0.708938 ### In preemptive thread ### Time [s]: 0.015059 WaveCRC: 87F8C6FD Time [s]: 0.005198 WaveCRC: 87F8C6FD

Igor

I think you got the MatrixOP syntax right. You could make it simpler (by subranging only once and removing some flags) but it could be less efficient.

I also suspect you may want the last CRC test to be WaveCRC(0, result2)?

Your results show that MatrixOP performance is the same in both cases. This is also what I expect. On the other hand if you wonder why the MultiThread approach seems to have worse performance I would guess it has to do with the fact that that it is simply less efficient to multithread when it is not in the main thread. Idle threads are a possible cause. Others here may suggest additional explanations.

In this connection it is worth mentioning that certain functionality that involves automatic multi-threading may not be available if it is called from anywhere outside the main thread. This also holds for MatrixOP.

I assume this all started because of curiosity about the best way to optimize some code. Here are some general thoughts:

Suppose you are trying to execute a command that performs a calculation for an output wave of N points (in at most 2D) and that this calculation when performed by Igor code (non matrixOP) requires M flops/wave point output. Denote by Oi the overhead cost of running in threads. This will be a function of N. So this gives you a total cost of Oi+N*M/nThreads. Now MatrixOP will require W flops per data point (typically with W<<M) but has additional overhead in memory allocations Om. So when MatrixOP is executing in the main thread the cost of the calculation is roughly Om+N*W. It gets a bit more complicated here because Om depends on the number of MatrixOP tokens and parenthesis in the MatrixOP expression and the size of the tokens (compared to processor cache) but I think that you can see that when N is sufficiently large, the key here involves comparison of M/nThreads vs. W.

Unfortunately, we can only have a rough idea about the ratio of M to W and it varies depending on the complexity of the expression. Very few expressions are internally optimized for Igor code where M/W can be O(1) but I have seen ratios of well over 100. If you are using nThreads ~10 deciding which approach to use may require some experimentation.

When you try to evaluate the complexity of the syntax, it is worth remembering that an expression e.g.,

wave1=exp(wave2)

is going to execute much faster in MatrixOP (or WaveTransform) because it is using MKL vectorized routines.

Another point worth remembering is that (at least according to INTEL docs for TBB) you need approximately 1e6 flops to make threading worthwhile.

My conclusions are:

1. Don't rush to thread/multithread.

2. Experiment with waves of the exact dimensions to determine the best approach.

A.G.

September 6, 2019 at 04:49 pm - Permalink

aclight

I will add that when you use the MultiThread keyword for a wave assignment, Igor internally uses ThreadGroupCreate to divide the assignment into multiple threads. So, while it will likely take testing different approaches to each analysis to determine which is faster on your hardware, you shouldn't need to test MultiThread vs. ThreadGroupCreate for multithreading, since they're both essentially doing the same thing under the hood.

September 7, 2019 at 03:00 pm - Permalink

thomas_braun

I realized I never replied...

@aclight: Thanks that is indeed worth remembering

@Igor: Thanks for your explanations. I only have one comment:

When you try to evaluate the complexity of the syntax, it is worth remembering that an expression e.g.,

wave1=exp(wave2)

is going to execute much faster in MatrixOP (or WaveTransform) because it is using MKL vectorized routines.

Are you sure you are always using vectorized routines? AFAIR IP waves are 8 byte aligned and the intel optimization manual [1] states in Table 11.1 that at least SSE requires 16-byte alignment, optimized move instructions for AVX benefit from 32-byte alignment and in case one has a XEON Phi lying around (I don't) this benefits from 64-byte alignment.

[1]: https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64…
[2]: https://software.intel.com/en-us/articles/data-alignment-to-assist-vect…

February 29, 2020 at 08:42 am - Permalink

Igor

Thomas,

I'm not sure that getting into any more details here is a productive use of time. Given the technology in IP8 I think you may have better performance with MatrixOP for the particular example above (assuming the calculations exceed the allocation time in MatrixOP).

A.G.

March 2, 2020 at 12:41 pm - Permalink