Multithreading of FFT operation in Igor

Hi,

Recently I look into using Igor's FFT operation, but it seems it wasn't implemented to be multithreaded by default (only 1 core is used when executed on a wave > 1e6 and is power of 2), is there anything I can do to make it multithreaded?

I am using Igor Pro 8.0.3.3 and my CPU is AMD Threadripper 1950X.

There is not much that you can do from the user's side.  I have already tweaked the FFT to the extent I can using a single thread.  The overhead on splitting to threads is such that in most situations it is not worth the trouble.  The most common situation where the FFT takes more time than it should is when you try to compute the transform of an array with size that is a product of large primes.  Ideally you would pad your array to a power of 2 or, if that is not realistic, change its length enough so that the number of points breaks down to factors of 2 and 3.  Any factors larger than 3 will end up with a more costly computation time.

 

A.G.

In reply to by Igor

Igor wrote:

There is not much that you can do from the user's side.  I have already tweaked the FFT to the extent I can using a single thread.  The overhead on splitting to threads is such that in most situations it is not worth the trouble.  The most common situation where the FFT takes more time than it should is when you try to compute the transform of an array with size that is a product of large primes.  Ideally you would pad your array to a power of 2 or, if that is not realistic, change its length enough so that the number of points breaks down to factors of 2 and 3.  Any factors larger than 3 will end up with a more costly computation time.

 

A.G.

 

Thanks for the reply, and we are aware of the power of 2 requirement and have already tweaked that part.

We hoped to use FFT to perform a fast signal calibration (e.g. offsetting the magnitude response over a certain bandwidth by FFT-->Division-->IFFT), but it is not of priority at the moment. I am looking into writing a small OpenCL FFT XOP for that and I will share it here if I get to finish that. Haven't tested it thoroughly but it seems GPU-accelerated FFT is significantly faster, albeit with lots of constraints.

@Sandbo: Can you post an example FFT benchmark for your machine? Here the stop FFT operation is really fast and we use it during data acquistion as well.

You also don't need to write an XOP for it. Just use https://github.com/pdedecker/IgorCL and add some plain OpenCL code in IP.

In reply to by thomas_braun

thomas_braun wrote:

@Sandbo: Can you post an example FFT benchmark for your machine? Here the stop FFT operation is really fast and we use it during data acquistion as well.

You also don't need to write an XOP for it. Just use https://github.com/pdedecker/IgorCL and add some plain OpenCL code in IP.

 

Unfortunately it isn't finished yet, and I was only comparing using the provided sample from the clfft repository:

https://github.com/clMathLibraries/clFFT

Using this example, https://github.com/clMathLibraries/clFFT/blob/master/src/examples/fft1d.c

By comparing the case of a FFT size of length 2^22 (the longest it supports for complex FFT), I recorded at least 10 times faster in performance in running FFT on a Vega 56 compared to in Igor on a system with Threadripper 1950X. I don't have yet a vigorous comparison, but in the GPU case it always ends up finishing in 0.04 sec or less, while CPU took 1.1 sec in the above length. I am not certain about it as the GPU might just be doing something different, I can confirm it later once I have actual output data to compare (need to inject the same data to GPU, compute then move the data back to Igor).

Also, I am aware of the Igor CL XOP but I haven't tried it and I have been writing my own XOP, maybe I will give it a shot for this application. In particular now I need to use the clFFT library, not sure if IgorCL allows me to call a that library from within Igor.

 

A separated thing:

Just in case if this interests you as you are also doing data acquisition, one thing I have done is that I was able to write an XOP using OpenCL to do digital downconversion (DDC). The digitized data is downloaded from digitizer to PC then GPU, then 3 things are done: 1. the scaling from integer to floating point, 2. digital mixing with two numerically clocked sine and cosine wave, 3. low-pass filtered the two mixed data and decimated to a lower sampling frequency.

A quick example with an input raw wave of size n=1e8 and downconvert that to 1e6, with a number of tap equals ~489
Including transport, GPU took 0.197 sec
For Igor, doing step 1 can take 0.089 sec, step 2: 0.42 sec; step 3: 1.94 sec, total ~ 2.45 sec. 

The bottleneck in the GPU case here is likely the PCI-E 3.0 bus speed, this maybe less of an issue later when we have the chance to use PCI-E 4.0 or faster. Though, one can also now buy a 64 core CPU and call it a day I guess.