Multithreading of FFT operation in Igor

There is not much that you can do from the user's side. I have already tweaked the FFT to the extent I can using a single thread. The overhead on splitting to threads is such that in most situations it is not worth the trouble. The most common situation where the FFT takes more time than it should is when you try to compute the transform of an array with size that is a product of large primes. Ideally you would pad your array to a power of 2 or, if that is not realistic, change its length enough so that the number of points breaks down to factors of 2 and 3. Any factors larger than 3 will end up with a more costly computation time.

A.G.

Log in or register to post comments

October 29, 2019 at 09:52 am - Permalink

Sandbo

Igor wrote:

There is not much that you can do from the user's side. I have already tweaked the FFT to the extent I can using a single thread. The overhead on splitting to threads is such that in most situations it is not worth the trouble. The most common situation where the FFT takes more time than it should is when you try to compute the transform of an array with size that is a product of large primes. Ideally you would pad your array to a power of 2 or, if that is not realistic, change its length enough so that the number of points breaks down to factors of 2 and 3. Any factors larger than 3 will end up with a more costly computation time.

A.G.

Thanks for the reply, and we are aware of the power of 2 requirement and have already tweaked that part.

We hoped to use FFT to perform a fast signal calibration (e.g. offsetting the magnitude response over a certain bandwidth by FFT-->Division-->IFFT), but it is not of priority at the moment. I am looking into writing a small OpenCL FFT XOP for that and I will share it here if I get to finish that. Haven't tested it thoroughly but it seems GPU-accelerated FFT is significantly faster, albeit with lots of constraints.

Log in or register to post comments

October 30, 2019 at 06:37 pm - Permalink

Igor

I'd be interested in performance numbers for OpenCL FFT XOP.

Log in or register to post comments

October 31, 2019 at 11:01 am - Permalink

thomas_braun

@Sandbo: Can you post an example FFT benchmark for your machine? Here the stop FFT operation is really fast and we use it during data acquistion as well.

You also don't need to write an XOP for it. Just use https://github.com/pdedecker/IgorCL and add some plain OpenCL code in IP.

Log in or register to post comments

November 7, 2019 at 01:56 pm - Permalink

Sandbo

thomas_braun wrote:

@Sandbo: Can you post an example FFT benchmark for your machine? Here the stop FFT operation is really fast and we use it during data acquistion as well.

You also don't need to write an XOP for it. Just use https://github.com/pdedecker/IgorCL and add some plain OpenCL code in IP.

Unfortunately it isn't finished yet, and I was only comparing using the provided sample from the clfft repository:

https://github.com/clMathLibraries/clFFT

Using this example, https://github.com/clMathLibraries/clFFT/blob/master/src/examples/fft1d.c

By comparing the case of a FFT size of length 2^22 (the longest it supports for complex FFT), I recorded at least 10 times faster in performance in running FFT on a Vega 56 compared to in Igor on a system with Threadripper 1950X. I don't have yet a vigorous comparison, but in the GPU case it always ends up finishing in 0.04 sec or less, while CPU took 1.1 sec in the above length. I am not certain about it as the GPU might just be doing something different, I can confirm it later once I have actual output data to compare (need to inject the same data to GPU, compute then move the data back to Igor).

Also, I am aware of the Igor CL XOP but I haven't tried it and I have been writing my own XOP, maybe I will give it a shot for this application. In particular now I need to use the clFFT library, not sure if IgorCL allows me to call a that library from within Igor.

A separated thing:

Just in case if this interests you as you are also doing data acquisition, one thing I have done is that I was able to write an XOP using OpenCL to do digital downconversion (DDC). The digitized data is downloaded from digitizer to PC then GPU, then 3 things are done: 1. the scaling from integer to floating point, 2. digital mixing with two numerically clocked sine and cosine wave, 3. low-pass filtered the two mixed data and decimated to a lower sampling frequency.

A quick example with an input raw wave of size n=1e8 and downconvert that to 1e6, with a number of tap equals ~489
Including transport, GPU took 0.197 sec
For Igor, doing step 1 can take 0.089 sec, step 2: 0.42 sec; step 3: 1.94 sec, total ~ 2.45 sec.

The bottleneck in the GPU case here is likely the PCI-E 3.0 bus speed, this maybe less of an issue later when we have the chance to use PCI-E 4.0 or faster. Though, one can also now buy a 64 core CPU and call it a day I guess.

Log in or register to post comments

November 12, 2019 at 11:07 am - Permalink