multithreaded curvefitting

I have hyperspectral data stored as a 3D wave, w3D, with 64 columns, 64 rows, and 1789 layers.

Rows and columns define position, and layer dimension is the spectroscopic independent variable.

To fit each of the 4096 spectra, I create a threadsafe worker function to perform the fit. I pass the 3D wave to the worker function, which creates a free data folder, extracts one spectrum ( = w3D[xpixel][ypixel][]) and returns a free wave containing the result of fitting.

To perform the fitting, I create a 2D wave reference wave, wRefs, to collect the fitting results:

multithread wRefs = workerfunction(w3D, p, q)

if I compare with single-threaded fitting

wRefs = workerfunction(w3D, p, q)

I see a factor of about 4.8 speed improvement (macOS 14.3.1, Igor 9.06B01), except for order 5 or higher polynomial fitting, where I see an 80% reduction in speed.

My first thought was that the the assignment of the fit results using the poly() function could be to blame, but just doing the fit without making any wave assignment gives the same speed results.

Any idea why curve fitting with poly is slower when multithreaded? Is there some internal multithreading in curvefit that isn't compatible with the multithread wave assignment?

All other built-in fit functions seem to benefit equally from multithreading, so it would be simple to check for high-order polynomial fit function and run the fitting without multithreading in that case.

 

That would be because a polynomial fit is linear in the coefficients and uses totally separate code from the nonlinear, iterative fits. The linear fits (line fits, polynomial and poly2d) should still be faster than the equivalent nonlinear fit because they don't iterate.

@tony Can you provide a MWE for playing around? I also don't see a reason why poly order 5 doesn't benefit from multithreading.

OK, I misunderstood a bit. I was thinking of the automatic multithreading internal to the iterated nonlinear fits. But this is a question about doing multiple fits simultaneously via the Multithread keyword.

My guess is that the polynomial fits are sufficiently fast that for whatever number of fits you're doing, the threading overhead is swamping the benefit of multithreading. Just a guess. As Thomas suggests, some play would be in order.

Turns out that the slowdown happens only when a mask wave is involved. For poly fitfunc, not for others.

•MTspeedtest("poly", 5, 1000)
  Unmasked poly fit. poly order: 5, MT time: 295399, ST time: 911160, Speedup: 3.0845
  Masked poly fit. poly order: 5, MT time: 8.34295e+06, ST time: 938055, Speedup: 0.112437

 

In reply to by tony

There is also an unexpected dependence on the length of waves to be fitted:

•MTspeedtest("poly", 5, 200)
  Unmasked poly fit. poly order: 5, MT time: 209471, ST time: 486002, Speedup: 2.32014
  Masked poly fit. poly order: 5, MT time: 211773, ST time: 509983, Speedup: 2.40816
•MTspeedtest("poly", 5, 500)
  Unmasked poly fit. poly order: 5, MT time: 7.55799e+06, ST time: 547901, Speedup: 0.0724929
  Masked poly fit. poly order: 5, MT time: 257492, ST time: 683144, Speedup: 2.65307
•MTspeedtest("poly", 5, 1000)
  Unmasked poly fit. poly order: 5, MT time: 293396, ST time: 917342, Speedup: 3.12663
  Masked poly fit. poly order: 5, MT time: 8.86679e+06, ST time: 945262, Speedup: 0.106607

Other fit functions don't show this kind of behaviour.

•MTspeedtest("gauss", 5, 200)
  Unmasked gauss fit. poly order: 5, MT time: 214601, ST time: 977831, Speedup: 4.55651
  Masked gauss fit. poly order: 5, MT time: 139170, ST time: 460551, Speedup: 3.30926
•MTspeedtest("gauss", 5, 500)
  Unmasked gauss fit. poly order: 5, MT time: 566795, ST time: 2.91364e+06, Speedup: 5.14056
  Masked gauss fit. poly order: 5, MT time: 306634, ST time: 1.56379e+06, Speedup: 5.09985
•MTspeedtest("gauss", 5, 1000)
  Unmasked gauss fit. poly order: 5, MT time: 1.14809e+06, ST time: 5.9246e+06, Speedup: 5.16039
  Masked gauss fit. poly order: 5, MT time: 658464, ST time: 3.2876e+06, Speedup: 4.99284

Edit: add system info

Print IgorInfo(3)
  OS:macOS;OSVERSION:14.3.1;LOCALE:US;IGORFILEVERSION:9.06B01;
Print ThreadProcessorCount
  8

  Model Name:    MacBook Pro
  Model Identifier:    MacBookPro15,2
  Processor Name:    Quad-Core Intel Core i5
  Processor Speed:    2.3 GHz
  Number of Processors:    1
  Total Number of Cores:    4
  L2 Cache (per Core):    256 KB
  L3 Cache:    6 MB
  Hyper-Threading Technology:    Enabled
  Memory:    16 GB

did you take a look at the ipf linked above? is it a. reproducible for you and b. enough to do some profiling?

There must be more in this, my system (IP9.05 on MacOS ARM,  M1PRO)  shows very different results:

•MTspeedtest("poly", 5, 200)
  Unmasked poly fit. poly order: 5, MT time: 1.14053e+06, ST time: 913149, Speedup: 0.800638
  Masked poly fit. poly order: 5, MT time: 1.15828e+06, ST time: 757275, Speedup: 0.653794
•MTspeedtest("poly", 5, 500)
  Unmasked poly fit. poly order: 5, MT time: 1.17211e+06, ST time: 987406, Speedup: 0.842417
  Masked poly fit. poly order: 5, MT time: 1.21521e+06, ST time: 902780, Speedup: 0.742902
•MTspeedtest("poly", 5, 1000)
  Unmasked poly fit. poly order: 5, MT time: 1.11386e+06, ST time: 1.17587e+06, Speedup: 1.05567
  Masked poly fit. poly order: 5, MT time: 1.236e+06, ST time: 1.30618e+06, Speedup: 1.05678

 

Now that is strange.

From my home computer, an older macbook pro:

•MTspeedtest("poly", 5, 200)
  Unmasked poly fit. poly order: 5, MT time: 234119, ST time: 421007, Speedup: 1.79826
  Masked poly fit. poly order: 5, MT time: 194209, ST time: 396807, Speedup: 2.0432
•MTspeedtest("poly", 5, 500)
  Unmasked poly fit. poly order: 5, MT time: 585567, ST time: 540801, Speedup: 0.923551
  Masked poly fit. poly order: 5, MT time: 283139, ST time: 613617, Speedup: 2.16719
•MTspeedtest("poly", 5, 1000)
  Unmasked poly fit. poly order: 5, MT time: 369582, ST time: 925710, Speedup: 2.50474
  Masked poly fit. poly order: 5, MT time: 1.03486e+06, ST time: 1.08674e+06, Speedup: 1.05013

compared with gaussian fits (poly order doesn't have any meaning here):

•MTspeedtest("gauss", 5, 200)
  Unmasked gauss fit. poly order: 5, MT time: 352307, ST time: 977642, Speedup: 2.77497
  Masked gauss fit. poly order: 5, MT time: 174721, ST time: 478166, Speedup: 2.73674
•MTspeedtest("gauss", 5, 500)
  Unmasked gauss fit. poly order: 5, MT time: 1.01302e+06, ST time: 2.87903e+06, Speedup: 2.84203
  Masked gauss fit. poly order: 5, MT time: 578342, ST time: 1.55051e+06, Speedup: 2.68096
•MTspeedtest("gauss", 5, 1000)
  Unmasked gauss fit. poly order: 5, MT time: 2.20035e+06, ST time: 7.81715e+06, Speedup: 3.55269
  Masked gauss fit. poly order: 5, MT time: 1.24213e+06, ST time: 3.38625e+06, Speedup: 2.72616

 

Hardware Overview:

  Model Name:    MacBook Pro
  Model Identifier:    MacBookPro14,2
  Processor Name:    Dual-Core Intel Core i5
  Processor Speed:    3.3 GHz
  Number of Processors:    1
  Total Number of Cores:    2
 

 

There's even more excitement in the Wintel world.  Gauss fits with any data length are fine, as are 4th order poly fits of any length.  However, with 5th order poly fits, if the datalength is greater than 300 or so, the multithread fitting causes Igor to essentially lock up and makes the computer super laggy.  The memory usage stays the same, but the CPU quickly goes to 100% and the process never completes.  I can abort the process, but when I try to close the experiment I often get a window saying that Igor has crashed.  4th order polynomial and Gauss fits seem fine, as do single threaded 5th order polynomial fits.  I've shipped the crash report off to WM support.

System info: Intel 12th Gen, i5-12600H
Windows 10 Enterprise (22H2), 10.0.19045
Igor 9.0.5.1

•MTspeedtest("poly", 5, 200)
  Unmasked poly fit. poly order: 5, MT time: 1.09726e+06, ST time: 1.32983e+06, Speedup: 1.21196
  Masked poly fit. poly order: 5, MT time: 785966, ST time: 1.43032e+06, Speedup: 1.81983
•MTspeedtest("poly", 5, 300)
  Unmasked poly fit. poly order: 5, MT time: 1.17299e+06, ST time: 1.43175e+06, Speedup: 1.2206
  Masked poly fit. poly order: 5, MT time: 861461, ST time: 1.45183e+06, Speedup: 1.68531
•MTspeedtest("gauss", 5, 200)
  Unmasked gauss fit. poly order: 5, MT time: 344054, ST time: 663625, Speedup: 1.92884
  Masked gauss fit. poly order: 5, MT time: 173156, ST time: 299527, Speedup: 1.72981
•MTspeedtest("gauss", 5, 500)
  Unmasked gauss fit. poly order: 5, MT time: 490591, ST time: 1.78317e+06, Speedup: 3.63475
  Masked gauss fit. poly order: 5, MT time: 233319, ST time: 1.02946e+06, Speedup: 4.41223
•MTspeedtest("gauss", 5, 1000)
  Unmasked gauss fit. poly order: 5, MT time: 877258, ST time: 3.63709e+06, Speedup: 4.14598
  Masked gauss fit. poly order: 5, MT time: 324798, ST time: 2.05203e+06, Speedup: 6.31787
•MTspeedtest("poly", 4, 1000)
  Unmasked poly fit. poly order: 4, MT time: 819395, ST time: 1.67335e+06, Speedup: 2.04218
  Masked poly fit. poly order: 4, MT time: 648839, ST time: 1.79097e+06, Speedup: 2.76027
Print IgorInfo(3)
  OS:Microsoft Windows 10 Enterprise (22H2);OSVERSION:10.0.19045.4291;LOCALE:US;IGORFILEVERSION:9.0.5.1;
Print IgorInfo(4)
  Intel
Print ThreadProcessorCount
  16

 

It's not reproducible:

•mtspeedtest("poly", 5, 500)
  Unmasked poly fit. poly order: 5, MT time: 279780, ST time: 491701, Speedup: 1.75746
  Masked poly fit. poly order: 5, MT time: 709739, ST time: 516056, Speedup: 0.727106
•mtspeedtest("poly", 5, 500)
  Unmasked poly fit. poly order: 5, MT time: 354910, ST time: 430078, Speedup: 1.21179
  Masked poly fit. poly order: 5, MT time: 705499, ST time: 526373, Speedup: 0.7461
•mtspeedtest("poly", 5, 500)
  Unmasked poly fit. poly order: 5, MT time: 3.62641e+08, ST time: 477736, Speedup: 0.00131738
  Masked poly fit. poly order: 5, MT time: 709477, ST time: 525866, Speedup: 0.741202

 

MacBook Pro 16in, 2.4GHz i9, 32GB RAM

•MTspeedtest("poly",5,200)
  Unmasked poly fit. poly order: 5, MT time: 754293, ST time: 406586, Speedup: 0.539029
  Masked poly fit. poly order: 5, MT time: 753732, ST time: 408524, Speedup: 0.542001
•MTspeedtest("poly",5,500)
  Unmasked poly fit. poly order: 5, MT time: 2.29552e+06, ST time: 464351, Speedup: 0.202285
  Masked poly fit. poly order: 5, MT time: 793827, ST time: 576895, Speedup: 0.726727
•MTspeedtest("poly",5,1000)
  Unmasked poly fit. poly order: 5, MT time: 869120, ST time: 838253, Speedup: 0.964485
  Masked poly fit. poly order: 5, MT time: 2.15979e+07, ST time: 849671, Speedup: 0.0393404

•MTspeedtest("gauss",5,1000)
  Unmasked gauss fit. poly order: 5, MT time: 580758, ST time: 4.89651e+06, Speedup: 8.43124
  Masked gauss fit. poly order: 5, MT time: 334341, ST time: 2.64889e+06, Speedup: 7.92272


Print IgorInfo(3)
  OS:macOS;OSVERSION:14.4.1;LOCALE:US;IGORFILEVERSION:9.06B01;
Print ThreadProcessorCount
  16

I am getting consistent times on the poly 5 1000.

Dare I propose that this is related to an i5 versus i7 versus i9 chip?

Interesting that the M1 chip does worse in multi-threading versus single threading in the poly 5 1000 test compared to the i9. Indeed, if I read this correctly, I should certainly not switch my i9 for an M1 as long as I am planning to need heavy processing with Igor Pro.

If instead of using CurveFit poly I fit a user-defined polynomial fitting function (FuncFit UserPoly), fitting is of course way slower, but I see the expected speed increase with multithreading. No surprise there.

It's clear that something is amiss with CurveFit poly. I am curious about how the M1pro fares with other fit functions. I would expect to see significant speed improvement with multithreading.

Another windows data point.  As KZarzana observed, poly 5 starts to get slow between 300 and 400 points and CPU usage remains at 100% until completion.  However, there was no problem with the computer responding to, say, mouse movement.  

Completion time for unmasked multithreaded poly 5 hit 2e8 for 400 and above points with approximately the same time beyond that number of points.  Single threaded times were on the order of 1e6.  Time for the Masked MT fit took a big jump some where between 500 and 1000 points.

Dell Precision 7760 - 11th Gen Intel(R) Core(TM) i9-11950H @ 2.60GHz; 32 GB RAM; 8 cores/16 logical processors

Poly 5:
•MTspeedtest("poly", 5, 200)
  Unmasked poly fit. poly order: 5, MT time: 208505, ST time: 363385, Speedup: 1.74281
  Masked poly fit. poly order: 5, MT time: 140017, ST time: 411845, Speedup: 2.94138

•MTspeedtest("poly", 5, 300)
  Unmasked poly fit. poly order: 5, MT time: 127913, ST time: 418590, Speedup: 3.27246
  Masked poly fit. poly order: 5, MT time: 132262, ST time: 392124, Speedup: 2.96474

•MTspeedtest("poly", 5, 400)
  Unmasked poly fit. poly order: 5, MT time: 2.09506e+08, ST time: 424655, Speedup: 0.00202693
  Masked poly fit. poly order: 5, MT time: 141391, ST time: 537311, Speedup: 3.80017

•MTspeedtest("poly", 5, 500)
  Unmasked poly fit. poly order: 5, MT time: 2.16166e+08, ST time: 424359, Speedup: 0.00196312
  Masked poly fit. poly order: 5, MT time: 172648, ST time: 647535, Speedup: 3.7506

•MTspeedtest("poly", 5, 1000)
  Unmasked poly fit. poly order: 5, MT time: 2.02592e+08, ST time: 841479, Speedup: 0.00415357
  Masked poly fit. poly order: 5, MT time: 2.06259e+08, ST time: 884786, Speedup: 0.00428968

Gauss:
•MTspeedtest("gauss", 5, 200)
  Unmasked gauss fit. poly order: 5, MT time: 117796, ST time: 566506, Speedup: 4.80923
  Masked gauss fit. poly order: 5, MT time: 94848.7, ST time: 248933, Speedup: 2.62453

•MTspeedtest("gauss", 5, 500)
  Unmasked gauss fit. poly order: 5, MT time: 235547, ST time: 1.86926e+06, Speedup: 7.9358
  Masked gauss fit. poly order: 5, MT time: 134776, ST time: 876038, Speedup: 6.49997

 •MTspeedtest("gauss", 5, 1000)
 Unmasked gauss fit. poly order: 5, MT time: 462936, ST time: 3.61267e+06, Speedup: 7.80383
  Masked gauss fit. poly order: 5, MT time: 273130, ST time: 1.99793e+06, Speedup: 7.31495


 

Macbook Pro, M1PRO (ARM), Sonoma 14.4.1, IP 9.0.5.1. I think it is working with Gauss as expected. 

Igor manages to load all cores. Since some cores are performance and some are efficiency, I wonder how that makes multithreading complicated as results from different cores must be arriving at very different speeds.  

MTspeedtest("gauss", 5, 200)
  Unmasked gauss fit. poly order: 5, MT time: 126119, ST time: 859307, Speedup: 6.81346
  Masked gauss fit. poly order: 5, MT time: 101142, ST time: 434436, Speedup: 4.29531
•MTspeedtest("gauss", 5, 500)
  Unmasked gauss fit. poly order: 5, MT time: 290593, ST time: 2.53254e+06, Speedup: 8.71508
  Masked gauss fit. poly order: 5, MT time: 149553, ST time: 1.34725e+06, Speedup: 9.00853
•MTspeedtest("gauss", 5, 1000)
  Unmasked gauss fit. poly order: 5, MT time: 632526, ST time: 5.23632e+06, Speedup: 8.27842
  Masked gauss fit. poly order: 5, MT time: 319750, ST time: 2.88794e+06, Speedup: 9.03187
•MTspeedtest("gauss", 5, 10000)
  Unmasked gauss fit. poly order: 5, MT time: 5.79273e+06, ST time: 1.33819e+07, Speedup: 2.31012
  Masked gauss fit. poly order: 5, MT time: 3.36825e+06, ST time: 1.1933e+07, Speedup: 3.54279

 

KZarzana has sent us a report of an actual crash while doing the multithreaded poly fit test.

Given that the test uses enoise() to set the poly coefficients, it's highly likely that sometimes there is a pathological data set that causes unusual problems with the fitting.

The erratic results (see my test results above with three runs of N=500 above, where everthing looks mostly OK, but one test gives a result 1.5 orders of magnitude higher) plus KZarzana's crash, I'm betting on some sort of thread safety violation in the poly fit code.

That wouldn't be surprising given that the basic code was first written 30 years ago. At present, it is using a SVD solution based on Numerical Recipes edition 1.something, from before they tightened up their licensing restrictions. That would be suspect...

I'll let y'all know what I find.

In reply to by johnweeks

Yeah, using enoise is a poor choice for making speed comparisons. But I doubt that a pathological data set could be created using this method. The fact that problems arise only under restricted combinations of inputs rather suggests otherwise. For this test, creating an input wave of zeroes would probably show the same trends.

I remember coding plenty of numerical recipes stuff in Igor in the days before we had so many built in methods available. I used to translate from the fortran version because that was the only code I understood.

Two more data points.  On an AMD Ryzen 3700X system (Win10, Igor 9.0.5.1):

•MTspeedtest("Gaus", 5, 200)
  Unmasked Gaus fit. poly order: 5, MT time: 10070.2, ST time: 35370.4, Speedup: 3.51238
  Masked Gaus fit. poly order: 5, MT time: 9805.9, ST time: 35446.2, Speedup: 3.61478
•MTspeedtest("Gaus", 5, 500)
  Unmasked Gaus fit. poly order: 5, MT time: 10190.5, ST time: 61877.8, Speedup: 6.07211
  Masked Gaus fit. poly order: 5, MT time: 9685.1, ST time: 90618.8, Speedup: 9.35652
•MTspeedtest("Gaus", 5, 1000)
  Unmasked Gaus fit. poly order: 5, MT time: 12094.6, ST time: 111623, Speedup: 9.22915
  Masked Gaus fit. poly order: 5, MT time: 11537.4, ST time: 136792, Speedup: 11.8564

•MTspeedtest("Poly", 4, 200)
  Unmasked Poly fit. poly order: 4, MT time: 115901, ST time: 377673, Speedup: 3.25859
  Masked Poly fit. poly order: 4, MT time: 122661, ST time: 370966, Speedup: 3.02433
•MTspeedtest("Poly", 4, 500)
  Unmasked Poly fit. poly order: 4, MT time: 336137, ST time: 611126, Speedup: 1.81809
  Masked Poly fit. poly order: 4, MT time: 159058, ST time: 522750, Speedup: 3.28653
•MTspeedtest("Poly", 4, 1000)
  Unmasked Poly fit. poly order: 4, MT time: 274522, ST time: 837962, Speedup: 3.05244
  Masked Poly fit. poly order: 4, MT time: 351762, ST time: 884927, Speedup: 2.5157

•MTspeedtest("poly", 5, 200)
  Unmasked poly fit. poly order: 5, MT time: 372060, ST time: 510901, Speedup: 1.37317
  Masked poly fit. poly order: 5, MT time: 413478, ST time: 460336, Speedup: 1.11333
•MTspeedtest("poly", 5, 500)
  Unmasked poly fit. poly order: 5, MT time: 289436, ST time: 641480, Speedup: 2.21631
  Masked poly fit. poly order: 5, MT time: 409983, ST time: 685766, Speedup: 1.67267
•MTspeedtest("poly", 5, 1000)
  Unmasked poly fit. poly order: 5, MT time: 426360, ST time: 974424, Speedup: 2.28545
  Masked poly fit. poly order: 5, MT time: 399927, ST time: 924882, Speedup: 2.31263

There weren't any slow down issues on the Ryzen Windows system with the poly 5 fits.

 

And on an M2 Max (macOS Ventura 13.5.2):

•MTspeedtest("Gauss", 5, 200)
  Unmasked Gauss fit. poly order: 5, MT time: 121638, ST time: 591668, Speedup: 4.86416
  Masked Gauss fit. poly order: 5, MT time: 84415.9, ST time: 295797, Speedup: 3.50404
•MTspeedtest("Gauss", 5, 500)
  Unmasked Gauss fit. poly order: 5, MT time: 218543, ST time: 1.79662e+06, Speedup: 8.22089
  Masked Gauss fit. poly order: 5, MT time: 145612, ST time: 966459, Speedup: 6.63721
•MTspeedtest("Gauss", 5, 1000)
  Unmasked Gauss fit. poly order: 5, MT time: 431504, ST time: 3.63349e+06, Speedup: 8.42052
  Masked Gauss fit. poly order: 5, MT time: 255647, ST time: 2.09708e+06, Speedup: 8.20304

•MTspeedtest("poly", 4, 200)
  Unmasked poly fit. poly order: 4, MT time: 1.84797e+06, ST time: 630529, Speedup: 0.3412
  Masked poly fit. poly order: 4, MT time: 2.62976e+06, ST time: 698544, Speedup: 0.265631
•MTspeedtest("poly", 4, 500)
  Unmasked poly fit. poly order: 4, MT time: 1.86755e+06, ST time: 726264, Speedup: 0.388885
  Masked poly fit. poly order: 4, MT time: 2.57265e+06, ST time: 850029, Speedup: 0.330409
•MTspeedtest("poly", 4, 1000)
  Unmasked poly fit. poly order: 4, MT time: 1.87716e+06, ST time: 969844, Speedup: 0.516654
  Masked poly fit. poly order: 4, MT time: 2.66254e+06, ST time: 1.07144e+06, Speedup: 0.402413

•MTspeedtest("poly", 5, 200)
  Unmasked poly fit. poly order: 5, MT time: 2.20938e+06, ST time: 671477, Speedup: 0.30392
  Masked poly fit. poly order: 5, MT time: 3.08876e+06, ST time: 773982, Speedup: 0.25058
•MTspeedtest("poly", 5, 500)
  Unmasked poly fit. poly order: 5, MT time: 2.22977e+06, ST time: 887805, Speedup: 0.39816
  Masked poly fit. poly order: 5, MT time: 3.16569e+06, ST time: 910858, Speedup: 0.287728
•MTspeedtest("poly", 5, 1000)
  Unmasked poly fit. poly order: 5, MT time: 2.25792e+06, ST time: 1.13934e+06, Speedup: 0.504595
  Masked poly fit. poly order: 5, MT time: 3.2675e+06, ST time: 1.16903e+06, Speedup: 0.357776