Multi Threading M1 (Igor 8/9 using Rosetta)

Does anyone have experience with multi threading on an M1? The CPUs have different clock rates, if I understood correctly (some are slow, others are faster).

Do programs like Igor access all CPUs or are the slow ones reserved for background processes of macOS.

(If Igor can use all CPUs:) Does it lead to significant time differences when using fast and slow CPUs together? If so, can you read the clock speed for each thread by Igor to correct this?

I have M1 mini available. Baseline. IP9 release version installed. How do I test this? 

I would say do some calculations with Multithread/NT=(x), measure the time and then plot time as function of x. There should be a dip when going from the fast cores to the slow cores.

The number of available CPUs should be displayed with

print ThreadProcessorCount

For the other things, I'm not sure how to easily figure that out. I don't know of any command that gives you the clock speed per processor in Igor Pro.

And unfortunately I don't have an M1 available to try other things out.

Edit: Dividing the time between several CPUs does not give a linear time gain. But I would also assume that there will be differences depending on the clock rate of the core.

The macOS implementation for Igor's ThreadProcessorCount function is essentially this:

return sysconf( _SC_NPROCESSORS_ONLN );

We don't have any M1 machines so I can't tell you what that returns.

There is also no way to read the clock speed of a thread or investigate any other properties of a thread.

I found "MultithreadMandelbrot.pxp" demo from WM which I could easily modify to run same stuff with different number of cores. Looks like this will need some more studies by someone smarter...  

NOT using MandelbrotPoint function (was too fast) and only varying number of cores, changing zoom between 6 and 6.5 in the demo, so one time is zoom up and the other zoom down. Same image area. Looks like odd number of cores results in longer calculation times? And yes, it seems reproducible, with some noise from test to test, but the general differences are real. 

1 core - 0.9 / 0.7  sec

2 core - 0.57 / 0.44 sec

3 core - 0.64 / 0.45 sec

4 core - 0.38 / 0.30 sec

5 core - 0.50 / 0.35 sec

6 core - 0.33 / 0.25 sec

7 core - 0.40 / 0.28 sec.

8 core - 0.31 / 0.23 sec

Something like the following maybe?

 

threadsafe Function DoWork()
    variable ref

    Make/N=(1e6)/D/FREE data

    ref = stopmstimer(-2)

    data[] = p * sin(p) * cos(p) * exp(p)

    return (stopmstimer(-2) - ref)/1e6
End

Function RunMe()

    variable i
    variable numRuns = ThreadProcessorCount

    Make/O/N=(numRuns) totals = NaN

    for(i = 1; i < numRuns; i += 1)
        Make/FREE/N=(i) result
        Multithread/NT=(i) result = DoWork()
//      print result
        totals[i] = Mean(result)
        printf "Avearge: %#.010g [s]\r", totals[i]
    endfor

    Display totals
    ModifyGraph mode=4
End

 

Sure, here are results

•RunMe()
  Avearge: 1.408926606 [s]
  Avearge: 1.241815805 [s]
  Avearge: 1.284322858 [s]
  Avearge: 1.288824320 [s]
  Avearge: 1.446949244 [s]
  Avearge: 1.583647490 [s]
  Avearge: 1.691548944 [s]

 

Here is up to 8 cores :

  Avearge: 1.395027041 [s]
  Avearge: 1.232660770 [s]
  Avearge: 1.276729107 [s]
  Avearge: 1.283119798 [s]
  Avearge: 1.442630410 [s]
  Avearge: 1.572660923 [s]
  Avearge: 1.688646317 [s]
  Avearge: 1.808453679 [s]

note: not all cores seem to report 100% utilization in this procedure, even though I increased number of points in DoWork to 1e7 points to make it run longer. 4 cores show 100% utilization and 4 cores do not. Not sure which ones as that I do not see type of cores from the iStatMenu.

With 1e7 I get (IP8 on windows):

  Average: 2.572647572 [s]

  Average: 2.364289761 [s]

  Average: 2.256549358 [s]

  Average: 2.277353048 [s]

  Average: 2.321866274 [s]

  Average: 2.450463057 [s]

  Average: 2.531247616 [s]

  Average: 2.648287058 [s]

  Average: 2.708741426 [s]

  Average: 2.847256899 [s]

  Average: 2.920223236 [s]
I do see all 12 cores getting occupied. And looks like your's faster ;)

It could be that taking the average over the #i cores does already mess up the results.

Thanks everyone, this looks very interesting.... effectively, the performance doesn't seem to differ much between the two types of CPU.
With the different clock speeds (internet says: 3.2 GHz and 2 GHz) amazing. Maybe Igor can't access the full performance through Rosetta? Who knows.

@ thomas_braun - Is the test routine only ever testing up to n-1 cores?

threadsafe Function DoWork()
    variable ref

    Make/N=(1e7)/D/FREE data

    ref = stopmstimer(-2)

    data[] = p * sin(p) * cos(p) * exp(p)

    return (stopmstimer(-2) - ref)/1e6
End

Function RunMe()

    variable i
    variable numRuns = ThreadProcessorCount

    Make/O/N=(numRuns+1) totals = NaN

    for(i = 1; i < (numRuns+1); i += 1)
        Make/FREE/N=(i) result
        Multithread/NT=(i) result = DoWork()
//      print result
        totals[i] = Mean(result)
        printf "Avearge: %#.010g [s]\r", totals[i]
    endfor

    Display totals
    ModifyGraph mode=4
End

There just needed to be some changes - that's how it should work.
However, it seems that you cannot output the speed of the M1's cores at all. Thus, one can unfortunately not take these differences into account by an asymmetrical distribution of the tasks.

Edit: I found someone who did a longer multi-core computation (many user-defined funcfits) for me with an M1 MacBook Air as a test. All eight cores finished the calculation (around 8 minutes) simultaneously (± 5 seconds). Maybe the M1 will re-sort the assigned tasks itself afterwards? Unfortunately, Apple seems to block a lot of information about this.

Dear all,

I have some calculation with multi processing which runs between 45min - 3h depending on the amount of data. I ran on IgorPro 8 the experiment on the same data on a trash-can MacPro and a MacbookAir. The MacbookAir was about 1.5-2 times faster. 

Because they run so long, the old trash-can MacPro has to do the job. Nevertheless even during the long run the was no heat issue nearly all CPUs run a max output (no throttle)

 

best regards

 

Stefan