Questions about multithreaded loops and FuncFits

I use a loop to read 1dim waves from a matrix (3dim wave) using beam, fit the wave and write the fit parameters into different matrices.

...
for (row_index=0; row_index<row_max; row_index +=1)
     for (column_index=0; column_index<column_max; column_index +=1)
          Make/O/D/N=(layer_max)/O tempwave
          MatrixOP/O tempwave=beam(importmatrix,row_index,column_index)
          FuncFit/Q ...
          export[row_index][column_index]=W_coef[0]
     endfor
endfor
...

The Fit itself is already multithreaded. Can I call the fit directly and it will automatically be used multi-threaded or do I have to call a function that is itself multi-threaded and it will call the fit?

Would it be enough if I put the loop into a separate function, that is threadsafe and the loop would be processed by multiple threads or do I have to use a more complicated way: something like splitting the loop into several sections? If that would be necessary, my question would be, if there are problems when several W_coef and W_sigma Waves exist at the same time.

The FuncFit operation is threadsafe- is that what you're asking? The W_coef and W_sigma waves (and any others that might be created) will be created in the context of the thread that runs FuncFit. Those waves will be lost when the thread terminates unless you do something to pack them up and send them back to the main thread. That something might involve ThreadGroupPutDF and ThreadGroupGetDF(R).

If your fitting function is already threadsafe and your data set is big enough, Igor will use threads for a portion of the fitting process. If FuncFit runs in a preemptive thread, the automatic within-FuncFit threading is disabled to avoid contention for threads, which can seriously degrade performance.

Having said that, if you have many fits to perform, running each fit in a thread will get  you a better speed-up than the automatic threading because more of the fitting process will run in parallel.

I tried to split the loop to the individual cores.

The f_test function passes certain areas of the matrix to each thread.

Function f_test(w_matrix)
    wave w_matrix
    variable v_threads=ThreadProcessorCount
    variable v_rows_max=dimsize(w_matrix,0)
    variable v_columns_max=dimsize(w_matrix,1)
    variable v_layers_max=dimsize(w_matrix,2)
    variable v_row_per_thread=ceil(v_rows_max/v_threads)
    Variable threadGroupID= ThreadGroupCreate(v_threads)
    variable r_index,c_index,t_index
    make/O/N=(v_rows_max,v_columns_max) w_export
    for(t_index=0; t_index<v_threads; t_index+=1)
        ThreadStart threadGroupID,t_index,f_test2(w_matrix,w_export,v_rows_max,v_columns_max,v_layers_max,v_row_per_thread,t_index)
    endfor
    do
        Variable threadGroupStatus = ThreadGroupWait(threadGroupID,100)
        while( threadGroupStatus != 0 )
    Variable dummy= ThreadGroupRelease(threadGroupID)
end

The data for the fit is then created in the temporary directory for each thread. Afterwards the fit is performed and the parameters are entered into the globally available wave.

ThreadSafe Function f_test2(w_matrix,w_export,v_rows_max,v_columns_max,v_layers_max,v_row_per_thread,t_index)
    wave w_matrix, w_export
    variable v_rows_max,v_columns_max,v_layers_max,v_row_per_thread,t_index
    variable v_rowstart=t_index*v_row_per_thread
    variable v_rowend=(t_index+1)*v_row_per_thread
    if (v_rowend>v_rows_max)
        v_rowend=v_rows_max
    endif
    variable row_index,column_index
    Make/D/O/N=2/O W_coef                      
    W_coef[0] =  {1.5,1e-6}             Make/D/O/N=(1,2) M_FitConstraint=0
    M_FitConstraint[0][0]=-1;
    Make/D/O/N=1 W_FitConstraint=0
    Make/D/N=2/O W_sigma
    make/N=(v_layers_max)/O tempwave
    for (row_index=v_rowstart;row_index<v_rowend;row_index+=1)
        for (column_index=0;column_index<v_columns_max;column_index+=1)
            MatrixOP/O tempwave=beam(w_matrix,row_index,column_index)
            FuncFit/Q f_example_fit W_coef tempwave /D /C={M_FitConstraint,W_FitConstraint}
            w_export[row_index][column_index]=w_coef[1]
        endfor
    endfor
end

Here is an example of the fit function:

Threadsafe function f_example_fit(w,x) : FitFunc
    Wave w
    Variable x
    wave w_wave = root:wave0
    return w[0]*w_wave(x) + w[1]
end

But when I look at the results and compared it with the single thread fit, I get different results. I can see the six threads in the result - it looks like that they have different offsets and I get this error message:

While executing a wave read, the following error occurred: Attempt to operate on a null (missing) wave

Did I do something fundamentally wrong or is the problem that all threads want to access the same waves while fitting?

One thing I see is the use of constraints. Passing a text constraint wave to FuncFit in a thread is not supported (I probably should have mentioned that previously). The constraint code uses Igor's interpreter to parse the constraint expressions, and the interpreter is not threadsafe. For the work-around, DisplayHelpTopic "Constraint Matrix and Vector" and DisplayHelpTopic "Constraints and ThreadSafe Functions".

Perhaps being naive but ... Would it be feasible and beneficial instead to split your matrix into as many "beam chunks" as you have cores and passed those "beam chunks through a threadsafe fitting routine? IOW, try to optimize the spread of the data fitting in the two loops across all cores rather than putting all the thread safe load in the fitting function.

For example, given a hypothetical 10 core system using a 10 x 10 x C matrix, create 10 sub-matrices (e.g. split to ten 10 x 1 x C matrices) that are then each sent off to their own individual core for fitting.

Just a few things that I noticed in your f_test2 function :

Make/D/O/N=2/O W_coef                      
Make/D/O/N=(1,2) M_FitConstraint=0
Make/D/O/N=1 W_FitConstraint=0
Make/D/N=2/O W_sigma
make/N=(v_layers_max)/O tempwave
MatrixOP/O tempwave=beam(w_matrix,row_index,column_index)

None of these waves are threadsafe because they are all located in the current datafolder. Each thread will therefore overwite the waves of the other threads. A simple fix would be to create free waves instead.

Make/FREE/D/O/N=2/O W_coef                      
Make/FREE/D/O/N=(1,2) M_FitConstraint=0
Make/FREE/D/O/N=1 W_FitConstraint=0
Make/FREE/D/N=2/O W_sigma
make/FREE/N=(v_layers_max)/O tempwave
MatrixOP/FREE/O tempwave=beam(w_matrix,row_index,column_index)

That way the waves becomes unique to each thread and they will no longer overwrite each other.

A free waves is like a local variable. It only exists within the function that created it. A normal wave is like a global variable. It has a location in Igor datafolder structure and is shared between all functions.

There may be other issues with your code, but the shared waves was what caught my eye.

In reply to by olelytken

olelytken wrote:

Just a few things that I noticed in your f_test2 function :

Make/D/O/N=2/O W_coef                      
Make/D/O/N=(1,2) M_FitConstraint=0
Make/D/O/N=1 W_FitConstraint=0
Make/D/N=2/O W_sigma
make/N=(v_layers_max)/O tempwave
MatrixOP/O tempwave=beam(w_matrix,row_index,column_index)

None of these waves are threadsafe because they are all located in the current datafolder. Each thread will therefore overwite the waves of the other threads. A simple fix would be to create free waves instead.

That's actually not true. From this help topic:

DisplayHelpTopic "ThreadSafe Functions"

"When threadsafe functions execute in the main thread, they have normal access to data folders, waves, and variables. But when running in a preemptive thread, threadsafe functions use their own private data folders, waves, and variables."

 

So in a preemptive thread, those waves are created in a thread local data folder. This makes them more or less the same as a free wave. But I agree with your recommendation to explicitly create them as free waves, because I think it makes the code a little more explanatory and it ensures that the behavior when run in a preemptive thread is the same as when run from the main thread.

Many thanks for the suggestions.

@johnweeks: I had already adjusted the constraints using this help topics. I created the necessary waves for it. Without the threadsafe Waves you can't compile the code at all, because an error message will be displayed.

@jjweimer: My attempt is to assign matrix sections to a thread. (e.g. row 0 to 9 -> core 1, row 10 to 19 -> core 2 ... results -> one wave for all cores).

@olelytken & aclight: I have implemented the /FREE flag.

Additionally I tried to duplicate the reference wave into each thread and then fit it.

wave w_orig_wave0=root:wave0
make/FREE/N=(v_layers_max)/O w_wave0=w_orig_wave0
setscale/P x,dimoffset(w_matrix,2),dimdelta(w_matrix,2),w_wave0 //same scale like root:wave0

Then I rewrote the fit function to take the wave0 from the current datafolder. Do I correctly assume that this datafolder is the temporary folder from the thread that calls the FuncFit? (With the normal fit via the menu the fit works).

Threadsafe function f_example_fit(w,x) : FitFunc
    Wave w
    Variable x
    string s_wave=GetDataFolder(1)+"wave0"
    wave w_wave = $s_wave
    return w[0]*w_wave(x) + w[1]
end

I could solve the offset / stripe pattern by recreating the w_coef wave before each fit and not using the results of the last fit as start value:

    for (row_index=v_rowstart;row_index<v_rowend;row_index+=1)
        for (column_index=0;column_index<v_columns_max;column_index+=1)
            Make/D/O/N=2/O W_coef                      
            W_coef[0] =  {1.5,1e-6}  
            MatrixOP/O tempwave=beam(w_matrix,row_index,column_index)
            FuncFit/Q f_example_fit W_coef tempwave /D /C={M_FitConstraint,W_FitConstraint}
            w_export[row_index][column_index]=w_coef[1]
        endfor
    endfor

Unfortunately I have the problem that the wave W_sigma is not read out at all and the Wave W_coef is apparently not accessible properly. (Creation of the wave w_sigma and the readout was inserted of course.)

> My attempt is to assign matrix sections to a thread. (e.g. row 0 to 9 -> core 1, row 10 to 19 -> core 2 ... results -> one wave for all cores).

I would propose to use Duplicate/FREE to split the matrix and then assign the newly created FREE sub matrices each to their own core. The trade off is to increase the RAM needed for the code with the hope that the speed increase in running the code compensates.

In reply to by jjweimer

I have tested the function which is to be processed by several threads in parallel with single inputs (certain data and one processor core). If this function is not threadsafe, it works as aspected (uses core #1 for beam and multiple threads for the fitting). As soon as this function is threadsafe by the command threadsafe, this function behaves strange:
Although I specified the flag /Q, the result (of each fit) is shown in a separate window. The following text is written to the history:

 FitProgressDialog allocating a dialogFitFunction instance

Are the results possibly saved differently when threadsafe is active compared to threadsafe is not active?

Is it possible that the thread's temporary folder is not output as string with the

GetDataFolder(1)

command?

(W_coef and W_sigma seems to be used not correctly with make/FREE. But since the waves are temporary anyway, that's not too bad.)

Should splitting into multiple threads work for my problem? I have inserted the creation of a difference of two points on the "tempwave" (its generated using beam) and this is done without errors by splitting different regions of rows to several threads. Only the fit seems not to work, although it works without ThreadSafe.

jjweimer wrote:

I would propose to use Duplicate/FREE to split the matrix and then assign the newly created FREE sub matrices each to their own core. The trade off is to increase the RAM needed for the code with the hope that the speed increase in running the code compensates.

This is a good idea. The Waves are not so large that there could be problems with the RAM.

 

Edit: Code for generation of an exampe:

f_generation_matrix creates the necessary waves. f_test2 is the function that is not threadsafe (in this case v_row_per_thread =v_row_max). f_test is the function to start the evaluation over all threads. This does not work because the threadsafe function f_test3 (f_test3 is a copy of f_test2) does not work.

The result of the fit is correct if w_export1=tempwave2 and w_export2=tempwave3.

function f_generation_matrix()
    Make/O/N=(40,40,800) test_matrix
    Make/O/N=(800) reference1=2+2*exp(-((x-100)/10)^2)
    Make/O/N=(800) reference2=1+3*exp(-((x-150)/25)^2)
    variable row_max, col_max, layer_max
    variable r_index, c_index, l_index
    row_max=dimsize(test_matrix,0)
    col_max=dimsize(test_matrix,1)
    layer_max=dimsize(test_matrix,2)
    make/D/O/N=(layer_max) tempwave
    make/D/O/N=(row_max,col_max) tempwave2=abs(enoise(5))
    make/D/O/N=(row_max,col_max) tempwave3=enoise(5)
    for (r_index=0;r_index<row_max;r_index+=1)
        for (c_index=0;c_index<col_max;c_index+=1)
            tempwave=tempwave2[r_index][c_index]*reference1+tempwave3[r_index][c_index]*reference2
            for (l_index=0;l_index<layer_max;l_index+=1)
                test_matrix[r_index][c_index][l_index]=tempwave[l_index]
            endfor
        endfor
    endfor
end

Threadsafe function f_example_fit(w,x) : FitFunc
    Wave w
    Variable x
    wave w_refwave1 = root:reference1
    wave w_refwave2 = root:reference2
    return w[0]*w_refwave1(x) + w[1]*w_refwave2(x) + w[2]
end

Function f_test2(w_matrix,w_export1,w_export2,w_export3,w_export4, w_export5,v_rows_max,v_columns_max,v_layers_max,v_row_per_thread,t_index)
    wave w_matrix, w_export1, w_export2, w_export3, w_export4, w_export5
    variable v_rows_max,v_columns_max,v_layers_max,v_row_per_thread,t_index
    variable v_rowstart=t_index*v_row_per_thread
    variable v_rowend=(t_index+1)*v_row_per_thread
    if (v_rowend>v_rows_max)
        v_rowend=v_rows_max
    endif
    variable row_index,column_index
    make/O w_matrix_duplicate
    killwaves w_matrix_duplicate
     duplicate w_matrix w_matrix_duplicate
    Make/D/O/N=3/O W_coef                      
    W_coef[0] =  {1.5,1.5,1e-6}            
    Make/D/O/N=(1,3) M_FitConstraint=0
    M_FitConstraint[0][0]=-1;
    Make/D/O/N=1 W_FitConstraint=0
    Make/D/N=3/O W_sigma
    make/N=(v_layers_max)/O tempwave
    for (row_index=v_rowstart;row_index<v_rowend;row_index+=1)
        for (column_index=0;column_index<v_columns_max;column_index+=1)
            Make/D/O/N=3/O W_coef
            Make/D/N=3/O W_sigma
            make/N=(v_layers_max)/O tempwave                      
                W_coef[0] =  {1.5,1.5,1e-6}  
            MatrixOP/O tempwave=beam(w_matrix_duplicate,row_index,column_index)
            setscale/P x,dimoffset(w_matrix,2),dimdelta(w_matrix,2),tempwave
            FuncFit/Q f_example_fit W_coef tempwave[81,180] /D /C={M_FitConstraint,W_FitConstraint}
            w_export1[row_index][column_index]=w_coef[0]
            w_export2[row_index][column_index]=w_coef[1]
            w_export3[row_index][column_index]=w_coef[2]
            w_export4[row_index][column_index]=w_sigma[1]
            w_export5[row_index][column_index]=tempwave[10]-tempwave[30]
        endfor
        make/O fit_tempwave
        killwaves fit_tempwave
    endfor
end

ThreadSafe Function f_test3(w_matrix,w_export1,w_export2,w_export3,w_export4, w_export5,v_rows_max,v_columns_max,v_layers_max,v_row_per_thread,t_index)
    wave w_matrix, w_export1, w_export2, w_export3, w_export4, w_export5
    variable v_rows_max,v_columns_max,v_layers_max,v_row_per_thread,t_index
    variable v_rowstart=t_index*v_row_per_thread
    variable v_rowend=(t_index+1)*v_row_per_thread
    if (v_rowend>v_rows_max)
        v_rowend=v_rows_max
    endif
    variable row_index,column_index
    make/O w_matrix_duplicate
    killwaves w_matrix_duplicate
     duplicate w_matrix w_matrix_duplicate
    Make/D/O/N=3/O W_coef                      
    W_coef[0] =  {1.5,1.5,1e-6}            
    Make/D/O/N=(1,3) M_FitConstraint=0
    M_FitConstraint[0][0]=-1;
    Make/D/O/N=1 W_FitConstraint=0
    Make/D/N=3/O W_sigma
    make/N=(v_layers_max)/O tempwave
    for (row_index=v_rowstart;row_index<v_rowend;row_index+=1)
        for (column_index=0;column_index<v_columns_max;column_index+=1)
            Make/D/O/N=3/O W_coef
            Make/D/N=3/O W_sigma
            make/N=(v_layers_max)/O tempwave                      
                W_coef[0] =  {1.5,1.5,1e-6}  
            MatrixOP/O tempwave=beam(w_matrix_duplicate,row_index,column_index)
            setscale/P x,dimoffset(w_matrix,2),dimdelta(w_matrix,2),tempwave
            FuncFit/Q f_example_fit W_coef tempwave[81,180] /D /C={M_FitConstraint,W_FitConstraint}
            w_export1[row_index][column_index]=w_coef[0]
            w_export2[row_index][column_index]=w_coef[1]
            w_export3[row_index][column_index]=w_coef[2]
            w_export4[row_index][column_index]=w_sigma[1]
            w_export5[row_index][column_index]=tempwave[10]-tempwave[30]
        endfor
        make/O fit_tempwave
        killwaves fit_tempwave
    endfor
end

Function f_test(w_matrix)
    wave w_matrix
    variable v_threads=ThreadProcessorCount
    variable v_rows_max=dimsize(w_matrix,0)
    variable v_columns_max=dimsize(w_matrix,1)
    variable v_layers_max=dimsize(w_matrix,2)
    variable v_row_per_thread=ceil(v_rows_max/v_threads)
    Variable threadGroupID= ThreadGroupCreate(v_threads)
    variable r_index,c_index,t_index
    make/O/N=(v_rows_max,v_columns_max) w_export1
    make/O/N=(v_rows_max,v_columns_max) w_export2
    make/O/N=(v_rows_max,v_columns_max) w_export3
    make/O/N=(v_rows_max,v_columns_max) w_export4
    make/O/N=(v_rows_max,v_columns_max) w_export5
    for(t_index=0; t_index<v_threads; t_index+=1)
        ThreadStart threadGroupID,t_index,f_test3(w_matrix,w_export1,w_export2,w_export3,w_export4,w_export5,v_rows_max,v_columns_max,v_layers_max,v_row_per_thread,t_index)
    endfor
    do
        Variable threadGroupStatus = ThreadGroupWait(threadGroupID,100)
        while( threadGroupStatus != 0 )
    Variable dummy= ThreadGroupRelease(threadGroupID)
end

 

Sorry about my earlier error with respect to your use of the constraint waves. Old eyes, I guess:)

Maybe you could provide short walk-through of the code? I don't see how the waves reference1 and reference2 are transmitted to a FuncFit running in a thread.

In case John's last comment isn't clear, your problem is here:

Threadsafe function f_example_fit(w,x) : FitFunc
    Wave w
    Variable x
    wave w_refwave1 = root:reference1
    wave w_refwave2 = root:reference2
    return w[0]*w_refwave1(x) + w[1]*w_refwave2(x) + w[2]
end

This is not going to work when it is called from a preemptive thread. As I quoted above:

"When threadsafe functions execute in the main thread, they have normal access to data folders, waves, and variables. But when running in a preemptive thread, threadsafe functions use their own private data folders, waves, and variables."

I think the wording of the statement is a bit vague, but that statement means that functions called in a preemptive thread do not have access to global waves or variables. So you cannot look up w_refwave1 and w_refwave2 since those are global waves.

If you modify your function a bit, like this, and then call your test function that uses threads for the fit, you'll see plenty of messages printed out:

Threadsafe function f_example_fit(w,x) : FitFunc
    Wave w
    Variable x
    wave/Z w_refwave1 = root:reference1
    if (!WaveExists(w_refwave1))
        print "w_refwave1 does not exist"
    endif
    wave/Z w_refwave2 = root:reference2
    if (!WaveExists(w_refwave2))
        print "w_refwave2 does not exist"
    endif
    return w[0]*w_refwave1(x) + w[1]*w_refwave2(x) + w[2]
end

 

Edit2: Ok, thank you very much. It works. :)

I have updated the source code below so that others who will have the same problem in the future will find the solution better.

//the function for using all threads
Function f_test(w_matrix)

//the import matrix contains the needed data
    wave w_matrix
   
   
//the waves, that are needed for the fit    
    wave w1_reference1=root:reference1
    wave w1_reference2=root:reference2
 
//dimensions of the w_matrix, will be sent to each thread
    variable v_rows_max=dimsize(w_matrix,0)
    variable v_columns_max=dimsize(w_matrix,1)
    variable v_layers_max=dimsize(w_matrix,2)
   
//more or less from an example procedure:    
    variable v_threads=ThreadProcessorCount         //how many threads are there?
    variable v_row_per_thread=ceil(v_rows_max/v_threads)        //how many rows will be done by each Thread? (round for non integers, ceil to get all rows)
    Variable threadGroupID= ThreadGroupCreate(v_threads) //create ThreadGroup

//make the waves in which the results of the fit parameters are saved
    make/O/N=(v_rows_max,v_columns_max) w_export1
    make/O/N=(v_rows_max,v_columns_max) w_export2
    make/O/N=(v_rows_max,v_columns_max) w_export3
    make/O/N=(v_rows_max,v_columns_max) w_export4
    make/O/N=(v_rows_max,v_columns_max) w_export5
   
//loop that sends the function and the regions for evaluation to the threads
    variable t_index      
    for(t_index=0; t_index<v_threads; t_index+=1)
        ThreadStart threadGroupID,t_index,f_test3(w_matrix,w1_reference1,w1_reference2,w_export1,w_export2,w_export3,w_export4,w_export5,v_rows_max,v_columns_max,v_layers_max,v_row_per_thread,t_index)
    endfor
   
//waiting for the finish of evaluation    
    do
        Variable threadGroupStatus = ThreadGroupWait(threadGroupID,100)
        while( threadGroupStatus != 0 )
    Variable dummy= ThreadGroupRelease(threadGroupID)   //release ThreadGroup
end


ThreadSafe Function f_test3(w_matrix,w1_reference1,w1_reference2,w_export1,w_export2,w_export3,w_export4, w_export5,v_rows_max,v_columns_max,v_layers_max,v_row_per_thread,t_index)

//the matrix, the waves for fit, the exportwaves and some variables have to be transferred to the single Thread
    wave w_matrix,w1_reference1,w1_reference2, w_export1, w_export2, w_export3, w_export4, w_export5
 
    variable v_rows_max,v_columns_max,v_layers_max,v_row_per_thread,t_index

   
//each thread have to do a specific range of the row. v_row_start is the value of the first row for this thread,
//v_row_end is the value of the first row that is  done by the next thread  
    variable v_rowstart=t_index*v_row_per_thread
    variable v_rowend=(t_index+1)*v_row_per_thread
   
//sometimes the rows devided by the number of thread does not get a integer.
//The row_per_thread are rounded using ceil and the last row for the last thread is set to the maximum of rows
    if (v_rowend>v_rows_max)
        v_rowend=v_rows_max
    endif
    variable row_index,column_index
   
   
 //the wave, that are needed for the fit, will be duplicated, that the threads does not have to use the same matrix    
    duplicate w1_reference1 reference1
    duplicate w1_reference2 reference2    
   
   
//the matrix will be duplicated, that the threads does not have to use the same matrix    
    make/O w_matrix_duplicate
    killwaves w_matrix_duplicate
     duplicate w_matrix w_matrix_duplicate
   
//the needed data for the fit, not necessary needed at this position, W_sigma is important, because Igor has to know that a wave with this name will exist
    Make/D/O/N=3/O W_coef                      
    W_coef[0] =  {1.5,1.5,1e-6}            
    Make/D/O/N=(1,3) M_FitConstraint=0
    M_FitConstraint[0][0]=-1;
    Make/D/O/N=1 W_FitConstraint=0
    Make/D/N=3/O W_sigma
   
//make the wave that will contain the data using beam
    make/N=(v_layers_max)/O tempwave
   
   
//the loop to get each row/column combination    
    for (row_index=v_rowstart;row_index<v_rowend;row_index+=1)
        for (column_index=0;column_index<v_columns_max;column_index+=1)
       
//overwriting the waves to provide errors based on old waves / results
            Make/D/O/N=3/O W_coef
            Make/D/N=3/O W_sigma
            make/N=(v_layers_max)/O tempwave                      
                W_coef[0] =  {1.5,1.5,1e-6}
               
//Get the data for fit and scale them.                           
            MatrixOP/O tempwave=beam(w_matrix_duplicate,row_index,column_index)
            setscale/P x,dimoffset(w_matrix_duplicate,2),dimdelta(w_matrix_duplicate,2),tempwave

//Do the fit.                      
            FuncFit/Q f_example_fit W_coef tempwave[81,180] /D /C={M_FitConstraint,W_FitConstraint}
           
//Writing some of the coefficients into the w_export waves. Real situation: all data are saved in export_waves      
            w_export1[row_index][column_index]=w_coef[0]
            w_export2[row_index][column_index]=w_coef[1]
            w_export3[row_index][column_index]=w_coef[2]
            w_export4[row_index][column_index]=w_sigma[1]
           
//Difference of two points is saved for testing if beam works well      
            w_export5[row_index][column_index]=tempwave[10]-tempwave[30]
        endfor
       
//deleating the fit_wave. not necessary        
        make/O fit_tempwave
        killwaves fit_tempwave
    endfor
end

//the fit itself
Threadsafe function f_example_fit(w,x) : FitFunc
    Wave w
    Variable x
    string s_datafolder = GetDataFolder(1) //while using different Threads it is normally "root:"
    //print s_datafolder
    string s_refwave1 = s_datafolder+"reference1"
    wave/Z w_refwave1 = $s_refwave1     //other version: wave/Z w_refwave1 = root:reference1
    if (!WaveExists(w_refwave1))
        print "w_refwave1 does not exist"
    endif
    string s_refwave2 = s_datafolder+"reference2"
    wave/Z w_refwave2 = $s_refwave2
    if (!WaveExists(w_refwave2))
        print "w_refwave2 does not exist"
    endif
    return w[0]*w_refwave1(x) + w[1]*w_refwave2(x) + w[2]
end

//generation of needed waves for testing
function f_generation_matrix()

    //make the needed waves
    Make/O/N=(40,40,800) test_matrix                                        //the wave that will be fittet
    Make/O/N=(800) reference1=2+2*exp(-((x-100)/10)^2)          //first reference
    Make/O/N=(800) reference2=1+3*exp(-((x-150)/25)^2)          //second reference
   
    //get the dimensions of the wave
    variable row_max, col_max, layer_max                               
    row_max=dimsize(test_matrix,0)
    col_max=dimsize(test_matrix,1)
    layer_max=dimsize(test_matrix,2)
   
    //make the waves to generate the test_matrix wave.
    //tempwave will contain the "spectrum" for each point, tempwave2 the prefactors for reference1 and tempwave3 for reference2
    //tempwave and tempwave 2 contains random numbers, tempwave2 only positive numbers
    make/D/O/N=(layer_max) tempwave
    make/D/O/N=(row_max,col_max) tempwave2=abs(enoise(5))
    make/D/O/N=(row_max,col_max) tempwave3=enoise(5)
   
   
    //loop to generate the tempwave
    variable r_index, c_index, l_index 
    for (r_index=0;r_index<row_max;r_index+=1)
        for (c_index=0;c_index<col_max;c_index+=1)
            tempwave=tempwave2[r_index][c_index]*reference1+tempwave3[r_index][c_index]*reference2
   
    //loop to write the tempwave into the test_matrix
            for (l_index=0;l_index<layer_max;l_index+=1)
                test_matrix[r_index][c_index][l_index]=tempwave[l_index]
            endfor
        endfor
    endfor
end

@johnweeks: In this case the compilation is very helpful, because it complains about a wrong contraints wave.

I have added notes for the tasks of each section.

But in itself the procedure would be as follows:
1) f_generation_matrix()


2) ThreadSafe: f_test(test_matrix)
or
3a) for the variant that is not threadsafe, create the w_export waves: 

    make/O/N=(40,40) w_export1
    make/O/N=(40,40) w_export2
    make/O/N=(40,40) w_export3
    make/O/N=(40,40) w_export4
    make/O/N=(40,40) w_export5

3b) f_test2(test_matrix2,w_export1,w_export2,w_export3,w_export4,w_export5,40,40,800,40,0)