Calculating Theil-Sen Estimator

I would like to calculate the Theil-Sen Estimator in an effort to analyze trends in electromyographic (EMG) activity recorded from rat muscles. The Theil-Sen Estimator is a robust linear regression approach that is insensitive outliers (see https://en.wikipedia.org/wiki/Theil–Sen_estimator). I would appreciate any advice about how to implement this in Igor Pro 8. The data are recorded in a one-dimensional Wave (single float 32 bit). I realize there are python and R packages that perform this calculation. However, I would like to incorporate this into a data analysis program I have developed in Igor Pro. Thank you in advance for your help.

Here is some code that I did not test.  Note that it does not handle the case where you have multiple y values for the same x value.  I hope this helps as a starting point.

Function theilSenEstimator(wave inX,wave inY)

    Variable numPoints=numpnts(inX)
    if(numPnts(inY)!=numPoints)
        doAlert 0,"Wave length mismatch."
        return 0
    endif
   
    Make/Free/D/N=(numPoints*(numPoints-1)) slopesWave=nan
    Variable i,j,count=0,dx
    for(i=0;i<numPoints;i+=1)
        for(j=i+1;j<numPoints;j+=1)
            dx=inX[j]-inX[i]
            if(dx!=0)
                slopesWave[count]=(inY[j]-inY[i])/dx
                count+=1
            endif
        endfor
    endfor
    return median(slopesWave)
End

 

Gosh AG ... Where's the MatrixOP version. :-)

I *think* (perhaps maybe possibly hopefully) that this might do the trick. My knowledge of MatrixOP in practice is ... sorely limited. And I did not test this (it is "off the top of my head" so to speak).

I imagine that the calculations could be sped up with ThreadSafe and by operating only on the diagonal elements. Finally, I have to wonder if the approach to creating a matrix would open the door to being able to calculate advanced statistics (uncertainties) more easily.

Function do_TheilSenEstimator(wave xarray, wave xarray)

    // no consistency check!!!
    variable npts = numpnts(xarray)
    variable meanslope

    make/O/FREE/D/N=(npts,npts) dxmatrix, dymatrix

    dxmatrix = calc_matrix(xarray, p, q)
    dymatrix = calc_matrix(yarray, p, q)

    MatrixOP/FREE/O meanslope = mean(replace((replace(dymatrix,0,NaN)/replace(dxmatrix,0,NaN)),NaN,0))

    return (meanslope)
end

Function calc_matrix(wave ww, variable pp, variable qq)

    variable vtrn

    vrtn = ww[pp] - ww[qq]
    return vrtn
end

 

In reply to by jjweimer

Now that you mention it, I do need information in addition to the slope including the the y-intercept (the 'b' in y= mx + b), confidence intervals, and residuals. I am particularly interested in identifying outliers in the data that likely reflect different sources of excitation to the muscle. Unfortunately, my linear algebra is quite rusty. Therefore, any help with this would be much appreciated.

I would hold on my proposed code. My excitement to imagine streamlining the two for-endfor loops into a (p,q) matrix is fraught with presumptions. After off-line discussion with AG, I see that it needs major revisions to correct mistakes in principles and implementation.

As for getting other statistical values, here are some thoughts:

* To obtain the mean intercept <b>, you will have to determine yi - <m>*xi for each point-pair and take the mean of the resulting array.

* Suppose that you have <m> and <b>. How would you look for "outliers"? My crude first-guess approach would be as follows:

- Subtract (<m>*xi + <b>) from all yi values to obtain yi*.
- Plot the data yi* in a histogram. This will reveal visually the distribution and skew in the data points from the proposed regression fit line.
- Obtain the mean <mu> and standard error <sigma> of the histogram values.
- Decide a cutoff in <sigma> to denote points as "outliers".

In reply to by jjweimer

Thank you for your follow up. This is very helpful. I agree that the distribution of the yi-<m>*xi values will yield <b>. However, again I would like to minimize the influence of outliers so I am thinking of using the median of the resulting array rather than the mean. And yes, calculating all of the yi* values is a requisite step in identifying outliers. I have an advantage in that the input data wave is a time series with a fixed delta-x (original sample rate 2Khz, data resampled to 100 hz). Thus, one criterion for a physiologically relevant outlier could be successive yi* values greater than the cutoff for a minimum period of time. In any case, this has given me lots of ideas for how to proceed. I very much appreciate your input.