Advice on use of Line fit statistic Pr and ChiSqr

I have experimental data - measured frequncy (in Hz) v time.  The time data have variations due to logging jitter.  The data would be expected to be linear over a portion of the measurement.  There are between 5 and 50 pairs to which I apply the line option in programmed CurveFit.  I use the slope to derive a further parameter.  Ideally I would like a number that told me how much I can depend on the paramater based on the line fit.  At some stage I will need a look-up table for probabilities for V_chisq or V_Pr.  Which is the right statistic to use?

If the time values are the independent variable, and they have errors, you should be using the /ODR=2 "errors in variables" method:

DisplayHelpTopic "Errors in Variables: Orthogonal Distance Regression"

In that case, the fit is inherently nonlinear even when fitting a line, and V_Pr is not generated. Instead, you get the usual coefficient sigmas and if you ask for it, a covariance matrix.

I have been through the help topic and examples given and not having used this aspect of curve fitting before, found them straight forward and easy to follow (so I have probably made mistakes:-)

I applied ODR to two sets of XY pairs of the same event from two different instruments with the SD from WaveStat (V_sdev)

The first set is 19 pairs and the second set 14 pairs.

function odr1()
// Two XY pairs sets of same event from diferent detectors
wave myD2_F = D2_F                          //Frequency Hz
wave myD2_Tcorr = D2_Tcorr              //Corrected Time (Igor DateTime)
wave mySP_F = SP_F
wave mySP_Tcorr = SP_Tcorr

//Duplicate Wave for weighting
duplicate /O myD2_F, D2_FW
duplicate /O myD2_Tcorr, D2_TcorrW
duplicate /O mySp_F, Sp_FW
duplicate /O mySp_Tcorr, Sp_TcorrW

// Fill weightind waves with SDs
waveStats /Q myD2_F
D2_FW = V_sdev
waveStats /Q myD2_Tcorr
D2_TcorrW = V_sdev

waveStats /Q mySP_F
SP_FW = V_sdev
waveStats /Q mySP_Tcorr
SP_TcorrW = V_sdev

// ODR Curvefit
CurveFit /ODR=2 line, myD2_F /X=myD2_Tcorr /I=1 /D /W=D2_Fw /XW=D2_TcorrW   /R

CurveFit /ODR=2 line, mySP_F /X=mySP_Tcorr /I=1 /D /W=SP_Fw /XW=SP_TcorrW /R

end

The first set fit converged properly whereas the second threw an error -"singular matrix of other numeric error"  When I arbitrarily removed the last points of the second set the fit converged.  The removed data looks quite acceptable.  Any clues?

In practice there are some 60 or more XY pairs with different lengths and slopes for analysis and curve fitting and an 'automated' approach is essential.

Here is the ouput with the reduced second set.

•odr1()
Fit converged properly
fit_D2_F= W_coef[0]+W_coef[1]*x
Res_D2_F= D2_F[p] - (W_coef[0]+W_coef[1]*D2_Tcorr[p])
W_coef={2.4156e+13,-6442.3}
V_chisq= 0.0131485;V_npnts= 19;V_numNaNs= 0;V_numINFs= 0;
V_startRow= 0;V_endRow= 18;
W_sigma={8.05e+12,2.15e+03}
Coefficient values ± one standard deviation
a   =2.4156e+13 ± 8.05e+12
b   =-6442.3 ± 2.15e+03
Fit converged properly
fit_Sp_F= W_coef[0]+W_coef[1]*x
Res_Sp_F= Sp_F[p] - (W_coef[0]+W_coef[1]*SP_Tcorr[p])
W_coef={2.5089e+13,-6691.2}
V_chisq= 0.0102483;V_npnts= 13;V_numNaNs= 0;V_numINFs= 0;
V_startRow= 0;V_endRow= 12;
W_sigma={1.02e+13,2.73e+03}
Coefficient values ± one standard deviation
a   =2.5089e+13 ± 1.02e+13
b   =-6691.2 ± 2.73e+03

I have also read the commentry from John Weeks in a thread from 01/10/2013  - "Orthogonal Distance Regression - weight with two SDs".  How would I go ablout determining if the slopes of the lines within each others error bands ?

I read the John's comment  "One complication of all this stuff is that the ODR fitting technique is inherently nonlinear, so normal statistics aren't exact"  and I am not sure how to move forward with analysis of the line fit....

In reply to by Mike German

I may be misunderstanding something here, but it looks as though you are filling the weighting waves with the standard deviation of the whole dataset, whereas I would have thought the weighting should be the (estimate of) the standard deviation for each data point, on a point by point basis?

If you had multiple independent measurements of each data point then you could use the standard deviation of these replicates to estimate the uncertainty of a given data point, and fill the weighting wave with these.

Have you tried adding these weighting (st dev) as error bars on a x-y plot so that you can see whether they look reasonable?

Hallo Kurt,

Thnaks for the relpy. You may well be right but I do not understand (and cant find it in the help files - but it may be there) how to generate the SD for weighting needed when starting with raw data of unknown errors.

I can look at the data on a one off basis but I really need to operate in a batch mode and the convergence/non-convergence when I remove a data pair doesn't seem right

In reply to by Mike German

Hi Mike,

The problem here is your errors are unknown, but you need to somehow be able to estimate these 'unknown' errors because these are the weighting values that you need.

One approach is if you can a priori assume that a straight line is a good fit, then fit this line and calculate the residuals (this is an option in the built-in CurveFit. Then, further assuming the errors are, on average, uniform across your dataset, calculate the sd of the residuals and use this value.

Hope this helps!

1) I presume "DP" and "SP" are double precision and single precision. Basically, you should never use a single-precision coefficient wave. Igor copies the other waves into double-precision arrays, but the coefficient wave actually participates in the fit.

2) in ODR fitting, the X and Y weights do two things: as in non-ODR fitting, the sigmas are computed using those weights, which makes the chi-square into something that is useful. But in ODR fitting it also establishes a geometric scale factor. If the slope is extreme, like 10^3 or 10^-3, the perpendiculars from the solution to your data points are going to be either nearly vertical or nearly horizontal, and the fit loses any meaningful information in the horizontal or vertical direction. But the perpendicular distance is computed using the weighted data, not the raw data. That means that if your slopes aren't near 1, you really need some sort of weighting to establish the 2D geometrical scaling. The span of the X and Y values probably is a reasonable ad-hoc starting point for situations where you don't have a good estimate of actual errors. It's been a while since I looked at this in detail... With normal least-squares, you can assume that the (chi-square / N) should be 1, and then use (chi-square / N) as a factor to adjust sigma's. That's common when you need weighting (you know that there is a measurement error that is proportional to log(Y) for instance) but you only know it to within a constant multiplier. I would have to do some research to figure out the proper measure for ODR fitting, but something similar should be possible.

1) I presume "DP" and "SP" are double precision and single precision. Basically, you should never use a single-precision coefficient wave. Igor copies the other waves into double-precision arrays, but the coefficient wave actually participates in the fit.

The prefix D1 and SP refer to the instruments that logged the data.  The X data are Igor Date/Time and of course in double precision. However, the Y data are in single precision. Thanks for pointing this out; I'll correct this

If the slope is extreme, like 10^3 or 10^-3, the perpendiculars from the solution to your data points are going to be either nearly vertical or nearly horizontal, and the fit loses any meaningful information in the horizontal or vertical direction.

For this event with D1, the Y data spans, from ~ 1300 to 40 Hz and the X data, from 3749504289.114 to 3749504289.308 (25/10/2022 00:58:09.114 to 25/10/2022 00:58:09.308). The slope is ~ 6400 Hz/s.

How does one know that the correct weighting has been used?  Or perhaps more importantly the other way round, how does one know with a given weighting that the best straight line fit has been achieved.

In reply to by Mike German

Mike German wrote:

1) I presume "DP" and "SP" are double precision and single precision. Basically, you should never use a single-precision coefficient wave. Igor copies the other waves into double-precision arrays, but the coefficient wave actually participates in the fit.

The prefix D1 and SP refer to the instruments that logged the data.  The X data are Igor Date/Time and of course in double precision. However, the Y data are in single precision. Thanks for pointing this out; I'll correct this

Oops. I read "SP" and then my eyes automatically registered "DP" instead of "D1" or "D2". I must be a programmer :)

Quote:

If the slope is extreme, like 10^3 or 10^-3, the perpendiculars from the solution to your data points are going to be either nearly vertical or nearly horizontal, and the fit loses any meaningful information in the horizontal or vertical direction.

For this event with D1, the Y data spans, from ~ 1300 to 40 Hz and the X data, from 3749504289.114 to 3749504289.308 (25/10/2022 00:58:09.114 to 25/10/2022 00:58:09.308). The slope is ~ 6400 Hz/s.

How does one know that the correct weighting has been used?  Or perhaps more importantly the other way round, how does one know with a given weighting that the best straight line fit has been achieved.

Without an independent way to estimate actual errors, you can't really tell if you have "correct" weighting. In some cases, it is reasonable to assume that the errors are a proportion of the value span. You will get the same fit with the X and Y weights multiplied by the same factor, but the estimated coefficient errors will scale with that multiplier. For an ordinary least squares fit, if you have the correct weights, the chi-square as Igor reports it should be about N (number of data points). I don't know how that compares to ODR fitting, or how to combine the two sets of weights, but I note that if you multiply the weights by 10, the solution is the same, and the sigmas go up by a factor of 10, and chi-square by a factor of 100.

I would say that it would be reasonable to set the X weights to 1300 - 40 = 1260 and the Y weights to 3749504289.308 - 3749504289.114 = 0.194. Be sure to use the /I=1 flag- that's Igor's historical oddity that says that the weights are SD, not 1/SD. You can get outputs of both X and Y residuals from an ODR fit, that might give you some indication of the actual distribution of errors.

OK I am now using double precision for both X and Y data and the spans of X and Y for the weighting waves. D2_Fdp and SP_Fdp are the two Y waves.  From history area:-

Fit converged properly
fit_D2_Fdp= W_coef[0]+W_coef[1]*x
Res_D2_Fdp= D2_Fdp[p] - (W_coef[0]+W_coef[1]*D2_Tcorr[p])
W_coef={2.4156e+13,-6442.4}
V_chisq= 0.00123225;V_npnts= 19;V_numNaNs= 0;V_numINFs= 0;
V_startRow= 0;V_endRow= 18;
W_sigma={2.63e+13,7.02e+03}
Coefficient values ± one standard deviation
a   =2.4156e+13 ± 2.63e+13
b   =-6442.4 ± 7.02e+03

Fit converged properly
fit_SP_Fdp= W_coef[0]+W_coef[1]*x
Res_SP_Fdp= SP_Fdp[p] - (W_coef[0]+W_coef[1]*SP_Tcorr[p])
W_coef={2.478e+13,-6608.9}
V_chisq= 0.000635881;V_npnts= 13;V_numNaNs= 0;V_numINFs= 0;
V_startRow= 0;V_endRow= 12;
W_sigma={3.01e+13,8.03e+03}
Coefficient values ± one standard deviation
a   =2.478e+13 ± 3.01e+13
b   =-6608.9 ± 8.03e+03

As mentioned elsewhere the sigmas are clearly not right. Using the above weighting it was necessary to remove an outlier on the SP_Fdp data (point 6) on the previous "XY data with non-ODR fit" atatchment to get convergence!  I am not sure how I will overcome this editing without eye-balling which isn't practical for batch processing.  The fits and residuals are at the attachment

For comparison the same data using non-ODR curvefit and no weighting. The error estiamtes (sigmas) are sensible

D1_Fdp
y= W_coef[0]+W_coef[1]*x
W_coef={2.4156e+13,-6442.5}
V_chisq= 3867.81;V_npnts= 19;V_numNaNs= 0;V_numINFs= 0;
V_startRow= 0;V_endRow= 18;V_q= 1;V_Rab= -1;V_Pr= -0.99927;
V_r2= 0.998583;
W_sigma={2.24e+11,59.7}
Coefficient values ± one standard deviation
a   =2.4156e+13 ± 2.24e+11
b   =-6442.5 ± 59.7

SP_Fdp
y= W_coef[0]+W_coef[1]*x
W_coef={2.478e+13,-6608.8}
V_chisq= 985.968;V_npnts= 13;V_numNaNs= 0;V_numINFs= 0;
V_startRow= 0;V_endRow= 12;V_q= 1;V_Rab= -1;V_Pr= -0.99953;
V_r2= 0.999084;
W_sigma={2.29e+11,61.1}
Coefficient values ± one standard deviation
a   =2.478e+13 ± 2.29e+11
b   =-6608.8 ± 61.1

I find it difficult to see the benefit of using ODR other than it is teh right way to do it.:-)

Can you post your Igor experiment file? Or send it to support@wavemetrics.com?

The large sigmas are probably because using the data spans as weight is a considerable over-estimation of measurement errors. It *probably* gives you X and Y weights that are nearly proportional.

In reply to by Mike German

Mike German wrote:

Ideally I would like a number that told me how much I can depend on the paramater based on the line fit.

Hi Mike,

Here is a thought - not sure about the robustness of the stats though - can you fit you data to both a least squares on y and a least squares on x, and then use the difference in the fitted slopes to determine some measure of the uncertainty in the (fitted) slope of the data (which could be estimated as the average of these two slopes)?

In case it helps to visualize this, below is some code I wrote many years ago to generate some example plots when writing a curve fitting tutorial for my work colleagues. In the generated graph, green is least-squares y, blue is least-squares x, and red is ODR.

Function LineFitDemo()
NewDataFolder/S/O root:LineFitDemo

SetRandomSeed 0.93

Make/O/D/N=(10) wDataX, wDataY
wDataY[] = 95.5 + 1 * p + (gnoise(1.5))
wDataX[] = 95.5 + 1 * p + (gnoise(1.5))

DoWindow/F PlotwDataXY
if (V_flag == 0)
Display/N=PlotwDataXY/W=(35,35,435,435) wDataY vs wDataX
endif
ModifyGraph mode=3
ModifyGraph marker(wDataY)=41, mrkThick(wDataY)=1.5
ModifyGraph grid=2,mirror=1,nticks=5, sep=5
ModifyGraph minor=1
SetAxis left 90,110
SetAxis bottom 90,110
SetAxis left 94,106
SetAxis bottom 94,106
ModifyGraph height={Aspect,1}
Label left "y"
Label bottom "x"

// X=Y
Make/O/D/N=2 wXY
wXY = {90,110}
AppendToGraph wXY vs wXY
ModifyGraph mode(wXY)=0, lstyle(wXY)=3,rgb(wXY)=(24576,24576,65280)

Make/O/D/N=(1,2) wMeanXY
SetDimlabel 1, 0, x, wMeanXY
SetDimlabel 1, 1, y, wMeanXY
wMeanXY[0][%x] = mean(wDataX)
wMeanXY[0][%y] = mean(wDataY)
AppendToGraph wMeanXY[][%y] vs wMeanXY[][%x]
ModifyGraph mode(wMeanXY)=3
ModifyGraph marker(wMeanXY)=60, mrkThick(wMeanXY)=1, msize(wMeanXY)=5

// Resid
Variable i, vNum, va, vb
// Least sq y on x
StatsLinearRegression/T=0/PAIR wDataX, wDataY
Wave/D W_StatsLinearRegression
Duplicate/D/O W_StatsLinearRegression, wSLRxy
va = wSLRxy[0][%a]
vb = wSLRxy[0][%b]
Make/O/D/N=(2,2) wYonX
SetDimlabel 1, 0, x, wYonX
SetDimlabel 1, 1, y, wYonX
wYonX[0][%y] = 90
wYonX[1][%y] = 110
wYonX[][%x] = (wYonX[p][%y] - va) / vb

AppendToGraph wYonX[][%y] vs wYonX[][%x]
ModifyGraph lsize(wYonX)=2
ModifyGraph rgb(wYonX)=(0,52224,0)

vNum = DimSize(wDataX, 0)
Make/O/D/N=(3*vNum,2) wResYonX
SetDimlabel 1, 0, x, wResYonX
SetDimlabel 1, 1, y, wResYonX
wResYonX[][]=NaN
for(i=0;i<vNum;i+=1)
wResYonX[3*i,3*i+1][%x] = wDataX[i]
wResYonX[3*i][%y] = wDataY[i]
wResYonX[3*i+1][%y] = va + vb * wDataX[i]
endfor
AppendToGraph wResYonX[][%y] vs wResYonX[][%x]
ModifyGraph rgb(wResYonX)=(0,52224,0), lsize(wResYonX)=2.5

// Least sq x on y
StatsLinearRegression/T=0/PAIR wDataY, wDataX
Wave/D W_StatsLinearRegression
Duplicate/D/O W_StatsLinearRegression, wSLRyx
va = -wSLRyx[0][%a] / wSLRyx[0][%b]
vb = 1 / wSLRyx[0][%b]
Make/O/D/N=(2,2) wXonY
SetDimlabel 1, 0, x, wXonY
SetDimlabel 1, 1, y, wXonY
wXonY[0][%x] = 90
wXonY[1][%x] = 110
wXonY[][%y] = va + vb * wXonY[p][%x]

AppendToGraph wXonY[][%y] vs wXonY[][%x]
ModifyGraph lsize(wXonY)=2
ModifyGraph rgb(wXonY)=(0,0,52224)

vNum = DimSize(wDataX, 0)
Make/O/D/N=(3*vNum,2) wResXonY
SetDimlabel 1, 0, x, wResXonY
SetDimlabel 1, 1, y, wResXonY
wResXonY[][]=NaN
for(i=0;i<vNum;i+=1)
wResXonY[3*i,3*i+1][%y] = wDataY[i]
wResXonY[3*i][%x] = wDataX[i]
wResXonY[3*i+1][%x] = (wDataY[i] - va) / vb
endfor
AppendToGraph wResXonY[][%y] vs wResXonY[][%x]
ModifyGraph rgb(wResXonY)=(0,0,52224), lsize(wResXonY)=2.5

// Orthogonal Distance Regression (Total Least Squares)
CurveFit/ODR=2 line, wDataY/X=wDataX
Wave/D W_coef
va = W_coef[0]
vb = W_coef[1]
Make/O/D/N=(2,2) wODRXY
SetDimlabel 1, 0, x, wODRXY
SetDimlabel 1, 1, y, wODRXY
wODRXY[0][%y] = 90
wODRXY[1][%y] = 110
wODRXY[][%x] = (wODRXY[p][%y] - va) / vb
AppendToGraph wODRXY[][%y] vs wODRXY[][%x]
ModifyGraph lsize(wODRXY)=2
ModifyGraph rgb(wODRXY)=(52224,0,0)

vNum = DimSize(wDataX, 0)
Make/O/D/N=(3*vNum,2) wResODRXY
SetDimlabel 1, 0, x, wResODRXY
SetDimlabel 1, 1, y, wResODRXY
wResODRXY[][]=NaN
for(i=0;i<vNum;i+=1)
wResODRXY[3*i][%x] = wDataX[i]
wResODRXY[3*i][%y] = wDataY[i]
wResODRXY[3*i+1][%x] = (wDataY[i]+wDataX[i]/vb-va) / (vb + 1/vb)
wResODRXY[3*i+1][%y] = wDataY[i] + wDataX[i]/vb - wResODRXY[3*i+1][%x] / vb
endfor
AppendToGraph wResODRXY[][%y] vs wResODRXY[][%x]
ModifyGraph rgb(wResODRXY)=(52224,0,0), lsize(wResODRXY)=2.5

Variable vAngLsY, vAngLsX
vAngLsY = atan(  wSLRxy[0][%b]) //*180/pi
vAngLsX = atan(1/wSLRyx[0][%b]) //*180/pi
print vAngLsY, vAngLsX
Make/O/D/N=(5,3) wArc
SetDimlabel 1, 0, x, wArc
SetDimlabel 1, 1, y, wArc
SetDimlabel 1, 2, a, wArc
wArc[][%a] = vAngLsY + p / (DimSize(wArc, 0) - 1) * (vAngLsX - vAngLsY)
wArc[][%x] = wMeanXY[0][%x] + 6 * cos(wArc[p][%a])
wArc[][%y] = wMeanXY[0][%y] + 6 * sin(wArc[p][%a])
AppendToGraph wArc[][%y] vs wArc[][%x]
ModifyGraph rgb(wArc)=(0,0,0), lsize(wArc)=4
Tag/C/N=text0/A=RT/L=0/F=0/B=1 wArc, 1,"\Z14\[0\F'Symbol'a\]0\f01"
End

Nice demo, KurtB!

Please note that the demo has data with a slope that is nearly 1, which avoids all this bother with the weighting.

There is a demonstration of the importance of the weighting in our documentation: DisplayHelpTopic "ODR Fitting Examples" I used error bars to represent the XY residuals in those examples. I like KurtB's use of lines to the fit line much better; perhaps I can revise my documentation.

I draw your attention to the examples of exponential fits with and without weighting. I think that makes a pretty good visualization of the importance of weighting.

As I mentioned previously, at this point it may be more useful for me to have your actual data that I can work with myself.

Yes John, agreed it is a nice demo from Kurt.  Thanks for sharing Kurt it does nicely show the construction.

I have sent a sample of my real data to John as requested.

Not withstanding the recommendations on how to do robust regression fitting, is your system amenable to an analytical approach to determine an uncertainty budget? The method uses (linear) propagation of uncertainty.

https://en.wikipedia.org/wiki/Propagation_of_uncertainty

By example, when your parameter P depends on slope m as a function P(m), I would imagine doing a derivation to obtain the uncertainty DP as a function of the uncertainty Dm in a form (DP/P)^2 = f((Dm/m)^2).