# Monte Carlo analysis to calculate number of points for a good fit

tdgordon

Hi,

I am fitting data from a physical process that decays exponentially, and I have a lot of these decays every second (10,000 decays/s). Ultimately we need to know the time constant of the exponential. We want to minimize our computational time, so we would like to determine the minimum number of points required for a "good" fit (I know the term is ambiguous). If someone knows an explicit analytical approach, please let me know. I haven't found one. Thus, I would like to model it by generating a series of "experiments" in which exponentially decreasing data with random noise are compared against the true values (provided by a simple exponential equation). Each experiment would utilize a different number of samples (i.e., data points). Then I'd like to calculate the goodness of the fits to the underlying model (the exponential equation) and see how far off the fits are as a function of the number of samples in an experiment. Obviously, the fidelity of the fit improves with the number of samples, but the idea is to provide some guidance about how many points is good enough. Good enough in this context might mean a time constant that is within 5% of the true time constant. Will 10 points get us a fit that is good enough? Or do we need 10,000 points? It seems that what would actually be required is to run multiple experiments for a given number of samples (so, 1000 Monte Carlo trials to see what the range of fit quality is for n=10 samples, for example). Then repeat this for a lot of different values for n.

Seems like I can automate this in Igor by using this as a building block:

•Make/n=1e4 noiseWave=expNoise(10)

•Display/K=0 root:noiseWave

•Make/N=100/O W_Hist

•Histogram/P/B={0,1,100} noiseWave,W_Hist

•Display/K=0 root:W_Hist

•CurveFit/M=2/W=0 exp, W_Hist/D

This picks 10000 exponentially decreasing noisy points with a time constant of (1/10). Then plot a histogram of the points and do a curve fit. The stats give you the fitted time constant w/ stdv. If you repeat this 1000x, I would think you could get a range of time constants that would bound what you might expect if you were to sample 10000 points from the decay.

Does this approach make sense? Are there any Igor subtleties that I'm overlooking? Anything that would make this easier to do?

Thanks.

I am fitting data from a physical process that decays exponentially, and I have a lot of these decays every second (10,000 decays/s). Ultimately we need to know the time constant of the exponential. We want to minimize our computational time, so we would like to determine the minimum number of points required for a "good" fit (I know the term is ambiguous). If someone knows an explicit analytical approach, please let me know. I haven't found one. Thus, I would like to model it by generating a series of "experiments" in which exponentially decreasing data with random noise are compared against the true values (provided by a simple exponential equation). Each experiment would utilize a different number of samples (i.e., data points). Then I'd like to calculate the goodness of the fits to the underlying model (the exponential equation) and see how far off the fits are as a function of the number of samples in an experiment. Obviously, the fidelity of the fit improves with the number of samples, but the idea is to provide some guidance about how many points is good enough. Good enough in this context might mean a time constant that is within 5% of the true time constant. Will 10 points get us a fit that is good enough? Or do we need 10,000 points? It seems that what would actually be required is to run multiple experiments for a given number of samples (so, 1000 Monte Carlo trials to see what the range of fit quality is for n=10 samples, for example). Then repeat this for a lot of different values for n.

Seems like I can automate this in Igor by using this as a building block:

•Make/n=1e4 noiseWave=expNoise(10)

•Display/K=0 root:noiseWave

•Make/N=100/O W_Hist

•Histogram/P/B={0,1,100} noiseWave,W_Hist

•Display/K=0 root:W_Hist

•CurveFit/M=2/W=0 exp, W_Hist/D

This picks 10000 exponentially decreasing noisy points with a time constant of (1/10). Then plot a histogram of the points and do a curve fit. The stats give you the fitted time constant w/ stdv. If you repeat this 1000x, I would think you could get a range of time constants that would bound what you might expect if you were to sample 10000 points from the decay.

Does this approach make sense? Are there any Igor subtleties that I'm overlooking? Anything that would make this easier to do?

Thanks.

Theoretically, you only need the same number of data points as the number of free parameters (here: 3). Note: that 'fit' will be exact - always !

Do I understand your approach right: You want to create a series of dummy data sets, test the quality of the fit against the number of data points in your fit (the histogram wave), and determine a critical n where things go bad. Put your code in a loop, store the time constant in a separate wave and display that one versus n. I'd cross check this result with like 10 real data sets (or directly start from there).

A quite fast way to get the time constants could be to solve the exact equation system for three xy pairs for the time constant (too lazy a.t.m.)

y1=A+B*exp(-C*x1), y2=... --> C=f(x1,x2,x3,y1,y2,y3) ,

calculate C for a couple (100?) fixed or random data points, and use the median (better) or average (faster) of C as result.

Maybe data reduction and fitting is fast enough.

Testing several waves against each other might be slow, just my feeling.

Avoid to display things (graphs, info panels, etc.)

HJ

PS: 10000 events per second and 1000 data points each leaves you with a sampling rate ~10 MS/s. This calls a little for a dedicated hardware solution if you want to have it done online....

August 2, 2017 at 12:03 am - Permalink

One area I would investigate for an analytic model is the use of the Cramer-Rao lower bound for the variance of the unknown decay coefficient. The classic reference I am partial to is "Detection, Estimation, and Modulation Theory - Part I" by H. L. Van Trees.

August 2, 2017 at 02:54 am - Permalink

http://www.igorexchange.com/node/7844

I call it empirical analysis of uncertainty budget on a fitting term. A few thoughts cross my mind from my experience.

* Have you considered deriving the analytical side of the uncertainty budget? Write the expression for your measured value as a sum of theory and noise M = T + N. Expand T and solve back for the time constant \tau = f(M - N, ...). Finally, apply linear propagation (using partial derivatives) to obtain a relative uncertainty expression (\Delta\tau / \tau)^2 = f(...). The result might give you insights, if nothing else to an empirical metric to define "goodness of fit".

* On the Igor Pro side, do you have experience in creating control panels? I can see a panel that allows a user to set a theoretical decay constant, the relative level of noise, a fitting range, and (when interested) the number density of points in the fitting range (the uncertainty budget should be dependent approximately on N^2 so this is a reasonable test). The panel would have a button to "Generate Result", putting the value of \Delta\tau / \tau and (\tau_{fit} - \tau_{theory})/\tau_{theory} in the history. I've done this for the Igor Pro analysis that I am carrying out on my equation.

--

J. J. Weimer

Chemistry / Chemical & Materials Engineering, UAH

August 2, 2017 at 06:34 am - Permalink

I think your reiteration of my approach is almost correct. Let me try again:

Use a pure exponential fxn as the reference or true data, e.g., y=exp(-0.1*x). For shorthand, let's call this fxn Herbert. In the little code snippet I included in the original post, I was using expNoise to generate the dummy data sets. This is what probably needs to change so I can match the noisiness of my physical data, so I'd somehow create a dummy data set that is based on Herbert with added noise. This dummy data set would contain a user-specified number of points (n). Then fit an exponential fxn to the dummy data. Calculate the time constant of the dummy data set. Calculate the squared difference between the calculated time constant for dummy data and the time constant of Herbert and save this squared difference in a wave. Then repeat this a large number of times for the specified n in order to build up a large sample of squared diffs between time constants from the dummy data sets and Herbert. Then repeat all of the preceding tasks with different n values. The final output would be a plot of the median (+- stdv) of squared difference between calculated and true time constant as a function of sample size.

We have dedicated hardware that samples up to 25 MS/sec.

August 2, 2017 at 09:45 am - Permalink

SetScale/I x 0,1,simulatedData // Change 0,1 to the appropriate range

simulatedData = exp(-x/tau) + gnoise(.1) // Change the parameters are required

Now simulatedData contains an exponential with decay constant tau and normally distributed noise with SDEV=.1. You can now fit it. Do it many times to get an idea of how the fits vary.

John Weeks

WaveMetrics, Inc.

support@wavemetrics.com

August 2, 2017 at 09:54 am - Permalink

August 2, 2017 at 09:57 am - Permalink

Are there any other tools in Igor that would help with the overall project? I suppose it's easy enough to just loop through things to get the Monte Carlo results.

August 2, 2017 at 10:04 am - Permalink

--

J. J. Weimer

Chemistry / Chemical & Materials Engineering, UAH

August 3, 2017 at 07:35 am - Permalink

The demo has a control panel to input Ao, tau, noise, start, end, and reduction factor. A button will do the fit. The relative accuracy and precision for tau are returned in the history window.

--

J. J. Weimer

Chemistry / Chemical & Materials Engineering, UAH

August 3, 2017 at 07:38 am - Permalink

August 3, 2017 at 08:09 am - Permalink

Start is the starting point of the fit. It can be set from 0 to 400. End is the ending point of the fit. It can be set from 500 to 1000 (the end of the data set).

Reduction factor is used to decrease the number of points in the "data" wave. It is used in a calculation

`ftotal = mod(p,redf) == 0 ? (theory + noise) : NaN`

. By example, when redf = 1, all points are used. When redf = 2, every other point is set to NaN. When redf = 10, only every 10th point is used.I guess you meant end being set to 500. It cannot be decreased. It can only be increased. Alternatively, when you know how to edit a control panel, you can change the settings to be whatever you would want (hence my question under point 2 of my original post).

--

J. J. Weimer

Chemistry / Chemical & Materials Engineering, UAH

August 3, 2017 at 08:32 am - Permalink

The initial estimator for lambda can be found by solving an equation found from the first derivative of the log likehood function. In the attached pxp I have treated it as a user-specified value. It should be interesting to compare this model with your Monte Carlo simulation results. The CRLB is most useful in ranges where the S/N is not too low.

EDIT: I have checked the CRLB with my own Monte Carlo analysis. Using sigma=0.02, the Monte Carlo points lie on my CRLB traces for low numbers of fitting points, but then roll over. From top trace to bottom the rollovers are >20, 15, and 10 fitting points, respectively. The MC roll-over locations will change with choice of sigma; the CRLB traces are normalized to sigma and will not change.

EDIT 2: It is probably worth pointing out that my Monte Carlo procedure uses the noisy data to calculate the MAP estimator for lambda, followed by a variance() calculation about that estimator. It does not use the a priori value of lambda (used in constructing the noisy data) in finding the variance. Although not universally true, in this instance the MAP and Minimum Least Squares estimators and variances are equivalent.

August 15, 2017 at 10:51 am - Permalink

I am not sure your approach will directly address your data. You are assuming that your fits are limited by noise in the Y axis.

This assumption and corresponding MC exploration may not uncover all pitfalls. What about errors in your X? What about deviations from ideal exponential behaviour, e.g. when two processes contribute to the decay?

My suggestion: Use your own data. Take a few decays as a test data set. Fit them as well as you can (automated). Then delete every 10th point, and repeat. Then delete every 8th point, 5th, 3rd, 2nd point and repeat. Plot error-in-time-constant vs deleted-points to see how far you can go. "Error" is defined by using the "as well as you can" case as the reference.

This way, in your final report, you can just say "only every 2nd point was fitted to accelerate the computation, which had <0.1% influence on the fitted results."

j

August 25, 2017 at 08:31 am - Permalink

November 2, 2017 at 09:00 pm - Permalink

The approach will be valid to obtain the contribution of number of fitting points to the overall uncertainty budget when all else constant. That in itself is an important measurement.

I recommend plotting RELATIVE uncertainty squared as a function of number of points squared. When the fit parameter is K, you obtain K ± \Delta K from Igor Pro. Plot (\Delta K / K)^2 versus N^2.

I also suggest that using simulated data is still OK. To obtain a true MC-type analysis, regenerate the data set multiple times (say M times) with the same (relative) gaussian noise level and repeat the fit using the same number of points. You can then calculate and a weighted uncertainty S_K over the M fits. Your final plot will be U_K^2 = (S_K/)^2 versus N^2.

Again, I recommend working in relative uncertainty space. The language becomes ... "Only half the data set was fit. This was estimated by MC analysis to increase the overall relative uncertainty on the fitting constant K by X%". The value X% is the difference between using N points and using N/2 points.

I believe the result should show that U_K has an approximate \sqrt{N} dependence.

I am undertaking a comparable approach to do MC analysis of fitting curves, so I have a bit of depth in this area already. The concepts to appreciate are based on the term uncertainty budget and the approach to construct an equation for it theoretically using linear propagation of uncertainties.

Contact me off-line for information. I can point you to a paper that is in review and analysis that a student of mine is doing exactly along the lines you are attempting.

--

J. J. Weimer

Chemistry / Chemical & Materials Engineering, UAH

November 3, 2017 at 07:24 am - Permalink