Monte Carlo analysis to calculate number of points for a good fit

Hi,

I am fitting data from a physical process that decays exponentially, and I have a lot of these decays every second (10,000 decays/s). Ultimately we need to know the time constant of the exponential. We want to minimize our computational time, so we would like to determine the minimum number of points required for a "good" fit (I know the term is ambiguous). If someone knows an explicit analytical approach, please let me know. I haven't found one. Thus, I would like to model it by generating a series of "experiments" in which exponentially decreasing data with random noise are compared against the true values (provided by a simple exponential equation). Each experiment would utilize a different number of samples (i.e., data points). Then I'd like to calculate the goodness of the fits to the underlying model (the exponential equation) and see how far off the fits are as a function of the number of samples in an experiment. Obviously, the fidelity of the fit improves with the number of samples, but the idea is to provide some guidance about how many points is good enough. Good enough in this context might mean a time constant that is within 5% of the true time constant. Will 10 points get us a fit that is good enough? Or do we need 10,000 points? It seems that what would actually be required is to run multiple experiments for a given number of samples (so, 1000 Monte Carlo trials to see what the range of fit quality is for n=10 samples, for example). Then repeat this for a lot of different values for n.

Seems like I can automate this in Igor by using this as a building block:
•Make/n=1e4 noiseWave=expNoise(10)
•Display/K=0 root:noiseWave
•Make/N=100/O W_Hist
•Histogram/P/B={0,1,100} noiseWave,W_Hist
•Display/K=0 root:W_Hist
•CurveFit/M=2/W=0 exp, W_Hist/D

This picks 10000 exponentially decreasing noisy points with a time constant of (1/10). Then plot a histogram of the points and do a curve fit. The stats give you the fitted time constant w/ stdv. If you repeat this 1000x, I would think you could get a range of time constants that would bound what you might expect if you were to sample 10000 points from the decay.

Does this approach make sense? Are there any Igor subtleties that I'm overlooking? Anything that would make this easier to do?

Thanks.
Could you share a typical data set? I guess that the number of required data points for the fit will depend on the noise in your real signal.
Theoretically, you only need the same number of data points as the number of free parameters (here: 3). Note: that 'fit' will be exact - always !

Do I understand your approach right: You want to create a series of dummy data sets, test the quality of the fit against the number of data points in your fit (the histogram wave), and determine a critical n where things go bad. Put your code in a loop, store the time constant in a separate wave and display that one versus n. I'd cross check this result with like 10 real data sets (or directly start from there).

A quite fast way to get the time constants could be to solve the exact equation system for three xy pairs for the time constant (too lazy a.t.m.)
y1=A+B*exp(-C*x1), y2=... --> C=f(x1,x2,x3,y1,y2,y3) ,
calculate C for a couple (100?) fixed or random data points, and use the median (better) or average (faster) of C as result.

Maybe data reduction and fitting is fast enough.
Testing several waves against each other might be slow, just my feeling.

Avoid to display things (graphs, info panels, etc.)

HJ

PS: 10000 events per second and 1000 data points each leaves you with a sampling rate ~10 MS/s. This calls a little for a dedicated hardware solution if you want to have it done online....
tdgordon wrote:
we would like to determine the minimum number of points required for a "good" fit (I know the term is ambiguous). If someone knows an explicit analytical approach, please let me know. I haven't found one.

One area I would investigate for an analytic model is the use of the Cramer-Rao lower bound for the variance of the unknown decay coefficient. The classic reference I am partial to is "Detection, Estimation, and Modulation Theory - Part I" by H. L. Van Trees.
This sounds similar to what I am doing with a different set of equations.

http://www.igorexchange.com/node/7844

I call it empirical analysis of uncertainty budget on a fitting term. A few thoughts cross my mind from my experience.

* Have you considered deriving the analytical side of the uncertainty budget? Write the expression for your measured value as a sum of theory and noise M = T + N. Expand T and solve back for the time constant \tau = f(M - N, ...). Finally, apply linear propagation (using partial derivatives) to obtain a relative uncertainty expression (\Delta\tau / \tau)^2 = f(...). The result might give you insights, if nothing else to an empirical metric to define "goodness of fit".

* On the Igor Pro side, do you have experience in creating control panels? I can see a panel that allows a user to set a theoretical decay constant, the relative level of noise, a fitting range, and (when interested) the number density of points in the fitting range (the uncertainty budget should be dependent approximately on N^2 so this is a reasonable test). The panel would have a button to "Generate Result", putting the value of \Delta\tau / \tau and (\tau_{fit} - \tau_{theory})/\tau_{theory} in the history. I've done this for the Igor Pro analysis that I am carrying out on my equation.

--
J. J. Weimer
Chemistry / Chemical & Materials Engineering, UAH
You're right about the amount of noise being important. I was just thinking about this this morning a bit more. The problem with the built-in expNoise function is you can't (correct me if I'm wrong) control how noisy the randomly generated data are. This is a problem because I need to somehow match the noisiness of my synthetic data to the noisiness of my actual physically-derived data in order get a model that will predict the required sample size accurately. Of course, I'm not 100% sure how to do that noisiness matching either.

I think your reiteration of my approach is almost correct. Let me try again:
Use a pure exponential fxn as the reference or true data, e.g., y=exp(-0.1*x). For shorthand, let's call this fxn Herbert. In the little code snippet I included in the original post, I was using expNoise to generate the dummy data sets. This is what probably needs to change so I can match the noisiness of my physical data, so I'd somehow create a dummy data set that is based on Herbert with added noise. This dummy data set would contain a user-specified number of points (n). Then fit an exponential fxn to the dummy data. Calculate the time constant of the dummy data set. Calculate the squared difference between the calculated time constant for dummy data and the time constant of Herbert and save this squared difference in a wave. Then repeat this a large number of times for the specified n in order to build up a large sample of squared diffs between time constants from the dummy data sets and Herbert. Then repeat all of the preceding tasks with different n values. The final output would be a plot of the median (+- stdv) of squared difference between calculated and true time constant as a function of sample size.

We have dedicated hardware that samples up to 25 MS/sec.


HJDrescher wrote:
Could you share a typical data set? I guess that the number of required data points for the fit will depend on the noise in your real signal.
Theoretically, you only need the same number of data points as the number of free parameters (here: 3). Note: that 'fit' will be exact - always !

Do I understand your approach right: You want to create a series of dummy data sets, test the quality of the fit against the number of data points in your fit (the histogram wave), and determine a critical n where things go bad. Put your code in a loop, store the time constant in a separate wave and display that one versus n. I'd cross check this result with like 10 real data sets (or directly start from there).

A quite fast way to get the time constants could be to solve the exact equation system for three xy pairs for the time constant (too lazy a.t.m.)
y1=A+B*exp(-C*x1), y2=... --> C=f(x1,x2,x3,y1,y2,y3) ,
calculate C for a couple (100?) fixed or random data points, and use the median (better) or average (faster) of C as result.

Maybe data reduction and fitting is fast enough.
Testing several waves against each other might be slow, just my feeling.

Avoid to display things (graphs, info panels, etc.)

HJ

PS: 10000 events per second and 1000 data points each leaves you with a sampling rate ~10 MS/s. This calls a little for a dedicated hardware solution if you want to have it done online....

Your process for creating the noisy data seems overly complex, unless I've missed some part of the problem. How about something like
Make/D/N=(testPoints) simulatedData
SetScale/I x 0,1,simulatedData              // Change 0,1 to the appropriate range
simulatedData = exp(-x/tau) + gnoise(.1)        // Change the parameters are required

Now simulatedData contains an exponential with decay constant tau and normally distributed noise with SDEV=.1. You can now fit it. Do it many times to get an idea of how the fits vary.

John Weeks
WaveMetrics, Inc.
support@wavemetrics.com
Thanks. I will look it up.

s.r.chinn wrote:
tdgordon wrote:
we would like to determine the minimum number of points required for a "good" fit (I know the term is ambiguous). If someone knows an explicit analytical approach, please let me know. I haven't found one.

One area I would investigate for an analytic model is the use of the Cramer-Rao lower bound for the variance of the unknown decay coefficient. The classic reference I am partial to is "Detection, Estimation, and Modulation Theory - Part I" by H. L. Van Trees.

Thanks, John. I think this makes sense. I don't know for sure if the noise is gaussian, but it's probably a reasonable start.

Are there any other tools in Igor that would help with the overall project? I suppose it's easy enough to just loop through things to get the Monte Carlo results.


johnweeks wrote:
Your process for creating the noisy data seems overly complex, unless I've missed some part of the problem. How about something like
Make/D/N=(testPoints) simulatedData
SetScale/I x 0,1,simulatedData              // Change 0,1 to the appropriate range
simulatedData = exp(-x/tau) + gnoise(.1)        // Change the parameters are required

Now simulatedData contains an exponential with decay constant tau and normally distributed noise with SDEV=.1. You can now fit it. Do it many times to get an idea of how the fits vary.

John Weeks
WaveMetrics, Inc.
support@wavemetrics.com

Apologies here for the duplicate. I hit the POST button while the previous was uploading.

The demo has a control panel to input Ao, tau, noise, start, end, and reduction factor. A button will do the fit. The relative accuracy and precision for tau are returned in the history window.

--
J. J. Weimer
Chemistry / Chemical & Materials Engineering, UAH
Very cool! Thanks! Not quite sure what some of these buttons are. What does "start," "end" and "reduction factor" refer to exactly? Reduction factor is set to 500, but it doesn't seem to want to let me set it any lower than that.

jjweimer wrote:
Apologies here for the duplicate. I hit the POST button while the previous was uploading.

The demo has a control panel to input Ao, tau, noise, start, end, and reduction factor. A button will do the fit. The relative accuracy and precision for tau are returned in the history window.

--
J. J. Weimer
Chemistry / Chemical & Materials Engineering, UAH

tdgordon wrote:
Very cool! Thanks! Not quite sure what some of these buttons are. What does "start," "end" and "reduction factor" refer to exactly? Reduction factor is set to 500, but it doesn't seem to want to let me set it any lower than that.


Start is the starting point of the fit. It can be set from 0 to 400. End is the ending point of the fit. It can be set from 500 to 1000 (the end of the data set).

Reduction factor is used to decrease the number of points in the "data" wave. It is used in a calculation ftotal = mod(p,redf) == 0 ? (theory + noise) : NaN. By example, when redf = 1, all points are used. When redf = 2, every other point is set to NaN. When redf = 10, only every 10th point is used.

I guess you meant end being set to 500. It cannot be decreased. It can only be increased. Alternatively, when you know how to edit a control panel, you can change the settings to be whatever you would want (hence my question under point 2 of my original post).

--
J. J. Weimer
Chemistry / Chemical & Materials Engineering, UAH
I am following up on my earlier reply from August 2 with results from a Cramer-Rao Lower Bound analysis in the attached pxp. I have simplified the situation a bit by assuming that the noise-free data is normalized to unity initial value. I use 'lambda' instead of John Weeks' 1/tau. The CRLB calculates the variance bound on the single variable 'lambda'. My function returns the denominator by which the initial sigma^2 is reduced as a function of the number of data points being used. Note that this function depends heavily on the estimator for 'lambda'. This make physical sense because when the noise-free data becomes small enough, it is swamped by noise and using more points makes little sense. It is up to the user to determine what makes a good enough fit.

The initial estimator for lambda can be found by solving an equation found from the first derivative of the log likehood function. In the attached pxp I have treated it as a user-specified value. It should be interesting to compare this model with your Monte Carlo simulation results. The CRLB is most useful in ranges where the S/N is not too low.
EDIT: I have checked the CRLB with my own Monte Carlo analysis. Using sigma=0.02, the Monte Carlo points lie on my CRLB traces for low numbers of fitting points, but then roll over. From top trace to bottom the rollovers are >20, 15, and 10 fitting points, respectively. The MC roll-over locations will change with choice of sigma; the CRLB traces are normalized to sigma and will not change.
EDIT 2: It is probably worth pointing out that my Monte Carlo procedure uses the noisy data to calculate the MAP estimator for lambda, followed by a variance() calculation about that estimator. It does not use the a priori value of lambda (used in constructing the noisy data) in finding the variance. Although not universally true, in this instance the MAP and Minimum Least Squares estimators and variances are equivalent.
tdgordon,

I am not sure your approach will directly address your data. You are assuming that your fits are limited by noise in the Y axis.

This assumption and corresponding MC exploration may not uncover all pitfalls. What about errors in your X? What about deviations from ideal exponential behaviour, e.g. when two processes contribute to the decay?

My suggestion: Use your own data. Take a few decays as a test data set. Fit them as well as you can (automated). Then delete every 10th point, and repeat. Then delete every 8th point, 5th, 3rd, 2nd point and repeat. Plot error-in-time-constant vs deleted-points to see how far you can go. "Error" is defined by using the "as well as you can" case as the reference.

This way, in your final report, you can just say "only every 2nd point was fitted to accelerate the computation, which had <0.1% influence on the fitted results."

j
This makes sense, and I like the idea. I could certainly do this with some chunk of data from one particular time. However, I'd like to be able to characterize things when there is more or less noise than in that specific chunk of data. How can I do that? I could pull data from a few different time periods, but it would be better to have some systematic way of assessing the number of points needed to fit the data as a function of the data's noisiness.


jcor wrote:
tdgordon,

I am not sure your approach will directly address your data. You are assuming that your fits are limited by noise in the Y axis.

This assumption and corresponding MC exploration may not uncover all pitfalls. What about errors in your X? What about deviations from ideal exponential behaviour, e.g. when two processes contribute to the decay?

My suggestion: Use your own data. Take a few decays as a test data set. Fit them as well as you can (automated). Then delete every 10th point, and repeat. Then delete every 8th point, 5th, 3rd, 2nd point and repeat. Plot error-in-time-constant vs deleted-points to see how far you can go. "Error" is defined by using the "as well as you can" case as the reference.

This way, in your final report, you can just say "only every 2nd point was fitted to accelerate the computation, which had <0.1% influence on the fitted results."

j

jcor wrote:

This assumption and corresponding MC exploration may not uncover all pitfalls. What about errors in your X? What about deviations from ideal exponential behaviour, e.g. when two processes contribute to the decay?


The approach will be valid to obtain the contribution of number of fitting points to the overall uncertainty budget when all else constant. That in itself is an important measurement.

jcor wrote:

My suggestion: Use your own data. Take a few decays as a test data set. Fit them as well as you can (automated). Then delete every 10th point, and repeat. Then delete every 8th point, 5th, 3rd, 2nd point and repeat. Plot error-in-time-constant vs deleted-points to see how far you can go. "Error" is defined by using the "as well as you can" case as the reference.


I recommend plotting RELATIVE uncertainty squared as a function of number of points squared. When the fit parameter is K, you obtain K ± \Delta K from Igor Pro. Plot (\Delta K / K)^2 versus N^2.

I also suggest that using simulated data is still OK. To obtain a true MC-type analysis, regenerate the data set multiple times (say M times) with the same (relative) gaussian noise level and repeat the fit using the same number of points. You can then calculate and a weighted uncertainty S_K over the M fits. Your final plot will be U_K^2 = (S_K/)^2 versus N^2.

jcor wrote:

This way, in your final report, you can just say "only every 2nd point was fitted to accelerate the computation, which had <0.1% influence on the fitted results."


Again, I recommend working in relative uncertainty space. The language becomes ... "Only half the data set was fit. This was estimated by MC analysis to increase the overall relative uncertainty on the fitting constant K by X%". The value X% is the difference between using N points and using N/2 points.

I believe the result should show that U_K has an approximate \sqrt{N} dependence.

I am undertaking a comparable approach to do MC analysis of fitting curves, so I have a bit of depth in this area already. The concepts to appreciate are based on the term uncertainty budget and the approach to construct an equation for it theoretically using linear propagation of uncertainties.

Contact me off-line for information. I can point you to a paper that is in review and analysis that a student of mine is doing exactly along the lines you are attempting.

--
J. J. Weimer
Chemistry / Chemical & Materials Engineering, UAH