# Maximum Likelihood Estimation (MLE)

If you want to do MLE, you need to write down a likelihood function, and take the log of it. You can then use the Optimize operation to find the maximum. Conveniently, log likelihood functions are often convex, so there always be exactly one extremum, which will be a global maximum. Under certain conditions, they are always convex.

Suppose your model asserts that the data {yi} has a Gaussian distribution given the values of covariates {xi}, and that the mean value of that distribution is a linear combination of the values in {xi}. Then the likelihood of the data {yi} given the covariates {xi} is given by the product of the probabilities of each observed datum yi. Since we have defined that probability to be gaussian, the probability of datum is exp(-(yi-mu)^2/sigma^2), where mu is the mean and sigma the standard deviation of the distribution. We can ignore scaling constants since they will not change the values of mu and sigma that maximize this function. The likelihood of all of the data {yi} is the product of many such functions:

L({yi})=prod{ exp(-(yi-mu)^2/sigma^2) }

The log likelihood is easier to maximize, so we compute it:

ln(L{yi}) =sum{ -(yi-mu)^2/sigma^2 }

Now we plug-in a linear equation for mu, which is given by the values of coefficients {b}, such that:
mu = b0 + b1*x1 + b2*x2 + ...

ln({yi}) =sum{ -(yi-b0-b1*x1i-b2*x2i-...)^2/sigma^2 }

We can now drop the sigma^2 when maximizing since it can be moved outside the sum. So we then want to maximize:

sum{ -(yi-b0-b1*x1i-b2*x2i-...)^2 }

It can be seen that maximize this sum is identical to minimize the sum of square errors of a linear regression. In fact, linear regression is precisely the maximum likelihood solution under a model that the expectation value of y is linearly related to x and each value yi is drawn from a normal distribution (gaussian error) whose mean is given by that linear function of xi.

For other models, i.e. non-linear relationships, and non-gaussian errors, the log-likelihood function may be more complex, but it can always be written down.

In order to solve this problem using Optimize, we create a function that returns the log-likelihood:

Function myFitFuncToMaximize(xx,betas)
wave xx // The covariates.  The first column [0] is ones, the rest [1,] are x values.
wave betas // The coefficients that will be adjusted to find the maximum.
wave yy // A global corresponding to observations.  I could have stuck this as a column in the xx wave, but then the indexing becomes non-intuitive.

matrixop /free xxBeta=xx x betas
xxBeta = -(yy[p]-xxBeta[p])^2 // Just minimizing the sum of squares in this simple case.
return sum(xxBeta)
End

This function will take covariates xx (the {xi} above), corresponding to the known variables at each observation point. For example, if you measured 5 sensors at 10 different time points, and your goal was to predict the response of a 6th sensor at those time points, your xx would be a wave with dimension (10,6). The 0th column should be filled with ones if you want a constant term in your model, which is typical. In the linear model, the 'betas' wave would then have 6 points. The zeroth point would be the weight of the constant term, and the other 5 would be the weights of the other 5 covariates (e.g. sensors). You do not have to specify the values of 'betas' ahead of time, except perhaps to make a guess that is closer to the maximum. The content of 'betas' after Optimize is called is what you are looking for to fill you model coefficients. Lastly, the observed data yy is a wave with dimension (10,1). At every time point in this example, a row from xx (plus noise) determines a value in yy. yy would be the value of the 6th sensor which you are trying to predict from the values of the first 5.

Since the log-likelihood for this model was shown above to simply be the sum of the squared differences between the values in yy and the linear prediction from the values in xx and the corresponding coefficients, the last three lines of the function simply compute that sum and return it. Now, we call Optimize to use that function to find the values for betas (i.e. to find the model coefficients) that maximize the value returned by the function:

make /o/n=6 betas=0
wave yy // Wave of data to be explained (10 x 1)
wave xx // Wave of measurements to explain the data (10 x 6).  The first column should be filled with ones.
optimize /q/a=1/m={2,1} /s=(10) /x=betas myFitFuncToMaximize,xx

Most of the flags I provided to Optimize are optional, although in my experience these are the ones that work best. After running it, inspect the values in the betas wave, which will be the model coefficients obtained from maximum likelihood estimation. To confirm that you have implemented this correctly, try a multivariate linear regression with the same data, which should give the same answer. Once this works, you are ready to write down more sophisticated models, and write analogous fit functions that return the log-likelihood. Remember that you need to consider both a model that maps measurements onto predictions, and an assumption about the distribution of the data around those predictions. In order for this to work most smoothly, there are certain combinations of so-called "link" functions that go with certain distributions. See http://en.wikipedia.org/wiki/Generalized_linear_model for details.
Instead of using optimize you may be able to use GenCurvefit to do this procedure with the /MINF flag to specify your own cost function. Since GenCurvefit wants to minimise the cost function you would have to do something like:

Function MLE_forMINF(coefs, y_obs, y_calc, s_obs)
//a function for the /MINF flag for GenCurvefit
Wave coefs, y_obs, y_calc, s_obs
variable MLE
//calculate MLE here

//now return 1/MLE, so that you are maximising MLE by minimising 1/MLE.
return 1/MLE
end

to maximise the cost function. Alternatively, if I got time I could patch GenCurvefit to search for maxima, not minima.

Forum

Support

Gallery