# 3.3 – So what is maximum likelihood estimation, anyway?

Over the past few posts, we’ve talked quite a bit about likelihood, log-likelihood, and maximum likelihood estimation, but I realize how difficult this was for me when I first started–my fundamental question was this: what is MLE? Why are we using it? I’ll try to lucidly explain the answers to these questions in this post.

Let’s start by recalling what we’re really doing. Like I briefly mentioned, machine learning really is more about math than anything else. You start with a “model”, like linear regression, say, and that model has parameters like each of the $\theta$s, that we repeatedly tweak to find the optimum value. Let’s take a step back now and try looking at the whole picture. We have some data. We have a model we think will suit the data. We need to find the parameters. That’s our problem.

Let’s do this with a sample dataset. Here’s a bunch of points I generated, where linear regression seems like a good idea. I can tell you that this is one-dimensional data, therefore, our hypothesis is of the form $h(x) = \theta_0 + \theta_1 x$

So here’s an intuition to understand MLE. If we change $\theta_0$ and $\theta_1$, we get different lines. We have already assumed that this data was generated by a linear model as above. In fact, that’s why we’re using the linear regression model–we believe this data came from that equation. So here’s the highlight:

We need to find the parameters such that the line obtained is the one that most likely generated this data.

I’ll rephrase it so it’s clearer–we think there’s one magical, ultimate line that generated this data (with some noise, but we’re not really concerned about that). Every set of values for $\Theta$ gives a different line, which may or may not be this ultimate line we’re in search of. No line is perfect (because there’s noise), so we’re satisfied by the line that’s the most likely one to have generated this data.

Let’s now write this as probability. This uses the concept of joint probability, but don’t worry, it’s an easy concept to understand. We want to look at the probability that this is the data generated by the model, assuming some values of the parameters $\Theta$ (we don’t know the values yet). So we’d like a probability density function, say $f(x^{(1)}, x^{(2)}, \ldots, x^{(m)} | \Theta)$

This probability density function can really be anything. The above is the probability of observing all our data points if $\Theta$ takes some value. This is sort of abstract, so it might help to read again and make sure you understand. Now, we make a statistical assumption that all these data points are IID (independent and identically distributed). What this means is that these were drawn completely at random, so the presence of one data point does not affect the presence of another, and that they are all drawn from the same probability distribution. This is a neat assumption, because it means that our data is independent, and we can use a rule in probability for independent random variables, that basically says that we can take the above function, and that will be equal to the below: $f(x^{(1)} | \Theta) \cdot f(x^{(2)} | \Theta) \ldots f(x^{(m)} | \Theta)$

This is just the product of the individual probabilities. Now, recall from the previous paragraphs, what we’re trying to do. We want this probability to be maximum. Why? Because intuitively, this is the probability that our data came when $\Theta$ is that particular value. You’ll see the above function also written in product notation as below. This function is the likelihood function. $L(\Theta) = \prod_{i = 1}^m f(x^{(i)} | \Theta)$

Now since we want this to be maximum, we use calculus techniques: in particular, we use the fact that at the point where this function reaches either a maximum or minimum, the derivative is 0. Recall that $\Theta$ is a vector, so in our linear regression example, it’s two values, $\theta_0$ and $\theta_1$. We need to find maximum likelihood estimates of both of these. To do this, we set the partial derivatives to 0. \begin{aligned}\frac{\partial L(\Theta)}{\partial \theta_0} &= 0 \\ \frac{\partial L(\Theta)}{\partial \theta_1} &= 0 \end{aligned}

Unfortunately, differentiating a product is difficult. A simple solution to this problem is to take the logarithm of the likelihood, because logarithms turn products to sums. This is so common, the resulting function is called the log-likelihood function. $\ell(\Theta) = \sum_{i=1}^m \log(f(x^{(i)} |\Theta)$

Finding the partial derivatives and setting them to 0 as above gives the maximum likelihood estimates of each of the parameters.

And that’s it! What we discussed above is the process of maximum likelihood estimation. If it seems like we haven’t found one final equation, that’s because there isn’t one per se. Maximum likelihood estimation is a method of finding the parameter values, so the exact calculation details will differ for each model. If you look at the previous posts now, you’ll see that the above procedure is exactly what we’ve described above.

You may have noticed that we’ve switched from talking about probability to likelihood along the way. There’s a minor but important difference, but here’s the equation relating them. See if you can interpret it: $L(\Theta) = L(\Theta ; X) = P(X ; \Theta)$

Mathematically, you read this as, “The likelihood of theta is the likelihood of theta parameterized by X, which is equal to the probability of X parameterized by theta”. What does this mean? The left term is what we’ve been using to keep our equations concise. The middle term is the correct term for those being pedantic. The middle term is the likelihood of each value in our $\Theta$ vector taking some particular value when the data is X. The right-most term is the probability that some data X can be obtained if the parameter values are given as $\Theta$. The difference is this: likelihood is talking about the parameters, while the probability is talking about the data.