6 – Locally Weighted Linear Regression

Locally weighted regression is also called LOESS or LOWESS. It’s inspired by cases when linear regression, which simply fits a line, isn’t sufficient, but we don’t want to overfit either. I found the notes by University of Manitoba, Canada to be an excellent resource for this topic.

Motivation

One motivation of this algorithm is that we don’t always know how our data will look like. Sometimes we’re lucky, and it’s linear, so linear regression does a good job. Sometimes, it’s quadratic. Sometimes, it maybe sinusoidal, and sometimes it may just be all over the place. However, given a point we want to predict at, we want our prediction to be as good as possible. How do we achieve this?

One rather intuitive reasoning is that although all the points together may not help us build a nice model, the points near to the test point are a good way to estimate the value we should predict. We’re still doing regression, but we’re giving more focus, or emphasis, or (you see where I’m going here) weightage to points in the training data that are close to the point we want to make a prediction on. Here’s an example from the lecture slides I linked to above.

The blue dots are the training data. We have a test point, $x^{(new)}$, and we want to predict the value of $y$. Obviously, fitting one line to this whole dataset will lead to a value that’s way off the real one. Let’s use this weighting concept and only look at a few nearby points, and perform regressions using those nearby points only:

Well that’s significantly better. It looks like the predicted value of $y$ is something we’d expect given how our curve looks. Let’s now go over the math for this, and see how we change standard linear regression to this.

The model: similarities

The good news is that a lot of the model remains the same. In particular, our hypothesis function remains exactly the same, that is, given a test case $x$, we predict:

$h(x) = \Theta^T x$

So we still have to find the parameters $\Theta$ and output the same hypothesis function. Let’s now see what’s changed to make this weighting effect happen.

Tweaking standard linear regression

In standard linear regression, we took the training data, used gradient descent to fit the parameters, and that was it. We didn’t need the training data to make a prediction. Notice, however, that in locally weighted linear regression, we need the training data as well as the parameters to make a prediction, because we also need to know which points are close to the test point. This also means that we don’t have one ready model that we can use for any new test point. For this reason, locally weighted linear regression is called a non-parametric model.

Recall now, that in standard linear regression, we fit $\Theta$ to minimize the cost function:

$J(\Theta) = \frac{1}{2m} \sum_{i=1}^m \left( h(x^{(i)}) - y^{(i)} \right)^2$

The only difference now is that we need a weight term, like so:

$J(\Theta) = \frac{1}{2m} \sum_{i=1}^m w^{(i)} \left( h(x^{(i)}) - y^{(i)} \right)^2$

This weight term is obviously a function of the test point and the training data points. Why? Because it needs to tell us how close the test point is to each of the training data points. Such a distance measure is called a kernel function. Kernel functions will be useful in other learning algorithms as well, particularly in Support Vector Machines. For locally weighted linear regression, an extremely popular choice is the Gaussian kernel. The reason is that this kernel function reflects what we want with this kernel: for points that are close to the test point, the value is higher, and for points far away, it tends to (but never equals) zero. Let’s have a look at this kernel function:

$w^{(i)} = \exp \left(- \frac{(x^{(i)} - x)^2}{2 \tau^2} \right)$

Most of the terms should be familiar. $x^{(i)}$ is each training example point. $x$ is the test point. We have a new parameter, $\tau$. This in effect determines the width of the kernel. Let’s visualize this:

It’s now easy to see that if $\tau$ is bigger, then more points are considered while fitting our local linear regression model. This is a hyperparameter, which means that it is a parameter of the model, but it’s one that you choose, not one that the model learns.

One way to look at this kernel function is that for training points near the test point, $x^{(i)} - x$ is small (close to 0), so the weight is close to 1. For training points far away, $x^{(i)} - x$ is big, so the weight term is close to 0.

Without proof, I’ll state here that the optimal value of $\Theta$ here is given by:

$\Theta = (X^T WX)^{-1}X^T WY$

If you notice closely, you’ll see that this is exactly what we derived for the standard linear regression model, except that $X^T X$ is replaced by $X^T WX$.