6.1 – Implementing Locally Weighted Regression

We’re now in a position to implement the locally weighted regression algorithm. While we will still follow the same equation that we set up earlier, we will slightly deviate, because the dimensions in the equation as-is do not match to perform the matrix multiplications.

For this implementation, we will use the equations from Dr. Patrick Breheny’s class at the University of Kentucky.

Let’s begin by importing the packages that we will require. We will use the seaborn plotting library because it gives us nicer visuals, but also because it comes with the dataset that we will be using. The implementation as it is only works for when there is one independent variable (X has only one column). Luckily, this dataset has this property. However, if you wish to implement this for a different data where this does not hold, you could always do a PCA to reduce the dimensionality to one.

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import numpy as np




Let’s now load the data, and make sure that we don’t have a rank-one array by reshaping it to a proper shape. We will use the bills and tips dataset, which among other things, shows the tips paid for various bill amounts. We will only focus on these two columns right now.

data = sns.load_dataset('tips')
X = np.array(data['total_bill'])
Y = np.array(data['tip'])

X = X.reshape(244)
Y = Y.reshape(244)


Let’s have a look at the data.


The x-axis represents the total bill amount, and the y-axis represents the tip given. We will now split the data into train and test parts.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.7)

X_train = X_train.reshape(X_train.shape[0])
Y_train = Y_train.reshape(Y_train.shape[0])
X_test = X_test.reshape(X_test.shape[0])
Y_test = Y_test.reshape(Y_test.shape[0])


We do the reshaping to make sure we have sensible shapes for the arrays. The next part is the implementation, and we’ll spend more time to explain what’s going on, because we’re slightly deviating from what we’ve seen.

plot_x = []
plot_y = []
plot_z = []

for i in range(len(X_test)):
    X0 = X_test[i]
    X = np.ones((len(X_train), 2))
    W = np.zeros((len(X_train), len(X_train)))

    for j in range(len(X_train)):
        X[j][1] = X_train[j] - X0
        W[j][j] = np.exp(-(X_train[j] - X0) ** 2 / (2 * 4))
e = np.array([1, 0]).reshape(1, 2)
    XTW = np.dot(X.T, W)

    pred = np.dot(np.dot(np.linalg.inv(np.dot(XTW, X)), XTW), Y_train)[0]




Remember that locally weighted regression is a non-parametric algorithm, which means that you don’t keep any parameters in memory. Instead, when a test sample arrives, we use the entire training dataset to compute the predicted value of the test sample. This explains why we’re iterating over the test samples one by one. I do believe there may be a way to do this more efficiently, but I’m not sure how.

The next part is where we deviate from our equations and instead use the equations from the lecture I linked above. According to the lecture, we create a matrix X where the second column is all 1s, and the first column is the x - x_0 values (x_0 is the current test sample). This is what I’ve done with the X variable. The matrix W is a diagonal matrix, which means every entry except the ones on the diagonal are zero. This is something I’m mentioning here because it’s more of an implementation detail. The values of these diagonal entries are the Gaussian values that we discussed. The Wikipedia entry for LOESS has another weighting function that looks interesting, one that I encourage you to try.

The declaration of the e variable is in accordance with the lecture slides. Note that I haven’t actually used this as shown in the slides. Here’s why. The choice of the matrix e is so that we only get the value we need. If you see the line that defines the pred variable, you’ll notice it’s basically the equation that we laid out in the previous post, EXCEPT, that we have a [0] index at the end. Why? If you simply run the code without the [0] index, you’ll get an array of two values. I strongly recommend that you run the code and see this for yourself to get a clear understanding. This array has shape (dimensions) (2, 1). So in the lecture, the equation pre-multiplies a column matrix of dimension (1, 2). This will yield a real value (what we require as an output), because (1,2) * (2,1) = (1,1). You should try pre-multiplying instead of using the [0] index to see that you get similar results. The choice of e is pretty clever, because not only does it give us one value, it also throws away the second element of the array that we would otherwise get. All this, in effect, is to take that array of two elements and simply throw away the second value, which is exactly what the [0] index does. I understand that this paragraph has been a lot to take in, but it will be much clearer to you if you practically run the code, run into dimension errors, and see all this work out for yourself.

Now that we have the prediction values, we need to see if we’re doing it right! One measure we could use is the R^2 value. This is basically just the square of the Pearson correlation coefficient. A higher value is better. The np.corrcoef function gives us the Pearson correlation coefficient, and we simply find the square. We could also visually see the results by plotting our predictions against the true output values, and see that it is, in fact, linear. In my system, this plot is below.


Seems like a pretty linear fit, right? How good is this numerically? Let’s find out:

plot_x = np.squeeze(plot_x)
plot_x = plot_x.reshape(74)

plot_z = np.squeeze(plot_z)
plot_z = plot_z.reshape(74)

pearsonr = np.corrcoef(plot_z, plot_x)[0][1]
print(pearsonr ** 2)


We first make sure we have sane dimensions, then compute the Pearson correlation coefficient. This gives us a matrix, and we want the cross-diagonal elements, which we obtain by indexing [0][1]. On my system, it prints about 0.711.

This sadly does not seem like a very impressive value, and it isn’t as good as we might hope for. To convince ourselves that all this trouble was really worth it, let’s use the built-in library to run a standard linear regression and see its performance.

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train.reshape(-1, 1), Y_train)

predictions = model.predict(X_test.reshape(-1, 1))
print(np.corrcoef(predictions, Y_test)[0][1] ** 2)

And the moment of truth! Is our effort worth it? It turns out, YES! The standard library’s implementation gives an R^2 value of 0.647, so our implementation of LOESS does indeed outperform the standard linear regression model. We can pat ourselves on the back, knowing that we didn’t waste our time!

Update: I tried out the tri-cube weight function as well. From NIST’s handbook, we find the distance of the test point from all the training points, but make sure to normalize it so that it lies between 0 and 1. Then, we use this distance in the tri-cube weight formula. The code becomes only slightly bigger, but the result is pretty similar–although it doesn’t perform to our expectations, it certainly outperforms the standard linear regression.

d = np.zeros((len(X_train)))

for j in range(len(X_train)):
    X[j][1] = X_train[j] - X0
    d[j] = np.abs(X_train[j] - X0)

d /= np.linalg.norm(d, 1)

for j in range(len(X_train)):
    if d[j] < 1:
        W[j][j] = (1 - np.abs(d[j]) ** 3) ** 3

The second parameter to np.linalg.norm decides how you want to normalize your data. When you pass 1, it returns the L1 norm, which basically scales the data so that the maximum value is 1. You could also do an L2 norm, and on trying, it appears that you still get better results than linear regression, but between L1 and L2, your mileage may vary.


So what have we achieved? We successfully implemented locally weighted regression. We found the R^2 value, and compared this with the standard linear regression model to prove that this was, indeed, an improvement.

You shouldn’t use this as an excuse to always use LOESS. Remember, this is non-parametric, so all the training data is kept in memory. This can get prohibitive if you have a large training dataset. For small or medium-sized data, however, this can be an option to consider, especially if your data isn’t exactly linear and has some curvature.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s