Evaluating regression models is less trivial than classification models. Because the outputs are now real numbers and not discrete classes, metrics like accuracy don’t really make sense anymore. Using something like a loss function doesn’t work either–a low loss is obviously better, but how do we know *how* good it is? Since there’s theoretically no upper bound, and because a loss of 0 is almost never possible, this isn’t very intuitive to use. Before we get to how we can judge a model, let’s discuss a few definitions.

## Some terminology

Let’s use our standard notation as follows.

- is the th training example output.
- is our prediction for the th training example.
- is the mean of all the s (the true output).
- Additionally, the differences between the actual outputs and the predicted outputs are called the
**residuals**. Let denote the th residual.

In statistics, you’ll see the notations and used instead, but to maintain consistency with what we’ve used so far, we’ll use the superscript notation.

Using this notation, let’s define a few terms (from Wikipedia):

- The
**total sum of squares**, denoted or , is the sum of squares of the errors, if our model always predicted the mean, . That is,

- The
**regression sum of squares**, denoted is the similar, but takes the squared errors between the mean and the predicted values, .

- Finally, let’s define the
**residual sum of squares**, , whose meaning is evident from its name.

As always, Wikipedia gives a fantastic diagram to get some intuition about these terms:

This toy dataset contains 4 data points (the circles). In the first figure, we’ve chosen a model that always predicts the mean of the data. To get the errors, from each point, draw a vertical line to this mean line, and then construct a square with this length as the side (these are the red squares). Note that in this case, the lines drawn from the points to this “predicted” mean line are perpendicularThe areas of these squares are the squared errors, and the sum of these red areas is the total sum of squares.

In the second figure, we predicted the black line. To get the sum of squared residuals, we draw vertical lines to our prediction line (note that this time, these lines are *not* perpendicular to the predicted line), and construct squares as before. Then, we sum up the areas of these squares.

Given this background, we are now in a position to understand what , the metric for judging regression models.

## , the coefficient of determination

is defined as:

is one way of judging the **goodness of fit**. It’s always a value between 0 and 1, and the higher it is, the better the model. It is usually interpreted as “the percentage of variance explained by the model”, which basically means how well the model captures the variability of the data and generalizes.

There is, however, a problem with . Because of its definition, the value of can only increase as you go on adding more predictor variables, till it reaches a value arbitrarily close to 1. This however has problems, because it encourages **kitchen sink regression**. The name “kitchen sink” is by no means a compliment–it refers to the fact that you’re throwing everything you have at the problem, completely disregarding whether or not it is a useful predictor of the output variable or not. For this reason, a metric called the **adjusted R-squared** is sometimes also shown. Using our usual notation of being the number of training examples and being the number of features, we have:

The adjusted R-squared can also be negative, and is always less than or equal to R-squared. Let’s now implement these two.

## Implementing and

As always, we’ll start by importing the libraries:

from sklearn.datasets import make_regression from sklearn.linear_model import LinearRegression import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D

Next, we create a dataset of 100 samples and 2 features.

data = make_regression(n_samples=100, n_features=2, n_informative=2, noise=18) X, Y = data fig = plt.figure(figsize=(8, 8)) ax = fig.add_subplot(111, projection='3d') ax.scatter(X.T[0], X.T[1], Y); ax.set_xlabel('$X_1$') ax.set_ylabel('$X_2$') ax.set_zlabel('Y');

Let’s now train a linear regression model on this dataset. First, we’ll use *train_test_split* as usual to get a train and a test set.

from sklearn.model_selection import train_test_split model = LinearRegression() X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.7) model.fit(X_train, Y_train) predictions = model.predict(X_test)

Now, we can implement the calculations of and using the formulas above. We’ll first have to find and . This is made easy using the numpy library:

SS_res = np.sum((Y_test - predictions) ** 2) mean = np.mean(Y_test) SS_tot = np.sum((Y_test - mean) ** 2) R_sq = 1 - SS_res / SS_tot print(R_sq)

This for me prints 0.8998379099758742. To see if we’ve done it right, let’s use the built-in function to get .

model.score(X_test, Y_test)

It should print the same value. Let’s now find the adjusted R-squared value:

R_sq_adj = 1 - (1 - R_sq) * (100 - 1) / (100 - 2 - 1) print(R_sq_adj)

This for me prints 0.897772712243418, which is just slightly less than . This tells us that both predictor variables are important for predicting the output variable, and this is a good thing.

However, isn’t the only thing we can do to check our model. Let’s look at one more thing we could check.

## A brief introduction to hypothesis testing

This is a statistics concept, but I’ll try to explain lucidly. Remember that as a concept, machine learning borrows regression from statistics. Let’s first introduce a few terms from statistics.

A **null hypothesis** can be thought of as an established statement or the current method of thinking. It is usually denoted by . For example, a null hypothesis could be that 90% of the manufactured products in a factory are of high quality. In contrast, an **alternate hypothesis** is something we believe in, and opposes the null hypothesis. For example, our alternate hypothesis could be that the percentage of high quality products in the factory is less than 90%.

In statistics, an important concept is that of **hypothesis testing**. The idea is to figure out which of the two hypotheses we should accept, and which we don’t. How do we do this? First, we take a **sample**. In the factory example, this means we take a subset of the manufactured goods. This sample should be random so that the sample is *representative of the population* (here, the **population** refers to the set of all the products. Obviously, we can’t test every single one!). Once we have the sample, we check the percentage of high quality goods (in our example), and then obtain a **confidence interval** at the specified **significance level**. The significance level is something that you choose. This is a lot of terms: let’s break it down.

Remember that you’re only testing a sample, and not the population, so you don’t know about the quality percentage of the whole population. However, given the sample, what you can do is say with some level of confidence that the percentage of high quality goods lies in some range. That level of confidence is called the **confidence level**, and the range you get is called the **confidence interval**. The significance level is 100 minus the confidence level. For example, if the chosen significance level is 5%, this means that the confidence level is 95%. Suppose that at this confidence level, you obtain the confidence interval for the percentage of high quality goods as [88.4, 91.2]. Since 90% lies in this range, we say, “with 95% confidence, the percentage of high quality goods lies in between 88.4 and 91.2”. Since this range contains 90, we *accept* the null hypothesis. That is, the company’s claim cannot be rejected. Remember though that we accepted the null hypothesis **only for the given sample**. It may happen that for a different sample, the null hypothesis is rejected. This is why it’s important to pick a sample that’s representative of the whole population.

## p-values

p-values are a concept related to confidence intervals. These are the preferred values to report, because you don’t have to choose a level of significance. Suppose the null hypothesis is true (the person performing hypothesis testing doesn’t know this, of course). That is, the factory really does have 90% high quality products. Suppose that in your sample, you find that the percentage of high quality products is 86%. The **p-value** is the probability of getting results that are *at least as extreme as the observed results*. Remember, this is a conditional probability (conditioned on the premise that the null hypothesis really is true). Therefore, the p-value would tell us the probability of picking another sample, and that sample having the percentage of high quality products as 86% or lesser, given that 90% of the products are actually high quality.

Let’s interpret what this tells us. Suppose the p-value is 0.3 (30%), a rather high value. This means that there’s a 30% chance of another sample having at most 86% high quality products. But since the null hypothesis is true, that means that there’s something wrong with your results, that is, your results are not **statistically significant**. Thus, we can also interpret the p-value as *the probability that you will wrongly reject the null hypothesis*.

The lower the p-value, the more statistically significant your results are.

Suppose now that you obtain a p-value of 0.01 (1%). This is a very low value, which means that you would very rarely be wrong when you reject the null hypothesis. And so, this leads to the next highlight:

If the p-value is very low, you reject the null hypothesis, because there is a very low probability that you would wrongly reject it, i.e., the results you have obtained experimentally are not random; they are definitely statistically significant.

Obtaining the confidence intervals and p-values requires the use of statistical tables or calculators with these functions, so I won’t detail that here. Fortunately, built-in libraries give us p-values and confidence intervals with barely any hassle.

## Using confidence intervals and p-values in regression

Alright, so with that statistical background, how do we use that in regression? In regression, we make the null hypothesis that the parameters of the model are 0.

In regression, the null hypothesis is that the chosen parameters are statistically insignificant, and that their

population coefficientsare zero.

What does this mean? It means that when we look at regression from a statistical standpoint, rather than saying that we have all the data, we assume that this data is just a *sample* of a bigger *population*, and that we’re only *estimating the population parameter coefficients*. Now that we know what the sample and population is, we look at our hypotheses. The null hypothesis is that the population parameter coefficients of the model are 0. This would mean that the parameters are useless (because if they’re all 0, what’s the point?). Our aim in regression is to estimate the population coefficients from the sample data that we have, find the p-values and see if we can reject this null hypothesis.

If the p-values for the regression model are small, the model is statistically significant, and so we can reject the null hypothesis. This means that the parameters chosen for the model do, in fact, contribute meaningfully in predicting the value of the output variable.

Alternatively, we could also find the confidence intervals for the parameters. If the confidence interval for any parameter contains 0, that parameter is not statistically significant, because with a high level of confidence, 0 may be the value of the parameter (which is the null hypothesis). Statistical packages will usually provide both the p-values and the confidence intervals at 95% confidence level (which is a 5% significance level). The p-values and the confidence intervals will always agree–if the confidence interval contains 0, the p-value will be more than 0.05 (5%).

## Python3 implementation

Let’s continue and check whether our results are statistically significant. Because we didn’t add the column of 1s to the data , the model that the library fitted to (as shown above) had only 2 coefficients, one for each feature. Therefore, it didn’t have any constant terms. Let’s see if both features are statistically significant. We’ll use a different library for this.

import statsmodels.api as sm

We’ll tell *statsmodels* to fit an OLS (ordinary least squares) model to our data (exactly what *sklearn* was doing).

results = sm.OLS(Y_train, X_train).fit() print(results.summary())

This prints:

Look at the middle part. We see that for both coefficients, the p-value (indicated as “P>|t|”) is very small (so small that the first 3 decimal places are 0). We also notice that the 95% confidence interval (which is denoted by the last 2 columns) does not contain 0 for either of the parameters. Therefore, this model is statistically significant, and both parameters contribute to the output variable.

We notice that the value is only 0.853 here. Why is that? The reason is that here, the coefficients are slightly different from the one *sklearn* obtained. To see this, print out the value of *model.coef_*. This will print the model coefficients. For me, it prints

These are only slightly different from the coefficients obtained by *statsmodels*, and that causes the discrepancy. Let’s now see what happens if we try forcing a constant in the model:

X2 = sm.add_constant(X_train) results = sm.OLS(Y_train, X2).fit() print(results.summary())

This for me gives the output:

For the constant term, we notice that the p-value is very high, and also that the 95% confidence interval contains 0. Thus, the constant doesn’t help the model at all, and statistically, you should not include the constant in your model.

and confidence intervals/p-values are two ways to judge how good your regression model is. There’s one last thing you can check: a scatterplot between the residuals and the fitted values.

## The residuals vs fits plot

We’ve talked about what residuals are. The **fitted values** are the values on the regression line (or plane). By design, the residuals of the linear regression model that we’ve studied are uncorrelated with the input data, . This is a rather intuitive assumption to make. Because the fitted values are simply a linear function of the input data, we expect that the fitted values are also uncorrelated with the residuals. This is what we’ll look for when we plot residuals vs fitted values. Here’s a sample:

There are a few things you want to see in a plot like this:

- The points are all over the place with no discernible pattern.
- They’re evenly spread out on both sides of the Residuals = 0 line
- None of the points seem very different from the others: this implies there aren’t any outliers.

Our plot meets all of the above criteria. Indeed: using *np.corrcoef* to find the **Pearson’s correlation coefficient** shows the value -0.042772, which means that there’s absolutely no correlation between them.

## Summary

To conclude, while the value is the most frequently quoted metric for regression models, there are other statistical checks you can do to make sure your model really does make sense and that all the parameters are statistically significant.

We discussed the and values, and then looked briefly at hypothesis testing. We used the concept of hypothesis testing to define p-values and confidence intervals, and defined a null hypothesis for regression. We then also looked at the residuals vs fits plot, along with code samples for all of the above metrics.