# 3.5 – Regularization

You haven’t truly mastered an algorithm until you’ve implemented it yourself–that’s one of my beliefs, anyway. Which is why, so far, I’ve shown the implementations of the crux of the algorithms discussed. However, I haven’t done that for softmax yet. Although the previous post covers everything we require to fully implement softmax regression, it works significantly better when we also add in a concept called regularization. Let’s discuss that here.

So far, the focus has been on fitting the model well given our training set. And the math we developed to do this works well. In fact, sometimes, it works too well. Note that for a lot of problems, we’re using a method like gradient descent, constantly tuning the parameters of the model. What happens as a result of this is that the model will sometimes overfit by trying to fit through points that it doesn’t really need to. More precisely, if there’s some noise in the dataset, as there almost always is, the model will try its best to make sure that it can include those points as well.

So why is this a problem? Here’s an example taken from Wikipedia:

The red dots are the data points. The green curve is what we might expect the model to learn (this is a regression model). The blue curve is what it might end up learning. How did it end up learning that, you ask? Well, both the green and the blue curves have a least-squares error of 0, so according to the model, both are perfectly right. Overfitting can be a result of multiple reasons including trying to fit to a model that’s too complex (say, trying to fit to a quadratic curve when a linear one would suffice), choosing initial parameter values wrong, etc. Still, even with everything done right, we don’t want to take the risk of our algorithm predicting the blue line.

What’s common when models overfit, though, is that the coefficients (the parameters) tend to become very large (either positive or negative, we’re talking only about the magnitude). So this is something to go by. Note that our model is trying to minimize the cost function. Therefore, an obvious way to minimize this overfitting problem is to include the coefficients (parameters) themselves in the cost function. When we do this, we’re using regularization.

There are several ways of adding the coefficients, with no real right or wrong way (okay, so there are some wrong ways, but we’ll not go there). Before we look at how we do this, a word of caution. Remember how in the least-squares cost function, we simply threw in a $\frac{1}{m}$ factor, and said it wouldn’t affect the values of the coefficients that we’d obtain? That held true there because when we set the derivative of the cost function to 0, multiplying this by a constant makes no difference to the values you obtain (this is a calculus concept, so if it went over your head, try understanding the gist of it). However, when we add regularization terms, this doesn’t hold true anymore. We also can’t simply throw a $\frac{1}{m}$ on the regularization term, either (we’ll see why shortly). The fix for this is simple, but important: perform feature scaling of some sort. We discussed this right after regression, and I showed examples in Python. Now that we’ve covered that, we can proceed.

Let’s start with the basics. Let’s take our linear regression model, and to the cost function, we’ll add the term $\lambda \sum_{i=0}^n |\Theta_i|$. Thus, we have $J(\Theta) = \frac{1}{2m}\sum_{i=1}^m \left( h(x^{(i)}) - y^{(i)} \right)^2 + \lambda \sum_{i=0}^n |\Theta_i|$

The $\lambda$ term controls how much we want to regularize the parameters. You can experiment with different values to find what works best. The above is called lasso regression (for those of you who are curious, lasso stands for “least absolute shrinkage and selection operator”). The regularization term we added, that is, the sum of the absolute values (ignoring the $\lambda$ term) is called the L1 norm. Although lasso regression was meant to be used with linear regression, you can also use it with generalized linear models.

We could also take the square of each $\Theta_i$ instead of the absolute value. When we do that, the resulting model is called ridge regression. In machine learning, the term weight decay has caught on better, because these regularization methods effectively cause the parameter values (or “weights”) to go down (or “decay”), resulting in a model that does not overfit. Concretely, the cost function is $J(\Theta) = \frac{1}{2m}\sum_{i=1}^m \left( h(x^{(i)}) - y^{(i)} \right)^2 + \frac{\lambda}{2} \sum_{i=0}^n \Theta_i^2$

We added the 2 in the denominator for the same mathematical convenience as for the ordinary least-squares cost function. This regularization term (again, without the $\lambda$ term is called the L2 norm. Instead of the summation term, you may instead see $\left \| \Theta \right \|$, which is the mathematical way of representing the L2 norm.

Implementing these is easy, because the terms are simple. Taking the derivatives of these is also easy, so I won’t give them a separate post. Instead, I’ll next discuss how to implement softmax regression with regularization.