# 3.6 – Completing Softmax Regression: Implementation and Regularization

We’ve now laid all the groundwork to successfully implement softmax regression with regularization! This post will be easy on the math, because we’ve covered it in the previous two posts. Briefly, we’ll first recall the the formulas so we can get to implementing the algorithm.

Let’s first recall the gradient of the log-likelihood function: $\frac{\partial}{\partial \Theta_p} \ell(\Theta) = \sum_{i=1}^m x^{(i)} ([y^{(i)} = p] - s_p)$

Pretty simple. Also recall that we wish to maximize the log-likelihood, but gradient descent minimizes a function. The fix is simple: we simply minimize the negative log-likelihood. We’ll also add in the regularization term. For softmax regression, we’ll use the L2 regularization method. I won’t prove it here because it’s rather simple, but the partial derivative of the L2 regularization term is simply $\lambda \Theta_p$.

Thus, the final gradient of our cost function, which we’ll minimize with gradient descent is: $\frac{\partial}{\partial \Theta_p} J(\Theta) = -\sum_{i=1}^m x^{(i)} ([y^{(i)} = p] - s_p) + \lambda \Theta_p$

Now that we have everything we need, we can start implementing! We’ll also go ahead and compare our implementation with the built-in one. The library one will do better (kind of a bummer) because they use more advanced methods of optimization. We’ll make ourselves content as long as we’re not too far behind. Let’s begin by importing the libraries.


import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import math
import scipy

sns.set(style='white')



We have a lot more libraries than usual here, but that’s mostly because we’re also using the built-in methods to classify and see how good the library does and how good we do. Let’s create a dataset to classify and plot it:


data = make_classification(n_samples=200, n_features=2, n_informative=2, n_redundant=0, n_classes=3,
n_clusters_per_class=1)
# Plot our data
fig = plt.figure(figsize=(8, 8))

ax.scatter(X.T, X.T, Y, c='r');
ax.set_xlabel('$X_1$')
ax.set_ylabel('$X_2$')
ax.set_zlabel('Y');



I got the following dataset: Seems like something we could work with. We’ll now do something that’s common for testing your machine learning algorithms. What we’ll do is instead of training our model on the whole dataset, we’ll use only 70% for training. This means the model will not see 30% of the data. We’ll then ask our model to predict on the 30% it hasn’t seen. If it does well there, then we know we’re headed the right way. We won’t implement this splitting, and instead use the built-in function to do this. The built-in function also randomly shuffles the dataset to make it a very fair method of evaluating our model. Oh, and let’s not forget to add the $X_0 = 1$ terms to our dataset first:


X_with_ones = np.hstack((np.ones((X.shape, 1)) ,X))
X_train, X_test, Y_train, Y_test = train_test_split(X_with_ones, Y, train_size=0.7, shuffle=True)



Let’s see how well the built-in model performs using the multinomial model:


from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, Y_train)
print(accuracy_score(model.predict(X_test), Y_test))



Pretty straightforward. Train on the training set, which is 70% of the data, and predict on the remaining 30%. On my system, it gives me 0.9, which means it predicted the right class 90% of the time. Quite well done for so few lines of code, really! Let’s now work on building our model. Remember: the library version is more sophisticated than ours, so it will do better than ours. We’ll see how close we can get.

Let’s start by implementing the softmax function:


def softmax(theta):
"""
Returns a vector, the softmax of the vector eta.
"""
exps = np.exp(theta - np.max(theta))
return exps / np.sum(exps, axis=0)



This is exactly the softmax function we’ve seen, but with one difference: in the exponent of each term, we subtracted the maximum value of $\theta$. The reason we do this is for numerical stability. Basically, because we’re using the exponential function, we need to be careful that it doesn’t shoot to infinity (as the exponential function does very quickly). To do this, we make 0 the greatest possible exponent (you need to convince yourself that by subtracting the maximum value of $\theta$ from each of the values, the maximum exponent value is 0). This is just a mathematical detail, really.

We now need an initial value for the $\Theta$ matrix (remember: in softmax regression, it’s a matrix and not just a vector). The best way to do this is to initialize it randomly. We’ll initialize it according to a Normal distribution. $\Theta$ has dimensions $(k, n+1)$, but to keep things simple, we’ll simply hard-code these values:


# Get initial theta values randomly
theta = np.random.randn(3, 3) # k x (n + 1)



We’ve laid the foundations to start implementing the model. Let’s now start with the gradient of the log-likelihood function. Although we’ve named the variable as $n$, its value is $n+1$. At the cost of annoying the reader, this simplifies the variable name.


def grad_nll(x, y, p, theta, lambd):
"""
Returns the gradient of the negative log-likelihood function w.r.t theta_p.

Args:
x: (m, n + 1) matrix
y: (m, 1) vector
theta: (k, n+1) matrix
"""
m = x.shape
n = x.shape

cur_sum = np.zeros(n, dtype=np.float64)
for i in range(m):
cur_sum += x[i] * ((y[i] == p) - softmax(np.dot(theta, x[i].T))[p])

return -cur_sum / m + lambd * theta[p]



We added the regularization parameter at the end, and divided by $m$ to maintain consistency with our earlier models. This doesn’t affect the working of the model.

The actual implementation here is really in the one-liner in the for loop. I’ll admit this took me a while to figure out. While this doesn’t exactly match what our formula says, the way to think about it is to remember what dimensions we need. Our softmax function returns a vector the same dimensions as we pass it. x[i] has dimensions $n+1$ (it’s a vector). Our Iverson notation is almost the same when written in Python. We need the value in the parentheses to be a number. Now this is the part I had to fiddle around a lot. Notice that $\Theta \cdot x$ will give us a matrix of dimensions $(k, m)$. Oops. But our formula says we need to use $x^{(i)}$ (which is a vector), not $x$. That’s what we did. Now, $\Theta \cdot x^{(i)}$ has dimensions $(k, 1)$. Well this is better than a matrix, but we still need a real number. Since we’re finding the partial derivative with respect to $\Theta_p$, it only makes sense to use the pth value, which is what we’ve done. And this is why it’s important to remember the dimensions of the terms in our equations.

After that, the easy part is left: gradient descent. It’s a little less trivial then what we did with linear regression. Note that because we need to find the partial derivatives with respect to all the $\Theta_p$s, there’s going to be an additional loop for this. I can’t help but thinking there’s a better, more efficient way to implement this, but this crude implementation, although slow, does work.


def gradient_descent(theta, x, y, iterations=1000, alpha=0.01, lambd=1):
"""

Args:
theta: (n + 1, k) vector
x: (m, n+1) vector
y: (m, 1) vector
iterations: number of iterations to perform
alpha: learning rate
lambda: regularization parameter

"""
m = x.shape
n = x.shape
k = theta.shape

for i in range(iterations):
for p in range(k):

return theta



Like I said, I’m pretty sure there’s some sort of better way of doing this, but this code is at least easy to understand. We find the partial derivatives for each $\Theta_p$ vector, make sure it’s the same dimensions as $\Theta$, and then perform the gradient descent update.

Let’s run gradient descent for 500 iterations. From what I’ve seen, this is more than enough: you might get the same result with say 100 or 200 iterations as well. We’ll use a rather standard 0.1 for our learning rate.


theta = gradient_descent(theta, X_train, Y_train, iterations=500, alpha=0.1)



Okay, so we have our updated values of $\Theta$. Let’s use it to make predictions.


probabilities = softmax(np.dot(X_test, theta))
predictions = np.argmax(probabilities, axis=1)
predictions = predictions.reshape((60, 1))
right_predictions = np.sum(Y_test == predictions)
print('Accuracy = {0}'.format(right_predictions / len(predictions)))



First, we find the probabilities for each class. We then find the class with the highest probability using the numpy argmax function. These are our predictions, which we need to compare with the actual values. We make sure it’s the right shape (dimensions) as the actual values (which are 30% of our dataset). The shape here is (60, 1) because our dataset had 200 points, and 30% of the dataset is used for testing, which is 60 points. Then, we find the accuracy by finding the fraction of right class predictions we made.

On my system, this gave the output:

Accuracy = 0.8833333333333333

Which means our crude implementation predicted the right class 88.3% of the time! Not bad at all compared to the library function, which got 90%. Your mileage may vary, of course, but usually our implementation should perform in the same ballpark as the library version.

While it’s always recommended to use library functions, the purpose of this exercise was to make sure you clearly understand how softmax regression works. You now know the math, a brief intuition, and the implementation for logistic regression and softmax regression. Congratulations!