3.4 – Softmax Regression

Note: This is based on Andrew Ng’s course at Stanford University, but I culled information from other sources that I’ve attributed appropriately as well.

Alright, we’ve talked about a one-vs-rest implementation for multi-class classification using logistic regression, now we’ll look at the other method, softmax regression. If you recall, we specified an additional parameter, multi_class=’ovr’ while training a multi-class logistic regression model in sklearn. That’s because the default algorithm used is softmax regression, which is based on the multinomial distribution. In this post, I’ll try to explain how exactly this algorithm works. This post is pretty much all math, but hopefully I’m lucid enough to be understood.

This method is heavily based on statistical concepts, but I’ll try to go slow and explain everything one by one (this might turn out to be a long post).

Recap: probability distributions

I can’t think of a better way to explain what a probability distribution is than the way Wikipedia has phrased it:

…a probability density function (PDF), or density of a continuous random variable, is a function, whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample.

Easy, right? It’s a function of a “random variable”, which gives the likelihood of that variable taking on specific values. There are several distributions–binomial, exponential, Gaussian, it’s a huge list. You may have encountered these in a probability class.

Let’s now look at one particular example, the Gaussian (this is just an example, really). The Gaussian is a function parameterized by its mean, \mu, and its variance, \sigma^2. If we vary these parameters, we get different Gaussian distributions, as shown below:

By Inductiveload – self-made, Mathematica, Inkscape, Public Domain, Link

Clearly, the Gaussian is a whole family of probability distributions. This is the case with other distributions as well. Let’s now generalize this.

Exponential family distributions

There’s a class of distributions called exponential family distributions. A distribution is belongs to this class of distributions if it can be written in the form:

P(y ; \eta) = b(y) \exp(\eta^T T(y) - a(\eta))

Let’s talk about all these variables:

  • \eta is called the natural parameter. When proving that a certain distribution (such as the Gaussian) belongs to the  exponential family, it becomes convenient to change the parameters of the original distribution to new parameters. The new natural parameters are different for different distributions. For example, in the Gaussian example, the original parameters were \mu and \sigma^2, but the natural parameters are \frac{\mu}{\sigma^2} and \frac{-1}{2 \sigma^2}. For those of you interested in the derivation, these notes from Princeton University provide it for a few common distributions. The natural parameter is also called the canonical parameter.
  • T(y) is the sufficient statistic. This is one of those tricky terms that’s easy conceptually, but you need to read it over and over to get it. A sufficient statistic of a sample of data about a parameter is a statistic (like the mean and the variance) that provides as much as information as we can possibly get about that parameter from that data. As an example, say we use the sample mean to get an estimate the population mean. For example, given a sample of data distributed according to the Gaussian, it can be proved using maximum likelihood estimation that the sample mean is the best estimate of the population mean. Thus, for a Gaussian distribution, the sample mean is the sufficient statistic of the population mean.
  • a(\eta) is called the log partition function or the cumulant generating function. The quantity e^{-a(\eta)} normalizes the distribution so that the area under the curve is 1; that is, it sums (in case of a discrete distribution) or integrates (in case of a continuous distribution) to 1. This has the additional property that its first derivative yields the mean, and its second derivative yields the variance.

Many common distributions such as the Binomial, Gaussian, Weibull, and Poisson are examples of distributions that belong to the exponential family. More importantly to our discussion, the multinomial distribution, an extension of the Bernoulli distribution, also belongs to this class.

Generalized Linear Models (GLMs)

Consider some models we’ve seen so far. In logistic regression, the output was either 0 or 1. Thus, in logistic regression, y|x ; \Theta followed a Bernoulli distribution. Similarly, although not mentioned, the linear regression, which used a least-squares cost function, assumed that y|x ; \Theta followed a Gaussian distribution.

More generally, we can have a model that assumes that the output variable follows a distribution that belongs to the exponential family of distributions. Such a model is called a generalized linear model. Apart from the distribution of the output variable belonging to the exponential family, GLMs have some other assumptions:

  • Given input x, we would like to predict \mathbb{E}[T(y) | x ; \Theta] (the expected value of T(y)|x;\Theta).
  • The natural parameter \eta = \Theta^T x. If, as in the Gaussian example, \eta is a vector, then we have \eta_i = \Theta^T_i x.

Deriving the softmax regression model

Let’s now work towards a (generalized linear) model for multi-class classification. Since we have, in general, k classes, our predictor variable follows a multinomial distribution, that is, y \in \{ 1, 2, \ldots, k \}.

To decide on parameters, we could choose k parameters \phi_1, \phi_2, \ldots, \phi_k. However, this leads to a problem. To understand why, let’s first look at the Bernoulli distribution (the one we used for logistic regression), and then come back to the multinomial. Recall that

P(y = 1 | x ; \Theta) = h(x)

for logistic regression. We define this value as a parameter. Thus, we have:

\begin{aligned} P(y = 1 | x ; \Theta) &= \phi \\ P(y = 0 | x ; \Theta) &= 1 - \phi \end{aligned}

The second equation follows from that there are only two possible outcomes, 0 and 1. Here, we have one parameter, \phi, and

P(y = 1 | x; \Theta) + P(y = 0 | x; \Theta) = \phi + 1 - \phi = 1

That is, the sum of the probability values for each possible output value should sum to 1. We could certainly do this for our multinomial model:

\begin{aligned} P(y = 1|x;\Theta) &= \phi_1 \\ P(y = 2|x;\Theta) &= \phi_2 \\ &\vdots \\ P(y = k|x;\Theta) &= \phi_k \end{aligned}

The issue is easier to see now. With all these parameters, we have

\sum_{i = 1}^k \phi_k = 1

But now, the parameters aren’t independent. That is, one parameter is redundant, because we can get the last parameter by subtracting the sum of the others from 1. The fix is easy: we simply remove one parameter, and like the Bernoulli model, the last “parameter” is

\phi_k = 1 - (\phi_1 + \phi_2 + \ldots + \phi_{k-1})

Thus, we have k - 1 parameters. Remember: these aren’t the natural parameters. We’ll work towards that. So now we need to get an exponential family distribution. Let’s work on T(y), and define the vectors T(y) as follows:

T(1) = \begin{bmatrix}1\\ 0\\ 0\\ \vdots \\ 0 \end{bmatrix}, T(2) = \begin{bmatrix}0\\ 1\\ 0\\ \vdots \\ 0 \end{bmatrix} \ldots T(k-1) = \begin{bmatrix}0\\ 0\\ 0\\ \vdots \\ 1 \end{bmatrix}, T(k) = \begin{bmatrix}0\\ 0\\ 0\\ \vdots \\ 0 \end{bmatrix}

Since T(y) is a vector, we use our usual notation and denote the ith element by T(y)_i. With this framework in place, we can prove that the multinomial distribution is a member of the exponential family of distributions.

\begin{aligned} P(y;\phi) &= \phi_1^{T(y)_1}\phi_2^{T(y)_2} \ldots \phi_k^{T(y)_k} \\ &= \phi_1^{T(y)_1}\phi_2^{T(y)_2} \ldots \phi_k^{1 - \sum_{i=1}^{k-1} T(y)_i} \\ &= \exp \left( T(y)_1 \log(\phi_1) + T(y)_2 \log(\phi_2) + \ldots + \left( 1 - \sum_{i=1}^{k-1} T(y)_i \right) \log(\phi_k \right) \\ &= \exp \left( T(y)_1 \log \left( \frac{\phi_1}{\phi_k} \right) + T(y)_2 \log \left( \frac{\phi_2}{\phi_k} \right) + \ldots + T(y)_{k-1} \log \left( \frac{\phi_{k-1}}{\phi_k} \right) + \log(\phi_k) \right) \end{aligned}

This is in the form of an exponential family distribution, with

\begin{aligned} b(y) &= 1 \\ a(\eta) &= -\log(\phi_k) \\ \eta &= \begin{bmatrix} \log \left( \frac{\phi_1}{\phi_k} \right) \\ \log \left( \frac{\phi_2}{\phi_k} \right) \\ \vdots \\ \log \left( \frac{\phi_{k-1}}{\phi_k} \right) \end{bmatrix} \end{aligned}

Let’s go over the steps above:

  • We formed the first step in the same way that we formed a similar equation for logistic regression. This is merely an extension, but it works very well. To check for yourself, notice that T(y)_i is 1 only for the ith position, and 0 elsewhere. Hence, when you substitute any value, all the multiplicands except the ith become 1 (because they’re raised to the 0th power).
  • In the second step we simply used the equation for our parameters.
  • In the third step, we simply used the property of logarithms:

a^x = \exp(x \log a)

  • Next, we expanded the last term, and used the property of logarithms:

\log x - \log y = \log \left( \frac{x}{y} \right)

Now, note that \eta_i = \log \left( \frac{\phi_i}{\phi_k} \right). For convenience, let’s also define \eta_k = \log \left( \frac{\phi_k}{\phi_k} \right) = \log 1 = 0.

Before we proceed further, take a moment to notice how \eta is a vector. As we discussed way back in GLMs, when \eta is a vector, we have \eta_i = \Theta_i^T x. But wait, this means that there’s not just one \Theta vector, there are multiple of them. In fact, we have k-1 of them, from \Theta_1 through \Theta_{k-1}. For convenience, let’s throw in that last one as well, and define it as \Theta_k = 0 so that \eta_k = \log 1 = 0 just as before. You should make sure you understand well that we have k parameter vectors denoted by \Theta_i. In logistic regression (with 2 classes), we had only one (2 – 1 = 1), here (with k classes) we have k - 1 (but we defined an extra \Theta_k so we have k of them).

Continuing from where we left off,

\begin{aligned} e^{\eta_i} &= \frac{\phi_i}{\phi_k} \\ \phi_k e^{\eta_i} &= \phi_i \\ \phi_k \sum_{i=1}^k e^{\eta_i} &= \sum_{i=1}^k \phi_i = 1 \end{aligned}

And so,

\phi_k = \frac{1}{\sum_{i=1}^k e^{\eta_i}}

Substituting this in the second equation above gives us

\phi_i = \frac{e^{\eta_i}}{\sum_{j=1}^k e^{\eta_j}}

Let’s look at what we have here. Given a vector \eta with k elements, we have a function that takes this vector and gives us another vector, \phi, also with k values, defined by the above equation, and subject to the conditions

\begin{aligned} \sum_{i=1}^k \phi_i &= 1 \\ \phi_i > 0 \end{aligned}.

This function is a generalization of the logistic function, and is called the softmax function. The case is similar to that of logistic regression, where we were actually performing regression, but cleverly choosing a bounded function. It’s harder to see in this case because it’s multidimensional.

Finally, let’s use the last assumption stated waay back in the GLMs section. You don’t need to scroll all the way up, I’ll restate it here: we assumed that

\eta_i = \Theta_i^T x \ \ \ \ \ \forall i = 1, 2, \ldots, k - 1

Each of these \Theta_is is a vector of (n+1) dimensions. And thus, our final softmax regression model is

P(y=i |x;\Theta) = \frac{e^{\Theta_i^T x}}{\sum_{j=1}^k e^{\Theta_j^T x}}

The model will output a vector:

\begin{aligned} h(x) &= \mathbb{E}[T(y) |x; \Theta ] \\ &= \begin{bmatrix} \phi_1 \\ \phi_2 \\ \vdots \\ \phi_{k-1} \end{bmatrix} \\ &= \begin{bmatrix} P(y =1|x;\Theta) \\ P(y=2|x;\Theta) \\ \vdots \\ P(y = k - 1 | x;\Theta) \end{bmatrix} \end{aligned}

Notice that the above only outputs a (k-1) dimension vector. We can find

P(y = k|x;\Theta) = \phi_k = 1 - \sum_{i=1}^{k-1} \phi_i.

With our model finally defined, all that’s left to do is figure out how to find the optimal values of the parameters \Theta.

Finding the optimal parameter values

Credits: I got stuck myself in the last part of this derivation, and this answer at CrossValidated helped me proceed.

To find the optimal values of \Theta to fit the data, we use the now-familiar technique of maximum likelihood estimation. As always, we’ll start with the log-likelihood function. Here, we’ll just directly write it down without writing the likelihood first.

\begin{aligned} \ell(\Theta) &= \sum_{i=1}^m \log P\left( y^{(i)}|x^{(i)};\Theta \right) \\ &= \sum_{i=1}^m \log \prod_{j=1}^k \left( \frac{e^{\Theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{\Theta_l^T x^{(i)}}} \right)^{T(y^{(i)})_j} \\ &= \sum_{i=1}^m \sum_{j=1}^k \log \left( \frac{e^{\Theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{\Theta_l^T x^{(i)}}} \right)^{T(y^{(i)})_j}  \end{aligned}

To find the derivative of this, let’s first find the derivative of the softmax function.

\begin{aligned} \frac{\partial}{\partial \Theta_p} \left( \frac{e^{\Theta^T_j x^{(i)}}}{\sum_{l=1}^k e^{\Theta^T_l x^{(i)}}} \right) &= \frac{\partial}{\partial \Theta^T_p x^{(i)}} \left( \frac{e^{\Theta^T_j x^{(i)}}}{\sum_{l=1}^k e^{\Theta^T_l x^{(i)}}} \right) \cdot \frac{\partial}{\partial \Theta_p} \left( \Theta^T_p x^{(i)} \right) \\ &= \frac{\frac{\partial}{\partial \Theta^T_p x^{(i)}} \left( e^{\Theta^T_j x^{(i)}} \right) \sum_{l=1}^k e^{\Theta^T_l x^{(i)}} - e^{\Theta^T_j x^{(i)}} \cdot \frac{\partial}{\partial \Theta^T_p x^{(i)}} \left( \sum_{l=1}^k e^{\Theta^T_j x^{(i)}}\right )}{\left( \sum_{l=1}^k e^{\Theta^T_l x^{(i)}} \right)^2} \cdot \frac{\partial}{\partial \Theta^T_p} \left( \Theta_p^T x^{(i)} \right) \\ &= \frac{[p=j] e^{\Theta^T_j x^{(i)}}\sum_{l=1}^k e^{\Theta^T_l x^{(i)}} - e^{\Theta^T_j x^{(i)}} \cdot e^{\Theta^T_p x^{(i)}} }{\left( \sum_{l=1}^k e^{\Theta^T_l x^{(i)}} \right)^2} \cdot x^{(i)} \\ &= \left( \frac{[p=j] e^{\Theta^T_j x^{(i)}}}{\sum_{l=1}^k e^{\Theta^T_l x^{(i)}}} - \frac{e^{\Theta^T_j x^{(i)}}}{\sum_{l=1}^k e^{\Theta^T_l x^{(i)}}} \cdot \frac{e^{\Theta^T_p x^{(i)}}}{\sum_{l=1}^k e^{\Theta^T_l x^{(i)}}} \right) \cdot x^{(i)} \\ &= \left([p=j] s_j - s_j s_p \right) \cdot x^{(i)} \\ &= s_j([p=j]-s_p)x^{(i)} \end{aligned}

Let’s review the steps.

  • We first employ the chain rule to make the computation significantly less confusing. To do this, we first find the derivative with respect to the power, and then with respect to \Theta_p.
  • I then used the quotient rule for derivatives. This gets long because each term is long, but bear with me.
  • In the next step, I introduce the Iverson notation, denoted by [i=j], which is equal to 1 if i = j, and 0 otherwise. To understand how I got the result there, note that we’re finding the (partial) derivative with respect to \Theta_p^T x^{(i)}. If the exponent were anything else (that is, if the exponent were say \Theta_q^T x^{(i)} and p \neq q), then the derivative of that term would be 0, because the term would be treated as a constant. You might want to re-read and make sure you understand why. The Iverson term there says that that term exists only when p = j.
  • The next step is simply obtained by simplifying and splitting the big fraction into two smaller fractions, and canceling out common terms in the numerator and the denominator.
  • Finally, we note that except the delta term, everything else is the softmax function, and use the concise notation s_j to denote the softmax for the vector \Theta_j.

Now we have the derivative of the softmax function. If you look at the log-likelihood function again, you’ll see that we have a double summation and a log as well. Let’s take care of one summation and the logarithm now:

\begin{aligned} \frac{\partial}{\partial \Theta_p} \left( \sum_{j=1}^k \log s_j \right) &= \sum_{j=1}^k \frac{\partial}{\partial \Theta_p} (\log s_j) \\ &= \sum_{j=1}^k \frac{1}{s_j} \cdot \frac{\partial s_j}{\partial \Theta_p} \\ &= \sum_{j=1}^k \frac{1}{s_j} \cdot s_j ([p=j]-s_p) x^{(i)} \\ &= \sum_{j=1}^k ([p=j]-s_p)x^{(i)} \end{aligned}

The steps here should be fairly easy to follow, it’s just a simple application of the chain rule. However, the original log-likelihood function also has a T(y^{(i)})_j exponent. This really isn’t a problem:

\begin{aligned} \frac{\partial}{\partial \Theta_p} \left( \sum_{j=1}^k \log (s_j)^{T(y^{(i)})_j} \right) &= \frac{\partial}{\partial \Theta_p} \sum_{j=1}^k T(y^{(i)})_j \log s_j \\ &= T(y^{(i)})_j \sum_{j=1}^k ([p=j]-s_p)x^{(i)} \end{aligned}

At this point, let’s note that T(y^{(i)})_j = 1 only when y^{(i)} = j, and so we can use the Iverson notation again. This will be useful for us in a minute:

\frac{\partial}{\partial \Theta_p} \left( \sum_{j=1}^k \log (s_j)^{T(y^{(i)})_j} \right) = x^{(i)}\left( \sum_{j=1}^k [y^{(i)}=j] \cdot [p=j] - s_p \sum_{j=1}^k [y^{(i)}=j] \right)

I’ve done quite a bit of shuffling of terms here, so let’s take things step by step.

  • I first converted the T(y^{(i)})_j to the Iverson notation.
  • To make terms cleaner, I took the x^{(i)} term outside the summations, since that term isn’t dependent on the summation variable j.
  • Similarly, the s_p term is also not dependent on the summation variable, so it can be taken outside.
  • Then, I multiplied the Iverson function throughout (using the distribution rule of multiplication), and split the summation into two summations.

You should work these steps out if this seems like a big jump. Next, let’s simplify some terms.

  1. Take the first summation and let’s intuitively understand what the two Iverson terms mean. The first term is 1 only if y^{(i)} = j. The second one is 1 only if j = p. Clearly, we can see that this term is 1 only when y^{(i)} = p. Thus, this summation reduces to \sum_{j=1}^k [y^{(i)}=p]. Intuitively, what does this summation represent? If we sum up all these terms, effectively, we’re counting the number of terms such that y^{(i)}=p. Since these are two constant values, the summation doesn’t really matter, because either they’re equal, or they’re not. If they are, then the value of the summation is 1, otherwise, it’s 0. Why? Because y^{(i)} is the output variable, which in multi-class classification is the class (remember, this is what we’re trying to do), and p is what we’re finding the derivative with respect to. Obviously, both these values lie between 1 and k. With this understanding, you should now see that this summation term is simply equal to [y^{(i)}=p].
  2. The second summation has a very similar reasoning. y^{(i)} can equal only one value of j (because y^{(i)} is constant and in the range of j. This time, though, we’re sure that y^{(i)} will definitely equal some class, so we don’t require an Iverson term here: the summation will always equal 1.

\frac{\partial}{\partial \Theta_p} \left( \sum_{j=1}^k \log (s_j)^{T(y^{(i)})_j} \right) =  x^{(i)} \left( [y^{(i)}=p] - s_p \right)

And finally, computing the gradient of the log-likelihood function,

\begin{aligned} \frac{\partial}{\partial \Theta_p} \ell(\Theta) &= \frac{\partial}{\partial \Theta_p} \left( \sum_{i=1}^m \sum_{j=1}^k \log (s_j)^{T(y^{(i)})_j} \right) \\ &= \sum_{i=1}^m x^{(i)} ( [y^{(i)} = p] - s_p) \end{aligned}

Unfortunately, we can’t really do much analytically here. It doesn’t lead us very far towards finding the maximum likelihood estimates. However, this expression is nice and simple, and we can use a computational method to find the minimum values of the parameters \Theta.

Instead of maximizing the log-likelihood, we can minimize the negative log-likelihood, and a simple way to do that is to use the familiar gradient descent. We found the gradient of the log-likelihood, so we just have to pick an initial guess and repeatedly move in the opposite direction of the gradient.

Let’s now recap what we’ve seen in this blog post.


We’ve covered a lot of ground in this post. Let’s review.

  • We started by saying that softmax regression was an alternate way of using logistic regression for multi-class classification. In fact, it’s a generalization of logistic regression.
  • Then, we laid the foundations for generalized linear models (GLM) by briefly discussing exponential family distributions, of which the multinomial distribution is a member.
  • We then saw that softmax regression is an example of a generalized linear model (GLM) using the multinomial distribution.
  • Then, we started talking about the math and showed how softmax regression is, in fact, a GLM. We did this by first finding P(y|x;\Theta), and showing that each of our k-1 parameters \phi_i are softmax functions, hence the name of the algorithm.
  • We went on to define our hypothesis function for the model, which simply gave as output, a vector with the probabilities of the input being in each of the k classes.
  • Last, we tried finding the maximum likelihood estimates for the model. It was significantly longer than previous derivations, involving a new Iverson notation and the derivative of a rather complicated function. We formulated the log-likelihood function and found its derivative, and saw that it wasn’t steering us in the right direction, concluding that computational methods like gradient descent are better suited to this.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s