3 – Logistic Regression

Let’s talk about classification. “But the title says regression!” might be a first thought. We’ll get there. Classification is what the name suggests–you need a model that classifies or segregates incoming data into 2 or more groups. For the majority of this, we’ll talk only about binary classification–when there are only two classes. As you’ll see, it’s easy to adapt a binary classification algorithm to multiple classes.

Using our previous notation, $y$ being the output variable, in binary classification, $y \in \{0, 1\}$. For classification, we can modify our hypothesis function to

$h(x) = g(\Theta^T x)$

where $g$ is some function. A very popular choice for this function is the sigmoid function, which is also called the logistic function. Using this function is what gives our algorithm its name. The logistic function is defined as:

$g(z) = \frac{1}{1 + e^{-z}}$

Here’s how this function looks:

As you can see, the logistic function is bounded between 0 and 1. This makes it very convenient to use it for classification, because if the output of our hypothesis function is above 0.5, we predict 1; otherwise, we predict 0.

Besides changing the hypothesis function, what we do is pretty much the same as what we do in linear regression (the math changes, obviously). We start with an initial guess for $\Theta$, and try to tweak it so that the cost function $J(\Theta)$ is minimum. So basically, we’re performing regression, but cleverly using a function that’s bounded between 0 and 1, which lets us use this for classification. And this is why logistic regression is put under classification algorithms.

How do we interpret this hypothesis function? We can interpret the output as the percentage or degree to which we’re confident that the output is 1. Mathematically,

$h(x) = P(y = 1 | x)$

This equation says the same thing we put in words.

Let’s note now, that $g(z) > 0.5$ when $z > 0$, and thus, $h(x) = g(\Theta^T x) > 0.5$ when $\Theta^T x > 0$.

Decision boundaries

Using the above equation, we can see that there’s a point at which the logistic regression model changes its decision. We can work on the features and tweak the parameters to get decision boundaries of many shapes. Let’s see some examples.

Linear decision boundary

Let’s use the following imports in Python:


import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs

sns.set(style='white')



Then we create linearly separable data:


# Make 2 clusters of linearly separable data
data = make_blobs(n_samples=40, centers=2)



Let’s extract the $x_1$ and $x_2$ values and visualize them.


x1 = np.array([d[0] for d in data[0]])
x2 = np.array([d[1] for d in data[0]])

plt.scatter(x1, x2)



This gives the following on my system. This will be different on every run.

Let’s define the sigmoid function:


def sigmoid(x):
return 1 / (1 + np.exp(-x))



Let’s say we found out the value for $\Theta = [22, 11, 12]$.


theta = np.array([22, 11, 12])



Let’s define the hypothesis function. Remember to append the 1s!


def hypothesis(theta, x1, x2):
X = [[1] * len(x1), x1, x2]
return sigmoid(np.dot(theta.T, X))



Let’s also make an array that assigns classes to each point.


y = [0 if x1[i] &lt; 4 else 1 for i in range(len(x1))]



And finally, we plot a decision boundary, which in this case is linear.

# Make a continuous grid xx, yy = np.mgrid[-2:11:.01, -11:2:.01] grid = np.c_[xx.ravel(), yy.ravel()]

# Evaluate the probability at each point on the grid probs = hypothesis(theta, grid[:,0], grid[:,1]).reshape(xx.shape) f, ax = plt.subplots(figsize=(8, 6))

# Plot the probability grid as a contour map ax.contour(xx, yy, probs, levels=[.5], cmap="Greys", vmin=0, vmax=.6)

# Show the original data on the map as well ax.scatter(x1, x2, c=y, s=50, cmap="RdBu", vmin=-.2, vmax=1.2, edgecolor="white", linewidth=1) ax.set(aspect="equal", xlim=(-2, 11), ylim=(-11, 2), xlabel="$X_1$", ylabel="$X_2$")

Here’s the output on my machine. (btw, sorry for the lack of syntax highlighting; it was incredibly annoying to get it to work, and I gave up in the end).

Here we have our linear decision boundary for the theta that we chose. Let’s now try and get a nonlinear boundary.

Nonlinear decision boundary

After a little fidgeting on Stack Overflow, I was able to generate the perfect dataset to demonstrate this.


import math
from random import random
pi = 3.14159

# From https://stackoverflow.com/a/44356472
def rand_cluster(n,c,r):
"""returns n random points in disk of radius r centered at c"""
x,y = c
points = []
for i in range(n):
theta = 2*math.pi*random()
s = r*random()
points.append((x+s*math.cos(theta), y+s*math.sin(theta)))
return points

points = np.array([(math.cos(x) * 4, math.sin(x) * 4) for x in np.linspace(0, 2 * pi, 1000)])
cluster2 = np.array(rand_cluster(200, (0, 0), 2))



And then, visualize it as before:

f, ax = plt.subplots(figsize=(8, 6))
ax.scatter(points.T[0], points.T[1])
ax.scatter(np.array(cluster2).T[0], np.array(cluster2).T[1])
ax.set(aspect="equal", xlim=(-5, 5), ylim=(-5, 5), xlabel="$X_1$", ylabel="$X_2$")


Fantastic. A nice test dataset that’s separable by a circle. Let’s look at how we might do this. We’ll have to modify things a bit first. Clearly, passing a linear function like what we’re passing to the sigmoid function above won’t work. But we still do need to fit to a logistic function (because that’s the point of this post!), so that part can’t change. And we don’t exactly want to change the fact that we’re passing in $\Theta^T X$ either. We really like the way things are. So what do we do?

Let’s change some features. The decision boundary, we can see, is a circle. So instead of using $x_1$ and $x_2$, let’s instead use $x_1^2$ and $x_2^2$ as features. Playing around with features like this is certainly allowed; we even have a name for it! This is called feature engineering. Since I made the data, I can tell you that a valid value for $\Theta$ is $[-6, 1, 1]$. Just to be clear again, we’re still passing in $\Theta^T x$ to the sigmoid function, but now,

$x = \begin{bmatrix} 1 \\ x_1^2 \\ x_2^2 \end{bmatrix}$

Equipped with our new game plan, let’s do this in Python:

theta = np.array([-6, 1, 1])
X = np.concatenate((points.T[0], cluster2.T[0]))
Y = np.concatenate((points.T[1], cluster2.T[1]))

# Notice how we're passing x1^2 and x2^2 (sorry for naming them X and Y)
hypothesis(theta, X ** 2, Y ** 2)

# And visualize the data
# Make a continuous grid
xx, yy = np.mgrid[-5:5:.01, -5:5:.01]
print(xx.shape, yy.shape)

grid = np.c_[xx.ravel(), yy.ravel()]
print(grid.shape)

# Evaluate the probability at each point on the grid
probs = hypothesis(theta, grid[:,0] ** 2, grid[:,1] ** 2).reshape(xx.shape)
print(probs.shape)

# Get the Y vector (with the classes) again. This is named Z here.
Z = []
for z in range(len(X)):
if X[z] **2 + Y[z] **2 &lt; 6:
Z.append(0)
else:
Z.append(1)

# Plot the figure
f, ax = plt.subplots(figsize=(8, 6))

# Plot the probability grid as a contour map
ax.contour(xx, yy, probs, levels=[.5], cmap="Greys", vmin=0, vmax=.6)

# Show the original data on the map as well
ax.scatter(X, Y, c=Z, s=50,
cmap="RdBu", vmin=-.2, vmax=1.2,
edgecolor="white", linewidth=1)

ax.set(aspect="equal",
xlim=(-5, 5), ylim=(-5, 5),
xlabel="$X_1$", ylabel="$X_2$")



Absolutely fantastic! We’ve gotten our circular decision boundary as intended.

Note on math: This section has some probability and statistics concepts. You can skip that part if you’re not comfortable with these concepts.

Let’s now look at the cost function for logistic regression. Unfortunately, we can’t use the same cost function that we used for linear regression. Why not? Because as it turns out, since our hypothesis function $h(x)$ is non-linear, the least-squares cost function becomes non-convex. Alright, so what’s the big deal? The trouble with non-convex cost functions is that they have local optima. What does this mean? Remember how we’re trying to find the lowest point in the cost function? And that we’re taking small steps? With convex functions, this works because there’s only one minimum. Non-convex functions can have points that are lower than their surrounding points, but they’re still greater than the global minimum that we’re really after.

So we need a new cost function. How do we find one? Statistics provides a promising direction: the maximum likelihood estimates. This is based on the likelihood function. Let’s derive this (math ahead).

Note: If you’re not familiar with maximum likelihood estimation, I now made a post explaining it, you can read that post to get a good intuition about the method, and then come back here. Link to post.

Recall that

$P(y = 1 | x) = h(x)$

and thus,

$P(y = 0 | x) = 1 - h(x)$

Given this, we can combine these to get:

$P(y | x) = h(x)^y (1 - h(x))^{1-y}$

You can verify this by plugging in $y = 1$ and $y = 0$ in this equation to get the previous ones. The likelihood function is then given by,

$L(\Theta) = P(y | x; \Theta) = \prod_i P \left( y^{(i)} | x^{(i)} ; \Theta \right)$

The notation $P(y | x ; \Theta)$ is read as “probability of $y$ given $x$, parameterized by $\Theta$. This likelihood as it is, is hard to optimize. We instead look at the log-likelihood function, which is just the logarithm of the likelihood function.

$l(\Theta) = \sum_{i = 1}^m y^{(i)} \log \left( h(x^{(i)}) \right) + (1 - y^{(i)}) \log \left( 1 - h(x^{(i)}) \right)$

The principle of maximum likelihood then states that we should choose $\Theta$ so that the likelihood (or in our case, the log-likelihood) is the maximum. However, maximizing the likelihood is different from what we’re used to: minimizing a cost function. There’s an easy fix: if we’re interested in minimizing, we simply add a negative sign to the log-likelihood, and minimize that; this has the same effect as maximizing the log-likelihood. To maintain consistency with what we did in linear regression, we’ll also divide by $m$ to get our final cost function.

$J(\Theta) = \frac{-1}{m} \sum_{i = 1}^m y^{(i)} \log \left( h(x^{(i)}) \right) + (1 - y^{(i)}) \log \left( 1 - h(x^{(i)}) \right)$

This function is now convex, and can be optimized using gradient descent.

$\text{repeat\{} \\ \theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\Theta) \\ \text{\}}$

Let’s now find that derivative as we did with linear regression. To do this, let’s first find the derivative of the sigmoid function.

\begin{aligned} \frac{\mathrm{d}}{\mathrm{d}z}\left( \frac{1}{1 + e^{-z}} \right) &= \frac{-\left( -e^{-z} \right)}{\left( 1 + e^{-z} \right)^2} \\ &= \frac{e^{-z}}{\left( 1 + e^{-z} \right)^2} \\ &= \frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}} \\ &= g(z) \left( 1 - \frac{1}{1 + e^{-z}} \right) \\ &= g(z) \left( 1 - g(z) \right) \end{aligned}

We’ll need this to find the derivative, and as we’ll see, this plays out very nicely:

\begin{aligned} \frac{\partial}{\partial \theta_j} J(\Theta) &= \frac{\partial}{\partial \theta_j} \left( \frac{-1}{m} \sum_{i = 1}^m y^{(i)} \log \left( h(x^{(i)}) \right) + (1 - y^{(i)}) \log \left( 1 - h(x^{(i)}) \right) \right) \\ &= \frac{-1}{m} \sum_{i = 1}^m \left( \frac{y^{(i)}}{h(x^{(i)})}h^{\prime}(x^{(i)}) + \frac{1 - y^{(i)}}{1 - h(x^{(i)})}(-h^{\prime}(x^{(i)})) \right) \\ &= \frac{-1}{m} \sum_{i = 1}^m \left( \frac{y^{(i)}}{h(x^{(i)})} h(x^{(i)})(1 - h(x^{(i)}))(x^{(i)}_j) - \frac{1 - y^{(i)}}{1 - h(x^{(i)})}h(x^{(i)})(1 - h(x^{(i)}))(x^{(i)}_j) \right) \\ &= \frac{-1}{m} \sum_{i = 1}^m \left( y^{(i)}(1 - h(x^{(i)}))(x^{(i)}_j) - (1 - y^{(i)})h(x^{(i)})(x^{(i)}_j) \right) \\ &= \frac{-1}{m} \sum_{i = 1}^m \left( y^{(i)}(1 - h(x^{(i)})) - (1 - y^{(i)})h(x^{(i)}) \right)x^{(i)}_j \\ &= \frac{1}{m} \sum_{i = 1}^m \left( y^{(i)} - y^{(i)}h(x^{(i)}) - h(x^{(i)}) + y^{(i)}h(x^{(i)}) \right)x^{(i)}_j \\ &= \frac{1}{m} \sum_{i = 1}^m \left( h(x^{(i)}) - y^{(i)} \right)x^{(i)}_j \end{aligned}

Let’s go over these steps.

• The first step uses the chain rule.
• The second step uses the derivative we derived above. The extra $x^{(i)}_j$ term comes from using the chain rule again, because

\begin{aligned} h(x^{(i)}) &= g(\Theta^T x^{(i)}) \\ &= g(\theta_0 x^{(i)}_0 + \theta_1 x^{(i)}_1 + \ldots + \theta_j x^{(i)}_j + \ldots + \theta_n x^{(i)}_n) \end{aligned}.

and so we get an extra $x^{(i)}_j$ term because we’re finding the derivative with respect to $\theta_j$.

• The rest of the steps are simply simplification of the expression.

Nothing fancy except for the first two steps, really. We can then perform gradient descent exactly as we did in linear regression, only this time, the hypothesis function is different.

Conclusion

With this, we wind up our discussion of logistic regression. To recap, discussed

• what logistic regression is
• where the name comes from
• how to classify using a logistic regression model
• how the decision boundary looks
• what the cost function is
• the gradient descent update rule, which we found to be remarkably similar to the one we got for linear regression.