8.1 – Implementing Gaussian Discriminant Analysis: Iris data classification

Now that we have a solid understanding of the math behind our first generative learning algorithm, implementing it is now a piece of cake. As an added bonus, we can now understand why we’re doing what we’re doing, which is the whole point of learning the math.

As usual, we’ll start by importing some packages.

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np



The first package is part of sklearn, and it loads a predefined, toy dataset that we can use to test the waters for our algorithms. The iris dataset is an extremely popular beginner dataset for classifying flowers based on features such as sepal width.

We next import a package to perform what’s called Principal Components Analysis, or PCA. I will dedicate a separate blog post for this because it deserves one, but for now, I’ll give a brief explanation. A lot of the time, when we have high-dimensional data (data with many features), a lot of the features aren’t really needed to do well in classification. Why would we want to throw away features? The resulting model that uses fewer features is much easier to interpret, and by throwing away features, we’re usually also throwing away noise in the data. PCA identifies what it calls principal components that capture most of the information in the data. For the purposes of this demonstration, we will only use 2 features. This is a common choice. Another advantage is that it makes it easier for me to show you the data.

The other package imports are ones we’ve already encountered. The last line just makes graphs look prettier. Let’s now load the data using the function we imported:

data = load_iris()
X, Y = data['data'], data['target']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.7)


This is hopefully easy to understand, and easier to understand if you do it yourself. I strongly recommend you run this too, and explore the data returned by the load_iris function.

I’ll now make a small simplification. The Iris data as it is contains data about 3 kinds of flowers, creating a multi-class classification problem. However, we have not (and will not) discuss the multi-class case, so we will simply drop the third class.

indices = np.where(Y_train == 2)
for index in sorted(indices, reverse=True):
    X_train = np.delete(X_train, index, 0)
    Y_train = np.delete(Y_train, index, 0)


The third argument to np.delete is the axis parameter. Setting it to 0 makes sure the dimensions of the data are preserved. Let’s now perform dimensionality reduction:

pca = PCA(n_components=2)
X_train = pca.transform(X_train)


Let’s now plot our data:

plt.scatter(X_train.T[0], X_train.T[1], c=Y_train, cmap='jet');



We’re now done with all the prerequisite work. We can now start implementing the algorithm. Remember what we’re doing. We find the likelihoods of the data belonging to each class, and predict the class with the higher probability. Our likelihoods are assumed to be multivariate Gaussian distributions, and we spent a very long blog post finding the maximum likelihood estimates. We can now find the MLEs in code using the formulas that we obtained. Let’s start with \phi.

phi = np.mean(Y_train == 1)


That’s it! Remember that the MLE for \phi is simply the average number of items of class 1 in the dataset. In my system, it prints


Next, we’ll find the MLEs for \mu_0 and \mu_1.

indices = (Y_train == 0)
denominator = np.sum(indices)
numerator = np.sum(X_train[indices], axis=0)
mu_0 = numerator / denominator


Let’s see what we’re doing here. The denominator is simply the number of times the output variable is 0 for \mu_0. We first obtain a list of indices where the output Y is 0. Summing these values (which are boolean) gives us the required count. For the numerator, we need to sum up the \textbf{x}^{(i)} values where the output variable is 0. We use the indices from earlier to sum up these values, and use axis=0 to tell numpy to sum across columns so that we get the expected result. In my system, it prints

[-1.41929759  0.04673795]

The exact code above with very minor changes can be used to compute the MLE for \mu_1. On my system, this value is

[ 1.89239678 -0.06231726]

Remember that \mu is the mean vector, so you should expect a vector of dimensionality (length) the same as the data. Here, we reduced our data’s dimensionality to 2, so our mean vector should have the same length.

We’re left with our last MLE. This one will take a little more code to find, but it’s still easy.

mu = [mu_0, mu_1]

# Initialize the sum
x_minus_mu = X_train[0] - mu[Y_train[0]]
# We don't want rank-one arrays
x_minus_mu = x_minus_mu.reshape(*(x_minus_mu.shape), 1)
s = np.matmul(x_minus_mu, x_minus_mu.T)

m = len(Y_train)

for i in range(1, m):
    x_minus_mu = X_train[i] - mu[Y_train[i]]
    x_minus_mu = x_minus_mu.reshape(*(x_minus_mu.shape), 1)
    s += np.matmul(x_minus_mu, x_minus_mu.T)

s /= m


Recall that we need \mu_0 when y=0 and \mu_1 when y=1. For this reason, I use a list of \mu values, so I can index them by the y value. I then simply find the sum of the expression over the entire array. However, there’s a problem: we need a way to initialize the sum. Thus, I do the first one outside the loop, and continue the rest inside the for loop. If you see the formula for the MLE, this code should be easy to understand.

There’s one part of the code above that I haven’t discussed–the constant (and diverting) calls to the reshape function. Here’s the problem. When you subtract two numpy arrays the way we did, you get what’s called a rank-one array. This is a rather funny (but more annoying) array, whose shape is, for example, (25, )–that’s right, there’s no second dimension shown, but it is there! This causes problems for us, and it’s safer to reshape this to (25, 1) so that the code does exactly what we want it to. If your code acts strange and you’re sure about your code, check your numpy array shapes and make sure you don’t have these rank-one arrays.

In my system, the code above prints:

 [[0.12281694 0.08844674]
 [0.08844674 0.20662475]]

Now that we’ve obtained all our MLEs, we simply have to predict on the testing part of our data. Here’s the code:

pi = 3.1415926535
n = len(mu_0) # Or mu_1, or any of the X
denominator = (2 * pi) ** (n / 2) * np.sqrt(np.linalg.det(s))

predictions = []

for x in X_test:
x_minus_mu0 = x - mu_0
    x_minus_mu0 = x_minus_mu0.reshape(*(x_minus_mu0.shape), 1)
    p_x0 = 1 / denominator * np.exp(-0.5 * np.matmul(x_minus_mu0.T, np.matmul(np.linalg.inv(s), x_minus_mu0)))
    p_x0 = np.squeeze(p_x0)

    x_minus_mu1 = x - mu_1
    x_minus_mu1 = x_minus_mu1.reshape(*(x_minus_mu1.shape), 1)
    p_x1 = 1 / denominator * np.exp(-0.5 * np.matmul(x_minus_mu1.T, np.matmul(np.linalg.inv(s), x_minus_mu1)))
    p_x1 = np.squeeze(p_x1)

    if p_x1 >= p_x0:


We first set up the denominator, since that is common to both P(x|y=0) and P(x|y=1). We then simply use the formula for the two likelihoods. If you don’t recall these formulas, look at the beginning of the previous post–these are simply the formulas for the multivariate Gaussian distribution. The np.squeeze function extracts single values from arrays–for example, if we have an array with a single value [25], it returns 25. After that, if P(x|y=1) > P(x|y=0), we predict class 1, otherwise, we predict class 0. In my system, these predictions are

[1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0]


We can now proceed to check our metrics:

print(accuracy_score(Y_test, predictions))

On my system, this prints 1.0. Seems outstanding! We classified everything in our test set perfectly! While this is an accomplishment and does mean that our implementation is correct, we should look at the data and note that there was no overlap. Any decent ML algorithm should classify this example perfectly. Still, it’s nice to look back on one example with a perfect score.

With this, we wrap up our discussion of the Gaussian Discriminant Analysis algorithm. This however, is not used directly. GDA is a general algorithm that can be used with multiple kinds of data, so often, specialized versions of it with stronger assumptions on the data are used. We will next look at one special case of GDA, an extremely popular algorithm called the Naive Bayes algorithm.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s