Having seen a classification model, we also need to know how to evaluate its performance! After all, if you don’t know how well (if at all) your model is doing, you’re heading for disaster. This post has practically no math at all.

Let’s begin by stating some definitions. The **ground truths** of a dataset are the class labels in the dataset. These are the true values of the classes (this also applies to regression: the ground truth is the actual value for each training example).

Suppose we use a **train-test split**, which means that we don’t train with the full dataset–instead, we split it into a train and a test set. Standard values for these are 70% for training and 30% for testing. We train the model on the training set, and make it predict on the remaining 30% so that we can see how well it performs on data it’s never seen before.

So given that our model predicts on the test set (and in this test set we do know the actual output values), the **true positives** are the training examples where the model predicts a positive class, and the ground truth for that example is also positive. Alternatively, it’s a training example where the model predicted that the training example belongs to a certain class, and this is in agreement with the ground truth. This alternate definition makes it easier to think about multi-class classification, where there are more than two classes. **False positives** are where the model predicts true, but the ground truth is false. In other words, the model says that the training example belongs to a certain class, but in reality it’s in some other class. False positives in statistics are called **Type I errors** (that’s a Roman numeral one). **True negatives** and **false negatives **(called **Type II errors**) can be thought about in a similar way.

With all this in mind, we can now talk about the **confusion matrix**.

## The confusion matrix

The confusion matrix is simply a matrix, or table, of the counts of everything we defined above. For a binary classification task. Here’s a very crude representation:

Of course, the confusion matrix will be different with multi-class classification, and we’ll see it when we use models in practice.

## Some metrics

Given the confusion matrix, we can now begin to define the metrics to judge our classifiers.

Hopefully it’s easy to understand that TP, FP, etc. are just the short versions of true positives, false positives, etc. Accuracy is pretty intuitive, but the others aren’t, so here’s one way to think about them:

- Precision answers the question, “What fraction of positives predicted by the model were right?”
- Recall answers the question, “What fraction of actual positive classes did the model get right?”
- The F1 score is the
**harmonic mean**of the recall and precision. It’s a neat way to sum things up.

## The ROC curve

Another commonly quoted metric is the **receiver operating characteristic (ROC) curve**. This is a curve drawn with the **true positive rate** (recall) on the y-axis, and the **false positive rate** () on the x-axis. This is done only with binary classification tasks.

To draw this, recall that the logistic regression model (for binary classification), one of our classification models, gave us a probability as an output, rather than a class directly. We used a threshold value (say 0.5) to decide the class. Call this threshold value as . By varying systematically, we obtain many values of true positive rate and true negative rate, and we can join all these points to form a smooth curve. This curve is the ROC curve.

The diagonal line is what you would get if the true positive rate was always equal to the false positive rate. You want your ROC curve to be as far from it as possible, towards the top left corner. This is because at the diagonal line, you’re basically just guessing the class, so there’s no use of a model. Below the diagonal line, your model is performing worse than one that’s just guessing randomly.

Clearly, the ideal ROC is one that just goes up the y-axis till the top-left corner, and then goes along the top of the graph. This however, is almost never doable, since it implies that the algorithm absolutely never goes wrong at all.