Looking at Precision, Recall, and the F1 Score.

Precision and Recall are two important notions when doing analysis, especially if you've got an imbalanced dataset (many 0's, not many 1's; think of something like speople defaulting on a loan or some strange event happening). We'll explore precision and recall and show off a tool called a confusion matrix. At the end, we'll show a measure which is essentially the "average" of precision and recall and which is useful to use when you have imbalanced data and are trying to predict the underrepresented piece well.

In [25]:
import numpy as np
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, recall_score, precision_score
In [28]:
# Make up pretend target and pretend predicted values.  
y = np.r_[np.ones(100), np.zeros(900)]
y_predicted = np.r_[np.ones(20), np.zeros(600), np.ones(380)]

The confusion matrix will display four values:

  • True Positives (TP): values which we guessed were 1, and they were actually 1.
  • False Positive (FP): values which we guessed were 1, and they were actually 0.
  • True Negatives (TN): values which we guessed were 0, and they were actually 0.
  • False Negatives (FN): values which we guessed were 0, and they were actually 1.

Let's look at this with respect to the data above.

In [23]:
confusion_matrix(y, y_predicted)
Out[23]:
array([[520, 380],
       [ 80,  20]])

The top row deals with the negatives: true negatives and false negatives. From our data above we see that we've correctly predicted 520 false values. But, we had 380 positives (at the end of our y_predicted that should have been negatives. Hence, these are false negatives.

The lower row deals with positives: false positives and true positives. This says we said that 80 values were 0 when they were actually 1; hence, we have 80 false positives. Similarly, we have 20 true positives.

Okay. What's Precision? What's Recall?

When we care about positive values more, we want to see how well we did at predicting these items.

For precision, we want to look at True Positives divided by all of our Positive Predictions (that is, TP / (TP + FP)). On the confusion matrix, this is going to be looking at the right-hand column. For our example, we get 20/400 = 0.005.

For recall, we want to look at True Positives divided by all things which should have been positive. That is, we want True Positives and False Negatives. This is the bottom row of our confusion matrix. This gives us 20/100 = 0.2.


The gist here is that for precision we want to see how well our positive predictor predicted positives. If it predicted a lot of positives but got many wrong, this will be a low number. Similarly, if we predicted few false positives, this number will be larger.

For recall, this tells us how well we captured the elements that should have been classified as positive. If we missed many elements that should have been positive, this will be fairly low.


Let's calculate the recall and precision with Python to make sure we got it right above.

In [27]:
print("Precision: {}".format(precision_score(y, y_predicted)))
print("Recall: {}".format(recall_score(y, y_predicted)))
Precision: 0.05
Recall: 0.2

Exactly as we calculated. Awesome.


F1 Score?

Let's look at a much more imbalanced dataset for a moment and check out the confusion matrix.

In [30]:
# Much more imbalanced data.
y_skew = np.r_[np.ones(10), np.zeros(990)]
y_skew_predicted = np.r_[np.ones(2), np.zeros(980), np.ones(18)]

print(confusion_matrix(y_skew, y_skew_predicted))
[[972  18]
 [  8   2]]

Let's look at precision, recall, and standard accuracy. Recall that the standard accuracy score (which is usually the default score in Python) is simply the number of correct matches we had (when we predict 0 and it is 0, when we predict 1 and it is 1) over the total number of elements.

In this case, the precision should be 2/10 = 0.2, the recall should be 2/20 = 0.1, and the accuracy should be 974/1000 = 0.974.

In [31]:
print("Precision: {}".format(precision_score(y_skew, y_skew_predicted)))
print("Recall: {}".format(recall_score(y_skew, y_skew_predicted)))
print("Accuracy: {}".format(accuracy_score(y_skew, y_skew_predicted)))
Precision: 0.1
Recall: 0.2
Accuracy: 0.974

Great. Okay. Suppose now that we were looking at how well we did predicting things correctly. Suppose also that it's really important to get positive things correct. These positive things might represent something like an engine failure or a potentially good client to solicit so we really care about these.

If the data person makes a model and gets our predictions, they might be like, "Awesome! 97.4% is a great accuracy! This model is great." But, of course, they'd be wrong. This model only classified 2 correct positive elements out of 10. That's not great. When we look at the precision and accuracy, we see this: they're relatively low values, meaning that we both predicted positive values poorly and that we missed a large number of positive values. C'est la vie.

It would be nice if there was a measure that combined the Precision and Recall into one measure so that we could optimize our model with respect to that. One might guess that it's just the arithmetic average of these two vaues, but that's not quite correct.

The F1 Score.

When working with rates, it is much easier to work with the harmonic mean for reasons we won't talk about here (if you're interested, go through some tutorials on using the harmonic mean or program a harmonic mean in Python or your language of choice and check out some of the averages you get). The harmonic mean of $a, b$ is defined as:

$$HM(a, b) = \frac{2}{\frac{1}{a} + \frac{1}{b}}$$

Sort of strange looking but works well. Let's plug in our numbers here and see what the average is.

In [36]:
def hm(a, b):
    return 2/((1/a) + (1/b))

print("We expect the f1 score to be: {:.3f}".format(hm(0.1, 0.2)))
print("Sklearn tells us the f1 score is: {:.3f}".format(f1_score(y_skew, y_skew_predicted)))
We expect the f1 score to be: 0.133
Sklearn tells us the f1 score is: 0.133

Awesome. This matches well. The F1 score goes from 0 to 1, where greater values are better. Let's look at one last dataset and check out how the f1 score looks with a more accurate prediction.

In [42]:
# Much more imbalanced data.
y_last = np.r_[np.ones(10), np.zeros(990)]
y_last_predicted = np.r_[np.ones(8), np.zeros(980), np.ones(12)]

print(confusion_matrix(y_last, y_last_predicted))
print()
print("Precision: {}".format(precision_score(y_last, y_last_predicted)))
print("Recall: {}".format(recall_score(y_last, y_last_predicted)))
print("Accuracy: {}".format(accuracy_score(y_last, y_last_predicted)))
print("F1: {:.3f}".format(f1_score(y_last, y_last_predicted)))
[[978  12]
 [  2   8]]

Precision: 0.4
Recall: 0.8
Accuracy: 0.986
F1: 0.533

Much better! Notice that we've predicted almost all the positive values correctly and didn't predict too many false values as positive. This gives us an F1 score of 0.533. Notice that the recall here was much better than the precision and note the how the harmonic mean transformed that into f1. Moreover, note the ridiculously high accuracy which tells us virtually nothing about the positive values we care about.

In [ ]: