The Recall Score (also known as Sensitivity or True Positive Rate) measures the ability of a classification model to correctly identify all relevant (positive) instances. In other words, it answers the question: Out of all the actual positive cases, how many did the model correctly predict?
The formula for recall is:
${Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}$
Key Terms:¶
- True Positives (TP): The number of correct positive predictions.
- False Negatives (FN): The number of actual positives that the model failed to predict.
Example:¶
If there are 100 actual positive cases, and the model correctly predicts 80 of them but misses 20, the recall score would be:
${Recall} = \frac{80}{80 + 20} = \frac{80}{100} = 0.8 \text{ or } 80\%$
Usefulness:¶
- Recall is important when missing positive cases has a high cost, such as in medical diagnosis, where you want to minimize the chance of missing a disease (false negatives).
- It’s particularly valuable when the focus is on identifying as many positive instances as possible, even if that means allowing some false positives.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('Social_Network_Ads.csv')
data
User ID | Gender | Age | EstimatedSalary | Purchased | |
---|---|---|---|---|---|
0 | 15624510 | Male | 19.0 | 19000.0 | 0 |
1 | 15810944 | Male | 35.0 | 20000.0 | 0 |
2 | 15668575 | Female | 26.0 | 43000.0 | 0 |
3 | 15603246 | Female | 27.0 | 57000.0 | 0 |
4 | 15804002 | Male | 19.0 | 76000.0 | 0 |
… | … | … | … | … | … |
395 | 15691863 | Female | 46.0 | 41000.0 | 1 |
396 | 15706071 | Male | 51.0 | 23000.0 | 1 |
397 | 15654296 | Female | 50.0 | 20000.0 | 1 |
398 | 15755018 | Male | 36.0 | 33000.0 | 0 |
399 | 15594041 | Female | 49.0 | 36000.0 | 1 |
400 rows × 5 columns
X = data.iloc[:, [2, 3]].values
y = data.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()
y_pred = classifier.predict(X_test)
y_pred
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1], dtype=int64)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm
array([[64, 4], [ 3, 29]], dtype=int64)
# Visualising the Training set results
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
Z =classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape)
plt.contourf(X1, X2, Z)
plt.scatter(X_set[y_set == 0, 0], X_set[y_set == 0, 1],label = 0)
plt.scatter(X_set[y_set == 1, 0], X_set[y_set == 1, 1],label = 1)
plt.title('K-Nearest Neighbors (K-NN) (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
from sklearn.metrics import recall_score
The confusion matrix for sklearn is as follows:
- TN | FP
- FN | TP
cm
array([[64, 4], [ 3, 29]], dtype=int64)
# Calculate recall
recall = recall_score(y_test, y_pred)
recall
0.90625