Precision Score is a metric used in classification tasks to measure how many of the positive predictions made by a model are actually correct. In simpler terms, it answers the question: Out of all the instances the model predicted as positive, how many were truly positive?
The formula for precision is:
${Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}$
Key Terms:¶
- True Positives (TP): The number of correct positive predictions.
- False Positives (FP): The number of incorrect positive predictions (predicted positive, but actually negative).
Example:¶
If a model predicts 10 positive cases, and 8 of those are correct while 2 are incorrect, the precision score would be:
${Precision} = \frac{8}{8 + 2} = \frac{8}{10} = 0.8 \text{ or } 80\%$
Usefulness:¶
Precision is particularly useful when the cost of false positives is high. For example, in spam detection, you want high precision to ensure that legitimate emails are not incorrectly marked as spam. It’s a good metric when you care more about the quality of positive predictions than the quantity.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('Social_Network_Ads.csv')
data
User ID | Gender | Age | EstimatedSalary | Purchased | |
---|---|---|---|---|---|
0 | 15624510 | Male | 19.0 | 19000.0 | 0 |
1 | 15810944 | Male | 35.0 | 20000.0 | 0 |
2 | 15668575 | Female | 26.0 | 43000.0 | 0 |
3 | 15603246 | Female | 27.0 | 57000.0 | 0 |
4 | 15804002 | Male | 19.0 | 76000.0 | 0 |
… | … | … | … | … | … |
395 | 15691863 | Female | 46.0 | 41000.0 | 1 |
396 | 15706071 | Male | 51.0 | 23000.0 | 1 |
397 | 15654296 | Female | 50.0 | 20000.0 | 1 |
398 | 15755018 | Male | 36.0 | 33000.0 | 0 |
399 | 15594041 | Female | 49.0 | 36000.0 | 1 |
400 rows × 5 columns
X = data.iloc[:, [2, 3]].values
y = data.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()
y_pred = classifier.predict(X_test)
y_pred
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1], dtype=int64)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm
array([[64, 4], [ 3, 29]], dtype=int64)
# Visualising the Training set results
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
Z =classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape)
plt.contourf(X1, X2, Z)
plt.scatter(X_set[y_set == 0, 0], X_set[y_set == 0, 1],label = 0)
plt.scatter(X_set[y_set == 1, 0], X_set[y_set == 1, 1],label = 1)
plt.title('K-Nearest Neighbors (K-NN) (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
from sklearn.metrics import precision_score
The confusion matrix for sklearn is as follows:
- TN | FP
- FN | TP
cm
array([[64, 4], [ 3, 29]], dtype=int64)
# Calculate precision
precision = precision_score(y_test, y_pred)
precision
0.8787878787878788