K-Nearest Neighbors (K-NN) is a simple machine learning algorithm used for both classification and regression tasks. It’s based on the idea that things that are similar to each other are likely to be in the same category.
Key Concepts of K-NN:¶
What is K-NN?
- K-NN is an algorithm that looks at the closest data points (neighbors) to make predictions.
- It does this by finding the K nearest neighbors to a new data point and looking at their labels (in classification) or values (in regression).
How does K-NN work?
- Step 1: Choose a number, K. This is how many neighbors you want to consider (e.g., K=3 means looking at the 3 closest neighbors).
- Step 2: When given a new data point to classify or predict, K-NN looks for the K nearest neighbors in the training data.
- Step 3: In classification, it assigns the new data point to the most common category among its neighbors. In regression, it averages the values of the neighbors.
- Step 4: The prediction is based on what the nearest neighbors “vote” for.
How to Measure Neighbors?
- Neighbors are usually measured by distance. The most common is Euclidean distance, which measures the straight-line distance between two points.
Example (Classification):
- Imagine you’re trying to predict if someone likes apples or bananas. You have past data about other people’s preferences based on their age and income.
- When a new person comes along, K-NN finds the K people closest to this person (based on age and income) and sees what fruit they prefer.
- If most of the neighbors prefer apples, K-NN predicts that this new person will also prefer apples.
Example (Regression):
- In regression, you might be trying to predict someone’s height based on their age and weight.
- K-NN would find the nearest neighbors in terms of age and weight and take the average of their heights to make a prediction.
Important Points:¶
- K value: Choosing the right K is important. If K is too small, the model might be sensitive to noise; if K is too large, it might miss important local patterns.
- Distance metric: The algorithm works by calculating distances. Common ones are Euclidean distance, Manhattan distance, etc.
- No training phase: K-NN doesn’t build a model. It just stores the data and makes predictions by comparing the new point to the data every time.
Pros and Cons:¶
- Pros:
- Simple and easy to understand.
- No need for a complicated training phase.
- Cons:
- Slow when there’s a lot of data, because it has to compare every new point to all the old ones.
- Can struggle if the data is very high-dimensional (many features).
Summary:¶
K-NN is like making decisions based on what’s happening with your closest neighbors. For classification, it looks at what the majority of neighbors are doing; for regression, it averages their values. The number of neighbors (K) and the distance measure are important factors in how well it works.
Let’s review example step by step.¶
1. Importing Libraries¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
2. Loading the Dataset¶
data = pd.read_csv('Social_Network_Ads.csv')
data
User ID | Gender | Age | EstimatedSalary | Purchased | |
---|---|---|---|---|---|
0 | 15624510 | Male | 19 | 19000 | 0 |
1 | 15810944 | Male | 35 | 20000 | 0 |
2 | 15668575 | Female | 26 | 43000 | 0 |
3 | 15603246 | Female | 27 | 57000 | 0 |
4 | 15804002 | Male | 19 | 76000 | 0 |
… | … | … | … | … | … |
395 | 15691863 | Female | 46 | 41000 | 1 |
396 | 15706071 | Male | 51 | 23000 | 1 |
397 | 15654296 | Female | 50 | 20000 | 1 |
398 | 15755018 | Male | 36 | 33000 | 0 |
399 | 15594041 | Female | 49 | 36000 | 1 |
400 rows × 5 columns
3. Preparing Features and Target Variables¶
X = data.iloc[:, 2:4].values # Features (Age and Salary)
y = data.iloc[:, 4].values # Target variable (Purchased or Not)
4. Visualizing the Data¶
plt.title("Social Network Ads by Age and Salary")
plt.xlabel("Age")
plt.ylabel("Salary")
plt.scatter(X[y==0, 0], X[y==0, 1], label='No') # Users who did not purchase
plt.scatter(X[y==1, 0], X[y==1, 1], label='Yes') # Users who purchased
plt.legend()
plt.show()
5. Splitting the Dataset into Training and Test Sets¶
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
6. Feature Scaling¶
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
X_train = sc_x.fit_transform(X_train) # Fit and transform the training data
X_test = sc_x.transform(X_test) # Transform the test data
7. Creating and Training the K-Neighbors Classifier¶
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5, p=2, metric="minkowski")
classifier.fit(X_train, y_train) # Training the classifier
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()
8. Making Predictions¶
y_pred = classifier.predict(X_test) # Predicting the test set results
9. Evaluating the Model¶
classifier.score(X_test, y_test) * 100 # Model accuracy in percentage
93.0
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred) * 100 # Accuracy score calculation
93.0
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred) # Confusion matrix
cm
array([[64, 4], [ 3, 29]], dtype=int64)
10. Visualizing the Decision Boundary¶
X_set, y_set = X_test, y_test # Using the test set for visualization
plt.title("Social Network Ads by Age and Salary")
plt.xlabel("Age")
plt.ylabel("Salary")
X1 = np.arange(X_set[:, 0].min()-1, X_set[:, 0].max()+1, 0.01) # Range for Age
X2 = np.arange(X_set[:, 1].min()-1, X_set[:, 1].max()+1, 0.01) # Range for Salary
xx, yy = np.meshgrid(X1, X2) # Create a grid of values
X3 = np.array([xx.ravel(), yy.ravel()]).T # Combine grid values
zz = classifier.predict(X3).reshape(xx.shape) # Predictions for the grid
plt.contourf(xx, yy, zz) # Plotting the decision boundary
plt.scatter(X_set[y_set == 0, 0], X_set[y_set == 0, 1], label='No') # Users who did not purchase
plt.scatter(X_set[y_set == 1, 0], X_set[y_set == 1, 1], label='Yes') # Users who purchased
plt.legend()
plt.show()