K Nearest Neighbor Classification – KNN

K-NN

K-Nearest Neighbors (K-NN) is a simple machine learning algorithm used for both classification and regression tasks. It’s based on the idea that things that are similar to each other are likely to be in the same category.

Key Concepts of K-NN:¶

What is K-NN?
- K-NN is an algorithm that looks at the closest data points (neighbors) to make predictions.
- It does this by finding the K nearest neighbors to a new data point and looking at their labels (in classification) or values (in regression).
How does K-NN work?
- Step 1: Choose a number, K. This is how many neighbors you want to consider (e.g., K=3 means looking at the 3 closest neighbors).
- Step 2: When given a new data point to classify or predict, K-NN looks for the K nearest neighbors in the training data.
- Step 3: In classification, it assigns the new data point to the most common category among its neighbors. In regression, it averages the values of the neighbors.
- Step 4: The prediction is based on what the nearest neighbors “vote” for.
How to Measure Neighbors?
- Neighbors are usually measured by distance. The most common is Euclidean distance, which measures the straight-line distance between two points.
Example (Classification):
- Imagine you’re trying to predict if someone likes apples or bananas. You have past data about other people’s preferences based on their age and income.
- When a new person comes along, K-NN finds the K people closest to this person (based on age and income) and sees what fruit they prefer.
- If most of the neighbors prefer apples, K-NN predicts that this new person will also prefer apples.
Example (Regression):
- In regression, you might be trying to predict someone’s height based on their age and weight.
- K-NN would find the nearest neighbors in terms of age and weight and take the average of their heights to make a prediction.

Important Points:¶

K value: Choosing the right K is important. If K is too small, the model might be sensitive to noise; if K is too large, it might miss important local patterns.
Distance metric: The algorithm works by calculating distances. Common ones are Euclidean distance, Manhattan distance, etc.
No training phase: K-NN doesn’t build a model. It just stores the data and makes predictions by comparing the new point to the data every time.

Pros and Cons:¶

Pros:
- Simple and easy to understand.
- No need for a complicated training phase.
Cons:
- Slow when there’s a lot of data, because it has to compare every new point to all the old ones.
- Can struggle if the data is very high-dimensional (many features).

Summary:¶

K-NN is like making decisions based on what’s happening with your closest neighbors. For classification, it looks at what the majority of neighbors are doing; for regression, it averages their values. The number of neighbors (K) and the distance measure are important factors in how well it works.

Let’s review example step by step.¶

1. Importing Libraries¶

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

2. Loading the Dataset¶

In [2]:

data = pd.read_csv('Social_Network_Ads.csv')
data

Out[2]:

	User ID	Gender	Age	EstimatedSalary	Purchased
0	15624510	Male	19	19000	0
1	15810944	Male	35	20000	0
2	15668575	Female	26	43000	0
3	15603246	Female	27	57000	0
4	15804002	Male	19	76000	0
…	…	…	…	…	…
395	15691863	Female	46	41000	1
396	15706071	Male	51	23000	1
397	15654296	Female	50	20000	1
398	15755018	Male	36	33000	0
399	15594041	Female	49	36000	1

400 rows × 5 columns

3. Preparing Features and Target Variables¶

In [3]:

X = data.iloc[:, 2:4].values  # Features (Age and Salary)
y = data.iloc[:, 4].values      # Target variable (Purchased or Not)

4. Visualizing the Data¶

In [4]:

plt.title("Social Network Ads by Age and Salary")
plt.xlabel("Age")
plt.ylabel("Salary")

plt.scatter(X[y==0, 0], X[y==0, 1], label='No')  # Users who did not purchase
plt.scatter(X[y==1, 0], X[y==1, 1], label='Yes')  # Users who purchased

plt.legend()
plt.show()

No description has been provided for this image

5. Splitting the Dataset into Training and Test Sets¶

In [5]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

6. Feature Scaling¶

In [6]:

from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
X_train = sc_x.fit_transform(X_train)  # Fit and transform the training data
X_test = sc_x.transform(X_test)        # Transform the test data

7. Creating and Training the K-Neighbors Classifier¶

In [7]:

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5, p=2, metric="minkowski")
classifier.fit(X_train, y_train)  # Training the classifier

Out[7]:

KNeighborsClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

8. Making Predictions¶

In [8]:

y_pred = classifier.predict(X_test)  # Predicting the test set results

9. Evaluating the Model¶

In [9]:

classifier.score(X_test, y_test) * 100  # Model accuracy in percentage

Out[9]:

93.0

In [10]:

from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred) * 100  # Accuracy score calculation

Out[10]:

93.0

In [11]:

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)  # Confusion matrix
cm

Out[11]:

array([[64,  4],
       [ 3, 29]], dtype=int64)

10. Visualizing the Decision Boundary¶

In [12]:

X_set, y_set = X_test, y_test  # Using the test set for visualization

plt.title("Social Network Ads by Age and Salary")
plt.xlabel("Age")
plt.ylabel("Salary")

X1 = np.arange(X_set[:, 0].min()-1, X_set[:, 0].max()+1, 0.01)  # Range for Age
X2 = np.arange(X_set[:, 1].min()-1, X_set[:, 1].max()+1, 0.01)  # Range for Salary

xx, yy = np.meshgrid(X1, X2)  # Create a grid of values
X3 = np.array([xx.ravel(), yy.ravel()]).T  # Combine grid values

zz = classifier.predict(X3).reshape(xx.shape)  # Predictions for the grid
plt.contourf(xx, yy, zz)  # Plotting the decision boundary

plt.scatter(X_set[y_set == 0, 0], X_set[y_set == 0, 1], label='No')  # Users who did not purchase
plt.scatter(X_set[y_set == 1, 0], X_set[y_set == 1, 1], label='Yes')  # Users who purchased

plt.legend()
plt.show()

In [ ]:

Machine Learning Tutorials, Courses and Certifications

K Nearest Neighbor Classification – KNN

Related Articles

Key Concepts of K-NN:¶

Important Points:¶

Pros and Cons:¶

Summary:¶

Let’s review example step by step.¶

1. Importing Libraries¶

2. Loading the Dataset¶

3. Preparing Features and Target Variables¶

4. Visualizing the Data¶

5. Splitting the Dataset into Training and Test Sets¶

6. Feature Scaling¶

7. Creating and Training the K-Neighbors Classifier¶

8. Making Predictions¶

9. Evaluating the Model¶

10. Visualizing the Decision Boundary¶

Related

About Machine Learning

Check Also

Introduction to XGBoost Classifier

Leave a Reply Cancel reply

OpenCV Python Project for Bus Detection from an Image

Multiple Linear Regression:

Microsoft AI Classroom Series Assessment Answers

Polynomial Regression

Support Vector Regression

FUNDAMENTALS OF DIGITAL MARKETING: MODULE 14 Quiz Answers

Python Basic Syntax

FUNDAMENTALS OF DIGITAL MARKETING: MODULE 16 Quiz Answers

Moving Data into Hadoop cognitive class Exam Answers:-

Text to Speech

OpenCV Python Project for Bus Detection from an Image

OpenCV Python Project for Vehicle Detection From an Image

OpenCV Python Project for Vehicle Detection in a Video frame

Airline Quality Service

Airport Quality Service