Naive Bayes is a simple but powerful machine learning algorithm used for classification tasks. It’s based on applying Bayes’ Theorem with the assumption that features are independent of each other, which is why it’s called “naive.”
Key Concepts of Naive Bayes:¶
What is Naive Bayes?
- Naive Bayes is a classification algorithm that uses probability to predict the class of a given data point.
- It works well for tasks like spam detection, sentiment analysis, and document classification.
Bayes’ Theorem:
- Naive Bayes is based on Bayes’ Theorem, which calculates the probability of an event happening based on prior knowledge of related events. The formula is:
[
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
]
Where:
- (P(A|B)) is the probability of class (A) given data (B) (this is what we want to predict).
- (P(B|A)) is the probability of observing the data (B) given that class (A) is true.
- (P(A)) is the prior probability of class (A).
- (P(B)) is the overall probability of the data.
- Naive Bayes is based on Bayes’ Theorem, which calculates the probability of an event happening based on prior knowledge of related events. The formula is:
[
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
]
Where:
“Naive” Assumption:
- The “naive” part of Naive Bayes assumes that all features are independent of each other. For example, in a spam detection task, the algorithm assumes that the presence of one word in an email is unrelated to the presence of another word.
- In reality, this is often not true, but Naive Bayes works surprisingly well in many cases despite this assumption.
How Naive Bayes Works:
- The algorithm calculates the probability of each class (e.g., spam or not spam) for a given set of features.
- It picks the class with the highest probability as the prediction.
Types of Naive Bayes:¶
Gaussian Naive Bayes:
- Used when the features follow a normal (Gaussian) distribution.
- Suitable for continuous data like height, weight, etc.
Multinomial Naive Bayes:
- Works well for discrete data like word counts in text data.
- Commonly used in text classification problems (e.g., spam detection, document classification).
Bernoulli Naive Bayes:
- Used for binary/boolean features (e.g., presence or absence of a feature).
- Also widely used for text data where features are binary (e.g., a word either appears or doesn’t in an email).
Example:¶
Let’s say you’re trying to classify whether an email is spam or not spam based on certain words. For example, words like “discount” or “free” might indicate spam, while words like “meeting” or “schedule” might indicate not spam.
- Step 1: The algorithm looks at how often each word appears in spam and not spam emails based on the training data.
- Step 2: It calculates the probability of an email being spam if certain words are present.
- Step 3: The algorithm multiplies the probabilities of the features (words) and picks the category (spam or not spam) with the highest probability.
Pros and Cons:¶
Pros:
- Simple and fast: Easy to implement and quick to train.
- Works well with small data: Performs well even with a small amount of training data.
- Good for text classification: Especially effective for tasks like spam filtering and document classification.
Cons:
- Naive independence assumption: The assumption that features are independent may not hold in real-world data, which can affect accuracy.
- Can struggle with complex data: If features are highly dependent or interactions between features are important, Naive Bayes may not perform well.
Summary:¶
Naive Bayes is a probabilistic algorithm that predicts the class of data based on probabilities and the assumption that all features are independent. It’s simple, fast, and effective for many classification tasks, especially in text-related problems. However, its naive assumption can limit performance on more complex data sets where feature interactions matter.
Let’s review example step by step.¶
1. Importing Libraries¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
- NumPy: For numerical operations.
- Pandas: For data manipulation and analysis.
- Matplotlib: For data visualization.
2. Loading the Dataset¶
data = pd.read_csv('Social_Network_Ads.csv')
data
User ID | Gender | Age | EstimatedSalary | Purchased | |
---|---|---|---|---|---|
0 | 15624510 | Male | 19 | 19000 | 0 |
1 | 15810944 | Male | 35 | 20000 | 0 |
2 | 15668575 | Female | 26 | 43000 | 0 |
3 | 15603246 | Female | 27 | 57000 | 0 |
4 | 15804002 | Male | 19 | 76000 | 0 |
… | … | … | … | … | … |
395 | 15691863 | Female | 46 | 41000 | 1 |
396 | 15706071 | Male | 51 | 23000 | 1 |
397 | 15654296 | Female | 50 | 20000 | 1 |
398 | 15755018 | Male | 36 | 33000 | 0 |
399 | 15594041 | Female | 49 | 36000 | 1 |
400 rows × 5 columns
- This line reads the dataset from a CSV file into a Pandas DataFrame named
data
.
3. Preparing Features and Target Variables¶
X = data.iloc[:, 2:4].values # Features: Age and Salary
y = data.iloc[:, 4].values # Target variable: Purchased (Yes/No)
- X: Contains the feature columns (Age and Salary).
- y: Contains the target variable, indicating whether the user purchased the product.
4. Visualizing the Data¶
plt.title("Social Network Ads by Age and Salary")
plt.xlabel("Age")
plt.ylabel("Salary")
plt.scatter(X[y==0, 0], X[y==0, 1], label='No') # Users who did not purchase
plt.scatter(X[y==1, 0], X[y==1, 1], label='Yes') # Users who purchased
plt.legend()
plt.show()
- This section visualizes the data points by plotting Age against Salary.
- Different colors represent users who did or did not purchase the ads.
5. Splitting the Dataset into Training and Test Sets¶
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
- The dataset is split into training (75%) and testing (25%) sets to evaluate the model’s performance.
6. Feature Scaling¶
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
X_train = sc_x.fit_transform(X_train) # Fit and transform the training data
X_test = sc_x.transform(X_test) # Transform the test data
- StandardScaler is used to standardize the features, ensuring they have a mean of 0 and a standard deviation of 1. This is important for the Naive Bayes algorithm.
7. Creating and Training the Naive Bayes Classifier¶
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB() # Initialize the Gaussian Naive Bayes classifier
classifier.fit(X_train, y_train) # Train the classifier
GaussianNB()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GaussianNB()
- The Gaussian Naive Bayes Classifier is initialized and trained using the training data.
8. Making Predictions¶
y_pred = classifier.predict(X_test) # Predicting the test set results
y_pred
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1], dtype=int64)
- Predictions are made for the test set using the trained model.
9. Evaluating the Model¶
classifier.score(X_test, y_test) * 100 # Model accuracy in percentage
90.0
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred) * 100 # Accuracy score calculation
90.0
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred) # Confusion matrix
cm
array([[65, 3], [ 7, 25]], dtype=int64)
- Model Accuracy: The overall accuracy of the model on the test set is calculated.
- Confusion Matrix: This matrix shows the performance of the classifier by comparing predicted and actual values.
10. Visualizing the Decision Boundary¶
X_set, y_set = X_test, y_test # Using the test set for visualization
plt.title("Social Network Ads by Age and Salary")
plt.xlabel("Age")
plt.ylabel("Salary")
X1 = np.arange(X_set[:, 0].min()-1, X_set[:, 0].max()+1, 0.01) # Range for Age
X2 = np.arange(X_set[:, 1].min()-1, X_set[:, 1].max()+1, 0.01) # Range for Salary
xx, yy = np.meshgrid(X1, X2) # Create a grid of values
X3 = np.array([xx.ravel(), yy.ravel()]).T # Combine grid values
zz = classifier.predict(X3).reshape(xx.shape) # Predictions for the grid
plt.contourf(xx, yy, zz) # Plotting the decision boundary
plt.scatter(X_set[y_set == 0, 0], X_set[y_set == 0, 1], label='No') # Users who did not purchase
plt.scatter(X_set[y_set == 1, 0], X_set[y_set == 1, 1], label='Yes') # Users who purchased
plt.legend()
plt.show()
- This section visualizes the decision boundary of the classifier.
- A grid is created to show areas where the model predicts different outcomes (purchased vs. not purchased).
Conclusion¶
This code effectively demonstrates the process of using a Naive Bayes Classifier to predict user behavior based on age and salary. It includes data loading, preprocessing, model training, prediction, evaluation, and visualization of results.