Grid Search with K-Fold Cross-Validation is a technique used in machine learning to help you find the best settings (hyperparameters) for your model. Here’s a simple breakdown:
1. Grid Search¶
- What it is: Imagine you want to bake a cake, but you’re unsure about the best recipe. You have different ingredients (like flour, sugar, and baking time) that you can change. Grid search tests out all possible combinations of these ingredients to find the best cake.
- In Machine Learning: Similarly, in grid search, you define a set of hyperparameters (like the learning rate, number of trees, etc.) and the algorithm tries out every combination of these parameters to see which one works best for your model.
2. K-Fold Cross-Validation¶
- What it is: Think of it as sharing a pizza among friends. Instead of giving the entire pizza to one person and seeing if they like it, you cut it into equal slices (or “folds”) and share it with everyone. This way, each person gets a chance to try a slice and provide feedback.
- In Machine Learning: K-fold cross-validation divides your dataset into “k” equal parts (or folds). For each fold:
- The model is trained on the remaining “k-1” parts.
- Then, it’s tested on the fold that was set aside.
- This process is repeated “k” times, so every part of the data gets used for both training and testing. This helps ensure that the model performs well across different subsets of data.
3. Combining Both¶
When you combine grid search with k-fold cross-validation:
- First, for each combination of hyperparameters in the grid search, the model is trained and validated using k-fold cross-validation.
- Then, the best combination of hyperparameters is chosen based on how well the model performed across all the folds.
Why Use Them?¶
- Better Accuracy: By trying out many combinations and validating them across different parts of the dataset, you increase the chances of finding a model that performs well.
- Robustness: It helps in ensuring that the model’s performance is consistent and not just a fluke based on one particular split of the data.
Summary¶
In short, grid search with k-fold cross-validation is like experimenting with different cake recipes while making sure that every taste tester gives feedback from different slices of the pizza. This helps you find the best recipe (or model) that works well in general, not just for one specific case.
Let’s review Practically¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('Social_Network_Ads.csv')
data
User ID | Gender | Age | EstimatedSalary | Purchased | |
---|---|---|---|---|---|
0 | 15624510 | Male | 19.0 | 19000.0 | 0 |
1 | 15810944 | Male | 35.0 | 20000.0 | 0 |
2 | 15668575 | Female | 26.0 | 43000.0 | 0 |
3 | 15603246 | Female | 27.0 | 57000.0 | 0 |
4 | 15804002 | Male | 19.0 | 76000.0 | 0 |
… | … | … | … | … | … |
395 | 15691863 | Female | 46.0 | 41000.0 | 1 |
396 | 15706071 | Male | 51.0 | 23000.0 | 1 |
397 | 15654296 | Female | 50.0 | 20000.0 | 1 |
398 | 15755018 | Male | 36.0 | 33000.0 | 0 |
399 | 15594041 | Female | 49.0 | 36000.0 | 1 |
400 rows × 5 columns
X = data.iloc[:,2:4].values
y = data.iloc[:,4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=0)
from sklearn.preprocessing import StandardScaler
ss_x = StandardScaler()
X_train = ss_x.fit_transform(X_train)
X_test = ss_x.transform(X_test)
plt.scatter(X_train[:,0],X_train[:,1])
plt.show()
plt.scatter(X_train[y_train==0,0],X_train[y_train==0,1])
plt.scatter(X_train[y_train==1,0],X_train[y_train==1,1])
plt.show()
plt.scatter(X_test[y_test==0,0],X_test[y_test==0,1])
plt.scatter(X_test[y_test==1,0],X_test[y_test==1,1])
plt.show()
from sklearn.svm import SVC
classifier = SVC(kernel='linear',random_state=0)
classifier.fit(X_train,y_train)
SVC(kernel='linear', random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(kernel='linear', random_state=0)
y_pred = classifier.predict(X_test)
y_pred
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1], dtype=int64)
classifier.score(X_test,y_test)*100
90.0
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
cm
array([[66, 2], [ 8, 24]], dtype=int64)
Cross Validation¶
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=classifier, X = X_train, y = y_train, cv=10)
accuracies
array([0.76666667, 0.8 , 0.73333333, 0.83333333, 0.73333333, 0.66666667, 0.83333333, 0.93333333, 0.96666667, 0.86666667])
accuracies.mean()
0.8133333333333335
accuracies.std()
0.08844332774281068
0.8133333333333335 + 0.08844332774281068
0.9017766610761442
0.8133333333333335 - 0.08844332774281068
0.7248900055905227
Grid Search¶
from sklearn.model_selection import GridSearchCV
parameters = [{'C':[1,10,100,1000],'kernel':['linear']},
{'C':[1,10,100,1000],'kernel':['rbf'],'gamma':[0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3]}
]
gs = GridSearchCV(estimator=classifier,param_grid=parameters,scoring='accuracy',cv=10)
gs = gs.fit(X_train,y_train)
gs.best_score_
0.9133333333333333
gs.best_params_
{'C': 1, 'gamma': 1.2, 'kernel': 'rbf'}
from sklearn.svm import SVC
classifier = SVC(kernel='rbf',random_state=0,C=1,gamma=1.2)
classifier.fit(X_train,y_train)
SVC(C=1, gamma=1.2, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(C=1, gamma=1.2, random_state=0)
classifier.score(X_test,y_test)*100
93.0