Grid Search using K-Fold Cross-Validation is a technique that combines hyperparameter tuning with robust model evaluation. This method helps find the optimal hyperparameters for a machine learning model while ensuring that the evaluation of those hyperparameters is reliable and not dependent on a specific train-test split. Here’s a breakdown of how it works and how to implement it.
What is K-Fold Cross-Validation?¶
K-Fold Cross-Validation is a method used to assess the performance of a model. Here’s how it works:
- Dataset Splitting: The dataset is divided into
k
equal-sized folds (subsets). - Training and Validation: The model is trained
k
times, each time usingk-1
folds for training and 1 fold for validation. - Performance Averaging: The performance metrics (like accuracy, precision, etc.) are averaged over all
k
iterations to provide a more reliable estimate of the model’s performance.
How Grid Search with K-Fold Works¶
- Define the Model: Choose the machine learning model you want to optimize.
- Set Hyperparameter Grid: Specify the hyperparameters and their possible values to test.
- Combine with K-Fold: For each combination of hyperparameters, perform K-Fold Cross-Validation to evaluate model performance.
- Select Best Hyperparameters: Identify the hyperparameter combination that yields the best average performance across all folds.
Steps to Implement Grid Search with K-Fold¶
Here’s a step-by-step guide to implementing Grid Search with K-Fold Cross-Validation using Python’s scikit-learn
library.
1. Import Libraries¶
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
2. Load Dataset¶
# Example: Load a dataset
data = pd.read_csv('your_dataset.csv') # Replace with your dataset
X = data.drop('target', axis=1) # Features
y = data['target'] # Target variable
3. Define the Model and Hyperparameter Grid¶
# Define the model
model = RandomForestClassifier()
# Define the hyperparameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10]
}
4. Set Up K-Fold Cross-Validation¶
# Define K-Fold
kf = KFold(n_splits=5, shuffle=True, random_state=42) # 5-fold cross-validation
5. Implement Grid Search with K-Fold¶
# Set up Grid Search with cross-validation
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=kf, n_jobs=-1)
# Fit Grid Search to the data
grid_search.fit(X, y)
# Retrieve the best parameters and score
best_params = grid_search.best_params_
best_score = grid_search.best_score_
print("Best Parameters:", best_params)
print("Best Cross-Validation Score:", best_score)
Benefits of Using Grid Search with K-Fold¶
- Robust Performance Estimation: K-Fold provides a more reliable estimate of model performance, reducing the risk of overfitting to a specific train-test split.
- Comprehensive Search: Grid Search ensures that all combinations of hyperparameters are evaluated.
- Increased Model Generalization: By validating on multiple folds, you can ensure that the selected hyperparameters help the model generalize well to unseen data.
Drawbacks¶
- Computational Cost: This method can be computationally expensive, especially with large datasets and extensive hyperparameter grids.
- Time-Consuming: As the number of folds increases or the grid expands, the training time increases, which may be impractical for large-scale problems.
Summary¶
Grid Search with K-Fold Cross-Validation is an effective approach for hyperparameter tuning in machine learning. By systematically exploring hyperparameter combinations and validating them across multiple folds, this method enhances the robustness and reliability of model evaluations, leading to better-performing models.
Let’s review Practically¶
Finding the optimal tuning parameters for a machine learning problem can often be very difficult. We may encounter overfitting, which means our machine learning model trains too specifically on our training dataset and causes higher levels of error when applied to our test/holdout datasets. Or, we may run into underfitting, which means our model doesn’t train specifically enough to our training dataset. This also leads to higher levels of error when applied to test/holdout datasets.
When conducting a normal train/validation/test split for model training and testing, the model trains on a specific randomly selected portion of the data, validates on a separate set of data, then finally tests on a holdout dataset. In practice this could lead to some issues, especially when the size of the dataset is relatively small, because you could be eliminating a portion of observations that would be key to training an optimal model. Keeping a percentage of data out of the training phase, even if its 15–25% still holds plenty of information that would otherwise help our model train more effectively.
In comes a solution to our problem — Cross Validation. Cross validation works by splitting our dataset into random groups, holding one group out as the test, and training the model on the remaining groups. This process is repeated for each group being held as the test group, then the average of the models is used for the resulting model.
One of the most common types of cross validation is k-fold cross validation, where ‘k’ is the number of folds within the dataset. Using k =5 is a common first step and easy for demonstrations of this principle below:
Here we see five iterations of the model, each of which treats a different fold as the test set and trains on the other four folds. Once all five iterations are complete the resulting iterations are averaged together creating the final cross validation model.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('Social_Network_Ads.csv')
data.head()
User ID | Gender | Age | EstimatedSalary | Purchased | |
---|---|---|---|---|---|
0 | 15624510 | Male | 19.0 | 19000.0 | 0 |
1 | 15810944 | Male | 35.0 | 20000.0 | 0 |
2 | 15668575 | Female | 26.0 | 43000.0 | 0 |
3 | 15603246 | Female | 27.0 | 57000.0 | 0 |
4 | 15804002 | Male | 19.0 | 76000.0 | 0 |
X = data.iloc[:,2:4].values
y = data.iloc[:,4].values
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
plt.scatter(X_train[...,0],X_train[...,1])
plt.show()
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
plt.scatter(X_train[...,0],X_train[...,1])
plt.show()
from sklearn.svm import SVC
classifier = SVC(kernel='linear',random_state=0)
classifier.fit(X_train,y_train)
SVC(kernel='linear', random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(kernel='linear', random_state=0)
Yp = classifier.predict(X_test)
Yp
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1], dtype=int64)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,Yp)
cm
array([[66, 2], [ 8, 24]], dtype=int64)
from sklearn.metrics import accuracy_score
Score = accuracy_score(y_test,Yp)
Score * 100
90.0
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=classifier,X = X_train,y = y_train, cv = 10)
accuracies
array([0.76666667, 0.8 , 0.73333333, 0.83333333, 0.73333333, 0.66666667, 0.83333333, 0.93333333, 0.96666667, 0.86666667])
accuracies.mean()
0.8133333333333335
accuracies.std()
0.08844332774281068
from sklearn.model_selection import GridSearchCV
parameters = [{'C':[1,10,100,1000],'kernel':['linear']},
{'C':[1,10,100,1000],'kernel':['rbf'],
'gamma':[0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3]}]
grid_search = GridSearchCV(estimator=classifier,param_grid = parameters,scoring='accuracy',cv=10)
grid_search = grid_search.fit(X_train,y_train)
grid_search.best_score_
0.9133333333333333
grid_search.best_params_
{'C': 1, 'gamma': 1.2, 'kernel': 'rbf'}