K-Fold Cross Validation is a method used in machine learning to assess the performance of a model by partitioning the data into K equal subsets (or folds). Here’s an outline of the process:
Steps:
- Data Splitting: The dataset is divided into K equal subsets or “folds.”
- Training and Validation: The model is trained on K-1 folds and validated on the remaining fold. This process is repeated K times, with a different fold being used as the validation set each time.
- Performance Averaging: After all K iterations, the model’s performance metrics (such as accuracy, precision, or F1 score) from each fold are averaged to provide an overall evaluation.
Key Benefits:
- More Reliable Estimation: By testing on multiple validation sets, K-Fold Cross Validation provides a better estimate of a model’s true performance on unseen data.
- Efficient Use of Data: It makes full use of the dataset since every observation is used for both training and validation.
- Reduces Bias and Variance: It reduces the variability of performance estimates that might occur if only a single train-test split is used.
Variants:
- Stratified K-Fold: Ensures that each fold maintains the same class distribution as the entire dataset, which is particularly useful for imbalanced datasets.
- Leave-One-Out Cross Validation (LOO): A special case where each fold consists of a single data point, leading to K being equal to the number of data points.
This technique is widely used to evaluate models across various machine learning tasks because it helps ensure that the model generalizes well to unseen data.
Let’s review PracticallyIn [1]:
from sklearn.model_selection import KFold
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
In [2]:
iris = load_iris()
X = iris.data
y = iris.target
In [3]:
model = LogisticRegression()
In [4]:
kfold = KFold(n_splits=5,shuffle=True,random_state=42)
In [5]:
scores = []
for train,test in kfold.split(X):
X_train, X_test = X[train],X[test]
y_train, y_test = y[train],y[test]
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
scores.append(accuracy)
In [6]:
scores
Out[6]:
[1.0, 1.0, 0.9333333333333333, 0.9666666666666667, 0.9666666666666667]
In [7]:
for fold,accuracy in enumerate(scores):
print(fold+1,accuracy*100)
1 100.0 2 100.0 3 93.33333333333333 4 96.66666666666667 5 96.66666666666667
In [8]:
avg_accuracy = sum(scores)/len(scores) * 100
avg_accuracy
Out[8]:
97.33333333333334
Grid Search using K-Fold¶
In [9]:
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.datasets import load_iris
from sklearn.svm import SVC
In [10]:
iris = load_iris()
X = iris.data
y = iris.target
In [11]:
kfold = KFold(n_splits=5,shuffle=True,random_state=42)
In [12]:
svm = SVC()
In [13]:
grid_param = {'C':[0.1,1,10],'gamma':[0.01,0.1,1],'kernel':['linear','poly','sigmoid','rbf']}
In [14]:
gs = GridSearchCV(estimator=svm,param_grid = grid_param,cv=kfold)
In [15]:
gs.fit(X,y)
Out[15]:
GridSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True), estimator=SVC(), param_grid={'C': [0.1, 1, 10], 'gamma': [0.01, 0.1, 1], 'kernel': ['linear', 'poly', 'sigmoid', 'rbf']})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True), estimator=SVC(), param_grid={'C': [0.1, 1, 10], 'gamma': [0.01, 0.1, 1], 'kernel': ['linear', 'poly', 'sigmoid', 'rbf']})
SVC(C=10, gamma=0.01)
SVC(C=10, gamma=0.01)
In [16]:
gs.best_params_
Out[16]:
{'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}
In [17]:
gs.best_score_
Out[17]:
0.9800000000000001
In [18]:
svm = SVC(kernel='rbf',C=10,gamma=0.01)
svm.fit(X,y)
Out[18]:
SVC(C=10, gamma=0.01)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(C=10, gamma=0.01)
In [19]:
svm.score(X,y)*100
Out[19]:
98.0
In [ ]: