DEV Community

Cover image for Talking about Machine Learning (II): Cross Validaton
Andrés Baamonde Lozano
Andrés Baamonde Lozano

Posted on

Talking about Machine Learning (II): Cross Validaton

Intro

For talking about Cross Validation first we should talk about overfit. Overfit occours when a model is training wiht too many examples with are redudant. Our model cant genralize. A direct consecunce of this situation is that our model will work with our train example but it fails on real data. To prevent this situation we can split our dataset on 2 subsets, train and test, this subsets each class must have a representative number of rows.

On the following example we split our dataset into train-test samples and score our training.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()
iris.data.shape, iris.target.shape
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
print("Cross validation score = {0}".format(clf.score(X_test, y_test)))
# Cross validation score = 0.9666666666666667
Enter fullscreen mode Exit fullscreen mode

Doing this only once, will be a error. We must split our full dataset into train-test, with that, we ensure that our model configuration is working properly.

Cross Validation

Splitting our dataset in portions is called cross validation. Cross validation has a lot of variations i will use the simplest K-fold.
You can take a look to this variatios here.

Cross validation figure

from sklearn.model_selection import KFold
clf = svm.SVC(kernel='linear', C=1)
kf = KFold(n_splits=10)
for train, test in kf.split(iris.data):
    clf.fit(X_train, y_train)
    print(clf.score(X_test, y_test))
Enter fullscreen mode Exit fullscreen mode

Sklearn has a function that abstracts cross validation and returns score of our model:

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=10)
for idx, score in enumerate(scores):
    print("Cross validation {0} score = {1}".format(idx + 1, score))

print(scores.mean(), scores.std() * 2)
# Cross validation 1 score = 1.0
# Cross validation 2 score = 0.9333333333333333
# Cross validation 3 score = 1.0
# Cross validation 4 score = 1.0
# Cross validation 5 score = 0.8666666666666667
# Cross validation 6 score = 1.0
# Cross validation 7 score = 0.9333333333333333
# Cross validation 8 score = 1.0
# Cross validation 9 score = 1.0
# Cross validation 10 score = 1.0
Enter fullscreen mode Exit fullscreen mode

Playing with penalty

You can see an explanation of penalty on the previous post. I will show a graphic to see how different penalties affect to our model.


penalties = list(
    np.arange(
        0.5,
        10.0,
        0.1
        )
    )

means = []
stds  = []

for C in penalties:
    clf = svm.SVC(kernel='linear', C=C)
    scores = cross_val_score(clf, iris.data, iris.target, cv=10)
    means.append(scores.mean())
    stds.append(scores.std() * 2)

import matplotlib.pyplot as plt

# plot the data
fig = plt.figure(1)
plt.subplot(211)
plt.plot(penalties, means, 'r')
plt.subplot(212)
plt.plot(penalties, stds, 'r')
plt.show()
Enter fullscreen mode Exit fullscreen mode

figure compare

Top comments (0)