How to Avoid Data Leakage When Performing Data Preparation

By Jason Brownlee on August 17, 2020 in Data Preparation 87

Data preparation is the process of transforming raw data into a form that is appropriate for modeling.

A naive approach to preparing data applies the transform on the entire dataset before evaluating the performance of the model. This results in a problem referred to as data leakage, where knowledge of the hold-out test set leaks into the dataset used to train the model. This can result in an incorrect estimate of model performance when making predictions on new data.

A careful application of data preparation techniques is required in order to avoid data leakage, and this varies depending on the model evaluation scheme used, such as train-test splits or k-fold cross-validation.

In this tutorial, you will discover how to avoid data leakage during data preparation when evaluating machine learning models.

After completing this tutorial, you will know:

Naive application of data preparation methods to the whole dataset results in data leakage that causes incorrect estimates of model performance.
Data preparation must be prepared on the training set only in order to avoid data leakage.
How to implement data preparation without data leakage for train-test splits and k-fold cross-validation in Python.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Avoid Data Leakage When Performing Data Preparation
Photo by kuhnmi, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Problem With Naive Data Preparation
Data Preparation With Train and Test Sets
1. Train-Test Evaluation With Naive Data Preparation
2. Train-Test Evaluation With Correct Data Preparation
Data Preparation With k-fold Cross-Validation
1. Cross-Validation Evaluation With Naive Data Preparation
2. Cross-Validation Evaluation With Correct Data Preparation

Problem With Naive Data Preparation

The manner in which data preparation techniques are applied to data matters.

A common approach is to first apply one or more transforms to the entire dataset. Then the dataset is split into train and test sets or k-fold cross-validation is used to fit and evaluate a machine learning model.

1. Prepare Dataset
2. Split Data
3. Evaluate Models

Although this is a common approach, it is dangerously incorrect in most cases.

The problem with applying data preparation techniques before splitting data for model evaluation is that it can lead to data leakage and, in turn, will likely result in an incorrect estimate of a model’s performance on the problem.

Data leakage refers to a problem where information about the holdout dataset, such as a test or validation dataset, is made available to the model in the training dataset. This leakage is often small and subtle but can have a marked effect on performance.

… leakage means that information is revealed to the model that gives it an unrealistic advantage to make better predictions. This could happen when test data is leaked into the training set, or when data from the future is leaked to the past. Any time that a model is given information that it shouldn’t have access to when it is making predictions in real time in production, there is leakage.

— Page 93, Feature Engineering for Machine Learning, 2018.

We get data leakage by applying data preparation techniques to the entire dataset.

This is not a direct type of data leakage, where we would train the model on the test dataset. Instead, it is an indirect type of data leakage, where some knowledge about the test dataset, captured in summary statistics is available to the model during training. This can make it a harder type of data leakage to spot, especially for beginners.

One other aspect of resampling is related to the concept of information leakage which is where the test set data are used (directly or indirectly) during the training process. This can lead to overly optimistic results that do not replicate on future data points and can occur in subtle ways.

— Page 55, Feature Engineering and Selection, 2019.

For example, consider the case where we want to normalize a data, that is scale input variables to the range 0-1.

When we normalize the input variables, this requires that we first calculate the minimum and maximum values for each variable before using these values to scale the variables. The dataset is then split into train and test datasets, but the examples in the training dataset know something about the data in the test dataset; they have been scaled by the global minimum and maximum values, so they know more about the global distribution of the variable then they should.

We get the same type of leakage with almost all data preparation techniques; for example, standardization estimates the mean and standard deviation values from the domain in order to scale the variables; even models that impute missing values using a model or summary statistics will draw on the full dataset to fill in values in the training dataset.

The solution is straightforward.

Data preparation must be fit on the training dataset only. That is, any coefficients or models prepared for the data preparation process must only use rows of data in the training dataset.

Once fit, the data preparation algorithms or models can then be applied to the training dataset, and to the test dataset.

1. Split Data.
2. Fit Data Preparation on Training Dataset.
3. Apply Data Preparation to Train and Test Datasets.
4. Evaluate Models.

More generally, the entire modeling pipeline must be prepared only on the training dataset to avoid data leakage. This might include data transforms, but also other techniques such feature selection, dimensionality reduction, feature engineering and more. This means so-called “model evaluation” should really be called “modeling pipeline evaluation”.

In order for any resampling scheme to produce performance estimates that generalize to new data, it must contain all of the steps in the modeling process that could significantly affect the model’s effectiveness.

— Pages 54-55, Feature Engineering and Selection, 2019.

Now that we are familiar with how to apply data preparation to avoid data leakage, let’s look at some worked examples.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Data Preparation With Train and Test Sets

In this section, we will evaluate a logistic regression model using train and test sets on a synthetic binary classification dataset where the input variables have been normalized.

First, let’s define our synthetic dataset.

We will use the make_classification() function to create the dataset with 1,000 rows of data and 20 numerical input features. The example below creates the dataset and summarizes the shape of the input and output variable arrays.

# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# summarize the dataset
print(X.shape, y.shape)

# test classification dataset

from sklearn.datasets import make_classification

# define dataset

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)

# summarize the dataset

print(X.shape, y.shape)

Running the example creates the dataset and confirms that the input part of the dataset has 1,000 rows and 20 columns for the 20 input variables and that the output variable has 1,000 examples to match the 1,000 rows of input data, one value per row.

(1000, 20) (1000,)

1	(1000, 20) (1000,)

Next, we can evaluate our model on the scaled dataset, starting with their naive or incorrect approach.

Train-Test Evaluation With Naive Data Preparation

The naive approach involves first applying the data preparation method, then splitting the data before finally evaluating the model.

We can normalize the input variables using the MinMaxScaler class, which is first defined with the default configuration scaling the data to the range 0-1, then the fit_transform() function is called to fit the transform on the dataset and apply it to the dataset in a single step. The result is a normalized version of the input variables, where each column in the array is separately normalized (e.g. has its own minimum and maximum calculated).

...
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

...

# standardize the dataset

scaler = MinMaxScaler()

X = scaler.fit_transform(X)

Next, we can split our dataset into train and test sets using the train_test_split() function. We will use 67 percent for the training set and 33 percent for the test set.

...
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

...

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

We can then define our logistic regression algorithm via the LogisticRegression class, with default configuration, and fit it on the training dataset.

...
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)

...

# fit the model

model = LogisticRegression()

model.fit(X_train, y_train)

The fit model can then make a prediction for the input data for the test set, and we can compare the predictions to the expected values and calculate a classification accuracy score.

...
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

...

# evaluate the model

yhat = model.predict(X_test)

# evaluate predictions

accuracy = accuracy_score(y_test, yhat)

print('Accuracy: %.3f' % (accuracy*100))

Tying this together, the complete example is listed below.

# naive approach to normalizing the data before splitting the data and evaluating the model
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

# naive approach to normalizing the data before splitting the data and evaluating the model

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# define dataset

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)

# standardize the dataset

scaler = MinMaxScaler()

X = scaler.fit_transform(X)

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# fit the model

model = LogisticRegression()

model.fit(X_train, y_train)

# evaluate the model

yhat = model.predict(X_test)

# evaluate predictions

accuracy = accuracy_score(y_test, yhat)

print('Accuracy: %.3f' % (accuracy*100))

Running the example normalizes the data, splits the data into train and test sets, then fits and evaluates the model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the estimate for the model is about 84.848 percent.

Accuracy: 84.848

1	Accuracy: 84.848

Given we know that there was data leakage, we know that this estimate of model accuracy is wrong.

Next, let’s explore how we might correctly prepare the data to avoid data leakage.

Train-Test Evaluation With Correct Data Preparation

The correct approach to performing data preparation with a train-test split evaluation is to fit the data preparation on the training set, then apply the transform to the train and test sets.

This requires that we first split the data into train and test sets.

...
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

...

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

We can then define the MinMaxScaler and call the fit() function on the training set, then apply the transform() function on the train and test sets to create a normalized version of each dataset.

...
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)

...

# define the scaler

scaler = MinMaxScaler()

# fit on the training dataset

scaler.fit(X_train)

# scale the training dataset

X_train = scaler.transform(X_train)

# scale the test dataset

X_test = scaler.transform(X_test)

This avoids data leakage as the calculation of the minimum and maximum value for each input variable is calculated using only the training dataset (X_train) instead of the entire dataset (X).

The model can then be evaluated as before.

Tying this together, the complete example is listed below.

# correct approach for normalizing the data after the data is split before the model is evaluated
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

# correct approach for normalizing the data after the data is split before the model is evaluated

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# define dataset

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# define the scaler

scaler = MinMaxScaler()

# fit on the training dataset

scaler.fit(X_train)

# scale the training dataset

X_train = scaler.transform(X_train)

# scale the test dataset

X_test = scaler.transform(X_test)

# fit the model

model = LogisticRegression()

model.fit(X_train, y_train)

# evaluate the model

yhat = model.predict(X_test)

# evaluate predictions

accuracy = accuracy_score(y_test, yhat)

print('Accuracy: %.3f' % (accuracy*100))

Running the example splits the data into train and test sets, normalizes the data correctly, then fits and evaluates the model.

In this case, we can see that the estimate for the model is about 85.455 percent, which is more accurate than the estimate with data leakage in the previous section that achieved an accuracy of 84.848 percent.

We expect data leakage to result in an incorrect estimate of model performance. We would expect this to be an optimistic estimate with data leakage, e.g. better performance, although in this case, we can see that data leakage resulted in slightly worse performance. This might be because of the difficulty of the prediction task.

Accuracy: 85.455

1	Accuracy: 85.455

Data Preparation With k-fold Cross-Validation

In this section, we will evaluate a logistic regression model using k-fold cross-validation on a synthetic binary classification dataset where the input variables have been normalized.

You may recall that k-fold cross-validation involves splitting a dataset into k non-overlapping groups of rows. The model is then trained on all but one group to form a training dataset and then evaluated on the held-out fold. This process is repeated so that each fold is given a chance to be used as the holdout test set. Finally, the average performance across all evaluations is reported.

The k-fold cross-validation procedure generally gives a more reliable estimate of model performance than a train-test split, although it is more computationally expensive given the repeated fitting and evaluation of models.

Let’s first look at naive data preparation with k-fold cross-validation.

Cross-Validation Evaluation With Naive Data Preparation

Naive data preparation with cross-validation involves applying the data transforms first, then using the cross-validation procedure.

We will use the synthetic dataset prepared in the previous section and normalize the data directly.

...
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

...

# standardize the dataset

scaler = MinMaxScaler()

X = scaler.fit_transform(X)

The k-fold cross-validation procedure must first be defined. We will use repeated stratified 10-fold cross-validation, which is a best practice for classification. Repeated means that the whole cross-validation procedure is repeated multiple times, three in this case. Stratified means that each group of rows will have the relative composition of examples from each class as the whole dataset. We will use k=10 or 10-fold cross-validation.

This can be achieved using the RepeatedStratifiedKFold which can be configured to three repeats and 10 folds, and then using the cross_val_score() function to perform the procedure, passing in the defined model, cross-validation object, and metric to calculate, in this case, accuracy.

...
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

...

# define the evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the model using cross-validation

scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

We can then report the average accuracy across all of the repeats and folds.

Tying this all together, the complete example of evaluating a model with cross-validation using data preparation with data leakage is listed below.

# naive data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# define the model
model = LogisticRegression()
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

# naive data preparation for model evaluation with k-fold cross-validation

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression

# define dataset

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)

# standardize the dataset

scaler = MinMaxScaler()

X = scaler.fit_transform(X)

# define the model

model = LogisticRegression()

# define the evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the model using cross-validation

scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance

print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

Running the example normalizes the data first, then evaluates the model using repeated stratified cross-validation.

In this case, we can see that the model achieved an estimated accuracy of about 85.300 percent, which we know is incorrect given the data leakage allowed via the data preparation procedure.

Accuracy: 85.300 (3.607)

1	Accuracy: 85.300 (3.607)

Next, let’s look at how we can evaluate the model with cross-validation and avoid data leakage.

Cross-Validation Evaluation With Correct Data Preparation

Data preparation without data leakage when using cross-validation is slightly more challenging.

It requires that the data preparation method is prepared on the training set and applied to the train and test sets within the cross-validation procedure, e.g. the groups of folds of rows.

We can achieve this by defining a modeling pipeline that defines a sequence of data preparation steps to perform and ending in the model to fit and evaluate.

To provide a solid methodology, we should constrain ourselves to developing the list of preprocessing techniques, estimate them only in the presence of the training data points, and then apply the techniques to future data (including the test set).

— Page 55, Feature Engineering and Selection, 2019.

The evaluation procedure changes from simply and incorrectly evaluating just the model to correctly evaluating the entire pipeline of data preparation and model together as a single atomic unit.

This can be achieved using the Pipeline class.

This class takes a list of steps that define the pipeline. Each step in the list is a tuple with two elements. The first element is the name of the step (a string) and the second is the configured object of the step, such as a transform or a model. The model is only supported as the final step, although we can have as many transforms as we like in the sequence.

...
# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)

...

# define the pipeline

steps = list()

steps.append(('scaler', MinMaxScaler()))

steps.append(('model', LogisticRegression()))

pipeline = Pipeline(steps=steps)

We can then pass the configured object to the cross_val_score() function for evaluation.

...
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

...

# define the evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the model using cross-validation

scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

Tying this together, the complete example of correctly performing data preparation without data leakage when using cross-validation is listed below.

# correct data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

# correct data preparation for model evaluation with k-fold cross-validation

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import Pipeline

# define dataset

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)

# define the pipeline

steps = list()

steps.append(('scaler', MinMaxScaler()))

steps.append(('model', LogisticRegression()))

pipeline = Pipeline(steps=steps)

# define the evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the model using cross-validation

scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance

print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

Running the example normalizes the data correctly within the cross-validation folds of the evaluation procedure to avoid data leakage.

In this case, we can see that the model has an estimated accuracy of about 85.433 percent, compared to the approach with data leakage that achieved an accuracy of about 85.300 percent.

As with the train-test example in the previous section, removing data leakage has resulted in a slight improvement in performance when our intuition might suggest a drop given that data leakage often results in an optimistic estimate of model performance. Nevertheless, the examples clearly demonstrate that data leakage does impact the estimate of model performance and how to correct data leakage by correctly performing data preparation after the data is split.

Accuracy: 85.433 (3.471)

1	Accuracy: 85.433 (3.471)

Summary

In this tutorial, you discovered how to avoid data leakage during data preparation when evaluating machine learning models.

Specifically, you learned:

Naive application of data preparation methods to the whole dataset results in data leakage that causes incorrect estimates of model performance.
Data preparation must be prepared on the training set only in order to avoid data leakage.
How to implement data preparation without data leakage for train-test splits and k-fold cross-validation in Python.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

87 Responses to How to Avoid Data Leakage When Performing Data Preparation

Lawrence June 26, 2020 at 6:09 am #

Hello Jason and thanks for this article.

I’m wondering, what if we separate the training and test sets first, then apply a simple min-max or z-score normalizer to each differently, will that help with data leakage?

I’m assuming a situation where we don’t use the MinMaxScaler or any other in-built scaler from any library at all, but just a simple function we can apply to the features of both the training and testing sets separately.

for example a simple min-max method like:-

def min_max:(x_train): return (x_train – min(x_train)) / (max(x_train) – min(x_train))

Will separating the data sets before applying this function to each, help to reduce or prevent data leakage?

If that won’t work:-

How can we avoid data leakage without using the MinMax() or StandardScaler() libraries?

Reply
- Jason Brownlee June 26, 2020 at 10:22 am #
  
  As long as any coefficients you calculate are prepared on the training set only, or domain knowledge that is broader than both datasets.
  
  The above examples show exactly how to avoid data leakage with a train/test split.
  
  Reply
Priscila dos Santos Brito June 26, 2020 at 7:54 am #

Jason,

Obrigada por mia este material incrível, sou super fã do seu trabalho que em muito colabora pra minha formação e conhecimento da área.

Obrigada, Deus abençoe
Priscila

Reply
- Jason Brownlee June 26, 2020 at 10:25 am #
  
  You’re very welcome Priscila!
  
  Reply
Rachael Winkless June 26, 2020 at 8:52 am #

Thanks for the article.

I have two questions:

When people report an accuracy that has been done “properly”, is there any particular terminology they would use to signify this?

When I drop incomplete data rows, should I adjust the accuracy I quote to account for that? (Obviously, a model that doesn’t deal with incomplete data will not make good predictions on any new incomplete data.) Alternatively, is it reasonable to quote the accuracy I expect the model to achieve on complete data?

Reply
- Jason Brownlee June 26, 2020 at 10:26 am #
  
  Hmmm. Great question!
  
  We should try to be as complete as possible when describing how data has been prepared and how a model has been evaluated.
  
  Ideally, we would provide code to accompany a paper or report that produces the output described in the report so anyone can reproduce the result and extend it.
  
  Reply
  - Pawel July 8, 2020 at 7:15 pm #
    
    Hi Jason,
    
    Awesome tutorial like always.
    What if We want to make a prediction on a single row?
    How would We scale it then?
    We cannot apply standard or minmax scaler on a single row.
    
    Reply
    - Jason Brownlee July 9, 2020 at 6:39 am #
      
      Thanks.
      
      You can, make it a matrix and transform it.
      
      Reply
Harald Flesche June 26, 2020 at 4:18 pm #

Jason,

Thanks for a clear presentation that illustrates the principles! I agree with you that the exact estimate of the model’s performance is best done in this manner, given a split between training and validation sets. However, I think it is also valuable to split into training, validation and test, in which you hold the test data out of the model until it is optimised. Thereby, the validation data can be used when optimising meta parameters for the model, and in my opinion also be used for fitting scalers.

My argument for this is that in a real-life problem, you don’t just want to know the performance of the model, you also want it to be as robust as possible. Let us say that you split your data into three, for training, validation and test. I would use both the training and validation for fitting the scalar, and I use the test data to see if I get any surprises with data that fit the model poorly.

Evaluation of which variables to include should also be seen in this light. Let’s say that you have elevation of measurements as an input, and this is scaled according to the range of your training data. In that case there is no guarantee that new data points is within this range, which can strongly affect non-linear classification of regression methods.

Reply
- Jason Brownlee June 27, 2020 at 5:27 am #
  
  You’re very welcome!
  
  Yes, you can split the validation from the training set. More here:
  https://machinelearningmastery.com/difference-test-validation-datasets/
  
  Reply

Anthony The Koala July 1, 2020 at 7:24 am #

Dear Dr Jason,

In the above code for correct handling of the cross validation data

steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

steps = list()

steps.append(('scaler', MinMaxScaler()))

steps.append(('model', LogisticRegression()))

pipeline = Pipeline(steps=steps)

# define the evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the model using cross-validation

scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

Particularly the line

scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

1	scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

If we were to take the naive method, then the std dev of the whole of X is the cause of the leakage. In contrast, by applying the pipeline, means that the std deviation of each subsample is taken into account?

Thank you,
Anthony of Sydney

Jason Brownlee July 1, 2020 at 11:17 am #

The pipeline fits the transform (whole pipeline) on the training part of each CV-split only.

Reply
- Anthony The Koala July 1, 2020 at 6:33 pm #
  
  Thank you again in advance,
  
  One more question “….on the training part of each CV-split only”. That is each CV-split has its own std deviation. In contrast the ‘naive’ method does the whole dataset and not take into account the invidividual std deviation of a particular split.
  
  Hence the Pipeline ensured that the global std dev of the features X did not leak into each CV-split?
  
  Thank you
  Anthony of Sydney
  
  Reply
  - Jason Brownlee July 2, 2020 at 6:17 am #
    
    When using a pipeline the transform is not applied to the entire dataset, it is applied as needed to training data only for each repetition of the model evaluation procedure.
    
    Reply
    - Anthony The Koala July 2, 2020 at 6:35 am #
      
      Thank you
      Anthony of Sydney
    - Jason Brownlee July 2, 2020 at 2:06 pm #
      
      You’re welcome.
    - Lukman February 5, 2022 at 5:14 pm #
      
      So it means, if total fold is ‘n’ then for each iteration of cv ‘MinMax()’ object in pipeline is applying fit_transform() to n-1(train)portions and transform() to remaining 1(validation)portion, is it?
      
      If it is true, is same workflow as above is applicable for other feature transforming/converting methods like LabelEncode(), OneHotEncoder() when we use this type of method to prevent data leakage?
    - James Carmichael February 17, 2022 at 1:35 pm #
      
      Hi Lukman…The following is a great resource to help add clarity:
      
      https://machinelearningmastery.com/mcnemars-test-for-machine-learning/
- Subham October 29, 2021 at 5:36 am #
  
  Jason Brownlee July 1, 2020 at 11:17 am #
  
  The pipeline fits the transform (whole pipeline) on the training part of each CV-split only.
  Referring to your reply here: The pipeline is only applied to the training part of the CV.
  
  But in the pipeline, you’re also scaling the values, and only scaling is applied on the training data of the CV then what about the validation fold or data on which no data preparation is applied. Or it is like, the pipeline is separately applied on the validation set as well in CV
  
  Reply
  - Adrian Tam October 30, 2021 at 12:32 pm #
    
    Scaling is trained using the training data. For example, you use the training data only to determine the mean and standard deviation. Then on validation, you use the same mean and standard deviation to convert your data unconditionally.
    
    Reply

Anthony The Koala July 10, 2020 at 10:48 pm #

Dear Dr Jason,
I wish to compare the model performance that is naive and pipelined.
From your book, listing 4.7 page 29, (46 of 398), I have successfully implemented.
I will present the successful implementation of the naive code then my attempt to make a pipelined model.

Succesful naive model on your code in listing 4.7, page 29 of your book. Skip this to see my attempt at pipelining.

#Without pipeline
# naive approach to normalizing the data before splitting the data and evaluating the model
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5,random_state=7)
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
accuracy
0.8484848484848485
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
mean(scores), std(scores)
(0.853, 0.03606937759374286)

#Without pipeline

# naive approach to normalizing the data before splitting the data and evaluating the model

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# define dataset

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5,random_state=7)

# standardize the dataset

scaler = MinMaxScaler()

X = scaler.fit_transform(X)

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# fit the model

model = LogisticRegression()

model.fit(X_train, y_train)

# evaluate the model

yhat = model.predict(X_test)

# evaluate predictions

accuracy = accuracy_score(y_test, yhat)

accuracy

0.8484848484848485

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

mean(scores), std(scores)

(0.853, 0.03606937759374286)

My aim is to construct a pipeline and compare the accuracy score and the cross validation scores of the model that has been pipelined.
What must I put in the pipeline.

This is what I would like to do. FOR THE CRUX OF THE MATTER SKIP THIS CODE,

steps = list()
steps.append( ('scaler', MinMaxScaler()))
steps.append(('model',LogisticRegression()))
model = Pipeline(steps=steps)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# fit the model
model.fit(X_train, y_train); # I GET ERRORS AT THIS STAGE
# evaluate the model;  
yhat = model.predict(X_test);# I GET ERROR AT THIS STAGE
# evaluate predictions
accuracy = accuracy_score(y_test, yhat); # I CANNOT GET THIS
accuracy
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
mean(scores), std(scores)
(nan, nan)

steps = list()

steps.append( ('scaler', MinMaxScaler()))

steps.append(('model',LogisticRegression()))

model = Pipeline(steps=steps)

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# fit the model

model.fit(X_train, y_train); # I GET ERRORS AT THIS STAGE

# evaluate the model;

yhat = model.predict(X_test);# I GET ERROR AT THIS STAGE

# evaluate predictions

accuracy = accuracy_score(y_test, yhat); # I CANNOT GET THIS

accuracy

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

mean(scores), std(scores)

(nan, nan)

THIS IS THE QUESTION:
The question is what further steps must I add to get a pipelined model.of listing 4.7 of your book.

steps = list()
steps.append( ('scaler', MinMaxScaler()))
steps.append(('model',LogisticRegression()))
model = Pipeline(steps=steps)

steps = list()

steps.append( ('scaler', MinMaxScaler()))

steps.append(('model',LogisticRegression()))

model = Pipeline(steps=steps)

With the aim is to get:

model.fit(X_train, y_train); # I GET ERRORS AT THIS STAGE
# evaluate the model;  
yhat = model.predict(X_test);# I GET ERROR AT THIS STAGE

Would also like to work out:
* The accuracy score AND
* Cross validation score for the pipelined version of YOUR code, listing 4.17

model.fit(X_train, y_train); # I GET ERRORS AT THIS STAGE

# evaluate the model;

yhat = model.predict(X_test);# I GET ERROR AT THIS STAGE

Would also like to work out:

* The accuracy score AND

* Cross validation score for the pipelined version of YOUR code, listing 4.17

Please point me in the direction, I will work it out eventually.
Thank you for your time and your ebook.

Anthony of Sydney

Jason Brownlee July 11, 2020 at 6:15 am #

The problem is not obvious to me sorry, what is the nature of the error you’re getting?

Reply

Anthony The Koala July 11, 2020 at 10:12 am #

Dear Dr Jason,
Thank you for your reply,
These are the functions I wish to invoke within a Pipeline

model.fit(X_train, y_train); # I GET ERRORS AT THIS STAGE
# evaluate the model;  
yhat = model.predict(X_test);# I GET ERROR AT THIS STAGE

model.fit(X_train, y_train); # I GET ERRORS AT THIS STAGE

# evaluate the model;

yhat = model.predict(X_test);# I GET ERROR AT THIS STAGE

Here are the errors when I invoke the abovementioned functions within a Pipeline:

model.fit(X_train, y_train)
Traceback (most recent call last):
  File "", line 1, in 
    model.fit(X_train, y_train)
  File "C:\Python36\lib\site-packages\sklearn\pipeline.py", line 350, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Python36\lib\site-packages\sklearn\pipeline.py", line 315, in _fit
    **fit_params_steps[name])
  File "C:\Python36\lib\site-packages\joblib\memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "C:\Python36\lib\site-packages\sklearn\pipeline.py", line 728, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Python36\lib\site-packages\sklearn\base.py", line 574, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "C:\Python36\lib\site-packages\sklearn\preprocessing\_data.py", line 339, in fit
    return self.partial_fit(X, y)
  File "C:\Python36\lib\site-packages\sklearn\preprocessing\_data.py", line 363, in partial_fit
    if feature_range[0] >= feature_range[1]:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> yhat = model.predict(X_test)
Traceback (most recent call last):
  File "", line 1, in 
    yhat = model.predict(X_test)
  File "C:\Python36\lib\site-packages\sklearn\utils\metaestimators.py", line 116, in 
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
  File "C:\Python36\lib\site-packages\sklearn\pipeline.py", line 419, in predict
    Xt = transform.transform(Xt)
  File "C:\Python36\lib\site-packages\sklearn\preprocessing\_data.py", line 409, in transform
    check_is_fitted(self)
  File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 967, in check_is_fitted
    raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.exceptions.NotFittedError: This MinMaxScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

model.fit(X_train, y_train)

Traceback (most recent call last):

File "", line 1, in

model.fit(X_train, y_train)

File "C:\Python36\lib\site-packages\sklearn\pipeline.py", line 350, in fit

Xt, fit_params = self._fit(X, y, **fit_params)

File "C:\Python36\lib\site-packages\sklearn\pipeline.py", line 315, in _fit

**fit_params_steps[name])

File "C:\Python36\lib\site-packages\joblib\memory.py", line 355, in __call__

return self.func(*args, **kwargs)

File "C:\Python36\lib\site-packages\sklearn\pipeline.py", line 728, in _fit_transform_one

res = transformer.fit_transform(X, y, **fit_params)

File "C:\Python36\lib\site-packages\sklearn\base.py", line 574, in fit_transform

return self.fit(X, y, **fit_params).transform(X)

File "C:\Python36\lib\site-packages\sklearn\preprocessing\_data.py", line 339, in fit

return self.partial_fit(X, y)

File "C:\Python36\lib\site-packages\sklearn\preprocessing\_data.py", line 363, in partial_fit

if feature_range[0] >= feature_range[1]:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

>>> yhat = model.predict(X_test)

Traceback (most recent call last):

File "", line 1, in

yhat = model.predict(X_test)

File "C:\Python36\lib\site-packages\sklearn\utils\metaestimators.py", line 116, in

out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)

File "C:\Python36\lib\site-packages\sklearn\pipeline.py", line 419, in predict

Xt = transform.transform(Xt)

File "C:\Python36\lib\site-packages\sklearn\preprocessing\_data.py", line 409, in transform

check_is_fitted(self)

File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 967, in check_is_fitted

raise NotFittedError(msg % {'name': type(estimator).__name__})

sklearn.exceptions.NotFittedError: This MinMaxScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Is there something that I am omiting or should do please?

Thank you,
Anthony of Sydney

Jason Brownlee July 12, 2020 at 5:38 am #

Sorry it is not clear to me as to what the cause is, you will have to debug the error.

Reply

Anthony The Koala July 11, 2020 at 11:24 am #

Dear Dr Jason,
Here is further information on setting up the Pipeline

steps = list()
steps.append( ('scaler', MinMaxScaler()))
steps.append(('model',LogisticRegression()))
model = Pipeline(steps=steps)

steps = list()

steps.append( ('scaler', MinMaxScaler()))

steps.append(('model',LogisticRegression()))

model = Pipeline(steps=steps)

Should I have added more steps in the pipeline?
This is to avoid the errors described in the abovementioned errors when invoking

model.fit(X_train, y_train); # I GET ERRORS AT THIS STAGE
# evaluate the model;  
yhat = model.predict(X_test);# I GET ERROR AT THIS STAGE

model.fit(X_train, y_train); # I GET ERRORS AT THIS STAGE

# evaluate the model;

yhat = model.predict(X_test);# I GET ERROR AT THIS STAGE

Thank you,
Anthony of Sydney

Jason Brownlee July 12, 2020 at 5:39 am #

The pipeline looks good to me.

Perhaps confirm your data was loaded correctly.
Perhaps try a diffrent dataset and see if it still causes errors.
Perhaps try a diffrent model/s and see if it still causes errors.

Anthony The Koala July 12, 2020 at 8:30 pm #

Dear Dr Jason,
Thank you, your advice to change the dataset worked!

I used the pima-indians-diabetes.csv file WHICH WORKED compared to the synthetic make_classification model that presented with the problem

I present the programs, and the conclusions at the end.

Programs:
I present the code for the pipelined and naive models
For the pipelined model:

# Pipeline approach to normalizing the data before splitting the data and evaluating the model
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from pandas import read_csv

data = read_csv('pima-indians-diabetes.csv',header=None)
X = data.values[:,0:8]
y = data.values[:,-1]
	
steps = list()
steps.append( ('scaler', MinMaxScaler()))
steps.append(('model',LogisticRegression()))
model = Pipeline(steps=steps)
model.fit(X_train, y_train)

yhat = model.predict(X_test)
accuracy = accuracy_score(y_test, yhat);

accuracy
0.7755905511811023
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

mean(scores), std(scores)
(0.7704032809295966, 0.04222493685923635)
<pre>

For the naive model
<pre>
#Naive approach without using the pipeline
# naive approach to normalizing the data before splitting the data and evaluating the model
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# define dataset
data = read_csv('pima-indians-diabetes.csv',header=None)
X = data.values[:,0:8]
y = data.values[:,-1]
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
accuracy
0.7755905511811023
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
mean(scores), std(scores)
(0.7717076782866258, 0.04318082938105522)

# Pipeline approach to normalizing the data before splitting the data and evaluating the model

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

from sklearn.pipeline import Pipeline

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.model_selection import cross_val_score

from pandas import read_csv

data = read_csv('pima-indians-diabetes.csv',header=None)

X = data.values[:,0:8]

y = data.values[:,-1]

steps = list()

steps.append( ('scaler', MinMaxScaler()))

steps.append(('model',LogisticRegression()))

model = Pipeline(steps=steps)

model.fit(X_train, y_train)

yhat = model.predict(X_test)

accuracy = accuracy_score(y_test, yhat);

accuracy

0.7755905511811023

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

mean(scores), std(scores)

(0.7704032809295966, 0.04222493685923635)

<pre>

For the naive model

<pre>

#Naive approach without using the pipeline

# naive approach to normalizing the data before splitting the data and evaluating the model

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# define dataset

data = read_csv('pima-indians-diabetes.csv',header=None)

X = data.values[:,0:8]

y = data.values[:,-1]

# standardize the dataset

scaler = MinMaxScaler()

X = scaler.fit_transform(X)

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# fit the model

model = LogisticRegression()

model.fit(X_train, y_train)

# evaluate the model

yhat = model.predict(X_test)

# evaluate predictions

accuracy = accuracy_score(y_test, yhat)

accuracy

0.7755905511811023

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

mean(scores), std(scores)

(0.7717076782866258, 0.04318082938105522)

Summary and conclusions:
For the pipelined model:
accuracy
0.7755905511811023; #from the accuracy score

mean(scores), std(scores); # from the cross validation scores
(0.7704032809295966, 0.04222493685923635)

For the naive model:
accuracy ; # from the accuracy score
0.7755905511811023

mean(scores), std(scores); #from the cross_validation scores
(0.7717076782866258, 0.04318082938105522)

Comment:
* accuracy – there appears to be no difference when using the pipelined or naive models, being 0.775591
* mean and stddev scores from cross_validation – the pipelined method produced a slightly lower mean and std dev of scores at 0.770 and 0.042 respectively compared to the naive model’s mean and std dev scores at 0.772 and 0.043
* Why my model worked with the pima-indians data but not the synthetic make_classification data.

Generally it works, there is no change in values in accuracy between the pipelined and naive models. There is a slightly lower cross_validation mean and stddev scores of the pipelined model compared to the naive model. However don’t know why the synthetically generated make_classification values caused a problem.

Thank you,
Anthony from Sydney

Jason Brownlee July 13, 2020 at 6:01 am #

Well done!

Reply
- Anthony The Koala July 13, 2020 at 11:20 am #
  
  Dear Dr Jason,
  One further remark and one question:
  * Given that the accuracy for the naive and pipelined model is the same, we can conclude that there may well be very little leakage in the naive model.
  
  *Question please. why I used the synthetic dataset make_classification from from sklearn.datasets import make_classification produced errors whereas the pima-indians-diabetes dataset worked.
  
  Thank you
  Anthony of Sydney
- Jason Brownlee July 13, 2020 at 1:40 pm #
  
  Yes, that may be the case.
  
  I don’t know off hand, perhaps check for bugs in your implementation or copy-paste.
- Anthony The Koala July 13, 2020 at 7:01 pm #
  
  Dear Dr Jason,
  Thank you.
  For the second line, I reran the with dataset/synthetic from make_classification using the models that worked for the pima-indians-diabetes.
  
  The resuls for pipeplined and naive modes using the synthetic make_classification is:
  
  Pipeline method – using data source synthetic data make_classification
  accuracy
  0.8424242424242424
  mean(scores), std(scores)
  (0.8153333333333334, 0.04341530708043982)
  
  Naive method – using the data source synthetic data make_classification
  accuracy
  0.8393939393939394
  mean(scores), std(scores)
  0.8153333333333334, 0.04341530708043982
  
  Summary:
  Using the make_classification synthetic data, using the pipelined method
  greater accuracy score than the naive method while mean and stddev of scores are the same in both pipelined and naive models.
  
  Thank you for your advice,
  Anthony of Sydney
- Jason Brownlee July 14, 2020 at 6:17 am #
  
  Well done.

Mahdi July 21, 2020 at 8:50 am #

Hi Jason,

Thank you for this detailed post! I seem to be lost a little bit here and may need more clarification.

Under the section “Cross-Validation Evaluation With Correct Data Preparation” and in the last code box, line 20, here is how I interpret that:
– We have fed the “entire” dataset into the cross_val_score function. Therefore, the cv function splits the “entire” data into training and test sets for pre-processing and modeling.
– then the pipeline is applied to both training and test sets “separately”, develops a logistic regression on the training test, and evaluates it on the test set.

Now my question is, where is the cross-validation set? Given that we have fed the “entire” dataset into the cross_val_score function and the cv function splits the data into two sets, train and test, I can’t imagine how the cross-validation set is created internally and used in the process. Am I missing something here?

Thank you,
Mahdi

Reply
- Jason Brownlee July 21, 2020 at 1:46 pm #
  
  Thanks.
  
  The cross_val_score function does not split into train and test sets, it performs the cross-validation process.
  
  You can learn more here:
  https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
  
  Reply
  - Mahdi July 21, 2020 at 3:05 pm #
    
    Thank you, Jason, for your prompt reply.
    
    So, in the cross_val_score in your example, we are feeding the entire dataset, X & y, to the model as well as a cv function. My understanding from this combination is that the cv splits the data iteratively and then in each iteration the pipeline is applied. This means two sub-datasets are created in each iteration. What are these two sub-sets? Test, train, or cross-validation?
    
    Thanks again!
    Mahdi
    
    Reply
    - Jason Brownlee July 22, 2020 at 5:26 am #
      
      The dataset is split into k non-overlapping folds, one fold is held back to evaluate the model (test) and the rest are used to train the model and a score is recored, this is then repeated so that each fold gets a chance to be used as the hold out set, meaning we evaluate k models and get k scores.
      
      This is the k-fold cross-validation process, not a train-test process – which is something different.
      
      Learn more how k-fold cross-validation works here:
      https://machinelearningmastery.com/k-fold-cross-validation/
      
      Reply
Mahdi July 24, 2020 at 2:36 pm #

Thanks, Jason!

I have seen that in some instances a test set is held out (using train_test_split) and the train set is then fed into the pipeline and cross_val_score. How different is that from what you did (as you used the whole dataset in cross_vale_score)? When do we use your approach and when the other one?

Thank you again for your responses!
Mahdi

Reply
- Jason Brownlee July 25, 2020 at 6:06 am #
  
  Perhaps the hold out set is used as a final validation of the selected model. A good approach if you can spare the data.
  
  Reply
  - Sebastien January 28, 2021 at 8:58 pm #
    
    In that case how to avoid data leakage in the CV since we do :
    1. split train/test
    2. fit_transform on train + transform on test
    3. use the already transformed train set to feed the CV => data leakage
    
    Reply
    - Jason Brownlee January 29, 2021 at 6:04 am #
      
      If you do that, then yes there would be leakage.
      
      You would use a pipeline within the cv that will correctly fit/apply transforms.
      
      Reply
  - xxq August 29, 2022 at 1:38 pm #
    
    Hi Jason,
    
    After pipeline training on the training set, how to fit_transform the retained test set?
    
    Because kfold=10 will produce 10 min, max or std statistical properties.
    
    There are two simple hypotheses about it.
    
    1.The function of pipeline has its own methon to fit_tranforms the retained test set. So, Is this approch adopted to statistical properties of the best score of cv?
    
    2.fit_transform and retrain all training data after getting the best hyper-parameters of cv, then appied to the retained test set.
    
    Reply
    - James Carmichael August 30, 2022 at 6:42 am #
      
      Hi xxq…The following may be of interest:
      
      https://machinelearningmastery.com/training-validation-test-split-and-cross-validation-done-right/
      
      Reply
Melec July 24, 2020 at 11:39 pm #

Hello Jason,

I agree that some preprocessing methods may lead to data leakage.
But I do not get why the normalization step could cause such a problem. If you want to avoid covariate shift, train set and test set should have similar distribution. If so, statistical properties like min, max or std should be similar for both sets and normalization on training set or full dataset should not be so significant.
Could you please explain what I am missing ?

Thank you very much !
Melec

Reply
- Jason Brownlee July 25, 2020 at 6:22 am #
  
  If any data prep coefficients (e.g. min/max/mean/stdev/…) use out of sample data (e.g. test set), that is data leakage.
  
  Reply
Saman July 25, 2020 at 4:34 am #

Thank you for sharing a great Article! In case we have multiple rows for some subjects (e.g., more than one example for some subjects), do you have nay recommendation in your book for prevention of data leakage?

Would be great to see an example on how to split data 1) while keeping the prevalence in train and test sets, and 2) prevent the same subject to be in training and test sets.

Reply
- Jason Brownlee July 25, 2020 at 6:25 am #
  
  Wonderful point!
  
  Yes, data for the same subject probably should (must!?) be kept together in the same dataset to avoid data leakage.
  
  Reply
MS December 14, 2020 at 5:40 am #

how can i implement several pre-precessing steps using pipeline after performing train_test_split ,similar to the approach as shown for cv by you?

Reply
- Jason Brownlee December 14, 2020 at 6:26 am #
  
  You can use a pipeline.
  
  Reply
  - MS December 15, 2020 at 12:05 am #
    
    I define a pipeline which has imputer,stdscaler etc as follows
    pl=Pipeline(….)
    
    Then fit it
    pl.fit(X_train,y_train)
    
    As i’ve performed train_test_split before any pre-processing so my X_test contains missing values. Using pl.score(X_test,y_test) does not apply the pipeline pre-processing to the X_test dataset as it shows error for NaN. What can i use to apply the same pre-processing to the X_test? Even pl.predict gives error.
    
    Reply
    - MS December 15, 2020 at 12:09 am #
      
      dt=DecisionTreeClassifier()
      pl=Pipeline(steps=[(‘transform’,col_transform),(‘dt’,dt)])
      this is the pipeline
      
      Reply
    - Jason Brownlee December 15, 2020 at 6:25 am #
      
      If the pipeline has an imputer, and the test set is just like the train set, then pipeline will handle the missing values.
      
      Perhaps use fit() then predict() and evaluate the results manually.
      
      Reply
MS December 19, 2020 at 9:31 pm #

Thanks a lot Jason. Tc

Reply
- Jason Brownlee December 20, 2020 at 5:56 am #
  
  You’re welcome.
  
  Reply
Robert Riley March 14, 2021 at 2:31 pm #

Hi,
I am currently working through this problem, and I have a conceptual question about leakage and the scaling operation.
1. Split the data.
2. Fit the scaler to the training data (but not the test data).
3. Transform both sets of data using the same scaler.
Why is this ok? We are using data from the train set (step 2), to transform the test data?

Reply
- Jason Brownlee March 15, 2021 at 5:54 am #
  
  Yes, we are using known historical data used to fit a model (train) to prepare new and previously unseen data not used to train the model (test).
  
  Reply
Vidya March 18, 2021 at 7:20 pm #

Hi Jason.

Thanks for this post.
During model selection ,for the given X_test, X_validation and X_test , can’t we fit and transform each of them separately ?
Asking this question because, when I am testing the saved model on new test data , I need to fit transform the new data set .

Reply
- Jason Brownlee March 19, 2021 at 6:18 am #
  
  Not sure I follow your question, perhaps you can rephrase it?
  
  The data prep is fit on the training set, then applied to the training, test, validation sets before the data is provided to the model.
  
  Reply
Vidya Manu Shankar March 19, 2021 at 11:27 am #

During training , we standardise the train data (scaler.fit_transform()). Then based on the train data , we standardise the validation and test data (scaler.transform()) .
My question is after model selection , when we have new batch of test data , we need to standardise it. At that time , I just standardise the new test data i.e scaler.fit_transform() and predict using the saved model .
Is this method right to treat new test data ?? Especially when we are running the model in a web application and have no access to historical data.

Thanks !

Reply
- Jason Brownlee March 20, 2021 at 5:15 am #
  
  Yes, scale new data using a data prep object fit on training data.
  
  Reply
  - Vidya March 20, 2021 at 6:26 pm #
    
    Sorry Jason. Didn’t quite understand. In the context of a web application , we just have the saved model , right ? How do I get the data prep object that is fit on training data ? Am I missing something ? So , then would it be right to just scale the test data and use it for prediction ?
    
    Reply
    - Jason Brownlee March 21, 2021 at 6:08 am #
      
      You can save the entire pipeline (data prep and model) or save the data prep objects and model separately. Your choice.
      
      Reply
      - Vidya March 21, 2021 at 1:45 pm #
        
        oh ok . Got it . Thanks Jason.
Cuong April 24, 2021 at 12:09 pm #

Hi Jason

Great explanations always.

Can you please indicate whether your advice also applies to categorical data? That is, can we encoder categorical data before the train test split?

Reply
- Jason Brownlee April 25, 2021 at 5:15 am #
  
  Thanks.
  
  Yes. Encoding should be on training data only. You might want to use “domain knowledge” in the encoding though (e.g. all possible values you may see in practice), and that could be fine.
  
  Reply
Daniel July 23, 2021 at 4:53 pm #

Hi Jason,
Great article, I have a question about the FE process. I do understand that in order to perform transformations such as min/max scaler or normalization we should first fit_transform on the training data and then transform on test data.
Before we apply the modelling we always look at the features: try to find the correlations, fill the missing values in the data and design new features:

1) Whenever I want to deal with missing values how can I deal with the issue of imputing the values? To my understanding, I cant simply look at the combined training and testing set and then impute the missing values as that would leak data knowledge about the testing set. What’s the solution to that problem?

2) Assuming I deal with continuous numerical features ex. age. If I want to perform binning on that feature then I would do it separately on training and testing data. But that would create a big problem which is that the binning would be different in training and testing data due to different data and thus the features will be different. This will not allow me to later fit the data to the model.
How do I deal with this issue?
Binning numerical features based on the whole dataset would solve that issue but might introduce data leakage.

3) Do log() transforms also need to be applied separately on training and testing data? It is just a mathematical operation that doesn’t seem to be using any specific parameters of the data (correct me if I am wrong)

Your feedback would be of great value,
Thanks

Reply
- Jason Brownlee July 24, 2021 at 5:12 am #
  
  You fit imputation methods on the training data and apply to the test data just like any other data preparation method.
  
  Binning would be fit on the training data and applied to train and test data in the same way.
  
  Log can be applied any time to any data, it has no learned coefficient.
  
  Reply
  - Daniel July 25, 2021 at 5:42 am #
    
    So when you say here to fit on training and then apply on training and testing, you are talking about applying the preprocesing in the pipeline?
    Do you have any examples of how to use binning in the pipeline?
    
    What about feature engineering? Can it be done on combined data?
    
    Reply
    - Jason Brownlee July 26, 2021 at 5:29 am #
      
      No, we are talking about what a pipeline would do for you.
      
      If you are using a pipeline, then you can just call fit and then predict and it does it all for you.
      
      Feature engineering is the same, fit on train and applied on both, or you can let the pipeline do this for you.
      
      Reply
      - Daniel July 26, 2021 at 4:31 pm #
        
        I can see that when using pipelines the data leakage issue can be easily avoided.
        What if I dont want to work with pepeline to design a new feature. Instead I use pandas to manipulate the data.
        In that case the creation of new features has to be done separately on train and test set?
      - Jason Brownlee July 27, 2021 at 5:05 am #
        
        In that case, all data prep must be fit on training and applied on train and test sets.
Elke Samantha July 31, 2021 at 8:54 am #

Can I use the minimum and maximum values from the training set to normalize the test set?

If the answer is yes, this could result in negative values or greater than 1. Is this a problem?

Thanks.

Reply
- Jason Brownlee August 1, 2021 at 4:49 am #
  
  Yes, this is the standard approach. If it is not appropriate for your data, you may want to define the min and max based on domain knowledge.
  
  Reply
Szymon August 10, 2021 at 12:08 am #

Hi,
I am working on some dataset that has been split for me to train and test data. I am working with pipelines however during the data visualization step I saw that there is a feature in a testing set that has a missing value while in training set all the values for that feature are present.
1) Am I allowed to fill the missing value based on on the average of the testing set? Or I have to leave it empty?
2) If I want to create a new feature based on existing features can I use pandas framework to do it on the training set and then separately testing set?

Thanks a lot

Reply
- Jason Brownlee August 10, 2021 at 5:29 am #
  
  Yes, try imputing the missing value, try different methods and compare performance.
  
  Yes, you can use pandas or numpy directly, whatever you find easy.
  
  Reply
LL September 12, 2021 at 7:40 pm #

Hi Jason,

I have a question. Is it necessary or meaningful to use both train/test split and cross validation strategies at the same time? I could see here in your cross validation example, you used X and y in cross_val_score(). I’m confused. I thought it’s better to do train/test split first, then use cross validation on training set only.

Thanks!

Reply
- Adrian Tam September 14, 2021 at 1:30 pm #
  
  Cross validation is using train/test split behind the scene with whatever data you provided to the function. But if you get a large data set, your approach does no harm.
  
  Reply
Marionaud October 22, 2021 at 8:10 am #

hello,

how can we hypertuning a parameter like K in KNN and normalize the data within cross validation?
Do you have another article for that?

Reply
- Adrian Tam October 27, 2021 at 1:21 am #
  
  Do you think this answers your question: https://machinelearningmastery.com/training-validation-test-split-and-cross-validation-done-right/
  
  Reply
Otokwala December 1, 2021 at 3:33 am #

Jason, thanks for your explanation but if we reduce the features in the train data during feature reduction, can we also reduce the features in the test data in order for it to correspond ?

Reply
- Adrian Tam December 2, 2021 at 2:06 am #
  
  Yes, if you reduced the feature in training data, you reduce the feature in test data in the same way.
  
  Reply
Moha December 16, 2021 at 8:32 am #

Can we say that the Cross-validation step itself results in data leakage? because we are using the average score at the end that calculated from different rows which one sample could be allocated to both training and test parts in different rows.

Reply
- Adrian Tam December 17, 2021 at 7:20 am #
  
  It can be, so we have to be careful on the steps of cross-validation. Please see https://machinelearningmastery.com/training-validation-test-split-and-cross-validation-done-right/
  
  Reply
Anil November 14, 2022 at 9:43 pm #

Q1)Does pipeline takes care of fitting the preprocessing transform only on train set and store the parameters and apply same preprocessing on held out set(without fitting again on it) for each fold?
Q2)Also same approach or explanation can be extended for oversampling technique like SMOTE, right?

Thanks in advance!

Reply
Rohit March 28, 2023 at 12:28 am #

@jason- What are your thoughts on label/one hot encoding and data leak and how to minimize that?

Reply
- James Carmichael March 28, 2023 at 7:24 am #
  
  Hi Rohit…One of the most effective techniques is k-fold cross validation:
  
  https://machinelearningmastery.com/k-fold-cross-validation/
  
  Reply
Ankush June 29, 2023 at 3:33 am #

Hi Jason,

I have a very basic question.

I have a raw dataset. I want to apply missing value imputation, normalization, and feature selection. What I am doing now is, I am splitting the raw data into training and test set. Then since I want to perform k-fold cross-validation, I am splitting the training data into train and validation set using rskf = StratifiedKFold(n_splits=n_splits, shuffle = True, random_state = 42).

Now, do I need to perform the imputation, normalization, and feature selection in each fold for the training set and validation set?

Once I get the best model from cross-validation, I will also get the best model, best parameters, and best feature set.

Then can I use the best model with the best feature set with the same preprocessing steps on test data to evaluate the performance of the model?

Reply
- James Carmichael June 29, 2023 at 8:49 am #
  
  Hi Ankush…The following resource may be of interest:
  
  https://machinelearningmastery.com/training-validation-test-split-and-cross-validation-done-right/
  
  Reply

Navigation

How to Avoid Data Leakage When Performing Data Preparation

Tutorial Overview

Problem With Naive Data Preparation

Want to Get Started With Data Preparation?

Data Preparation With Train and Test Sets

Train-Test Evaluation With Naive Data Preparation

Train-Test Evaluation With Correct Data Preparation

Data Preparation With k-fold Cross-Validation

Cross-Validation Evaluation With Naive Data Preparation

Cross-Validation Evaluation With Correct Data Preparation

Further Reading

Tutorials

Books

APIs

Articles

Summary

Get a Handle on Modern Data Preparation!

Prepare Your Machine Learning Data in Minutes

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects

More On This Topic

87 Responses to How to Avoid Data Leakage When Performing Data Preparation

Leave a Reply Click here to cancel reply.

Navigation

Tutorial Overview

Problem With Naive Data Preparation

Want to Get Started With Data Preparation?

Data Preparation With Train and Test Sets

Train-Test Evaluation With Naive Data Preparation

Train-Test Evaluation With Correct Data Preparation

Data Preparation With k-fold Cross-Validation

Cross-Validation Evaluation With Naive Data Preparation

Cross-Validation Evaluation With Correct Data Preparation

Further Reading

Tutorials

Books

APIs

Articles

Summary

Get a Handle on Modern Data Preparation!

Prepare Your Machine Learning Data in Minutes

Bring Modern Data Preparation Techniques to Your Machine Learning Projects

More On This Topic

87 Responses to How to Avoid Data Leakage When Performing Data Preparation

Leave a Reply Click here to cancel reply.

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects