[New Book] Click to get The Beginner's Guide to Data Science!
Use the offer code 20offearlybird to get 20% off. Hurry, sale ends soon!

XGBoost for Regression

Extreme Gradient Boosting (XGBoost) is an open-source library that provides an efficient and effective implementation of the gradient boosting algorithm.

Shortly after its development and initial release, XGBoost became the go-to method and often the key component in winning solutions for a range of problems in machine learning competitions.

Regression predictive modeling problems involve predicting a numerical value such as a dollar amount or a height. XGBoost can be used directly for regression predictive modeling.

In this tutorial, you will discover how to develop and evaluate XGBoost regression models in Python.

After completing this tutorial, you will know:

  • XGBoost is an efficient implementation of gradient boosting that can be used for regression predictive modeling.
  • How to evaluate an XGBoost regression model using the best practice technique of repeated k-fold cross-validation.
  • How to fit a final model and use it to make a prediction on new data.

Let’s get started.

XGBoost for Regression

XGBoost for Regression
Photo by chas B, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Extreme Gradient Boosting
  2. XGBoost Regression API
  3. XGBoost Regression Example

Extreme Gradient Boosting

Gradient boosting refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems.

Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. This is a type of ensemble machine learning model referred to as boosting.

Models are fit using any arbitrary differentiable loss function and gradient descent optimization algorithm. This gives the technique its name, “gradient boosting,” as the loss gradient is minimized as the model is fit, much like a neural network.

For more on gradient boosting, see the tutorial:

Extreme Gradient Boosting, or XGBoost for short, is an efficient open-source implementation of the gradient boosting algorithm. As such, XGBoost is an algorithm, an open-source project, and a Python library.

It was initially developed by Tianqi Chen and was described by Chen and Carlos Guestrin in their 2016 paper titled “XGBoost: A Scalable Tree Boosting System.”

It is designed to be both computationally efficient (e.g. fast to execute) and highly effective, perhaps more effective than other open-source implementations.

The two main reasons to use XGBoost are execution speed and model performance.

XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems. The evidence is that it is the go-to algorithm for competition winners on the Kaggle competitive data science platform.

Among the 29 challenge winning solutions 3 published at Kaggle’s blog during 2015, 17 solutions used XGBoost. […] The success of the system was also witnessed in KDDCup 2015, where XGBoost was used by every winning team in the top-10.

XGBoost: A Scalable Tree Boosting System, 2016.

Now that we are familiar with what XGBoost is and why it is important, let’s take a closer look at how we can use it in our regression predictive modeling projects.

XGBoost Regression API

XGBoost can be installed as a standalone library and an XGBoost model can be developed using the scikit-learn API.

The first step is to install the XGBoost library if it is not already installed. This can be achieved using the pip python package manager on most platforms; for example:

You can then confirm that the XGBoost library was installed correctly and can be used by running the following script.

Running the script will print your version of the XGBoost library you have installed.

Your version should be the same or higher. If not, you must upgrade your version of the XGBoost library.

It is possible that you may have problems with the latest version of the library. It is not your fault.

Sometimes, the most recent version of the library imposes additional requirements or may be less stable.

If you do have errors when trying to run the above script, I recommend downgrading to version 1.0.1 (or lower). This can be achieved by specifying the version to install to the pip command, as follows:

If you require specific instructions for your development environment, see the tutorial:

The XGBoost library has its own custom API, although we will use the method via the scikit-learn wrapper classes: XGBRegressor and XGBClassifier. This will allow us to use the full suite of tools from the scikit-learn machine learning library to prepare data and evaluate models.

An XGBoost regression model can be defined by creating an instance of the XGBRegressor class; for example:

You can specify hyperparameter values to the class constructor to configure the model.

Perhaps the most commonly configured hyperparameters are the following:

  • n_estimators: The number of trees in the ensemble, often increased until no further improvements are seen.
  • max_depth: The maximum depth of each tree, often values are between 1 and 10.
  • eta: The learning rate used to weight each model, often set to small values such as 0.3, 0.1, 0.01, or smaller.
  • subsample: The number of samples (rows) used in each tree, set to a value between 0 and 1, often 1.0 to use all samples.
  • colsample_bytree: Number of features (columns) used in each tree, set to a value between 0 and 1, often 1.0 to use all features.

For example:

Good hyperparameter values can be found by trial and error for a given dataset, or systematic experimentation such as using a grid search across a range of values.

Randomness is used in the construction of the model. This means that each time the algorithm is run on the same data, it may produce a slightly different model.

When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation. When fitting a final model, it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.

Let’s take a look at how to develop an XGBoost ensemble for regression.

XGBoost Regression Example

In this section, we will look at how we might develop an XGBoost model for a standard regression predictive modeling dataset.

First, let’s introduce a standard regression dataset.

We will use the housing dataset.

The housing dataset is a standard machine learning dataset comprising 506 rows of data with 13 numerical input variables and a numerical target variable.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.

The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset and the first five rows of data.

Running the example confirms the 506 rows of data and 13 input variables and a single numeric target variable (14 in total). We can also see that all input variables are numeric.

Next, let’s evaluate a regression XGBoost model with default hyperparameters on the problem.

First, we can split the loaded dataset into input and output columns for training and evaluating a predictive model.

Next, we can create an instance of the model with a default configuration.

We will evaluate the model using the best practice of repeated k-fold cross-validation with 3 repeats and 10 folds.

This can be achieved by using the RepeatedKFold class to configure the evaluation procedure and calling the cross_val_score() to evaluate the model using the procedure and collect the scores.

Model performance will be evaluated using mean squared error (MAE). Note, MAE is made negative in the scikit-learn library so that it can be maximized. As such, we can ignore the sign and assume all errors are positive.

Once evaluated, we can report the estimated performance of the model when used to make predictions on new data for this problem.

In this case, because the scores were made negative, we can use the absolute() NumPy function to make the scores positive.

We then report a statistical summary of the performance using the mean and standard deviation of the distribution of scores, another good practice.

Tying this together, the complete example of evaluating an XGBoost model on the housing regression predictive modeling problem is listed below.

Running the example evaluates the XGBoost Regression algorithm on the housing dataset and reports the average MAE across the three repeats of 10-fold cross-validation.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved a MAE of about 2.1.

This is a good score, better than the baseline, meaning the model has skill and close to the best score of 1.9.

We may decide to use the XGBoost Regression model as our final model and make predictions on new data.

This can be achieved by fitting the model on all available data and calling the predict() function, passing in a new row of data.

For example:

We can demonstrate this with a complete example, listed below.

Running the example fits the model and makes a prediction for the new rows of data.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model predicted a value of about 24.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Papers

APIs

Summary

In this tutorial, you discovered how to develop and evaluate XGBoost regression models in Python.

Specifically, you learned:

  • XGBoost is an efficient implementation of gradient boosting that can be used for regression predictive modeling.
  • How to evaluate an XGBoost regression model using the best practice technique of repeated k-fold cross-validation.
  • How to fit a final model and use it to make a prediction on new data.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Discover The Algorithm Winning Competitions!

XGBoost With Python

Develop Your Own XGBoost Models in Minutes

...with just a few lines of Python

Discover how in my new Ebook:
XGBoost With Python

It covers self-study tutorials like:
Algorithm Fundamentals, Scaling, Hyperparameters, and much more...

Bring The Power of XGBoost To Your Own Projects

Skip the Academics. Just Results.

See What's Inside

35 Responses to XGBoost for Regression

  1. Avatar
    Anthony The Koala March 12, 2021 at 5:45 am #

    Dear Dr Jason,
    xgboost’s current version

    Upgrading or installing

    Thank you,
    Anthony of Sydney

    • Avatar
      Jason Brownlee March 12, 2021 at 6:07 am #

      Nice work!

    • Avatar
      Nicholas Roth January 28, 2023 at 8:26 am #

      This:
      # split data into input and output columns
      X, y = data[:, :-1], data[:, -1]

      Should be this:
      # split data into input and output columns
      X, y = data.iloc[:, :-1], data.iloc[:, -1]

  2. Avatar
    Anthony The Koala March 12, 2021 at 5:20 pm #

    Dear Dr Jason,
    Can XGBoost be used in conjunction SVM and random forest classification?
    Thank you,
    Anthony of Sydney

    • Avatar
      Jason Brownlee March 13, 2021 at 5:25 am #

      I don’t see why not.

      • Avatar
        Anthony The Koala March 13, 2021 at 5:03 pm #

        Dear Dr Jason,
        There are two ways of implementing random forest ensembles by using XGBoost’s XGBRFClassifier and using sklearn.ensemble ‘s RandomForestClassifier based on the following tutorials at:

        The program:

        Results:

        Comments:
        * The sklearn’s randomforeclassifier produced the highest accuracy at 0.917 compared to XGBoost’s XGBRFClassifier. At most the accuracy was 0.896.
        – To maximise the accuracy of XGBRFClassifier,required adjusting the parameters colsample and subsample.
        – subsample optimal at 0.9. Adjusting subsample 0-.9 reduced accuracy.
        – adjusting colsample between 0.25 and 0.29 increased accuracy from 0.894 to 0,896

        Conclusion: when implementing a random forest classifier, xklearn’s version was more accurate than XGBoost’s version.

        Other remark – which I cannot explain:
        * When implementing XGboost’s random forest classifier model when fitting the model.fit(X,y), in order to predict the yhat, program ‘spewed’. Please see my comment at https://machinelearningmastery.com/random-forest-ensemble-in-python/ as at 13-03-2021 at 1600 (approx).
        The error when I implement model.fit(X,y) for XGBoost’s XGBRFClassifier is:

        Thank you,
        Anthony of Sydney

        • Avatar
          Jason Brownlee March 14, 2021 at 5:24 am #

          Nice experiments!

          Note, RandomForestClassifier does not use xgboost.

          • Avatar
            Anthony The Koala March 14, 2021 at 1:01 pm #

            Dear Dr Jason,

            While my experiments don’t prove XGBoost’s random forest classifier (‘rfc’) is worse than sklearn’s random forest classifier, it happens for a particular set of data and features that sklearn’s random forest classifier (‘rfc’) performed marginally better than XGBoost’s random forest classifier.

            In other words there may well be other conditions that may produce the opposite results of XBoost’s rfc being better than sklearn’s rfc.

            Conclusion: if modelling with rfc, use both XGBoost and sklearn and pick the best performing one.

            Thank you,
            Anthony of Sydney`

          • Avatar
            Jason Brownlee March 15, 2021 at 5:52 am #

            Good advice.

          • Avatar
            Anthony The Koala March 18, 2021 at 4:13 pm #

            Dear Dr Jason,
            In your reply “Note, RandomForestClassifier does not use xgboost.”, are there any packages outside xgboost which utilizes xgboost’s “…implementation of gradient boosted decision trees designed for speed and performance…: for “… structured or tabular data…”

            Ref: https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/

            For example can I:
            * use sklearn.svm.SVR with xgboost to use xgboost’s gradient boosted decision trees?
            * use
            sklearn.neighbors.KNeighborsRegressor with xgboost to use xgboost’s gradient boosted decision trees?
            *use
            sklearn.tree.DecisionTreeRegressor with xgboost to use xgboost’s gradient boosted decision trees?

            Thank you
            Anthony of Sydney

          • Avatar
            Jason Brownlee March 19, 2021 at 6:16 am #

            No, as far as I know xgboost is specific to decision trees.

          • Avatar
            Anthony The Koala March 19, 2021 at 7:24 am #

            Dear Dr Jason,
            Thank you for your reply.
            Where you said “…xgboost is specific to decision trees…” did you mean the specific decision trees found in the xgboost module?
            Thank you,
            Anthony of Sydney

          • Avatar
            Jason Brownlee March 19, 2021 at 7:51 am #

            No, but sure that fits too.

          • Avatar
            Anthony The Koala March 19, 2021 at 9:50 pm #

            Dear Jason,
            I write it more clearly,
            is there a way to use xgboost’s gradient boosting function with sklearn’s
            sklearn.tree.DecisionTreeClassifier with xgboost’s gradient boosting algorithm.
            Thank you,
            Anthony of Sydney

          • Avatar
            Jason Brownlee March 20, 2021 at 5:21 am #

            No, not as far as I know.

          • Avatar
            Anthony The Koala March 20, 2021 at 5:42 am #

            Dear Dr Jason,
            Thank you for your reply and patience,
            Anthony of Sydney

          • Avatar
            Jason Brownlee March 21, 2021 at 6:00 am #

            You’re welcome.

  3. Avatar
    Matthias March 22, 2021 at 8:21 pm #

    Hello Dr. Brownlee,

    For a long time I have been trying to find a suitable model for a regression problem with many inputs. I have now also tested with XGBoost. The results for the training data are very good. The results of the separated test data are worse. For validation data (real data) that does not differ very much from the training data, the results are pretty bad. I think I see overfitting here. The results for the RandomForestRegressor were so similar. If it’s overfitting, do you have a tip to avoid it?
    Many greetings

    Matthias

    • Avatar
      Jason Brownlee March 23, 2021 at 4:56 am #

      Perhaps the test set is too small or not representative? Perhaps you can try repeated k-fold cross-validation to estimate model performance?

  4. Avatar
    Matthias March 24, 2021 at 7:16 pm #

    You are probably right, even if I believe that the validation data differs very little from the training data and there is actually a lot of test data. But there must be some reason. I will repeat cv again.
    Many Thanks!

  5. Avatar
    Tom April 19, 2021 at 7:44 pm #

    Hi Jason and thank you for this and other tutorials.

    In the final code of…
    # evaluate an xgboost regression model on the housing dataset
    I do understand that sklearn is used to EVALUATE => model = XGBRegressor() where XGBRegressor() has default parameter values.

    However in the 2nd final code of…
    # fit a final xgboost model on the housing dataset and make a prediction
    I do not understand how a FINAL XGBOOST MODEL has been arrived at.

    OK so I’m assuming the word ‘final’ maybe should be replaced by ‘default’?

    If I am correct then how is a FINAL model arrived at in the real world?
    Is it about parameter tuning?

    Thanks

    • Avatar
      Jason Brownlee April 20, 2021 at 5:56 am #

      Final here means the model fit on all data and used to make predictions on new data.

      Indeed, you will want to tune the hyperparametres in most cases.

  6. Avatar
    ttbek November 1, 2021 at 2:50 am #

    I don’t think it makes sense to do cross validation on the entire data here with no held out test set. I guess if we’re operating under the assumption of building a final production model per se, but that isn’t the assumption we use when comparing models. The housing data set is particularly sensitive to this because it has outliers and having them in only train or test makes a pretty big difference vs. being able to have them in both your train and “test” as you do CV. Maybe I missed the part of the code where the test is held out or I don’t understand everything done within RepeatedKFold?

    I’m curious about the following: “Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.”

    Are these numbers derived from your own experiments without a held out test set? What sort of model is a “naive” model here? I have not seen a 1.9 achieved on a held out test set elsewhere, so if you have a reference that would be great (I haven’t followed the housing data set competitions much, etc… but am trying to see how a method I am using now stacks up, I guess a pretty average run of the method I’m using has an MAE around 3, while an exceptional run can be as low as 2.3408, there is sampling involved that gives the randomness). So it is possible for it to sometimes do better than less tuned xgboost results with a held out test set, e.g. https://www.kaggle.com/shreayan98c/boston-house-price-prediction/notebook that had an MAE of 2.45 on the test set, but that didn’t use any CV in the training set (i.e. no validation set).

  7. Avatar
    Sofia V. December 9, 2021 at 2:35 am #

    Hello to everyone!! 🙂

    I have a question!

    Can we implement also the XGBoost Ranker with your code?

    Thanks in advance!

    Sofia

    • Avatar
      Adrian Tam December 10, 2021 at 4:16 am #

      Should be possible. Can you try?

  8. Avatar
    Medlien December 30, 2021 at 10:16 pm #

    Hi Jason,

    I have two questions on your statement from above:

    “Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.”

    1. I understood from from your post on Zero Rule Algorithm how to find MAE with a naive model with a train-test split. How do you do that cross-validation?

    2. How did you arrive at the MAE of a top-performing model which gives us the upper bound for the expected performance on a dataset?

    • Avatar
      Medlien January 18, 2022 at 9:29 pm #

      Is this a stupid question? I am sorry, just in case.

  9. Avatar
    Alex Fontes May 16, 2022 at 9:50 am #

    Hi Jason, I am trying to use XGBRegressor on a project, but it keeps returning the same value for a given input, even after re-fitting.
    So, as a test, I came to this post and used your code above (Boston Housing dataset), and it is ALSO returning the same value (which is also identical to the value you got).

    X shape: (506, 13)
    y shape: (506,)
    input row: [0.00632, 18.0, 2.31, 0, 0.538, 6.575, 65.2, 4.09, 1, 296.0, 15.3, 396.9, 4.98]
    Predicted: 24.0193386078
    Predicted: 24.0193386078
    Predicted: 24.0193386078
    Predicted: 24.0193386078
    Predicted: 24.0193386078
    Predicted: 24.0193386078
    Predicted: 24.0193386078
    Predicted: 24.0193386078
    Predicted: 24.0193386078
    Predicted: 24.0193386078

    (ps – on each of the runs above, the model is refitted to (X,y)

    Do you get different predictions on each run with this code?
    I’m using Python 3.10.3 and my libraries are all recent … I was hoping you or anyone else in the community could help pointing me in a direction to solve this issue?

    Thank You !!!

    • Avatar
      James Carmichael May 17, 2022 at 9:55 am #

      Hi Alex…Have you tried to implement your model in Google Colaboratory?

  10. Avatar
    Alex Fontes May 18, 2022 at 10:16 am #

    Hi James, I appreciate your reply and thank you for pointing me to that resource.
    As an experiment I wrote a simple code on my computer, and then ran it on Google Colab too.

    This is the code (same on my computer and Google Colab):

    from pandas import read_csv
    import xgboost as xgb

    path = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv’
    ds = read_csv(path, header=None).values

    ds_train = xgb.DMatrix(ds[:500,:-1], label=ds[:500,-1:])
    ds_test = xgb.DMatrix(ds[500:,:-1], label=ds[500:,-1:])

    params = {
    ‘colsample_bynode’: 0.8,
    ‘learning_rate’: 1,
    ‘max_depth’: 5,
    ‘num_parallel_tree’: 100,
    ‘objective’: ‘reg:squarederror’,
    ‘subsample’: 0.8,
    }
    num_round = 100

    for _ in range(5) :
    bst = xgb.train(params, ds_train, num_round)
    preds = bst.predict(ds_test)
    print(preds)

    ***********************************************************
    These are the predictions on my computer:
    [20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]
    [20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]
    [20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]
    [20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]
    [20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]

    And these are the predictions on Google Colab:
    [20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]
    [20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]
    [20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]
    [20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]
    [20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]

    So, the results differ when I run the same code on different environments … but in either case it is still generating the same predictions every time I fit the model to the dataset …. I have already tried different combinations of parameters, different wrappers (Sklearn, and XGB as above), different datasets, and the outcome is always the same … equal predictions every time the model is fit and run … is this how XGBooster is supposed to be?

    Again, I truly appreciate your help.

  11. Avatar
    Emerson de Lemmus July 13, 2022 at 1:16 am #

    This particular line:

    # split data into input and output columns
    X, y = data[:, :-1], data[:, -1]

    Causes the following error: pandas.errors.InvalidIndexError: (slice(None, None, None), slice(None, -1, None)). In the example shown, ‘data’ is not defined, however ‘dataframe’ is.

    The following fixed this error so the example worked:

    # split data into input and output columns
    X, y = dataframe.iloc[:, :-1], dataframe.iloc[:, -1]

    • Avatar
      James Carmichael July 13, 2022 at 7:46 am #

      Thank you for the feedback Emerson!

  12. Avatar
    Lee September 22, 2022 at 7:14 pm #

    Is there any reason why you didnt split the dataset into train and test, like you do with other regression projects?

    • Avatar
      James Carmichael September 23, 2022 at 5:55 am #

      Hi Lee…There is no reason and we agree that you should do so as best practice. The tutorial is showing an example of another concept, however your understanding is correct. Keep up the great work!

  13. Avatar
    Atena December 8, 2022 at 8:21 am #

    Dear Dr Jason,
    Can XGBoost be used on a small dataset with 5 features and 40 samples?

Leave a Reply