Dealing With Class Imbalanced Datasets For Classification.

Aditya Lahiri
Towards Data Science
4 min readDec 15, 2018

--

Skewed datasets are not uncommon. And they are tough to handle. Usual classification models and techniques often fail miserably when presented with such a problem. Although your model could get you to even a 99% accuracy on such cases, yet, if you are measuring yourself against a sensible metric such as the ROC Auc score, then you will face trouble getting up that leaderboard. This is because if the dataset is skewed, for example, a 10:1 ratio of Positives to Negatives occur, then by just predicting positives for every sample you see without any learning, you can get to a 90% accuracy score! So, how do we get around this problem? This post will highlight a few effective techniques you can use to do well on such tasks. The techniques range from differently sampling your data to setting some hyperparameters cleverly to using libraries that contain different versions of the usual algorithms which internally handle imbalance themselves.

  1. Sampling

You can do this in two different ways.

a. Undersampling.

Say, you have 40,000 positive sample and 2,000 negative samples in your dataset. We will use this as our running example henceforth. What you can do is just randomly pick up 2,000 positive samples out of the 40,000, all 2,000 negative samples, and train and validate your model only on these 4,000 samples. This will allow you to use all the classification algorithms in just the usual way. This method is easy to implement and runs very fast as well. However, one downside is that you are potentially discarding the 38,000 positive sample you have and that data is going down the drain.

To overcome this, you can create an ensemble of models wherein each model uses a different set of 2,000 positive sample and all 2,000 negative samples and is trained and validated separately. Then on your test set, you take a majority vote of all these models. This allows you to take into account all of your data without causing an imbalance. Furthermore, you can even use different algorithms for different sets and then your ensemble would be even more robust. However, this would be a bit computationally expensive.

b. Oversampling

In this method, you generate more samples of your minority class. You can do this either by first creating a generative model and then creating new samples or by just picking existing samples with replacement. There exist a number of oversampling techniques such as SMOTE, ADASYN, etc. You will have to see which works best for your use case. Also, oversampling itself is a computationally expensive procedure. The major advantage is that this allows one model of yours to take all of your data into consideration at once and also helps you generate new data.

2. Using the scale_pos_weight parameter.

If you are using an algorithm like XGBoost, there is an easy way out. You can set the scale_pos_weight hyperparameter of the algorithm to indicate that your dataset has a certain ratio of positive and negative classes and XGBoost will take care of the rest. So, in our running example of 40,000 positive and 2,000 negative samples, if we want to train our XGBoost classifier on this dataset we would set the value of scale_pos_weight to be 40,000/2,000 = 20.
This works very well in practice. However, one downside is that this restricts you to using XGBoost and other similar algorithms since not all algorithms have this adjustable hyperparameter.

3. Using the imbalanced-learn library.

Yes, you guessed it right. There already exists a full-fledged python library designed specifically for dealing with these kinds of problems. This library is a part of sklearn-contrib. But it is easy to get lost in the details of the library. The one thing that has worked the best for me is the BalancedRandomForestClassifier. The usual Random Forest algorithm performs extremely poorly on imbalanced datasets. However, this Balanced Random Forest Classifier which is a part of imblearn package works wonderfully well. It internally handles the sampling issues. And you get all that power of Random Forest along with the familiar sklearn API. Other features of the library include inbuilt oversamplers, undersamplers, a combination of both(!) and other algorithms tailor-made for handling skewed datasets. There’s a lot to explore here.

When I started working on a dataset that had a 20:1 imbalance a month back, I couldn’t find a great resource on how to tackle it. So, I decided to present what I learnt while I was trying to find ways of effectively dealing with skewed datasets. I hope this was of benefit to you. You can find all of these techniques used in the code base here at my GitHub repo. Please feel free to suggest in comments, any other methods you might know of that could help folks find the great equaliser in this battle of unequals! Thank you.

--

--

arrayslayer.github.io Research Intern at Machine Learning And Optimization Group @American Express,AI Labs