Dealing with Imbalanced Data

In an ideal world of data, we would always have a classification problem where the data classes are even or balanced, that is each class will have an equal count of instances. Unfortunately, there are several classification problems where the data can be very highly imbalanced.

We commonly see imbalanced data when dealing with medical data, fraud detection, claim prediction, or any case where rare events occur. For example, in medical data, the cases in a population that have a particular problem will be much lesser than the cases without. Training a model with this kind of data can lead to very misleading predictions. Suppose we need to predict the cases of a rare disease in a given population. Since the existing data has very few instances of the disease, the models we train will tend to predict almost everything as negative (or no disease). The overall accuracy shown will still be very high, leading us to believe that the models are performing well.

Here are some graphs from pregnancy data depicting severely imbalanced data:

  1. Graph depicting the count of live births:
  2. Graph depicting the count of deaths of mother:
  3. Graph depicting the count of cases where there was a complication during delivery:

As evident from the graphs, these cases are extremely skewed, and training with such data will result in almost all cases being predicted as the majority class.

There are a few things we can try to address this problem:

  1. Oversampling
    We increase the number of examples of the minority class. This can be done in a couple of ways:
    1. Random oversampling – Picking the minority class samples at random and duplicating them.
    2. Clustered oversampling – The data may contain several attributes. We can choose one of these attributes that have categorical values and then duplicate the minority from each category. For example – if the pregnancy data is collected from twenty districts, we identify about a hundred cases of complication from each district and duplicate them ten times. This will give us almost a 50:50 class balance.
    3. SMOTE (Synthetic Minority Oversampling Technique)
      This is a method where based on random examples from the minority class, new data is created using the k-nearest neighbour technique. We can create as many new synthetic examples as needed.

Each of the methods mentioned above can help us balance the classes, however, they have their drawbacks. The first two methods copy existing data and provide no new data for the models to learn from. The third creates new data which may add additional noise to the data. All the techniques may lead to overfitting of the models.  

  1. Undersampling
    We decrease or delete examples of the majority class. This too can be done in the following ways:
    1. Random undersampling – Deleting at random examples from the majority class
    2. Custom undersampling – We can select parameters like different categories or clusters from the majority class and delete a fixed count from each. Example – In the dataset for all women whose pregnancy data is available, delete 20% of cases with no complication from every five-year age group. Below 18, 18-23, 24-29, and so on.

The problem with this technique is it leaves us with too little data, which may be insufficient to train our models with.

  1. A combination of oversampling and undersampling, also known as hybrid sampling, can also be used to overcome the imbalance.

Depending on the available dataset size, complexity of data, the data quality, etc, the correct technique or combination of techniques can be used to create a more balanced dataset, which will help us build better prediction models.

Leave a Reply