1. Class Imbalance

14 May 2026 Introduction

Consider a standard classification problem in a banking context: predicting whether a customer will default on a loan. We are given historical data where past customers are labelled by whether or not they defaulted, and this becomes the basis for model development.

In practice, one of the most overlooked aspects of this setup is not the features or the model choice, but the distribution of the target variable itself. If the ratio of loan-default: loan-repayment per month is approximately 50:50, developing a loan-default model is relatively straightforward. But what happens when our target distribution of loan-default:loan-repayment is closer to 5:95, or even 1:99? These kinds of target distributions are notoriously difficult to model, and are known as class imbalance problems -- situations where the events we care most about often occur least often. Class imbalance is often presented as a machine learning problem. In practice, it is usually a business risk problem disguised as a modelling problem.

Why are imbalance datasets difficult to model?

It’s quite surprising to realise that class imbalance poses a significant problem for modelling frameworks. Most modelling frameworks (for example, decision trees, logistic regression, gradient boosting) are trained to minimise some overall form of error or log-loss across all samples within the data, and each data point (most commonly a row in a dataset) contributes equally to the resultant model. Because these models learn from data, the majority class dominates the error signal, and therefore disproportionately influences parameter estimation. The minority class has limited influence on the fitted decision boundary, particularly when it is both very rare (meaning, a very imbalanced target distribution) and highly variable (meaning, if we take a subset of data that only contains the minority class, we observe a high degree of variation in the features of that sub-dataset). Together, these concepts reduce the ability of the model to establish an accurate decision boundary.

The decision boundary (vertical black line) is generally easy to determine within the total search space (combined red and blue rectangles in which all points sit) when classes are evenly balanced (Class A: red triangles, Class B: blue circles), and the classification rule is quite simple.

Even for more complex classification rules, most algorithms can determine a sufficient decision boundary, because there are enough data points of either class to provide a strong learning signal.

When one class occurs less frequently (red triangles) than another class (blue circles), its contribution to learning the decision boundary is proportionally reduced. In some cases, this isn't a problem, if the classification rule is very simple.

The decision boundary significantly deteriorates given class imbalance and a difficult classification rule. Simply put, there isn't enough repetition from the minority class (red triangles) to contribute meaningfully to the learning of the decision boundary. High variance in the minority class (as observed here along Feature 1) exacerbates this problem.

What are the consequences of class imbalance?

As Data Scientists, ML Engineers, and Product Managers, we are used to relying on the usual suspects for classification models: Accuracy, Precision, Recall, and F1. We know enough to not rely on any single one of these metrics alone (especially not Accuracy – particularly when paired with class imbalance). However, model metrics alone may not be sufficient to address class imbalance. Consider the following example.

Let’s return to the example outlined in the introduction. You're a Data Scientist at HTB (Hostile Takeover Bank), and Chick Hicks has asked you to develop a model to predict loan-defaults. The target split of loan-default:loan-repay is 10:90. We’ve developed a model that we think is “pretty good”: of the 900 loan-repays in the Test Dataset, it correctly classifies all 900 of them. Of the 100 loan-defaults our model can only predict 20 of them correctly. That doesn’t sound too bad though, and Class Imbalance is a difficult thing to model, you say – and, after all, the accuracy of the model is 92%!

Ah, here is where we see that some common model metrics can be misleading, which can be a huge problem for non-technical stakeholders. This model may sound good on paper, and project managers would speed this into production without hesitation. But let’s investigate more closely what this model is doing and what it means to the business; let’s look more closely at the metrics that matter.

Our model identifies only 20% of loan-defaulting customers (Recall = 20%), meaning that 8/10 defaulting customers are missed every month. If the average cost to the bank of a loan-default is R500, then our Test Dataset sample (1000 rows, 10:90 loan-default to loan-repay class ratio) inherently contains R50 000 bad debt (100 default cases × R500 per default = R50 000). Since our model misses 80% of loan-defaulters, our model misses R500 x 80 = R40 000 in loan values -- each month. Whilst recovering R10 000 in loan amounts each month is better than losing the full R50 000 (were we to do nothing) it’s still not a very useful model. Despite reporting 92% accuracy and perfect precision, the model still fails to identify 8 out of 10 loan defaulters -- the very outcome the business cares most about. Let’s further say we developed a second model, and its metrics were as follows:

We initially discarded this model, because its F1 was lower (marginally) than Model A. But Chick Hicks has asked us to reconsider, and now we carefully evaluate what our metrics mean in practice for Model B. Our recall is 80%, meaning that of all the total loan-defaulters, our model correctly identifies 80% of them. So in practical terms, of the R50 000 bad debt inherent in our dataset, we capture R 40 000 of that. Much better than the first model (4 times better!). But, this model also has a terrible precision: of all the loan-repayers our model correctly identified only 21% of them, meaning that 4 out of 5 debt repayers are unfortunately classed as loan-defaulters.

This is where we start to think of models in terms of consequences and trade offs, rather than in terms of metrics. It turns out that at HTB, the cost of contacting a predicted loan-defaulter is a simple (cheap) phone call, and almost all loan-payers who are predicted loan-defaulters don’t mind the phone call and can set matters right in a few seconds. So in this case, the trade off of a higher recall but lower precision works for our specific business objective.

Class Imbalance as a business risk problem

Class imbalance problems are rarely modelling problems in isolation. More often, they are business risk problems disguised as modelling problems. That is, the problem is almost never that one cannot predict the minority class with a high enough accuracy; it’s what the minority class represents to the business and the risk it poses. In the case above, the cost (risk) of incorrectly predicting loan-payers as loan-defaulters is minor in comparison to the cost of incorrectly predicting loan-defaulters as loan-payers. Models trained on class-imbalanced datasets almost always fail asymmetrically: one class accumulates a disproportionate number of False Negatives or False Positives.

Our job as Data Practitioners is not simply to minimise these errors, but to understand which of them the business can afford. As is often the case, simpler is better, and if we can get away with a more simple model that adequately solves the business objective, then that’s what we should aim for. Explicitly accounting for class imbalance typically occurs in scenarios whereby a misclassification of the target class has severe consequences; this could be in fields like self-driving cars (misclassifying a pedestrian from video feeds), healthcare (misclassifying objects in a medical image), or engineering (misclassifying a machine failure).

How do we account for class imbalance?

If we have assessed our current situation and have indeed decided that we need to address class imbalance, we typically have two options:

1) we either deal with the data itself and use sophisticated sampling methods which aim to either increase the frequency of the minority class, or decrease that of the majority class, or,

2) we utilise a series of complex modelling algorithms that are developed specifically for class imbalance problems.

As with everything in life, each approach comes with its own set of pros and cons, and trade offs – class imbalance is a notoriously difficult problem to model, and the thinking must shift from an approach of “how do i predict the minority class as accurately as possible, whilst keeping my majority class predictions accurate” to one of “What level of missed minority events can the business afford?”. Our job as Data Scientists, ML Engineers, and Product Managers is to determine which set of tradeoffs are acceptable for a given problem. This is an incredibly interesting space, but as you can imagine, there is a great deal of material to work through. As such, I’ll be focussing on solutions to class imbalance over the next two posts. Thank you for reading!

If you've found this useful or interesting, please consider subscribing, or share to interested colleagues. I publish long-form, in-depth articles on ML, AI, and Data Science (just like this one) every two weeks or so, as well as practical business-oriented coding problems, like this one. If you'd like to get in touch, drop me a line.

Page updated

Google Sites

Report abuse