Finding the 1%

When one class is rare — fraud, disease, defects, churn — a classifier can score 99% accuracy by ignoring it entirely. Here’s why trees and forests fall into that trap on imbalanced data, and how undersampling, weighting, and bagging climb back out.

Every figure is computed live by a real weighted CART (Classification and Regression Tree) implementation running in your browser on a fixed seed. Drag the controls.

Part 2 of a series. This builds directly on How Decision Trees Carve Up Space, which covers Gini-impurity splits, leaf probabilities, and forest averaging. If those words are unfamiliar, start there.

In the last piece we watched a decision tree slice a balanced grid of red and green dots into rectangles, and saw a random forest average many wobbly trees into a smooth boundary. We were quietly lucky about one thing: the two colors appeared in roughly equal numbers.

Real problems are rarely so polite. Fraudulent transactions, malignant scans, manufacturing defects, users about to churn — the interesting class is often the rare one. A 1% positive rate is common; 0.1% is not unusual. Let’s see what that does to everything we just learned.

1. The accuracy paradox

Below is a forest trained on data where green points (the “positive” class — say, fraud) live in a compact region, and red points (the overwhelming majority) are everywhere else. Drag the imbalance slider from a balanced 50/50 split down toward 1% green and watch two things at once: the boundary, and the scoreboard.

Where the minority lives

Percent green (positive): 2%

Plain random forest, threshold 0.5. Shading is predicted P(green).

positive (minority) negative (majority) predicted green predicted red

As the green class gets rarer, accuracy goes up even as the model gets worse at the only job that matters: finding green. By 1–2% the green region has all but vanished — the model has learned the winning strategy of “always guess red.” This is the accuracy paradox, and it’s why accuracy is the wrong yardstick here.

Watch the Recall chip — the fraction of actual green points the model catches. It collapses toward zero while Accuracy sails past 98%. A fraud detector that flags nothing is 99% accurate and 100% useless.

The honest scorecard for imbalanced problems looks past accuracy to:

Recall (a.k.a. sensitivity, TPR): of the real positives, how many did we catch?
Precision: of the points we called positive, how many really were?
G-mean = √(recall × specificity): a single number that punishes you for sacrificing either class. Unlike accuracy, you can’t game it by ignoring the minority.

2. Why trees give up on the rare class

Two forces conspire here, and it’s worth separating them.

The splitting criterion barely notices. A tree splits to reduce Gini impurity. When 99% of points are red, a region is already almost pure red — its impurity is tiny before any split. Carving out a handful of green points buys a microscopic impurity drop, easily beaten by splits that tidy up the majority. The rare class simply isn’t worth the tree’s attention.

The 0.5 threshold is brutal. Even when a leaf does contain green points, they’re usually outnumbered, so the leaf’s P(green) sits well below 0.5 — and a default classifier calls anything under 0.5 “red.” The model’s probabilities are squashed toward zero, so the standard cutoff never trips.

That second force suggests a cheap (partial) fix: just move the threshold. Below, on a fixed 2% dataset, drag the decision threshold down from 0.5 and watch green territory reappear.

Decision threshold: 0.50

show raw probability

Lowering the threshold trades precision for recall along a single dial. It can rescue a model whose probabilities are merely miscalibrated — but it does nothing about the deeper problem that the tree never learned much structure around the rare class in the first place. We want to fix the training, not just the cutoff.

Threshold-tuning is a useful last-mile adjustment, but it’s working with whatever signal the model already extracted. The more powerful moves change what the model sees during training so the rare class actually earns the tree’s attention. That’s the rest of this post.

3. Rebalancing the training signal

From the first post we already have two ideas that help a little: bagging (averaging many bootstrapped trees) reduces variance, and that variance reduction does smooth the jittery minority region somewhat. But bagging alone doesn’t fix the bias toward the majority — every bootstrap sample is still 99% red. We need to change the class balance the trees train on. Four standard ways to do it:

Method	What it does	Cost / catch
Class weighting	Multiply the rare class’s contribution to the impurity calculation by `n_neg / n_pos`, so isolating a green point finally moves the needle.	Free; no data added or dropped. The default first thing to try.
ROS bagging (random over-sampling)	Duplicate minority points until the bag is balanced, then train. Duplicating a point k times is mathematically the same as giving it weight k.	No new information — the same few greens, repeated. Trees can overfit their exact locations.
RUS bagging (random under-sampling)	For each tree, keep all the greens and a random equal-size sample of reds. Each tree sees a different slice of the majority; averaging them is the magic.	Throws away majority data per tree — but recovers it across the ensemble. Cheap and very effective.
SMOTE	Invent synthetic minority points by interpolating between real ones and their nearest neighbors, filling out the rare region.	Synthetic points can land in the wrong place. Evidence is mixed (see §5).

Here are all five side by side — the plain baseline plus the four fixes — on the same imbalanced dataset. Compare the G-mean under each, and notice how the shape of the recovered green region differs.

Dataset

Percent green: 2%

Same data, five training strategies. The plain forest barely registers green; weighting and ROS recover a tight, high-precision region; RUS bagging paints the most generous green region (high recall, lower precision); SMOTE fills in smoothly when the minority is one blob, and more questionably when it isn’t.

A pattern worth internalizing: RUS bagging tends to maximize recall (it errs toward calling things green), while weighting and ROS stay more precise (tighter regions, fewer false alarms). Which you want depends on the asymmetry of your costs — missing fraud usually hurts more than a false alarm, which is exactly why the recall-friendly methods are popular in practice.

4. Why RUS bagging works: diversity from the majority

RUS bagging deserves a closer look because it’s the same trick as the random forest from post 1, aimed at a new target. A single under-sampled tree is trained on a tiny, balanced dataset — all the greens, plus a random equal-size handful of reds. On its own it’s noisy and over-eager. But each tree draws a different handful of reds, so the trees disagree in different places — and averaging many of them cancels the noise, just like before.

The bonus: because each tree trains on a tiny balanced set, the trees are both cheap and decorrelated. Drag the slider to add under-sampled trees and watch a jagged, trigger-happy single tree settle into a smooth, sensible minority region.

Dataset

Under-sampled trees: 1

show one bag’s sample

One under-sampled tree is high-variance and over-predicts green. A hundred of them, each seeing a different sample of the majority, average into a smooth boundary with strong recall — at a fraction of the training cost of a forest on the full data. Tick the box to see how sparse a single balanced bag really is.

5. A word on SMOTE

SMOTE (Synthetic Minority Over-sampling TEchnique) is the most cited rebalancing method, and the most contentious. Instead of duplicating minority points, it synthesizes new ones: pick a real green point, pick one of its nearest green neighbors, and drop a brand-new green point somewhere on the line between them. Repeat until balanced. The intuition is appealing — you’re filling in the rare region rather than just reweighting a few examples.

It works beautifully when the minority class is a single, convex, well-separated blob. It gets dangerous otherwise. Toggle the synthetic points below on the two-cluster minority and watch SMOTE string fake green points across the empty red gap between the clusters — inventing “evidence” for green exactly where there is none.

Minority shape

show synthetic points

real positive synthetic positive negative

Hollow purple points are SMOTE’s inventions. On a single island they sensibly fill the region. On two clusters they bridge a gap that should stay red; with class overlap they amplify the noise. This is why the empirical record on SMOTE is mixed.

The literature reflects this. SMOTE often helps on tabular problems with a coherent minority manifold, but multiple careful studies have found it offers little over plain random over-sampling or class weighting once you tune the threshold properly — and it can actively hurt with high-dimensional data, heavy class overlap, or label noise, because interpolation assumes the space between minority points is also minority. It’s a tool worth knowing, not a default worth reaching for. When in doubt, the boring options — class weights and RUS bagging — are strong, cheap, and predictable baselines.

Takeaways

Accuracy lies on imbalanced data. A model that ignores a 1% class scores 99%. Judge with recall, precision, and G-mean instead.
Trees ignore the rare class for two reasons: isolating it barely reduces impurity, and its leaf probabilities sit below the 0.5 threshold.
Threshold-tuning is a band-aid — useful, but it only re-reads signal the model already had.
Rebalancing the training signal is the real fix. Class weighting ≡ over-sampling (a duplicated point is just a higher weight). RUS bagging is the standout: cheap, diverse, recall-friendly. SMOTE can help on clean, blob-like minorities but the evidence is mixed — don’t reach for it first.
It’s the same machinery as before. Bagging’s “average many decorrelated trees” idea, pointed at a balanced bootstrap, is what makes the rare class learnable.

Further reading. For a thorough, rigorous treatment of everything sketched here — resampling, cost-sensitive learning, ensemble methods like RUS/ROS bagging, and the nuanced evidence on SMOTE — see Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, and Francisco Herrera, Learning from Imbalanced Data Sets (Springer, 2018). It’s the standard reference on the topic.

Everything here is a compact weighted-CART implementation: Gini-impurity splits on weighted counts, bootstrap / under-sample / over-sample / SMOTE resampling, and probability averaging — all on a fixed seed so the figures are reproducible. Metrics are computed on a held-out sample drawn from the same distribution. View source for the details.

Next in this series: Part 3: Trees That Fix Their Own Mistakes — gradient boosting, and why tree ensembles can’t extrapolate.

Written by Max Buckley. ← Part 1 · All labs