Trees That Fix Their Own Mistakes
A random forest grows many trees in parallel and averages them. Gradient boosting does the opposite: it grows trees one at a time, each one trained to correct the errors of all the trees before it. Here’s how that works — and where it spectacularly fails.
In the first two posts, every ensemble we built was a bag: train many trees independently on resampled data, then average their votes. Averaging cuts variance, and that’s the whole game.
Boosting is a different philosophy entirely. Instead of many independent experts voting at once, you build a relay team: each new tree looks at what the current ensemble still gets wrong and focuses entirely on fixing that. The trees are deliberately weak and deliberately dependent. Stack enough of them and the combined model can capture remarkably intricate structure.
1. First, a regression tree predicts staircases
We’re switching from classifying colors to predicting a number, so our tree changes job slightly. A regression tree still splits the input into rectangular regions, but each leaf now stores the average target value of the training points that land in it. Its prediction is therefore a piecewise-constant staircase: flat within each region, jumping at the splits.
Below, a single regression tree fits a wiggly 1-D function. Raise the depth and the staircase gains more steps, hugging the curve ever more tightly. Hold onto one fact — every leaf outputs a flat constant — because it will explain the dramatic failure at the end of this post.
2. Boosting: add trees that fix the residuals
One deep tree overfits; one shallow tree underfits. Gradient boosting takes a different route: start with a trivial model — just the mean of the target — and then repeatedly add shallow trees, each fit not to the data but to the residuals: the gap between the truth and what the ensemble currently predicts.
For squared-error loss those residuals are the negative gradient of the loss, which is where “gradient” boosting gets its name: each tree takes a step downhill on the error surface. We don’t take the full step, though — we scale each tree’s contribution by a small learning rate (shrinkage), so the model creeps toward the data instead of lurching. Many small, careful steps generalize better than a few greedy ones.
Drag rounds to add trees one at a time. The top panel shows the ensemble fit; the bottom shows the residuals it’s still trying to kill. Watch the residuals collapse toward zero as the fit tightens.
| Bagging (random forest) | Boosting (GBT) | |
|---|---|---|
| How trees relate | Independent, built in parallel | Sequential, each fixes the last ensemble’s errors |
| Each tree is | Deep & overfit (low bias, high variance) | Shallow & weak (high bias, low variance) |
| Combining reduces | Variance (averaging) | Bias (additive correction) |
| More trees → | Safer; saturates | Lower training error; can overfit |
3. Stochastic gradient boosting: randomness as a feature
So far each tree has seen all the data and all the features. Jerome Friedman’s key refinement — stochastic gradient boosting — is to feed each tree only a random slice. There are two knobs, and modern libraries (XGBoost, LightGBM, scikit-learn) expose both:
- Row subsampling (
subsample): each round fits its tree on a random fraction of the rows, drawn without replacement. Different rounds see different data, so the trees disagree more. - Column subsampling (
colsample): each tree (or each split) may only choose from a random subset of the features. This is the same decorrelation trick that turns bagged trees into a random forest.
Both inject variety. That sounds bad for an additive model, but it’s the opposite: the randomness decorrelates the trees and acts as a regularizer, usually improving generalization and — as a bonus — making each round cheaper to fit. The cost saving is direct: fitting a tree means scanning rows and, at every candidate split, sorting and sweeping each feature’s values to find the best threshold. Subsample the rows and there are fewer points to scan; subsample the columns and there are fewer features to examine at each split — so each tree does strictly less work.
Row subsampling
Below, the bold blue line is the boosted fit; the faint lines are the same model re-run with different random seeds at your chosen subsample fraction. At 100% they’re identical — boosting is deterministic. Pull the fraction down and the faint lines fan out: that spread is the injected variety. Notice the fit also gets less jagged (it stops chasing every noisy point), and the gap between train and test error shrinks.
Column subsampling
Column sampling only does something when there’s more than one feature, so here we switch to a 2-D target z = f(x, y) and show the first six trees the booster builds. With every feature available, the trees look broadly similar. Restrict each split to a single random feature and the trees specialize — some carve along x, some along y — so they’re far more varied, and the ensemble that sums them is more robust.
It’s worth being explicit about what this sampling does not cost you. Any single tree may never see a particular row, or may be forbidden from splitting on a particular column — but that gap is temporary. The next tree, fitting the residual the previous one left behind, draws a fresh random set of rows and columns, and so does the one after that. Across a reasonable number of rounds the samples overlap enough that every data point and every feature gets used many times over — just not all at once, and not by any one tree. The ensemble as a whole still learns from all of your data; the per-tree blinkers only stop the trees from learning the same thing in lockstep.
4. The thing boosted trees cannot do: extrapolate
Time for the demonstration that ties it all together. Here is a nonlinear “time series” — a trend plus a seasonal wiggle plus noise. We train a gradient-boosted model only on the shaded middle window and then ask it to predict the entire timeline, including the unseen stretches on each side.
Drag the rounds up. Inside the window, the fit gets better and better until, with enough trees, it threads through essentially every training point — perfect interpolation (which, past a point, just means memorizing the noise). Now look at the edges.
Why? Recall the fact from §1: every leaf outputs a constant. Any input beyond the largest training time falls into the same right-most leaf of every tree, so the ensemble returns the same number for all future times. The model has no notion that the trend was rising or the season was about to peak — concepts a straight line or a sine term would capture for free. This isn’t a tuning problem; it’s structural. Tree ensembles are superb interpolators and hopeless extrapolators.
The practical upshots:
- Don’t forecast raw trends with tree models. Detrend / difference the series first (predict the change, or the residual from a linear fit), or use a model class that can extrapolate, then let the trees mop up the nonlinear remainder.
- Watch for silent extrapolation in tabular data too. Any test point outside the training range of a feature gets a clamped, flat prediction — easy to miss when you can’t plot it.
- “Perfect” interpolation is a warning, not a trophy. Push the rounds high and the in-window fit memorizes noise. Early stopping on a validation set is how you find the round that generalizes instead of memorizes.
Takeaways
- Boosting is sequential error-correction. Start at the mean, then add shallow trees each fit to the current residuals — the negative gradient of the loss — scaled by a small learning rate.
- Bagging cuts variance; boosting cuts bias. Forests average strong independent trees; boosting accumulates weak dependent ones.
- Stochastic sampling helps. Row subsampling and column subsampling inject variety that decorrelates the trees, regularizes the model, and speeds up training.
- Tree ensembles interpolate but never extrapolate. Constant-valued leaves mean predictions flatline outside the training range — a structural limit no amount of trees or tuning can fix.
Everything here is a compact implementation: squared-error regression trees (variance-reduction splits), stage-wise gradient boosting with shrinkage, and stochastic row/column subsampling — all on a fixed seed so the figures are reproducible. View source for the details.
The series: Part 1 · Decision trees · Part 2 · Imbalanced learning · Part 3 · Gradient boosting · All labs