← Max Buckley Labs Blog About Contact

Trees That Fix Their Own Mistakes

A random forest grows many trees in parallel and averages them. Gradient boosting does the opposite: it grows trees one at a time, each one trained to correct the errors of all the trees before it. Here’s how that works — and where it spectacularly fails.

Every figure is computed live by a real regression-tree + gradient-boosting implementation running in your browser on a fixed seed. Drag the controls.

Part 3 of a series. It assumes the tree and ensemble vocabulary from Part 1: How Decision Trees Carve Up Space and Part 2: Finding the 1%. Those were primarily about classification (predicting a color), apart from a brief regression aside at the end of Part 1; this one is about regression — predicting a number — throughout.

In the first two posts, every ensemble we built was a bag: train many trees independently on resampled data, then average their votes. Averaging cuts variance, and that’s the whole game.

Boosting is a different philosophy entirely. Instead of many independent experts voting at once, you build a relay team: each new tree looks at what the current ensemble still gets wrong and focuses entirely on fixing that. The trees are deliberately weak and deliberately dependent. Stack enough of them and the combined model can capture remarkably intricate structure.

1. First, a regression tree predicts staircases

We’re switching from classifying colors to predicting a number, so our tree changes job slightly. A regression tree still splits the input into rectangular regions, but each leaf now stores the average target value of the training points that land in it. Its prediction is therefore a piecewise-constant staircase: flat within each region, jumping at the splits.

Below, a single regression tree fits a wiggly 1-D function. Raise the depth and the staircase gains more steps, hugging the curve ever more tightly. Hold onto one fact — every leaf outputs a flat constant — because it will explain the dramatic failure at the end of this post.

true function tree prediction ● training points
A regression tree carves the input into intervals and predicts the mean inside each. Deeper trees mean more, narrower steps. It can approximate any curve — but only as a staircase, and only within the range of the training data.

2. Boosting: add trees that fix the residuals

One deep tree overfits; one shallow tree underfits. Gradient boosting takes a different route: start with a trivial model — just the mean of the target — and then repeatedly add shallow trees, each fit not to the data but to the residuals: the gap between the truth and what the ensemble currently predicts.

For squared-error loss those residuals are the negative gradient of the loss, which is where “gradient” boosting gets its name: each tree takes a step downhill on the error surface. We don’t take the full step, though — we scale each tree’s contribution by a small learning rate (shrinkage), so the model creeps toward the data instead of lurching. Many small, careful steps generalize better than a few greedy ones.

Drag rounds to add trees one at a time. The top panel shows the ensemble fit; the bottom shows the residuals it’s still trying to kill. Watch the residuals collapse toward zero as the fit tightens.

Round 0 is a flat line at the mean. Each round adds a shrunken shallow tree fit to the current residuals (bottom panel), bending the fit toward the data. A small learning rate needs more rounds but lands softer; a large one converges fast and risks overshooting. This stage-wise “fix what’s left” loop is the entire idea of boosting.
Bagging (random forest)Boosting (GBT)
How trees relateIndependent, built in parallelSequential, each fixes the last ensemble’s errors
Each tree isDeep & overfit (low bias, high variance)Shallow & weak (high bias, low variance)
Combining reducesVariance (averaging)Bias (additive correction)
More trees →Safer; saturatesLower training error; can overfit

3. Stochastic gradient boosting: randomness as a feature

So far each tree has seen all the data and all the features. Jerome Friedman’s key refinement — stochastic gradient boosting — is to feed each tree only a random slice. There are two knobs, and modern libraries (XGBoost, LightGBM, scikit-learn) expose both:

Both inject variety. That sounds bad for an additive model, but it’s the opposite: the randomness decorrelates the trees and acts as a regularizer, usually improving generalization and — as a bonus — making each round cheaper to fit. The cost saving is direct: fitting a tree means scanning rows and, at every candidate split, sorting and sweeping each feature’s values to find the best threshold. Subsample the rows and there are fewer points to scan; subsample the columns and there are fewer features to examine at each split — so each tree does strictly less work.

Row subsampling

Below, the bold blue line is the boosted fit; the faint lines are the same model re-run with different random seeds at your chosen subsample fraction. At 100% they’re identical — boosting is deterministic. Pull the fraction down and the faint lines fan out: that spread is the injected variety. Notice the fit also gets less jagged (it stops chasing every noisy point), and the gap between train and test error shrinks.

Lower subsampling = more variety between runs (the fan of faint lines) and a smoother, better-regularized average fit. The test error (on held-out points) typically bottoms out below 100% subsampling — proof that throwing data away per tree can help.

Column subsampling

Column sampling only does something when there’s more than one feature, so here we switch to a 2-D target z = f(x, y) and show the first six trees the booster builds. With every feature available, the trees look broadly similar. Restrict each split to a single random feature and the trees specialize — some carve along x, some along y — so they’re far more varied, and the ensemble that sums them is more robust.

The first six trees of a boosted ensemble on a 2-D surface. Toggle column subsampling: with it on, individual trees become strongly horizontal or vertical and visibly different from one another — that diversity is what decorrelation buys you.

It’s worth being explicit about what this sampling does not cost you. Any single tree may never see a particular row, or may be forbidden from splitting on a particular column — but that gap is temporary. The next tree, fitting the residual the previous one left behind, draws a fresh random set of rows and columns, and so does the one after that. Across a reasonable number of rounds the samples overlap enough that every data point and every feature gets used many times over — just not all at once, and not by any one tree. The ensemble as a whole still learns from all of your data; the per-tree blinkers only stop the trees from learning the same thing in lockstep.

4. The thing boosted trees cannot do: extrapolate

Time for the demonstration that ties it all together. Here is a nonlinear “time series” — a trend plus a seasonal wiggle plus noise. We train a gradient-boosted model only on the shaded middle window and then ask it to predict the entire timeline, including the unseen stretches on each side.

Drag the rounds up. Inside the window, the fit gets better and better until, with enough trees, it threads through essentially every training point — perfect interpolation (which, past a point, just means memorizing the noise). Now look at the edges.

true series GBT prediction ● training points (shaded window)
Inside the training window: with enough rounds the model interpolates the points almost perfectly. Outside it: the prediction is a flat horizontal line, frozen at the value of the boundary leaf, no matter what the true series does. Boosting cannot extrapolate a trend or continue a season — it can only repeat the last constant it learned.

Why? Recall the fact from §1: every leaf outputs a constant. Any input beyond the largest training time falls into the same right-most leaf of every tree, so the ensemble returns the same number for all future times. The model has no notion that the trend was rising or the season was about to peak — concepts a straight line or a sine term would capture for free. This isn’t a tuning problem; it’s structural. Tree ensembles are superb interpolators and hopeless extrapolators.

The practical upshots:

Takeaways

Everything here is a compact implementation: squared-error regression trees (variance-reduction splits), stage-wise gradient boosting with shrinkage, and stochastic row/column subsampling — all on a fixed seed so the figures are reproducible. View source for the details.