How Decision Trees Carve Up Space
A visual, build-it-in-your-browser tour of how a decision tree learns to classify a grid of red and green dots — and why averaging many wobbly trees produces a smooth, confident boundary.
Suppose someone hands you a scatter of points on a square. Some are green, some are red, and your job is to predict the color of any new point based only on its position. A decision tree solves this by repeatedly asking one dead-simple question — “is this coordinate above or below some threshold?” — slicing the plane into rectangles until each rectangle is mostly one color.
That’s the whole idea. The rest of this post makes it concrete, one visualization at a time.
1. One split at a time
A tree is grown greedily. At each step it looks at every possible horizontal and vertical cut and picks the single cut that best separates the colors. “Best” is measured by Gini impurity: a region of all-green or all-red dots has impurity 0 (perfectly pure), while a 50/50 mix has impurity 0.5 (maximally confused). The tree chooses the cut that drops the average impurity the most, then recurses into each half.
Use the depth slider to grow the tree one level at a time. Watch the plane get partitioned into rectangles, and watch the tree diagram on the right grow in lockstep. Each leaf is colored by its majority class.
A few things to notice as you slide:
- Each cut is axis-aligned — purely horizontal or vertical. A tree can only build boundaries out of these steps, which is why the regions are always rectangles. This is the tree’s great strength (cheap, interpretable) and its great weakness (blocky boundaries).
- The outermost regions extend to infinity, so a tree predicts the same fixed value for everything beyond the range of its training data. It can interpolate between the points it has seen, but it cannot extrapolate — push an input past the edge of the data and the prediction simply freezes at the boundary region’s value. This is a structural limit that every tree-derived method inherits; we make it concrete in §5 below.
- The checkerboard needs several levels before it “gets it” — no single cut helps, so the first split looks almost arbitrary, but a couple of levels down the structure snaps into place.
- Grow deep enough and the tree drives training impurity to zero, wrapping a tiny rectangle around every stray point. That’s overfitting, which sets up the rest of the post.
2. Leaves hold probabilities, not just colors
So far each region picked a single color. But a leaf rarely contains only green or only red points — especially in a shallow tree. What a leaf really stores is a probability: the fraction of its training points that are green.
Below, regions are shaded by that probability. Deep green means “almost certainly green,” pale colors near the boundary mean “the leaf is mixed — the tree isn’t sure.” Lower the depth to keep leaves big and mixed (lots of pale, hedged regions); raise it to make leaves pure and confident (saturated colors, but carved into ever-smaller boxes).
The color bar maps a leaf’s predicted probability to a shade. The pale band through the middle is the tree’s decision boundary — where it flips from guessing red to guessing green.
3. Wiggle the data, get a totally different tree
Here is the uncomfortable truth about a single deep tree: it is high-variance. Because every split is chosen greedily, nudging just a few training points can change the very first cut — and a different first cut sends the whole tree down a different path, producing a wildly different boundary.
To see this, we train four trees, each on a bootstrap sample of the data: a random draw of the points with replacement, so each tree sees a slightly different version of the world (some points duplicated, some omitted). Same dataset, same algorithm — four very different boundaries. Hit resample a few times and watch how jumpy they are.
4. Average the wobble away: the random forest
If one tree’s errors are jumpy and partly random, then many independent trees will make their mistakes in different places — and those mistakes can cancel out. A random forest does exactly this: grow many trees, each on its own bootstrap sample, then average their probabilities. (Real forests also pick a random subset of features at each split to make the trees more independent; with only two features here we lean on bootstrap sampling, the “bagging” half of the idea.)
Drag the slider from 1 tree up to 150. Watch the blocky, high-contrast boundary of a single tree melt into a smooth, gently-graded surface. The staircase edges average out; the confident-but-wrong tiny rectangles get outvoted; the boundary settles down right where the two colors actually meet.
As trees accumulate, the prediction surface stops looking like a tiled floor and starts looking like a topographic map: smooth contours of probability.
The deeper lesson: each individual tree in that smooth forest is still a blocky, overfit mess. We didn’t make the trees better — we made them diverse, and then averaged. Variance falls roughly like 1/(number of trees) as long as the trees aren’t too correlated, which is why the curve smooths fast at first (1 → 10 trees is dramatic) and then only polishes (100 → 150 is barely visible).
5. The one thing a tree cannot do: extrapolate
Everything above classified colors, but a tree is just as happy predicting a number — that’s a regression tree. The machinery is identical; only the leaf changes. Instead of storing the local mix of labels, each leaf stores the average target value of the points that fall in it, and the prediction becomes a piecewise-constant staircase: flat within each region, jumping at the splits.
That single fact — every leaf emits one constant — has a consequence that bites hard in practice. Consider a variable that trends over time, like a noisy seasonal time series. We train a regression tree on the shaded middle window only, then ask it to predict the whole timeline. Drag the depth up: inside the window the staircase hugs the data better and better, but look at the edges.
This is structural, not a tuning problem: no depth setting fixes it, because the model class has no concept of a trend continuing past the data. And since a forest is just an average of trees and a boosted model is just a sum of trees, every tree-derived method inherits this blind spot — a deeper dive, with gradient boosting, is the finale of Part 3. The practical takeaway: for trending data, don’t feed raw values to a tree model — difference or detrend first, then let the trees model the leftover wiggles they’re actually good at.
Takeaways
- A tree partitions space with axis-aligned cuts, chosen greedily to make each region as pure as possible. Boundaries are always rectangular.
- Leaves store probabilities — the local mix of training labels — so a tree outputs a confidence, not just a label. Shallow trees are coarse but calibrated; deep trees are sharp but overfit.
- A single deep tree is high-variance: small data changes produce large boundary changes.
- A forest trades that variance for smoothness by averaging many decorrelated trees. The blocky pieces cancel, leaving a smooth boundary that the underlying model class — a single tree — could never represent on its own.
- Trees interpolate but never extrapolate. Outside the range of the training data a tree just repeats a constant — and forests and boosted ensembles, being built from trees, inherit the same blind spot.
Everything here is a faithful (if compact) CART implementation: Gini-impurity splits, bootstrap sampling, and probability averaging, all running on a fixed random seed so the pictures are reproducible. View source to read the ~300 lines that power it.
Next in this series: Finding the 1%: Trees & Forests on Imbalanced Data — what happens to all of this when one class is 99% of the data.
Written by Max Buckley. ← All labs