← Max Buckley Labs Blog About Contact

How Decision Trees Carve Up Space

A visual, build-it-in-your-browser tour of how a decision tree learns to classify a grid of red and green dots — and why averaging many wobbly trees produces a smooth, confident boundary.

Every picture below is computed live in your browser by a real CART (Classification and Regression Tree) decision-tree implementation. Drag the sliders; nothing is pre-rendered.

Suppose someone hands you a scatter of points on a square. Some are green, some are red, and your job is to predict the color of any new point based only on its position. A decision tree solves this by repeatedly asking one dead-simple question — “is this coordinate above or below some threshold?” — slicing the plane into rectangles until each rectangle is mostly one color.

That’s the whole idea. The rest of this post makes it concrete, one visualization at a time.

1. One split at a time

A tree is grown greedily. At each step it looks at every possible horizontal and vertical cut and picks the single cut that best separates the colors. “Best” is measured by Gini impurity: a region of all-green or all-red dots has impurity 0 (perfectly pure), while a 50/50 mix has impurity 0.5 (maximally confused). The tree chooses the cut that drops the average impurity the most, then recurses into each half.

Use the depth slider to grow the tree one level at a time. Watch the plane get partitioned into rectangles, and watch the tree diagram on the right grow in lockstep. Each leaf is colored by its majority class.

green point red point region predicted green region predicted red
Left: the data and the rectangular regions carved out by the splits (black lines are the cuts). Right: the tree itself — each internal node is one threshold test; each leaf is a colored rectangle. At depth 0 the “tree” is a single guess for the whole plane; every extra level can double the number of regions.

A few things to notice as you slide:

2. Leaves hold probabilities, not just colors

So far each region picked a single color. But a leaf rarely contains only green or only red points — especially in a shallow tree. What a leaf really stores is a probability: the fraction of its training points that are green.

Below, regions are shaded by that probability. Deep green means “almost certainly green,” pale colors near the boundary mean “the leaf is mixed — the tree isn’t sure.” Lower the depth to keep leaves big and mixed (lots of pale, hedged regions); raise it to make leaves pure and confident (saturated colors, but carved into ever-smaller boxes).

The color bar maps a leaf’s predicted probability to a shade. The pale band through the middle is the tree’s decision boundary — where it flips from guessing red to guessing green.

Shallow trees give few, large, hedged regions (calibrated but coarse). Deep trees give many tiny, confident regions that hug individual points (sharp but brittle). Neither is “the answer” — which is the problem a forest is about to fix.

3. Wiggle the data, get a totally different tree

Here is the uncomfortable truth about a single deep tree: it is high-variance. Because every split is chosen greedily, nudging just a few training points can change the very first cut — and a different first cut sends the whole tree down a different path, producing a wildly different boundary.

To see this, we train four trees, each on a bootstrap sample of the data: a random draw of the points with replacement, so each tree sees a slightly different version of the world (some points duplicated, some omitted). Same dataset, same algorithm — four very different boundaries. Hit resample a few times and watch how jumpy they are.

Four trees, four bootstrap samples of the same dataset. Each is a perfectly reasonable fit, yet they disagree — sometimes sharply — about the territory between the clusters. That disagreement is variance, and on its own it’s bad news. The trick is to turn it into an advantage.

4. Average the wobble away: the random forest

If one tree’s errors are jumpy and partly random, then many independent trees will make their mistakes in different places — and those mistakes can cancel out. A random forest does exactly this: grow many trees, each on its own bootstrap sample, then average their probabilities. (Real forests also pick a random subset of features at each split to make the trees more independent; with only two features here we lean on bootstrap sampling, the “bagging” half of the idea.)

Drag the slider from 1 tree up to 150. Watch the blocky, high-contrast boundary of a single tree melt into a smooth, gently-graded surface. The staircase edges average out; the confident-but-wrong tiny rectangles get outvoted; the boundary settles down right where the two colors actually meet.

As trees accumulate, the prediction surface stops looking like a tiled floor and starts looking like a topographic map: smooth contours of probability.

One tree: jagged, overconfident, every region either fully red or fully green. A hundred trees: a smooth probability gradient that traces the true shape of the data — even for the spiral, which no axis-aligned tree could ever draw cleanly on its own.

The deeper lesson: each individual tree in that smooth forest is still a blocky, overfit mess. We didn’t make the trees better — we made them diverse, and then averaged. Variance falls roughly like 1/(number of trees) as long as the trees aren’t too correlated, which is why the curve smooths fast at first (1 → 10 trees is dramatic) and then only polishes (100 → 150 is barely visible).

5. The one thing a tree cannot do: extrapolate

Everything above classified colors, but a tree is just as happy predicting a number — that’s a regression tree. The machinery is identical; only the leaf changes. Instead of storing the local mix of labels, each leaf stores the average target value of the points that fall in it, and the prediction becomes a piecewise-constant staircase: flat within each region, jumping at the splits.

That single fact — every leaf emits one constant — has a consequence that bites hard in practice. Consider a variable that trends over time, like a noisy seasonal time series. We train a regression tree on the shaded middle window only, then ask it to predict the whole timeline. Drag the depth up: inside the window the staircase hugs the data better and better, but look at the edges.

true series tree prediction ● training points (shaded window)
Inside the training window the regression tree interpolates the data — more steps with more depth. Outside it, the prediction is a flat horizontal line, frozen at the value of the outermost leaf, no matter what the true series goes on to do. The tree literally cannot represent “keep rising” or “keep oscillating” — its outer regions stretch to infinity at a single constant.

This is structural, not a tuning problem: no depth setting fixes it, because the model class has no concept of a trend continuing past the data. And since a forest is just an average of trees and a boosted model is just a sum of trees, every tree-derived method inherits this blind spot — a deeper dive, with gradient boosting, is the finale of Part 3. The practical takeaway: for trending data, don’t feed raw values to a tree model — difference or detrend first, then let the trees model the leftover wiggles they’re actually good at.

Takeaways

Everything here is a faithful (if compact) CART implementation: Gini-impurity splits, bootstrap sampling, and probability averaging, all running on a fixed random seed so the pictures are reproducible. View source to read the ~300 lines that power it.

Written by Max Buckley.  ← All labs