What's been cooking — May 2026

17 merged PRs across 4 repos

May was the month of beating CUDA into submission and teaching CoreML some new tricks. A fork of LightGBM absorbed a small mountain of GPU correctness fixes, the ONNX Runtime CoreML EP grew several new op builders, and a popular face-swap project finally got a lint gate. Plenty of one-line fixes hiding behind multi-paragraph postmortems.

BelixRogner/ExaBoost

The bulk of the month went into this LightGBM fork, where the CUDA backend was apparently a minefield. The headline is ExaBoost#1, which closes a ~10x perf gap between CUDA and CPU quantized training and stops divergence at num_leaves > 4 by fixing three interacting bugs. ExaBoost#3 is the upstreamable one-liner: CUDAConstructDiscretizedHistogramDenseKernel_GlobalMemory was computing its scratch base from partition_column_start instead of partition_hist_start, which only coincidentally worked for partition 0. ExaBoost#2 turns a SIGSEGV at CUDAObjectiveInterface::Init into an actual error message when you swap device_type after constructing a Dataset, and ExaBoost#4 finally makes the CUDA categorical split-finder honor min_data_per_group, which it had been accepting as a parameter and silently ignoring.

The percentile math got two passes: ExaBoost#6 and ExaBoost#9 both fix the same off-by-one (dividing by len instead of len - 1, with 0-based vs 1-based indexing) in PercentileDevice and the global-memory PercentileGlobalKernel respectively, affecting regression_l1 and quantile objectives. ExaBoost#8 stops weighted L1 and quantile training from crashing CUDA with an illegal memory access on anything over ~100 samples, and ExaBoost#7 makes max_depth actually do something on the CUDA tree learner instead of being silently dropped on the floor. Rounding it out, ExaBoost#5 gates the dropped Maxwell/Pascal/Volta compute capabilities so the project actually builds against CUDA Toolkit 13.x.

microsoft/onnxruntime

Most of the activity here was widening what the CoreML EP can actually take ownership of. onnxruntime#28270 lowers the Split minimum opset from 13 to 1 by teaching both the MLProgram and NeuralNetwork emitters to read the legacy split INTS attribute. onnxruntime#28289 adds com.microsoft:FusedConv support, which matters because ConvActivationFusion happily produces those nodes whenever a model is optimized via the CPU EP and saved back out. onnxruntime#28293 ships Identity, Ceil, and Tile builders alongside a heuristic that drops CoreML partitions made entirely of trivial shape and cheap-elementwise ops — discovered while staring at YOLOv10 partitioning on Apple Silicon. onnxruntime#28278 works around CoreML's rank-1+ gather indices requirement so scalar indices stop disqualifying nodes, and onnxruntime#28596 routes Sin and Cos through the existing UnaryOpBuilder (MLProgram only, since NeuralNetwork's UnaryFunctionLayerParams doesn't have them).

Outside the CoreML pile, onnxruntime#28288 fixes a subtle papercut in ReplaceWithNew::CreateReplacementNode, which was defaulting replacement nodes to kCpuExecutionProvider when the target's EP was empty — pinning fused nodes to CPU in cases where they had no business being pinned anywhere.

hacksider/Deep-Live-Cam

One PR, but a useful one: Deep-Live-Cam#1845 adds a pyproject.toml and a GitHub Actions workflow pinning ruff==0.15.7 as a CI gate on E701, E711, E712, F401, and F541, plus mechanical cleanup of the 29 existing violations across 14 files. No behavior change, just a tripwire so the next batch of bare except: and stray f-strings can't sneak in.

lightgbm-org/LightGBM

The upstream version of the ExaBoost histogram fix landed as LightGBM#7261 — a single character change in CUDAConstructDiscretizedHistogramDenseKernel_GlobalMemory, cross-referenced against the working sparse kernel that had the correct pattern all along. Satisfying in inverse proportion to its diff size.

That's the month: a lot of GPU bugs that should never have shipped, and a lot of CoreML ops that probably should have. See you in June.