What's been cooking — February 2026
7 merged PRs across 3 repos
What's been cooking — February 2026
February was, apparently, the month everyone decided their CPU was embarrassing them. Three of the four repos shipped non-trivial perf work — vectorizing hot loops, killing kernel launch overhead, and trimming preprocessing fat — while cuVS quietly took a broom to its codebase.
urchade/GLiNER
GLiNER spent the month chasing the ~20% of wall time that wasn't the GPU forward pass, and the results are pretty satisfying. GLiNER#333 swaps a pile of Python-level loops in preprocessing and span decoding for batched tensor ops, going after entity pair generation, span index construction, and label handling. GLiNER#334 targets the per-batch-item decoding loop, collapsing B * 8 CUDA kernel launches down to roughly 8 regardless of batch size — which translates to a 63–95% decoder speedup on GPU at bs>=8, all statistically significant. And GLiNER#335 finishes the job on the relation side: the triple-nested batch × relations × classes loop with its .item() calls is gone, replaced by a single torch.where on the full (B, R, C) tensor, with the decoding logic factored into a shared module-level helper. Three PRs, one consistent theme: stop synchronizing, stop iterating in Python, let the tensors do the work.
rapidsai/cuvs
cuVS had a tidy month rather than a flashy one. cuvs#1703 appeases stricter compilers by adding [[fallthrough]] annotations in team_sum and initializing a previously uninitialized seed_index. cuvs#1705 deletes an unused use_norms constant from the pairwise distances SM60 kernel, and cuvs#1706 drops a redundant read_idx from the query loop. None of it moves a benchmark, but the diffs are the kind of thing you're glad someone bothered to do before the warnings turn into errors.
fastino-ai/GLiNER2
GLiNER2 joined the latency-trimming party with GLiNER2#75, knocking 16–26% off end-to-end inference time by reworking preprocessing, embedding extraction, and batch setup. The encoder forward pass wasn't touched, and outputs are bitwise identical to before — existing tests passed unmodified, which is the right kind of boring for a perf PR.
That's the month: less Python in the hot path, fewer kernel launches, and a slightly cleaner cuVS. See you in March.