Table of Contents

Scheduling: Continuous Batching, KV-Cache, and the Request Lifecycle

This chapter dives into the heart of mini-sglang's serving engine: the scheduler. You will understand how the scheduler orchestrates the continuous-batching loop, how it separates prefill and decode phases, and how it manages the paged KV-cache. The chapter covers the request state table, the cache manager interface, and the two concrete cache strategies—naive sequential and radix-tree prefix sharing—that are the key algorithmic contribution of this system.

The Main Scheduler Loop: scheduler.py

`python/minisgl/scheduler/scheduler.py` is the control center of the serving system. Its main loop runs continuously: each iteration checks for new incoming requests, decides which to prefill and which to continue decoding, assembles a `Batch`, ships it to the engine workers, waits for output tokens, updates request state, and sends completed responses back to the tokenizer.

The loop embodies **continuous batching**: it does not wait for a batch to finish before accepting new requests. Decode-phase requests (those past their first token) are kept in a running pool, and new prefill requests are inserted whenever there is available KV-cache space and compute budget. The separation into `prefill.py` and `decode.py` sub-modules makes the two scheduling modes independently readable and testable.

The scheduler also manages **flow control** via `scheduler/io.py`, which wraps ZMQ queues with non-blocking poll loops. This lets the scheduler drain the input queue at the start of each step (collecting new requests) and drain the output queue (sending replies) without blocking the GPU pipeline. The request state machine lives in `scheduler/table.py`, which tracks which requests are waiting for KV-cache, which are prefilling, which are decoding, and which have finished.

python/minisgl/scheduler/scheduler.py — main scheduler loop and step function (lines 1-100)

Prefill and Decode Phase Scheduling

`python/minisgl/scheduler/prefill.py` implements **chunked prefill**: rather than processing an entire long prompt in one forward pass (which could overflow GPU memory or block decode requests for too long), the scheduler breaks prompts into chunks of at most `max_prefill_tokens` tokens per step. This is why you see `chunk_size` logic gating how many tokens from a waiting request are admitted to the current batch.

`python/minisgl/scheduler/decode.py` is simpler: in the decode phase, each request contributes exactly one token position per step (the next token to generate). The decode scheduler selects all requests in the decode pool that have allocated KV-cache and constructs the decode batch. The key insight is that decode batches can be much larger than prefill batches (many requests generating one token each), which is why CUDA graph capture (in `engine/graph.py`) is applied to decode but not prefill—the batch shape is fixed and predictable for decode.

`python/minisgl/scheduler/utils.py` provides shared helpers like `count_tokens` and `count_blocks` that both the prefill and cache modules use to reason about resource budgets before committing to a scheduling decision.

python/minisgl/scheduler/prefill.py — chunked prefill scheduling (lines 1-80)

python/minisgl/scheduler/decode.py — decode batch construction (lines 1-60)

KV-Cache Abstractions: base.py

`python/minisgl/kvcache/base.py` defines the interface that every KV-cache manager must implement. The core abstraction is the **block**: a fixed-size unit of GPU memory that can hold the key and value tensors for a fixed number of tokens. Blocks are allocated per request and freed when the request finishes. The base module defines `CacheBlock`, `CacheMeta`, and the abstract `KVCacheManager` class with `allocate`, `free`, `prefix_match`, and `evict` methods.

The **paged** design (blocks rather than contiguous per-request buffers) is what makes continuous batching feasible: requests of different lengths share a pool of fixed-size pages, so there is no memory waste from worst-case pre-allocation. The `CacheMeta` structure attached to each `Request` carries the list of block IDs assigned to that request, allowing the attention kernel to reconstruct the scattered key-value tensors during computation. The `mha_pool.py` module manages the raw GPU tensor buffer that backs all blocks, providing `alloc_block` and `free_block` at the physical level.

python/minisgl/kvcache/base.py — CacheBlock, CacheMeta, KVCacheManager abstract class (lines 1-80)

Prefix Sharing with the Radix-Tree Cache Manager

`python/minisgl/kvcache/radix_manager.py` is the most algorithmically sophisticated file in the codebase. Its insight: many LLM requests share common prefixes (system prompts, few-shot examples, conversation history). If the KV tensors for those shared tokens are already computed and cached, new requests can skip recomputing them entirely—a major throughput win.

The radix tree stores token sequences as paths from root to leaf, where each edge is labeled with a token ID. When a new request arrives, the manager walks the tree to find the longest matching prefix. The matching blocks are then **locked** (pinned, preventing eviction) and shared with the new request; only the remaining suffix tokens need to be prefilled. When GPU memory is tight, the tree's **eviction policy** walks leaf nodes (least recently used) and frees their blocks, never evicting locked or recently-used paths.

The C++ extension in `kernel/csrc/src/radix.cpp` provides a fast key-comparison function used during tree traversal, exposed via `kernel/radix.py`. The `naive_manager.py` provides a simpler baseline that allocates blocks sequentially without any prefix sharing—useful for comparison and for workloads with no repeated prefixes. Both managers implement the same `KVCacheManager` interface from `base.py`, making them interchangeable at startup.

python/minisgl/kvcache/radix_manager.py — RadixTreeCacheManager, prefix matching, and eviction (lines 1-120)

python/minisgl/kvcache/naive_manager.py — NaiveCacheManager sequential allocation (lines 1-40)

← System Architecture: Server Launch and Inter-Process Communication The Inference Engine: Forward Pass, Sampling, and CUDA Graphs →