Table of Contents

The Inference Engine: Forward Pass, Sampling, and CUDA Graphs

This chapter examines how mini-sglang runs the actual forward pass. You will see how the Engine class initializes the model and KV-cache, how it dispatches prefill and decode batches, how token sampling is applied to logits, and how CUDA graph capture dramatically speeds up the decode phase by eliminating Python overhead on the GPU critical path.

The Engine Class: Wiring Model, Cache, and Backends

`python/minisgl/engine/engine.py` is the component that actually executes the neural network. The `Engine.__init__` method is a useful map of the system's dependencies: it loads the model from disk (via `models/__init__.py`'s `create_model` factory), allocates the KV-cache pool, instantiates the chosen attention backend and MoE backend, and builds the CUDA graph for decode batches.

The main entry point is the **`step`** method (or equivalent forward function), which receives a `Batch` from the scheduler, routes it through the model's `forward` method, applies the sampler to get the next token for each request, and returns the output token IDs. The engine is designed to be driven by the scheduler—it is stateless between calls except for the KV-cache contents and model weights. This clean separation means the scheduler can be tested independently of GPU code.

`python/minisgl/engine/config.py` defines `EngineConfig` and provides a `from_hf` factory that reads a HuggingFace model config and derives engine parameters (number of layers, heads, head dimension, data type, etc.). This is the bridge between the HuggingFace model zoo and mini-sglang's internal representations.

python/minisgl/engine/engine.py — Engine.__init__ and step/forward (lines 1-100)

python/minisgl/engine/config.py — EngineConfig and from_hf factory (lines 1-60)

Token Sampling: engine/sample.py

`python/minisgl/engine/sample.py` implements the `Sampler` class. After the model produces a `[batch_size, vocab_size]` logits tensor, the sampler applies temperature scaling, top-k filtering, top-p (nucleus) filtering, and then draws a sample (or takes the argmax for greedy decoding). Each `Request` carries its own `SamplingParams`, so the sampler must handle a batch where different requests have different sampling configurations.

The design batches the sampling operations where possible (e.g., applying temperature to all logits at once) and only branches per-request for the final sample draw. This keeps the sampling code GPU-friendly while still supporting heterogeneous request configurations. Understanding this file also clarifies why the `SamplingParams` dataclass in `core.py` carries the fields it does—each field maps to a specific operation in the sampler.

python/minisgl/engine/sample.py — Sampler class and sampling logic (lines 1-80)

CUDA Graph Capture for Fast Decode: engine/graph.py

`python/minisgl/engine/graph.py` implements CUDA graph capture for the decode phase. In the decode phase, every step processes the same batch shape (N requests × 1 new token), making it a perfect candidate for graph capture: the sequence of CUDA kernels is identical on every step, so CUDA can record the entire sequence once and replay it with minimal CPU overhead.

The capture process works by running a "warm-up" forward pass with a fixed batch size, then calling `torch.cuda.graph()` to record the CUDA operations into a graph object. On subsequent steps, `graph.replay()` re-executes the recorded kernel sequence without any Python interpreter involvement on the GPU path. The module also provides **memory estimation utilities** that compute how large a batch can fit in GPU memory given the KV-cache size and model weights, which the engine uses to set the maximum decode batch size before capture.

The tradeoff is that captured graphs are inflexible: if the batch size changes (e.g., a request finishes mid-step), the graph cannot be used and the engine falls back to eager execution. Mini-sglang handles this by maintaining multiple captured graphs for different power-of-two batch sizes (a technique also used in vLLM and SGLang), so the right graph is selected based on the current decode batch size.

python/minisgl/engine/graph.py — CUDA graph capture, replay, and memory estimation (lines 1-100)

← Scheduling: Continuous Batching, KV-Cache, and the Request Lifecycle Attention Backends: Pluggable Implementations and the FlashAttention Reference →