Table of Contents
1
Foundations: Core Data Structures and Runtime Configuration
2
System Architecture: Server Launch and Inter-Process Communication
3
Scheduling: Continuous Batching, KV-Cache, and the Request Lifecycle
4
The Inference Engine: Forward Pass, Sampling, and CUDA Graphs
5
Attention Backends: Pluggable Implementations and the FlashAttention Reference
6
Model Layers and Tensor Parallelism
7
Model Architectures, MoE, and GPU Kernels
Library > sgl-project/mini-sglang > Chapter 1

Foundations: Core Data Structures and Runtime Configuration

This chapter introduces the vocabulary of the entire codebase. You will learn the central data structures—Request, Batch, SamplingParams, and the global context—defined in core.py, and the environment-variable-driven configuration flags in env.py. Every other component in mini-sglang speaks in terms of these types, so understanding them first makes every subsequent file far easier to read.

The Central Data Model: core.py

`python/minisgl/core.py` is the single most important file in the codebase. It defines the shared vocabulary that the scheduler, engine, attention backends, and server all use to talk to each other. With a PageRank of 1.00—the highest in the repository—it is a true hub: every other major component imports from it.

The file defines **`Request`**, which represents one user inference job. It carries the raw prompt tokens, sampling parameters, an allocated KV-cache slot, and mutable state (current token position, generated tokens so far, finish reason). Understanding `Request` is prerequisite to understanding any scheduling or batching logic.

**`Batch`** is a collection of `Request` objects that are processed together in a single forward pass. The scheduler assembles batches; the engine consumes them. `Batch` tracks which requests are in the prefill phase (processing prompt tokens) versus the decode phase (generating new tokens one at a time), because these two phases require different attention computation patterns.

**`SamplingParams`** encapsulates the user-facing knobs: temperature, top-p, top-k, max new tokens, and stop conditions. The design keeps sampling logic separate from the model itself—the model produces logits, and a dedicated `Sampler` (in `engine/sample.py`) applies these parameters. The file also provides **global context** accessors (get/set functions for tensor-parallel rank, world size, and the active KV-cache manager), making these values available as process-wide singletons rather than threading them through every function call.

python/minisgl/core.py — Request, SamplingParams, and Batch definitions (lines 1-60)

Runtime Configuration via Environment Variables: env.py

`python/minisgl/env.py` reads environment variables at import time and exposes them as module-level constants. This pattern lets operators and test harnesses configure the system without changing code—a common pattern in systems software where configuration must be decided before any objects are instantiated.

The most important flags are the **attention backend selector** (choosing between FlashAttention, FlashInfer, and TensorRT-LLM), the **communication mode** (whether to use PyNCCL or fall back to a simulated path for single-GPU testing), and toggles for CUDA graph capture and chunked prefill. You will see these constants used as `if env.USE_FLASHINFER:` conditionals sprinkled throughout `attention/`, `engine/`, and `scheduler/`.

Reading `env.py` early also tells you which features are optional versus required. For example, FlashInfer and TensorRT-LLM are only imported if their respective environment flags are set, protecting users who have not installed those libraries. This lazy-import pattern is important for keeping the package importable in diverse environments.

python/minisgl/env.py — environment variable definitions and defaults (lines 1-50)
System Architecture: Server Launch and Inter-Process Communication