Library > sgl-project/mini-sglang

mini-sglang is a minimal, educational re-implementation of an LLM inference serving engine inspired by SGLang. It provides a full stack from HTTP API server down to GPU kernels: a FastAPI-based HTTP/ZMQ server, a continuous-batching scheduler with paged KV-cache (naive and radix-tree prefix-sharing variants), a pluggable attention backend system (FlashAttention, FlashInfer, TensorRT-LLM), tensor-parallel distributed inference via NCCL, and support for multiple model architectures (LLaMA, Qwen2, Qwen3, Qwen3-MoE). The codebase is structured as a Python package with C++/CUDA/Triton extensions compiled via JIT and AOT loaders. Key design decisions include: a message-passing architecture where the API server, tokenizer worker, scheduler, and engine workers communicate via ZMQ queues; a registry pattern for swappable attention and MoE backends; tensor-parallel linear layers with column/row partitioning; CUDA graph capture for fast decode-phase replay; and a radix-tree KV-cache manager for efficient prompt prefix reuse across requests. The scheduler separates prefill and decode phases explicitly, enabling chunked prefill and continuous batching. This repository would be used by ML systems engineers and researchers who want to understand how a production LLM serving system like SGLang or vLLM works internally, or who want a clean, hackable baseline to experiment with new scheduling, attention, or caching strategies. It is not intended as a production deployment but as a well-structured reference implementation with working benchmarks and tests.

Start Reading →Browse All Files

Reading Guide

1
Foundations: Core Data Structures and Runtime Configuration
This chapter introduces the vocabulary of the entire codebase. You will learn the central data structures—Request, Batch, SamplingParams, and the global context—defined in core.py, and the environment-variable-driven configuration flags in env.py. Every other component in mini-sglang speaks in terms of these types, so understanding them first makes every subsequent file far easier to read.
2
System Architecture: Server Launch and Inter-Process Communication
This chapter explains how mini-sglang starts up and how its major processes communicate. You will trace the launch sequence in server/launch.py, understand the message types that flow between the API server, tokenizer worker, and scheduler via ZMQ, and see how the tokenizer worker bridges the gap between raw text and token IDs. Together these files reveal the message-passing architecture that decouples the HTTP-facing frontend from the GPU-bound backend.
3
Scheduling: Continuous Batching, KV-Cache, and the Request Lifecycle
This chapter dives into the heart of mini-sglang's serving engine: the scheduler. You will understand how the scheduler orchestrates the continuous-batching loop, how it separates prefill and decode phases, and how it manages the paged KV-cache. The chapter covers the request state table, the cache manager interface, and the two concrete cache strategies—naive sequential and radix-tree prefix sharing—that are the key algorithmic contribution of this system.
4
The Inference Engine: Forward Pass, Sampling, and CUDA Graphs
This chapter examines how mini-sglang runs the actual forward pass. You will see how the Engine class initializes the model and KV-cache, how it dispatches prefill and decode batches, how token sampling is applied to logits, and how CUDA graph capture dramatically speeds up the decode phase by eliminating Python overhead on the GPU critical path.
5
Attention Backends: Pluggable Implementations and the FlashAttention Reference
This chapter covers mini-sglang's pluggable attention backend system. You will learn the abstract interface that all backends must satisfy, how the registry pattern enables swappable implementations, and how the FlashAttention backend is implemented as the canonical reference—including how it handles the fundamentally different computation patterns of prefill (full attention over the prompt) versus decode (single-token attention over the KV cache).
6
Model Layers and Tensor Parallelism
This chapter explains how mini-sglang implements transformer model layers with tensor parallelism. You will learn the layer base classes that unify weight loading, the column-parallel and row-parallel linear layers that partition weight matrices across GPUs, the multi-head attention layer that assembles these primitives with RoPE and optional QK-norm, and the distributed communication operations (all-reduce, all-gather) that stitch partial results back together.
7
Model Architectures, MoE, and GPU Kernels
This final chapter covers the top-level model implementations (LLaMA, Qwen2/3, Qwen3-MoE), the Mixture-of-Experts backend with its Triton fused kernels, and the low-level kernel infrastructure (JIT/AOT loading, KV-cache store kernels, and NCCL bindings). Together these show how the high-level model abstractions connect all the way down to GPU microcode, completing the full stack picture of mini-sglang.

Architecture

Kernel
A collection of GPU kernel modules providing low-level compute primitives (tensor ops, KV-cache store, radix operations, MoE computation, and NCCL-based distributed communication) along with JIT/AOT compilation utilities for a mini serving/inference framework.
Misc/Utils
A collection of general-purpose utility modules for the minisgl library, covering multiprocessing, logging, HuggingFace integration, hardware architecture detection, tensor operations, and component registration.
Minisgl/Layers
Defines the neural network layer building blocks for the minisgl inference engine, including attention, MoE, linear, embedding, normalization, activation, and rotary positional encoding layers with tensor-parallel support.
Minisgl/Attention
Implements a pluggable attention backend system for minisgl, supporting FlashAttention, FlashInfer, and TensorRT-LLM backends with a common base interface and registry-based factory.
Minisgl/Message
Defines and serializes structured message types passed between the frontend, tokenizer, and backend components of the MiniSGL inference pipeline.
Minisgl/Kvcache
Implements KV-cache management for LLM inference, providing base abstractions and two concrete strategies—naive sequential allocation and radix-tree-based prefix sharing—along with a multi-head attention memory pool.
Minisgl/Engine
Implements the core inference engine for MiniSGL, orchestrating model loading, KV-cache management, attention/MoE backends, sampling, and CUDA graph execution for LLM serving.
Python/Minisgl
Core data models, global context management, and entry-point wiring for the MiniSGL LLM inference server.
Tests/Kernel
Unit and performance tests for low-level kernel operations including distributed communication (NCCL all-reduce/all-gather), tensor indexing, KV store, and tensor utilities in the minisgl inference engine.
Minisgl/Server
Implements the minisgl inference server, including argument parsing, API endpoint definitions, and server launch logic for serving LLM completions over HTTP and ZMQ.
Minisgl/Tokenizer
Implements a tokenizer server worker that handles bidirectional tokenization and detokenization of messages between frontend and backend components using ZMQ queues.
Minisgl/Distributed
Provides tensor-parallel distributed communication primitives and process group metadata for the minisgl inference engine, enabling multi-GPU operations like all-reduce and all-gather via NCCL.
Minisgl/Moe
Implements Mixture-of-Experts (MoE) backends for the minisgl engine, providing both a base interface and a fused high-performance implementation using Triton kernels.
Include/Minisgl
C++ header files providing core abstractions for tensor operations, NCCL collective communication wrappers, and shared utility macros/functions for the MiniSGL GPU kernel layer.
Csrc/Src
C++ extension source files implementing low-level tensor utilities and radix tree key comparison operations used by the minisgl KV cache and kernel layers.
Minisgl/Benchmark
Provides benchmarking utilities for measuring LLM serving performance, including client-side request generation/tracing and CUDA kernel performance comparison tools.
Minisgl/Llm
Provides a high-level LLM inference interface that wraps the scheduler and core engine to support offline batch text generation.
Benchmark/Online
Online benchmark scripts for measuring inference throughput and latency using Qwen trace-based and simple synthetic workloads against a running model server.
Tests/Core
Integration tests for the core scheduling and cache allocation subsystems of the minisgl inference engine.
Benchmark/Offline
Offline benchmarking scripts that measure LLM inference throughput using synthetic and WildChat dataset prompts via the minisgl LLM engine.

Entry Points

python/minisgl/server/launch.pypython/minisgl/core.pypython/minisgl/__main__.py