Table of Contents
1
Foundations: Core Data Structures and Runtime Configuration
2
System Architecture: Server Launch and Inter-Process Communication
3
Scheduling: Continuous Batching, KV-Cache, and the Request Lifecycle
4
The Inference Engine: Forward Pass, Sampling, and CUDA Graphs
5
Attention Backends: Pluggable Implementations and the FlashAttention Reference
6
Model Layers and Tensor Parallelism
7
Model Architectures, MoE, and GPU Kernels
Library > sgl-project/mini-sglang > Chapter 7

Model Architectures, MoE, and GPU Kernels

This final chapter covers the top-level model implementations (LLaMA, Qwen2/3, Qwen3-MoE), the Mixture-of-Experts backend with its Triton fused kernels, and the low-level kernel infrastructure (JIT/AOT loading, KV-cache store kernels, and NCCL bindings). Together these show how the high-level model abstractions connect all the way down to GPU microcode, completing the full stack picture of mini-sglang.

Model Implementations: LLaMA as the Canonical Example

`python/minisgl/models/llama.py` is the best model file to read first because LLaMA is the simplest architecture supported and all subsequent models (Qwen2, Qwen3, Qwen3-MoE) follow the same pattern. The file defines `LlamaDecoderLayer` (one transformer block: attention + MLP + two RMSNorms) and `LlamaModel` (the stack of blocks plus token embedding and LM head).

The model's `forward` method receives a `Batch` object (from `core.py`), not raw tensors, because the batch carries the KV-cache metadata needed to route the attention computation. Each `LlamaDecoderLayer.forward` calls `self.attention(batch, ...)` and `self.mlp(x)`, where `self.attention` is a `layers/attention.py` `Attention` instance and `self.mlp` uses `layers/linear.py` parallel linears. The entire residual stream flows through the layer stack, with all tensor-parallel communication handled inside the layer calls.

`models/register.py` maintains a `MODEL_REGISTRY` dictionary mapping architecture strings (e.g., `"LlamaForCausalLM"`) to their class. `models/__init__.py`'s `create_model` function reads the `architectures` field from the HuggingFace config (loaded via `models/config.py`'s `from_hf`), looks up the registry, and instantiates the correct class. This registry pattern means adding a new model architecture requires only writing the model class and adding one line to the registry.

python/minisgl/models/llama.py — LlamaDecoderLayer and LlamaModel (lines 1-100)
python/minisgl/models/register.py — MODEL_REGISTRY and get_model_class (lines 1-40)

Mixture-of-Experts: Routing, Dispatch, and Fused Kernels

`python/minisgl/layers/moe.py` implements the MoE routing layer used in Qwen3-MoE. The router is a linear layer that produces per-expert logits; the top-k experts are selected via `torch.topk`, and the expert outputs are weighted by softmax scores and summed. This sparse computation—activating only k of N experts per token—is what makes MoE models computationally efficient relative to their parameter count.

`python/minisgl/moe/fused.py` implements the fused MoE backend, which replaces the naive loop-over-experts with a single highly optimized Triton kernel call. The key optimization is **token batching by expert**: rather than iterating over experts and filtering tokens, the kernel groups tokens by their assigned expert, enabling coalesced memory access and high GPU utilization. The `kernel/triton/fused_moe.py` file contains the actual Triton kernel code.

The MoE backend follows the same registry/factory pattern as the attention backend: `moe/__init__.py` registers `FusedMoEBackend` and provides `create_moe_backend(name, config)`. The engine instantiates one MoE backend at startup and passes it to all `Qwen3MoE` layers, so the kernel selection is a one-time decision rather than a per-call branch.

python/minisgl/layers/moe.py — MoE routing layer and expert dispatch (lines 1-80)
python/minisgl/moe/fused.py — FusedMoEBackend with Triton kernel dispatch (lines 1-80)
python/minisgl/kernel/triton/fused_moe.py — Triton MoE forward and sum-reduction kernels (lines 1-100)

GPU Kernel Infrastructure: JIT/AOT Loading and the KV-Store Kernel

`python/minisgl/kernel/utils.py` is the infrastructure layer for all GPU kernels. It defines `KernelConfig` (which records the CUDA architecture, include paths, and compiler flags needed to build a kernel), and two loader functions: **`load_jit`** compiles a CUDA/C++ extension on first import using `torch.utils.cpp_extension.load`, caching the result so subsequent imports are fast; **`load_aot`** loads a pre-compiled shared library using `ctypes`, used for kernels that must be built offline (like NCCL wrappers that depend on system NCCL headers).

`python/minisgl/kernel/store.py` provides the **KV-cache store kernel**, which writes newly computed key and value tensors into the correct positions in the GPU memory pool after each attention computation. This kernel is performance-critical because it runs on every forward pass for every layer. It uses the block table (constructed from `CacheMeta`) to compute the target memory addresses and performs a coalesced write. `kernel/index.py` similarly provides a JIT-compiled embedding index kernel as an alternative to PyTorch's built-in `embedding` that may be more efficient in specific tensor-parallel configurations.

The C++ headers in `kernel/csrc/include/minisgl/tensor.h` define the `Tensor` struct passed between Python and C++, providing shape, stride, dtype, and a raw data pointer. `nccl227.h` declares the NCCL API subset that `pynccl.py` uses. These headers form a stable ABI boundary between the Python runtime and the compiled extensions, and understanding them completes the picture from high-level Python request handling all the way to GPU memory operations.

python/minisgl/kernel/utils.py — KernelConfig, load_jit, and load_aot (lines 1-80)
python/minisgl/kernel/store.py — KV-cache store kernel JIT compilation and call (lines 1-50)
python/minisgl/kernel/csrc/include/minisgl/tensor.h — Tensor struct definition (lines 1-60)
Model Layers and Tensor Parallelism