Table of Contents

Model Layers and Tensor Parallelism

This chapter explains how mini-sglang implements transformer model layers with tensor parallelism. You will learn the layer base classes that unify weight loading, the column-parallel and row-parallel linear layers that partition weight matrices across GPUs, the multi-head attention layer that assembles these primitives with RoPE and optional QK-norm, and the distributed communication operations (all-reduce, all-gather) that stitch partial results back together.

Layer Base Classes and Weight Loading: layers/base.py

`python/minisgl/layers/base.py` provides the foundation all layer implementations inherit from. The key abstraction is separating **weight loading** from **forward computation**. Each layer class declares its weights as metadata (`TensorMeta` objects describing shape, dtype, and parallelism strategy) rather than allocating them directly. A separate `load_weights` pass then fills in the actual tensor data from a HuggingFace state dict.

This design decouples model architecture from weight format, which matters when loading quantized weights, sharded checkpoints, or weights with different naming conventions. The `models/weight.py` module orchestrates the mapping from HuggingFace parameter names to mini-sglang layer names, handling the transpositions and reshaping that HuggingFace models often require.

The `TensorMeta` abstraction also carries **parallelism annotations**: a weight can be marked as column-parallel (partitioned along the output dimension), row-parallel (partitioned along the input dimension), or replicated. These annotations tell the weight loader how to shard the tensor across the `tp_size` GPUs without requiring the layer implementation to know anything about the distributed setup.

python/minisgl/layers/base.py — TensorMeta, base layer class, and weight loading interface (lines 1-100)

Tensor-Parallel Linear Layers: layers/linear.py

`python/minisgl/layers/linear.py` implements the workhorse of tensor parallelism. **`ColumnParallelLinear`** partitions the weight matrix along the output (column) dimension: each GPU holds `output_size / tp_size` output features. After the matmul, each GPU has a partial result; no communication is needed until the outputs are used (lazy all-reduce). **`RowParallelLinear`** partitions along the input (row) dimension; each GPU computes a partial matmul over its slice of the input, then an all-reduce sums the partial results to reconstruct the full output.

The **`FusedQKVParallelLinear`** combines Q, K, V projections into a single matmul for efficiency, then splits the output along the head dimension according to the tensor-parallel rank. This fusion is important for decode performance because the Q, K, V projections are sequential bottlenecks. The file also contains `GateUpParallelLinear` for MLP gate-and-up projections used in SiLU-gated FFNs.

The layers call `distributed/impl.py`'s `all_reduce` after `RowParallelLinear` and `all_gather` after certain column-parallel operations. These calls are no-ops when `tp_size == 1` (single GPU), so the same code path works for both single- and multi-GPU inference.

python/minisgl/layers/linear.py — ColumnParallelLinear, RowParallelLinear, FusedQKVParallelLinear (lines 1-120)

The Attention Layer: RoPE, QK-Norm, and Backend Dispatch

`python/minisgl/layers/attention.py` assembles the full multi-head attention layer from the primitives above. It holds a `FusedQKVParallelLinear` for input projection, an `OutputParallelLinear` for output projection, and optionally per-head QK layer norms (used in Qwen3). The forward method applies **Rotary Positional Embedding (RoPE)** via `layers/rotary.py` to the query and key tensors before dispatching to the attention backend.

RoPE is applied *after* the QKV projection and *before* calling the backend's `prefill_forward` or `decode_forward`. The `rotary.py` module pre-computes sine/cosine tables at initialization and applies them via an in-place kernel, avoiding repeated recomputation. The attention layer is the only place in the codebase that references the global attention backend (retrieved from the engine context), making it the integration point between the abstract backend system and the model computation.

Distributed communication (all-reduce on the output projection) is handled inside the layer rather than in the model, keeping the model implementation clean. This is a deliberate architectural choice: each layer is responsible for its own communication, so the model code reads like standard non-distributed code.

python/minisgl/layers/attention.py — Attention layer, RoPE application, and backend dispatch (lines 1-80)

python/minisgl/layers/rotary.py — RoPE table computation and application (lines 1-60)

Distributed Communication: distributed/impl.py

`python/minisgl/distributed/impl.py` provides the `all_reduce` and `all_gather` functions that tensor-parallel layers call. These delegate to `kernel/pynccl.py`, which wraps an AOT-compiled NCCL extension. NCCL (NVIDIA Collective Communications Library) is the standard library for GPU-to-GPU communication and provides highly optimized ring-allreduce and tree-based algorithms.

The `distributed/info.py` module stores the process group metadata—rank, world size, and whether this process is the primary rank—as module-level state. The primary rank is rank 0, which is responsible for I/O operations like receiving input batches from the scheduler and sending output tokens back. Worker ranks (rank > 0) participate in forward passes but do not do I/O.

For single-GPU operation (`tp_size == 1`), all distributed calls are short-circuited to no-ops, verified by the `is_tp()` guard in `impl.py`. This means the same model code runs correctly in both single-GPU (zero communication overhead) and multi-GPU (full NCCL) modes, a key design goal for a hackable reference implementation.

python/minisgl/distributed/impl.py — all_reduce, all_gather, init and destroy (lines 1-80)

python/minisgl/distributed/info.py — rank, world_size, and is_primary accessors (lines 1-40)

← Attention Backends: Pluggable Implementations and the FlashAttention Reference Model Architectures, MoE, and GPU Kernels →