Table of Contents
1
Foundations: Core Data Structures and Runtime Configuration
2
System Architecture: Server Launch and Inter-Process Communication
3
Scheduling: Continuous Batching, KV-Cache, and the Request Lifecycle
4
The Inference Engine: Forward Pass, Sampling, and CUDA Graphs
5
Attention Backends: Pluggable Implementations and the FlashAttention Reference
6
Model Layers and Tensor Parallelism
7
Model Architectures, MoE, and GPU Kernels
Library > sgl-project/mini-sglang > Chapter 5

Attention Backends: Pluggable Implementations and the FlashAttention Reference

This chapter covers mini-sglang's pluggable attention backend system. You will learn the abstract interface that all backends must satisfy, how the registry pattern enables swappable implementations, and how the FlashAttention backend is implemented as the canonical reference—including how it handles the fundamentally different computation patterns of prefill (full attention over the prompt) versus decode (single-token attention over the KV cache).

The Backend Interface and Registry: attention/base.py and __init__.py

`python/minisgl/attention/base.py` defines the **abstract base class** for all attention backends. The key methods are `prefill_forward(q, k, v, batch, cache_meta)` and `decode_forward(q, k, v, batch, cache_meta)`. These signatures encode a fundamental design decision: prefill and decode are handled by separate code paths because they have very different computational characteristics. Prefill is compute-bound (full attention over potentially thousands of tokens); decode is memory-bandwidth-bound (attention over a growing KV cache for a single new token).

The base module also defines **`HybridBackend`**, a wrapper that combines a prefill backend and a decode backend that may come from different libraries. For example, you might use FlashAttention for prefill (where its memory-efficient chunked computation shines) and FlashInfer for decode (where its specialized paged-attention kernel is faster).

`python/minisgl/attention/__init__.py` implements the registry pattern using `utils/registry.py`. It calls `register` for each available backend (FA, FI, TRT-LLM) at import time, guarded by try/except imports so missing optional dependencies do not break the package. The `create_attention_backend(name, config)` factory function looks up the registry and instantiates the requested backend, raising a clear error if the name is unknown.

python/minisgl/attention/base.py — AttentionBackend ABC and HybridBackend (lines 1-80)
python/minisgl/attention/__init__.py — backend registry and create_attention_backend factory (lines 1-50)

The FlashAttention Backend: fa.py

`python/minisgl/attention/fa.py` is the reference implementation you should read first among the backends. It uses the `flash_attn` library's `flash_attn_varlen_func` for prefill, which supports variable-length sequences within a batch (necessary because different requests have different prompt lengths). The function takes cumulative sequence-length arrays (`cu_seqlens_q`, `cu_seqlens_k`) to describe the batch packing.

For decode, `fa.py` uses FlashAttention's paged attention API, passing the block table that maps each request's logical positions to physical KV-cache blocks. The block table is the key data structure linking the scheduler's KV-cache allocation (managed by `kvcache/`) to the GPU kernel that reads it. `attention/utils.py` provides the helper that constructs this block table tensor from the `CacheMeta` attached to each request in the batch.

The CUDA graph integration is also here: the decode `forward` method checks whether a captured graph exists for the current batch size and either replays it or runs eagerly. This is the practical reason decode and prefill have separate code paths—CUDA graph capture is only viable for the fixed-shape decode case.

python/minisgl/attention/fa.py — prefill_forward, decode_forward, and CUDA graph integration (lines 1-100)
python/minisgl/attention/utils.py — block table construction helper (lines 1-40)
The Inference Engine: Forward Pass, Sampling, and CUDA GraphsModel Layers and Tensor Parallelism