Table of Contents

System Architecture: Server Launch and Inter-Process Communication

This chapter explains how mini-sglang starts up and how its major processes communicate. You will trace the launch sequence in server/launch.py, understand the message types that flow between the API server, tokenizer worker, and scheduler via ZMQ, and see how the tokenizer worker bridges the gap between raw text and token IDs. Together these files reveal the message-passing architecture that decouples the HTTP-facing frontend from the GPU-bound backend.

Server Launch and Process Topology: launch.py

`python/minisgl/server/launch.py` is the orchestrator of the entire system. When you run `python -m minisgl`, execution flows through `__main__.py` → `server/__init__.py` → `launch.py`. Reading `launch.py` gives you the system's process map in one place.

The launch sequence spawns three separate OS processes: (1) a **tokenizer worker** that handles text-to-token and token-to-text conversion, (2) a **scheduler process** that manages batching, KV-cache allocation, and drives the engine workers, and (3) the **API server** (FastAPI/uvicorn) that accepts HTTP requests. Each process is connected to the others via **ZMQ push/pull and pub/sub sockets**. This design choice—multiple processes rather than multiple threads—sidesteps Python's GIL and lets the GPU-bound scheduler run independently of the I/O-bound HTTP server.

The ZMQ socket addresses are passed as constructor arguments to each component, establishing a fixed topology: the API server pushes raw user requests to the tokenizer; the tokenizer pushes tokenized requests to the scheduler; the scheduler pushes finished outputs back to the tokenizer for detokenization; and the tokenizer publishes decoded text back to the API server. Understanding this flow is essential before reading any individual component.

python/minisgl/server/launch.py — process spawn and ZMQ wiring (lines 1-80)

Message Types: The Inter-Process Protocol

`python/minisgl/message/backend.py`, `frontend.py`, and `tokenizer.py` define the data classes that flow over ZMQ queues. Rather than passing arbitrary Python objects (which would require pickle and break process boundaries cleanly), mini-sglang defines explicit, serializable message types for each link in the pipeline.

**`backend.py`** defines `UserRequest` (the tokenized request going into the scheduler), `BatchedMessage` (a batch of requests the scheduler sends to the engine), and `AbortSignal`. **`tokenizer.py`** defines `TokenizeRequest` and `DetokenizeRequest`—the messages the API server sends to the tokenizer worker. **`frontend.py`** defines `UserReply`, which carries generated text back to the HTTP client.

`python/minisgl/message/utils.py` provides the serialization layer. Because ZMQ transports raw bytes, every message must be encoded before sending and decoded after receiving. The utils module provides `encode` and `decode` helpers that use a compact binary format. This separation of message definition from serialization logic makes it easy to add new message types without touching the transport layer. When reading the scheduler or API server code, you will see calls like `msg = decode(raw_bytes)` and `send(encode(reply))` that delegate entirely to this module.

python/minisgl/message/backend.py — UserRequest, BatchedMessage, AbortSignal (lines 1-60)

python/minisgl/message/utils.py — encode and decode helpers (lines 1-40)

The Tokenizer Worker: Bridging Text and Tokens

`python/minisgl/tokenizer/server.py` implements the `tokenize_worker` process function. Its job is simple but critical: it sits between the HTTP-facing API server and the GPU-facing scheduler, converting between the two worlds. The API server receives strings; the scheduler operates on integer token IDs. The tokenizer worker handles both directions asynchronously.

`tokenize.py` wraps a HuggingFace tokenizer, handling special tokens, chat templates, and batching. `detokenize.py` manages incremental detokenization—as the scheduler emits new token IDs one step at a time, the detokenizer must reconstruct coherent text, handling multi-byte UTF-8 sequences and special tokens gracefully. The worker pulls from two ZMQ queues (one from the API server for new requests, one from the scheduler for completed token IDs) and pushes to two others. This single-process design means tokenization is never a bottleneck for GPU compute, since it runs in its own CPU-bound process.

python/minisgl/tokenizer/server.py — tokenize_worker process (lines 1-60)

python/minisgl/tokenizer/tokenize.py — Tokenizer manager class (lines 1-50)

← Foundations: Core Data Structures and Runtime Configuration Scheduling: Continuous Batching, KV-Cache, and the Request Lifecycle →