Lions

Architecture

Kernel

A collection of GPU kernel modules providing low-level compute primitives (tensor ops, KV-cache store, radix operations, MoE computation, and NCCL-based distributed communication) along with JIT/AOT compilation utilities for a mini serving/inference framework.

python/minisgl/kernel/__init__.py python/minisgl/kernel/__main__.py python/minisgl/kernel/index.py python/minisgl/kernel/moe_impl.py python/minisgl/kernel/pynccl.py python/minisgl/kernel/radix.py python/minisgl/kernel/store.py python/minisgl/kernel/tensor.py python/minisgl/kernel/triton/fused_moe.py python/minisgl/kernel/utils.py

Minisgl/Models

Defines, registers, and loads multiple LLM architectures (LLaMA, Qwen2, Qwen3, Qwen3-MoE) with shared configuration, base classes, and weight-loading utilities for the minisgl inference engine.

python/minisgl/models/__init__.py python/minisgl/models/base.py python/minisgl/models/config.py python/minisgl/models/llama.py python/minisgl/models/qwen2.py python/minisgl/models/qwen3.py python/minisgl/models/qwen3_moe.py python/minisgl/models/register.py python/minisgl/models/utils.py python/minisgl/models/weight.py

Misc/Utils

A collection of general-purpose utility modules for the minisgl library, covering multiprocessing, logging, HuggingFace integration, hardware architecture detection, tensor operations, and component registration.

python/minisgl/utils/__init__.py python/minisgl/utils/arch.py python/minisgl/utils/hf.py python/minisgl/utils/logger.py python/minisgl/utils/misc.py python/minisgl/utils/mp.py python/minisgl/utils/registry.py python/minisgl/utils/torch_utils.py tests/misc/test_serialize.py

Minisgl/Layers

Defines the neural network layer building blocks for the minisgl inference engine, including attention, MoE, linear, embedding, normalization, activation, and rotary positional encoding layers with tensor-parallel support.

python/minisgl/layers/__init__.py python/minisgl/layers/activation.py python/minisgl/layers/attention.py python/minisgl/layers/base.py python/minisgl/layers/embedding.py python/minisgl/layers/linear.py python/minisgl/layers/moe.py python/minisgl/layers/norm.py python/minisgl/layers/rotary.py

Minisgl/Scheduler

Implements the request scheduler for the MiniSGL LLM serving system, managing prefill and decode batching, KV-cache allocation, and I/O queuing across the inference pipeline.

python/minisgl/scheduler/__init__.py python/minisgl/scheduler/cache.py python/minisgl/scheduler/config.py python/minisgl/scheduler/decode.py python/minisgl/scheduler/io.py python/minisgl/scheduler/prefill.py python/minisgl/scheduler/scheduler.py python/minisgl/scheduler/table.py python/minisgl/scheduler/utils.py

Minisgl/Attention

Implements a pluggable attention backend system for minisgl, supporting FlashAttention, FlashInfer, and TensorRT-LLM backends with a common base interface and registry-based factory.

python/minisgl/attention/__init__.py python/minisgl/attention/base.py python/minisgl/attention/fa.py python/minisgl/attention/fi.py python/minisgl/attention/trtllm.py python/minisgl/attention/utils.py

Minisgl/Message

Defines and serializes structured message types passed between the frontend, tokenizer, and backend components of the MiniSGL inference pipeline.

python/minisgl/message/__init__.py python/minisgl/message/backend.py python/minisgl/message/frontend.py python/minisgl/message/tokenizer.py python/minisgl/message/utils.py

Minisgl/Kvcache

Implements KV-cache management for LLM inference, providing base abstractions and two concrete strategies—naive sequential allocation and radix-tree-based prefix sharing—along with a multi-head attention memory pool.

python/minisgl/kvcache/__init__.py python/minisgl/kvcache/base.py python/minisgl/kvcache/mha_pool.py python/minisgl/kvcache/naive_manager.py python/minisgl/kvcache/radix_manager.py

Minisgl/Engine

Implements the core inference engine for MiniSGL, orchestrating model loading, KV-cache management, attention/MoE backends, sampling, and CUDA graph execution for LLM serving.

python/minisgl/engine/__init__.py python/minisgl/engine/config.py python/minisgl/engine/engine.py python/minisgl/engine/graph.py python/minisgl/engine/sample.py

Python/Minisgl

Core data models, global context management, and entry-point wiring for the MiniSGL LLM inference server.

python/minisgl/__main__.py python/minisgl/core.py python/minisgl/env.py python/minisgl/shell.py

Tests/Kernel

Unit and performance tests for low-level kernel operations including distributed communication (NCCL all-reduce/all-gather), tensor indexing, KV store, and tensor utilities in the minisgl inference engine.

tests/kernel/test_comm.py tests/kernel/test_index.py tests/kernel/test_store.py tests/kernel/test_tensor.py

Minisgl/Server

Implements the minisgl inference server, including argument parsing, API endpoint definitions, and server launch logic for serving LLM completions over HTTP and ZMQ.

python/minisgl/server/__init__.py python/minisgl/server/api_server.py python/minisgl/server/args.py python/minisgl/server/launch.py

Minisgl/Tokenizer

Implements a tokenizer server worker that handles bidirectional tokenization and detokenization of messages between frontend and backend components using ZMQ queues.

python/minisgl/tokenizer/__init__.py python/minisgl/tokenizer/detokenize.py python/minisgl/tokenizer/server.py python/minisgl/tokenizer/tokenize.py

Minisgl/Distributed

Provides tensor-parallel distributed communication primitives and process group metadata for the minisgl inference engine, enabling multi-GPU operations like all-reduce and all-gather via NCCL.

python/minisgl/distributed/__init__.py python/minisgl/distributed/impl.py python/minisgl/distributed/info.py

Minisgl/Moe

Implements Mixture-of-Experts (MoE) backends for the minisgl engine, providing both a base interface and a fused high-performance implementation using Triton kernels.

python/minisgl/moe/__init__.py python/minisgl/moe/base.py python/minisgl/moe/fused.py

Include/Minisgl

C++ header files providing core abstractions for tensor operations, NCCL collective communication wrappers, and shared utility macros/functions for the MiniSGL GPU kernel layer.

python/minisgl/kernel/csrc/include/minisgl/nccl227.h python/minisgl/kernel/csrc/include/minisgl/tensor.h python/minisgl/kernel/csrc/include/minisgl/utils.h

Csrc/Src

C++ extension source files implementing low-level tensor utilities and radix tree key comparison operations used by the minisgl KV cache and kernel layers.

python/minisgl/kernel/csrc/src/radix.cpp python/minisgl/kernel/csrc/src/tensor.cpp

Minisgl/Benchmark

Provides benchmarking utilities for measuring LLM serving performance, including client-side request generation/tracing and CUDA kernel performance comparison tools.

python/minisgl/benchmark/client.py python/minisgl/benchmark/perf.py

Minisgl/Llm

Provides a high-level LLM inference interface that wraps the scheduler and core engine to support offline batch text generation.

python/minisgl/llm/__init__.py python/minisgl/llm/llm.py

Benchmark/Online

Online benchmark scripts for measuring inference throughput and latency using Qwen trace-based and simple synthetic workloads against a running model server.

benchmark/online/bench_qwen.py benchmark/online/bench_simple.py

Tests/Core

Integration tests for the core scheduling and cache allocation subsystems of the minisgl inference engine.

tests/core/test_cache_allocate.py tests/core/test_scheduler.py

Benchmark/Offline

Offline benchmarking scripts that measure LLM inference throughput using synthetic and WildChat dataset prompts via the minisgl LLM engine.

benchmark/offline/bench.py benchmark/offline/bench_wildchat.py

Entry Points

python/minisgl/server/launch.pypython/minisgl/core.pypython/minisgl/__main__.py

Reading Guide

Architecture

Entry Points