Kernel
A collection of GPU kernel modules providing low-level compute primitives (tensor ops, KV-cache store, radix operations, MoE computation, and NCCL-based distributed communication) along with JIT/AOT compilation utilities for a mini serving/inference framework.
Minisgl/Models
Defines, registers, and loads multiple LLM architectures (LLaMA, Qwen2, Qwen3, Qwen3-MoE) with shared configuration, base classes, and weight-loading utilities for the minisgl inference engine.
Misc/Utils
A collection of general-purpose utility modules for the minisgl library, covering multiprocessing, logging, HuggingFace integration, hardware architecture detection, tensor operations, and component registration.
Minisgl/Layers
Defines the neural network layer building blocks for the minisgl inference engine, including attention, MoE, linear, embedding, normalization, activation, and rotary positional encoding layers with tensor-parallel support.
Minisgl/Scheduler
Implements the request scheduler for the MiniSGL LLM serving system, managing prefill and decode batching, KV-cache allocation, and I/O queuing across the inference pipeline.
Minisgl/Attention
Implements a pluggable attention backend system for minisgl, supporting FlashAttention, FlashInfer, and TensorRT-LLM backends with a common base interface and registry-based factory.
Minisgl/Message
Defines and serializes structured message types passed between the frontend, tokenizer, and backend components of the MiniSGL inference pipeline.
Minisgl/Kvcache
Implements KV-cache management for LLM inference, providing base abstractions and two concrete strategies—naive sequential allocation and radix-tree-based prefix sharing—along with a multi-head attention memory pool.
Minisgl/Engine
Implements the core inference engine for MiniSGL, orchestrating model loading, KV-cache management, attention/MoE backends, sampling, and CUDA graph execution for LLM serving.
Python/Minisgl
Core data models, global context management, and entry-point wiring for the MiniSGL LLM inference server.
Tests/Kernel
Unit and performance tests for low-level kernel operations including distributed communication (NCCL all-reduce/all-gather), tensor indexing, KV store, and tensor utilities in the minisgl inference engine.
Minisgl/Server
Implements the minisgl inference server, including argument parsing, API endpoint definitions, and server launch logic for serving LLM completions over HTTP and ZMQ.
Minisgl/Tokenizer
Implements a tokenizer server worker that handles bidirectional tokenization and detokenization of messages between frontend and backend components using ZMQ queues.
Minisgl/Distributed
Provides tensor-parallel distributed communication primitives and process group metadata for the minisgl inference engine, enabling multi-GPU operations like all-reduce and all-gather via NCCL.
Minisgl/Moe
Implements Mixture-of-Experts (MoE) backends for the minisgl engine, providing both a base interface and a fused high-performance implementation using Triton kernels.
Include/Minisgl
C++ header files providing core abstractions for tensor operations, NCCL collective communication wrappers, and shared utility macros/functions for the MiniSGL GPU kernel layer.
Csrc/Src
C++ extension source files implementing low-level tensor utilities and radix tree key comparison operations used by the minisgl KV cache and kernel layers.
Minisgl/Benchmark
Provides benchmarking utilities for measuring LLM serving performance, including client-side request generation/tracing and CUDA kernel performance comparison tools.
Minisgl/Llm
Provides a high-level LLM inference interface that wraps the scheduler and core engine to support offline batch text generation.
Benchmark/Online
Online benchmark scripts for measuring inference throughput and latency using Qwen trace-based and simple synthetic workloads against a running model server.
Tests/Core
Integration tests for the core scheduling and cache allocation subsystems of the minisgl inference engine.
Benchmark/Offline
Offline benchmarking scripts that measure LLM inference throughput using synthetic and WildChat dataset prompts via the minisgl LLM engine.