Mini-SGLang is designed as a distributed system with multiple independent processes working together to handle LLM inference efficiently.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/sgl-project/mini-sglang/llms.txt
Use this file to discover all available pages before exploring further.
Architecture Overview
The system consists of several key components that communicate via ZeroMQ (ZMQ) for control messages and NCCL for GPU tensor data:
Core Components
API Server
Implementation:python/minisgl/server/api_server.py
The API Server is the entry point for user requests:
- Provides OpenAI-compatible HTTP endpoints:
/v1/chat/completions- Chat completion API/v1/completions- Text completion API
- Built with FastAPI for async request handling
- Receives text prompts from users
- Returns streaming or non-streaming responses
- Handles request routing to the tokenizer
- Manages response streaming back to clients
- Receives HTTP requests from users
- Sends tokenization requests to Tokenizer via ZMQ
- Receives generated text from Detokenizer via ZMQ
Tokenizer Worker
Implementation:python/minisgl/tokenizer/server.py
The Tokenizer Worker converts text to tokens:
- Runs as an independent process
- Loads the tokenizer model (from HuggingFace)
- Converts input text strings into token IDs
- Handles special tokens and chat templates
- Forwards tokenized requests to Scheduler
- Receives tokenization requests from API Server via ZMQ
- Sends tokenized data to Scheduler (Rank 0) via ZMQ
- Uses
TokenizeMsgmessage type frompython/minisgl/message/tokenizer.py
Detokenizer Worker
Implementation:python/minisgl/tokenizer/detokenize.py and python/minisgl/tokenizer/server.py
The Detokenizer Worker converts tokens back to text:
- Runs within the tokenizer worker process
- Receives token IDs from Scheduler
- Converts tokens into human-readable text
- Handles incremental decoding for streaming
- Sends decoded text back to API Server
- Receives token IDs from Scheduler (Rank 0) via ZMQ
- Sends decoded text to API Server via ZMQ
- Uses
DetokenizeMsgmessage type frompython/minisgl/message/tokenizer.py
Scheduler Worker
Implementation:python/minisgl/scheduler/scheduler.py
The Scheduler is the core orchestrator of inference:
- One scheduler per GPU in multi-GPU (Tensor Parallel) setups
- Each scheduler is called a TP Rank
- Manages request queuing and batching
- Allocates KV cache resources
- Controls the inference engine on its GPU
- Implements continuous batching for efficiency
- Receives tokenized requests from Tokenizer
- Broadcasts requests to all other scheduler ranks
- Collects generated tokens from Engine
- Sends tokens to Detokenizer
- Handles abort and control messages
- Receive broadcast requests from Rank 0
- Run inference in parallel with Rank 0
- Participate in tensor-parallel computation
- Synchronize via NCCL for tensor operations
- Request table management (
python/minisgl/scheduler/table.py) - Batch preparation for prefill and decode (
python/minisgl/scheduler/prefill.py,python/minisgl/scheduler/decode.py) - KV cache allocation (
python/minisgl/scheduler/cache.py) - Message I/O handling (
python/minisgl/scheduler/io.py)
- Rank 0 receives messages via ZMQ from Tokenizer
- Rank 0 sends messages via ZMQ to Detokenizer
- All ranks synchronize via NCCL (torch.distributed) for tensor data
- Inter-scheduler communication for distributed inference
Engine
Implementation:python/minisgl/engine/engine.py
The Engine is the TP worker on a single GPU:
- One engine per GPU process
- Loads and manages the LLM model
- Manages the inference context (
Contextfrompython/minisgl/core.py) - Controls KV cache pool
- Selects and uses attention backend (FlashAttention, FlashInfer, TensorRT-LLM)
- Implements CUDA graph capture for decode optimization
- Performs actual model forward passes
- Executes sampling to generate next tokens
- Model loading and weight sharding
- Attention backend management (
python/minisgl/attention/) - KV cache management (
python/minisgl/kvcache/) - CUDA graph optimization (
python/minisgl/engine/graph.py) - Token sampling (
python/minisgl/engine/sample.py)
- Controlled by local Scheduler via function calls (same process)
- Participates in NCCL collectives for tensor-parallel operations
Communication Protocols
ZeroMQ (ZMQ)
Purpose: Control message passing between processes Implementation:python/minisgl/utils/mp.py
Mini-SGLang uses ZMQ for lightweight inter-process communication:
-
Push/Pull Queues: Point-to-point request/response
ZmqPushQueue- Send messagesZmqPullQueue- Receive messagesZmqAsyncPushQueue,ZmqAsyncPullQueue- Async variants
-
Pub/Sub Queues: Broadcast to multiple subscribers
ZmqPubQueue- Publish messagesZmqSubQueue- Subscribe to messages
- Defined in
python/minisgl/message/ - Support automatic serialization/deserialization
- Type-safe message passing
NCCL (NVIDIA Collective Communications Library)
Purpose: High-performance GPU-to-GPU tensor communication Implementation:python/minisgl/distributed/impl.py, python/minisgl/kernel/pynccl.py
NCCL is used for tensor parallelism:
- All-Reduce: Aggregate results across GPUs (e.g., after row-parallel linear layers)
- All-Gather: Collect tensors from all GPUs
- Broadcast: Send tensor from one GPU to all others
- Synchronizes model weights across TP ranks
- Combines partial results from parallel computations
- Used by tensor-parallel layers in
python/minisgl/layers/
Request Lifecycle
Here’s how a request flows through the system:1. User Request
User → API Server- User sends HTTP POST to
/v1/chat/completions - API Server validates request and extracts prompt
2. Tokenization
API Server → Tokenizer- API Server sends
TokenizeMsgvia ZMQ - Tokenizer converts text to token IDs
3. Scheduling
Tokenizer → Scheduler (Rank 0)- Tokenizer sends tokenized request via ZMQ
- Scheduler creates
Reqobject (python/minisgl/core.py) - Adds request to request table
4. Broadcasting (Multi-GPU)
Scheduler (Rank 0) → All Schedulers- Rank 0 broadcasts request to other ranks via NCCL
- All schedulers synchronize request state
5. Batch Preparation
All Schedulers- Each scheduler prepares
Batchobject - Allocates KV cache pages
- Creates attention metadata
- Groups prefill and decode requests
6. Inference
Scheduler → Engine- Scheduler calls Engine.forward()
- Engine performs model forward pass
- Tensor-parallel layers use NCCL for synchronization
- Attention backend computes attention
- KV cache is updated
- Token sampling generates next token
7. Detokenization
Scheduler (Rank 0) → Detokenizer- Rank 0 sends generated token ID via ZMQ
- Detokenizer converts to text
- Handles streaming decoding
8. Response
Detokenizer → API Server → User- Detokenizer sends text chunk via ZMQ
- API Server streams to user via HTTP
- Process repeats until EOS or max tokens
Multi-GPU Tensor Parallelism
In tensor-parallel setups:-
Model Sharding:
- Model weights are split across GPUs
- Each GPU holds a portion of each layer
- Sharding handled by
python/minisgl/models/weight.py
-
Parallel Computation:
- All GPUs process the same batch simultaneously
- Column-parallel layers split output features
- Row-parallel layers split input features
- Implemented in
python/minisgl/layers/linear.py
-
Synchronization:
- NCCL all-reduce after row-parallel layers
- Fast GPU-to-GPU communication
- Minimal overhead for tensor operations
-
Coordinator Pattern:
- Rank 0 scheduler coordinates I/O
- All ranks execute inference in lockstep
- Results collected at Rank 0
Process Launching
Implementation:python/minisgl/server/launch.py
The launch_server function orchestrates all processes:
- Spawns API Server process
- Spawns Tokenizer/Detokenizer process
- Spawns Scheduler processes (one per GPU)
- Each Scheduler creates its Engine
- Sets up ZMQ and NCCL communication
- Monitors process health
python/minisgl/__main__.py
Users can start the entire system with: