TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/sgl-project/mini-sglang/llms.txt
Use this file to discover all available pages before exploring further.
Engine class is the core component that manages model execution, KV cache, attention backends, and CUDA graph optimization. It handles the low-level details of GPU inference.
Constructor
Engine configuration object containing:
- Model configuration (architecture, vocab size, etc.)
- Tensor parallelism settings
- Memory and performance tuning
- Backend selection (attention, MoE)
- CUDA graph settings
Initialization Process
The engine initialization performs several critical steps:- Communication Setup: Initializes distributed communication (NCCL/Gloo) for tensor parallelism
- Model Loading: Loads model weights from disk or initializes dummy weights
- KV Cache Allocation: Allocates page-based KV cache in GPU memory
- Page Table Creation: Sets up page table for efficient memory management
- Backend Initialization: Configures attention backend (FlashAttention, FlashInfer, TRT-LLM) and MoE backend if needed
- Sampler Setup: Initializes token sampling logic
- CUDA Graph Capture: Pre-records frequently used batch sizes for faster execution
Key Methods
forward_batch()
Executes a forward pass for a batch of requests.Batch object containing:
reqs: List of request objectsphase: Either “prefill” or “decode”input_ids,positions,out_loc: Prepared tensors
Sampling arguments for the batch, including temperature, top-k, top-p per request.
ForwardOutput
A named tuple containing:
next_tokens_gpu: Sampled tokens on GPUnext_tokens_cpu: Sampled tokens copied to CPUcopy_done_event: CUDA event for synchronization
shutdown()
Cleanly shuts down the engine and releases resources.- Destroys CUDA graphs
- Cleans up distributed process groups
- Releases GPU memory
Key Attributes
The loaded language model
KV cache pool managing paged memory
Page table mapping logical to physical KV cache locations
Shape:
(max_running_req + 1, aligned_max_seq_len)Attention backend implementation (FlashAttention, FlashInfer, or TRT-LLM)
MoE backend for mixture-of-experts models (if applicable)
Token sampling module
CUDA graph manager for optimized execution
Global context object containing shared state
Architecture
The engine coordinates several subsystems:Memory Management
The engine automatically determines the number of KV cache pages based on available GPU memory:num_page_override in EngineConfig.
CUDA Graph Optimization
The engine captures CUDA graphs for frequently used batch sizes (specified incuda_graph_bs). This reduces kernel launch overhead:
- If batch size matches a captured graph → replay graph (faster)
- Otherwise → execute model normally
Distributed Execution
The engine supports tensor parallelism for large models:Notes
- The engine must be initialized before CUDA is initialized elsewhere
- Only one engine instance should exist per process
- The engine is not thread-safe; use separate processes for parallelism
- Call
shutdown()to properly clean up resources before process exit - The dummy request (table index =
max_running_req) is used for CUDA graph capture