Mini-SGLang integrates multiple high-performance attention backends to optimize inference across different GPU architectures and workload phases. You can choose different backends for prefill and decode phases to maximize efficiency.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/sgl-project/mini-sglang/llms.txt
Use this file to discover all available pages before exploring further.
Supported Backends
Mini-SGLang supports three attention backends:fa- FlashAttention (github.com/Dao-AILab/flash-attention)fi- FlashInfer (github.com/flashinfer-ai/flashinfer)trtllm- TensorRT-LLM FMHA (github.com/NVIDIA/TensorRT-LLM)
Configuration
Use the--attn or --attention-backend flag to specify which backend(s) to use:
Hybrid Backend Configuration
When you specify two backends separated by a comma, the first backend is used for prefill and the second for decode:- FlashAttention (
fa) for the prefill phase - FlashInfer (
fi) for the decode phase
Backend Details
FlashAttention (fa)
FlashAttention provides highly optimized attention computation through IO-aware algorithms.
Key Features:
- Supports FlashAttention 3 on NVIDIA Hopper GPUs (SM100+)
- Falls back to FlashAttention 3 on older architectures
- Efficient memory usage through tiling
- Requires
sgl-kernelpackage
minisgl.attention.fa.FlashAttentionBackend (source:~/workspace/source/python/minisgl/attention/fa.py)
Installation:
FlashInfer (fi)
FlashInfer specializes in efficient decode-phase attention with optional tensor core usage.
Key Features:
- Optimized for decode phase with batched requests
- Configurable tensor core usage based on GQA ratio
- Currently only supports page size = 1
- Uses FlashAttention 2 backend internally
minisgl.attention.fi.FlashInferBackend (source:~/workspace/source/python/minisgl/attention/fi.py)
Tensor Core Usage:
By default, tensor cores are enabled when GQA (num_qo_heads / num_kv_heads) >= 4. You can override this with the FLASHINFER_USE_TENSOR_CORES environment variable.
TensorRT-LLM (trtllm)
TensorRT-LLM FMHA backend provides optimized attention through NVIDIA’s TensorRT-LLM library.
Key Features:
- Supports both prefill and decode phases
- Integrates with TensorRT-LLM optimizations
- Page size constraint: Only supports page sizes of 16, 32, or 64
minisgl.attention.trtllm.TensorRTLLMBackend (source:~/workspace/source/python/minisgl/attention/trtllm.py)
Example:
Default Backend Selection
When you use--attn auto (the default), Mini-SGLang automatically selects optimal backends based on your GPU architecture:
- NVIDIA Hopper GPUs (SM100+): FlashAttention 3 for prefill, FlashInfer for decode
- Other GPUs: FlashAttention 3 for both prefill and decode
- GPU compute capability
- Available installed kernels
- Model configuration (GQA ratio, head dimensions)
Page Size Constraints
Different attention backends have different page size requirements:| Backend | Page Size Support | Notes |
|---|---|---|
fa (FlashAttention) | Any size | Recommended: 1 for flexibility |
fi (FlashInfer) | 1 only | Hardcoded constraint |
trtllm | 16, 32, 64 only | Will override user setting |
--page-size flag:
Performance Recommendations
For High Throughput
Use hybrid FlashAttention + FlashInfer configuration:For Long Context
FlashAttention is optimized for long sequences:For Low Latency
FlashInfer with tensor cores for decode:Source Code Reference
All attention backends are implemented in~/workspace/source/python/minisgl/attention/:
- FlashAttention:
fa.py:36(FlashAttentionBackendclass) - FlashInfer:
fi.py:80(FlashInferBackendclass) - TensorRT-LLM:
trtllm.py:35(TensorRTLLMBackendclass) - Backend registry:
__init__.py:19(SUPPORTED_ATTENTION_BACKENDS)
Troubleshooting
Import Errors with FlashAttention
If you see errors importingsgl_kernel.flash_attn, install system dependencies:
Page Size Conflicts
If using FlashInfer (fi), ensure page size is set to 1:
TensorRT-LLM Page Size
When usingtrtllm, use supported page sizes: