Optimization Tips

Mini-SGLang offers several configuration options to optimize performance for your specific workload and hardware. This guide covers best practices and tuning recommendations.

Memory Management

GPU Memory Ratio

Control the fraction of GPU memory allocated to KV cache:

python -m minisgl --model "Qwen/Qwen3-0.6B" --memory-ratio 0.85

Recommendations:

Default: 0.9 (90% of available memory)
Shared GPU: Reduce to 0.7-0.8 if other processes need GPU memory
Long contexts: Increase to 0.95 for maximum cache capacity

Page Size Configuration

Page size determines the granularity of KV cache allocation:

python -m minisgl --model "Qwen/Qwen3-0.6B" --page-size 256

Recommendations:

Use Case	Page Size	Rationale
Short sequences (<512 tokens)	16-32	Reduces internal fragmentation
Medium sequences (512-2048 tokens)	64-128	Balanced trade-off
Long sequences (>2048 tokens)	256+	Fewer page allocations

Note: Some attention backends override page size:

TensorRT-LLM: Only supports 16, 32, or 64
FlashInfer: Works with any power-of-2 size

Number of Pages

Explicitly control the maximum number of KV cache pages:

python -m minisgl --model "Qwen/Qwen3-0.6B" --num-pages 10000

Useful when:

Debugging OOM issues
Profiling memory usage
Running on constrained hardware

Chunked Prefill

Chunked prefill splits long prompts into smaller chunks to reduce peak memory usage and prevent OOM errors.

Configuration

python -m minisgl --model "Qwen/Qwen3-0.6B" --max-prefill-length 8192

Recommended values:

Context Length	Chunk Size	Notes
<4K tokens	4096-8192	Minimal chunking needed
4K-32K tokens	8192-16384	Balance memory and speed
32K-128K tokens	16384-32768	Prevent OOM on most GPUs
>128K tokens	32768+	Very long context scenarios

Performance tips:

Too small (<512): Significant overhead from multiple kernel launches
Too large (>32K): Risk of OOM, especially with large batch sizes
Optimal: Set to 2-4x your typical prompt length

Chunked prefill is enabled by default and has been shown to improve throughput in long-context serving scenarios (see Sarathi-Serve).

CUDA Graph Optimization

CUDA graphs reduce CPU kernel launch overhead during the decode phase by capturing and replaying GPU operations.

Configuration

# Enable CUDA graphs with max batch size 256
python -m minisgl --model "Qwen/Qwen3-0.6B" --cuda-graph-max-bs 256

# Disable CUDA graphs
python -m minisgl --model "Qwen/Qwen3-0.6B" --cuda-graph-max-bs 0

Recommendations:

Workload	Max Batch Size	Rationale
Interactive (1-2 users)	1-4	Low concurrency
Small deployment	16-64	Moderate traffic
Production serving	128-256	High throughput
Memory constrained	0 (disabled)	Save GPU memory

Trade-offs:

Higher values: Better performance at high concurrency, but more GPU memory usage
Lower values: Less memory overhead, but may miss optimization opportunities
Auto-tuning: Leave unset to automatically tune based on GPU memory

When to Disable

Debugging decode kernels
Running on very limited GPU memory (<8GB)
Shell/interactive mode (automatically disabled)

Attention Backend Selection

Mini-SGLang supports multiple attention backends optimized for different phases:

Available Backends

fa: FlashAttention (including FlashAttention-3 on Hopper GPUs)
fi: FlashInfer
trtllm: TensorRT-LLM FMHA

Configuration

# Use FlashAttention for both prefill and decode
python -m minisgl --model "Qwen/Qwen3-0.6B" --attn fa

# Use FlashAttention for prefill, FlashInfer for decode (recommended)
python -m minisgl --model "Qwen/Qwen3-0.6B" --attn fa,fi

# Use TensorRT-LLM for both phases
python -m minisgl --model "Qwen/Qwen3-0.6B" --attn trtllm

Recommendations by GPU Architecture

GPU Architecture	Prefill Backend	Decode Backend	Notes
Hopper (H100, H200)	`fa` (FA3)	`fi`	Default; optimal performance
Ampere (A100, A10)	`fa` (FA2)	`fi`	Good balance
Ada (RTX 4090)	`fa`	`fi`	Consumer GPUs
Older (V100, T4)	`fa`	`fa`	Limited FlashInfer support

Backend-Specific Considerations

FlashAttention:

Excellent prefill performance
FlashAttention-3 on Hopper provides significant speedup
Works with any page size

FlashInfer:

Optimized for decode with paged attention
Better performance with batched decode requests
Requires page size to be power of 2

TensorRT-LLM:

Highly optimized NVIDIA kernels
Restricts page size to 16, 32, or 64
May require additional setup

Cache Management Strategy

Choose between Radix Cache and naive cache management:

# Use Radix Cache (default, recommended)
python -m minisgl --model "Qwen/Qwen3-0.6B" --cache-type radix

# Use naive cache
python -m minisgl --model "Qwen/Qwen3-0.6B" --cache-type naive

When to use Radix Cache (default):

Requests with shared prefixes (e.g., system prompts)
Multi-turn conversations
Batched requests with common context
Production serving scenarios

When to use naive cache:

Benchmarking (for fair comparison)
Debugging cache-related issues
Workloads with no shared prefixes

Performance impact: Radix Cache can improve throughput by 2-5x for workloads with high prefix sharing.

Overlap Scheduling

Overlap scheduling hides CPU scheduling overhead by overlapping it with GPU computation.

Configuration

Overlap scheduling is enabled by default. To disable for ablation studies:

MINISGL_DISABLE_OVERLAP_SCHEDULING=1 python -m minisgl --model "Qwen/Qwen3-0.6B"

Performance impact: Typically improves throughput by 5-15% by reducing scheduler overhead.

When to Disable

Debugging scheduler behavior
Profiling CPU overhead
Running ablation studies

Distributed Serving (Tensor Parallelism)

Scale large models across multiple GPUs:

# 4-way tensor parallelism
python -m minisgl --model "meta-llama/Llama-3.1-70B-Instruct" --tp 4

Recommendations:

Model Size	GPUs	TP Size	Notes
<7B params	1	1	Single GPU sufficient
7-13B params	1-2	1-2	Optional TP for speed
14-34B params	2-4	2-4	TP recommended
70B+ params	4-8	4-8	TP required

Best practices:

Use NVLink-connected GPUs for best performance
TP size should divide model layers evenly
PyNCCL is enabled by default (disable with --disable-pynccl if needed)

Advanced Tuning

Maximum Running Requests

Control scheduler concurrency:

python -m minisgl --model "Qwen/Qwen3-0.6B" --max-running-requests 128

Trade-offs:

Higher: Better throughput under load, but more memory usage
Lower: Reduced memory pressure, but may bottleneck under high QPS

Maximum Sequence Length

Override model’s default max sequence length:

python -m minisgl --model "Qwen/Qwen3-0.6B" --max-seq-len-override 8192

Useful when:

Model supports longer context than config specifies
Testing with shorter sequences to save memory

Data Type

Choose precision for model weights:

python -m minisgl --model "Qwen/Qwen3-0.6B" --dtype bfloat16

Options:

auto (default): FP16 for FP32/FP16 models, BF16 for BF16 models
bfloat16: Better numerical stability on Ampere+ GPUs
float16: Slightly faster on some GPUs
float32: Highest precision, but 2x memory usage

Monitoring and Debugging

Enable Detailed Logging

Set log level via environment variable:

LOG_LEVEL=DEBUG python -m minisgl --model "Qwen/Qwen3-0.6B"

Memory Profiling

Monitor GPU memory usage:

# In a separate terminal
watch -n 1 nvidia-smi

Performance Profiling

Use NVIDIA Nsight Systems for detailed profiling:

nsys profile -o profile.qdrep python -m minisgl --model "Qwen/Qwen3-0.6B"

Quick Reference

Common Configuration Profiles

Maximum throughput (production):

python -m minisgl --model "Qwen/Qwen3-32B" --tp 4 \
    --cuda-graph-max-bs 256 \
    --max-prefill-length 16384 \
    --attn fa,fi \
    --memory-ratio 0.9 \
    --cache-type radix

Memory-constrained:

python -m minisgl --model "Qwen/Qwen3-0.6B" \
    --memory-ratio 0.7 \
    --cuda-graph-max-bs 64 \
    --max-prefill-length 4096 \
    --page-size 16

Long-context serving:

python -m minisgl --model "Qwen/Qwen3-14B" \
    --max-seq-len-override 32768 \
    --max-prefill-length 16384 \
    --page-size 256 \
    --memory-ratio 0.95 \
    --cache-type radix

Debug/development:

MINISGL_DISABLE_OVERLAP_SCHEDULING=1 LOG_LEVEL=DEBUG \
python -m minisgl --model "Qwen/Qwen3-0.6B" \
    --cuda-graph-max-bs 0 \
    --shell

Getting Started

Core Concepts

Guides

Configuration

Performance

Memory Management

GPU Memory Ratio

Page Size Configuration

Number of Pages

Chunked Prefill

Configuration

CUDA Graph Optimization

Configuration

When to Disable

Attention Backend Selection

Available Backends

Configuration

Recommendations by GPU Architecture

Backend-Specific Considerations

Cache Management Strategy

Overlap Scheduling

Configuration

When to Disable

Distributed Serving (Tensor Parallelism)

Advanced Tuning

Maximum Running Requests

Maximum Sequence Length

Data Type

Monitoring and Debugging

Enable Detailed Logging

Memory Profiling

Performance Profiling

Quick Reference

Common Configuration Profiles

Getting Started

Core Concepts

Guides

Configuration

Performance

Documentation Index

​Memory Management

​GPU Memory Ratio

​Page Size Configuration

​Number of Pages

​Chunked Prefill

​Configuration

​CUDA Graph Optimization

​Configuration

​When to Disable

​Attention Backend Selection

​Available Backends

​Configuration

​Recommendations by GPU Architecture

​Backend-Specific Considerations

​Cache Management Strategy

​Overlap Scheduling

​Configuration

​When to Disable

​Distributed Serving (Tensor Parallelism)

​Advanced Tuning

​Maximum Running Requests

​Maximum Sequence Length

​Data Type

​Monitoring and Debugging

​Enable Detailed Logging

​Memory Profiling

​Performance Profiling

​Quick Reference

​Common Configuration Profiles

Memory Management

GPU Memory Ratio

Page Size Configuration

Number of Pages

Chunked Prefill

Configuration

CUDA Graph Optimization

Configuration

When to Disable

Attention Backend Selection

Available Backends

Configuration

Recommendations by GPU Architecture

Backend-Specific Considerations

Cache Management Strategy

Overlap Scheduling

Configuration

When to Disable

Distributed Serving (Tensor Parallelism)

Advanced Tuning

Maximum Running Requests

Maximum Sequence Length

Data Type

Monitoring and Debugging

Enable Detailed Logging

Memory Profiling

Performance Profiling

Quick Reference

Common Configuration Profiles