Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/sgl-project/mini-sglang/llms.txt

Use this file to discover all available pages before exploring further.

The Engine class is the core component that manages model execution, KV cache, attention backends, and CUDA graph optimization. It handles the low-level details of GPU inference.

Constructor

from minisgl.engine import Engine, EngineConfig

engine = Engine(config: EngineConfig)
config
EngineConfig
required
Engine configuration object containing:
  • Model configuration (architecture, vocab size, etc.)
  • Tensor parallelism settings
  • Memory and performance tuning
  • Backend selection (attention, MoE)
  • CUDA graph settings

Initialization Process

The engine initialization performs several critical steps:
  1. Communication Setup: Initializes distributed communication (NCCL/Gloo) for tensor parallelism
  2. Model Loading: Loads model weights from disk or initializes dummy weights
  3. KV Cache Allocation: Allocates page-based KV cache in GPU memory
  4. Page Table Creation: Sets up page table for efficient memory management
  5. Backend Initialization: Configures attention backend (FlashAttention, FlashInfer, TRT-LLM) and MoE backend if needed
  6. Sampler Setup: Initializes token sampling logic
  7. CUDA Graph Capture: Pre-records frequently used batch sizes for faster execution
from minisgl.engine import Engine, EngineConfig
from minisgl.distributed import DistributedInfo
import torch

config = EngineConfig(
    model_path="meta-llama/Llama-3.2-1B-Instruct",
    dtype=torch.bfloat16,
    tp_info=DistributedInfo(rank=0, size=1),
    page_size=16,
    max_running_req=128,
)

engine = Engine(config)

Key Methods

forward_batch()

Executes a forward pass for a batch of requests.
batch
Batch
required
Batch object containing:
  • reqs: List of request objects
  • phase: Either “prefill” or “decode”
  • input_ids, positions, out_loc: Prepared tensors
args
BatchSamplingArgs
required
Sampling arguments for the batch, including temperature, top-k, top-p per request.
Returns: ForwardOutput A named tuple containing:
  • next_tokens_gpu: Sampled tokens on GPU
  • next_tokens_cpu: Sampled tokens copied to CPU
  • copy_done_event: CUDA event for synchronization
from minisgl.core import Batch
from minisgl.engine.sample import BatchSamplingArgs

# Prepared by scheduler
batch = Batch(reqs=requests, phase="prefill")
sampling_args = BatchSamplingArgs(...)

output = engine.forward_batch(batch, sampling_args)

# Wait for CPU copy to complete
output.copy_done_event.synchronize()
next_tokens = output.next_tokens_cpu

shutdown()

Cleanly shuts down the engine and releases resources.
engine.shutdown()
This method:
  • Destroys CUDA graphs
  • Cleans up distributed process groups
  • Releases GPU memory

Key Attributes

model
nn.Module
The loaded language model
kv_cache
BaseKVCachePool
KV cache pool managing paged memory
page_table
torch.Tensor
Page table mapping logical to physical KV cache locations Shape: (max_running_req + 1, aligned_max_seq_len)
attn_backend
BaseAttnBackend
Attention backend implementation (FlashAttention, FlashInfer, or TRT-LLM)
moe_backend
BaseMoeBackend
MoE backend for mixture-of-experts models (if applicable)
sampler
Sampler
Token sampling module
graph_runner
GraphRunner
CUDA graph manager for optimized execution
ctx
Context
Global context object containing shared state

Architecture

The engine coordinates several subsystems:
┌─────────────────────────────────────┐
│           Engine                    │
├─────────────────────────────────────┤
│  Model (Transformer)                │
│  ├─ Loaded from disk                │
│  └─ Distributed across TP ranks     │
├─────────────────────────────────────┤
│  KV Cache Pool                      │
│  ├─ Paged memory management         │
│  └─ Page table for address mapping  │
├─────────────────────────────────────┤
│  Attention Backend                  │
│  ├─ FlashAttention (sm90+)          │
│  ├─ FlashInfer (fallback)           │
│  └─ TRT-LLM (sm100+)                │
├─────────────────────────────────────┤
│  Sampler                            │
│  └─ Top-k, top-p, temperature       │
├─────────────────────────────────────┤
│  CUDA Graph Runner                  │
│  └─ Captures common batch sizes     │
└─────────────────────────────────────┘

Memory Management

The engine automatically determines the number of KV cache pages based on available GPU memory:
num_pages = (available_memory * memory_ratio - model_size) / bytes_per_page
You can override this with num_page_override in EngineConfig.

CUDA Graph Optimization

The engine captures CUDA graphs for frequently used batch sizes (specified in cuda_graph_bs). This reduces kernel launch overhead:
config = EngineConfig(
    ...,
    cuda_graph_bs=[1, 2, 4, 8, 16, 32],  # Capture these batch sizes
    cuda_graph_max_bs=64,  # Maximum batch size for graph
)
During execution:
  • If batch size matches a captured graph → replay graph (faster)
  • Otherwise → execute model normally

Distributed Execution

The engine supports tensor parallelism for large models:
# Rank 0
config = EngineConfig(
    ...,
    tp_info=DistributedInfo(rank=0, size=4),
    distributed_addr="tcp://localhost:12345"
)

# Rank 1, 2, 3 similar with different ranks

Notes

  • The engine must be initialized before CUDA is initialized elsewhere
  • Only one engine instance should exist per process
  • The engine is not thread-safe; use separate processes for parallelism
  • Call shutdown() to properly clean up resources before process exit
  • The dummy request (table index = max_running_req) is used for CUDA graph capture