Engine Class

The Engine class is the core component that manages model execution, KV cache, attention backends, and CUDA graph optimization. It handles the low-level details of GPU inference.

Constructor

from minisgl.engine import Engine, EngineConfig

engine = Engine(config: EngineConfig)

config

EngineConfig

required

Engine configuration object containing:

Model configuration (architecture, vocab size, etc.)
Tensor parallelism settings
Memory and performance tuning
Backend selection (attention, MoE)
CUDA graph settings

Initialization Process

The engine initialization performs several critical steps:

Communication Setup: Initializes distributed communication (NCCL/Gloo) for tensor parallelism
Model Loading: Loads model weights from disk or initializes dummy weights
KV Cache Allocation: Allocates page-based KV cache in GPU memory
Page Table Creation: Sets up page table for efficient memory management
Backend Initialization: Configures attention backend (FlashAttention, FlashInfer, TRT-LLM) and MoE backend if needed
Sampler Setup: Initializes token sampling logic
CUDA Graph Capture: Pre-records frequently used batch sizes for faster execution

from minisgl.engine import Engine, EngineConfig
from minisgl.distributed import DistributedInfo
import torch

config = EngineConfig(
    model_path="meta-llama/Llama-3.2-1B-Instruct",
    dtype=torch.bfloat16,
    tp_info=DistributedInfo(rank=0, size=1),
    page_size=16,
    max_running_req=128,
)

engine = Engine(config)

Key Methods

forward_batch()

Executes a forward pass for a batch of requests.

batch

Batch

required

Batch object containing:

reqs: List of request objects
phase: Either “prefill” or “decode”
input_ids, positions, out_loc: Prepared tensors

args

BatchSamplingArgs

required

Sampling arguments for the batch, including temperature, top-k, top-p per request.

Returns: ForwardOutput A named tuple containing:

next_tokens_gpu: Sampled tokens on GPU
next_tokens_cpu: Sampled tokens copied to CPU
copy_done_event: CUDA event for synchronization

from minisgl.core import Batch
from minisgl.engine.sample import BatchSamplingArgs

# Prepared by scheduler
batch = Batch(reqs=requests, phase="prefill")
sampling_args = BatchSamplingArgs(...)

output = engine.forward_batch(batch, sampling_args)

# Wait for CPU copy to complete
output.copy_done_event.synchronize()
next_tokens = output.next_tokens_cpu

shutdown()

Cleanly shuts down the engine and releases resources.

engine.shutdown()

This method:

Destroys CUDA graphs
Cleans up distributed process groups
Releases GPU memory

Key Attributes

model

nn.Module

The loaded language model

kv_cache

BaseKVCachePool

KV cache pool managing paged memory

page_table

torch.Tensor

Page table mapping logical to physical KV cache locations Shape: (max_running_req + 1, aligned_max_seq_len)

attn_backend

BaseAttnBackend

Attention backend implementation (FlashAttention, FlashInfer, or TRT-LLM)

moe_backend

BaseMoeBackend

MoE backend for mixture-of-experts models (if applicable)

sampler

Sampler

Token sampling module

graph_runner

GraphRunner

CUDA graph manager for optimized execution

ctx

Context

Global context object containing shared state

Architecture

The engine coordinates several subsystems:

┌─────────────────────────────────────┐
│           Engine                    │
├─────────────────────────────────────┤
│  Model (Transformer)                │
│  ├─ Loaded from disk                │
│  └─ Distributed across TP ranks     │
├─────────────────────────────────────┤
│  KV Cache Pool                      │
│  ├─ Paged memory management         │
│  └─ Page table for address mapping  │
├─────────────────────────────────────┤
│  Attention Backend                  │
│  ├─ FlashAttention (sm90+)          │
│  ├─ FlashInfer (fallback)           │
│  └─ TRT-LLM (sm100+)                │
├─────────────────────────────────────┤
│  Sampler                            │
│  └─ Top-k, top-p, temperature       │
├─────────────────────────────────────┤
│  CUDA Graph Runner                  │
│  └─ Captures common batch sizes     │
└─────────────────────────────────────┘

Memory Management

The engine automatically determines the number of KV cache pages based on available GPU memory:

num_pages = (available_memory * memory_ratio - model_size) / bytes_per_page

You can override this with num_page_override in EngineConfig.

CUDA Graph Optimization

The engine captures CUDA graphs for frequently used batch sizes (specified in cuda_graph_bs). This reduces kernel launch overhead:

config = EngineConfig(
    ...,
    cuda_graph_bs=[1, 2, 4, 8, 16, 32],  # Capture these batch sizes
    cuda_graph_max_bs=64,  # Maximum batch size for graph
)

During execution:

If batch size matches a captured graph → replay graph (faster)
Otherwise → execute model normally

Distributed Execution

The engine supports tensor parallelism for large models:

# Rank 0
config = EngineConfig(
    ...,
    tp_info=DistributedInfo(rank=0, size=4),
    distributed_addr="tcp://localhost:12345"
)

# Rank 1, 2, 3 similar with different ranks

Notes

The engine must be initialized before CUDA is initialized elsewhere
Only one engine instance should exist per process
The engine is not thread-safe; use separate processes for parallelism
Call shutdown() to properly clean up resources before process exit
The dummy request (table index = max_running_req) is used for CUDA graph capture

API Endpoints

Python API

Architecture

Constructor

Initialization Process

Key Methods

forward_batch()

shutdown()

Key Attributes

Architecture

Memory Management

CUDA Graph Optimization

Distributed Execution

Notes

API Endpoints

Python API

Architecture

Documentation Index

​Constructor

​Initialization Process

​Key Methods

​forward_batch()

​shutdown()

​Key Attributes

​Architecture

​Memory Management

​CUDA Graph Optimization

​Distributed Execution

​Notes

Constructor

Initialization Process

Key Methods

forward_batch()

shutdown()

Key Attributes

Architecture

Memory Management

CUDA Graph Optimization

Distributed Execution

Notes