Batch Inference¶

KVBoost supports batched generation for multiple prompts sharing a common prefix. The shared prefix KV is loaded from cache once, then broadcast (zero-copy via expand()) across the batch.

generate_batch()¶

Use when you know the prompts share a prefix:

engine = KVBoost.from_pretrained("Qwen/Qwen2.5-3B")
engine.warm("You are a helpful assistant...")

results = engine.generate_batch([
    "You are a helpful assistant...\n\nUser: What is 2+2?",
    "You are a helpful assistant...\n\nUser: Explain gravity",
    "You are a helpful assistant...\n\nUser: Write a haiku",
], max_new_tokens=64)

for r in results:
    print(f"{r.output_text[:60]}... ({r.ttft_ms:.0f}ms)")

This runs:

One cache lookup for the shared prefix
One batched prefill over all suffix tokens
One batched decode loop

generate_many()¶

Use when prompts may or may not share prefixes – auto-groups by prefix hash:

results = engine.generate_many([
    "System A...\n\nQuery 1",
    "System A...\n\nQuery 2",   # batched with Query 1
    "System B...\n\nQuery 3",   # processed separately
])

Prompts are clustered by the hash of their first few chunks. Each cluster is batched together; single-prompt clusters fall back to generate().

How Broadcast Works¶

The shared prefix KV has shape [1, heads, seq, dim]. For batch size B, KVBoost calls expand(B, -1, -1, -1) instead of repeat(B, 1, 1, 1):

expand() creates a view with stride 0 – zero memory cost, all batch elements read from the same physical tensor.
repeat() would copy the tensor B times – wasting B * chunk_MB of RAM.

Throughput Impact¶

For the common case (same system prompt, N different queries):

Sequential: N forward passes, each at ~10% GPU utilization
Batched: 1 forward pass at ~80% GPU utilization, ~3x effective throughput

The batched prefill of N queries is typically ~2-3x slower than a single query (not Nx), because the GPU is better utilized.