Quick Start¶
Basic Usage¶
from kvboost import KVBoost
# 1. Load any HuggingFace causal LM (must use RoPE)
engine = KVBoost.from_pretrained("Qwen/Qwen2.5-3B")
# 2. Cache your system prompt / document / few-shot examples
engine.warm("You are a helpful coding assistant. Always provide concise answers...")
# 3. Generate -- cached prefix is reused automatically
result = engine.generate(
"You are a helpful coding assistant. Always provide concise answers...\n\n"
"User: How do I reverse a linked list?\n"
"Assistant:",
max_new_tokens=128,
)
print(result.output_text)
print(f"TTFT: {result.ttft_ms:.1f}ms | Cache reuse: {result.kv_reuse_ratio:.0%}")
Batch Inference¶
Process multiple prompts sharing a common prefix in one pass:
results = engine.generate_batch([
"You are a helpful assistant...\n\nUser: What is 2+2?",
"You are a helpful assistant...\n\nUser: What is 3+3?",
"You are a helpful assistant...\n\nUser: What is 4+4?",
])
for r in results:
print(r.output_text)
Or auto-group mixed prompts:
results = engine.generate_many([
"System A...\n\nQuery 1",
"System A...\n\nQuery 2", # batched with above
"System B...\n\nQuery 3", # processed separately
])
Memory-Efficient Mode¶
engine = KVBoost.from_pretrained(
"Qwen/Qwen2.5-3B",
kv_cache_bits=8, # int8 quantized KV (2x RAM savings)
disk_cache_dir="/tmp/kv", # evicted chunks go to disk
recompute_strategy="cacheblend", # smarter recompute
)
When Does It Help?¶
Condition |
Expected TTFT Speedup |
|---|---|
Multi-turn, 8+ turns, 3B+ model |
10-48x |
Code/doc reuse, 800+ tokens |
15-21x |
RAG, ~500 tokens |
1-2x |
System prompt, ~250 tokens |
0.3-0.5x (overhead) |
Any workload, 0.5B model |
<1x (overhead) |
Rule of thumb: benefits appear on 3B+ models with 500+ token shared context.