Benchmark Results¶

All benchmarks on Qwen/Qwen2.5-3B (float16) on MacBook Air M-series, 16GB RAM, MPS backend. Chunk size 128, greedy decoding.

Multi-Turn Conversation¶

Baseline TTFT scales linearly with history. KVBoost stays flat at ~62ms.

Turn	Tokens	Baseline	KVBoost	Reuse	Speedup
1	232	35ms	31ms	0%	1.1x
4	621	374ms	62ms	62%	6.0x
6	946	1,228ms	63ms	68%	19.6x
8	1,353	2,970ms	62ms	76%	47.9x

Code Context Reuse (~800 tokens)¶

Query	Baseline	KVBoost	Reuse	Speedup
Q1 (cold)	1,670ms	2,292ms	0%	0.7x
Q2 (warm)	1,577ms	75ms	92%	21.1x
Q3 (warm)	2,133ms	128ms	92%	16.6x

Running Benchmarks¶

# All examples
python examples/run.py

# Single example
python examples/run.py --example multiturn

# Full experiment suite (TinyLlama, ~55 min)
cd benchmarks_and_experiments && python run_all.py

# Distribution correctness test
python benchmarks_and_experiments/11_distribution_correctness.py