Benchmark Results¶
All benchmarks on Qwen/Qwen2.5-3B (float16) on MacBook Air M-series, 16GB RAM, MPS backend. Chunk size 128, greedy decoding.
Multi-Turn Conversation¶
Baseline TTFT scales linearly with history. KVBoost stays flat at ~62ms.
Turn |
Tokens |
Baseline |
KVBoost |
Reuse |
Speedup |
|---|---|---|---|---|---|
1 |
232 |
35ms |
31ms |
0% |
1.1x |
4 |
621 |
374ms |
62ms |
62% |
6.0x |
6 |
946 |
1,228ms |
63ms |
68% |
19.6x |
8 |
1,353 |
2,970ms |
62ms |
76% |
47.9x |
Code Context Reuse (~800 tokens)¶
Query |
Baseline |
KVBoost |
Reuse |
Speedup |
|---|---|---|---|---|
Q1 (cold) |
1,670ms |
2,292ms |
0% |
0.7x |
Q2 (warm) |
1,577ms |
75ms |
92% |
21.1x |
Q3 (warm) |
2,133ms |
128ms |
92% |
16.6x |
Running Benchmarks¶
# All examples
python examples/run.py
# Single example
python examples/run.py --example multiturn
# Full experiment suite (TinyLlama, ~55 min)
cd benchmarks_and_experiments && python run_all.py
# Distribution correctness test
python benchmarks_and_experiments/11_distribution_correctness.py