KVBoost vs vLLM Prefix Caching¶
Head-to-head against vLLM-MLX prefix caching on Apple Silicon.
vLLM caches system prompt KV and reuses on exact prefix match. KVBoost reuses any matching chunk, including non-prefix interior content.
Note
KVBoost uses Qwen2.5-3B float16 (MPS). vLLM-MLX uses Qwen2.5-3B 4-bit (MLX Metal). This is the realistic deployment comparison where each system uses its optimal format.
Axis 1: Non-Prefix Interior Reuse¶
The core differentiator. Same document placed at the start, in the middle, or as unique content:
Exact prefix: Both systems can cache. KVBoost is 3-30x faster.
Interior: vLLM gets zero cache hits. KVBoost achieves 82-83% reuse.
No reuse: Even without caching, KVBoost’s HF baseline (33ms) beats vLLM-MLX (1.3s) due to MPS vs MLX Metal overhead.
Axis 2: Cold-Start Overhead¶
Empty cache, no reuse possible. KVBoost at 32ms vs vLLM at 777ms – even cold, KVBoost is faster because of the underlying engine difference.
Axis 3: Break-Even Prompt Length¶
Length |
KVBoost (warm) |
vLLM (warm) |
|---|---|---|
~250 words |
37ms (89%) |
849ms |
~500 words |
48ms (88%) |
1,960ms |
~1000 words |
242ms (98%) |
61,131ms |
~2000 words |
1,452ms (98%) |
66,714ms |
Running¶
pip install vllm-mlx
python benchmarks_and_experiments/benchmark_vs_vllm.py
python benchmarks_and_experiments/benchmark_vs_vllm.py --axis non_prefix