KVBoost vs vLLM Prefix Caching¶

Head-to-head against vLLM-MLX prefix caching on Apple Silicon.

vLLM caches system prompt KV and reuses on exact prefix match. KVBoost reuses any matching chunk, including non-prefix interior content.

Note

KVBoost uses Qwen2.5-3B float16 (MPS). vLLM-MLX uses Qwen2.5-3B 4-bit (MLX Metal). This is the realistic deployment comparison where each system uses its optimal format.

Axis 1: Non-Prefix Interior Reuse¶

The core differentiator. Same document placed at the start, in the middle, or as unique content:

Exact prefix: Both systems can cache. KVBoost is 3-30x faster.
Interior: vLLM gets zero cache hits. KVBoost achieves 82-83% reuse.
No reuse: Even without caching, KVBoost’s HF baseline (33ms) beats vLLM-MLX (1.3s) due to MPS vs MLX Metal overhead.

Axis 2: Cold-Start Overhead¶

Empty cache, no reuse possible. KVBoost at 32ms vs vLLM at 777ms – even cold, KVBoost is faster because of the underlying engine difference.

Axis 3: Break-Even Prompt Length¶

Length	KVBoost (warm)	vLLM (warm)
~250 words	37ms (89%)	849ms
~500 words	48ms (88%)	1,960ms
~1000 words	242ms (98%)	61,131ms
~2000 words	1,452ms (98%)	66,714ms

Running¶

pip install vllm-mlx
python benchmarks_and_experiments/benchmark_vs_vllm.py
python benchmarks_and_experiments/benchmark_vs_vllm.py --axis non_prefix