Changelog¶
0.1.0 (2026-04-07)¶
Initial release.
Core Features:
Chunk-level KV cache reuse for any RoPE-based HuggingFace causal LM
Three generation modes: baseline, prefix cache, chunk KV reuse
Two recompute strategies: selective boundary and CacheBlend deviation-guided
Prefix-chained cache keys (vLLM-style) for positional correctness
Content-hash fallback for approximate non-prefix reuse
Storage:
Two-tier cache: hot RAM + cold disk (flat mmap block pool)
KIVI-style int8/int4 KV quantization (2-4x compression)
Frequency-based eviction (protects system prompt chunks)
Batch Inference:
generate_batch()for prompts sharing a common prefixgenerate_many()with auto-grouping by prefix hashZero-copy KV broadcast via
expand()
Quality & Safety:
Model compatibility validation (11 supported + 6 blocked architectures)
verify_correctness()self-test for untested modelsWarmResultdiagnostics with alignment warnings
Benchmarks:
5-48x TTFT reduction on 3B+ models with 500+ token shared context
3-41x faster than vLLM-MLX prefix caching
10 TinyLlama experiments + Qwen2.5-3B validation suite