Changelog¶

0.1.0 (2026-04-07)¶

Initial release.

Core Features:

Chunk-level KV cache reuse for any RoPE-based HuggingFace causal LM
Three generation modes: baseline, prefix cache, chunk KV reuse
Two recompute strategies: selective boundary and CacheBlend deviation-guided
Prefix-chained cache keys (vLLM-style) for positional correctness
Content-hash fallback for approximate non-prefix reuse

Storage:

Two-tier cache: hot RAM + cold disk (flat mmap block pool)
KIVI-style int8/int4 KV quantization (2-4x compression)
Frequency-based eviction (protects system prompt chunks)

Batch Inference:

generate_batch() for prompts sharing a common prefix
generate_many() with auto-grouping by prefix hash
Zero-copy KV broadcast via expand()

Quality & Safety:

Model compatibility validation (11 supported + 6 blocked architectures)
verify_correctness() self-test for untested models
WarmResult diagnostics with alignment warnings

Benchmarks:

5-48x TTFT reduction on 3B+ models with 500+ token shared context
3-41x faster than vLLM-MLX prefix caching
10 TinyLlama experiments + Qwen2.5-3B validation suite