Disk Cache Tier¶
When the hot RAM tier is full, evicted chunks are demoted to a disk-backed cold tier. On cache hit, chunks are promoted back to RAM.
Usage¶
engine = KVBoost.from_pretrained(
"Qwen/Qwen2.5-3B",
disk_cache_dir="/tmp/kv_cache",
)
Combine with quantization for maximum efficiency:
engine = KVBoost.from_pretrained(
"Qwen/Qwen2.5-3B",
kv_cache_bits=8,
disk_cache_dir="/tmp/kv_cache",
)
How It Works¶
The disk tier uses a flat block pool rather than per-file storage:
/tmp/kv_cache/
kv_cache.bin <- single pre-allocated file, fixed-size slots
kv_index.json <- hash -> slot mapping (loaded into RAM)
kv_meta.json <- chunk metadata (token_ids, positions, etc.)
No pickle: Raw tensor bytes written directly to slots. No serialization overhead.
Atomic index: Index updates use write-to-temp +
os.replace()for crash safety.Async-friendly: On CUDA, reads use
non_blocking=Truefor pipelined disk-to-GPU transfer.
Flow¶
Eviction: When the hot tier is full and a new chunk arrives, the lowest-frequency chunk is evicted. If a disk tier is configured, the evicted chunk’s KV tensors are flattened and written to a slot.
Promotion: On cache miss in the hot tier, the disk tier is checked. If found, the chunk is read from its slot, reconstructed, and promoted back to the hot tier.
Stats:
engine.cache_stats()includes disk tier info when enabled:stats = engine.cache_stats() # {'hot_chunks': 42, 'disk_chunks': 18, 'disk_free_slots': 238, ...}