KV Cache Quantization¶

KVBoost can compress cached KV tensors from float16 to int8 or int4 using KIVI-style asymmetric quantization (ICML 2024).

The key insight from KIVI: key cache has channel-specific outliers and should be quantized per-channel, while value cache has token-specific outliers and should be quantized per-token. Treating them identically causes accuracy degradation.

Usage¶

from kvboost import KVBoost

# int8 -- 2x RAM savings, near-lossless (recommended)
engine = KVBoost.from_pretrained("Qwen/Qwen2.5-3B", kv_cache_bits=8)

# int4 -- 4x RAM savings, aggressive
engine = KVBoost.from_pretrained("Qwen/Qwen2.5-3B", kv_cache_bits=4)
assert engine.verify_correctness()  # validate before trusting

# float16 -- no quantization (default)
engine = KVBoost.from_pretrained("Qwen/Qwen2.5-3B", kv_cache_bits=16)

Memory Impact¶

For Qwen2.5-3B at chunk_size=128:

Precision	Per-chunk	128 chunks (hot tier)	Quality
float16	9.4 MB	1.2 GB	Baseline
int8	4.7 MB	0.6 GB	Near-lossless (max error ~0.016)
int4	2.4 MB	0.3 GB	Aggressive (validate first)

How It Works¶

Quantization happens transparently:

On store: cache_manager.store() quantizes KV tensors and stores the compressed version + scale factors.
On load: cache_manager.get() dequantizes back to float16 before returning to the caller.

The caller never sees quantized tensors – the compression is internal to the cache tier.

Standalone API¶

For advanced use, the quantization functions are available directly:

from kvboost import quantize_kv, dequantize_kv

# Compress
qkv = quantize_kv(past_key_values, bits=8)
print(f"Compressed: {qkv.memory_bytes() / 1e6:.1f} MB")

# Decompress
past_key_values = dequantize_kv(qkv)