Internal Modules¶
These modules are used internally by KVBoost but are available for advanced use.
KV Quantization¶
- kvboost.kv_quantize.quantize_kv(kv, bits=8)[source]¶
Quantize a PastKVType using KIVI-style asymmetric quantization.
- Parameters:
kv (Tuple[Tuple[torch.Tensor, torch.Tensor], ...]) – Standard HF past_key_values tuple.
bits (int) – 8 (int8, safe) or 4 (int4, aggressive). 16 returns passthrough.
- Returns:
QuantizedKV container with compressed tensors + scale factors.
- Return type:
- kvboost.kv_quantize.dequantize_kv(qkv)[source]¶
Dequantize a QuantizedKV back to float16 PastKVType.
- Parameters:
qkv (QuantizedKV)
- Return type:
Tuple[Tuple[torch.Tensor, torch.Tensor], …]
- class kvboost.kv_quantize.QuantizedKV(layers, bits, original_dtype)[source]¶
Full model’s quantized KV cache — drop-in replacement for PastKVType in storage.
- Parameters:
layers (List[QuantizedLayer])
bits (int)
original_dtype (torch.dtype)
- original_dtype: torch.dtype¶
Disk Tier¶
- class kvboost.disk_tier.DiskTier(cache_dir, max_chunks=256, slot_bytes=10485760)[source]¶
Memory-mapped disk cache for KV tensors.
Instead of one file per chunk (torch.save), uses a single pre-allocated binary file with fixed-size slots. An in-memory JSON index maps chunk hashes to slot numbers.
- write(chunk)[source]¶
Write a chunk’s KV tensors and metadata to a disk slot. Returns True if stored successfully, False if no space.
- Parameters:
chunk (CachedChunk)
- Return type:
- read(chunk_hash, device='cpu')[source]¶
Read a chunk from disk. Returns a CachedChunk with KV tensors on the specified device, or None if not found.
- Parameters:
- Return type:
CachedChunk | None
Batch Utilities¶
- kvboost.batch.find_common_chunk_prefix(all_token_ids, chunk_size)[source]¶
Returns length of the longest chunk-aligned prefix shared by all prompts. Stops at the first chunk where any prompt diverges.
- kvboost.batch.broadcast_kv(kv, batch_size)[source]¶
Expand cached KV from [1, heads, seq, dim] to [batch, heads, seq, dim]. Uses expand() — zero-copy, shares underlying storage.
- Parameters:
kv (Tuple[Tuple[torch.Tensor, torch.Tensor], ...])
batch_size (int)
- Return type:
Tuple[Tuple[torch.Tensor, torch.Tensor], …]
- kvboost.batch.group_by_prefix(prompts, token_ids_list, chunk_size, n_prefix_chunks=3)[source]¶
Group prompt indices by the hash of their first N chunks. Prompts sharing the same prefix chunks are batched together. Returns: prefix_key → [prompt_indices].
CacheBlend Recompute¶
- class kvboost.cacheblend.CacheBlendRecompute(recompute_ratio=0.15, min_deviation=0.01, device='cpu')[source]¶
-
- apply(assembled, model)[source]¶
Fix stale KV tensors by deviation-guided selective recomputation. Same interface as SelectiveRecompute.apply().
- Parameters:
assembled (AssembledPrompt)
- Return type:
Selective Recompute¶
- class kvboost.selective_recompute.SelectiveRecompute(recompute_overlap=16, skip_if_no_seams=True, device='cpu')[source]¶
-
- apply(assembled, model)[source]¶
Optionally fix the KV seams in assembled.cached_past_kv. Returns a (possibly modified) AssembledPrompt. model must be a HuggingFace CausalLM.
- Parameters:
assembled (AssembledPrompt)
- Return type:
Chunk Registry¶
- class kvboost.chunk_registry.ChunkRegistry(chunk_size=128, strategy=ChunkStrategy.FIXED, min_chunk_tokens=32)[source]¶
Converts (text, token_ids) into a list of (start, end, sub_token_ids) triples according to the configured strategy.
The registry itself holds no KV state — that lives in KVCacheManager.
- Parameters:
chunk_size (int)
strategy (ChunkStrategy)
min_chunk_tokens (int)
Model Compatibility¶
- kvboost.compat.check_model_compatibility(model, strict=True)[source]¶
Validate that a model’s architecture is compatible with KV cache stitching.
- Parameters:
model – A HuggingFace CausalLM model instance.
strict (bool) – If True (default), raise ValueError for unsupported models and warn for untested ones. If False, only warn.
- Raises:
ValueError – If the model architecture is known to be incompatible.
- Return type:
None
- kvboost.compat.SUPPORTED_ARCHITECTURES = {'Gemma2ForCausalLM', 'GemmaForCausalLM', 'InternLM2ForCausalLM', 'InternLMForCausalLM', 'LlamaForCausalLM', 'MistralForCausalLM', 'Phi3ForCausalLM', 'PhiForCausalLM', 'Qwen2ForCausalLM', 'Qwen2_5ForCausalLM', 'StableLmForCausalLM'}¶
set() -> new empty set object set(iterable) -> new set object
Build an unordered collection of unique elements.
- kvboost.compat.UNSUPPORTED_ARCHITECTURES = {'BloomForCausalLM': 'BLOOM uses ALiBi positional encoding.', 'FalconForCausalLM': 'Falcon uses ALiBi positional encoding.', 'GPT2LMHeadModel': 'GPT-2 uses learned absolute positional embeddings. Position info is baked into token representations at the embedding layer — KV cache stitching cannot correct for position mismatches.', 'GPTNeoForCausalLM': 'GPT-Neo uses learned absolute positional embeddings.', 'GPTNeoXForCausalLM': 'GPT-NeoX uses rotary embeddings but the HF implementation does not accept position_ids — KV stitching may produce incorrect positions.', 'MPTForCausalLM': 'MPT uses ALiBi positional encoding. Positional bias is added directly to attention scores based on token distance — there is no position_ids tensor to inject, so KV cache stitching cannot produce correct positions.'}¶
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s
(key, value) pairs
- dict(iterable) -> new dictionary initialized as if via:
d = {} for k, v in iterable:
d[k] = v
- dict(**kwargs) -> new dictionary initialized with the name=value pairs
in the keyword argument list. For example: dict(one=1, two=2)