Internal Modules

These modules are used internally by KVBoost but are available for advanced use.

KV Quantization

kvboost.kv_quantize.quantize_kv(kv, bits=8)[source]

Quantize a PastKVType using KIVI-style asymmetric quantization.

Parameters:
Returns:

QuantizedKV container with compressed tensors + scale factors.

Return type:

QuantizedKV

kvboost.kv_quantize.dequantize_kv(qkv)[source]

Dequantize a QuantizedKV back to float16 PastKVType.

Parameters:

qkv (QuantizedKV)

Return type:

Tuple[Tuple[torch.Tensor, torch.Tensor], …]

class kvboost.kv_quantize.QuantizedKV(layers, bits, original_dtype)[source]

Full model’s quantized KV cache — drop-in replacement for PastKVType in storage.

Parameters:
layers: List[QuantizedLayer]
bits: int
original_dtype: torch.dtype
memory_bytes()[source]
Return type:

int

Disk Tier

class kvboost.disk_tier.DiskTier(cache_dir, max_chunks=256, slot_bytes=10485760)[source]

Memory-mapped disk cache for KV tensors.

Instead of one file per chunk (torch.save), uses a single pre-allocated binary file with fixed-size slots. An in-memory JSON index maps chunk hashes to slot numbers.

Parameters:
  • cache_dir (str)

  • max_chunks (int)

  • slot_bytes (int)

write(chunk)[source]

Write a chunk’s KV tensors and metadata to a disk slot. Returns True if stored successfully, False if no space.

Parameters:

chunk (CachedChunk)

Return type:

bool

read(chunk_hash, device='cpu')[source]

Read a chunk from disk. Returns a CachedChunk with KV tensors on the specified device, or None if not found.

Parameters:
  • chunk_hash (str)

  • device (str)

Return type:

CachedChunk | None

contains(chunk_hash)[source]
Parameters:

chunk_hash (str)

Return type:

bool

remove(chunk_hash)[source]
Parameters:

chunk_hash (str)

Return type:

None

stats()[source]
Return type:

dict

Batch Utilities

kvboost.batch.find_common_chunk_prefix(all_token_ids, chunk_size)[source]

Returns length of the longest chunk-aligned prefix shared by all prompts. Stops at the first chunk where any prompt diverges.

Parameters:
Return type:

int

kvboost.batch.broadcast_kv(kv, batch_size)[source]

Expand cached KV from [1, heads, seq, dim] to [batch, heads, seq, dim]. Uses expand() — zero-copy, shares underlying storage.

Parameters:
Return type:

Tuple[Tuple[torch.Tensor, torch.Tensor], …]

kvboost.batch.group_by_prefix(prompts, token_ids_list, chunk_size, n_prefix_chunks=3)[source]

Group prompt indices by the hash of their first N chunks. Prompts sharing the same prefix chunks are batched together. Returns: prefix_key → [prompt_indices].

Parameters:
Return type:

Dict[str, List[int]]

kvboost.batch.pad_and_mask(suffix_ids_list, pad_token_id)[source]

Pad suffix token lists to equal length. Returns (padded_ids, attention_masks). Masks are 1 for real tokens, 0 for padding.

Parameters:
Return type:

Tuple[List[List[int]], List[List[int]]]

CacheBlend Recompute

class kvboost.cacheblend.CacheBlendRecompute(recompute_ratio=0.15, min_deviation=0.01, device='cpu')[source]
Parameters:
apply(assembled, model)[source]

Fix stale KV tensors by deviation-guided selective recomputation. Same interface as SelectiveRecompute.apply().

Parameters:

assembled (AssembledPrompt)

Return type:

AssembledPrompt

Selective Recompute

class kvboost.selective_recompute.SelectiveRecompute(recompute_overlap=16, skip_if_no_seams=True, device='cpu')[source]
Parameters:
  • recompute_overlap (int)

  • skip_if_no_seams (bool)

  • device (str)

apply(assembled, model)[source]

Optionally fix the KV seams in assembled.cached_past_kv. Returns a (possibly modified) AssembledPrompt. model must be a HuggingFace CausalLM.

Parameters:

assembled (AssembledPrompt)

Return type:

AssembledPrompt

Chunk Registry

class kvboost.chunk_registry.ChunkRegistry(chunk_size=128, strategy=ChunkStrategy.FIXED, min_chunk_tokens=32)[source]

Converts (text, token_ids) into a list of (start, end, sub_token_ids) triples according to the configured strategy.

The registry itself holds no KV state — that lives in KVCacheManager.

Parameters:
split(token_ids, text='')[source]

Split token_ids into cacheable slices.

Returns list of (start, end, slice_token_ids). end is exclusive.

Parameters:
Return type:

List[Tuple[int, int, List[int]]]

chunk_ids_for(token_ids)[source]

Return the chunk_ids (hashes) for each chunk of this token sequence.

Parameters:

token_ids (List[int])

Return type:

List[str]

class kvboost.chunk_registry.ChunkStrategy(value)[source]
FIXED = 'fixed'
SEMANTIC = 'semantic'
DOCUMENT = 'document'

Model Compatibility

kvboost.compat.check_model_compatibility(model, strict=True)[source]

Validate that a model’s architecture is compatible with KV cache stitching.

Parameters:
  • model – A HuggingFace CausalLM model instance.

  • strict (bool) – If True (default), raise ValueError for unsupported models and warn for untested ones. If False, only warn.

Raises:

ValueError – If the model architecture is known to be incompatible.

Return type:

None

kvboost.compat.SUPPORTED_ARCHITECTURES = {'Gemma2ForCausalLM', 'GemmaForCausalLM', 'InternLM2ForCausalLM', 'InternLMForCausalLM', 'LlamaForCausalLM', 'MistralForCausalLM', 'Phi3ForCausalLM', 'PhiForCausalLM', 'Qwen2ForCausalLM', 'Qwen2_5ForCausalLM', 'StableLmForCausalLM'}

set() -> new empty set object set(iterable) -> new set object

Build an unordered collection of unique elements.

kvboost.compat.UNSUPPORTED_ARCHITECTURES = {'BloomForCausalLM': 'BLOOM uses ALiBi positional encoding.', 'FalconForCausalLM': 'Falcon uses ALiBi positional encoding.', 'GPT2LMHeadModel': 'GPT-2 uses learned absolute positional embeddings. Position info is baked into token representations at the embedding layer KV cache stitching cannot correct for position mismatches.', 'GPTNeoForCausalLM': 'GPT-Neo uses learned absolute positional embeddings.', 'GPTNeoXForCausalLM': 'GPT-NeoX uses rotary embeddings but the HF implementation does not accept position_ids KV stitching may produce incorrect positions.', 'MPTForCausalLM': 'MPT uses ALiBi positional encoding. Positional bias is added directly to attention scores based on token distance there is no position_ids tensor to inject, so KV cache stitching cannot produce correct positions.'}

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s

(key, value) pairs

dict(iterable) -> new dictionary initialized as if via:

d = {} for k, v in iterable:

d[k] = v

dict(**kwargs) -> new dictionary initialized with the name=value pairs

in the keyword argument list. For example: dict(one=1, two=2)