Internal Modules¶

These modules are used internally by KVBoost but are available for advanced use.

KV Quantization¶

kvboost.kv_quantize.quantize_kv(kv, bits=8)[source]¶

Quantize a PastKVType using KIVI-style asymmetric quantization.

Parameters:

kv (Tuple[Tuple[torch.Tensor, torch.Tensor], ...]) – Standard HF past_key_values tuple.
bits (int) – 8 (int8, safe) or 4 (int4, aggressive). 16 returns passthrough.

Returns:

QuantizedKV container with compressed tensors + scale factors.

Return type:

QuantizedKV

kvboost.kv_quantize.dequantize_kv(qkv)[source]¶

Dequantize a QuantizedKV back to float16 PastKVType.

Parameters:: qkv (QuantizedKV)
Return type:: Tuple[Tuple[torch.Tensor, torch.Tensor], …]

class kvboost.kv_quantize.QuantizedKV(layers, bits, original_dtype)[source]¶

Full model’s quantized KV cache — drop-in replacement for PastKVType in storage.

Parameters:

layers (List[QuantizedLayer])
bits (int)
original_dtype (torch.dtype)

layers: List[QuantizedLayer]¶

bits: int¶

original_dtype: torch.dtype¶

memory_bytes()[source]¶

Return type:: int

Disk Tier¶

class kvboost.disk_tier.DiskTier(cache_dir, max_chunks=256, slot_bytes=10485760)[source]¶

Memory-mapped disk cache for KV tensors.

Instead of one file per chunk (torch.save), uses a single pre-allocated binary file with fixed-size slots. An in-memory JSON index maps chunk hashes to slot numbers.

Parameters:

cache_dir (str)
max_chunks (int)
slot_bytes (int)

write(chunk)[source]¶

Write a chunk’s KV tensors and metadata to a disk slot. Returns True if stored successfully, False if no space.

Parameters:: chunk (CachedChunk)
Return type:: bool

read(chunk_hash, device='cpu')[source]¶

Read a chunk from disk. Returns a CachedChunk with KV tensors on the specified device, or None if not found.

Parameters:

chunk_hash (str)
device (str)

Return type:

CachedChunk | None

contains(chunk_hash)[source]¶

Parameters:: chunk_hash (str)
Return type:: bool

remove(chunk_hash)[source]¶

Parameters:: chunk_hash (str)
Return type:: None

stats()[source]¶

Return type:: dict

Batch Utilities¶

kvboost.batch.find_common_chunk_prefix(all_token_ids, chunk_size)[source]¶

Returns length of the longest chunk-aligned prefix shared by all prompts. Stops at the first chunk where any prompt diverges.

Parameters:

all_token_ids (List[List[int]])
chunk_size (int)

Return type:

int

kvboost.batch.broadcast_kv(kv, batch_size)[source]¶

Expand cached KV from [1, heads, seq, dim] to [batch, heads, seq, dim]. Uses expand() — zero-copy, shares underlying storage.

Parameters:

kv (Tuple[Tuple[torch.Tensor, torch.Tensor], ...])
batch_size (int)

Return type:

Tuple[Tuple[torch.Tensor, torch.Tensor], …]

kvboost.batch.group_by_prefix(prompts, token_ids_list, chunk_size, n_prefix_chunks=3)[source]¶

Group prompt indices by the hash of their first N chunks. Prompts sharing the same prefix chunks are batched together. Returns: prefix_key → [prompt_indices].

Parameters:

prompts (List[str])
token_ids_list (List[List[int]])
chunk_size (int)
n_prefix_chunks (int)

Return type:

Dict[str, List[int]]

kvboost.batch.pad_and_mask(suffix_ids_list, pad_token_id)[source]¶

Pad suffix token lists to equal length. Returns (padded_ids, attention_masks). Masks are 1 for real tokens, 0 for padding.

Parameters:

suffix_ids_list (List[List[int]])
pad_token_id (int)

Return type:

Tuple[List[List[int]], List[List[int]]]

CacheBlend Recompute¶

class kvboost.cacheblend.CacheBlendRecompute(recompute_ratio=0.15, min_deviation=0.01, device='cpu')[source]¶

Parameters:

recompute_ratio (float)
min_deviation (float)
device (str)

apply(assembled, model)[source]¶

Fix stale KV tensors by deviation-guided selective recomputation. Same interface as SelectiveRecompute.apply().

Parameters:: assembled (AssembledPrompt)
Return type:: AssembledPrompt

Selective Recompute¶

class kvboost.selective_recompute.SelectiveRecompute(recompute_overlap=16, skip_if_no_seams=True, device='cpu')[source]¶

Parameters:

recompute_overlap (int)
skip_if_no_seams (bool)
device (str)

apply(assembled, model)[source]¶

Optionally fix the KV seams in assembled.cached_past_kv. Returns a (possibly modified) AssembledPrompt. model must be a HuggingFace CausalLM.

Parameters:: assembled (AssembledPrompt)
Return type:: AssembledPrompt

Chunk Registry¶

class kvboost.chunk_registry.ChunkRegistry(chunk_size=128, strategy=ChunkStrategy.FIXED, min_chunk_tokens=32)[source]¶

Converts (text, token_ids) into a list of (start, end, sub_token_ids) triples according to the configured strategy.

The registry itself holds no KV state — that lives in KVCacheManager.

Parameters:

chunk_size (int)
strategy (ChunkStrategy)
min_chunk_tokens (int)

split(token_ids, text='')[source]¶

Split token_ids into cacheable slices.

Returns list of (start, end, slice_token_ids). end is exclusive.

Parameters:

token_ids (List[int])
text (str)

Return type:

List[Tuple[int, int, List[int]]]

chunk_ids_for(token_ids)[source]¶

Return the chunk_ids (hashes) for each chunk of this token sequence.

Parameters:: token_ids (List[int])
Return type:: List[str]

class kvboost.chunk_registry.ChunkStrategy(value)[source]¶

FIXED = 'fixed'¶

SEMANTIC = 'semantic'¶

DOCUMENT = 'document'¶

Model Compatibility¶

kvboost.compat.check_model_compatibility(model, strict=True)[source]¶

Validate that a model’s architecture is compatible with KV cache stitching.

Parameters:

model – A HuggingFace CausalLM model instance.
strict (bool) – If True (default), raise ValueError for unsupported models and warn for untested ones. If False, only warn.

Raises:

ValueError – If the model architecture is known to be incompatible.

Return type:

None

kvboost.compat.SUPPORTED_ARCHITECTURES = {'Gemma2ForCausalLM', 'GemmaForCausalLM', 'InternLM2ForCausalLM', 'InternLMForCausalLM', 'LlamaForCausalLM', 'MistralForCausalLM', 'Phi3ForCausalLM', 'PhiForCausalLM', 'Qwen2ForCausalLM', 'Qwen2_5ForCausalLM', 'StableLmForCausalLM'}¶

set() -> new empty set object set(iterable) -> new set object

Build an unordered collection of unique elements.

kvboost.compat.UNSUPPORTED_ARCHITECTURES = {'BloomForCausalLM': 'BLOOM uses ALiBi positional encoding.', 'FalconForCausalLM': 'Falcon uses ALiBi positional encoding.', 'GPT2LMHeadModel': 'GPT-2 uses learned absolute positional embeddings. Position info is baked into token representations at the embedding layer — KV cache stitching cannot correct for position mismatches.', 'GPTNeoForCausalLM': 'GPT-Neo uses learned absolute positional embeddings.', 'GPTNeoXForCausalLM': 'GPT-NeoX uses rotary embeddings but the HF implementation does not accept position_ids — KV stitching may produce incorrect positions.', 'MPTForCausalLM': 'MPT uses ALiBi positional encoding. Positional bias is added directly to attention scores based on token distance — there is no position_ids tensor to inject, so KV cache stitching cannot produce correct positions.'}¶

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s

(key, value) pairs

dict(iterable) -> new dictionary initialized as if via:: d = {} for k, v in iterable:

d[k] = v
dict(**kwargs) -> new dictionary initialized with the name=value pairs: in the keyword argument list. For example: dict(one=1, two=2)