Architecture Overview¶

KVBoost is composed of six core modules that form a pipeline:

prompt_text
    |
    v
 tokenize
    |
    v
 ChunkRegistry.split()            -- split into fixed/semantic chunks
    |
    v
 KVCacheManager.find_matches()    -- two-tier hash lookup per chunk
    |
    v
 PromptAssembler.assemble()       -- stitch cached KV + live tokens
    |
    v
 Recompute (selective | cacheblend)  -- fix stale KV from stitching
    |
    v
 model.forward(live_tokens, past_key_values=cached_kv)
    |
    v
 GenerationResult

Module Responsibilities¶

ChunkRegistry (chunk_registry.py): Splits token sequences into cacheable chunks. Supports fixed-size, semantic (paragraph-aligned), and document-level strategies.
KVCacheManager (cache_manager.py): Two-tier storage (hot RAM + cold disk) with two-tier keying (prefix-chained + content-only). Frequency-based eviction protects frequently-used system prompt chunks.
PromptAssembler (prompt_assembler.py): Given a token sequence and the cache, assembles the cached KV prefix and identifies which tokens need fresh computation.
SelectiveRecompute (selective_recompute.py): Fixes chunk boundary seams by recomputing the last R tokens at each junction with full cross-chunk attention.
CacheBlendRecompute (cacheblend.py): Deviation-guided recompute: measures per-token KV deviation and recomputes only the ~15% that actually changed.
InferenceEngine (engine.py): Ties everything together. Exposes generate(), generate_batch(), generate_many(), warm(), and verify_correctness().

Supporting Modules¶

kv_quantize.py: KIVI-style asymmetric int8/int4 quantization for compressed KV storage.
disk_tier.py: Flat block-pool disk cache with atomic index persistence.
batch.py: Batched inference primitives: prefix detection, zero-copy KV broadcast, padded decode loops.
compat.py: Model architecture validation. Blocks ALiBi/absolute-position models, warns on untested architectures.
models.py: Core data structures: CachedChunk, AssembledPrompt, WarmResult, and the two-tier hashing functions.