Engine API

KVBoost

KVBoost is the alias for kvboost.engine.InferenceEngine.

class kvboost.engine.InferenceEngine(model, tokenizer, chunk_size=128, max_chunks=128, recompute_overlap=16, recompute_strategy=RecomputeStrategy.SELECTIVE, recompute_ratio=0.15, kv_cache_bits=16, disk_cache_dir=None, device=None)[source]
Parameters:
  • model (AutoModelForCausalLM)

  • tokenizer (AutoTokenizer)

  • chunk_size (int)

  • max_chunks (int)

  • recompute_overlap (int)

  • recompute_strategy (RecomputeStrategy)

  • recompute_ratio (float)

  • kv_cache_bits (int)

  • disk_cache_dir (Optional[str])

  • device (Optional[str])

classmethod from_pretrained(model_name='TinyLlama/TinyLlama-1.1B-Chat-v1.0', strict=True, **kwargs)[source]

Load a HuggingFace model and create a KVBoost engine.

Parameters:
  • model_name (str) – Any HF decoder-only causal LM (must use RoPE).

  • strict (bool) – If True (default), raise on unsupported architectures and warn on untested ones. Set False to skip checks.

  • **kwargs – Passed to InferenceEngine.__init__().

Return type:

InferenceEngine

generate(prompt, max_new_tokens=64, mode=GenerationMode.CHUNK_KV_REUSE, temperature=1.0, do_sample=False)[source]
Parameters:
Return type:

GenerationResult

generate_batch(prompts, max_new_tokens=64, temperature=1.0, do_sample=False)[source]

Generate responses for multiple prompts sharing a common prefix. Loads shared prefix KV once, runs batched prefill and decode.

Parameters:
  • prompts (List[str]) – List of prompts (should share a common prefix for best results).

  • max_new_tokens (int) – Max tokens to generate per prompt.

  • temperature (float) – Sampling temperature.

  • do_sample (bool) – Greedy (False) or sampling (True).

Returns:

List of GenerationResult, one per prompt.

Return type:

List[GenerationResult]

generate_many(prompts, max_new_tokens=64, temperature=1.0, do_sample=False)[source]

Like generate_batch(), but auto-groups prompts by shared prefix. Prompts without shared prefixes are processed individually.

Parameters:
  • prompts (List[str]) – List of prompts (may or may not share prefixes).

  • max_new_tokens (int) – Max tokens to generate per prompt.

  • temperature (float)

  • do_sample (bool)

Returns:

List of GenerationResult in the same order as input prompts.

Return type:

List[GenerationResult]

warm(text, position_offset=0)[source]

Encode text and cache all its fixed-size chunks.

Returns a WarmResult with diagnostics including alignment warnings. The result is truthy (usable as int) via chunks_stored.

Call this for your system prompt / few-shot examples / documents BEFORE calling generate() so the cache is already populated.

Parameters:
  • text (str)

  • position_offset (int)

Return type:

WarmResult

cache_stats()[source]
Return type:

Dict

verify_correctness(max_new_tokens=32)[source]

Quick self-test: runs greedy decode on a synthetic prompt with both BASELINE and CHUNK_KV_REUSE, verifies identical output.

Returns True if outputs match, False otherwise. Use this to validate untested model architectures before trusting cached outputs in production.

Parameters:

max_new_tokens (int)

Return type:

bool

GenerationMode

class kvboost.engine.GenerationMode(value)[source]
BASELINE = 'baseline'
PREFIX_CACHE = 'prefix_cache'
CHUNK_KV_REUSE = 'chunk_kv_reuse'

RecomputeStrategy

class kvboost.engine.RecomputeStrategy(value)[source]
SELECTIVE = 'selective'
CACHEBLEND = 'cacheblend'
NONE = 'none'

GenerationResult

class kvboost.engine.GenerationResult(mode: 'str', prompt: 'str', output_text: 'str', generated_tokens: 'int', ttft_ms: 'float', total_ms: 'float', tokens_per_sec: 'float', kv_reuse_ratio: 'float', prompt_tokens: 'int', cached_tokens: 'int')[source]
Parameters:
mode: str
prompt: str
output_text: str
generated_tokens: int
ttft_ms: float
total_ms: float
tokens_per_sec: float
kv_reuse_ratio: float
prompt_tokens: int
cached_tokens: int