Engine API¶

KVBoost¶

KVBoost is the alias for kvboost.engine.InferenceEngine.

class kvboost.engine.InferenceEngine(model, tokenizer, chunk_size=128, max_chunks=128, recompute_overlap=16, recompute_strategy=RecomputeStrategy.SELECTIVE, recompute_ratio=0.15, kv_cache_bits=16, disk_cache_dir=None, device=None)[source]¶

Parameters:

model (AutoModelForCausalLM)
tokenizer (AutoTokenizer)
chunk_size (int)
max_chunks (int)
recompute_overlap (int)
recompute_strategy (RecomputeStrategy)
recompute_ratio (float)
kv_cache_bits (int)
disk_cache_dir (Optional[str])
device (Optional[str])

classmethod from_pretrained(model_name='TinyLlama/TinyLlama-1.1B-Chat-v1.0', strict=True, **kwargs)[source]¶

Load a HuggingFace model and create a KVBoost engine.

Parameters:

model_name (str) – Any HF decoder-only causal LM (must use RoPE).
strict (bool) – If True (default), raise on unsupported architectures and warn on untested ones. Set False to skip checks.
**kwargs – Passed to InferenceEngine.__init__().

Return type:

InferenceEngine

generate(prompt, max_new_tokens=64, mode=GenerationMode.CHUNK_KV_REUSE, temperature=1.0, do_sample=False)[source]¶

Parameters:

prompt (str)
max_new_tokens (int)
mode (GenerationMode)
temperature (float)
do_sample (bool)

Return type:

GenerationResult

generate_batch(prompts, max_new_tokens=64, temperature=1.0, do_sample=False)[source]¶

Generate responses for multiple prompts sharing a common prefix. Loads shared prefix KV once, runs batched prefill and decode.

Parameters:

prompts (List[str]) – List of prompts (should share a common prefix for best results).
max_new_tokens (int) – Max tokens to generate per prompt.
temperature (float) – Sampling temperature.
do_sample (bool) – Greedy (False) or sampling (True).

Returns:

List of GenerationResult, one per prompt.

Return type:

List[GenerationResult]

generate_many(prompts, max_new_tokens=64, temperature=1.0, do_sample=False)[source]¶

Like generate_batch(), but auto-groups prompts by shared prefix. Prompts without shared prefixes are processed individually.

Parameters:

prompts (List[str]) – List of prompts (may or may not share prefixes).
max_new_tokens (int) – Max tokens to generate per prompt.
temperature (float)
do_sample (bool)

Returns:

List of GenerationResult in the same order as input prompts.

Return type:

List[GenerationResult]

warm(text, position_offset=0)[source]¶

Encode text and cache all its fixed-size chunks.

Returns a WarmResult with diagnostics including alignment warnings. The result is truthy (usable as int) via chunks_stored.

Call this for your system prompt / few-shot examples / documents BEFORE calling generate() so the cache is already populated.

Parameters:

text (str)
position_offset (int)

Return type:

WarmResult

cache_stats()[source]¶

Return type:: Dict

verify_correctness(max_new_tokens=32)[source]¶

Quick self-test: runs greedy decode on a synthetic prompt with both BASELINE and CHUNK_KV_REUSE, verifies identical output.

Returns True if outputs match, False otherwise. Use this to validate untested model architectures before trusting cached outputs in production.

Parameters:: max_new_tokens (int)
Return type:: bool

GenerationMode¶

class kvboost.engine.GenerationMode(value)[source]¶

BASELINE = 'baseline'¶

PREFIX_CACHE = 'prefix_cache'¶

CHUNK_KV_REUSE = 'chunk_kv_reuse'¶

RecomputeStrategy¶

class kvboost.engine.RecomputeStrategy(value)[source]¶

SELECTIVE = 'selective'¶

CACHEBLEND = 'cacheblend'¶

NONE = 'none'¶

GenerationResult¶

class kvboost.engine.GenerationResult(mode: 'str', prompt: 'str', output_text: 'str', generated_tokens: 'int', ttft_ms: 'float', total_ms: 'float', tokens_per_sec: 'float', kv_reuse_ratio: 'float', prompt_tokens: 'int', cached_tokens: 'int')[source]¶

Parameters:

mode (str)
prompt (str)
output_text (str)
generated_tokens (int)
ttft_ms (float)
total_ms (float)
tokens_per_sec (float)
kv_reuse_ratio (float)
prompt_tokens (int)
cached_tokens (int)

mode: str¶

prompt: str¶

output_text: str¶

generated_tokens: int¶

ttft_ms: float¶

total_ms: float¶

tokens_per_sec: float¶

kv_reuse_ratio: float¶

prompt_tokens: int¶

cached_tokens: int¶