Engine API¶
KVBoost¶
KVBoost is the alias for kvboost.engine.InferenceEngine.
- class kvboost.engine.InferenceEngine(model, tokenizer, chunk_size=128, max_chunks=128, recompute_overlap=16, recompute_strategy=RecomputeStrategy.SELECTIVE, recompute_ratio=0.15, kv_cache_bits=16, disk_cache_dir=None, device=None)[source]¶
- Parameters:
- classmethod from_pretrained(model_name='TinyLlama/TinyLlama-1.1B-Chat-v1.0', strict=True, **kwargs)[source]¶
Load a HuggingFace model and create a KVBoost engine.
- Parameters:
- Return type:
- generate(prompt, max_new_tokens=64, mode=GenerationMode.CHUNK_KV_REUSE, temperature=1.0, do_sample=False)[source]¶
- Parameters:
prompt (str)
max_new_tokens (int)
mode (GenerationMode)
temperature (float)
do_sample (bool)
- Return type:
- generate_batch(prompts, max_new_tokens=64, temperature=1.0, do_sample=False)[source]¶
Generate responses for multiple prompts sharing a common prefix. Loads shared prefix KV once, runs batched prefill and decode.
- Parameters:
- Returns:
List of GenerationResult, one per prompt.
- Return type:
- generate_many(prompts, max_new_tokens=64, temperature=1.0, do_sample=False)[source]¶
Like generate_batch(), but auto-groups prompts by shared prefix. Prompts without shared prefixes are processed individually.
- warm(text, position_offset=0)[source]¶
Encode text and cache all its fixed-size chunks.
Returns a WarmResult with diagnostics including alignment warnings. The result is truthy (usable as int) via chunks_stored.
Call this for your system prompt / few-shot examples / documents BEFORE calling generate() so the cache is already populated.
- Parameters:
- Return type:
- verify_correctness(max_new_tokens=32)[source]¶
Quick self-test: runs greedy decode on a synthetic prompt with both BASELINE and CHUNK_KV_REUSE, verifies identical output.
Returns True if outputs match, False otherwise. Use this to validate untested model architectures before trusting cached outputs in production.