Model Compatibility¶

KVBoost’s KV cache stitching requires RoPE positional encoding with explicit position_ids support. Models using ALiBi, learned absolute embeddings, or sliding window attention are not compatible.

Supported Architectures¶

Architecture	Status
LlamaForCausalLM	Supported
Qwen2ForCausalLM	Supported
Qwen2_5ForCausalLM	Supported
GemmaForCausalLM	Supported
Gemma2ForCausalLM	Supported
MistralForCausalLM	Supported (full attention only)
PhiForCausalLM	Supported
Phi3ForCausalLM	Supported
StableLmForCausalLM	Supported
InternLMForCausalLM	Supported
InternLM2ForCausalLM	Supported

Unsupported Architectures¶

Architecture	Reason
GPT2LMHeadModel	Learned absolute positional embeddings
GPTNeoForCausalLM	Learned absolute positional embeddings
MPTForCausalLM	ALiBi positional encoding
FalconForCausalLM	ALiBi positional encoding
BloomForCausalLM	ALiBi positional encoding
MistralForCausalLM (sliding window)	Sliding window breaks KV stitching

Strict Mode¶

By default, from_pretrained raises on unsupported architectures and warns on untested ones:

# Raises ValueError for GPT-2
engine = KVBoost.from_pretrained("gpt2")

# Warns for unknown architectures
engine = KVBoost.from_pretrained("some/new-model")

# Suppress all checks
engine = KVBoost.from_pretrained("some/model", strict=False)

Verifying Unknown Models¶

For untested architectures, run the built-in correctness check:

engine = KVBoost.from_pretrained("some/new-rope-model", strict=False)

if engine.verify_correctness():
    print("Safe to use")
else:
    print("KV stitching produces wrong outputs for this model")

verify_correctness() runs greedy decoding on a synthetic prompt with both baseline and cached modes, comparing the output text token-by-token.