Model Compatibility

KVBoost’s KV cache stitching requires RoPE positional encoding with explicit position_ids support. Models using ALiBi, learned absolute embeddings, or sliding window attention are not compatible.

Supported Architectures

Architecture

Status

LlamaForCausalLM

Supported

Qwen2ForCausalLM

Supported

Qwen2_5ForCausalLM

Supported

GemmaForCausalLM

Supported

Gemma2ForCausalLM

Supported

MistralForCausalLM

Supported (full attention only)

PhiForCausalLM

Supported

Phi3ForCausalLM

Supported

StableLmForCausalLM

Supported

InternLMForCausalLM

Supported

InternLM2ForCausalLM

Supported

Unsupported Architectures

Architecture

Reason

GPT2LMHeadModel

Learned absolute positional embeddings

GPTNeoForCausalLM

Learned absolute positional embeddings

MPTForCausalLM

ALiBi positional encoding

FalconForCausalLM

ALiBi positional encoding

BloomForCausalLM

ALiBi positional encoding

MistralForCausalLM (sliding window)

Sliding window breaks KV stitching

Strict Mode

By default, from_pretrained raises on unsupported architectures and warns on untested ones:

# Raises ValueError for GPT-2
engine = KVBoost.from_pretrained("gpt2")

# Warns for unknown architectures
engine = KVBoost.from_pretrained("some/new-model")

# Suppress all checks
engine = KVBoost.from_pretrained("some/model", strict=False)

Verifying Unknown Models

For untested architectures, run the built-in correctness check:

engine = KVBoost.from_pretrained("some/new-rope-model", strict=False)

if engine.verify_correctness():
    print("Safe to use")
else:
    print("KV stitching produces wrong outputs for this model")

verify_correctness() runs greedy decoding on a synthetic prompt with both baseline and cached modes, comparing the output text token-by-token.