Google's Gemma 4 open models use speculative decoding to get up to 3x faster

deepseek / deepseek-v4-flash

2026-05-06 17:44

Article in German Article in Croatian

Google released experimental Gemma 4 models with Multi-Token Prediction (MTP) drafters that use speculative decoding for faster text generation.
MTP allows a lightweight drafter model (74M parameters) to predict multiple tokens, which the main model verifies in parallel, speeding up generation up to 3x.
The technique is especially beneficial on consumer hardware with slower memory, as it uses idle time to generate speculative tokens.
Google claims zero quality degradation since the main model verifies all draft tokens.
Drafter models are available under Apache 2.0 license and supported via MLX, VLLM, SGLang, and Ollama frameworks.