• Google released experimental Gemma 4 models with Multi-Token Prediction (MTP) drafters that use speculative decoding for faster text generation.
  • MTP allows a lightweight drafter model (74M parameters) to predict multiple tokens, which the main model verifies in parallel, speeding up generation up to 3x.
  • The technique is especially beneficial on consumer hardware with slower memory, as it uses idle time to generate speculative tokens.
  • Google claims zero quality degradation since the main model verifies all draft tokens.
  • Drafter models are available under Apache 2.0 license and supported via MLX, VLLM, SGLang, and Ollama frameworks.