Google Just Made Gemma 4 Up to 3x Faster — Without Touching the Model
Google's new Multi-Token Prediction (MTP) drafters pair a lightweight model alongside Gemma 4 to predict multiple tokens ahead, letting the big model verify them all in one pass — delivering up to 3x speedup with zero quality loss.

On this page
Google just shipped Multi-Token Prediction (MTP) drafters for Gemma 4 — delivering up to 3x faster inference without changing the model’s output quality or reasoning ability.
The trick isn’t a better model. It’s a smarter way to generate tokens: a small, fast drafter does the guessing while the big model does the verifying — in parallel.
What Was Announced
Google released MTP drafters for the entire Gemma 4 family — open source, Apache 2.0, available today on Hugging Face and Kaggle. They work with the tools developers already use: Hugging Face Transformers, MLX, vLLM, SGLang, and Ollama.
The headline number: up to 3x tokens-per-second speedup on benchmarked hardware, with no degradation in output quality.
The Problem: One Token at a Time Is Wasteful
Standard large language models generate text one token at a time — autoregressively. Every single token requires moving billions of model parameters from memory (VRAM) to the compute units. That memory transfer is the real bottleneck, not the actual computation.
The result: your GPU is significantly under-utilised, sitting idle most of the time, waiting for data to arrive. This is especially painful on consumer hardware and edge devices where memory bandwidth is limited.
Worse, the model spends the same amount of effort predicting an obvious continuation (“Actions speak louder than… words”) as it does solving a hard logic puzzle. There’s no differentiation — everything gets the same expensive, sequential treatment.

How MTP Drafters Fix This
The key idea is speculative decoding — originally introduced by Google researchers in 2022, now productionised here for Gemma 4.
Instead of one model doing everything, you pair two:
- A heavy target model (e.g., Gemma 4 31B) — accurate, slow, expensive
- A lightweight drafter (the MTP model) — fast, small, runs in parallel
Here’s what happens on each generation step:
- The drafter rapidly predicts several tokens ahead simultaneously — using the same KV cache as the target model, so it doesn’t recalculate context
- The target model verifies all those draft tokens in a single forward pass
- If the target agrees with the drafts, it accepts the entire sequence plus generates one extra token of its own
- Result: you get multiple tokens in the time it used to take to generate just one
The drafter doesn’t need to be perfect — it just needs to be right often enough. When it’s wrong, the target model corrects from that point and the process continues. The quality guarantee comes from the fact that the target model always has the final say.
What This Means for Developers
The speedup isn’t just a benchmark number — it changes what’s actually practical to build:
Local development on consumer hardware: The 26B MoE and 31B Dense models now run at usable speeds on personal computers and consumer GPUs. Previously, these were near-impractical for local dev loops.
Real-time and agentic apps: Near real-time chat, voice applications, and multi-step agentic workflows all benefit directly — every millisecond of latency reduction compounds across long interactions.
Edge and mobile: The E2B and E4B models for on-device use get a meaningful boost on Android and iOS. Google has also published the AI Edge Gallery app for both platforms to try this directly.
Batch size matters: On Apple Silicon with the 26B MoE model, increasing batch size from 1 to 4–8 unlocks ~2.2x speedup. Similar gains appear on Nvidia A100. Worth tuning for your specific deployment target.
The Engineering Detail Worth Knowing
The MTP drafters share the target model’s KV cache — meaning they reuse the attention computations the big model has already done rather than starting from scratch. This is what makes them fast without quality loss.
For the smaller E2B and E4B edge models, where the final logit calculation becomes a bottleneck, Google added an efficient clustering technique in the embedder to squeeze out further speed. The architecture is genuinely thoughtful rather than a bolt-on.
→ Read the full Google blog post
Over to you: Are you running Gemma 4 locally or in production? A 3x inference speedup changes the economics of self-hosted AI significantly — does this move open models closer to being viable for your real-time use cases?

About the Author
Ajay Walia
AI {IT Architect} focusing on local-first multi-agent AI engineering, zero-data-egress systems. Ideator, Creator and Executor on Curious Bit.
Keep Reading

Anthropic Can Now Read Claude's Thoughts — And What It Found Is Unsettling
Anthropic's new Natural Language Autoencoders (NLAs) convert Claude's internal activations into plain English — and they reveal Claude suspects it's being safety-tested far more often than it admits out loud.

Attention Is All You Need — The Paper That Rewired AI
The 2017 paper that killed RNNs, invented the Transformer, and launched the modern AI era — explained for beginners and intermediates with 11 original manga panels.

OpenAI Just Killed the Voice Assistant — And Built Something Far More Dangerous
GPT-Realtime-2 doesn't just answer questions — it reasons out loud, calls tools mid-sentence, and translates 70 languages live. The voice assistant era is over. The voice agent era has begun.