Google Just Made Gemma 4 Up to 3x Faster — Without Touching the Model

On this page

Google just shipped Multi-Token Prediction (MTP) drafters for Gemma 4 — delivering up to 3x faster inference without changing the model’s output quality or reasoning ability.

The trick isn’t a better model. It’s a smarter way to generate tokens: a small, fast drafter does the guessing while the big model does the verifying — in parallel.

What Was Announced

Google released MTP drafters for the entire Gemma 4 family — open source, Apache 2.0, available today on Hugging Face and Kaggle. They work with the tools developers already use: Hugging Face Transformers, MLX, vLLM, SGLang, and Ollama.

The headline number: up to 3x tokens-per-second speedup on benchmarked hardware, with no degradation in output quality.

The Problem: One Token at a Time Is Wasteful

Standard large language models generate text one token at a time — autoregressively. Every single token requires moving billions of model parameters from memory (VRAM) to the compute units. That memory transfer is the real bottleneck, not the actual computation.

The result: your GPU is significantly under-utilised, sitting idle most of the time, waiting for data to arrive. This is especially painful on consumer hardware and edge devices where memory bandwidth is limited.

Worse, the model spends the same amount of effort predicting an obvious continuation (“Actions speak louder than… words”) as it does solving a hard logic puzzle. There’s no differentiation — everything gets the same expensive, sequential treatment.

How MTP Drafters Fix This

The key idea is speculative decoding — originally introduced by Google researchers in 2022, now productionised here for Gemma 4.

Instead of one model doing everything, you pair two:

A heavy target model (e.g., Gemma 4 31B) — accurate, slow, expensive
A lightweight drafter (the MTP model) — fast, small, runs in parallel

Here’s what happens on each generation step:

The drafter rapidly predicts several tokens ahead simultaneously — using the same KV cache as the target model, so it doesn’t recalculate context
The target model verifies all those draft tokens in a single forward pass
If the target agrees with the drafts, it accepts the entire sequence plus generates one extra token of its own
Result: you get multiple tokens in the time it used to take to generate just one

The drafter doesn’t need to be perfect — it just needs to be right often enough. When it’s wrong, the target model corrects from that point and the process continues. The quality guarantee comes from the fact that the target model always has the final say.

What This Means for Developers

The speedup isn’t just a benchmark number — it changes what’s actually practical to build:

Local development on consumer hardware: The 26B MoE and 31B Dense models now run at usable speeds on personal computers and consumer GPUs. Previously, these were near-impractical for local dev loops.

Real-time and agentic apps: Near real-time chat, voice applications, and multi-step agentic workflows all benefit directly — every millisecond of latency reduction compounds across long interactions.

Edge and mobile: The E2B and E4B models for on-device use get a meaningful boost on Android and iOS. Google has also published the AI Edge Gallery app for both platforms to try this directly.

Batch size matters: On Apple Silicon with the 26B MoE model, increasing batch size from 1 to 4–8 unlocks ~2.2x speedup. Similar gains appear on Nvidia A100. Worth tuning for your specific deployment target.

The Engineering Detail Worth Knowing

The MTP drafters share the target model’s KV cache — meaning they reuse the attention computations the big model has already done rather than starting from scratch. This is what makes them fast without quality loss.

For the smaller E2B and E4B edge models, where the final logit calculation becomes a bottleneck, Google added an efficient clustering technique in the embedder to squeeze out further speed. The architecture is genuinely thoughtful rather than a bolt-on.

→ Read the full Google blog post

Over to you: Are you running Gemma 4 locally or in production? A 3x inference speedup changes the economics of self-hosted AI significantly — does this move open models closer to being viable for your real-time use cases?