<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>Ajay Walia</title><link>https://curiousbit.netlify.app/</link><description>Digital workplace, artificial intelligence, cloud, security, automation, and enterprise technology notes by Ajay Walia.</description><language>en-au</language><managingEditor>Ajay Walia</managingEditor><webMaster>Ajay Walia</webMaster><copyright>Copyright 2026 Ajay Walia</copyright><lastBuildDate>Sun, 21 Jun 2026 05:46:10 +0000</lastBuildDate><atom:link href="https://curiousbit.netlify.app/tags/developer-tools/index.xml" rel="self" type="application/rss+xml"/><image><url>https://curiousbit.netlify.app/images/og-default.png</url><title>Ajay Walia</title><link>https://curiousbit.netlify.app/</link></image><item><title>Google Just Made Gemma 4 Up to 3x Faster — Without Touching the Model</title><link>https://curiousbit.netlify.app/field-notes/google-gemma4-multi-token-prediction/</link><guid isPermaLink="true">https://curiousbit.netlify.app/field-notes/google-gemma4-multi-token-prediction/</guid><pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;&lt;strong&gt;Google just shipped Multi-Token Prediction (MTP) drafters for Gemma 4 — delivering up to 3x faster inference without changing the model&amp;rsquo;s output quality or reasoning ability.&lt;/strong&gt;&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/field-notes/gemma4-mtp-banner.jpg" alt="Developer-Tools" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p><strong>Google just shipped Multi-Token Prediction (MTP) drafters for Gemma 4 — delivering up to 3x faster inference without changing the model&rsquo;s output quality or reasoning ability.</strong></p><p><strong>The trick isn&rsquo;t a better model. It&rsquo;s a smarter way to generate tokens: a small, fast drafter does the guessing while the big model does the verifying — in parallel.</strong></p><hr><h2 id="what-was-announced">What Was Announced</h2><p>Google released<strong>MTP drafters for the entire Gemma 4 family</strong> — open source, Apache 2.0, available today on Hugging Face and Kaggle. They work with the tools developers already use: Hugging Face Transformers, MLX, vLLM, SGLang, and Ollama.</p><p>The headline number:<strong>up to 3x tokens-per-second speedup</strong> on benchmarked hardware, with no degradation in output quality.</p><hr><h2 id="the-problem-one-token-at-a-time-is-wasteful">The Problem: One Token at a Time Is Wasteful</h2><p>Standard large language models generate text one token at a time — autoregressively. Every single token requires moving billions of model parameters from memory (VRAM) to the compute units. That memory transfer is the real bottleneck, not the actual computation.</p><p>The result: your GPU is significantly under-utilised, sitting idle most of the time, waiting for data to arrive. This is especially painful on consumer hardware and edge devices where memory bandwidth is limited.</p><p>Worse, the model spends the same amount of effort predicting an obvious continuation (&ldquo;Actions speak louder than&hellip;<strong>words</strong>&rdquo;) as it does solving a hard logic puzzle. There&rsquo;s no differentiation — everything gets the same expensive, sequential treatment.</p><p><img src="/images/field-notes/gemma4-mtp-banner.jpg" alt="Gemma 4 MTP drafter — tokens at speed"/><hr><h2 id="how-mtp-drafters-fix-this">How MTP Drafters Fix This</h2><p>The key idea is<strong>speculative decoding</strong> — originally introduced by Google researchers in 2022, now productionised here for Gemma 4.</p><p>Instead of one model doing everything, you pair two:</p><ul><li>A<strong>heavy target model</strong> (e.g., Gemma 4 31B) — accurate, slow, expensive</li><li>A<strong>lightweight drafter</strong> (the MTP model) — fast, small, runs in parallel</li></ul><p>Here&rsquo;s what happens on each generation step:</p><ol><li>The drafter rapidly predicts several tokens ahead simultaneously — using the same KV cache as the target model, so it doesn&rsquo;t recalculate context</li><li>The target model verifies all those draft tokens in a<strong>single forward pass</strong></li><li>If the target agrees with the drafts, it accepts the entire sequence<strong>plus</strong> generates one extra token of its own</li><li>Result: you get multiple tokens in the time it used to take to generate just one</li></ol><p>The drafter doesn&rsquo;t need to be perfect — it just needs to be right often enough. When it&rsquo;s wrong, the target model corrects from that point and the process continues. The quality guarantee comes from the fact that the<strong>target model always has the final say</strong>.</p><hr><h2 id="what-this-means-for-developers">What This Means for Developers</h2><p>The speedup isn&rsquo;t just a benchmark number — it changes what&rsquo;s actually practical to build:</p><p><strong>Local development on consumer hardware</strong>: The 26B MoE and 31B Dense models now run at usable speeds on personal computers and consumer GPUs. Previously, these were near-impractical for local dev loops.</p><p><strong>Real-time and agentic apps</strong>: Near real-time chat, voice applications, and multi-step agentic workflows all benefit directly — every millisecond of latency reduction compounds across long interactions.</p><p><strong>Edge and mobile</strong>: The E2B and E4B models for on-device use get a meaningful boost on Android and iOS. Google has also published the AI Edge Gallery app for both platforms to try this directly.</p><p><strong>Batch size matters</strong>: On Apple Silicon with the 26B MoE model, increasing batch size from 1 to 4–8 unlocks ~2.2x speedup. Similar gains appear on Nvidia A100. Worth tuning for your specific deployment target.</p><hr><h2 id="the-engineering-detail-worth-knowing">The Engineering Detail Worth Knowing</h2><p>The MTP drafters share the target model&rsquo;s<strong>KV cache</strong> — meaning they reuse the attention computations the big model has already done rather than starting from scratch. This is what makes them fast without quality loss.</p><p>For the smaller E2B and E4B edge models, where the final logit calculation becomes a bottleneck, Google added an efficient clustering technique in the embedder to squeeze out further speed. The architecture is genuinely thoughtful rather than a bolt-on.</p><hr><p><strong>→<a href="https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/">Read the full Google blog post</a></strong></p><hr><blockquote><p><strong>Over to you:</strong> Are you running Gemma 4 locally or in production? A 3x inference speedup changes the economics of self-hosted AI significantly — does this move open models closer to being viable for your real-time use cases?</p></blockquote>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/field-notes/gemma4-mtp-banner.jpg" medium="image"><media:title type="plain">Developer-Tools</media:title></media:content><category>artificial-intelligence</category><category>developer-tools</category><category>Field Notes</category></item></channel></rss>