<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>Ajay Walia</title><link>https://curiousbit.netlify.app/</link><description>Digital workplace, artificial intelligence, cloud, security, automation, and enterprise technology notes by Ajay Walia.</description><language>en-au</language><managingEditor>Ajay Walia</managingEditor><webMaster>Ajay Walia</webMaster><copyright>Copyright 2026 Ajay Walia</copyright><lastBuildDate>Sun, 21 Jun 2026 05:46:10 +0000</lastBuildDate><atom:link href="https://curiousbit.netlify.app/posts/index.xml" rel="self" type="application/rss+xml"/><image><url>https://curiousbit.netlify.app/images/og-default.png</url><title>Ajay Walia</title><link>https://curiousbit.netlify.app/</link></image><item><title>When AI Agents Go Wrong — and How to Engineer Ones That Don't</title><link>https://curiousbit.netlify.app/when-ai-agents-go-wrong/</link><guid isPermaLink="true">https://curiousbit.netlify.app/when-ai-agents-go-wrong/</guid><pubDate>Fri, 19 Jun 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;style&gt;
.agw-fig { margin: 2.2rem 0 1rem; border-radius: 14px; overflow: hidden; border: 1px solid #1f3358; background: #0a1424; }
.agw-fig .agw-wrap { overflow-x: auto; }
.agw-fig svg { display: block; width: 100%; height: auto; min-width: 700px; }
.agw-cap { font-size: .9rem; opacity: .72; margin: .5rem 0 2.4rem; line-height: 1.6; font-style: italic; }
@media (prefers-reduced-motion: reduce) { .agw-fig .agw-anim { display: none; } }
&lt;/style&gt;
&lt;p&gt;Most of the AI conversation right now is about capability — what the next model can do. This project made me sit with the opposite question: what happens when these systems are &lt;em&gt;trusted&lt;/em&gt;, handed real decisions, and then get it wrong? That is the uncomfortable, less glamorous half of building with AI, and it is exactly where &amp;ldquo;responsible AI&amp;rdquo; stops being a slogan and starts being engineering.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/agents-go-wrong/hero.jpg" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><style>
.agw-fig { margin: 2.2rem 0 1rem; border-radius: 14px; overflow: hidden; border: 1px solid #1f3358; background: #0a1424; }
.agw-fig .agw-wrap { overflow-x: auto; }
.agw-fig svg { display: block; width: 100%; height: auto; min-width: 700px; }
.agw-cap { font-size: .9rem; opacity: .72; margin: .5rem 0 2.4rem; line-height: 1.6; font-style: italic; }
@media (prefers-reduced-motion: reduce) { .agw-fig .agw-anim { display: none; } }</style><p>Most of the AI conversation right now is about capability — what the next model can do. This project made me sit with the opposite question: what happens when these systems are<em>trusted</em>, handed real decisions, and then get it wrong? That is the uncomfortable, less glamorous half of building with AI, and it is exactly where &ldquo;responsible AI&rdquo; stops being a slogan and starts being engineering.</p><p>The exercise had two halves. First, take a real-world AI failure apart and explain<em>why</em> it failed — not just that it did. Second, flip from critic to designer: pick a domain I know, imagine an AI agent operating in it, and design the guardrails that would keep it safe. Below is the thinking behind both, plus the two case studies and two domains I worked through.</p><h2 id="what-this-exercise-is-actually-teaching">What this exercise is actually teaching</h2><p>Strip away the assignment framing and there are four skills underneath it:</p><ul><li><strong>Explain how and why AI systems fail</strong>, using evidence rather than vibes. &ldquo;It was biased&rdquo; is a conclusion, not an analysis. The interesting part is the<em>mechanism</em>.</li><li><strong>Connect failures to ethics</strong> — fairness, accountability, transparency, safety. A technical bug becomes an ethical problem the moment it touches a real person.</li><li><strong>Propose realistic safeguards</strong>, not &ldquo;be careful&rdquo; platitudes. Audits, human review gates, logging, escalation paths — things you could actually ship.</li><li><strong>Balance autonomy against control.</strong> An agent that asks permission for everything is useless; one that asks for nothing is dangerous. Good design is about putting the human in the loop at the<em>right</em> moments.</li></ul><p>That last point is the heart of it. Every safeguard is really a decision about where autonomy ends and oversight begins.</p><div class="agw-fig"><div class="agw-wrap"><svg viewBox="0 0 1200 420" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="A spectrum from an over-cautious agent that asks permission for everything to a reckless agent that asks for nothing, with the safe design zone in the middle."><defs><linearGradient id="agwDial" x1="0" x2="1"><stop offset="0" stop-color="#60a5fa"/><stop offset="0.5" stop-color="#34d399"/><stop offset="1" stop-color="#f87171"/></linearGradient></defs><rect width="1200" height="420" fill="#0a1424"/><text x="80" y="64" font-family="'Space Grotesk','Inter',sans-serif" font-size="15" fill="#f59e0b" letter-spacing="3" font-weight="700">THE CORE TRADE-OFF</text><text x="80" y="104" font-family="'Space Grotesk','Inter',sans-serif" font-size="34" fill="#ffffff" font-weight="700" letter-spacing="-.5">Where autonomy ends, oversight begins</text><rect x="80" y="190" width="1040" height="16" rx="8" fill="url(#agwDial)"/><rect x="470" y="178" width="260" height="40" rx="20" fill="none" stroke="#34d399" stroke-width="2"/><text x="600" y="203" text-anchor="middle" font-family="'Inter',sans-serif" font-size="15" fill="#34d399" font-weight="700">human-in-the-loop zone</text><g class="agw-anim"><polygon points="600,150 590,176 610,176" fill="#ffffff"><animateTransform attributeName="transform" type="translate" values="-40 0; 40 0; -40 0" dur="6s" repeatCount="indefinite" calcMode="spline" keySplines="0.4 0 0.6 1; 0.4 0 0.6 1"/></polygon></g><text x="80" y="300" font-family="'Space Grotesk','Inter',sans-serif" font-size="20" fill="#60a5fa" font-weight="700">Asks for everything</text><text x="80" y="328" font-family="'Inter',sans-serif" font-size="15" fill="#7e95b5">Permission for every step.</text><text x="80" y="350" font-family="'Inter',sans-serif" font-size="15" fill="#7e95b5">Safe, but useless — no one</text><text x="80" y="372" font-family="'Inter',sans-serif" font-size="15" fill="#7e95b5">would actually use it.</text><text x="600" y="300" text-anchor="middle" font-family="'Space Grotesk','Inter',sans-serif" font-size="20" fill="#34d399" font-weight="700">Gated at the right moments</text><text x="600" y="328" text-anchor="middle" font-family="'Inter',sans-serif" font-size="15" fill="#9fb4cf">Autonomous on low-stakes work;</text><text x="600" y="350" text-anchor="middle" font-family="'Inter',sans-serif" font-size="15" fill="#9fb4cf">a human approves the decisions</text><text x="600" y="372" text-anchor="middle" font-family="'Inter',sans-serif" font-size="15" fill="#9fb4cf">that can actually hurt someone.</text><text x="1120" y="300" text-anchor="end" font-family="'Space Grotesk','Inter',sans-serif" font-size="20" fill="#f87171" font-weight="700">Asks for nothing</text><text x="1120" y="328" text-anchor="end" font-family="'Inter',sans-serif" font-size="15" fill="#7e95b5">Acts without review.</text><text x="1120" y="350" text-anchor="end" font-family="'Inter',sans-serif" font-size="15" fill="#7e95b5">Fast and convenient, until</text><text x="1120" y="372" text-anchor="end" font-family="'Inter',sans-serif" font-size="15" fill="#7e95b5">a wrong answer acts on its own.</text></svg></div></div><p class="agw-cap">Every guardrail in this post is really a choice about where on this dial an agent should sit — and the right answer changes with the stakes of each decision.</p><h2 id="part-1--reading-the-autopsy-of-a-failure">Part 1 — Reading the autopsy of a failure</h2><p>I looked at two failures that fail in completely different ways. One is a<em>bias</em> problem baked into the data; the other is a<em>hallucination and accountability</em> problem baked into deployment. Putting them side by side is the most useful thing I took from this.</p><h3 id="case-a--compas-bias-that-hides-inside-neutral-math">Case A — COMPAS: bias that hides inside &ldquo;neutral&rdquo; math</h3><p>COMPAS is a risk-assessment tool used in US courts to score how likely a defendant is to reoffend. Judges used those scores to help inform bail and sentencing. In 2016, ProPublica analysed more than 7,000 cases in Broward County, Florida, and found something damning: among defendants who did<em>not</em> go on to reoffend, Black defendants were flagged &ldquo;high risk&rdquo; at roughly twice the rate of white defendants (about 45% versus 23%). The errors weren&rsquo;t random — they leaned in one direction.</p><p>Here&rsquo;s the part that took me a moment to appreciate.<strong>Race was never an input.</strong> The model didn&rsquo;t need it. It learned from historical criminal-justice data shaped by biased policing, and its questionnaire leaned on<em>proxies</em> — prior arrests, employment, neighbourhood, family history — that quietly correlate with race. The bias didn&rsquo;t enter through a checkbox; it seeped in through the data.</p><div class="agw-fig"><div class="agw-wrap"><svg viewBox="0 0 1200 620" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="Diagram showing that race is never an input to COMPAS, but proxy features that correlate with race flow into the model and produce a skewed risk score."><defs><marker id="agwArrB" markerWidth="10" markerHeight="10" refX="7" refY="5" orient="auto"><path d="M0,0 L10,5 L0,10 Z" fill="#60a5fa"/></marker><marker id="agwArrR" markerWidth="10" markerHeight="10" refX="7" refY="5" orient="auto"><path d="M0,0 L10,5 L0,10 Z" fill="#f87171"/></marker></defs><rect width="1200" height="620" fill="#0a1424"/><text x="80" y="60" font-family="'Space Grotesk','Inter',sans-serif" font-size="15" fill="#f59e0b" letter-spacing="3" font-weight="700">CASE A · COMPAS</text><text x="80" y="100" font-family="'Space Grotesk','Inter',sans-serif" font-size="32" fill="#ffffff" font-weight="700" letter-spacing="-.5">Bias enters through proxies, not a checkbox</text><g><rect x="80" y="150" width="230" height="70" rx="12" fill="#1a1424" stroke="#f87171" stroke-width="2" stroke-dasharray="6 5"/><text x="195" y="184" text-anchor="middle" font-family="'Inter',sans-serif" font-size="17" fill="#f87171" font-weight="700">Race</text><text x="195" y="206" text-anchor="middle" font-family="'Inter',sans-serif" font-size="13" fill="#f0a0a0">never an input ✕</text></g><text x="80" y="280" font-family="'Inter',sans-serif" font-size="14" fill="#7e95b5" letter-spacing="2" font-weight="700">PROXY FEATURES (the questionnaire)</text><g font-family="'Inter',sans-serif" font-size="15" fill="#dce7f5" font-weight="600"><g><rect x="80" y="300" width="230" height="54" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="1.6"/><text x="195" y="333" text-anchor="middle">Prior arrests</text></g><g><rect x="80" y="368" width="230" height="54" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="1.6"/><text x="195" y="401" text-anchor="middle">Employment status</text></g><g><rect x="80" y="436" width="230" height="54" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="1.6"/><text x="195" y="469" text-anchor="middle">Neighbourhood</text></g><g><rect x="80" y="504" width="230" height="54" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="1.6"/><text x="195" y="537" text-anchor="middle">Family history</text></g></g><path d="M195 220 Q 150 250 150 296" fill="none" stroke="#f87171" stroke-width="1.6" stroke-dasharray="4 4" marker-end="url(#agwArrR)"/><text x="30" y="265" font-family="'Inter',sans-serif" font-size="12" fill="#f0a0a0" font-style="italic">correlates</text><path d="M310 327 L 470 380" fill="none" stroke="#60a5fa" stroke-width="1.8" marker-end="url(#agwArrB)"/><path d="M310 395 L 470 410" fill="none" stroke="#60a5fa" stroke-width="1.8" marker-end="url(#agwArrB)"/><path d="M310 463 L 470 430" fill="none" stroke="#60a5fa" stroke-width="1.8" marker-end="url(#agwArrB)"/><path d="M310 531 L 470 460" fill="none" stroke="#60a5fa" stroke-width="1.8" marker-end="url(#agwArrB)"/><rect x="480" y="350" width="200" height="120" rx="14" fill="#10233f" stroke="#a78bfa" stroke-width="2"/><text x="580" y="405" text-anchor="middle" font-family="'Space Grotesk','Inter',sans-serif" font-size="20" fill="#c4b5fd" font-weight="700">COMPAS</text><text x="580" y="430" text-anchor="middle" font-family="'Inter',sans-serif" font-size="13" fill="#9fb4cf">trained on biased</text><text x="580" y="450" text-anchor="middle" font-family="'Inter',sans-serif" font-size="13" fill="#9fb4cf">historical data</text><path d="M680 410 L 760 410" fill="none" stroke="#a78bfa" stroke-width="2" marker-end="url(#agwArrB)"/><rect x="770" y="350" width="350" height="120" rx="14" fill="#1a1424" stroke="#f87171" stroke-width="2"/><text x="945" y="392" text-anchor="middle" font-family="'Space Grotesk','Inter',sans-serif" font-size="18" fill="#ffffff" font-weight="700">Risk score, skewed</text><text x="945" y="424" text-anchor="middle" font-family="'Inter',sans-serif" font-size="14" fill="#f0a0a0">Among those who did NOT reoffend:</text><text x="945" y="448" text-anchor="middle" font-family="'Inter',sans-serif" font-size="14" fill="#f0a0a0">~45% of Black vs ~23% of white</text><text x="945" y="466" text-anchor="middle" font-family="'Inter',sans-serif" font-size="14" fill="#f0a0a0">defendants flagged "high risk"</text></svg></div></div><p class="agw-cap">The model never sees race — but it sees features that stand in for it. Bias laundered through "neutral" inputs is still bias.</p><p>And the fairness argument has a genuinely hard core. The vendor (Northpointe) responded that the tool was<em>calibrated</em> — a given score meant the same probability of reoffending regardless of race — which was true. The catch is mathematical: when the base rates differ between groups, you<strong>cannot</strong> have equal calibration<em>and</em> equal false-positive rates at the same time. The two sides were optimising for different definitions of &ldquo;fair,&rdquo; and both were partly right. That is the lesson: &ldquo;fair&rdquo; is not one thing, and choosing which fairness to enforce is an ethical decision you can&rsquo;t dodge with more math.</p><p>The deeper failures were organisational. COMPAS was proprietary — a black box defendants couldn&rsquo;t inspect or contest — and it was deployed into life-altering decisions without independent audits for subgroup fairness.</p><blockquote><p><strong>Source:</strong> ProPublica,<a href="https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing">&ldquo;Machine Bias&rdquo;</a> (Angwin et al., 2016).</p></blockquote><h3 id="case-b--air-canadas-chatbot-a-confident-costly-wrong-answer">Case B — Air Canada&rsquo;s chatbot: a confident, costly wrong answer</h3><p>The second case is more recent and, honestly, more relatable. In late 2022, Jake Moffatt used Air Canada&rsquo;s website chatbot after his grandmother died, to check the airline&rsquo;s bereavement-fare policy. The bot told him, confidently, that he could book now and claim the bereavement discount retroactively within 90 days. That was simply false — Air Canada&rsquo;s real policy didn&rsquo;t allow retroactive claims, and the bot even contradicted the airline&rsquo;s own linked policy page.</p><p>When Moffatt tried to claim the refund the bot had promised, Air Canada refused — and then argued in tribunal that it shouldn&rsquo;t be liable because the chatbot was &ldquo;a separate legal entity responsible for its own actions.&rdquo; The tribunal rejected that flatly: a company is responsible for everything on its website, whether it comes from a static page or a bot. Moffatt was awarded damages.</p><div class="agw-fig"><div class="agw-wrap"><svg viewBox="0 0 1200 440" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="A left-to-right flow: a customer asks the chatbot, the bot invents an ungrounded answer, the customer relies on it, the claim is refused, the company blames the bot, and the tribunal holds the company liable."><defs><marker id="agwArrW" markerWidth="10" markerHeight="10" refX="7" refY="5" orient="auto"><path d="M0,0 L10,5 L0,10 Z" fill="#9fb4cf"/></marker></defs><rect width="1200" height="440" fill="#0a1424"/><text x="80" y="58" font-family="'Space Grotesk','Inter',sans-serif" font-size="15" fill="#f59e0b" letter-spacing="3" font-weight="700">CASE B · AIR CANADA</text><text x="80" y="98" font-family="'Space Grotesk','Inter',sans-serif" font-size="32" fill="#ffffff" font-weight="700" letter-spacing="-.5">"Not our bot's fault" — and why that failed</text><g font-family="'Inter',sans-serif"><rect x="60" y="150" width="180" height="120" rx="12" fill="#0f1d33" stroke="#60a5fa" stroke-width="1.8"/><text x="150" y="196" text-anchor="middle" font-size="15" fill="#ffffff" font-weight="700">Customer asks</text><text x="150" y="222" text-anchor="middle" font-size="13" fill="#9fb4cf">bereavement-fare</text><text x="150" y="240" text-anchor="middle" font-size="13" fill="#9fb4cf">policy</text><rect x="280" y="150" width="180" height="120" rx="12" fill="#1a1424" stroke="#f87171" stroke-width="1.8"/><text x="370" y="190" text-anchor="middle" font-size="15" fill="#f87171" font-weight="700">Bot invents</text><text x="370" y="214" text-anchor="middle" font-size="13" fill="#f0a0a0">"claim it back</text><text x="370" y="232" text-anchor="middle" font-size="13" fill="#f0a0a0">within 90 days"</text><text x="370" y="254" text-anchor="middle" font-size="12" fill="#f0a0a0" font-style="italic">ungrounded ✕</text><rect x="500" y="150" width="180" height="120" rx="12" fill="#0f1d33" stroke="#60a5fa" stroke-width="1.8"/><text x="590" y="196" text-anchor="middle" font-size="15" fill="#ffffff" font-weight="700">Customer relies</text><text x="590" y="222" text-anchor="middle" font-size="13" fill="#9fb4cf">books the flight,</text><text x="590" y="240" text-anchor="middle" font-size="13" fill="#9fb4cf">expects refund</text><rect x="720" y="150" width="180" height="120" rx="12" fill="#1a1424" stroke="#f87171" stroke-width="1.8"/><text x="810" y="190" text-anchor="middle" font-size="15" fill="#f87171" font-weight="700">Claim refused</text><text x="810" y="214" text-anchor="middle" font-size="13" fill="#f0a0a0">"the bot is a</text><text x="810" y="232" text-anchor="middle" font-size="13" fill="#f0a0a0">separate entity"</text><text x="810" y="254" text-anchor="middle" font-size="12" fill="#f0a0a0" font-style="italic">blame-shift</text><rect x="940" y="150" width="200" height="120" rx="12" fill="#10231a" stroke="#34d399" stroke-width="2"/><text x="1040" y="190" text-anchor="middle" font-size="15" fill="#34d399" font-weight="700">Tribunal: liable</text><text x="1040" y="214" text-anchor="middle" font-size="13" fill="#a8e6c8">a company owns</text><text x="1040" y="232" text-anchor="middle" font-size="13" fill="#a8e6c8">everything on its</text><text x="1040" y="250" text-anchor="middle" font-size="13" fill="#a8e6c8">site. Damages paid.</text><line x1="240" y1="210" x2="276" y2="210" stroke="#9fb4cf" stroke-width="1.8" marker-end="url(#agwArrW)"/><line x1="460" y1="210" x2="496" y2="210" stroke="#9fb4cf" stroke-width="1.8" marker-end="url(#agwArrW)"/><line x1="680" y1="210" x2="716" y2="210" stroke="#9fb4cf" stroke-width="1.8" marker-end="url(#agwArrW)"/><line x1="900" y1="210" x2="936" y2="210" stroke="#9fb4cf" stroke-width="1.8" marker-end="url(#agwArrW)"/></g><text x="80" y="340" font-family="'Inter',sans-serif" font-size="16" fill="#dce7f5" font-weight="700">The failure isn't the CA$800.</text><text x="80" y="366" font-family="'Inter',sans-serif" font-size="15" fill="#9fb4cf">It's the instinct to treat the AI as a third party you can blame — exactly the move responsible-AI</text><text x="80" y="388" font-family="'Inter',sans-serif" font-size="15" fill="#9fb4cf">governance exists to prevent. Deploy-and-forget, on a high-stakes question, with no owner.</text></svg></div></div><p class="agw-cap">A single ungrounded output, deployed in front of customers with no monitoring and no clear owner — and an accountability dodge the tribunal refused to accept.</p><p>What makes this a great teaching case isn&rsquo;t the money (about CA$800). It&rsquo;s the<strong>accountability</strong> move. The instinct to treat the AI as a third party you can blame is exactly the failure mode responsible-AI governance exists to prevent. Technically, the system generated an unverified answer that wasn&rsquo;t grounded in the authoritative policy. Organisationally, it was put in front of customers on high-stakes questions with no guardrails, no monitoring, and no clear owner — a &ldquo;deploy and forget&rdquo; posture.</p><blockquote><p><strong>Source:</strong><em>Moffatt v. Air Canada</em>,<a href="https://www.canlii.org/en/bc/bccrt/doc/2024/2024bccrt149/2024bccrt149.html">2024 BCCRT 149</a>.</p></blockquote><h3 id="the-pattern-across-both">The pattern across both</h3><p>COMPAS fails<em>quietly and systematically</em> through data; Air Canada fails<em>loudly and individually</em> through a single bad output. But the root causes rhyme: a system trusted beyond what it was validated for, no meaningful oversight, and unclear accountability when it broke. Bias and hallucination look different on the surface and share the same governance gap underneath.</p><div class="agw-fig"><div class="agw-wrap"><svg viewBox="0 0 1200 560" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="Two failure modes contrasted — COMPAS fails quietly and systematically through data, Air Canada fails loudly and individually through one output — converging on the same three shared root causes."><rect width="1200" height="560" fill="#0a1424"/><text x="600" y="58" text-anchor="middle" font-family="'Space Grotesk','Inter',sans-serif" font-size="15" fill="#f59e0b" letter-spacing="3" font-weight="700">THE PATTERN ACROSS BOTH</text><text x="600" y="98" text-anchor="middle" font-family="'Space Grotesk','Inter',sans-serif" font-size="30" fill="#ffffff" font-weight="700" letter-spacing="-.5">Different surface, same gap underneath</text><rect x="70" y="140" width="480" height="210" rx="16" fill="#0f1d33" stroke="#a78bfa" stroke-width="2"/><text x="100" y="180" font-family="'Space Grotesk','Inter',sans-serif" font-size="20" fill="#c4b5fd" font-weight="700">COMPAS</text><text x="100" y="206" font-family="'Inter',sans-serif" font-size="14" fill="#9fb4cf">Fails QUIETLY · SYSTEMATICALLY</text><text x="100" y="246" font-family="'Inter',sans-serif" font-size="15" fill="#dce7f5">• Bias baked into the training data</text><text x="100" y="276" font-family="'Inter',sans-serif" font-size="15" fill="#dce7f5">• Harms a whole group, invisibly</text><text x="100" y="306" font-family="'Inter',sans-serif" font-size="15" fill="#dce7f5">• No subgroup fairness audit</text><text x="100" y="336" font-family="'Inter',sans-serif" font-size="15" fill="#dce7f5">• Black-box, can't be contested</text><rect x="650" y="140" width="480" height="210" rx="16" fill="#0f1d33" stroke="#f87171" stroke-width="2"/><text x="680" y="180" font-family="'Space Grotesk','Inter',sans-serif" font-size="20" fill="#f87171" font-weight="700">Air Canada chatbot</text><text x="680" y="206" font-family="'Inter',sans-serif" font-size="14" fill="#9fb4cf">Fails LOUDLY · INDIVIDUALLY</text><text x="680" y="246" font-family="'Inter',sans-serif" font-size="15" fill="#dce7f5">• One confident, ungrounded answer</text><text x="680" y="276" font-family="'Inter',sans-serif" font-size="15" fill="#dce7f5">• Harms one person, visibly</text><text x="680" y="306" font-family="'Inter',sans-serif" font-size="15" fill="#dce7f5">• No grounding in authoritative policy</text><text x="680" y="336" font-family="'Inter',sans-serif" font-size="15" fill="#dce7f5">• Deploy-and-forget, no monitoring</text><path d="M310 350 L 480 410" fill="none" stroke="#7e95b5" stroke-width="1.6"/><path d="M890 350 L 720 410" fill="none" stroke="#7e95b5" stroke-width="1.6"/><rect x="300" y="416" width="600" height="110" rx="16" fill="#1a1606" stroke="#f59e0b" stroke-width="2"/><text x="600" y="450" text-anchor="middle" font-family="'Inter',sans-serif" font-size="13" fill="#f59e0b" letter-spacing="2" font-weight="700">SAME ROOT CAUSES</text><text x="600" y="480" text-anchor="middle" font-family="'Inter',sans-serif" font-size="16" fill="#ffe7b0" font-weight="600">Trusted beyond validation · No meaningful oversight</text><text x="600" y="506" text-anchor="middle" font-family="'Inter',sans-serif" font-size="16" fill="#ffe7b0" font-weight="600">Unclear accountability when it broke</text></svg></div></div><p class="agw-cap">Bias and hallucination are different symptoms of the same disease: a governance gap, not just a model bug.</p><h2 id="part-2--switching-seats-designing-the-guardrails">Part 2 — Switching seats: designing the guardrails</h2><p>Critiquing failures is the easy half. The harder, more honest half is designing an agent that<em>wouldn&rsquo;t</em> fail the same way. I framed both designs around the same three safeguard categories —<strong>Data Privacy, Content Safety, Operational Oversight</strong> — because that structure forces you to cover the three places agents usually go wrong: the data going in, the content coming out, and the humans watching over the whole thing.</p><div class="agw-fig"><div class="agw-wrap"><svg viewBox="0 0 1200 560" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="A reusable safeguard skeleton: data privacy controls the data going into the agent, content safety controls the content coming out, and operational oversight wraps the whole thing with human review, logging, monitoring, and a kill-switch."><defs><marker id="agwArrG" markerWidth="10" markerHeight="10" refX="7" refY="5" orient="auto"><path d="M0,0 L10,5 L0,10 Z" fill="#34d399"/></marker><marker id="agwArrBl" markerWidth="10" markerHeight="10" refX="7" refY="5" orient="auto"><path d="M0,0 L10,5 L0,10 Z" fill="#60a5fa"/></marker></defs><rect width="1200" height="560" fill="#0a1424"/><text x="80" y="58" font-family="'Space Grotesk','Inter',sans-serif" font-size="15" fill="#f59e0b" letter-spacing="3" font-weight="700">THE PORTABLE SKELETON</text><text x="80" y="98" font-family="'Space Grotesk','Inter',sans-serif" font-size="30" fill="#ffffff" font-weight="700" letter-spacing="-.5">Three places an agent goes wrong — guard all three</text><rect x="60" y="140" width="1080" height="380" rx="20" fill="none" stroke="#a78bfa" stroke-width="2" stroke-dasharray="8 6"/><text x="84" y="170" font-family="'Space Grotesk','Inter',sans-serif" font-size="16" fill="#c4b5fd" font-weight="700">3 · OPERATIONAL OVERSIGHT</text><text x="84" y="194" font-family="'Inter',sans-serif" font-size="14" fill="#b9a8e8">human review gate · logging · continuous monitoring · escalation path · kill-switch</text><rect x="110" y="250" width="240" height="200" rx="14" fill="#0f1d33" stroke="#60a5fa" stroke-width="2"/><text x="230" y="288" text-anchor="middle" font-family="'Space Grotesk','Inter',sans-serif" font-size="17" fill="#60a5fa" font-weight="700">1 · DATA PRIVACY</text><text x="230" y="312" text-anchor="middle" font-family="'Inter',sans-serif" font-size="13" fill="#9fb4cf">the data going IN</text><text x="130" y="348" font-family="'Inter',sans-serif" font-size="14" fill="#dce7f5">• minimise &amp; mask identifiers</text><text x="130" y="376" font-family="'Inter',sans-serif" font-size="14" fill="#dce7f5">• role-based access</text><text x="130" y="404" font-family="'Inter',sans-serif" font-size="14" fill="#dce7f5">• pull only fields a query needs</text><text x="130" y="432" font-family="'Inter',sans-serif" font-size="14" fill="#dce7f5">• log every access</text><rect x="480" y="280" width="240" height="140" rx="16" fill="#10233f" stroke="#ffffff" stroke-width="2"/><text x="600" y="338" text-anchor="middle" font-family="'Space Grotesk','Inter',sans-serif" font-size="22" fill="#ffffff" font-weight="700">THE AGENT</text><text x="600" y="366" text-anchor="middle" font-family="'Inter',sans-serif" font-size="13" fill="#9fb4cf">assists — never decides</text><text x="600" y="386" text-anchor="middle" font-family="'Inter',sans-serif" font-size="13" fill="#9fb4cf">on its own</text><rect x="850" y="250" width="240" height="200" rx="14" fill="#0f1d33" stroke="#34d399" stroke-width="2"/><text x="970" y="288" text-anchor="middle" font-family="'Space Grotesk','Inter',sans-serif" font-size="17" fill="#34d399" font-weight="700">2 · CONTENT SAFETY</text><text x="970" y="312" text-anchor="middle" font-family="'Inter',sans-serif" font-size="13" fill="#9fb4cf">the content coming OUT</text><text x="870" y="348" font-family="'Inter',sans-serif" font-size="14" fill="#dce7f5">• ground answers in evidence</text><text x="870" y="376" font-family="'Inter',sans-serif" font-size="14" fill="#dce7f5">• surface uncertainty</text><text x="870" y="404" font-family="'Inter',sans-serif" font-size="14" fill="#dce7f5">• refuse high-risk requests</text><text x="870" y="432" font-family="'Inter',sans-serif" font-size="14" fill="#dce7f5">• escalate instead of guessing</text><line x1="350" y1="350" x2="476" y2="350" stroke="#60a5fa" stroke-width="2.2" marker-end="url(#agwArrBl)"/><line x1="720" y1="350" x2="846" y2="350" stroke="#34d399" stroke-width="2.2" marker-end="url(#agwArrG)"/></svg></div></div><p class="agw-cap">Same skeleton every time: lock down the inputs, constrain the outputs, and wrap a human-run oversight layer around the whole thing. What changes is the detail inside each box.</p><h3 id="domain-1--healthcare-a-clinical-decision-support-chatbot">Domain 1 — Healthcare: a clinical decision-support chatbot</h3><p><strong>The use case:</strong> an agent inside a hospital&rsquo;s records system that helps<em>clinicians</em> (not patients) by summarising a patient&rsquo;s history and suggesting possible differential diagnoses and relevant guidelines. Crucially, it only<em>suggests</em>. It never diagnoses, prescribes, or talks to patients on its own. Defining what it<em>can&rsquo;t</em> do is half the safety work.</p><ul><li><strong>Data Privacy:</strong> patient data is about as sensitive as it gets, so the agent runs in a HIPAA-compliant environment, masks identifiers before processing, pulls only the fields a query needs, and uses role-based access so a clinician can only see their own patients. Every access is logged.</li><li><strong>Content Safety:</strong> the real danger is a confident, wrong clinical suggestion. So the agent is constrained to cite evidence-based guidelines, must surface its uncertainty, refuses high-risk questions like paediatric dosing (escalating to a pharmacist instead), and labels every output &ldquo;decision support, not a diagnosis.&rdquo;</li><li><strong>Operational Oversight:</strong> a licensed clinician reviews and approves anything before it touches care, every recommendation is logged for traceability, accuracy is monitored continuously, and there&rsquo;s a kill-switch to pull the tool if error rates spike.</li></ul><p>The thread running through it: the agent assists, the clinician decides. Autonomy is deliberately capped below the point where a wrong answer could act on its own.</p><h3 id="domain-2--education-an-ai-teaching-assistant">Domain 2 — Education: an AI teaching assistant</h3><p><strong>The use case:</strong> an agent in a college&rsquo;s learning platform that helps<em>students</em> with course material — explaining concepts, unpacking feedback, pointing to readings, generating practice problems. It supports learning; it does not grade official work or write the assignments students submit.</p><ul><li><strong>Data Privacy:</strong> student records are FERPA-protected, so the same discipline applies — minimise data, mask identifiers, role-based access so a student sees only their own data, audit logs, and no quietly training external models on student conversations.</li><li><strong>Content Safety:</strong> here &ldquo;unsafe&rdquo; has a twist — the danger isn&rsquo;t just false info, it&rsquo;s<em>doing the work for the student</em>. So the agent scaffolds and hints rather than handing over finished answers on graded work, refuses to write submittable assignments, cites course materials instead of inventing them, and routes sensitive disclosures (self-harm, harassment) to human support.</li><li><strong>Operational Oversight:</strong> instructors configure and review how it&rsquo;s used, interactions are logged for academic-integrity checks, accuracy and flagged conversations are monitored, and there&rsquo;s an escalation path to a human plus a disable switch.</li></ul><p>Notice how the<em>same</em> three-category skeleton produces different specifics once you take the domain&rsquo;s real risks seriously. In healthcare the nightmare is a wrong diagnosis; in education it&rsquo;s eroding academic integrity. The structure is portable; the judgement is not.</p><div class="agw-fig"><div class="agw-wrap"><svg viewBox="0 0 1200 620" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="A table mapping the three safeguard categories onto two domains: healthcare clinical decision support and an education teaching assistant, showing how the same skeleton produces different specifics."><rect width="1200" height="620" fill="#0a1424"/><text x="80" y="56" font-family="'Space Grotesk','Inter',sans-serif" font-size="15" fill="#f59e0b" letter-spacing="3" font-weight="700">SAME SKELETON · TWO DOMAINS</text><text x="80" y="96" font-family="'Space Grotesk','Inter',sans-serif" font-size="28" fill="#ffffff" font-weight="700" letter-spacing="-.5">Portable structure, domain-specific judgement</text><rect x="360" y="120" width="370" height="56" rx="10" fill="#10231a" stroke="#34d399" stroke-width="1.6"/><text x="545" y="148" text-anchor="middle" font-family="'Space Grotesk','Inter',sans-serif" font-size="17" fill="#34d399" font-weight="700">Healthcare</text><text x="545" y="168" text-anchor="middle" font-family="'Inter',sans-serif" font-size="12" fill="#a8e6c8">clinical decision support</text><rect x="745" y="120" width="370" height="56" rx="10" fill="#16112a" stroke="#a78bfa" stroke-width="1.6"/><text x="930" y="148" text-anchor="middle" font-family="'Space Grotesk','Inter',sans-serif" font-size="17" fill="#c4b5fd" font-weight="700">Education</text><text x="930" y="168" text-anchor="middle" font-family="'Inter',sans-serif" font-size="12" fill="#cfc2f0">AI teaching assistant</text><g font-family="'Inter',sans-serif"><rect x="70" y="190" width="270" height="120" rx="10" fill="#0f1d33" stroke="#60a5fa" stroke-width="1.6"/><text x="90" y="226" font-size="16" fill="#60a5fa" font-weight="700">Data Privacy</text><text x="90" y="252" font-size="13" fill="#9fb4cf">the data going in</text><rect x="360" y="190" width="370" height="120" rx="10" fill="#0c1626" stroke="#21405f" stroke-width="1.2"/><text x="378" y="222" font-size="14" fill="#dce7f5">HIPAA env · mask IDs ·</text><text x="378" y="246" font-size="14" fill="#dce7f5">role-based access (own</text><text x="378" y="270" font-size="14" fill="#dce7f5">patients) · log every</text><text x="378" y="294" font-size="14" fill="#dce7f5">access</text><rect x="745" y="190" width="370" height="120" rx="10" fill="#0c1626" stroke="#21405f" stroke-width="1.2"/><text x="763" y="222" font-size="14" fill="#dce7f5">FERPA · mask IDs · own</text><text x="763" y="246" font-size="14" fill="#dce7f5">data only · audit logs ·</text><text x="763" y="270" font-size="14" fill="#dce7f5">no training external</text><text x="763" y="294" font-size="14" fill="#dce7f5">models on chats</text><rect x="70" y="324" width="270" height="120" rx="10" fill="#0f1d33" stroke="#34d399" stroke-width="1.6"/><text x="90" y="360" font-size="16" fill="#34d399" font-weight="700">Content Safety</text><text x="90" y="386" font-size="13" fill="#9fb4cf">the content coming out</text><rect x="360" y="324" width="370" height="120" rx="10" fill="#0c1626" stroke="#21405f" stroke-width="1.2"/><text x="378" y="356" font-size="14" fill="#dce7f5">cite guidelines · show</text><text x="378" y="380" font-size="14" fill="#dce7f5">uncertainty · refuse</text><text x="378" y="404" font-size="14" fill="#dce7f5">paediatric dosing · "not</text><text x="378" y="428" font-size="14" fill="#dce7f5">a diagnosis"</text><rect x="745" y="324" width="370" height="120" rx="10" fill="#0c1626" stroke="#21405f" stroke-width="1.2"/><text x="763" y="356" font-size="14" fill="#dce7f5">scaffold &amp; hint · won't</text><text x="763" y="380" font-size="14" fill="#dce7f5">write submittable work ·</text><text x="763" y="404" font-size="14" fill="#dce7f5">cite course material ·</text><text x="763" y="428" font-size="14" fill="#dce7f5">route self-harm to humans</text><rect x="70" y="458" width="270" height="120" rx="10" fill="#0f1d33" stroke="#a78bfa" stroke-width="1.6"/><text x="90" y="494" font-size="16" fill="#c4b5fd" font-weight="700">Oversight</text><text x="90" y="520" font-size="13" fill="#9fb4cf">the humans watching</text><rect x="360" y="458" width="370" height="120" rx="10" fill="#0c1626" stroke="#21405f" stroke-width="1.2"/><text x="378" y="490" font-size="14" fill="#dce7f5">clinician approves before</text><text x="378" y="514" font-size="14" fill="#dce7f5">care · log all recs ·</text><text x="378" y="538" font-size="14" fill="#dce7f5">monitor accuracy ·</text><text x="378" y="562" font-size="14" fill="#dce7f5">kill-switch on error spike</text><rect x="745" y="458" width="370" height="120" rx="10" fill="#0c1626" stroke="#21405f" stroke-width="1.2"/><text x="763" y="490" font-size="14" fill="#dce7f5">instructors configure ·</text><text x="763" y="514" font-size="14" fill="#dce7f5">log for integrity checks ·</text><text x="763" y="538" font-size="14" fill="#dce7f5">monitor flags · escalate</text><text x="763" y="562" font-size="14" fill="#dce7f5">to human · disable switch</text></g></svg></div></div><p class="agw-cap">The three rows never change. The cells do — because in healthcare the nightmare is a wrong diagnosis, and in education it's a student who never actually learned.</p><h2 id="what-im-taking-away">What I&rsquo;m taking away</h2><p>Three things stuck with me:</p><ol><li><strong>The failure is rarely the model alone.</strong> In every case, the technical fault was amplified by an organisational gap — no audit, no human gate, no clear owner. Governance is not paperwork wrapped around the AI; it<em>is</em> the safety system.</li><li><strong>&ldquo;Fair&rdquo; and &ldquo;safe&rdquo; require you to choose.</strong> COMPAS proved you sometimes can&rsquo;t satisfy every definition of fairness at once. Pretending otherwise is how you end up shipping the bias.</li><li><strong>Good safeguards are boring on purpose.</strong> Logging, escalation, human review, kill-switches, scope limits. None of it is exciting. All of it is what stands between a useful agent and a headline.</li></ol><p>The capability race will keep accelerating. The quieter discipline — deciding where autonomy ends and accountability begins — is the part that decides whether any of it can be trusted.</p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/agents-go-wrong/hero.jpg" medium="image"><media:title type="plain">Posts</media:title></media:content><category>artificial-intelligence</category><category>ai-safety</category><category>governance</category><category>ethics</category><category>agents</category><category>Knowledge Base</category></item><item><title>Memory Management in LLMs</title><link>https://curiousbit.netlify.app/memory-management-in-llms/</link><guid isPermaLink="true">https://curiousbit.netlify.app/memory-management-in-llms/</guid><pubDate>Sun, 14 Jun 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;style&gt;
.mm-kb{--mm-accent:#00c853;--mm-static:#5b8cff;--mm-runtime:#3ecf8e;--mm-training:#f5a623;--mm-overview:#9aa4b2;--mm-app:#7c5cff;color:inherit;font-size:clamp(1.2rem,1.15rem + 0.25vw,1.45rem);line-height:1.85;}
.mm-kb *{box-sizing:border-box;}
.mm-kb .mm-intro{font-size:1em;line-height:1.85;margin:0 0 1.6rem;opacity:.9;}
.mm-kb .mm-legend{display:flex;flex-wrap:wrap;gap:14px;font-size:.82rem;opacity:.85;margin:0 0 2rem;padding:.9rem 1rem;border:1px solid rgba(127,127,127,.22);border-radius:12px;background:rgba(127,127,127,.05);}
.mm-kb .mm-legend span{display:inline-flex;align-items:center;gap:7px;}
.mm-kb .mm-legend i{width:11px;height:11px;border-radius:3px;display:inline-block;}
.mm-kb .mm-domain{display:flex;align-items:center;gap:12px;margin:2.2rem 0 1rem;}
.mm-kb .mm-domain .mm-dot{width:11px;height:11px;border-radius:50%;}
.mm-kb .mm-domain h2{font-size:1.15rem;margin:0;font-weight:800;}
.mm-kb .mm-domain .mm-tag{font-size:.72rem;opacity:.7;border:1px solid rgba(127,127,127,.3);padding:3px 10px;border-radius:999px;}
.mm-kb .mm-grid{display:grid;grid-template-columns:repeat(2,1fr);gap:20px;}
@media(max-width:640px){.mm-kb .mm-grid{grid-template-columns:1fr;}}
.mm-kb .mm-card{position:relative;border:1px solid rgba(127,127,127,.22);border-radius:16px;padding:24px 24px 22px;cursor:pointer;background:rgba(127,127,127,.04);transition:.18s;overflow:hidden;}
.mm-kb .mm-card:hover{transform:translateY(-3px);border-color:var(--mm-accent);box-shadow:0 8px 26px rgba(0,0,0,.18);}
.mm-kb .mm-thumb{margin:-24px -24px 16px;aspect-ratio:3/2;overflow:hidden;background:rgba(127,127,127,.08);border-bottom:1px solid rgba(127,127,127,.18);}
.mm-kb .mm-thumb img{width:100%;height:100%;object-fit:cover;display:block;}
.mm-kb .mm-card .mm-num{position:absolute;top:12px;right:16px;font-size:2.1rem;font-weight:800;opacity:.08;}
.mm-kb .mm-card h3{margin:0 0 10px;font-size:1.15em;font-weight:800;padding-right:30px;line-height:1.25;}
.mm-kb .mm-card p{margin:0 0 14px;font-size:.92em;opacity:.8;line-height:1.7;}
.mm-kb .mm-chips{display:flex;flex-wrap:wrap;gap:6px;}
.mm-kb .mm-chip{font-size:.78rem;opacity:.8;background:rgba(127,127,127,.1);border:1px solid rgba(127,127,127,.2);padding:3px 8px;border-radius:999px;}
.mm-kb .mm-open{margin-top:15px;font-size:.92rem;color:var(--mm-accent);font-weight:700;}
.mm-kb .mm-badge{display:inline-block;font-size:.66rem;font-weight:800;letter-spacing:.4px;text-transform:uppercase;padding:3px 9px;border-radius:999px;margin-bottom:10px;border:1px solid transparent;}
.mm-kb .mm-badge.static{color:#3f6fd8;background:rgba(91,140,255,.13);border-color:rgba(91,140,255,.4);}
.mm-kb .mm-badge.runtime{color:#1f9d63;background:rgba(62,207,142,.14);border-color:rgba(62,207,142,.42);}
.mm-kb .mm-badge.training{color:#c47e10;background:rgba(245,166,35,.16);border-color:rgba(245,166,35,.45);}
.mm-kb .mm-badge.overview{color:#6b7280;background:rgba(154,164,178,.14);border-color:rgba(154,164,178,.4);}
.mm-kb .mm-badge.app{color:#6a4dd6;background:rgba(124,92,255,.14);border-color:rgba(124,92,255,.42);}
.mm-kb .mm-hero{margin:0 0 1.4rem;border-radius:16px;overflow:hidden;border:1px solid rgba(127,127,127,.2);background:rgba(127,127,127,.06);aspect-ratio:16/9;}
.mm-kb .mm-hero img,.mm-kb .mm-hero video{width:100%;height:100%;object-fit:cover;display:block;}
.mm-kb .mm-article{display:none;}
.mm-kb .mm-back{display:inline-flex;align-items:center;gap:7px;font-size:.9rem;color:var(--mm-accent);margin-bottom:18px;cursor:pointer;background:none;border:none;padding:0;font-family:inherit;font-weight:700;}
.mm-kb .mm-back:hover{text-decoration:underline;}
.mm-kb .mm-article h1{font-size:1.6em;margin:.2rem 0 .5rem;font-weight:900;line-height:1.1;}
.mm-kb .mm-article .mm-lead{opacity:.8;font-size:1em;line-height:1.85;margin:0 0 1.8rem;}
.mm-kb .mm-sec{border:1px solid rgba(127,127,127,.22);border-radius:14px;padding:18px 22px;margin-bottom:14px;background:rgba(127,127,127,.04);}
.mm-kb .mm-sec h3{margin:0 0 10px;font-size:1.12em;font-weight:800;line-height:1.3;}
.mm-kb .mm-sec ul{margin:0;padding-left:20px;}
.mm-kb .mm-sec li{margin:8px 0;opacity:.92;font-size:.95em;line-height:1.75;}
.mm-kb .mm-note{font-size:.8rem;opacity:.7;border-left:3px solid var(--mm-accent);padding:10px 14px;border-radius:8px;background:rgba(127,127,127,.05);margin-top:14px;}
/* ---- rich article body + interactive widgets (Fundamentals) ---- */
.mm-kb .mm-sec p{margin:0 0 1.1rem;font-size:1.22em;line-height:1.9;opacity:.92;}
.mm-kb .mm-sec p:last-child{margin-bottom:0;}
.mm-kb .mmf-cap{font-size:1rem;opacity:.65;margin:.6rem 0 0;line-height:1.7;}
.mm-kb .mmf-kbd{font-family:ui-monospace,SFMono-Regular,Menlo,monospace;font-size:.85em;background:rgba(127,127,127,.14);padding:1px 6px;border-radius:5px;}
/* widget shell */
.mm-kb .mmf-w{border:1px solid rgba(127,127,127,.22);border-radius:14px;padding:18px;margin:1.1rem 0 .3rem;background:rgba(127,127,127,.05);}
.mm-kb .mmf-w h4{margin:0 0 3px;font-size:1.28rem;font-weight:800;}
.mm-kb .mmf-w .mmf-sub{font-size:1.02rem;opacity:.6;margin:0 0 16px;}
/* mode switch (segmented) */
.mm-kb .mmf-switch{display:inline-flex;border:1px solid rgba(127,127,127,.3);border-radius:999px;overflow:hidden;margin-bottom:14px;}
.mm-kb .mmf-switch button{font:inherit;font-size:1.1rem;font-weight:700;border:none;background:none;color:inherit;padding:9px 20px;cursor:pointer;opacity:.6;transition:.15s;}
.mm-kb .mmf-switch button.on{background:var(--mm-accent);color:#06231a;opacity:1;}
/* component map blocks */
.mm-kb .mmf-blocks{display:grid;grid-template-columns:repeat(4,1fr);gap:10px;}
@media(max-width:560px){.mm-kb .mmf-blocks{grid-template-columns:repeat(2,1fr);}}
.mm-kb .mmf-block{border:1px solid rgba(127,127,127,.25);border-radius:11px;padding:13px 12px;cursor:pointer;background:rgba(127,127,127,.04);transition:.18s;position:relative;text-align:left;}
.mm-kb .mmf-block .mmf-bt{font-weight:800;font-size:1.15rem;display:block;margin-bottom:6px;}
.mm-kb .mmf-block .mmf-bar{height:9px;border-radius:5px;background:var(--c,#888);transition:width .4s ease,opacity .3s;width:30%;}
.mm-kb .mmf-block .mmf-tag{font-size:.9rem;opacity:.7;display:block;margin-top:8px;}
.mm-kb .mmf-block.off{opacity:.28;filter:grayscale(.6);}
.mm-kb .mmf-block.sel{border-color:var(--c,var(--mm-accent));box-shadow:0 0 0 2px rgba(0,200,83,.15);}
.mm-kb .mmf-detail{margin-top:14px;font-size:1.12rem;line-height:1.8;border-left:3px solid var(--mm-accent);padding:13px 16px;background:rgba(127,127,127,.05);border-radius:8px;min-height:1.5em;}
/* trade-off slider */
.mm-kb .mmf-slider{width:100%;accent-color:var(--mm-accent);margin:6px 0 4px;}
.mm-kb .mmf-ends{display:flex;justify-content:space-between;font-size:1rem;opacity:.7;font-weight:700;}
.mm-kb .mmf-meters{margin-top:14px;display:grid;gap:10px;}
.mm-kb .mmf-meter .mmf-ml{display:flex;justify-content:space-between;font-size:1.05rem;margin-bottom:5px;opacity:.85;}
.mm-kb .mmf-track{height:14px;border-radius:7px;background:rgba(127,127,127,.16);overflow:hidden;}
.mm-kb .mmf-fill{height:100%;border-radius:7px;transition:width .25s ease;}
.mm-kb .mmf-readout{margin-top:14px;font-size:1.12rem;line-height:1.8;opacity:.9;}
/* estimator */
.mm-kb .mmf-ctl{display:flex;flex-wrap:wrap;gap:18px;align-items:flex-end;margin-bottom:8px;}
.mm-kb .mmf-ctl label{font-size:1.02rem;font-weight:700;opacity:.8;display:block;margin-bottom:7px;}
.mm-kb .mmf-prec button{font:inherit;font-size:1.05rem;font-weight:700;border:1px solid rgba(127,127,127,.3);background:none;color:inherit;padding:8px 15px;border-radius:8px;cursor:pointer;opacity:.65;margin-right:6px;transition:.15s;}
.mm-kb .mmf-prec button.on{background:var(--mm-static);color:#fff;border-color:var(--mm-static);opacity:1;}
.mm-kb .mmf-pval{font-weight:800;font-size:1.3rem;}
.mm-kb .mmf-est-bars{margin-top:16px;display:grid;gap:16px;}
.mm-kb .mmf-eb .mmf-ebl{display:flex;justify-content:space-between;align-items:baseline;font-size:1.1rem;font-weight:700;margin-bottom:6px;}
.mm-kb .mmf-eb .mmf-ebl small{font-weight:600;opacity:.65;}
.mm-kb .mmf-stack{display:flex;height:34px;border-radius:8px;overflow:hidden;background:rgba(127,127,127,.12);}
.mm-kb .mmf-seg{height:100%;transition:width .3s ease;min-width:0;}
.mm-kb .mmf-legend{display:flex;flex-wrap:wrap;gap:16px;margin-top:14px;font-size:1rem;opacity:.82;}
.mm-kb .mmf-legend span{display:inline-flex;align-items:center;gap:6px;}
.mm-kb .mmf-legend i{width:12px;height:12px;border-radius:3px;display:inline-block;}
.mm-kb .mmf-ratio{margin-top:14px;font-size:1.2rem;font-weight:700;text-align:center;padding:13px;border-radius:9px;background:rgba(0,200,83,.1);border:1px solid rgba(0,200,83,.25);}
/* ---- extra widgets for the remaining articles ---- */
.mm-kb .mmf-stat{font-size:2.2rem;font-weight:800;margin:8px 0 4px;letter-spacing:-.5px;line-height:1.1;}
.mm-kb .mmf-fitline{font-size:1.02rem;opacity:.85;margin-bottom:14px;line-height:1.65;}
.mm-kb .mmf-matrix{display:grid;gap:2px;margin:10px auto 18px;max-width:420px;}
.mm-kb .mmf-cell{aspect-ratio:1;background:#f5a623;border-radius:2px;opacity:.82;}
.mm-kb .mmf-memgrid{display:grid;grid-template-columns:repeat(20,1fr);gap:3px;margin:8px 0 14px;}
.mm-kb .mmf-mcell{aspect-ratio:1;border-radius:3px;background:rgba(127,127,127,.16);transition:.2s;}
.mm-kb .mmf-hier{display:grid;gap:11px;margin:8px 0 4px;}
.mm-kb .mmf-hrow{cursor:pointer;border:1px solid rgba(127,127,127,.22);border-radius:11px;padding:12px 15px;background:rgba(127,127,127,.04);transition:.15s;}
.mm-kb .mmf-hrow:hover,.mm-kb .mmf-hrow.sel{border-color:var(--mm-accent);box-shadow:0 0 0 2px rgba(0,200,83,.12);}
.mm-kb .mmf-hrow .mmf-htop{display:flex;justify-content:space-between;align-items:baseline;font-weight:800;font-size:1.08rem;margin-bottom:8px;}
.mm-kb .mmf-hrow .mmf-htop small{font-weight:600;opacity:.7;font-size:.92rem;}
.mm-kb .mmf-hbar{height:12px;border-radius:6px;background:var(--c,#888);}
.mm-kb .mmf-btn{font:inherit;font-size:1.02rem;font-weight:700;border:1px solid var(--mm-accent);background:rgba(0,200,83,.1);color:inherit;padding:9px 16px;border-radius:9px;cursor:pointer;transition:.15s;}
.mm-kb .mmf-btn:hover{background:rgba(0,200,83,.2);}
.mm-kb .mmf-btn:disabled{opacity:.4;cursor:not-allowed;}
&lt;/style&gt;
&lt;div class="mm-kb not-prose" id="mm-kb"&gt;
&lt;div id="mm-home"&gt;
&lt;p class="mm-intro"&gt;Everything about how large language models use, store, and optimize memory — from the bytes that hold model weights on a GPU to how an agent remembers a conversation across sessions. Each topic is tagged by &lt;em&gt;when&lt;/em&gt; the memory is consumed. Pick a card to open the article.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/memory-llms/hero.png" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><style>
.mm-kb{--mm-accent:#00c853;--mm-static:#5b8cff;--mm-runtime:#3ecf8e;--mm-training:#f5a623;--mm-overview:#9aa4b2;--mm-app:#7c5cff;color:inherit;font-size:clamp(1.2rem,1.15rem + 0.25vw,1.45rem);line-height:1.85;}
.mm-kb *{box-sizing:border-box;}
.mm-kb .mm-intro{font-size:1em;line-height:1.85;margin:0 0 1.6rem;opacity:.9;}
.mm-kb .mm-legend{display:flex;flex-wrap:wrap;gap:14px;font-size:.82rem;opacity:.85;margin:0 0 2rem;padding:.9rem 1rem;border:1px solid rgba(127,127,127,.22);border-radius:12px;background:rgba(127,127,127,.05);}
.mm-kb .mm-legend span{display:inline-flex;align-items:center;gap:7px;}
.mm-kb .mm-legend i{width:11px;height:11px;border-radius:3px;display:inline-block;}
.mm-kb .mm-domain{display:flex;align-items:center;gap:12px;margin:2.2rem 0 1rem;}
.mm-kb .mm-domain .mm-dot{width:11px;height:11px;border-radius:50%;}
.mm-kb .mm-domain h2{font-size:1.15rem;margin:0;font-weight:800;}
.mm-kb .mm-domain .mm-tag{font-size:.72rem;opacity:.7;border:1px solid rgba(127,127,127,.3);padding:3px 10px;border-radius:999px;}
.mm-kb .mm-grid{display:grid;grid-template-columns:repeat(2,1fr);gap:20px;}
@media(max-width:640px){.mm-kb .mm-grid{grid-template-columns:1fr;}}
.mm-kb .mm-card{position:relative;border:1px solid rgba(127,127,127,.22);border-radius:16px;padding:24px 24px 22px;cursor:pointer;background:rgba(127,127,127,.04);transition:.18s;overflow:hidden;}
.mm-kb .mm-card:hover{transform:translateY(-3px);border-color:var(--mm-accent);box-shadow:0 8px 26px rgba(0,0,0,.18);}
.mm-kb .mm-thumb{margin:-24px -24px 16px;aspect-ratio:3/2;overflow:hidden;background:rgba(127,127,127,.08);border-bottom:1px solid rgba(127,127,127,.18);}
.mm-kb .mm-thumb img{width:100%;height:100%;object-fit:cover;display:block;}
.mm-kb .mm-card .mm-num{position:absolute;top:12px;right:16px;font-size:2.1rem;font-weight:800;opacity:.08;}
.mm-kb .mm-card h3{margin:0 0 10px;font-size:1.15em;font-weight:800;padding-right:30px;line-height:1.25;}
.mm-kb .mm-card p{margin:0 0 14px;font-size:.92em;opacity:.8;line-height:1.7;}
.mm-kb .mm-chips{display:flex;flex-wrap:wrap;gap:6px;}
.mm-kb .mm-chip{font-size:.78rem;opacity:.8;background:rgba(127,127,127,.1);border:1px solid rgba(127,127,127,.2);padding:3px 8px;border-radius:999px;}
.mm-kb .mm-open{margin-top:15px;font-size:.92rem;color:var(--mm-accent);font-weight:700;}
.mm-kb .mm-badge{display:inline-block;font-size:.66rem;font-weight:800;letter-spacing:.4px;text-transform:uppercase;padding:3px 9px;border-radius:999px;margin-bottom:10px;border:1px solid transparent;}
.mm-kb .mm-badge.static{color:#3f6fd8;background:rgba(91,140,255,.13);border-color:rgba(91,140,255,.4);}
.mm-kb .mm-badge.runtime{color:#1f9d63;background:rgba(62,207,142,.14);border-color:rgba(62,207,142,.42);}
.mm-kb .mm-badge.training{color:#c47e10;background:rgba(245,166,35,.16);border-color:rgba(245,166,35,.45);}
.mm-kb .mm-badge.overview{color:#6b7280;background:rgba(154,164,178,.14);border-color:rgba(154,164,178,.4);}
.mm-kb .mm-badge.app{color:#6a4dd6;background:rgba(124,92,255,.14);border-color:rgba(124,92,255,.42);}
.mm-kb .mm-hero{margin:0 0 1.4rem;border-radius:16px;overflow:hidden;border:1px solid rgba(127,127,127,.2);background:rgba(127,127,127,.06);aspect-ratio:16/9;}
.mm-kb .mm-hero img,.mm-kb .mm-hero video{width:100%;height:100%;object-fit:cover;display:block;}
.mm-kb .mm-article{display:none;}
.mm-kb .mm-back{display:inline-flex;align-items:center;gap:7px;font-size:.9rem;color:var(--mm-accent);margin-bottom:18px;cursor:pointer;background:none;border:none;padding:0;font-family:inherit;font-weight:700;}
.mm-kb .mm-back:hover{text-decoration:underline;}
.mm-kb .mm-article h1{font-size:1.6em;margin:.2rem 0 .5rem;font-weight:900;line-height:1.1;}
.mm-kb .mm-article .mm-lead{opacity:.8;font-size:1em;line-height:1.85;margin:0 0 1.8rem;}
.mm-kb .mm-sec{border:1px solid rgba(127,127,127,.22);border-radius:14px;padding:18px 22px;margin-bottom:14px;background:rgba(127,127,127,.04);}
.mm-kb .mm-sec h3{margin:0 0 10px;font-size:1.12em;font-weight:800;line-height:1.3;}
.mm-kb .mm-sec ul{margin:0;padding-left:20px;}
.mm-kb .mm-sec li{margin:8px 0;opacity:.92;font-size:.95em;line-height:1.75;}
.mm-kb .mm-note{font-size:.8rem;opacity:.7;border-left:3px solid var(--mm-accent);padding:10px 14px;border-radius:8px;background:rgba(127,127,127,.05);margin-top:14px;}
/* ---- rich article body + interactive widgets (Fundamentals) ---- */
.mm-kb .mm-sec p{margin:0 0 1.1rem;font-size:1.22em;line-height:1.9;opacity:.92;}
.mm-kb .mm-sec p:last-child{margin-bottom:0;}
.mm-kb .mmf-cap{font-size:1rem;opacity:.65;margin:.6rem 0 0;line-height:1.7;}
.mm-kb .mmf-kbd{font-family:ui-monospace,SFMono-Regular,Menlo,monospace;font-size:.85em;background:rgba(127,127,127,.14);padding:1px 6px;border-radius:5px;}
/* widget shell */
.mm-kb .mmf-w{border:1px solid rgba(127,127,127,.22);border-radius:14px;padding:18px;margin:1.1rem 0 .3rem;background:rgba(127,127,127,.05);}
.mm-kb .mmf-w h4{margin:0 0 3px;font-size:1.28rem;font-weight:800;}
.mm-kb .mmf-w .mmf-sub{font-size:1.02rem;opacity:.6;margin:0 0 16px;}
/* mode switch (segmented) */
.mm-kb .mmf-switch{display:inline-flex;border:1px solid rgba(127,127,127,.3);border-radius:999px;overflow:hidden;margin-bottom:14px;}
.mm-kb .mmf-switch button{font:inherit;font-size:1.1rem;font-weight:700;border:none;background:none;color:inherit;padding:9px 20px;cursor:pointer;opacity:.6;transition:.15s;}
.mm-kb .mmf-switch button.on{background:var(--mm-accent);color:#06231a;opacity:1;}
/* component map blocks */
.mm-kb .mmf-blocks{display:grid;grid-template-columns:repeat(4,1fr);gap:10px;}
@media(max-width:560px){.mm-kb .mmf-blocks{grid-template-columns:repeat(2,1fr);}}
.mm-kb .mmf-block{border:1px solid rgba(127,127,127,.25);border-radius:11px;padding:13px 12px;cursor:pointer;background:rgba(127,127,127,.04);transition:.18s;position:relative;text-align:left;}
.mm-kb .mmf-block .mmf-bt{font-weight:800;font-size:1.15rem;display:block;margin-bottom:6px;}
.mm-kb .mmf-block .mmf-bar{height:9px;border-radius:5px;background:var(--c,#888);transition:width .4s ease,opacity .3s;width:30%;}
.mm-kb .mmf-block .mmf-tag{font-size:.9rem;opacity:.7;display:block;margin-top:8px;}
.mm-kb .mmf-block.off{opacity:.28;filter:grayscale(.6);}
.mm-kb .mmf-block.sel{border-color:var(--c,var(--mm-accent));box-shadow:0 0 0 2px rgba(0,200,83,.15);}
.mm-kb .mmf-detail{margin-top:14px;font-size:1.12rem;line-height:1.8;border-left:3px solid var(--mm-accent);padding:13px 16px;background:rgba(127,127,127,.05);border-radius:8px;min-height:1.5em;}
/* trade-off slider */
.mm-kb .mmf-slider{width:100%;accent-color:var(--mm-accent);margin:6px 0 4px;}
.mm-kb .mmf-ends{display:flex;justify-content:space-between;font-size:1rem;opacity:.7;font-weight:700;}
.mm-kb .mmf-meters{margin-top:14px;display:grid;gap:10px;}
.mm-kb .mmf-meter .mmf-ml{display:flex;justify-content:space-between;font-size:1.05rem;margin-bottom:5px;opacity:.85;}
.mm-kb .mmf-track{height:14px;border-radius:7px;background:rgba(127,127,127,.16);overflow:hidden;}
.mm-kb .mmf-fill{height:100%;border-radius:7px;transition:width .25s ease;}
.mm-kb .mmf-readout{margin-top:14px;font-size:1.12rem;line-height:1.8;opacity:.9;}
/* estimator */
.mm-kb .mmf-ctl{display:flex;flex-wrap:wrap;gap:18px;align-items:flex-end;margin-bottom:8px;}
.mm-kb .mmf-ctl label{font-size:1.02rem;font-weight:700;opacity:.8;display:block;margin-bottom:7px;}
.mm-kb .mmf-prec button{font:inherit;font-size:1.05rem;font-weight:700;border:1px solid rgba(127,127,127,.3);background:none;color:inherit;padding:8px 15px;border-radius:8px;cursor:pointer;opacity:.65;margin-right:6px;transition:.15s;}
.mm-kb .mmf-prec button.on{background:var(--mm-static);color:#fff;border-color:var(--mm-static);opacity:1;}
.mm-kb .mmf-pval{font-weight:800;font-size:1.3rem;}
.mm-kb .mmf-est-bars{margin-top:16px;display:grid;gap:16px;}
.mm-kb .mmf-eb .mmf-ebl{display:flex;justify-content:space-between;align-items:baseline;font-size:1.1rem;font-weight:700;margin-bottom:6px;}
.mm-kb .mmf-eb .mmf-ebl small{font-weight:600;opacity:.65;}
.mm-kb .mmf-stack{display:flex;height:34px;border-radius:8px;overflow:hidden;background:rgba(127,127,127,.12);}
.mm-kb .mmf-seg{height:100%;transition:width .3s ease;min-width:0;}
.mm-kb .mmf-legend{display:flex;flex-wrap:wrap;gap:16px;margin-top:14px;font-size:1rem;opacity:.82;}
.mm-kb .mmf-legend span{display:inline-flex;align-items:center;gap:6px;}
.mm-kb .mmf-legend i{width:12px;height:12px;border-radius:3px;display:inline-block;}
.mm-kb .mmf-ratio{margin-top:14px;font-size:1.2rem;font-weight:700;text-align:center;padding:13px;border-radius:9px;background:rgba(0,200,83,.1);border:1px solid rgba(0,200,83,.25);}
/* ---- extra widgets for the remaining articles ---- */
.mm-kb .mmf-stat{font-size:2.2rem;font-weight:800;margin:8px 0 4px;letter-spacing:-.5px;line-height:1.1;}
.mm-kb .mmf-fitline{font-size:1.02rem;opacity:.85;margin-bottom:14px;line-height:1.65;}
.mm-kb .mmf-matrix{display:grid;gap:2px;margin:10px auto 18px;max-width:420px;}
.mm-kb .mmf-cell{aspect-ratio:1;background:#f5a623;border-radius:2px;opacity:.82;}
.mm-kb .mmf-memgrid{display:grid;grid-template-columns:repeat(20,1fr);gap:3px;margin:8px 0 14px;}
.mm-kb .mmf-mcell{aspect-ratio:1;border-radius:3px;background:rgba(127,127,127,.16);transition:.2s;}
.mm-kb .mmf-hier{display:grid;gap:11px;margin:8px 0 4px;}
.mm-kb .mmf-hrow{cursor:pointer;border:1px solid rgba(127,127,127,.22);border-radius:11px;padding:12px 15px;background:rgba(127,127,127,.04);transition:.15s;}
.mm-kb .mmf-hrow:hover,.mm-kb .mmf-hrow.sel{border-color:var(--mm-accent);box-shadow:0 0 0 2px rgba(0,200,83,.12);}
.mm-kb .mmf-hrow .mmf-htop{display:flex;justify-content:space-between;align-items:baseline;font-weight:800;font-size:1.08rem;margin-bottom:8px;}
.mm-kb .mmf-hrow .mmf-htop small{font-weight:600;opacity:.7;font-size:.92rem;}
.mm-kb .mmf-hbar{height:12px;border-radius:6px;background:var(--c,#888);}
.mm-kb .mmf-btn{font:inherit;font-size:1.02rem;font-weight:700;border:1px solid var(--mm-accent);background:rgba(0,200,83,.1);color:inherit;padding:9px 16px;border-radius:9px;cursor:pointer;transition:.15s;}
.mm-kb .mmf-btn:hover{background:rgba(0,200,83,.2);}
.mm-kb .mmf-btn:disabled{opacity:.4;cursor:not-allowed;}</style><div class="mm-kb not-prose" id="mm-kb"><div id="mm-home"><p class="mm-intro">Everything about how large language models use, store, and optimize memory — from the bytes that hold model weights on a GPU to how an agent remembers a conversation across sessions. Each topic is tagged by<em>when</em> the memory is consumed. Pick a card to open the article.</p><div class="mm-legend"><span><i style="background:#5b8cff"/>Static — set before you run</span><span><i style="background:#3ecf8e"/>Runtime — scales with workload</span><span><i style="background:#f5a623"/>Training — only during training</span><span><i style="background:#9aa4b2"/>Overview</span><span><i style="background:#7c5cff"/>App-level memory</span></div><div class="mm-domain"><span class="mm-dot" style="background:#5b8cff"/><h2>System &amp; Runtime Memory</h2><span class="mm-tag">How the model physically uses hardware</span></div><div class="mm-grid" id="mm-grid-system"/><div class="mm-domain"><span class="mm-dot" style="background:#7c5cff"/><h2>Agent &amp; Long-Term Memory</h2><span class="mm-tag">How the model "remembers" across turns &amp; sessions</span></div><div class="mm-grid" id="mm-grid-agent"/></div><div id="mm-article-container"/></div><script>
(function(){
var PHASE={static:"Static",runtime:"Runtime",training:"Training",overview:"Overview",app:"App-level"};
var TOPICS=[
{id:1,domain:"system",num:"01",phase:"overview",img:"/images/memory-llms/card-01.png",title:"Fundamentals of LLM Memory",summary:"The mental model: what consumes memory in an LLM and the core trade-offs.",chips:["Weights","Activations","KV cache","Compute trade-off"],lead:"Before optimizing anything, you need a clear picture of what actually lives in memory when a model runs — and how that differs between training and inference.",sections:[
{h:"Types of memory in an LLM",li:["<b>Model weights</b> — the learned parameters; the largest fixed cost.","<b>Activations</b> — intermediate tensors produced during a forward pass.","<b>KV cache</b> — cached keys/values that grow with generated tokens (inference).","<b>Optimizer states &amp; gradients</b> — only during training."]},
{h:"Memory vs compute trade-offs",li:["Recomputation trades compute for memory (activation checkpointing).","Caching trades memory for compute (KV cache, prefix caching).","Memory<b>bandwidth</b> vs<b>capacity</b> — decode is usually bandwidth-bound."]},
{h:"Training vs inference memory",li:["Training holds weights + gradients + optimizer states + activations.","Inference holds weights + a much smaller activation set + KV cache.","Rule of thumb: training needs several× the memory of inference."]}],body:"fundamentals"},
{id:2,domain:"system",num:"02",phase:"static",img:"/images/memory-llms/card-02.png",title:"Model Weights Memory",summary:"How parameter count and numeric precision determine the model's fixed footprint.",chips:["Parameters","FP16/BF16","INT8/INT4","Quantization","LoRA/QLoRA"],lead:"Weights are the baseline memory cost — fixed by how many parameters the model has and how many bytes each one takes.",sections:[
{h:"Parameter count &amp; footprint",li:["Footprint ≈ params × bytes-per-param. A 7B model in FP16 ≈ 14 GB.","Embedding/vocab and LM-head matrices can be a non-trivial slice."]},
{h:"Data types &amp; precision",li:["<b>FP32</b> (4B),<b>FP16/BF16</b> (2B),<b>INT8</b> (1B),<b>INT4</b> (~0.5B).","BF16 vs FP16: same size, different exponent/mantissa trade-off.","Lower precision cuts memory and bandwidth but can cost accuracy."]},
{h:"Quantization basics &amp; impact",li:["Post-training quantization (PTQ) vs quantization-aware training (QAT).","Schemes: GPTQ, AWQ, GGUF/llama.cpp, bitsandbytes.","Accuracy vs size vs speed; per-channel / group scaling."]},
{h:"Memory-efficient fine-tuning",li:["<b>LoRA / PEFT</b> — train small adapters instead of full weights.","<b>QLoRA</b> — fine-tune on top of a 4-bit quantized base model."]}],body:"weights"},
{id:3,domain:"system",num:"03",phase:"runtime",img:"/images/memory-llms/card-03.png",title:"KV Cache &amp; Context Management",summary:"Why the KV cache exists, how it scales with context, and how to shrink it.",chips:["KV cache","Context length","GQA/MQA","MLA","Window limits"],lead:"The KV cache is what makes generation fast — and what makes long context expensive. It often dominates inference memory.",sections:[
{h:"What the KV cache is &amp; why it matters",li:["Stores attention keys/values so past tokens aren't recomputed each step.","Without it, every new token would re-attend over the whole sequence."]},
{h:"How it scales",li:["Size ≈ 2 × layers × heads × head_dim × seq_len × batch × bytes.","Grows linearly with context length and batch size — the long-context tax."]},
{h:"Reducing the KV cache",li:["<b>MQA / GQA</b> — share K/V across heads to cut cache size dramatically.","<b>MLA</b> (Multi-head Latent Attention, DeepSeek) — compress KV into a latent.","KV-cache quantization (store K/V in INT8/INT4)."]},
{h:"Context window limitations",li:["Practical limits driven by KV memory and attention cost, not just training.","Position encodings (RoPE scaling, ALiBi) and their effect on length."]}],body:"kv"},
{id:4,domain:"system",num:"04",phase:"runtime",img:"/images/memory-llms/card-04.png",video:"/images/memory-llms/card-04.mp4",title:"Attention Mechanisms &amp; Memory Efficiency",summary:"The memory cost of attention and the algorithms that tame it.",chips:["O(n²)","FlashAttention","Sparse","Sliding window"],lead:"Naive attention materializes a sequence-length-squared matrix. The fixes are some of the most important systems work in modern LLMs.",sections:[
{h:"Standard attention memory cost",li:["The score matrix is O(n²) in sequence length — the core bottleneck.","Memory, not just FLOPs, is what blows up at long context."]},
{h:"FlashAttention &amp; IO-aware methods",li:["Tiles the computation so the full n×n matrix is never materialized.","Cuts memory to O(n) and improves speed via better HBM/SRAM use.","Versions and what changed (FA-2, FA-3)."]},
{h:"Sparse &amp; windowed attention",li:["<b>Sliding-window</b> attention (Mistral) bounds the attended span.","Sparse / block-sparse patterns; local + global token schemes.","Trade-offs between coverage and memory savings."]}],body:"attention"},
{id:5,domain:"system",num:"05",phase:"runtime",img:"/images/memory-llms/card-05.png",video:"/images/memory-llms/card-05.mp4",title:"Inference Memory Optimization",summary:"Serving-time techniques that pack more requests into the same GPU.",chips:["PagedAttention","Continuous batching","Prefix cache","Speculative decoding"],lead:"Production inference is largely a memory-management problem. These techniques are why modern serving engines achieve high throughput.",sections:[
{h:"Paging &amp; batching",li:["<b>PagedAttention</b> (vLLM) — page the KV cache to kill fragmentation.","<b>Continuous batching</b> — add/remove requests mid-flight for high utilization."]},
{h:"Caching across requests",li:["<b>Prefix / prompt caching</b> — reuse KV for shared prefixes (system prompts, RAG).","Cache hit-rate economics and eviction policy."]},
{h:"KV compression &amp; eviction",li:["Token eviction (H2O, attention-sink / StreamingLLM ideas).","Quantized KV; selective retention of important tokens."]},
{h:"Other techniques",li:["<b>Speculative decoding</b> — draft model + verify; memory implications.","<b>Disaggregated serving</b> — separate prefill and decode across GPUs."]}],body:"inference"},
{id:6,domain:"system",num:"06",phase:"training",img:"/images/memory-llms/card-06.png",video:"/images/memory-llms/card-06.mp4",title:"Training-Time Memory",summary:"Where memory actually goes during training — and how to fit bigger models.",chips:["Optimizer states","Gradients","Checkpointing","ZeRO/FSDP"],lead:"Training memory is dominated by things inference never sees. This is the part most people's mental model is missing.",sections:[
{h:"What consumes training memory",li:["<b>Optimizer states</b> — Adam stores 2 extra tensors per param (~2× weights).","<b>Gradients</b> — one per parameter.","<b>Activations</b> — scale with batch size, sequence length, and depth."]},
{h:"Reducing activation memory",li:["<b>Activation checkpointing / recomputation</b> — recompute instead of store.","Mixed-precision training (FP16/BF16) with FP32 master weights."]},
{h:"Sharding &amp; distribution",li:["<b>ZeRO / FSDP</b> — shard optimizer states, gradients, and params across devices.","Offloading optimizer states to CPU/NVMe (ZeRO-Offload/Infinity).","Gradient accumulation to emulate larger batches within memory limits."]}],body:"training"},
{id:7,domain:"system",num:"07",phase:"static",img:"/images/memory-llms/card-07.png",title:"Hardware &amp; System-Level Memory",summary:"The physical memory hierarchy and how models are spread across it.",chips:["HBM/VRAM","Bandwidth","Unified memory","Parallelism","Offloading"],lead:"All of the above ultimately lands on real silicon. Understanding the hierarchy explains why certain optimizations matter.",sections:[
{h:"The memory hierarchy",li:["<b>HBM / VRAM</b> vs on-chip SRAM vs system RAM vs disk.","<b>Bandwidth vs capacity</b> — decode is typically bandwidth-bound.","Why FlashAttention's IO-awareness pays off here."]},
{h:"Unified &amp; alternative memory",li:["Apple Silicon unified memory — CPU/GPU share one pool.","Implications for running large models on consumer hardware."]},
{h:"Distribution &amp; offloading",li:["Tensor / pipeline / data parallelism and their memory profiles.","<b>CPU offloading</b> and<b>disk offloading</b> for oversized models.","PCIe/NVLink transfer costs as the hidden bottleneck."]}],body:"hardware"},
{id:8,domain:"agent",num:"08",phase:"app",img:"/images/memory-llms/card-08.png",video:"/images/memory-llms/card-08.mp4",title:"Agent &amp; Long-Term Memory Systems",summary:"How an LLM application remembers beyond a single context window.",chips:["Working memory","RAG","Vector stores","Summarization","Episodic/semantic"],lead:"A different sense of 'memory': not bytes on a GPU, but how an app retains and recalls information across turns and sessions.",sections:[
{h:"The context window as working memory",li:["Short-term memory = whatever fits in the current context window.","Limits, cost, and the 'lost in the middle' recall problem."]},
{h:"External / long-term memory",li:["<b>RAG</b> + vector databases as retrievable long-term storage.","Episodic (events) vs semantic (facts) vs procedural memory.","Memory writes: what to store, when, and how to index it."]},
{h:"Managing growing memory",li:["Conversation<b>summarization / compaction</b> to stay within context.","Eviction and relevance ranking of stored memories.","Frameworks &amp; patterns (memory stores, scratchpads, profiles)."]},
{h:"Multimodal memory",li:["Storing and recalling images/audio alongside text.","Embedding and retrieval across modalities."]}],body:"agent"},
{id:9,domain:"agent",num:"09",phase:"app",img:"/images/memory-llms/card-09.png",title:"Advanced &amp; Emerging Architectures",summary:"Architectures and techniques rethinking memory from the ground up.",chips:["Mamba","RWKV","Long context","SSMs"],lead:"Beyond transformers: approaches that change the fundamental memory–sequence relationship.",sections:[
{h:"Memory-efficient architectures",li:["<b>State-space models</b> (Mamba) — constant-size state instead of KV cache.","<b>RWKV</b> — RNN-style recurrence with transformer-level quality.","Linear / sub-quadratic attention variants."]},
{h:"Long-context techniques",li:["Context extension (RoPE scaling, position interpolation).","Retrieval-augmented and memory-augmented long context.","Hybrid architectures mixing attention and recurrence."]}],body:"advanced"}
];
/* ===== rich, written-out bodies for selected topics ===== */
var BODIES={
fundamentals:function(){return `<div class="mm-sec"><h3>Types of memory in an LLM</h3><p>Think of GPU memory as a desk. Before the model can do any useful work, several different things have to fit on that desk<em>at the same time</em>. Almost every optimization you'll meet later is just a way to shrink one of these piles or remove it entirely — so it pays to know what they are.</p><p>There are four buckets. Two are always there, one shows up only when you're generating text, and one only when you're training. Toggle the mode below and tap each block to see what it holds.</p><div class="mmf-w" id="mmf-map"><h4>What's on the GPU right now</h4><p class="mmf-sub">Switch the mode, then tap a block. Bar length shows roughly how big each pile tends to be.</p><div class="mmf-switch" data-role="mode"><button data-mode="inference" class="on">Inference</button><button data-mode="training">Training</button></div><div class="mmf-blocks"><div class="mmf-block" data-k="weights" style="--c:#5b8cff"><span class="mmf-bt">Weights</span><div class="mmf-bar"/><span class="mmf-tag">always present</span></div><div class="mmf-block" data-k="act" style="--c:#3ecf8e"><span class="mmf-bt">Activations</span><div class="mmf-bar"/><span class="mmf-tag">both modes</span></div><div class="mmf-block" data-k="kv" style="--c:#00c853"><span class="mmf-bt">KV cache</span><div class="mmf-bar"/><span class="mmf-tag">inference only</span></div><div class="mmf-block" data-k="opt" style="--c:#f5a623"><span class="mmf-bt">Optimizer + grads</span><div class="mmf-bar"/><span class="mmf-tag">training only</span></div></div><div class="mmf-detail">Tap a block above to see what it stores.</div></div></div><div class="mm-sec"><h3>Memory vs compute trade-offs</h3><p>You rarely get memory savings for free. Most techniques are a trade against<em>compute</em> (time): you decide which resource you'd rather spend. Two moves show up over and over.</p><p><b>Recomputation</b> throws activations away and rebuilds them later when they're needed — saving memory but paying extra compute (this is "activation checkpointing").<b>Caching</b> is the opposite: keep results around so you never redo them — saving compute but spending memory (the KV cache and prefix caching both do this).</p><p>There's a third axis people forget:<b>bandwidth vs capacity</b>. Capacity is how much fits on the desk; bandwidth is how fast you can move things on and off it. When a model generates text one token at a time, the GPU often sits idle waiting to<em>read the weights</em> from memory — so decoding is usually bandwidth-bound, not compute-bound.</p><div class="mmf-w" id="mmf-trade"><h4>Spend memory or spend time?</h4><p class="mmf-sub">Drag the slider to choose how much you store vs recompute.</p><input type="range" class="mmf-slider" min="0" max="100" value="50"><div class="mmf-ends"><span>◀ Store / cache everything</span><span>Recompute everything ▶</span></div><div class="mmf-meters"><div class="mmf-meter"><div class="mmf-ml"><span>Memory used</span><span class="mmf-mv" data-m="mem"/></div><div class="mmf-track"><div class="mmf-fill" data-f="mem" style="background:#5b8cff"/></div></div><div class="mmf-meter"><div class="mmf-ml"><span>Compute / latency</span><span class="mmf-mv" data-m="cpu"/></div><div class="mmf-track"><div class="mmf-fill" data-f="cpu" style="background:#f5a623"/></div></div></div><div class="mmf-readout"/></div></div><div class="mm-sec"><h3>Training vs inference memory</h3><p>The same model has a very different appetite depending on what you're doing with it.<b>Training</b> has to hold everything at once: the weights, a gradient for every weight, the optimizer's running averages, and all the activations needed to run backpropagation.<b>Inference</b> drops the gradients and optimizer entirely, keeps only a slim set of activations, and adds the KV cache.</p><p>The rule of thumb: training a model takes<em>several times</em> the memory of simply running it. That gap is exactly why you can comfortably run a model on hardware that could never have trained it. Move the slider and switch the inference precision to feel the difference.</p><div class="mmf-w" id="mmf-est"><h4>Training vs inference memory estimator</h4><p class="mmf-sub">Rough rule-of-thumb breakdown — orders of magnitude, not exact figures.</p><div class="mmf-ctl"><div style="flex:1 1 240px;"><label>Model size —<span class="mmf-pval">7B</span> parameters</label><input type="range" class="mmf-slider" id="mmf-params" min="1" max="70" value="7"/><div class="mmf-prec"><label>Inference precision</label><button data-b="4">FP32</button><button data-b="2" class="on">FP16</button><button data-b="1">INT8</button><button data-b="0.5">INT4</button></div></div><div class="mmf-est-bars"><div class="mmf-eb" data-row="train"><div class="mmf-ebl"><span>Training<small>(FP16 mixed precision)</small></span><span class="mmf-tot"/></div><div class="mmf-stack"/></div><div class="mmf-eb" data-row="inf"><div class="mmf-ebl"><span>Inference</span><span class="mmf-tot"/></div><div class="mmf-stack"/></div></div><div class="mmf-legend"><span><i style="background:#5b8cff"/>Weights</span><span><i style="background:#f5a623"/>Optimizer + grads</span><span><i style="background:#3ecf8e"/>Activations</span><span><i style="background:#00c853"/>KV cache</span></div><div class="mmf-ratio"/><p class="mmf-cap">Approximate, per common rules of thumb: weights = params × bytes-per-param; Adam optimizer + gradients ≈ 14 bytes/param; activations and KV cache depend on batch size and context length and are shown as representative values.</p></div></div>`;},
weights:function(){return `<div class="mm-sec"><h3>Parameter count &amp; footprint</h3><p>A model's weights are simply a giant pile of numbers — the parameters it learned during training. The memory they take is pleasantly easy to estimate:<span class="mmf-kbd">footprint ≈ parameters × bytes per parameter</span>. Nothing else about the architecture changes this number. A 7-billion-parameter model at 2 bytes each is about 14&nbsp;GB.</p><p>For smaller models the embedding and output (vocabulary) matrices can be a surprisingly large slice, because they scale with vocabulary size rather than with depth.</p></div><div class="mm-sec"><h3>Data types &amp; precision</h3><p>How many bytes each parameter takes depends on its numeric type:<b>FP32</b> uses 4 bytes,<b>FP16/BF16</b> use 2,<b>INT8</b> uses 1, and<b>INT4</b> about half a byte. BF16 and FP16 are the same size but split their bits differently — BF16 keeps FP32's wide range (good for training stability) at the cost of precision.</p><p>Halving the bytes halves both the memory and the bandwidth needed to read the weights, which is why lower precision usually runs faster too. Try it below.</p><div class="mmf-w" id="mmf-wt"><h4>Weights footprint calculator</h4><p class="mmf-sub">Pick a size and precision — see how many GB the raw weights need.</p><div class="mmf-ctl"><div style="flex:1 1 240px;"><label>Parameters —<span class="mmf-pval">7B</span></label><input type="range" class="mmf-slider" id="mmf-wparams" min="1" max="180" value="7"/><div class="mmf-prec"><label>Precision</label><button data-b="4">FP32</button><button data-b="2" class="on">FP16</button><button data-b="1">INT8</button><button data-b="0.5">INT4</button></div></div><div class="mmf-stat" data-s="gb"/><div class="mmf-fitline" data-s="fit"/><div class="mmf-est-bars" data-s="cmp"/><p class="mmf-cap">footprint = parameters × bytes-per-param. FP32 = 4B, FP16/BF16 = 2B, INT8 = 1B, INT4 ≈ 0.5B.</p></div></div><div class="mm-sec"><h3>Quantization basics &amp; impact</h3><p>Quantization is the art of squeezing weights into fewer bits.<b>Post-training quantization (PTQ)</b> compresses an already-trained model;<b>quantization-aware training (QAT)</b> trains with the rounding in mind for better accuracy. Popular schemes — GPTQ, AWQ, GGUF/llama.cpp, bitsandbytes — differ mainly in how they choose scaling factors (per-channel or per-group) to limit the accuracy hit.</p><p>The trade-off is always the same triangle:<b>size vs accuracy vs speed</b>. Going from FP16 to INT4 roughly quarters the footprint — letting a model that needed a data-center GPU run on a laptop — usually with only a small quality drop.</p></div><div class="mm-sec"><h3>Memory-efficient fine-tuning</h3><p>You don't have to retrain all the weights to adapt a model.<b>LoRA / PEFT</b> freezes the big base model and trains tiny add-on matrices, so you store and update a few million parameters instead of billions.<b>QLoRA</b> goes further: it keeps the base model quantized to 4-bit and trains the LoRA adapters on top — letting you fine-tune large models on a single consumer GPU.</p></div>`;},
kv:function(){return `<div class="mm-sec"><h3>What the KV cache is &amp; why it matters</h3><p>When a model generates text, each new token must "look back" at every previous token through attention. Re-deriving the keys and values for all those past tokens at every step would be hugely wasteful, so the model caches them — that's the<b>KV cache</b>. It's what makes generation fast: each new token computes only its own key and value and reuses the rest.</p></div><div class="mm-sec"><h3>How it scales</h3><p>The catch is that the cache grows with every token. Its size is roughly<span class="mmf-kbd">2 × layers × heads × head_dim × tokens × batch × bytes</span> (the 2 is for keys and values). It grows<b>linearly</b> with context length and batch size — double the conversation and you double the cache. At long context this often dwarfs the weights themselves: the long-context tax.</p><div class="mmf-w" id="mmf-kv"><h4>KV cache size — watch it grow</h4><p class="mmf-sub">Based on a 7B-class model: 32 layers, 32 query heads, head dim 128, FP16.</p><div class="mmf-ctl"><div style="flex:1 1 220px;"><label>Context length —<span class="mmf-pval" data-p="ctx">8K</span> tokens</label><input type="range" class="mmf-slider" id="mmf-ctx" min="1" max="128" value="8"/><div style="flex:1 1 150px;"><label>Batch —<span class="mmf-pval" data-p="bs">1</span></label><input type="range" class="mmf-slider" id="mmf-bs" min="1" max="32" value="1"/></div><div class="mmf-prec" data-role="attn" style="margin-bottom:6px;"><label>Attention type (K/V heads)</label><button data-h="32">MHA · 32</button><button data-h="8" class="on">GQA · 8</button><button data-h="1">MQA · 1</button></div><div class="mmf-stat" data-s="gb"/><div class="mmf-readout" data-s="cmp"/><p class="mmf-cap">cache = 2 × layers × kv-heads × head_dim × tokens × batch × 2 bytes. GQA/MQA shrink it by sharing K/V across heads.</p></div></div><div class="mm-sec"><h3>Reducing the KV cache</h3><p>Most savings come from making heads share their keys and values.<b>Multi-Query Attention (MQA)</b> uses one shared K/V for all heads;<b>Grouped-Query Attention (GQA)</b> is the popular middle ground, sharing across small groups. DeepSeek's<b>Multi-head Latent Attention (MLA)</b> compresses K/V into a small latent vector. You can also<b>quantize the cache</b> to INT8/INT4. Flip the attention type above to feel the difference.</p></div><div class="mm-sec"><h3>Context window limitations</h3><p>Context limits are usually set by this memory cost and attention's compute, not by anything fundamental from training. Position-encoding tricks like<b>RoPE scaling</b> and<b>ALiBi</b> let models stretch to longer contexts than they were trained on — with some quality cost.</p></div>`;},
attention:function(){return `<div class="mm-sec"><h3>Standard attention memory cost</h3><p>Attention compares every token with every other token. That comparison produces a score matrix of size<b>sequence × sequence</b> — so its memory grows with the<em>square</em> of the sequence length. Double the context and the matrix quadruples. For long sequences it's the memory of this matrix, not the raw compute, that blows up first.</p><div class="mmf-w" id="mmf-attn"><h4>Why long context explodes: n² vs n</h4><p class="mmf-sub">Slide the sequence length and watch the attention matrix grow.</p><div class="mmf-ctl"><div style="flex:1 1 100%;"><label>Sequence length —<span class="mmf-pval">16</span> tokens</label><input type="range" class="mmf-slider" id="mmf-n" min="4" max="40" value="16"/></div><div class="mmf-matrix" data-s="grid"/><div class="mmf-est-bars"><div class="mmf-eb"><div class="mmf-ebl"><span>Standard attention<small>O(n²)</small></span><span data-s="std"/></div><div class="mmf-stack"><div class="mmf-seg" data-f="std" style="background:#f5a623"/></div></div><div class="mmf-eb"><div class="mmf-ebl"><span>FlashAttention<small>O(n)</small></span><span data-s="fa"/></div><div class="mmf-stack"><div class="mmf-seg" data-f="fa" style="background:#00c853"/></div></div></div><div class="mmf-readout" data-s="note"/></div></div><div class="mm-sec"><h3>FlashAttention &amp; IO-aware methods</h3><p><b>FlashAttention</b> is the key fix. Instead of building the whole n×n matrix in slow GPU memory, it processes attention in small<b>tiles</b> that fit in fast on-chip SRAM, computing the result without ever materializing the full matrix. This drops memory from O(n²) to<b>O(n)</b> and runs faster by moving less data. Later versions (FA-2, FA-3) tuned the GPU work further.</p></div><div class="mm-sec"><h3>Sparse &amp; windowed attention</h3><p>The other approach is to simply not attend to everything.<b>Sliding-window</b> attention (used by Mistral) limits each token to a fixed nearby window.<b>Sparse</b> and block-sparse patterns mix local attention with a few global tokens. You trade some ability to connect very distant tokens for large memory savings.</p></div>`;},
inference:function(){return `<div class="mm-sec"><h3>Paging &amp; batching</h3><p>Production serving is mostly a memory-management problem.<b>PagedAttention</b> (from vLLM) treats the KV cache the way an operating system treats RAM: it splits memory into fixed-size pages so requests of different lengths pack together without leaving wasted gaps (fragmentation).<b>Continuous batching</b> adds and removes requests from the running batch on the fly instead of waiting for a whole batch to finish, keeping the GPU busy.</p><div class="mmf-w" id="mmf-pag"><h4>PagedAttention: stop wasting GPU memory</h4><p class="mmf-sub">Toggle paging and watch fragmentation disappear. Each colour is one request.</p><div class="mmf-switch" data-role="pg"><button data-pg="0" class="on">No paging</button><button data-pg="1">PagedAttention</button></div><div class="mmf-memgrid" data-s="grid"/><div class="mmf-readout" data-s="note"/></div></div><div class="mm-sec"><h3>Caching across requests</h3><p>Different requests often share a prefix — the same long system prompt, or the same retrieved documents in RAG.<b>Prefix / prompt caching</b> stores the KV for that shared part once and reuses it across requests, so you pay to process it only the first time. Hit rate and a sensible eviction policy decide how much you save.</p></div><div class="mm-sec"><h3>KV compression &amp; eviction</h3><p>You can also shrink the cache while serving:<b>evict</b> tokens that no longer matter (H2O and the attention-sink / StreamingLLM idea keep just the important and most-recent tokens), or store the cache in<b>INT8/INT4</b>.</p></div><div class="mm-sec"><h3>Other techniques</h3><p><b>Speculative decoding</b> uses a small draft model to guess several tokens that the big model then verifies in one pass.<b>Disaggregated serving</b> splits the prefill and decode phases onto different GPUs so each runs on hardware suited to it.</p></div>`;},
training:function(){return `<div class="mm-sec"><h3>What consumes training memory</h3><p>Training memory is dominated by things inference never sees. For every parameter you keep three things: the<b>weight</b> itself, its<b>gradient</b>, and the optimizer's bookkeeping. Adam stores two extra values per parameter (a running mean and variance), and mixed-precision training also keeps an FP32 master copy — together often<b>~12–16 bytes per parameter</b>, several times the weights alone.</p><p>On top of that sit the<b>activations</b>, which scale with batch size, sequence length, and depth. Use the widget below to see how sharding and checkpointing claw the memory back.</p><div class="mmf-w" id="mmf-zero"><h4>ZeRO / FSDP: fitting a model across GPUs</h4><p class="mmf-sub">7B model, FP16 mixed precision. Watch per-GPU memory shrink as you shard.</p><div class="mmf-prec" data-role="stage" style="margin-bottom:12px;"><label>ZeRO stage</label><button data-z="0" class="on">0 · none</button><button data-z="1">1 · optim</button><button data-z="2">2 · +grads</button><button data-z="3">3 · +params</button></div><div class="mmf-ctl"><div style="flex:1 1 240px;"><label>GPUs —<span class="mmf-pval">8</span></label><input type="range" class="mmf-slider" id="mmf-gpus" min="1" max="16" value="8"/><div><label>Activation checkpointing</label><div class="mmf-switch" data-role="ckpt"><button data-ck="0" class="on">Off</button><button data-ck="1">On</button></div></div></div><div class="mmf-est-bars"><div class="mmf-eb"><div class="mmf-ebl"><span>Memory per GPU</span><span class="mmf-tot"/></div><div class="mmf-stack"/></div></div><div class="mmf-legend"><span><i style="background:#5b8cff"/>Weights</span><span><i style="background:#f5a623"/>Optimizer</span><span><i style="background:#e0556b"/>Gradients</span><span><i style="background:#3ecf8e"/>Activations</span></div><div class="mmf-ratio"/></div></div><div class="mm-sec"><h3>Reducing activation memory</h3><p>The biggest lever on activation memory is<b>recomputation</b>, a.k.a. activation checkpointing: instead of storing every layer's activations for the backward pass, you keep a few checkpoints and recompute the rest on demand — trading extra compute for large memory savings.<b>Mixed-precision</b> (FP16/BF16) with an FP32 master copy also keeps activations and gradients small.</p></div><div class="mm-sec"><h3>Sharding &amp; distribution</h3><p>To train models too big for one GPU, you<b>shard</b>.<b>ZeRO</b> (DeepSpeed) and<b>FSDP</b> (PyTorch) split the optimizer states, gradients, and even the parameters across GPUs so no single device holds the whole thing. You can<b>offload</b> optimizer state to CPU or NVMe (ZeRO-Offload / Infinity), and use<b>gradient accumulation</b> to simulate big batches within a tight memory budget.</p></div>`;},
hardware:function(){return `<div class="mm-sec"><h3>The memory hierarchy</h3><p>Every byte we've discussed ultimately lives on real silicon, arranged in a hierarchy: tiny, blazing-fast on-chip<b>SRAM</b>; the GPU's<b>HBM/VRAM</b> (tens of GB, very fast);<b>system RAM</b> (hundreds of GB, slower); and<b>disk/SSD</b> (huge, slow). The higher you go, the faster and smaller; the lower, the bigger and slower. Two numbers matter at each level —<b>capacity</b> (how much fits) and<b>bandwidth</b> (how fast you can move it).</p><div class="mmf-w" id="mmf-hier"><h4>The memory hierarchy — speed vs size</h4><p class="mmf-sub">Tap a tier. Bar length shows relative bandwidth (how fast); the label shows capacity (how much).</p><div class="mmf-hier" data-s="rows"/><div class="mmf-detail">Tap a tier to see its trade-off.</div></div></div><div class="mm-sec"><h3>Unified &amp; alternative memory</h3><p>During token-by-token decoding the GPU often finishes its math and then waits to read weights from HBM — it's<b>bandwidth-bound</b>, not compute-bound. That's exactly why FlashAttention's trick of keeping data in fast SRAM pays off. Some systems blur the levels:<b>Apple Silicon</b> uses unified memory, where CPU and GPU share one pool — letting consumer machines run surprisingly large models.</p></div><div class="mm-sec"><h3>Distribution &amp; offloading</h3><p>When a model still doesn't fit, you can<b>offload</b> parts to CPU RAM or disk, or spread it across GPUs with<b>tensor</b>,<b>pipeline</b>, or<b>data</b> parallelism — each with a different memory profile. The catch: the<b>PCIe/NVLink</b> links between devices then become the hidden bottleneck, since moving data between them is far slower than reading local memory.</p></div>`;},
agent:function(){return `<div class="mm-sec"><h3>The context window as working memory</h3><p>Here "memory" means something different — not bytes on a GPU, but what an application remembers across turns. The simplest memory is the<b>context window</b> itself: whatever you can fit in the prompt is the model's short-term, working memory. It's fast and direct, but limited in size, costs money per token, and suffers the<b>"lost in the middle"</b> problem where facts buried in a long context get overlooked.</p><div class="mmf-w" id="mmf-ctxwin"><h4>Working memory fills up — manage it</h4><p class="mmf-sub">Add turns until the window fills, then try summarizing. Limit: 8,000 tokens.</p><div style="display:flex;gap:10px;flex-wrap:wrap;margin-bottom:14px;"><button class="mmf-btn" data-a="add">+ Add a turn</button><button class="mmf-btn" data-a="sum">Summarize old turns</button><button class="mmf-btn" data-a="reset">Reset</button></div><div style="display:flex;justify-content:space-between;font-size:1.05rem;margin-bottom:6px;opacity:.85;"><span data-s="lbl">0 / 8000 tokens</span><span data-s="turns">0 turns</span></div><div class="mmf-track" style="height:24px;"><div class="mmf-fill" data-f="ctx" style="background:#7c5cff;width:0%"/></div><div class="mmf-readout" data-s="note">Empty context. Each turn adds about 800 tokens.</div></div></div><div class="mm-sec"><h3>External / long-term memory</h3><p>For anything longer-lived, you store information outside the model and retrieve it when needed — usually with<b>RAG</b> and a vector database. It helps to think in human terms:<b>episodic</b> memory (what happened),<b>semantic</b> memory (facts), and<b>procedural</b> memory (how to do things). The hard design questions are what to write down, when, and how to index it for retrieval.</p></div><div class="mm-sec"><h3>Managing growing memory</h3><p>Because conversations grow without bound, you must manage what stays in context.<b>Summarization / compaction</b> folds old turns into a short recap;<b>relevance ranking</b> and<b>eviction</b> keep only what matters. Common patterns include memory stores, scratchpads, and user profiles.</p></div><div class="mm-sec"><h3>Multimodal memory</h3><p>Memory isn't only text.<b>Multimodal memory</b> stores and recalls images and audio alongside words, embedding everything into a shared space so a single query can retrieve across modalities.</p></div>`;},
advanced:function(){return `<div class="mm-sec"><h3>Memory-efficient architectures</h3><p>Transformers pay for their power with a KV cache that grows with every token. A new family of architectures avoids that.<b>State-space models</b> like<b>Mamba</b> carry a fixed-size "state" that summarizes everything so far — so memory stays<b>constant</b> no matter how long the sequence, instead of growing linearly.<b>RWKV</b> blends RNN-style recurrence with transformer-level quality, also keeping memory flat. Various linear / sub-quadratic attention variants chase the same goal.</p><div class="mmf-w" id="mmf-arch"><h4>Growing cache vs constant state</h4><p class="mmf-sub">Slide the sequence length. The transformer's memory climbs; an SSM stays flat.</p><div class="mmf-ctl"><div style="flex:1 1 100%;"><label>Sequence length —<span class="mmf-pval">8K</span> tokens</label><input type="range" class="mmf-slider" id="mmf-seq" min="1" max="256" value="8"/></div><div class="mmf-est-bars"><div class="mmf-eb"><div class="mmf-ebl"><span>Transformer<small>KV cache, grows with length</small></span><span data-s="tf"/></div><div class="mmf-stack"><div class="mmf-seg" data-f="tf" style="background:#f5a623"/></div></div><div class="mmf-eb"><div class="mmf-ebl"><span>State-space / Mamba<small>constant state</small></span><span data-s="ssm"/></div><div class="mmf-stack"><div class="mmf-seg" data-f="ssm" style="background:#00c853"/></div></div></div><div class="mmf-readout" data-s="note"/></div></div><div class="mm-sec"><h3>Long-context techniques</h3><p>Even within transformers, several techniques stretch usable context:<b>RoPE scaling</b> and<b>position interpolation</b> extend a model past its trained length;<b>retrieval-</b> and<b>memory-augmented</b> methods pull in only the relevant slices instead of attending to everything; and<b>hybrid</b> architectures interleave attention layers with recurrent or state-space layers to get the best of both.</p></div>`;}
};
var INITS={
fundamentals:function(c){
var map=c.querySelector("#mmf-map");
if(map){
var DET={
weights:"Model weights — the learned parameters. Always present, training or inference. A 7B model in FP16 is about 14 GB before anything else loads.",
act:"Activations — the intermediate results from each layer during a forward pass. They scale with batch size and sequence length. Training must keep them around for the backward pass (large); inference can discard them almost immediately (small).",
kv:"KV cache — the keys and values for every token so far, saved so they aren't recomputed each step. Inference only. It starts tiny and grows with every generated token — the long-context tax.",
opt:"Optimizer states & gradients — only exist during training. Every parameter gets a gradient, and an optimizer like Adam keeps two extra running averages per parameter. Usually the single biggest chunk of training memory."
};
var SIZ={inference:{weights:72,act:22,kv:38,opt:0},training:{weights:42,act:78,kv:0,opt:92}};
var blocks=map.querySelectorAll(".mmf-block");
var det=map.querySelector(".mmf-detail");
function applyMode(m){var s=SIZ[m];blocks.forEach(function(b){var k=b.getAttribute("data-k");var bar=b.querySelector(".mmf-bar");var w=s[k];if(w<=0){b.classList.add("off");bar.style.width="0%";}else{b.classList.remove("off");bar.style.width=w+"%";}});} map.querySelectorAll('[data-role="mode" ]= button').forEach(function(btn){btn.addEventListener("click",function(){map.querySelectorAll('[data-role="mode" ]= button').forEach(function(x){x.classList.remove("on");});btn.classList.add("on");applyMode(btn.getAttribute("data-mode"));});});= blocks.forEach(function(b){b.addEventListener("click",function(){blocks.forEach(function(x){x.classList.remove("sel");});b.classList.add("sel");det.textContent=DET[b.getAttribute("data-k")];});}); applyMode("inference");= }= var= trade=c.querySelector("#mmf-trade"); if(trade){= var= sl=trade.querySelector(".mmf-slider"); var= fmem=trade.querySelector('[data-f="mem"]'),fcpu=trade.querySelector('[data-f="cpu"]'); var= vmem=trade.querySelector('[data-m="mem"]'),vcpu=trade.querySelector('[data-m="cpu"]'); var= ro=trade.querySelector(".mmf-readout"); function= upd(){var= v=+sl.value;var mem=Math.round(95-0.7*v),cpu=Math.round(25+0.7*v);fmem.style.width=mem+"%";fcpu.style.width=cpu+"%";vmem.textContent=mem+"% used";vcpu.textContent=cpu+"% of= max";var= t;if(v<33)t="<b>Store everything</b> — full KV cache, no recomputation. Fastest, but memory fills up quickly." ;else= if(v=>66)t="<b>Recompute aggressively</b> — activation checkpointing, minimal caching. Fits in far less memory, but each step costs extra compute.";else t="<b>Balanced</b> — keep the hot data, rebuild the cheap-to-recompute parts. The usual real-world setting.";ro.innerHTML=t;}
sl.addEventListener("input",upd);upd();
}
var est=c.querySelector("#mmf-est");
if(est){
var ps=est.querySelector("#mmf-params");var pv=est.querySelector(".mmf-pval");var bytes=2;
function gb(x){return x>=10?Math.round(x)+" GB":(Math.round(x*10)/10)+" GB";}
function seg(color,val,scale){return '<div class="mmf-seg" style="background:'+color+';width:'+(val/scale*100)+'%"/>';}
function calc(){var P=+ps.value;pv.textContent=P+"B";var tw=2*P,to=14*P,ta=2*P,tt=tw+to+ta;var iw=bytes*P,ia=0.5*P,ik=1*P,it=iw+ia+ik;var scale=tt;var tr=est.querySelector('[data-row="train"]'),inf=est.querySelector('[data-row="inf"]');tr.querySelector(".mmf-stack").innerHTML=seg("#5b8cff",tw,scale)+seg("#f5a623",to,scale)+seg("#3ecf8e",ta,scale);tr.querySelector(".mmf-tot").textContent=gb(tt);inf.querySelector(".mmf-stack").innerHTML=seg("#5b8cff",iw,scale)+seg("#3ecf8e",ia,scale)+seg("#00c853",ik,scale);inf.querySelector(".mmf-tot").textContent=gb(it);est.querySelector(".mmf-ratio").innerHTML="Training needs ≈<b>"+(Math.round(tt/it*10)/10)+"×</b> the memory of inference here.";}
ps.addEventListener("input",calc);
est.querySelectorAll(".mmf-prec button").forEach(function(b){b.addEventListener("click",function(){est.querySelectorAll(".mmf-prec button").forEach(function(x){x.classList.remove("on");});b.classList.add("on");bytes=parseFloat(b.getAttribute("data-b"));calc();});});
calc();
}
},
weights:function(c){
var w=c.querySelector("#mmf-wt");if(!w)return;
var ps=w.querySelector("#mmf-wparams"),pv=w.querySelector(".mmf-pval"),bytes=2;
var PREC=[{n:"FP32",b:4,col:"#9aa4b2"},{n:"FP16",b:2,col:"#5b8cff"},{n:"INT8",b:1,col:"#3ecf8e"},{n:"INT4",b:0.5,col:"#00c853"}];
function gb(x){return x>=10?Math.round(x)+" GB":(Math.round(x*10)/10)+" GB";}
function calc(){var P=+ps.value;pv.textContent=P+"B";var g=P*bytes;
w.querySelector('[data-s="gb"]').textContent=gb(g)+" · "+bytes+" bytes/param";
var fit=w.querySelector('[data-s="fit"]');
if(g<=24)fit.innerHTML="✅ Fits= on= a= 24= GB= GPU= (e.g.= RTX= 4090).";= else= if(g<=80)fit.innerHTML="✅ Fits= on= an= 80= GB= GPU= (A100/H100),= but= not= a= 24= GB= card.";= else= fit.innerHTML="⚠️ Too big for a single 80 GB GPU — needs multiple GPUs or heavier quantization." ;= var= maxg=P*4; w.querySelector('[data-s="cmp" ]').innerHTML=PREC.map(function(p){var v=P*p.b;return '<div= class="mmf-eb"><div class="mmf-ebl"><span>'+p.n+'</span><span>'+gb(v)+'</span></div><div class="mmf-stack"><div class="mmf-seg" style="background:'+p.col+';width:'+(v/maxg*100)+'%"/></div></div>';}).join("");
}
ps.addEventListener("input",calc);
w.querySelectorAll(".mmf-prec button").forEach(function(b){b.addEventListener("click",function(){w.querySelectorAll(".mmf-prec button").forEach(function(x){x.classList.remove("on");});b.classList.add("on");bytes=parseFloat(b.getAttribute("data-b"));calc();});});
calc();
},
kv:function(c){
var w=c.querySelector("#mmf-kv");if(!w)return;
var ctx=w.querySelector("#mmf-ctx"),bs=w.querySelector("#mmf-bs"),kvh=8,layers=32,hd=128,bytes=2;
function gb(x){return x>=10?Math.round(x)+" GB":(Math.round(x*100)/100)+" GB";}
function calc(){var seqK=+ctx.value,batch=+bs.value,tokens=seqK*1024;
w.querySelector('[data-p="ctx"]').textContent=seqK+"K";w.querySelector('[data-p="bs"]').textContent=batch;
var g=(2*layers*kvh*hd*tokens*batch*bytes)/1e9;
w.querySelector('[data-s="gb"]').textContent=gb(g);
var pct=Math.round(g/14*100);
w.querySelector('[data-s="cmp"]').innerHTML="That's about<b>"+pct+"%</b> of the 14 GB the 7B weights take. "+(g>14?"The cache now exceeds the model itself.":"Stretch the context further and it overtakes the weights.");
}
ctx.addEventListener("input",calc);bs.addEventListener("input",calc);
w.querySelectorAll('[data-role="attn"] button').forEach(function(b){b.addEventListener("click",function(){w.querySelectorAll('[data-role="attn"] button').forEach(function(x){x.classList.remove("on");});b.classList.add("on");kvh=parseInt(b.getAttribute("data-h"),10);calc();});});
calc();
},
attention:function(c){
var w=c.querySelector("#mmf-attn");if(!w)return;
var n=w.querySelector("#mmf-n"),pv=w.querySelector(".mmf-pval"),grid=w.querySelector('[data-s="grid"]'),maxN=40;
function calc(){var v=+n.value;pv.textContent=v;
grid.style.gridTemplateColumns="repeat("+v+",1fr)";
grid.innerHTML=new Array(v*v).fill('<div class="mmf-cell"/>').join("");
var std=v*v,fa=v,scale=maxN*maxN;
w.querySelector('[data-f="std"]').style.width=(std/scale*100)+"%";
w.querySelector('[data-f="fa"]').style.width=Math.max(fa/scale*100,1)+"%";
w.querySelector('[data-s="std"]').textContent=std.toLocaleString()+" cells";
w.querySelector('[data-s="fa"]').textContent="~"+fa+" units";
w.querySelector('[data-s="note"]').innerHTML="At<b>"+v+"</b> tokens the full score matrix is<b>"+std.toLocaleString()+"</b> cells. FlashAttention never stores it — it streams tiles, so its memory grows with<b>"+v+"</b>, not "+std.toLocaleString()+".";
}
n.addEventListener("input",calc);calc();
},
inference:function(c){
var w=c.querySelector("#mmf-pag");if(!w)return;
var grid=w.querySelector('[data-s="grid"]'),note=w.querySelector('[data-s="note"]'),mode=0;
var COLS=["#5b8cff","#3ecf8e","#f5a623","#7c5cff"],reqs=[18,27,9,14],slot=25;
function render(){var cells=new Array(100).fill(null);
if(mode===0){var pos=0;reqs.forEach(function(len,i){for(var j=0;j<slot;j++){if(pos<100)cells[pos]=(j<len)?{c:COLS[i]}:{w:true};pos++;}}); var= wasted=cells.filter(function(x){return x&&x.w;}).length;= note.innerHTML="<b>No paging:</b> each request reserves a full-size slot up front. <b>" +wasted+"%= wasted</b=> on gaps, and the grid is full — no room for a 5th request.";}
else{var pos=0;reqs.forEach(function(len,i){for(var j=0;j<len;j++){cells[pos++]={c:COLS[i]};}}); var= free=cells.filter(function(x){return !x;}).length;= note.innerHTML="<b>PagedAttention:</b> requests are packed into pages with almost no waste. <b>" +free+"%= free</b=> — enough to admit more requests on the same GPU.";}
grid.innerHTML=cells.map(function(x){var bg=x?(x.w?"repeating-linear-gradient(45deg,rgba(127,127,127,.3),rgba(127,127,127,.3) 3px,transparent 3px,transparent 6px)":x.c):"rgba(127,127,127,.16)";return '<div class="mmf-mcell" style="background:'+bg+'"/>';}).join("");
}
w.querySelectorAll('[data-role="pg"] button').forEach(function(b){b.addEventListener("click",function(){w.querySelectorAll('[data-role="pg"] button').forEach(function(x){x.classList.remove("on");});b.classList.add("on");mode=parseInt(b.getAttribute("data-pg"),10);render();});});
render();
},
training:function(c){
var w=c.querySelector("#mmf-zero");if(!w)return;
var gpus=w.querySelector("#mmf-gpus"),pv=w.querySelector(".mmf-pval"),stage=0,ckpt=0;
var P=7,W=2*P,G=2*P,O=12*P;
function gb(x){return x>=10?Math.round(x)+" GB":(Math.round(x*10)/10)+" GB";}
function calc(){var N=+gpus.value;pv.textContent=N;var A=ckpt?4:16;var wp=W,gp=G,op=O;
if(stage>=1)op=O/N;if(stage>=2)gp=G/N;if(stage>=3)wp=W/N;
var tot=wp+gp+op+A,maxT=W+G+O+16;
var segs=[["#5b8cff",wp],["#f5a623",op],["#e0556b",gp],["#3ecf8e",A]];
w.querySelector(".mmf-stack").innerHTML=segs.map(function(s){return '<div class="mmf-seg" style="background:'+s[0]+';width:'+(s[1]/maxT*100)+'%"/>';}).join("");
w.querySelector(".mmf-tot").textContent=gb(tot);
w.querySelector(".mmf-ratio").innerHTML="Stage "+stage+" on "+N+" GPUs"+(ckpt?" + checkpointing":"")+":<b>"+gb(tot)+"</b> per GPU vs "+gb(maxT)+" unsharded.";
}
gpus.addEventListener("input",calc);
w.querySelectorAll('[data-role="stage"] button').forEach(function(b){b.addEventListener("click",function(){w.querySelectorAll('[data-role="stage"] button').forEach(function(x){x.classList.remove("on");});b.classList.add("on");stage=parseInt(b.getAttribute("data-z"),10);calc();});});
w.querySelectorAll('[data-role="ckpt"] button').forEach(function(b){b.addEventListener("click",function(){w.querySelectorAll('[data-role="ckpt"] button').forEach(function(x){x.classList.remove("on");});b.classList.add("on");ckpt=parseInt(b.getAttribute("data-ck"),10);calc();});});
calc();
},
hardware:function(c){
var w=c.querySelector("#mmf-hier");if(!w)return;
var rows=w.querySelector('[data-s="rows"]'),det=w.querySelector(".mmf-detail");
var TIERS=[
{n:"On-chip SRAM",cap:"~50 MB",bw:19000,col:"#00c853",d:"On-chip SRAM — the GPU's scratchpad. Astonishingly fast (~19 TB/s) but tiny. FlashAttention works by keeping its tiles here instead of in HBM."},
{n:"HBM / VRAM",cap:"~80 GB",bw:3350,col:"#3ecf8e",d:"High-Bandwidth Memory — the GPU's main memory, holding weights and the KV cache. Fast (~3.3 TB/s on an H100) but limited capacity. This is what 'GPU memory' usually means."},
{n:"System RAM",cap:"~512 GB",bw:100,col:"#5b8cff",d:"CPU system memory. Much bigger, but ~30x slower than HBM and reached over PCIe. Used for offloading optimizer state or layers that don't fit on the GPU."},
{n:"SSD / NVMe",cap:"~8 TB",bw:7,col:"#f5a623",d:"Disk. Effectively unlimited capacity but very slow (~7 GB/s). A last-resort offload (ZeRO-Infinity) for models far larger than RAM."}
];
var maxLog=Math.log(19000);
rows.innerHTML=TIERS.map(function(t,i){var wd=Math.max(Math.log(t.bw)/maxLog*100,6);var bwl=t.bw>=1000?(t.bw/1000)+" TB/s":t.bw+" GB/s";return '<div class="mmf-hrow" data-i="'+i+'" style="--c:'+t.col+'"><div class="mmf-htop"><span>'+t.n+'</span><small>'+t.cap+' · '+bwl+'</small></div><div class="mmf-hbar" style="width:'+wd+'%"/></div>';}).join("");
rows.querySelectorAll(".mmf-hrow").forEach(function(r){r.addEventListener("click",function(){rows.querySelectorAll(".mmf-hrow").forEach(function(x){x.classList.remove("sel");});r.classList.add("sel");det.textContent=TIERS[+r.getAttribute("data-i")].d;});});
},
agent:function(c){
var w=c.querySelector("#mmf-ctxwin");if(!w)return;
var fill=w.querySelector('[data-f="ctx"]'),lbl=w.querySelector('[data-s="lbl"]'),tn=w.querySelector('[data-s="turns"]'),note=w.querySelector('[data-s="note"]');
var LIMIT=8000,used=0,turns=0;
function render(msg){var pct=Math.min(used/LIMIT*100,100);fill.style.width=pct+"%";fill.style.background=used>=LIMIT?"#e0556b":(used>LIMIT*0.8?"#f5a623":"#7c5cff");lbl.textContent=Math.min(used,LIMIT)+" / "+LIMIT+" tokens";tn.textContent=turns+" turns";if(msg)note.innerHTML=msg;}
w.querySelectorAll(".mmf-btn").forEach(function(b){b.addEventListener("click",function(){var a=b.getAttribute("data-a");
if(a==="add"){turns++;used+=800;if(used>=LIMIT)render("<b>Context full.</b> New turns push the oldest ones out — the model forgets the start of the conversation unless you store it elsewhere.");else if(used>LIMIT*0.8)render("Getting full. Time to summarize or offload to long-term memory.");else render("Added a turn. Working memory holds the recent conversation verbatim.");}
else if(a==="sum"){if(turns<=1){render("Nothing to= summarize= yet= —= add= a= few= turns= first.");return;}used=800+Math.round((used-800)*0.25);render("<b>Summarized.</b> Old turns compacted to about 25% of their size — same gist, far fewer tokens. The standard way to stay under the limit.");}
else{used=0;turns=0;render("Reset. Each turn adds about 800 tokens.");}
});});
render();
},
advanced:function(c){
var w=c.querySelector("#mmf-arch");if(!w)return;
var seq=w.querySelector("#mmf-seq"),pv=w.querySelector(".mmf-pval"),perTok=0.125/1024;
function gb(x){return x>=10?Math.round(x)+" GB":(Math.round(x*100)/100)+" GB";}
function calc(){var k=+seq.value,tokens=k*1024;pv.textContent=k+"K";var tf=tokens*perTok,ssm=0.4,max=256*1024*perTok;
w.querySelector('[data-f="tf"]').style.width=Math.min(tf/max*100,100)+"%";
w.querySelector('[data-f="ssm"]').style.width=Math.max(ssm/max*100,1.5)+"%";
w.querySelector('[data-s="tf"]').textContent=gb(tf);
w.querySelector('[data-s="ssm"]').textContent=gb(ssm)+" (flat)";
w.querySelector('[data-s="note"]').innerHTML="At<b>"+k+"K</b> tokens the transformer's cache is<b>"+gb(tf)+"</b> and still climbing; the SSM holds a fixed<b>"+gb(ssm)+"</b> state no matter the length.";
}
seq.addEventListener("input",calc);calc();
}
};
var root=document.getElementById("mm-kb");
function esc(s){return s;}
function cardHTML(t){return '<div class="mm-card" data-id="'+t.id+'"><div class="mm-thumb"><img src="'+t.img+'" alt="" loading="lazy" onerror="this.parentNode.style.display=\'none\'"/><span class="mm-num">'+t.num+'</span><span class="mm-badge '+t.phase+'">'+PHASE[t.phase]+'</span><h3>'+t.title+'</h3><p>'+t.summary+'</p><div class="mm-chips">'+t.chips.map(function(c){return '<span class="mm-chip">'+c+'</span>';}).join("")+'</div><div class="mm-open">Open article →</div></div>';}
function fill(id,dom){document.getElementById(id).innerHTML=TOPICS.filter(function(t){return t.domain===dom;}).map(cardHTML).join("");}
fill("mm-grid-system","system");fill("mm-grid-agent","agent");
function heroHTML(t){if(t.video){return '<div class="mm-hero"><video src="'+t.video+'" autoplay= loop= muted= playsinline=/></div>';}if(t.img){return '<div class="mm-hero"><img src="'+t.img+'" alt="" onerror="this.parentNode.style.display=\'none\'"/>';}return '';}
function articleHTML(t){var body;if(t.body&&BODIES[t.body]){body=BODIES[t.body]();}else{var secs=t.sections.map(function(s){return '<div class="mm-sec"><h3>'+s.h+'</h3><ul>'+s.li.map(function(x){return '<li>'+x+'</li>';}).join("")+'</ul></div>';}).join("");body=secs+'<div class="mm-note">This is the article outline. Each bullet is a section to be written out in full.</div>';}var dom=t.domain==="system"?"System &amp; Runtime Memory":"Agent &amp; Long-Term Memory";return '<div class="mm-article" style="display:block"><button class="mm-back" id="mm-back">← Back to all topics</button>'+heroHTML(t)+'<span class="mm-badge '+t.phase+'">'+PHASE[t.phase]+'</span><span class="mm-chip">'+dom+'</span><h1>'+t.title+'</h1><p class="mm-lead">'+t.lead+'</p>'+body+'</div>';}
function show(id){var home=document.getElementById("mm-home");var cont=document.getElementById("mm-article-container");var t=TOPICS.filter(function(x){return x.id===id;})[0];if(t){home.style.display="none";cont.innerHTML=articleHTML(t);if(t.body&&INITS[t.body]){INITS[t.body](cont);}root.scrollIntoView({behavior:"smooth",block:"start"});}}
function goHome(){document.getElementById("mm-article-container").innerHTML="";document.getElementById("mm-home").style.display="block";root.scrollIntoView({behavior:"smooth",block:"start"});}
root.addEventListener("click",function(e){var card=e.target.closest(".mm-card");if(card){show(parseInt(card.getAttribute("data-id"),10));return;}if(e.target.closest("#mm-back")){goHome();}});
})();</script>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/memory-llms/hero.png" medium="image"><media:title type="plain">Posts</media:title></media:content><category>LLMs</category><category>Memory</category><category>Inference</category><category>GPU</category></item><item><title>Aether, Rethought — The Shape Was Wrong All Along</title><link>https://curiousbit.netlify.app/aether-rethought-the-shape-was-wrong-all-along/</link><guid isPermaLink="true">https://curiousbit.netlify.app/aether-rethought-the-shape-was-wrong-all-along/</guid><pubDate>Fri, 12 Jun 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Part III of the Aether series.&lt;/strong&gt; Missed the first two? Start with &lt;a href="https://curiousbit.netlify.app/i-built-a-team-of-it-architects-using-llm-that-live-on-macbook-meet-aether/"&gt;Meet Aether&lt;/a&gt; (the build), then &lt;a href="https://curiousbit.netlify.app/aether-grown-wild-the-implementation-journey-v2.6-v2.8.2/"&gt;Aether, Grown Wild&lt;/a&gt; (what happened when it ran).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;h2 id="the-story-so-far--one-paragraph-each"&gt;The story so far — one paragraph each&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Part I:&lt;/strong&gt; I built a 10-agent (later 13-agent) system that runs an entire team of IT architecture specialists on a single MacBook M5 Pro — one Gemma model, zero cloud, zero data egress. Every agent is just a YAML manifest: a different system prompt and a different knowledge-base namespace pointing at the same weights. The system escalates upward when confidence falls below 0.7.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/aether-rethought.png" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><blockquote><p><strong>Part III of the Aether series.</strong> Missed the first two? Start with<a href="/i-built-a-team-of-it-architects-using-llm-that-live-on-macbook-meet-aether/">Meet Aether</a> (the build), then<a href="/aether-grown-wild-the-implementation-journey-v2.6-v2.8.2/">Aether, Grown Wild</a> (what happened when it ran).</p></blockquote><hr><h2 id="the-story-so-far--one-paragraph-each">The story so far — one paragraph each</h2><p><strong>Part I:</strong> I built a 10-agent (later 13-agent) system that runs an entire team of IT architecture specialists on a single MacBook M5 Pro — one Gemma model, zero cloud, zero data egress. Every agent is just a YAML manifest: a different system prompt and a different knowledge-base namespace pointing at the same weights. The system escalates upward when confidence falls below 0.7.</p><p><strong>Part II:</strong> That clean idea hit reality. The router was rebuilt twice, retrieval flipped from knowledge-base-first to web-first, and self-reported confidence was replaced by a formula the system computes. Thirteen agents, a live web allowlist, a computed confidence score, and a CHANGELOG.md in place of Git.</p><p><strong>Part III (this one):</strong> The shape is wrong. The hierarchy that felt so natural — because it mirrors the org chart — turns out to optimise for the wrong things. I worked through five recognised ways to build an agentic AI system, scored each against the same criteria, and arrived at a recommendation that changes the architecture without throwing away anything we&rsquo;ve built.</p><hr><h2 id="what-were-actually-building">What we&rsquo;re actually building</h2><p>The output isn&rsquo;t a chatbot. It&rsquo;s a consulting deliverable — an architecture document advising how to run an IT transition. Every real client engagement spans multiple towers simultaneously: a move to Entra ID while modernising the network and shifting workloads to Azure touches Cloud, Network, Digital Workplace, and Security in the same breath. The value and the difficulty live in the cross-domain synthesis.</p><p><img src="/images/slides/slides.002.png" alt="Slide: The problem — what we are actually building. Six technology towers. The hard part is cross-domain synthesis into one coherent, defensible document."/><hr><h2 id="where-we-started--the-three-tier-hierarchy">Where we started — the three-tier hierarchy</h2><p>This is what Aether v2.x actually is. One model, thirteen agents, each differentiated only by system prompt and knowledge namespace. Work routes down the tree; low confidence escalates back up.</p><p><img src="/images/slides/slides.003.png" alt="Slide: The current Aether v2.x system — three-tier hierarchy of 13 agents across Cloud, DWP, and Network domains"/><p>It was appealing for real reasons: it mirrors how a delivery organisation thinks, easy to explain to a client, and the RAG namespace-per-domain isolation is clean. But we copied an org chart into the control flow — and the next slide explains why that&rsquo;s usually a trap.</p><hr><h2 id="the-core-insight--why-the-org-chart-is-the-wrong-shape">The core insight — why the org chart is the wrong shape</h2><p>This is the conceptual centre of the entire analysis. Everything that follows flows from these two points.</p><p><img src="/images/slides/slides.004.png" alt="Slide: Why mirroring the org chart is usually a trap — human orgs optimise for accountability, agent systems optimise for information flow. Two consequences: real infra work is cross-domain, and the deliverable is a workflow not an organisation."/><p>A strict tree only allows vertical movement — up to escalate, down to delegate. Real architecture work needs lateral collaboration. The AWS specialist can&rsquo;t directly ask the Network specialist a question; it has to climb the entire tree and back down. That&rsquo;s bureaucracy encoded in Python.</p><p>The second consequence is more fundamental:<strong>the deliverable is a workflow, not an organisation</strong>. Producing an architecture document is a consulting process with phases. The right structure for that process is a pipeline. We modelled the people first and the process second — we should have done it the other way round.</p><hr><h2 id="five-ways-to-build-it">Five ways to build it</h2><p>Rather than jump straight to a new design, I evaluated five recognised approaches against the same five axes: cross-domain capability, parallelism, auditability, simplicity, and fit for document generation.</p><p><img src="/images/slides/slides.005.png" alt="Slide: The section ahead — five ways to build it, each weighed on the same axes. RAG is a shared substrate underneath all five."/><p>Note the footer: RAG sits underneath<strong>all five</strong> approaches — they differ in control flow, not in whether they retrieve.</p><hr><h3 id="approach-1--hierarchical--org-mirror">Approach 1 — Hierarchical / org-mirror</h3><p>This is what Aether v2.x already is.</p><p><img src="/images/slides/slides.006.png" alt="Slide: Approach 1 of 5 — Hierarchical / org-mirror. Pros: intuitive, clear escalation, clean RAG namespaces. Cons: no lateral channel, brittle confidence escalation, Tier-1 becomes a bottleneck."/><p>The pros are real — which is why we chose it. But the killer con:<strong>there is no sideways path</strong>. The AWS agent can&rsquo;t ask the Network agent a question without escalating all the way up and back down. Also, escalation triggered by confidence scores sounds clean, but LLM confidence is unreliable — the trigger itself is shaky.</p><p><strong>Verdict:</strong> Great for stable, siloed problems with a genuine audit chain. That is not what our engagements look like.</p><hr><h3 id="approach-2--orchestrator--flat-specialists">Approach 2 — Orchestrator + flat specialists</h3><p>One orchestrator plans the task, fans it out to specialists running<strong>in parallel</strong>, then synthesises. Flat: adding a domain means adding one specialist — no re-tiering.</p><p><img src="/images/slides/slides.007.png" alt="Slide: Approach 2 of 5 — Orchestrator + flat specialists. Real parallelism, easy to extend, conflicts resolved in one place. Cost: orchestrator decomposition and synthesis are the hard part."/><p><strong>Verdict:</strong> A strong, flexible default. A piece of the recommendation.</p><hr><h3 id="approach-3--workflow--pipeline-process-native">Approach 3 — Workflow / pipeline (process-native)</h3><p>Instead of organising by<em>who</em>, organise by the<em>stages</em> of producing the document: Discover → Assess → Design → Review → Assemble. Each stage maps to a section of the output. Human checkpoints slot between stages.</p><p><img src="/images/slides/slides.008.png" alt="Slide: Approach 3 of 5 — Workflow / pipeline. Mirrors how the deliverable is actually made. Highly auditable: each stage = a section. The &lsquo;con&rsquo; — domains become knowledge sources — is arguably a feature."/><p><strong>Verdict:</strong> This is the spine. The structure that mirrors how the deliverable is actually made. The so-called con — that domain experts become knowledge sources rather than autonomous agents — is arguably the correct framing.</p><hr><h3 id="approach-4--blackboard--shared-artifact">Approach 4 — Blackboard / shared artifact</h3><p>All agents read and write a common workspace — the evolving document. Strong on cross-domain consistency because there&rsquo;s only one object. The hard part is concurrency control.</p><p><img src="/images/slides/slides.009.png" alt="Slide: Approach 4 of 5 — Blackboard / shared artifact. Good cross-domain consistency. Concurrency and conflict handling is fiddly. A typed, controlled state object (LangGraph) is safer than a true free-for-all blackboard."/><p><strong>Key distinction:</strong> a true blackboard (free-for-all writes) is risky. A typed, controlled shared state object — which LangGraph gives us — keeps the benefit without the chaos. We use the controlled version.</p><hr><h3 id="approach-5--single-agent-baseline">Approach 5 — Single-agent baseline</h3><p>One capable model. A lightweight router. Domain knowledge bases on demand. Simplest, cheapest, most reliable.</p><p><img src="/images/slides/slides.010.png" alt="Slide: Approach 5 of 5 — Single-agent baseline. Simplest and most reliable. Often beats multi-agent for doc generation. Breaks down when inputs blow past the context window."/><p><strong>Verdict:</strong> The benchmark every fancier design has to beat. The real reason to decompose is<em>information management</em> — not model weakness. When CMDB exports, cloud inventories, and Terraform files all arrive together, no context window handles it cleanly.</p><hr><h2 id="the-comparison">The comparison</h2><p><img src="/images/slides/slides.011.png" alt="Slide: Side-by-side comparison — all five approaches scored on cross-domain, parallelism, auditability, simplicity, and fit for doc gen. The two that score High on fit-for-doc-gen are Orchestrator+Specialists and Workflow — and they&rsquo;re complementary."/><p>The table does the work. Our current design (Hierarchical) is weakest exactly where we need strength — cross-domain — while strong on auditability. The two approaches that score high on fit-for-doc-gen are<strong>complementary</strong>: strong in different places. That&rsquo;s the bridge to the recommendation.</p><hr><h2 id="the-recommendation--a-hybrid">The recommendation — a hybrid</h2><p><img src="/images/slides/slides.013.png" alt="Slide: Recommended — a hybrid workflow + specialist platform. Hub-and-spoke structure: typed shared state at the centre, pipeline flowing through it, specialists invoked within stages, cross-cutting reviewers across all stages, QA/eval loops back."/><p>Not a ladder. A<strong>hub</strong>. The pipeline (Discovery → Assessment → Design) flows through a single typed shared state object. Domain specialists are invoked within stages — callable skills, not autonomous routing agents. Cross-cutting reviewers (Security, Cost, Risk, Compliance) act across all stages. QA can loop work back to Design or Assessment until the rubric passes.</p><hr><h3 id="how-the-model-works">How the model works</h3><p><img src="/images/slides/slides.014.png" alt="Slide: The recommendation in motion — pluggable model, orchestrated pipeline, parallel specialists, cross-cutting reviewers, deterministic arbitration, human gates, render and evaluate."/><p>Three things to highlight:</p><ul><li><strong>Parallel specialists</strong> write structured findings into one typed shared state — the single source of truth</li><li><strong>Arbitration is deterministic</strong> — fires on conflict, policy breach, or missing data, never on confidence scores</li><li><strong>The model is pluggable</strong> — local Gemma today, Claude/OpenAI/Gemini tomorrow, same knowledge, no rigid tree</li></ul><hr><h3 id="why-this-is-the-right-call">Why this is the right call</h3><p><img src="/images/slides/slides.015.png" alt="Slide: Why this is the right call — trade-offs honestly laid out. Pros: handles cross-domain work, governance is first-class, model-agnostic. Cons: more moving parts, orchestrator and eval are real engineering."/><p>The verdict, stated plainly: it&rsquo;s the only option that handles cross-domain work with the auditability, governance, and evidence-traceability a client deliverable demands — at an acceptable, well-understood increase in build complexity.</p><hr><h2 id="the-technology-stack--nothing-new-to-install">The technology stack — nothing new to install</h2><p><img src="/images/slides/slides.016.png" alt="Slide: The technology stack — LangGraph + FastAPI + Redis, Gemma via LM Studio, LanceDB + BAAI/bge-small + SQLite, ddgs + BeautifulSoup, Gradio, YAML manifests. Everything already running."/><p>Every box in the stack is something we already run. The v3 work extends Orchestration and adds the governance/eval layer.<strong>This is not a rebuild.</strong></p><hr><h2 id="the-process--how-a-real-engagement-runs">The process — how a real engagement runs</h2><p><img src="/images/slides/slides.017.png" alt="Slide: Process flow — Discovery → evidence normalise → current state → gap analysis → target design → ADRs → reviews → migration plan → assembly → QA/eval. Two human gates. Feedback loops back to design. Everything writes into one canonical model."/><p>Two things to notice:</p><ol><li><strong>Feedback arrows</strong> — review, QA, and cost/risk findings can send work<em>back</em> to Design or Assessment. It&rsquo;s iterative, not a one-way pipeline.</li><li><strong>Human gates</strong> — after assessment and before assembly. An architect validates the current-state picture and signs off before the document is built. Gated autonomy, not full automation — which matters when the output carries liability.</li></ol><hr><h2 id="the-real-hard-problem--evidence-quality">The real hard problem — evidence quality</h2><p><img src="/images/slides/slides.018.png" alt="Slide: The real hard problem — bad source data. Missing CMDB, conflicting diagrams, unknown dependencies, shadow IT. Evidence quality is an explicit early gate: score completeness, request more data, log gaps, escalate when material."/><p>Every downstream recommendation carries the confidence level and assumptions set at this gate.<strong>No silent guessing.</strong> The system requests more data, logs gaps in an assumption register, or proceeds while explicitly stating its confidence level.</p><hr><h2 id="what-flows-through-the-system--the-canonical-architecture-state">What flows through the system — the canonical Architecture State</h2><p><img src="/images/slides/slides.019.png" alt="Slide: The canonical Architecture Model — one typed object, ArchitectureState. Fields mature left to right: Evidence → Analysis → Decisions → Plan &amp; Governance → Output. Every recommendation traces back to the evidence that produced it."/><p>One typed object. Every stage, specialist, and reviewer reads from and writes to it. The document is rendered from it. Every recommendation traces back through the same object to the evidence that produced it — that traceability is what makes the deliverable auditable.</p><hr><h2 id="lenses-not-domains--where-security-and-compliance-live">Lenses, not domains — where security and compliance live</h2><p><img src="/images/slides/slides.020.png" alt="Slide: Lenses not domains — Security, Cost/FinOps, Compliance, Risk, Sustainability, Supportability cut horizontally across all stages. Each is a reviewer plus a checklist in the QA rubric firing on concrete triggers."/><p><strong>Security is a lens, not a domain.</strong> Same for Cost, Compliance, Risk, Sustainability, and Supportability. They aren&rsquo;t towers sitting next to Cloud and Network — they cut horizontally across every stage. Most rework loops originate in Security and Compliance, which is exactly why they&rsquo;re first-class cross-cutting reviewers.</p><hr><h2 id="key-design-considerations">Key design considerations</h2><p><img src="/images/slides/slides.021.png" alt="Slide: Six key design considerations — decompose for context not IQ, deterministic arbitration, human-in-the-loop gates, evaluation and provenance, template as rendering concern, operational reality."/><p>The most important reframe:<strong>we decompose for information management, not because the model is weak</strong>. That reframes the whole multi-agent debate. And arbitration fires on deterministic events — never on confidence scores. Every recommendation must trace to evidence.</p><hr><h2 id="design-decisions--where-the-build-effort-actually-goes">Design decisions — where the build effort actually goes</h2><p><img src="/images/slides/slides.022.png" alt="Slide: Design decisions and effort split. 35% evidence ingestion, 20% evaluation/governance, 10% agents. Most teams invert these numbers and ship a demo that can&rsquo;t produce a deliverable a client would pay for."/><p>The effort allocation is the provocative bit.<strong>Only ~10% of build effort goes to the agents themselves.</strong> ~35% is evidence ingestion and normalisation. ~20% is evaluation and governance. Most teams invert these numbers — polished agents, no evaluation — and end up with a demo that can&rsquo;t produce a deliverable a client would pay for. That&rsquo;s the trap we&rsquo;re avoiding.</p><hr><h2 id="business-value">Business value</h2><p><img src="/images/slides/slides.023.png" alt="Slide: Business value — consulting accelerator and quality platform. Time to assess: weeks to days. Evidence gathering: manual to automated. Document assembly: hand-built to generated and validated. Hypotheses to validate in a pilot."/><p>The positioning:<strong>a consulting accelerator and quality platform</strong> — not a headcount-reduction play. The specific numbers (50–80% faster assessments, 60–90% less evidence-collection effort) are hypotheses to validate in a pilot, not measured results. They&rsquo;re framed that way because that&rsquo;s what earns trust.</p><hr><h2 id="next-steps">Next steps</h2><p><img src="/images/slides/slides.024.png" alt="Slide: Next steps — align on approach, write v3 design, scope Cloud Migration Assessment as first engagement, stand up evidence and eval first."/><p>The sequencing matters:<strong>evidence ingestion and the QA rubric first</strong>, agent polish second. That&rsquo;s where the quality actually lives.</p><hr><h2 id="the-honest-accounting">The honest accounting</h2><p>I built a hierarchy because it was intuitive. It is intuitive — it maps to how the organisation thinks, the escalation chain is easy to trace and audit, and it&rsquo;s easy to explain to a client. Those are real advantages.</p><p>But intuitive for humans and optimal for agents are not the same thing. The hierarchy was designed to answer<em>&ldquo;who is responsible?&rdquo;</em> — an important human question that agents don&rsquo;t need answered. The new design answers<em>&ldquo;what needs to happen next?&rdquo;</em> — which is the right question for a document-generation workflow.</p><p><strong>What changes in v3:</strong></p><ul><li>Agents become implementation details inside the Knowledge layer — the durable value is in evidence, governance, evaluation, and the document workflow</li><li>Escalation fires on events (conflict, policy breach, missing data) — not on confidence scores</li><li>The canonical model is the design constraint — the document isn&rsquo;t an afterthought, it&rsquo;s what everything is structured around</li><li>Model-agnostic by default — local Gemma today, SOTA API tomorrow, same knowledge</li></ul><p><strong>What stays the same:</strong> LangGraph, LanceDB, BAAI/bge-small, FastAPI, SQLite, the YAML manifests, and the 13 domain specialists — which become callable skills within the Design stage rather than autonomous routing agents. Same knowledge, different invocation mechanism.</p><p>This is an extension of what we run, not a rebuild.</p><hr><p><em>Questions about the design or the approach? Reach out on<a href="https://www.linkedin.com/in/ajay-walia-8b066a1b/">LinkedIn</a>.</em></p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/aether-rethought.png" medium="image"><media:title type="plain">Posts</media:title></media:content><category>artificial-intelligence</category><category>architecture</category><category>automation</category><category>engineering</category><category>llm</category><category>Knowledge Base</category></item><item><title>I Built My Own RSS Reader in an Afternoon — With AI Doing the Typing</title><link>https://curiousbit.netlify.app/i-built-my-own-rss-reader-in-an-afternoon-with-ai-doing-the-typing/</link><guid isPermaLink="true">https://curiousbit.netlify.app/i-built-my-own-rss-reader-in-an-afternoon-with-ai-doing-the-typing/</guid><pubDate>Wed, 10 Jun 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;For years my RSS setup was a compromise. I never wanted to pay for a desktop reader, so I got stuck with The Old Reader — a perfectly fine service, but using it felt like visiting a website, because that&amp;rsquo;s exactly what it was. Open a browser tab, log in, scroll, repeat tomorrow. What I actually wanted was simple: a native Mac app, my feeds downloaded and stored locally, no account, no subscription, fast enough to triage a hundred articles with my keyboard.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/Lumen/lumenai-hero.jpg" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>For years my RSS setup was a compromise. I never wanted to pay for a desktop reader, so I got stuck with The Old Reader — a perfectly fine service, but using it felt like visiting a website, because that&rsquo;s exactly what it was. Open a browser tab, log in, scroll, repeat tomorrow. What I actually wanted was simple: a native Mac app, my feeds downloaded and stored locally, no account, no subscription, fast enough to triage a hundred articles with my keyboard.</p><p>Then I heard about Claude&rsquo;s new Fable model and thought: fine, let&rsquo;s stop wishing and just build the thing. This is the story of<strong>LumenAI</strong> — a local-first RSS reader for macOS with AI summaries — built from an empty folder to a notarizable DMG in about an hour of wall-clock time, with me acting as product owner and build verifier while the AI wrote the code.</p><p><img src="/images/lumenai-screens/main-window.png" alt="LumenAI main window — three-pane layout in dark mode"/><p><em>That&rsquo;s the real app, not a mockup. There&rsquo;s a<a href="/lumenai-screens.html">full captioned gallery here</a> — the loaded feed list, OPML import, the memory footprint, and the DMG build.</em></p><hr><h2 id="the-idea">The Idea</h2><p>The pitch I gave the AI was one paragraph: a local RSS reader for Mac, feeds downloaded and stored on my machine, a premium feel, and treat it like a real engineering project — clear phases, and ask me questions before making decisions. That last part turned out to be the most important sentence in the whole project.</p><p>Instead of immediately generating a wall of code, it interviewed me. What stack? What does v1 include, and — just as important — what does it exclude? How should refresh work? What does &ldquo;premium&rdquo; mean to you, concretely? By the end of a few rounds of multiple-choice questions, we had a real spec:</p><p><strong>v1 goals:</strong> subscribe to feeds, full article extraction, fast local search, OPML import/export, saved views, deduplication, offline reading, AI summaries, keyboard-first navigation, dark/light themes, typography controls, reader mode, reading progress.</p><p><strong>v1 non-goals:</strong> semantic search, topic clustering, multi-device sync, social anything.</p><p>Writing down the non-goals felt almost ceremonial at the time. It wasn&rsquo;t. Every time scope tried to creep, that list killed the discussion in one line.</p><hr><h2 id="the-stack">The Stack</h2><p>Every choice optimized for &ldquo;native feel, local data, no servers.&rdquo;</p><p>The app is Swift and SwiftUI targeting macOS 14, because nothing fakes the feel of a real Mac app. Storage is SQLite via<a href="https://github.com/groue/GRDB.swift">GRDB.swift</a>, chosen over Apple&rsquo;s SwiftData specifically for FTS5 — SQLite&rsquo;s built-in full-text search engine, which gives instant search across every article ever downloaded, entirely offline. Feed parsing is FeedKit, wrapped in a normalizer layer so the rest of the app never touches a FeedKit type and JSON Feed support costs no schema changes. Full-text extraction is Mozilla&rsquo;s Readability.js — the same engine behind Firefox&rsquo;s reader mode — running in a hidden WKWebView with the page&rsquo;s own JavaScript disabled. The reader itself is a WKWebView used purely as a rendering layer for a themed HTML template; everything around it stays SwiftUI. The project file is generated by XcodeGen from a YAML spec, which kept the AI and Xcode from ever fighting over a<code>.xcodeproj</code>.</p><p>The AI layer is the part I&rsquo;m proudest of architecturally: a single<code>SummaryProvider</code> protocol with six implementations — Apple Intelligence (on-device, appears only on macOS 26+), Ollama for local models, Claude, OpenAI, Gemini, and Disabled. One protocol method. Swapping providers is a dropdown in Settings; API keys live in the macOS Keychain.</p><p><img src="/images/Lumen/lumenai-stack.jpg" alt="LumenAI architecture — SwiftUI over SQLite/FTS5, FeedKit and Readability.js, with six pluggable AI providers"/><hr><h2 id="the-seven-phases-okay-eight">The Seven Phases (Okay, Eight)</h2><p>We numbered from zero, like civilized people.</p><p><strong>Phase 0 — Scaffold.</strong> XcodeGen project, sandbox and network entitlements, and a three-pane shell (sidebar, article list, reader) running on sample data. The exit criterion was simply &ldquo;builds and runs.&rdquo; It almost did: the very first build failed with Swift&rsquo;s infamous<em>&ldquo;the compiler is unable to type-check this expression in reasonable time&rdquo;</em> — the AI had written a too-clever nested closure to generate sample data. It rewrote it as a boring<code>for</code> loop. A very human bug, honestly.</p><p><strong>Phase 1 — Data layer.</strong> The real schema: feeds, folders, articles, saved views, an FTS5 index kept in sync by SQL triggers, and a three-tier deduplication identity — an article is its<code>guid</code> if the feed provides one, else its normalized URL (tracking parameters stripped), else a content hash. Ten unit tests against an in-memory database before any networking existed.</p><p><strong>Phase 2 — Feed engine.</strong> Fetching with HTTP conditional GET, so unchanged feeds cost a 304 response instead of a re-download. Feed auto-discovery, so typing<code>daringfireball.net</code> finds the actual feed URL by scanning the page&rsquo;s<code>&lt;link&gt;</code> tags. RSS, Atom, and JSON Feed all normalize into one canonical model. By the end of this phase the app was genuinely usable: subscribe, read, refresh.</p><p><strong>Phase 3 — Core UI.</strong> Folders, favicons, thumbnails, unread badges, and the thing that makes an RSS reader feel like a tool instead of a website: keyboard navigation.<code>j</code>/<code>k</code> to move, space for next unread,<code>s</code> to star,<code>m</code> to toggle read. The Reeder dialect, basically.</p><p><strong>Phase 4 — Reading experience.</strong> Select an article and it silently fetches the source page, runs Readability.js over it, and stores clean full text — so a feed that only publishes two-line excerpts still gives you whole articles, offline, forever. Typography controls (serif/sans, size, line width), themes that follow the system, and per-article reading progress that restores when you come back. This phase also produced the best bug of the project: scroll position was saved to app state, which regenerated the reader HTML, which reloaded the page, which reset the scroll — an infinite loop the AI caught in code review<em>before</em> I ever built it.</p><p><strong>Phase 5 — Search, saved views, OPML.</strong> Global FTS5 search from the toolbar, saved views (persistent named filters — &ldquo;unread Swift articles from these three feeds, last 30 days&rdquo;), and OPML import/export so my subscriptions could finally walk out of The Old Reader with folder structure intact.</p><p><strong>Phase 6 — AI summaries.</strong> The provider protocol described above, plus a deliberately boring prompt: summarize in two or three sentences, be specific, no &ldquo;this article discusses.&rdquo; The summary renders as a tinted card above the article. The point isn&rsquo;t to replace reading — it&rsquo;s triage. Is this worth my next ten minutes?</p><p><img src="/images/Lumen/lumenai-reader.jpg" alt="Reader view with the AI summary card above the article (concept render)"/><p><strong>Phase 7 — Polish.</strong> An app icon (generated programmatically — an RSS glyph under a sparkle on an indigo gradient), a dock badge with the unread count, render-path caching so thousand-article lists scroll smoothly, and a one-command script that builds a signed, drag-to-Applications DMG.</p><p><img src="/images/lumenai-screens/dmg-build.png" alt="One-command DMG build — ./Scripts/make_dmg.sh producing LumenAI.dmg"/><p><em>The Phase 7 finale, for real: one script, one signed DMG on the Desktop. More screenshots — the loaded feed list after OPML import, and the app&rsquo;s 143 MB memory footprint in Activity Monitor — are on the<a href="/lumenai-screens.html">LumenAI screenshots page</a>.</em></p><hr><h2 id="what-the-lifecycle-actually-felt-like">What the Lifecycle Actually Felt Like</h2><p>The loop for every phase was identical: the AI proposed decisions and asked questions, wrote the code and its tests, and then stopped — because it couldn&rsquo;t compile anything. Its sandbox is Linux; you can&rsquo;t build a Mac app there. So I was the build machine. ⌘U, ⌘R, report back. &ldquo;Build succeeded, go ahead&rdquo; became the rhythm of the afternoon.</p><p>That constraint turned out to be a feature. It forced a real checkpoint between phases — a human running the actual app — instead of an unbroken firehose of unverified code. Of the three failures across the whole project, two were caught by my builds (the type-checker timeout, and a &ldquo;cannot find type&rdquo; error that turned out to mean I&rsquo;d forgotten to re-run XcodeGen after files were added) and one was caught by the AI re-reading its own code. Final tally:<strong>38 Swift files, 27 tests, zero runtime crashes encountered.</strong></p><p>The other thing that surprised me: being asked questions felt like the AI respecting that it was<em>my</em> app. Tech stack, refresh cadence, dedup policy, summary length, even the app&rsquo;s name — every fork in the road was a decision I made in seconds from a menu of researched options, instead of an assumption silently baked into code I&rsquo;d discover three weeks later.</p><hr><h2 id="what-id-tell-you-if-youre-tempted">What I&rsquo;d Tell You If You&rsquo;re Tempted</h2><p>Treat it like an engineering project, not a magic trick. The phases, the non-goals list, the tests, the check-in after every phase — that structure is why this worked in an hour instead of unraveling in a weekend. The AI typed every line of code, but the spec, the taste, and the &ldquo;no, simpler&rdquo; calls were the human contribution, and the project needed both.</p><p>And yes — the app starts instantly, works on a plane, and never asks me to log in. The subscription I avoided paying for has been replaced by the most expensive thing of all: now I want to build everything.</p><hr><p><em>LumenAI is Swift/SwiftUI on macOS 14+, with GRDB, FeedKit, and Readability.js. Built with Claude Fable 5.</em></p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/Lumen/lumenai-hero.jpg" medium="image"><media:title type="plain">Posts</media:title></media:content><category>artificial-intelligence</category><category>automation</category><category>engineering</category><category>llm</category><category>Knowledge Base</category></item><item><title>LLMs Are Probability Engines, Not "Thinkers"</title><link>https://curiousbit.netlify.app/llms-are-probability-engines-not-ai/</link><guid isPermaLink="true">https://curiousbit.netlify.app/llms-are-probability-engines-not-ai/</guid><pubDate>Sun, 07 Jun 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;style&gt;
@import url('https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;600;700&amp;family=JetBrains+Mono:wght@400;500&amp;display=swap');
.pe-article {
--bg: #070b14;
--bg2: #0d1423;
--bg3: #111827;
--cyan: #00e5ff;
--purple: #a855f7;
--gold: #fbbf24;
--text: #e2e8f0;
--muted: #94a3b8;
--border: #1e293b;
--danger: #f87171;
font-family: 'Space Grotesk', system-ui, sans-serif;
font-size: 1.08rem;
line-height: 1.85;
color: var(--text);
}
/* TOC */
.pe-toc {
background: var(--bg2);
border: 1px solid var(--border);
border-left: 3px solid var(--cyan);
border-radius: 10px;
padding: 1.25rem 1.75rem;
margin: 2rem 0;
}
.pe-toc h3 {
font-size: 0.95rem;
letter-spacing: 0.18em;
text-transform: uppercase;
color: var(--cyan);
margin: 0 0 1rem;
}
.pe-toc ol { padding-left: 1.3rem; margin: 0; }
.pe-toc li { margin-bottom: 0.6rem; }
.pe-toc a { color: var(--muted); text-decoration: none; font-size: 1.15rem; font-weight: 600; transition: color 0.2s; }
.pe-toc a:hover { color: var(--cyan); }
/* Video */
.pe-video { margin: 2rem 0; border-radius: 12px; overflow: hidden; border: 1px solid var(--border); background: #000; }
.pe-video video { width: 100%; display: block; }
.pe-video-header {
background: var(--bg2);
padding: 1rem 1.4rem;
font-size: 1.15rem;
font-weight: 600;
color: var(--cyan);
border-bottom: 1px solid var(--border);
line-height: 1.5;
}
/* Typography */
.pe-article h2 {
font-size: 1.75rem;
font-weight: 700;
color: #fff;
margin: 3rem 0 0.9rem;
padding-bottom: 0.45rem;
border-bottom: 1px solid var(--border);
}
.pe-sec-num { color: var(--cyan); font-size: 1rem; font-weight: 600; display: block; margin-bottom: 0.2rem; letter-spacing: 0.1em; }
.pe-article p { margin-bottom: 1.1rem; }
.pe-article strong { color: #fff; }
.pe-em { color: var(--gold); }
/* Callouts */
.pe-callout { background: var(--bg2); border-left: 4px solid var(--purple); border-radius: 0 8px 8px 0; padding: 1.4rem 1.8rem; margin: 1.5rem 0; font-size: 1.4rem; color: var(--muted); line-height: 1.75; }
.pe-callout.cy { border-color: var(--cyan); }
.pe-callout.gd { border-color: var(--gold); }
.pe-callout strong { color: var(--text); }
/* Compare table */
.pe-table { width: 100%; border-collapse: collapse; font-size: 1rem; margin: 1.25rem 0; }
.pe-table th { text-align: left; padding: 0.7rem 1rem; background: var(--bg2); color: var(--cyan); font-size: 0.85rem; letter-spacing: 0.08em; text-transform: uppercase; border-bottom: 1px solid var(--border); }
.pe-table td { padding: 0.85rem 1rem; border-bottom: 1px solid var(--border); color: var(--muted); vertical-align: top; line-height: 1.6; }
.pe-table td:first-child { color: var(--text); font-weight: 500; }
.pe-table tr:hover td { background: var(--bg2); }
/* Formula boxes */
.pe-box { background: var(--bg2); border: 1px solid var(--border); border-radius: 12px; padding: 1.75rem; margin: 1.75rem 0; }
.pe-box-title { font-size: 0.95rem; letter-spacing: 0.15em; text-transform: uppercase; color: var(--purple); margin-bottom: 1rem; }
/* Anim 1 — token prediction */
.pe-sentence { font-size: 1.3rem; font-family: 'JetBrains Mono', monospace; color: var(--text); min-height: 2rem; margin-bottom: 1.1rem; }
.pe-cursor { display: inline-block; width: 2px; height: 1em; background: var(--cyan); animation: pe-blink 0.8s infinite; vertical-align: middle; margin-left: 2px; }
@keyframes pe-blink { 0%,100%{opacity:1} 50%{opacity:0} }
.pe-prob-bars { display: flex; flex-direction: column; gap: 0.55rem; }
.pe-prob-row { display: flex; align-items: center; gap: 0.8rem; font-size: 1.1rem; }
.pe-prob-lbl { width: 80px; text-align: right; color: var(--muted); font-family: 'JetBrains Mono', monospace; flex-shrink: 0; }
.pe-prob-track { flex: 1; height: 26px; background: var(--bg3); border-radius: 5px; overflow: hidden; }
.pe-prob-fill { height: 100%; background: var(--cyan); border-radius: 5px; transition: width 0.55s cubic-bezier(0.4,0,0.2,1); width: 0; }
.pe-prob-fill.win { background: var(--gold); }
.pe-prob-pct { width: 50px; font-family: 'JetBrains Mono', monospace; font-size: 1rem; color: var(--muted); }
/* Math display */
.pe-math { font-family: 'Georgia', serif; font-size: 1.45rem; color: var(--gold); text-align: center; padding: 1.3rem; background: var(--bg3); border-radius: 8px; margin-bottom: 0.9rem; }
.pe-term { display: inline; opacity: 0; transition: opacity 0.4s; cursor: help; position: relative; }
.pe-term.on { opacity: 1; }
.pe-term:hover::after { content: attr(data-tip); position: absolute; bottom: 115%; left: 50%; transform: translateX(-50%); background: var(--bg); border: 1px solid var(--purple); color: var(--text); padding: 0.4rem 0.85rem; border-radius: 6px; font-family: 'Space Grotesk', sans-serif; font-size: 0.88rem; white-space: nowrap; z-index: 20; }
.pe-anns { display: grid; grid-template-columns: 1fr 1fr; gap: 0.6rem; margin-top: 0.8rem; }
.pe-ann { background: var(--bg3); border-radius: 6px; padding: 0.55rem 0.85rem; font-size: 0.93rem; opacity: 0; transition: opacity 0.5s; }
.pe-ann.on { opacity: 1; }
.pe-ann-sym { color: var(--gold); font-family: 'JetBrains Mono', monospace; font-weight: bold; }
.pe-ann-desc { color: var(--muted); }
/* Softmax */
.pe-sm-demo { display: flex; gap: 1rem; align-items: flex-start; flex-wrap: wrap; }
.pe-sm-col { flex: 1; min-width: 160px; }
.pe-col-lbl { font-size: 0.82rem; letter-spacing: 0.1em; text-transform: uppercase; color: var(--muted); margin-bottom: 0.75rem; }
.pe-logit-row { display: flex; align-items: center; gap: 0.6rem; margin-bottom: 0.55rem; font-size: 0.97rem; font-family: 'JetBrains Mono', monospace; }
.pe-logit-w { width: 60px; color: var(--text); }
.pe-logit-v { padding: 0.22rem 0.6rem; border-radius: 4px; font-size: 0.92rem; }
.pe-logit-v.neg { background: rgba(248,113,113,0.15); color: var(--danger); }
.pe-logit-v.pos { background: rgba(0,229,255,0.1); color: var(--cyan); }
.pe-sm-bar { height: 22px; border-radius: 4px; background: var(--purple); transition: width 0.75s cubic-bezier(0.4,0,0.2,1); width: 0; display: flex; align-items: center; padding-left: 7px; font-size: 0.86rem; color: #fff; overflow: hidden; white-space: nowrap; }
.pe-arrow { display: flex; align-items: center; justify-content: center; padding-top: 1.4rem; font-size: 1.6rem; color: var(--cyan); }
/* Loss */
.pe-loss-wrap { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; align-items: start; }
@media(max-width:500px) { .pe-loss-wrap { grid-template-columns: 1fr; } }
.pe-loss-num { font-size: 2.8rem; font-weight: 800; font-family: 'JetBrains Mono', monospace; color: var(--danger); transition: color 0.5s; line-height: 1; }
.pe-loss-num.good { color: #4ade80; }
.pe-loss-lbl { font-size: 0.88rem; color: var(--muted); margin-top: 0.3rem; }
.pe-loss-slider label { font-size: 0.92rem; color: var(--muted); display: block; margin: 0.8rem 0 0.3rem; }
input[type=range] { width: 100%; accent-color: var(--cyan); }
.pe-loss-formula { background: var(--bg3); border-radius: 8px; padding: 1.1rem; font-family: 'JetBrains Mono', monospace; font-size: 1rem; color: var(--text); line-height: 2.1; }
.pe-lf-hl { color: var(--gold); }
.pe-lf-res { color: var(--cyan); font-weight: bold; }
/* Attention */
.pe-attn-words { display: flex; gap: 0.5rem; flex-wrap: wrap; margin-bottom: 0.9rem; }
.pe-attn-word { padding: 0.4rem 0.8rem; border-radius: 6px; background: var(--bg3); border: 1px solid var(--border); font-size: 1rem; cursor: pointer; transition: all 0.2s; user-select: none; }
.pe-attn-word:hover { border-color: var(--cyan); }
.pe-attn-word.sel { background: rgba(0,229,255,0.12); border-color: var(--cyan); color: var(--cyan); }
/* Attention formula cards */
.pe-qkv { display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 0.6rem; margin-top: 0.8rem; }
@media(max-width:480px) { .pe-qkv { grid-template-columns: 1fr; } }
.pe-qkv-card { border-radius: 6px; padding: 0.75rem 0.9rem; font-size: 0.93rem; }
/* Temperature */
.pe-temp-row-ctrl { display: flex; align-items: center; gap: 1rem; }
.pe-temp-big { font-size: 1.6rem; font-family: 'JetBrains Mono', monospace; font-weight: 700; color: var(--cyan); width: 56px; flex-shrink: 0; }
.pe-temp-lbl { font-size: 0.92rem; color: var(--muted); margin-top: 0.2rem; }
.pe-tbars { display: flex; flex-direction: column; gap: 0.45rem; margin-top: 0.9rem; }
.pe-trow { display: flex; align-items: center; gap: 0.65rem; font-size: 0.95rem; font-family: 'JetBrains Mono', monospace; }
.pe-tlbl { width: 64px; color: var(--muted); text-align: right; flex-shrink: 0; }
.pe-ttrack { flex: 1; height: 19px; background: var(--bg3); border-radius: 4px; overflow: hidden; }
.pe-tfill { height: 100%; border-radius: 4px; background: var(--purple); transition: width 0.5s cubic-bezier(0.4,0,0.2,1); }
.pe-tpct { width: 48px; text-align: right; color: var(--muted); }
/* Limits */
.pe-limits { display: grid; grid-template-columns: repeat(3, 1fr); gap: 1rem; margin: 1.5rem 0; }
@media(max-width:640px) { .pe-limits { grid-template-columns: 1fr; } }
.pe-limit-card { background: var(--bg2); border: 1px solid var(--border); border-radius: 12px; padding: 1.5rem 1.6rem; transition: border-color 0.2s; display: flex; flex-direction: column; gap: 0.4rem; }
.pe-limit-card:hover { border-color: var(--purple); }
.pe-limit-icon { font-size: 2.2rem; line-height: 1; }
.pe-limit-title { font-weight: 700; color: var(--text); font-size: 1.35rem; margin: 0; }
.pe-limit-desc { font-size: 1.15rem; color: var(--muted); line-height: 1.65; margin: 0; }
/* Dual meters */
.pe-meters { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; margin: 1.25rem 0; }
.pe-meter { border-radius: 8px; padding: 1.1rem; text-align: center; }
.pe-meter-lbl { font-size: 0.82rem; letter-spacing: 0.1em; text-transform: uppercase; margin-bottom: 0.4rem; }
.pe-meter-val { font-size: 2.2rem; font-weight: 800; }
/* Buttons */
.pe-btn {
margin-top: 0.9rem;
background: var(--bg3);
border: 1px solid var(--cyan);
color: var(--cyan);
padding: 0.4rem 1.1rem;
border-radius: 6px;
cursor: pointer;
font-size: 0.82rem;
font-family: 'Space Grotesk', sans-serif;
transition: background 0.2s;
}
.pe-btn:hover { background: rgba(0,229,255,0.1); }
.pe-btn-pur { border-color: var(--purple); color: var(--purple); }
.pe-btn-pur:hover { background: rgba(168,85,247,0.1); }
/* Interactive badge */
.pe-interactive-header {
display: flex;
align-items: center;
justify-content: space-between;
margin-bottom: 1rem;
}
.pe-interactive-header .pe-box-title { margin-bottom: 0; }
.pe-interactive-badge {
display: inline-flex;
align-items: center;
gap: 0.4rem;
background: rgba(0,229,255,0.08);
border: 1px solid var(--cyan);
color: var(--cyan);
font-size: 0.72rem;
font-weight: 700;
letter-spacing: 0.12em;
text-transform: uppercase;
padding: 0.3rem 0.75rem;
border-radius: 999px;
flex-shrink: 0;
}
.pe-interactive-badge::before {
content: '';
width: 7px;
height: 7px;
border-radius: 50%;
background: var(--cyan);
animation: pe-pulse 1.6s ease-in-out infinite;
flex-shrink: 0;
}
@keyframes pe-pulse {
0%, 100% { opacity: 1; transform: scale(1); }
50% { opacity: 0.4; transform: scale(0.7); }
}
.pe-interact-hint {
display: flex;
align-items: center;
gap: 0.6rem;
margin-top: 0.9rem;
padding: 0.85rem 1.1rem;
background: rgba(0,229,255,0.05);
border: 1px dashed rgba(0,229,255,0.25);
border-radius: 8px;
font-size: 1.05rem;
color: var(--muted);
}
.pe-interact-hint span { font-size: 1.25rem; }
.pe-divider { border: none; border-top: 1px solid var(--border); margin: 2.5rem 0; }
@media(max-width:520px) {
.pe-anns { grid-template-columns: 1fr; }
.pe-sm-demo { flex-direction: column; }
.pe-arrow { transform: rotate(90deg); }
}
&lt;/style&gt;
&lt;div class="pe-article"&gt;
&lt;div class="pe-video"&gt;
&lt;div class="pe-video-header"&gt;▶ Full Video Explainer — covering how LLMs work, from next-token prediction to attention, training, and why hallucinations are inevitable&lt;/div&gt;
&lt;video controls poster="/images/llms-are-pe/hero.jpg"&gt;
&lt;source src="https://curiousbit.netlify.app/images/llms-are-pe/explainer.mp4" type="video/mp4" /&gt;
Your browser doesn't support HTML5 video.
&lt;/video&gt;
&lt;/div&gt;
&lt;nav class="pe-toc"&gt;
&lt;h3&gt;In this article&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="#pe-s1"&gt;What ChatGPT and Claude actually are&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s2"&gt;The one job every LLM does&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s3"&gt;The probability formula (interactive)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s4"&gt;Softmax: turning scores into probabilities&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s5"&gt;How it learns: cross-entropy loss&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s6"&gt;The Transformer architecture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s7"&gt;Self-attention: every word watches every word&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s8"&gt;How text is actually generated&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s9"&gt;Temperature: controlling randomness&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s10"&gt;Why it sometimes lies (hallucinations)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s11"&gt;Key limitations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pe-s12"&gt;What's next&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/nav&gt;
&lt;p&gt;You've used ChatGPT. You've heard the word "AI" a thousand times this year. But here's something almost nobody explains clearly: the thing powering these tools is &lt;strong&gt;not intelligent in any human sense&lt;/strong&gt;. It doesn't think. It doesn't understand. It doesn't have goals.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/llms-are-pe/hero.jpg" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><style>
@import url('https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;600;700&family=JetBrains+Mono:wght@400;500&display=swap');
.pe-article {
--bg: #070b14;
--bg2: #0d1423;
--bg3: #111827;
--cyan: #00e5ff;
--purple: #a855f7;
--gold: #fbbf24;
--text: #e2e8f0;
--muted: #94a3b8;
--border: #1e293b;
--danger: #f87171;
font-family: 'Space Grotesk', system-ui, sans-serif;
font-size: 1.08rem;
line-height: 1.85;
color: var(--text);
}
/* TOC */
.pe-toc {
background: var(--bg2);
border: 1px solid var(--border);
border-left: 3px solid var(--cyan);
border-radius: 10px;
padding: 1.25rem 1.75rem;
margin: 2rem 0;
}
.pe-toc h3 {
font-size: 0.95rem;
letter-spacing: 0.18em;
text-transform: uppercase;
color: var(--cyan);
margin: 0 0 1rem;
}
.pe-toc ol { padding-left: 1.3rem; margin: 0; }
.pe-toc li { margin-bottom: 0.6rem; }
.pe-toc a { color: var(--muted); text-decoration: none; font-size: 1.15rem; font-weight: 600; transition: color 0.2s; }
.pe-toc a:hover { color: var(--cyan); }
/* Video */
.pe-video { margin: 2rem 0; border-radius: 12px; overflow: hidden; border: 1px solid var(--border); background: #000; }
.pe-video video { width: 100%; display: block; }
.pe-video-header {
background: var(--bg2);
padding: 1rem 1.4rem;
font-size: 1.15rem;
font-weight: 600;
color: var(--cyan);
border-bottom: 1px solid var(--border);
line-height: 1.5;
}
/* Typography */
.pe-article h2 {
font-size: 1.75rem;
font-weight: 700;
color: #fff;
margin: 3rem 0 0.9rem;
padding-bottom: 0.45rem;
border-bottom: 1px solid var(--border);
}
.pe-sec-num { color: var(--cyan); font-size: 1rem; font-weight: 600; display: block; margin-bottom: 0.2rem; letter-spacing: 0.1em; }
.pe-article p { margin-bottom: 1.1rem; }
.pe-article strong { color: #fff; }
.pe-em { color: var(--gold); }
/* Callouts */
.pe-callout { background: var(--bg2); border-left: 4px solid var(--purple); border-radius: 0 8px 8px 0; padding: 1.4rem 1.8rem; margin: 1.5rem 0; font-size: 1.4rem; color: var(--muted); line-height: 1.75; }
.pe-callout.cy { border-color: var(--cyan); }
.pe-callout.gd { border-color: var(--gold); }
.pe-callout strong { color: var(--text); }
/* Compare table */
.pe-table { width: 100%; border-collapse: collapse; font-size: 1rem; margin: 1.25rem 0; }
.pe-table th { text-align: left; padding: 0.7rem 1rem; background: var(--bg2); color: var(--cyan); font-size: 0.85rem; letter-spacing: 0.08em; text-transform: uppercase; border-bottom: 1px solid var(--border); }
.pe-table td { padding: 0.85rem 1rem; border-bottom: 1px solid var(--border); color: var(--muted); vertical-align: top; line-height: 1.6; }
.pe-table td:first-child { color: var(--text); font-weight: 500; }
.pe-table tr:hover td { background: var(--bg2); }
/* Formula boxes */
.pe-box { background: var(--bg2); border: 1px solid var(--border); border-radius: 12px; padding: 1.75rem; margin: 1.75rem 0; }
.pe-box-title { font-size: 0.95rem; letter-spacing: 0.15em; text-transform: uppercase; color: var(--purple); margin-bottom: 1rem; }
/* Anim 1 — token prediction */
.pe-sentence { font-size: 1.3rem; font-family: 'JetBrains Mono', monospace; color: var(--text); min-height: 2rem; margin-bottom: 1.1rem; }
.pe-cursor { display: inline-block; width: 2px; height: 1em; background: var(--cyan); animation: pe-blink 0.8s infinite; vertical-align: middle; margin-left: 2px; }
@keyframes pe-blink { 0%,100%{opacity:1} 50%{opacity:0} }
.pe-prob-bars { display: flex; flex-direction: column; gap: 0.55rem; }
.pe-prob-row { display: flex; align-items: center; gap: 0.8rem; font-size: 1.1rem; }
.pe-prob-lbl { width: 80px; text-align: right; color: var(--muted); font-family: 'JetBrains Mono', monospace; flex-shrink: 0; }
.pe-prob-track { flex: 1; height: 26px; background: var(--bg3); border-radius: 5px; overflow: hidden; }
.pe-prob-fill { height: 100%; background: var(--cyan); border-radius: 5px; transition: width 0.55s cubic-bezier(0.4,0,0.2,1); width: 0; }
.pe-prob-fill.win { background: var(--gold); }
.pe-prob-pct { width: 50px; font-family: 'JetBrains Mono', monospace; font-size: 1rem; color: var(--muted); }
/* Math display */
.pe-math { font-family: 'Georgia', serif; font-size: 1.45rem; color: var(--gold); text-align: center; padding: 1.3rem; background: var(--bg3); border-radius: 8px; margin-bottom: 0.9rem; }
.pe-term { display: inline; opacity: 0; transition: opacity 0.4s; cursor: help; position: relative; }
.pe-term.on { opacity: 1; }
.pe-term:hover::after { content: attr(data-tip); position: absolute; bottom: 115%; left: 50%; transform: translateX(-50%); background: var(--bg); border: 1px solid var(--purple); color: var(--text); padding: 0.4rem 0.85rem; border-radius: 6px; font-family: 'Space Grotesk', sans-serif; font-size: 0.88rem; white-space: nowrap; z-index: 20; }
.pe-anns { display: grid; grid-template-columns: 1fr 1fr; gap: 0.6rem; margin-top: 0.8rem; }
.pe-ann { background: var(--bg3); border-radius: 6px; padding: 0.55rem 0.85rem; font-size: 0.93rem; opacity: 0; transition: opacity 0.5s; }
.pe-ann.on { opacity: 1; }
.pe-ann-sym { color: var(--gold); font-family: 'JetBrains Mono', monospace; font-weight: bold; }
.pe-ann-desc { color: var(--muted); }
/* Softmax */
.pe-sm-demo { display: flex; gap: 1rem; align-items: flex-start; flex-wrap: wrap; }
.pe-sm-col { flex: 1; min-width: 160px; }
.pe-col-lbl { font-size: 0.82rem; letter-spacing: 0.1em; text-transform: uppercase; color: var(--muted); margin-bottom: 0.75rem; }
.pe-logit-row { display: flex; align-items: center; gap: 0.6rem; margin-bottom: 0.55rem; font-size: 0.97rem; font-family: 'JetBrains Mono', monospace; }
.pe-logit-w { width: 60px; color: var(--text); }
.pe-logit-v { padding: 0.22rem 0.6rem; border-radius: 4px; font-size: 0.92rem; }
.pe-logit-v.neg { background: rgba(248,113,113,0.15); color: var(--danger); }
.pe-logit-v.pos { background: rgba(0,229,255,0.1); color: var(--cyan); }
.pe-sm-bar { height: 22px; border-radius: 4px; background: var(--purple); transition: width 0.75s cubic-bezier(0.4,0,0.2,1); width: 0; display: flex; align-items: center; padding-left: 7px; font-size: 0.86rem; color: #fff; overflow: hidden; white-space: nowrap; }
.pe-arrow { display: flex; align-items: center; justify-content: center; padding-top: 1.4rem; font-size: 1.6rem; color: var(--cyan); }
/* Loss */
.pe-loss-wrap { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; align-items: start; }
@media(max-width:500px) { .pe-loss-wrap { grid-template-columns: 1fr; } }
.pe-loss-num { font-size: 2.8rem; font-weight: 800; font-family: 'JetBrains Mono', monospace; color: var(--danger); transition: color 0.5s; line-height: 1; }
.pe-loss-num.good { color: #4ade80; }
.pe-loss-lbl { font-size: 0.88rem; color: var(--muted); margin-top: 0.3rem; }
.pe-loss-slider label { font-size: 0.92rem; color: var(--muted); display: block; margin: 0.8rem 0 0.3rem; }
input[type=range] { width: 100%; accent-color: var(--cyan); }
.pe-loss-formula { background: var(--bg3); border-radius: 8px; padding: 1.1rem; font-family: 'JetBrains Mono', monospace; font-size: 1rem; color: var(--text); line-height: 2.1; }
.pe-lf-hl { color: var(--gold); }
.pe-lf-res { color: var(--cyan); font-weight: bold; }
/* Attention */
.pe-attn-words { display: flex; gap: 0.5rem; flex-wrap: wrap; margin-bottom: 0.9rem; }
.pe-attn-word { padding: 0.4rem 0.8rem; border-radius: 6px; background: var(--bg3); border: 1px solid var(--border); font-size: 1rem; cursor: pointer; transition: all 0.2s; user-select: none; }
.pe-attn-word:hover { border-color: var(--cyan); }
.pe-attn-word.sel { background: rgba(0,229,255,0.12); border-color: var(--cyan); color: var(--cyan); }
/* Attention formula cards */
.pe-qkv { display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 0.6rem; margin-top: 0.8rem; }
@media(max-width:480px) { .pe-qkv { grid-template-columns: 1fr; } }
.pe-qkv-card { border-radius: 6px; padding: 0.75rem 0.9rem; font-size: 0.93rem; }
/* Temperature */
.pe-temp-row-ctrl { display: flex; align-items: center; gap: 1rem; }
.pe-temp-big { font-size: 1.6rem; font-family: 'JetBrains Mono', monospace; font-weight: 700; color: var(--cyan); width: 56px; flex-shrink: 0; }
.pe-temp-lbl { font-size: 0.92rem; color: var(--muted); margin-top: 0.2rem; }
.pe-tbars { display: flex; flex-direction: column; gap: 0.45rem; margin-top: 0.9rem; }
.pe-trow { display: flex; align-items: center; gap: 0.65rem; font-size: 0.95rem; font-family: 'JetBrains Mono', monospace; }
.pe-tlbl { width: 64px; color: var(--muted); text-align: right; flex-shrink: 0; }
.pe-ttrack { flex: 1; height: 19px; background: var(--bg3); border-radius: 4px; overflow: hidden; }
.pe-tfill { height: 100%; border-radius: 4px; background: var(--purple); transition: width 0.5s cubic-bezier(0.4,0,0.2,1); }
.pe-tpct { width: 48px; text-align: right; color: var(--muted); }
/* Limits */
.pe-limits { display: grid; grid-template-columns: repeat(3, 1fr); gap: 1rem; margin: 1.5rem 0; }
@media(max-width:640px) { .pe-limits { grid-template-columns: 1fr; } }
.pe-limit-card { background: var(--bg2); border: 1px solid var(--border); border-radius: 12px; padding: 1.5rem 1.6rem; transition: border-color 0.2s; display: flex; flex-direction: column; gap: 0.4rem; }
.pe-limit-card:hover { border-color: var(--purple); }
.pe-limit-icon { font-size: 2.2rem; line-height: 1; }
.pe-limit-title { font-weight: 700; color: var(--text); font-size: 1.35rem; margin: 0; }
.pe-limit-desc { font-size: 1.15rem; color: var(--muted); line-height: 1.65; margin: 0; }
/* Dual meters */
.pe-meters { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; margin: 1.25rem 0; }
.pe-meter { border-radius: 8px; padding: 1.1rem; text-align: center; }
.pe-meter-lbl { font-size: 0.82rem; letter-spacing: 0.1em; text-transform: uppercase; margin-bottom: 0.4rem; }
.pe-meter-val { font-size: 2.2rem; font-weight: 800; }
/* Buttons */
.pe-btn {
margin-top: 0.9rem;
background: var(--bg3);
border: 1px solid var(--cyan);
color: var(--cyan);
padding: 0.4rem 1.1rem;
border-radius: 6px;
cursor: pointer;
font-size: 0.82rem;
font-family: 'Space Grotesk', sans-serif;
transition: background 0.2s;
}
.pe-btn:hover { background: rgba(0,229,255,0.1); }
.pe-btn-pur { border-color: var(--purple); color: var(--purple); }
.pe-btn-pur:hover { background: rgba(168,85,247,0.1); }
/* Interactive badge */
.pe-interactive-header {
display: flex;
align-items: center;
justify-content: space-between;
margin-bottom: 1rem;
}
.pe-interactive-header .pe-box-title { margin-bottom: 0; }
.pe-interactive-badge {
display: inline-flex;
align-items: center;
gap: 0.4rem;
background: rgba(0,229,255,0.08);
border: 1px solid var(--cyan);
color: var(--cyan);
font-size: 0.72rem;
font-weight: 700;
letter-spacing: 0.12em;
text-transform: uppercase;
padding: 0.3rem 0.75rem;
border-radius: 999px;
flex-shrink: 0;
}
.pe-interactive-badge::before {
content: '';
width: 7px;
height: 7px;
border-radius: 50%;
background: var(--cyan);
animation: pe-pulse 1.6s ease-in-out infinite;
flex-shrink: 0;
}
@keyframes pe-pulse {
0%, 100% { opacity: 1; transform: scale(1); }
50% { opacity: 0.4; transform: scale(0.7); }
}
.pe-interact-hint {
display: flex;
align-items: center;
gap: 0.6rem;
margin-top: 0.9rem;
padding: 0.85rem 1.1rem;
background: rgba(0,229,255,0.05);
border: 1px dashed rgba(0,229,255,0.25);
border-radius: 8px;
font-size: 1.05rem;
color: var(--muted);
}
.pe-interact-hint span { font-size: 1.25rem; }
.pe-divider { border: none; border-top: 1px solid var(--border); margin: 2.5rem 0; }
@media(max-width:520px) {
.pe-anns { grid-template-columns: 1fr; }
.pe-sm-demo { flex-direction: column; }
.pe-arrow { transform: rotate(90deg); }
}</style><div class="pe-article"><div class="pe-video"><div class="pe-video-header">▶ Full Video Explainer — covering how LLMs work, from next-token prediction to attention, training, and why hallucinations are inevitable</div><video controls= poster="/images/llms-are-pe/hero.jpg"><source src="/images/llms-are-pe/explainer.mp4" type="video/mp4"/>
Your browser doesn't support HTML5 video.</video></div><nav class="pe-toc"><h3>In this article</h3><ol><li><a href="#pe-s1">What ChatGPT and Claude actually are</a></li><li><a href="#pe-s2">The one job every LLM does</a></li><li><a href="#pe-s3">The probability formula (interactive)</a></li><li><a href="#pe-s4">Softmax: turning scores into probabilities</a></li><li><a href="#pe-s5">How it learns: cross-entropy loss</a></li><li><a href="#pe-s6">The Transformer architecture</a></li><li><a href="#pe-s7">Self-attention: every word watches every word</a></li><li><a href="#pe-s8">How text is actually generated</a></li><li><a href="#pe-s9">Temperature: controlling randomness</a></li><li><a href="#pe-s10">Why it sometimes lies (hallucinations)</a></li><li><a href="#pe-s11">Key limitations</a></li><li><a href="#pe-s12">What's next</a></li></ol></nav><p>You've used ChatGPT. You've heard the word "AI" a thousand times this year. But here's something almost nobody explains clearly: the thing powering these tools is<strong>not intelligent in any human sense</strong>. It doesn't think. It doesn't understand. It doesn't have goals.</p><p>It is, at its core, a<span class="pe-em">very sophisticated next-word predictor</span> — a probability engine trained on the vast majority of text the internet has ever produced. Once you understand this, everything else — its strengths, its failures, its weirdness — clicks into place.</p><div class="pe-callout cy"><strong>Interactive animations ahead:</strong> Press buttons and move sliders as you go — seeing the math move makes it stick.</div><h2 id="pe-s1"><span class="pe-sec-num">01 —</span>What ChatGPT and Claude actually are</h2><p>The term "Artificial Intelligence" conjures images of something that thinks, reasons, and understands — a mind in a machine. That framing is compelling, but misleading when applied to today's large language models (LLMs).</p><p>What you're actually talking to is an<strong>autoregressive probabilistic model</strong>. Every word it generates is the result of asking one question, over and over again:</p><div class="pe-callout gd"><strong>"Given everything written so far, what word is most likely to come next?"</strong></div><p>That's it. Do that billions of times on internet-scale text, and you get something that looks uncannily like reasoning. But it is, fundamentally, pattern matching at extraordinary scale — not understanding, consciousness, or genuine intelligence.</p><table class="pe-table"><thead><tr><th>What you see</th><th>What's actually happening</th><th>The catch</th></tr></thead><tbody><tr><td>It "reasons"</td><td>Pattern-matches reasoning traces from training data</td><td>Breaks on genuinely novel problems</td></tr><tr><td>It "knows facts"</td><td>Recalls high-frequency statistical associations</td><td>Hallucinates on rare edge cases</td></tr><tr><td>It's "creative"</td><td>Samples from learned creative pattern spaces</td><td>Derivative — remixes, doesn't invent</td></tr><tr><td>It has "opinions"</td><td>Outputs tokens shaped by training + alignment</td><td>No actual beliefs internally</td></tr></tbody></table><h2 id="pe-s2"><span class="pe-sec-num">02 —</span>The one job every LLM does</h2><p>Let's make this concrete. Below is a live simulation of next-token prediction. Press<strong>"Predict next token"</strong> and watch the model pick the next word based on probability scores.</p><div class="pe-box"><div class="pe-interactive-header"><div class="pe-box-title">🎯 Next-Token Prediction</div><span class="pe-interactive-badge">Live · Interactive</span></div><div class="pe-sentence" id="pe-sentence">The cat sat on the<span class="pe-cursor"/></div><div class="pe-prob-bars" id="pe-prob-bars"/><div class="pe-interact-hint"><span>👇</span> Press the button to watch the model predict — one token at a time.</div><button class="pe-btn" onclick="pePredict()">Predict next token →</button></div><p>Notice the bars: each candidate word gets a probability score. The model doesn't "decide" in any human sense — it samples from this distribution. The highest-probability word is chosen most often, but not always. That's where both creativity and errors come from.</p><h2 id="pe-s3"><span class="pe-sec-num">03 —</span>The probability formula</h2><p>Here's the mathematical heart of it.<strong>Hover each term</strong> for a plain-English tooltip, then press the button to reveal the full breakdown piece by piece.</p><div class="pe-box"><div class="pe-interactive-header"><div class="pe-box-title">📐 Probability Formula</div><span class="pe-interactive-badge">Live · Interactive</span></div><div class="pe-interact-hint" style="margin-top:0;margin-bottom:0.9rem;"><span>🖱️</span> Hover any term for a plain-English tooltip. Press the button to reveal the formula step by step.</div><div class="pe-math"><span class="pe-term" id="pet0" data-tip="P = Probability of">P</span><span class="pe-term" id="pet1" data-tip="wₜ = the specific token we're predicting">(w<sub>t</sub></span><span class="pe-term" id="pet2" data-tip="| = 'given all of this before it'"> |</span><span class="pe-term" id="pet3" data-tip="w<t = every token that came before in the context"> w<sub>&lt;t</sub></span><span class="pe-term" id="pet4" data-tip="; θ = the model's billions of learned parameters"> ; θ)</span><span class="pe-term" id="pet5" data-tip="= the output we calculate"> =</span><span class="pe-term" id="pet6" data-tip="softmax converts raw scores into a proper probability distribution summing to 1"> softmax(logits<sub>t</sub>)</span></div><div class="pe-anns" id="pe-anns"><div class="pe-ann" id="pea0"><span class="pe-ann-sym">wₜ</span> —<span class="pe-ann-desc">The next token to predict</span></div><div class="pe-ann" id="pea1"><span class="pe-ann-sym">w&lt;t</span> —<span class="pe-ann-desc">All previous tokens (the context)</span></div><div class="pe-ann" id="pea2"><span class="pe-ann-sym">θ</span> —<span class="pe-ann-desc">Billions of learned parameters</span></div><div class="pe-ann" id="pea3"><span class="pe-ann-sym">softmax</span> —<span class="pe-ann-desc">Converts scores → probabilities (sum = 1)</span></div></div><button class="pe-btn" onclick="peRevealFormula()">Reveal formula step by step →</button></div><p>Plain English:<span class="pe-em">"Given everything typed so far, and everything the model learned during training, what is the probability of each possible next word?"</span> The model scores every word in its vocabulary — 50,000+ words — and softmax turns those raw scores into probabilities that add up to exactly 1.0.</p><h2 id="pe-s4"><span class="pe-sec-num">04 —</span>Softmax: raw scores → probabilities</h2><p>The model internally produces a raw score (called a<strong>logit</strong>) for every possible next word. Logits can be any number — positive, negative, large, small. They're not probabilities yet. The<strong>softmax</strong> function converts them into a clean distribution. Press the button to watch the transformation.</p><div class="pe-box"><div class="pe-interactive-header"><div class="pe-box-title">⚡ Softmax Transform</div><span class="pe-interactive-badge">Live · Interactive</span></div><div class="pe-interact-hint" style="margin-top:0;margin-bottom:0.9rem;"><span>👇</span> Press the button to watch raw scores transform into probabilities.</div><div class="pe-sm-demo"><div class="pe-sm-col"><div class="pe-col-lbl">Raw Logits (scores)</div><div id="pe-logits"/></div><div class="pe-arrow" id="pe-sm-arrow" style="opacity:0.3">→</div><div class="pe-sm-col"><div class="pe-col-lbl">After Softmax (probabilities)</div><div id="pe-softmax"/></div></div><button class="pe-btn pe-btn-pur" onclick="peSoftmax()">Run softmax →</button></div><p>Notice: even the most negative logit still gets a small non-zero probability after softmax. The model never completely rules anything out — it just makes some words astronomically unlikely. This is partly why LLMs occasionally produce bizarre outputs: a 0.001% token still gets picked sometimes.</p><h2 id="pe-s5"><span class="pe-sec-num">05 —</span>How it learns: cross-entropy loss</h2><p>During training, the model sees a sentence with the last word hidden and makes a prediction. The training algorithm asks:<span class="pe-em">"How wrong were you?"</span> The measure of wrongness is<strong>cross-entropy loss</strong>.</p><p>The formula:<code style="color:var(--gold);background:var(--bg3);padding:2px 8px;border-radius:4px;font-family:'JetBrains Mono',monospace;">ℒ = −log P(correct word)</code>. If the model assigns 100% probability to the right word, loss = 0. If it assigns 1%, loss is very high.<strong>Drag the slider</strong> to see this in action.</p><div class="pe-box"><div class="pe-interactive-header"><div class="pe-box-title">📉 Cross-Entropy Loss</div><span class="pe-interactive-badge">Live · Interactive</span></div><div class="pe-interact-hint" style="margin-top:0;margin-bottom:0.9rem;"><span>🎚️</span> Drag the slider to change the model's confidence and watch the loss recalculate live.</div><div class="pe-loss-wrap"><div><div class="pe-loss-num" id="pe-loss-num">1.47</div><div class="pe-loss-lbl">Loss ℒ = −log(p)</div><div class="pe-loss-slider"><label>Model's confidence in correct word:<strong id="pe-conf-lbl">23%</strong></label><input type="range" id="pe-conf-slider" min="1" max="99" value="23" oninput="peLoss(this.value)"/></div></div><div class="pe-loss-formula">
Correct word:<span class="pe-lf-hl">"lazy"</span><br>
P("lazy"):<span class="pe-lf-hl" id="pe-lf-p">0.23</span><br><br>
ℒ = −log(<span class="pe-lf-hl" id="pe-lf-p2">0.23</span>)<br>
ℒ =<span class="pe-lf-res" id="pe-lf-res">1.47</span><br><br><span style="color:var(--muted);font-size:0.76rem;" id="pe-lf-verdict">High loss → big update</span></div></div></div><h2 id="pe-s6"><span class="pe-sec-num">06 —</span>The Transformer: the machine inside</h2><p>The specific architecture that makes modern LLMs work is called the<strong>Transformer</strong>, introduced in a landmark 2017 Google paper. Every major LLM today — GPT-4, Claude, Gemini, Llama — is built on this design.</p><p>A Transformer processes your text through many stacked layers. Each layer has two main components:</p><div class="pe-callout"><strong>Multi-Head Self-Attention</strong> — Every word simultaneously looks at every other word, learning which relationships matter. This is the core insight.<br><br><strong>Feed-Forward Network</strong> — A dense neural network that processes each token's information independently, after attention has been applied.</div><p>A large model like GPT-4 stacks around 96 of these layers. With enough layers, parameters, and training data, emergent abilities appear — code generation, translation, basic reasoning — that nobody explicitly programmed. They fall out of the math at scale.</p><h2 id="pe-s7"><span class="pe-sec-num">07 —</span>Self-attention: every word watches every word</h2><p>Before the Transformer, AI models processed text word by word in sequence, making it hard to connect things far apart in a sentence. Self-attention solves this by letting every word simultaneously evaluate its relationship to every other word.<strong>Click a word</strong> to see its attention weights.</p><div class="pe-box"><div class="pe-interactive-header"><div class="pe-box-title">🔍 Self-Attention Weights</div><span class="pe-interactive-badge">Live · Interactive</span></div><div class="pe-attn-words" id="pe-attn-words"/><div id="pe-attn-grid"/><div class="pe-interact-hint"><span>👆</span> Click any word above to see how it attends to every other word. Brighter = stronger attention — notice "it" lights up "animal."</div></div><div class="pe-box"><div class="pe-box-title">📐 The Attention Equation</div><div class="pe-math" style="font-size:1.05rem;">
Attention(Q, K, V) = softmax(<span style="color:var(--cyan);">QKᵀ</span> /<span style="color:var(--gold);">√d<sub>k</sub></span> ) ·<span style="color:var(--purple);">V</span></div><div class="pe-qkv"><div class="pe-qkv-card" style="background:rgba(0,229,255,0.07);border:1px solid rgba(0,229,255,0.2);"><div style="color:var(--cyan);font-weight:700;margin-bottom:0.3rem;font-size:1rem;">Q — Query</div><div style="color:var(--muted);font-size:0.92rem;">"What am I looking for?"</div></div><div class="pe-qkv-card" style="background:rgba(251,191,36,0.07);border:1px solid rgba(251,191,36,0.2);"><div style="color:var(--gold);font-weight:700;margin-bottom:0.3rem;font-size:1rem;">K — Key</div><div style="color:var(--muted);font-size:0.92rem;">"What does each word offer?"</div></div><div class="pe-qkv-card" style="background:rgba(168,85,247,0.07);border:1px solid rgba(168,85,247,0.2);"><div style="color:var(--purple);font-weight:700;margin-bottom:0.3rem;font-size:1rem;">V — Value</div><div style="color:var(--muted);font-size:0.92rem;">"What info do I retrieve?"</div></div></div></div><h2 id="pe-s8"><span class="pe-sec-num">08 —</span>How text is actually generated</h2><p>When you press Send in any AI chat app, here is exactly what happens:</p><ol style="padding-left:1.3rem;margin-bottom:1.1rem;"><li style="margin-bottom:0.55rem;color:var(--muted);"><strong style="color:var(--text);">Tokenization</strong> — Your message splits into tokens (subwords). "unbelievable" → ["un","believ","able"].</li><li style="margin-bottom:0.55rem;color:var(--muted);"><strong style="color:var(--text);">Embedding</strong> — Each token becomes a high-dimensional vector capturing meaning and position.</li><li style="margin-bottom:0.55rem;color:var(--muted);"><strong style="color:var(--text);">Forward pass</strong> — Vectors flow through all Transformer layers. Attention and feed-forward happen, repeatedly.</li><li style="margin-bottom:0.55rem;color:var(--muted);"><strong style="color:var(--text);">Logits → Probabilities</strong> — The final layer scores every vocabulary word. Softmax converts to probabilities.</li><li style="margin-bottom:0.55rem;color:var(--muted);"><strong style="color:var(--text);">Sampling</strong> — One word is chosen based on those probabilities.</li><li style="margin-bottom:0.55rem;color:var(--muted);"><strong style="color:var(--text);">Repeat</strong> — That word is appended and the whole process runs again until the response is done.</li></ol><div class="pe-callout"><strong>KV Caching:</strong> The model caches Key and Value matrices from previous steps so it doesn't recompute attention from scratch every token — making long responses computationally feasible.</div><h2 id="pe-s9"><span class="pe-sec-num">09 —</span>Temperature: controlling randomness</h2><p>When sampling the next word, you can control how random the selection is with a parameter called<strong>temperature</strong>. Drag the slider to see how it reshapes the probability distribution in real time.</p><div class="pe-box"><div class="pe-interactive-header"><div class="pe-box-title">🌡️ Temperature Sampling</div><span class="pe-interactive-badge">Live · Interactive</span></div><div class="pe-interact-hint" style="margin-top:0;margin-bottom:0.9rem;"><span>🎚️</span> Drag the slider left for predictable outputs, right for creative (or chaotic) ones.</div><div class="pe-temp-row-ctrl"><div><div class="pe-temp-big" id="pe-temp-val">1.0</div><div class="pe-temp-lbl" id="pe-temp-lbl">Balanced</div></div><input type="range" id="pe-temp-slider" min="1" max="30" value="10" style="flex:1;accent-color:var(--cyan);" oninput="peTemp(this.value)"/></div><div class="pe-tbars" id="pe-tbars"/><div style="font-size:0.92rem;color:var(--muted);margin-top:0.7rem;">Formula: p'ᵢ = pᵢ<sup>1/T</sup> / Σ(pⱼ<sup>1/T</sup>)</div></div><p>Low temperature (e.g. 0.2) makes the model deterministic — it almost always picks the top word. High temperature (e.g. 2.0) flattens the distribution, giving unusual words a real chance. Most production systems run between 0.7 and 1.0.</p><h2 id="pe-s10"><span class="pe-sec-num">10 —</span>Why it sometimes lies (hallucinations)</h2><p>One of the most misunderstood LLM behaviors is<strong>hallucination</strong> — when the model confidently states something false. This isn't a bug to be patched away. It's a direct consequence of the architecture.</p><p>The model has no internal truth checker. No access to the real world. It only knows:<span class="pe-em">what sequence of words tends to follow this sequence of words?</span> When asked something rare or obscure, the model fills the gap with statistically plausible text — which may be completely wrong.</p><div class="pe-callout gd"><strong>Analogy:</strong> Imagine someone who has read every book in a library but never left the building. Ask what the weather is like outside — they'll give a confident, well-reasoned answer based on weather descriptions they've read. It might be completely wrong.</div><div class="pe-meters"><div class="pe-meter" style="background:rgba(248,113,113,0.08);border:1px solid var(--danger);"><div class="pe-meter-lbl" style="color:var(--danger);">Ground Truth Access</div><div class="pe-meter-val" style="color:var(--danger);">NONE</div></div><div class="pe-meter" style="background:rgba(74,222,128,0.08);border:1px solid #4ade80;"><div class="pe-meter-lbl" style="color:#4ade80;">Statistical Plausibility</div><div class="pe-meter-val" style="color:#4ade80;">HIGH</div></div></div><h2 id="pe-s11"><span class="pe-sec-num">11 —</span>Key limitations to know</h2><p>Understanding these isn't pessimism — it's how you use these tools well.</p><div class="pe-limits"><div class="pe-limit-card"><div class="pe-limit-icon">📏</div><div class="pe-limit-title">Context Window</div><div class="pe-limit-desc">Fixed memory. Older models: ~4K tokens. Newer: up to 1M+. Anything beyond the window is completely invisible to the model.</div></div><div class="pe-limit-card"><div class="pe-limit-icon">🌀</div><div class="pe-limit-title">No Persistent Memory</div><div class="pe-limit-desc">Every conversation starts completely fresh. The model has no memory of past sessions unless you explicitly provide them.</div></div><div class="pe-limit-card"><div class="pe-limit-icon">🎲</div><div class="pe-limit-title">Stochasticity</div><div class="pe-limit-desc">Same prompt, potentially different outputs. The sampling process is inherently random, even at low temperatures.</div></div><div class="pe-limit-card"><div class="pe-limit-icon">🔓</div><div class="pe-limit-title">Jailbreaks</div><div class="pe-limit-desc">Safety training is pattern-based. Clever prompting can sometimes bypass it because the model is still a pattern matcher at heart.</div></div><div class="pe-limit-card"><div class="pe-limit-icon">💭</div><div class="pe-limit-title">Hallucinations</div><div class="pe-limit-desc">Inevitable on low-frequency knowledge. No fact-checker means confident errors are always possible. Verify important claims.</div></div><div class="pe-limit-card"><div class="pe-limit-icon">⚡</div><div class="pe-limit-title">Quadratic Cost</div><div class="pe-limit-desc">Attention cost grows quadratically with context length. Techniques like FlashAttention mitigate this, but it's a fundamental constraint.</div></div></div><h2 id="pe-s12"><span class="pe-sec-num">12 —</span>What's next</h2><p>The probability-engine core remains — but researchers are building powerful layers on top.<strong>RAG (Retrieval-Augmented Generation)</strong> gives the model access to real documents at query time, dramatically reducing hallucination on factual tasks.<strong>Agentic systems</strong> let LLMs use tools, execute code, and iterate on their outputs.<strong>Reasoning models</strong> generate long internal chains of thought before answering, improving performance on math and logic. And<strong>multimodal models</strong> extend the same probabilistic core to images and audio.</p><p>None of these change the fundamental nature of what an LLM is. They all sit on top of the same next-token prediction engine. Understanding that foundation is what makes you a sharper thinker about where this technology is — and isn't — going.</p><div class="pe-callout cy"><strong>The bottom line:</strong> LLMs are extraordinary pattern-recognition engines that have scaled statistical prediction to the point of producing genuinely useful, sometimes astonishing outputs. They are not intelligent in any human sense. Knowing this — really knowing it — is what separates clear thinking about AI from hype.</div><hr class="pe-divider"/><p style="color:var(--muted);font-size:0.95rem;">Video generated with Grok Imagine. Animations built with vanilla JavaScript.</p></div><script>
(function() {
// ── Token Prediction ──
const peSeqs = [
{ prefix: "The cat sat on the", cands: [
{w:"mat",p:62,win:true},{w:"floor",p:18},{w:"rug",p:11},{w:"roof",p:5},{w:"couch",p:4}
]},
{ prefix: "The cat sat on the mat and", cands: [
{w:"looked",p:41,win:true},{w:"waited",p:28},{w:"purred",p:17},{w:"slept",p:9},{w:"yawned",p:5}
]},
{ prefix: "The cat sat on the mat and looked", cands: [
{w:"up",p:54,win:true},{w:"around",p:22},{w:"out",p:14},{w:"away",p:7},{w:"back",p:3}
]},
{ prefix: "The cat sat on the mat and looked up at", cands: [
{w:"the",p:58,win:true},{w:"me",p:22},{w:"nothing",p:11},{w:"her",p:6},{w:"him",p:3}
]},
];
var peIdx = 0;
function peRenderBars(cands) {
var c = document.getElementById('pe-prob-bars');
if (!c) return;
c.innerHTML = '';
cands.forEach(function(cd, i) {
var row = document.createElement('div');
row.className = 'pe-prob-row';
row.innerHTML = '<div class="pe-prob-lbl">'+cd.w+'</div><div class="pe-prob-track"><div class="pe-prob-fill'+(cd.win?' win':'')+'" id="pepf'+i+'"/></div><div class="pe-prob-pct">'+cd.p+'%</div>';
c.appendChild(row);
});
setTimeout(function() {
cands.forEach(function(cd, i) {
var el = document.getElementById('pepf'+i);
if (el) el.style.width = cd.p+'%';
});
}, 60);
}
window.pePredict = function() {
var seq = peSeqs[peIdx % peSeqs.length];
peRenderBars(seq.cands);
setTimeout(function() {
var winner = seq.cands.find(function(c){return c.win;});
var el = document.getElementById('pe-sentence');
if (el) el.innerHTML = seq.prefix+'<span style="color:var(--gold);font-weight:bold;">'+winner.w+'</span><span class="pe-cursor"/>';
}, 800);
peIdx++;
};
peRenderBars(peSeqs[0].cands);
// ── Formula Reveal ──
var peTerms = ['pet0','pet1','pet2','pet3','pet4','pet5','pet6'];
var peAnns = ['pea0','pea1','pea2','pea3'];
window.peRevealFormula = function() {
peTerms.forEach(function(id){var el=document.getElementById(id);if(el)el.classList.remove('on');});
peAnns.forEach(function(id){var el=document.getElementById(id);if(el)el.classList.remove('on');});
peTerms.forEach(function(id, i){ setTimeout(function(){ var el=document.getElementById(id);if(el)el.classList.add('on'); }, i*280); });
peAnns.forEach(function(id, i){ setTimeout(function(){ var el=document.getElementById(id);if(el)el.classList.add('on'); }, 2100+i*230); });
};
// ── Softmax ──
var peSMData = [
{w:'mat',l:4.2},{w:'floor',l:1.8},{w:'rug',l:0.9},{w:'table',l:-0.4},{w:'sky',l:-2.1}
];
(function initSM() {
var li = document.getElementById('pe-logits');
var si = document.getElementById('pe-softmax');
if (!li||!si) return;
li.innerHTML = '';
si.innerHTML = '';
peSMData.forEach(function(d,i) {
li.innerHTML += '<div class="pe-logit-row"><span class="pe-logit-w">'+d.w+'</span><span class="pe-logit-v '+(d.l<0?'neg':'pos')+'">'+(d.l>0?'+':'')+d.l+'</span></div>';
si.innerHTML += '<div class="pe-logit-row"><span class="pe-logit-w">'+d.w+'</span><div class="pe-sm-bar" id="pesm'+i+'"/></div>';
});
})();
window.peSoftmax = function() {
var exps = peSMData.map(function(d){return Math.exp(d.l);});
var sum = exps.reduce(function(a,b){return a+b;},0);
var probs = exps.map(function(e){return e/sum;});
var arrow = document.getElementById('pe-sm-arrow');
if (arrow) arrow.style.opacity = '1';
probs.forEach(function(p,i){
setTimeout(function(){
var bar = document.getElementById('pesm'+i);
if (!bar) return;
bar.style.width = Math.max(p*150,0)+'px';
bar.textContent = (p*100).toFixed(1)+'%';
}, i*140);
});
};
// ── Loss ──
window.peLoss = function(val) {
var p = val/100;
var loss = -Math.log(p);
var cl = document.getElementById('pe-conf-lbl');
var ln = document.getElementById('pe-loss-num');
var lp = document.getElementById('pe-lf-p');
var lp2 = document.getElementById('pe-lf-p2');
var lr = document.getElementById('pe-lf-res');
var lv = document.getElementById('pe-lf-verdict');
if(cl) cl.textContent = val+'%';
if(ln){ ln.textContent = loss.toFixed(2); ln.classList.toggle('good', loss< 0.5);= }= if(lp)= lp.textContent=p.toFixed(2); if(lp2)= lp2.textContent=p.toFixed(2); if(lr)= lr.textContent=loss.toFixed(2); if(lv)= lv.textContent=loss <= 0.5= ?= 'Low= loss= →= small= parameter= update= ✓'= := 'High= loss= →= big= parameter= update= ↑';= };= //= ──= Attention= ──= var= peAttnWords=["The","animal","didn't","cross","the","street","because","it","was","tired"]; var= peAttnW={ "The":= [0.5,0.1,0.05,0.05,0.1,0.05,0.05,0.05,0.03,0.02],= "animal":= [0.05,0.55,0.05,0.05,0.05,0.05,0.05,0.1,0.03,0.02],= "didn't":= [0.04,0.08,0.5,0.08,0.04,0.08,0.06,0.05,0.05,0.02],= "cross":= [0.03,0.05,0.08,0.5,0.03,0.12,0.08,0.05,0.04,0.02],= "the":= [0.08,0.05,0.04,0.05,0.5,0.12,0.05,0.04,0.05,0.02],= "street":= [0.04,0.05,0.06,0.12,0.1,0.45,0.06,0.05,0.05,0.02],= "because":= [0.03,0.06,0.08,0.08,0.03,0.07,0.45,0.1,0.07,0.03],= "it":= [0.03,0.38,0.07,0.06,0.03,0.06,0.1,0.15,0.09,0.03],= "was":= [0.03,0.06,0.05,0.05,0.03,0.05,0.07,0.1,0.5,0.06],= "tired":= [0.03,0.08,0.05,0.05,0.03,0.05,0.07,0.12,0.1,0.42],= };= var= peSel="it" ;= function= peRenderAttn()= {= var= wc=document.getElementById('pe-attn-words'); if= (!wc)= return;= wc.innerHTML='' ;= peAttnWords.forEach(function(w)= {= var= el=document.createElement('div'); el.className='pe-attn-word' +(w===peSel?' sel':'');= el.textContent=w; el.onclick=function(){ peSel=w; peRenderAttn();= };= wc.appendChild(el);= });= var= weights=peAttnW[peSel] ||= peAttnW["it"];= var= wrap=document.getElementById('pe-attn-grid'); if= (!wrap)= return;= wrap.innerHTML='' ;= var= n=peAttnWords.length; var= grid=document.createElement('div'); grid.style.display='grid' ;= grid.style.gridTemplateColumns='repeat(' +n+',= 1fr)';= grid.style.gap='3px' ;= weights.forEach(function(wt,= j)= {= var= cell=document.createElement('div'); cell.style.height='26px' ;= cell.style.borderRadius='3px' ;= cell.style.background='rgba(0,229,255,' +wt+')';= cell.style.transition='background 0.4s' ;= cell.title=peSel+' →= '+peAttnWords[j]+':= '+(wt*100).toFixed(0)+'%';= grid.appendChild(cell);= });= var= labelRow=document.createElement('div'); labelRow.style.display='grid' ;= labelRow.style.gridTemplateColumns='repeat(' +n+',= 1fr)';= labelRow.style.gap='3px' ;= labelRow.style.marginTop='4px' ;= peAttnWords.forEach(function(w)= {= var= lbl=document.createElement('div'); lbl.textContent=w; lbl.style.fontSize='0.6rem' ;= lbl.style.color='var(--muted)' ;= lbl.style.textAlign='center' ;= lbl.style.overflow='hidden' ;= lbl.style.textOverflow='ellipsis' ;= labelRow.appendChild(lbl);= });= wrap.appendChild(grid);= wrap.appendChild(labelRow);= }= peRenderAttn();= //= ──= Temperature= ──= var= peTempBase=[ {w:'mat',p:0.52},{w:'floor',p:0.22},{w:'rug',p:0.13},{w:'table',p:0.08},{w:'sky',p:0.03},{w:'cloud',p:0.02}= ];= (function= initTemp()= {= var= c=document.getElementById('pe-tbars'); if= (!c)= return;= c.innerHTML='' ;= peTempBase.forEach(function(d,i)= {= c.innerHTML= +='<div class="pe-trow"><span class="pe-tlbl">' +d.w+'</span=><div class="pe-ttrack"><div class="pe-tfill" id="petf'+i+'"/></div><span class="pe-tpct" id="petp'+i+'">—</span></div>';
});
peTemp(10);
})();
window.peTemp = function(val) {
var T = val/10;
var dv = document.getElementById('pe-temp-val');
var dl = document.getElementById('pe-temp-lbl');
if (dv) dv.textContent = T.toFixed(1);
if (dl) {
if (T<0.5) dl.textContent='🧊 Deterministic' ;= else= if(T<0.8)= dl.textContent='🔵 Conservative' ;= else= if(T<1.3)= dl.textContent='⚖️ Balanced — sweet spot' ;= else= if(T<2.0)= dl.textContent='🔥 Creative' ;= else= dl.textContent='🌋 Chaotic' ;= }= var= scaled=peTempBase.map(function(d){return Math.pow(d.p,1/T);});= var= sum=scaled.reduce(function(a,b){return a+b;},0);= var= probs=scaled.map(function(s){return s/sum;});= probs.forEach(function(p,i){= var= f=document.getElementById('petf'+i); var= t=document.getElementById('petp'+i); if(f)= f.style.width=(p*100)+'%'; if(t)= t.textContent=(p*100).toFixed(1)+'%'; });= };= })();= </script=>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/llms-are-pe/hero.jpg" medium="image"><media:title type="plain">Posts</media:title></media:content><category>artificial-intelligence</category><category>llm</category><category>machine-learning</category><category>deep-learning</category><category>architecture</category><category>Knowledge Base</category></item><item><title>The Minion Who Wanted a Touchscreen</title><link>https://curiousbit.netlify.app/the-minion-who-wanted-a-touchscreen-glide/</link><guid isPermaLink="true">https://curiousbit.netlify.app/the-minion-who-wanted-a-touchscreen-glide/</guid><pubDate>Sat, 06 Jun 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;h2 id="chapter-1--the-problem"&gt;Chapter 1 — The Problem&lt;/h2&gt;
&lt;p&gt;We live in a world of glass that responds to us. Tap an iPhone, it opens. Swipe an iPad, it scrolls.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/Minion/hero.jpg" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><h2 id="chapter-1--the-problem">Chapter 1 — The Problem</h2><p>We live in a world of glass that responds to us. Tap an iPhone, it opens. Swipe an iPad, it scrolls.</p><p>But when I reach out to my MacBook display? Nothing. Just fingerprints.</p><p><video autoplay= loop= muted= playsinline= style="width:100%; border-radius:8px; margin:1.5rem 0;"><source src="/images/Minion/ANIM-001.mp4" type="video/mp4"/></p><p>The interaction model is broken. We expect a simple flow:<strong><code>Finger</code> ↓<code>Screen</code> ↓<code>Content Moves</code></strong></p><p>But reality on a Mac is:<strong><code>Finger</code> ↓<code>Screen</code> ↓<code>Nothing Happens</code></strong></p><h2 id="chapter-2--the-crazy-idea">Chapter 2 — The Crazy Idea</h2><p>I didn&rsquo;t want to buy an external touchscreen, and I certainly didn&rsquo;t want a clunky piece of hardware taped to my bezel. I just wanted my Mac to understand when my finger was moving across it.</p><p><video autoplay= loop= muted= playsinline= style="width:100%; border-radius:8px; margin:1.5rem 0;"><source src="/images/Minion/IMG-002.mp4" type="video/mp4"/></p><p>What if we could create an invisible sheet of glass in front of the display?</p><p><video autoplay= loop= muted= playsinline= style="width:100%; border-radius:8px; margin:1.5rem 0;"><source src="/images/Minion/IMG-003.mp4" type="video/mp4"/></p><h2 id="chapter-3--building-an-invisible-touchscreen">Chapter 3 — Building an Invisible Touchscreen</h2><p>To make this work without lag, it couldn&rsquo;t be a Python script running in a terminal. It had to be a native macOS menu bar app, tapping directly into the Apple Neural Engine. Here is how the v2 architecture flows:</p><p><video autoplay= loop= muted= playsinline= style="width:100%; border-radius:8px; margin:1.5rem 0;"><source src="/images/Minion/DIAG-002.mp4" type="video/mp4"/></p><p>To explain this, let&rsquo;s meet the engineering team:</p><p><strong>Camera Manager</strong><video autoplay= loop= muted= playsinline= style="width:100%; border-radius:8px; margin:1.5rem 0;"><source src="/images/Minion/IMG-004.mp4" type="video/mp4"/>
Captures an<code>AVCaptureSession</code> at a strict 30/60 fps with actor isolation.</p><p><strong>Vision Pipeline</strong><video autoplay= loop= muted= playsinline= style="width:100%; border-radius:8px; margin:1.5rem 0;"><source src="/images/Minion/IMG-005.mp4" type="video/mp4"/>
Uses<code>VNDetectHumanHandPoseRequest</code> to identify the index finger in the frame.</p><p><strong>Hand Tracker</strong><video autoplay= loop= muted= playsinline= style="width:100%; border-radius:8px; margin:1.5rem 0;"><source src="/images/Minion/IMG-006.mp4" type="video/mp4"/>
Applies a centroid-based lock (0.15 threshold) so the system doesn&rsquo;t jump between hands.</p><p><strong>Scroll Engine</strong><video autoplay= loop= muted= playsinline= style="width:100%; border-radius:8px; margin:1.5rem 0;"><source src="/images/Minion/IMG-007.mp4" type="video/mp4"/>
A GestureStateMachine (Idle → Dwelling → TouchActive → Releasing) that posts FPS-normalised scroll events directly into the macOS HID system.</p><h2 id="chapter-4--why-gesture-control-is-wrong">Chapter 4 — Why Gesture Control Is Wrong</h2><p>The first instinct when building camera control is to use gestures.</p><p><video autoplay= loop= muted= playsinline= style="width:100%; border-radius:8px; margin:1.5rem 0;"><source src="/images/Minion/IMG-008.mp4" type="video/mp4"/></p><p>If I have to learn sign language to read an article, the tool failed. I didn&rsquo;t want gesture recognition. I wanted<strong>direct manipulation</strong>.</p><p>With Glide:</p><ol><li>Point finger.</li><li>Touch page.</li><li>Move page.</li></ol><h2 id="chapter-5--the-ux-rabbit-hole">Chapter 5 — The UX Rabbit Hole</h2><p>If you&rsquo;ve ever used a bad gesture system, you know the feeling.</p><p>Imagine hovering your hand in mid-air. In a typical gesture system, you enter a &ldquo;touch zone&rdquo; and then are forced to wait. A mandatory 200-millisecond delay kicks in just to confirm your intent before the system finally registers the action. That tiny fraction of a second feels sluggish, unnatural, and deeply frustrating.</p><p>Now imagine a true physical touchscreen. The moment your finger touches the glass, the response is instant. No waiting. No unnatural pausing. The digital content tracks perfectly and immediately with your physical movement.</p><p>The difference between a gimmick and a tool is latency.</p><p><video autoplay= loop= muted= playsinline= style="width:100%; border-radius:8px; margin:1.5rem 0;"><source src="/images/Minion/DIAG-003.mp4" type="video/mp4"/></p><h2 id="chapter-6--the-touch-plane">Chapter 6 — The Touch Plane</h2><p>To make it feel like a touchscreen, we had to invent a virtual screen floating exactly 10 inches in front of the actual screen.</p><p>Imagine a strict, invisible boundary hovering parallel to your MacBook display. Until your finger crosses that exact depth threshold, the system entirely ignores your movements. But the moment your fingertip pierces that invisible layer, the interface wakes up—locking onto your finger&rsquo;s precise coordinates and translating them into immediate, pixel-perfect scrolling.</p><p><video autoplay= loop= muted= playsinline= style="width:100%; border-radius:8px; margin:1.5rem 0;"><source src="/images/Minion/ANIM-004.mp4" type="video/mp4"/></p><h2 id="chapter-7--the-tech-stack">Chapter 7 — The Tech Stack</h2><p>To run a continuous computer vision pipeline without setting a MacBook on fire, the tech stack had to be heavily optimized.</p><p><video autoplay= loop= muted= playsinline= style="width:100%; border-radius:8px; margin:1.5rem 0;"><source src="/images/Minion/IMG-010.mp4" type="video/mp4"/></p><p><strong>Why Swift?</strong></p><p>Using Swift gave us direct access to native macOS APIs and the Vision Framework. It meant better latency, memory safety, and lower power consumption since the Neural Engine does the heavy lifting.</p><p><strong>Tech Stack Callout:</strong></p><ul><li><strong>Language:</strong> Swift 5.9+ (Strict concurrency)</li><li><strong>UI:</strong> SwiftUI</li><li><strong>Computer Vision:</strong> Vision Framework</li><li><strong>Event Injection:</strong> Quartz Event Services</li><li><strong>Architecture:</strong> Menu Bar App</li><li><strong>Platform:</strong> Apple Silicon (M-series)</li></ul><h2 id="chapter-8--the-real-challenge">Chapter 8 — The Real Challenge</h2><p>The hardest problem wasn&rsquo;t the computer vision or the math.</p><p><video autoplay= loop= muted= playsinline= style="width:100%; border-radius:8px; margin:1.5rem 0;"><source src="/images/Minion/IMG-011.mp4" type="video/mp4"/></p><p>The hardest problem was comfort. Humans aren&rsquo;t built to hold their arms out straight for eight hours a day.</p><p>Our Test Matrix (from the v1.0 Test Plan) reflected this. We had to pass:</p><p><video autoplay= loop= muted= playsinline= style="width:100%; border-radius:8px; margin:1.5rem 0;"><source src="/images/Minion/IMG-012.mp4" type="video/mp4"/></p><h2 id="chapter-9--what-success-looks-like">Chapter 9 — What Success Looks Like</h2><p><video autoplay= loop= muted= playsinline= style="width:100%; border-radius:8px; margin:1.5rem 0;"><source src="/images/Minion/IMG-013.mp4" type="video/mp4"/></p><h2 id="final-section">Final Section</h2><p>The journey doesn&rsquo;t stop at scrolling.</p><p><strong>Today</strong> ↓ Scrolling<br><strong>Tomorrow</strong> ↓ Zoom ↓ Click ↓ Window Management ↓ Presentations</p><p>At Aera, we are building a camera-native interaction layer for macOS. Glide is just the beginning.</p><h2 id="an-honest-postscript--into-cold-storage">An Honest Postscript — Into Cold Storage</h2><p>I&rsquo;ll be straight about where this actually ended: I left Glide midway. It&rsquo;s in cold storage, not shipped.</p><p>A few things stacked up. Token limits and the sheer time Antigravity took to grind through each problem made every iteration expensive. And the LLM never managed the one breakthrough that mattered — the scrolling logic. We could detect the hand, lock onto the finger, calibrate the touch plane, and light up the debug HUD, but turning that into scrolling that felt genuinely like a touchscreen stayed just out of reach. At some point I chose to stop rather than keep forcing it.</p><p>But I don&rsquo;t count it as wasted. I set out to treat this like a proper software-engineering project rather than a weekend hack — a real architecture, a menu-bar app instead of a terminal script, and detailed documentation that a serious project would keep: a maintained CHANGELOG, a test plan and test matrix, an architecture and a gesture spec. That discipline is the thing I&rsquo;m taking with me. Even shelved, the project is fully legible — anyone (including future me) can pick it up and know exactly where it stands and why. That&rsquo;s the real learning: good documentation is what lets a project survive being put down.</p><div style="background:linear-gradient(135deg,#10151f,#161d2b);border:1px solid rgba(96,165,250,0.28);border-radius:12px;padding:36px 32px;margin:44px 0;text-align:center;"><h2 style="margin:0 0 12px;color:#fff;">See it on the bench — the screenshots</h2><p style="color:#c7d0df;max-width:620px;margin:0 auto 24px;font-size:1.08rem;line-height:1.7;">Seven captioned frames from the build: the Xcode project and its CHANGELOG, the AI agent wrestling with the scroll logic, the full five-step onboarding flow, and the live debug HUD.</p><a href="/glide-screens.html" target="_blank" rel="noopener" style="display:inline-block;background:#3b82f6;color:#fff;font-weight:700;padding:15px 32px;border-radius:6px;font-size:1.02rem;">Open the screenshot gallery →</a></div>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/Minion/hero.jpg" medium="image"><media:title type="plain">Posts</media:title></media:content><category>swift</category><category>computer-vision</category><category>macos</category><category>Projects</category></item><item><title>Aether, Grown Wild — The Implementation Journey (v2.6 → v2.8.2)</title><link>https://curiousbit.netlify.app/aether-grown-wild-the-implementation-journey-v2.6-v2.8.2/</link><guid isPermaLink="true">https://curiousbit.netlify.app/aether-grown-wild-the-implementation-journey-v2.6-v2.8.2/</guid><pubDate>Wed, 03 Jun 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;link href="https://fonts.googleapis.com/css2?family=Noto+Serif:ital,wght@0,400;0,700;1,400&amp;family=JetBrains+Mono:wght@400;600&amp;display=swap" rel="stylesheet"&gt;
&lt;style&gt;
/* ─────────────────────────────────────────────
JUNGLE BOOK PALETTE — scoped to .aether-journey
───────────────────────────────────────────── */
.aether-journey {
--bg: #0c1410;
--bg2: #111d16;
--bg3: #16271c;
--jungle: #1f3d2b;
--jungle2: #2c5239;
--olive: #8a9a52;
--olive-dk: #4a5429;
--gold: #d9a521;
--gold-lt: #f0c95a;
--gold-dk: #8a6610;
--clay: #a9683f;
--clay-lt: #c98a5b;
--cream: #e7e0cf;
--muted: #aab694;
--rule: rgba(217,165,33,0.20);
--rule2: rgba(125,139,74,0.24);
font-family: 'Noto Sans', sans-serif;
font-size: 21px;
line-height: 1.85;
color: var(--cream);
}
.aether-journey a { color:var(--gold); text-decoration:none; border-bottom:1px solid rgba(217,165,33,0.35); }
.aether-journey a:hover { color:var(--gold-lt); border-color:var(--gold-lt); }
/* ── Hero ── */
.aj-hero { border-bottom:1px solid var(--rule); position:relative; overflow:hidden; min-height:540px; display:flex; align-items:flex-end; border-radius:12px; margin-bottom:40px; }
.aj-hero-video { position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:1; }
.aj-hero-scrim { position:absolute; inset:0; z-index:2; pointer-events:none;
background:
linear-gradient(180deg, rgba(12,20,16,0.25) 0%, rgba(12,20,16,0.50) 50%, rgba(12,20,16,0.93) 100%),
radial-gradient(900px 480px at 16% 100%, rgba(31,61,43,0.55), transparent 70%); }
.aj-hero-in { padding:74px 26px 60px; max-width:960px; margin:0 auto; position:relative; z-index:3; width:100%; }
.aj-hero-meta { display:flex; gap:22px; flex-wrap:wrap; margin-top:30px; font-size:1rem; color:var(--muted); }
.aj-hero-meta strong { color:var(--gold); font-weight:600; }
.aj-hero-meta span:not(:last-child)::after { content:"·"; margin-left:22px; color:var(--olive-dk); }
.aj-kicker { display:inline-block; font-size:.84rem; letter-spacing:.2em; text-transform:uppercase; color:var(--bg); background:var(--gold); padding:6px 15px; border-radius:3px; font-weight:700; margin-bottom:22px; }
/* ── Typography ── */
.aether-journey .aj-label { font-size:.88rem; letter-spacing:.22em; text-transform:uppercase; color:var(--gold); font-weight:600; margin:54px 0 12px; }
.aether-journey h2 { font-family:'Noto Serif',serif; font-size:clamp(1.85rem,3.5vw,2.6rem); color:#fff; font-weight:700; margin:8px 0 24px; line-height:1.25; }
.aether-journey h3 { font-family:'Noto Serif',serif; font-size:1.45rem; color:var(--gold-lt); margin:34px 0 12px; }
.aether-journey p { color:var(--cream); margin:0 0 24px; }
.aether-journey .aj-lead { font-size:1.4rem; color:#fff; line-height:1.7; }
.aether-journey em.q { color:var(--clay-lt); font-style:italic; }
.aether-journey strong { color:#fff; font-weight:700; }
.aj-hero h1 { font-family:'Noto Serif',serif; font-size:clamp(2.6rem,5.5vw,4rem); line-height:1.14; color:#fff; font-weight:700; }
/* ── Lists — em-dash ── */
.aether-journey ul.leaf { list-style:none; padding:0; margin:0 0 24px; }
.aether-journey ul.leaf li { padding:0 0 20px 34px; position:relative; color:var(--cream); font-size:1.35rem; line-height:1.8; }
.aether-journey ul.leaf li::before { content:"—"; position:absolute; left:0; color:var(--olive); font-weight:700; }
.aether-journey ul.leaf li strong { color:#fff; }
/* ── Recap note ── */
.aj-recap { background:var(--bg2); border-left:3px solid var(--clay); border-radius:0 6px 6px 0; padding:22px 28px; margin:0 0 40px; }
.aj-recap p { margin:0; font-size:1.12rem; color:var(--muted); line-height:1.7; }
/* ── Pull quote ── */
.aj-pull { font-family:'Noto Serif',serif; font-size:1.6rem; line-height:1.55; color:var(--gold-lt); font-style:italic; border-left:3px solid var(--gold); padding:8px 0 8px 28px; margin:40px 0; }
/* ── Stat strip ── */
.aj-statbar { display:grid; grid-template-columns:repeat(4,1fr); gap:1px; background:var(--rule2); border:1px solid var(--rule2); margin:40px 0; border-radius:8px; overflow:hidden; }
.aj-stat { background:var(--bg2); padding:28px 18px; text-align:center; }
.aj-stat .num { font-family:'Noto Serif',serif; font-size:2.2rem; color:var(--gold); display:block; line-height:1.1; }
.aj-stat .lbl { font-size:.82rem; letter-spacing:.08em; color:var(--muted); text-transform:uppercase; margin-top:8px; }
/* ── Figures ── */
.aj-visual { margin:36px 0 12px; border-radius:10px; overflow:hidden; border:1px solid var(--rule2); background:var(--bg2); }
.aj-visual svg { display:block; width:100%; height:auto; }
.aj-cap { font-size:1.25rem; color:var(--muted); text-align:center; margin:0 0 36px; font-style:italic; line-height:1.6; }
/* ── Code / formula ── */
.aether-journey pre { background:#07100b; border:1px solid var(--rule2); border-radius:8px; padding:22px 24px; overflow-x:auto; margin:28px 0; font-family:'JetBrains Mono',monospace; font-size:.95rem; line-height:1.75; color:var(--cream); }
.aether-journey pre .c { color:var(--muted); }
.aether-journey pre .k { color:var(--gold-lt); }
.aether-journey pre .s { color:var(--olive); }
.aether-journey code.inline { font-family:'JetBrains Mono',monospace; font-size:1em; background:rgba(125,139,74,.16); color:var(--gold-lt); padding:2px 8px; border-radius:4px; border-bottom:none; }
.aj-formula { background:var(--bg3); border:1px solid var(--rule); border-radius:8px; padding:24px 28px; margin:30px 0; font-family:'JetBrains Mono',monospace; font-size:1.05rem; color:var(--gold-lt); text-align:center; line-height:2; }
.aj-formula span { color:var(--muted); }
/* ── Timeline ── */
.aj-timeline { position:relative; margin:34px 0 12px; padding-left:6px; }
.aj-timeline::before { content:''; position:absolute; left:19px; top:8px; bottom:8px; width:2px; background:linear-gradient(var(--olive),var(--clay)); }
.aj-tl { display:grid; grid-template-columns:46px 1fr; gap:18px; margin-bottom:26px; }
.aj-tl-dot { width:44px; height:44px; border-radius:50%; background:var(--bg2); border:2px solid var(--gold); display:flex; align-items:center; justify-content:center; font-size:.74rem; font-family:'JetBrains Mono',monospace; color:var(--gold); font-weight:600; position:relative; z-index:1; }
.aj-tl-body h3 { font-size:1.25rem; color:var(--gold-lt); margin:8px 0 6px; }
.aj-tl-body p { font-size:1.1rem; color:var(--muted); margin:0; line-height:1.7; }
/* ── CTA ── */
.aj-cta { background:linear-gradient(135deg,var(--jungle),var(--bg2)); border:1px solid var(--gold-dk); border-radius:12px; padding:40px 34px; margin:48px 0; text-align:center; }
.aj-cta h2 { margin-bottom:12px; }
.aj-cta p { color:var(--cream); max-width:640px; margin:0 auto 24px; font-size:1.15rem; line-height:1.7; }
.aj-btn { display:inline-block; background:var(--gold); color:var(--bg); font-weight:700; padding:16px 34px; border-radius:6px; letter-spacing:.02em; font-size:1.05rem; border:none; transition:transform .15s ease, background .15s ease; }
.aj-btn:hover { background:var(--gold-lt); color:var(--bg); transform:translateY(-2px); }
@media (max-width:640px) {
.aj-statbar { grid-template-columns:repeat(2,1fr); }
}
&lt;/style&gt;
&lt;div class="aether-journey"&gt;
&lt;div class="aj-hero"&gt;
&lt;video class="aj-hero-video" autoplay loop muted playsinline poster="/images/ather22.jpg"&gt;
&lt;source src="https://curiousbit.netlify.app/images/ather2.mp4" type="video/mp4"&gt;
&lt;/video&gt;
&lt;div class="aj-hero-scrim"&gt;&lt;/div&gt;
&lt;div class="aj-hero-in"&gt;
&lt;span class="aj-kicker"&gt;Implementation Journey · Part II&lt;/span&gt;
&lt;h1&gt;Aether, Grown Wild&lt;/h1&gt;
&lt;p class="aj-lead" style="margin-top:22px;"&gt;The first article was implementing the base idea. This one is the expedition — how the idea evolved and things got added as we moved forward, growing into a 13-agent, web-first, self-escalating system, and every bug in the undergrowth that shaped it.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/ather22.jpg" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><link href="https://fonts.googleapis.com/css2?family=Noto+Serif:ital,wght@0,400;0,700;1,400&family=JetBrains+Mono:wght@400;600&display=swap" rel="stylesheet"><style>
/* ─────────────────────────────────────────────
JUNGLE BOOK PALETTE — scoped to .aether-journey
───────────────────────────────────────────── */
.aether-journey {
--bg: #0c1410;
--bg2: #111d16;
--bg3: #16271c;
--jungle: #1f3d2b;
--jungle2: #2c5239;
--olive: #8a9a52;
--olive-dk: #4a5429;
--gold: #d9a521;
--gold-lt: #f0c95a;
--gold-dk: #8a6610;
--clay: #a9683f;
--clay-lt: #c98a5b;
--cream: #e7e0cf;
--muted: #aab694;
--rule: rgba(217,165,33,0.20);
--rule2: rgba(125,139,74,0.24);
font-family: 'Noto Sans', sans-serif;
font-size: 21px;
line-height: 1.85;
color: var(--cream);
}
.aether-journey a { color:var(--gold); text-decoration:none; border-bottom:1px solid rgba(217,165,33,0.35); }
.aether-journey a:hover { color:var(--gold-lt); border-color:var(--gold-lt); }
/* ── Hero ── */
.aj-hero { border-bottom:1px solid var(--rule); position:relative; overflow:hidden; min-height:540px; display:flex; align-items:flex-end; border-radius:12px; margin-bottom:40px; }
.aj-hero-video { position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:1; }
.aj-hero-scrim { position:absolute; inset:0; z-index:2; pointer-events:none;
background:
linear-gradient(180deg, rgba(12,20,16,0.25) 0%, rgba(12,20,16,0.50) 50%, rgba(12,20,16,0.93) 100%),
radial-gradient(900px 480px at 16% 100%, rgba(31,61,43,0.55), transparent 70%); }
.aj-hero-in { padding:74px 26px 60px; max-width:960px; margin:0 auto; position:relative; z-index:3; width:100%; }
.aj-hero-meta { display:flex; gap:22px; flex-wrap:wrap; margin-top:30px; font-size:1rem; color:var(--muted); }
.aj-hero-meta strong { color:var(--gold); font-weight:600; }
.aj-hero-meta span:not(:last-child)::after { content:"·"; margin-left:22px; color:var(--olive-dk); }
.aj-kicker { display:inline-block; font-size:.84rem; letter-spacing:.2em; text-transform:uppercase; color:var(--bg); background:var(--gold); padding:6px 15px; border-radius:3px; font-weight:700; margin-bottom:22px; }
/* ── Typography ── */
.aether-journey .aj-label { font-size:.88rem; letter-spacing:.22em; text-transform:uppercase; color:var(--gold); font-weight:600; margin:54px 0 12px; }
.aether-journey h2 { font-family:'Noto Serif',serif; font-size:clamp(1.85rem,3.5vw,2.6rem); color:#fff; font-weight:700; margin:8px 0 24px; line-height:1.25; }
.aether-journey h3 { font-family:'Noto Serif',serif; font-size:1.45rem; color:var(--gold-lt); margin:34px 0 12px; }
.aether-journey p { color:var(--cream); margin:0 0 24px; }
.aether-journey .aj-lead { font-size:1.4rem; color:#fff; line-height:1.7; }
.aether-journey em.q { color:var(--clay-lt); font-style:italic; }
.aether-journey strong { color:#fff; font-weight:700; }
.aj-hero h1 { font-family:'Noto Serif',serif; font-size:clamp(2.6rem,5.5vw,4rem); line-height:1.14; color:#fff; font-weight:700; }
/* ── Lists — em-dash ── */
.aether-journey ul.leaf { list-style:none; padding:0; margin:0 0 24px; }
.aether-journey ul.leaf li { padding:0 0 20px 34px; position:relative; color:var(--cream); font-size:1.35rem; line-height:1.8; }
.aether-journey ul.leaf li::before { content:"—"; position:absolute; left:0; color:var(--olive); font-weight:700; }
.aether-journey ul.leaf li strong { color:#fff; }
/* ── Recap note ── */
.aj-recap { background:var(--bg2); border-left:3px solid var(--clay); border-radius:0 6px 6px 0; padding:22px 28px; margin:0 0 40px; }
.aj-recap p { margin:0; font-size:1.12rem; color:var(--muted); line-height:1.7; }
/* ── Pull quote ── */
.aj-pull { font-family:'Noto Serif',serif; font-size:1.6rem; line-height:1.55; color:var(--gold-lt); font-style:italic; border-left:3px solid var(--gold); padding:8px 0 8px 28px; margin:40px 0; }
/* ── Stat strip ── */
.aj-statbar { display:grid; grid-template-columns:repeat(4,1fr); gap:1px; background:var(--rule2); border:1px solid var(--rule2); margin:40px 0; border-radius:8px; overflow:hidden; }
.aj-stat { background:var(--bg2); padding:28px 18px; text-align:center; }
.aj-stat .num { font-family:'Noto Serif',serif; font-size:2.2rem; color:var(--gold); display:block; line-height:1.1; }
.aj-stat .lbl { font-size:.82rem; letter-spacing:.08em; color:var(--muted); text-transform:uppercase; margin-top:8px; }
/* ── Figures ── */
.aj-visual { margin:36px 0 12px; border-radius:10px; overflow:hidden; border:1px solid var(--rule2); background:var(--bg2); }
.aj-visual svg { display:block; width:100%; height:auto; }
.aj-cap { font-size:1.25rem; color:var(--muted); text-align:center; margin:0 0 36px; font-style:italic; line-height:1.6; }
/* ── Code / formula ── */
.aether-journey pre { background:#07100b; border:1px solid var(--rule2); border-radius:8px; padding:22px 24px; overflow-x:auto; margin:28px 0; font-family:'JetBrains Mono',monospace; font-size:.95rem; line-height:1.75; color:var(--cream); }
.aether-journey pre .c { color:var(--muted); }
.aether-journey pre .k { color:var(--gold-lt); }
.aether-journey pre .s { color:var(--olive); }
.aether-journey code.inline { font-family:'JetBrains Mono',monospace; font-size:1em; background:rgba(125,139,74,.16); color:var(--gold-lt); padding:2px 8px; border-radius:4px; border-bottom:none; }
.aj-formula { background:var(--bg3); border:1px solid var(--rule); border-radius:8px; padding:24px 28px; margin:30px 0; font-family:'JetBrains Mono',monospace; font-size:1.05rem; color:var(--gold-lt); text-align:center; line-height:2; }
.aj-formula span { color:var(--muted); }
/* ── Timeline ── */
.aj-timeline { position:relative; margin:34px 0 12px; padding-left:6px; }
.aj-timeline::before { content:''; position:absolute; left:19px; top:8px; bottom:8px; width:2px; background:linear-gradient(var(--olive),var(--clay)); }
.aj-tl { display:grid; grid-template-columns:46px 1fr; gap:18px; margin-bottom:26px; }
.aj-tl-dot { width:44px; height:44px; border-radius:50%; background:var(--bg2); border:2px solid var(--gold); display:flex; align-items:center; justify-content:center; font-size:.74rem; font-family:'JetBrains Mono',monospace; color:var(--gold); font-weight:600; position:relative; z-index:1; }
.aj-tl-body h3 { font-size:1.25rem; color:var(--gold-lt); margin:8px 0 6px; }
.aj-tl-body p { font-size:1.1rem; color:var(--muted); margin:0; line-height:1.7; }
/* ── CTA ── */
.aj-cta { background:linear-gradient(135deg,var(--jungle),var(--bg2)); border:1px solid var(--gold-dk); border-radius:12px; padding:40px 34px; margin:48px 0; text-align:center; }
.aj-cta h2 { margin-bottom:12px; }
.aj-cta p { color:var(--cream); max-width:640px; margin:0 auto 24px; font-size:1.15rem; line-height:1.7; }
.aj-btn { display:inline-block; background:var(--gold); color:var(--bg); font-weight:700; padding:16px 34px; border-radius:6px; letter-spacing:.02em; font-size:1.05rem; border:none; transition:transform .15s ease, background .15s ease; }
.aj-btn:hover { background:var(--gold-lt); color:var(--bg); transform:translateY(-2px); }
@media (max-width:640px) {
.aj-statbar { grid-template-columns:repeat(2,1fr); }
}</style><div class="aether-journey"><div class="aj-hero"><video class="aj-hero-video" autoplay= loop= muted= playsinline= poster="/images/ather22.jpg"><source src="/images/ather2.mp4" type="video/mp4"/><div class="aj-hero-scrim"/><div class="aj-hero-in"><span class="aj-kicker">Implementation Journey · Part II</span><h1>Aether, Grown Wild</h1><p class="aj-lead" style="margin-top:22px;">The first article was implementing the base idea. This one is the expedition — how the idea evolved and things got added as we moved forward, growing into a 13-agent, web-first, self-escalating system, and every bug in the undergrowth that shaped it.</p><div class="aj-hero-meta"><span>By<strong>Ajay Walia</strong></span><span>June 2026</span><span><strong>v2.6 → v2.8.2</strong></span><span>10 min read</span></div></div></div><div class="aj-recap" style="margin-top:12px;"><p>New here? Start with the original field guide —<a href="/i-built-a-team-of-it-architects-using-llm-that-live-on-macbook-meet-aether/">"I Built a Team of IT Architects Using LLM That Live on MacBook — Meet Aether."</a> That post laid out the thought. This one is what happened when the thought met real queries.</p></div><p class="aj-lead">Every design survives contact with the page. Then you run it.</p><p>Aether v2.6 worked end-to-end on day one — route, retrieve, build, generate, score, escalate, audit. And almost every lesson since came from running that clean little machine against questions it had never seen before.</p><p>The original architecture made a single bet: one model can't be an expert at everything, so build a tree of narrow experts and let them escalate on doubt. The bet held. But the path from v2.6 to v2.8.2 reshaped almost everything around it. The agent roster grew, the router was rebuilt twice, retrieval flipped from knowledge-base-first to web-first, and the confidence score stopped being a thing the model<em>claimed</em> and became a thing the system<em>computed</em>.</p><div class="aj-statbar"><div class="aj-stat"><span class="num">13</span><span class="lbl">Agents · 3 tiers</span></div><div class="aj-stat"><span class="num">v2.8.2</span><span class="lbl">Current release</span></div><div class="aj-stat"><span class="num">0.7</span><span class="lbl">Escalation threshold</span></div><div class="aj-stat"><span class="num">ZERO</span><span class="lbl">Egress · API cost</span></div></div><p class="aj-label">The Delta</p><h2 id="where-the-thought-and-the-build-diverged">Where the thought and the build diverged</h2><p>The blueprint described ten agents, a Gemma 4 26B model, knowledge-base-first retrieval, and a confidence number the model appended to its own answer. Run it for a week and three of those four assumptions bend out of shape:</p><ul class="leaf"><li><strong>Ten agents became thirteen.</strong> The flat roster of technology specialists reorganised into a four-agent network sub-branch and a consolidated digital-workplace branch.</li><li><strong>Knowledge-base-first became web-first.</strong> The local knowledge base started empty, so retrieval now scrapes a vendor allowlist first and falls back to the KB only when the web is thin.</li><li><strong>Self-reported confidence became computed.</strong> "Confidence: 0.92" was theatre; a formula over retrieval quality, domain fit and citation density replaced it.</li><li><strong>An idea became a discipline.</strong> With Git disabled, a hand-written<code class="inline">CHANGELOG.md</code> turned into the single source of truth — every fix and reversal, dated.</li></ul><p class="aj-label">The Pack</p><h2 id="from-ten-to-thirteen--the-hierarchy-regrows">From ten to thirteen — the hierarchy regrows</h2><p>The three-tier shape held: one Enterprise Architect at the top, three Domain Architects beneath, and a layer of deep technology specialists at the base. What changed was the base. The original Intune, AVD and Citrix agents were too narrow and overlapped each other, so they were folded into broader, sharper roles — and the network domain, barely a single agent before, grew a full four-specialist sub-branch.</p><div class="aj-visual"><svg viewBox="0 0 860 430" xmlns="http://www.w3.org/2000/svg" font-family="Noto Sans, sans-serif"><defs><linearGradient id="t1" x1="0" y1="0" x2="0" y2="1"><stop offset="0" stop-color="#d9a521"/><stop offset="1" stop-color="#8a6610"/></linearGradient></defs><g stroke="#4a5429" stroke-width="1.6" fill="none" opacity="0.8"><path d="M430,72 L200,150"/><path d="M430,72 L430,150"/><path d="M430,72 L660,150"/><path d="M200,182 L120,250"/><path d="M200,182 L260,250"/><path d="M430,182 L430,250"/></g><rect x="330" y="40" width="200" height="34" rx="7" fill="url(#t1)"/><text x="430" y="62" text-anchor="middle" font-size="13" font-weight="700" fill="#0c1410">TIER 1 · Enterprise Architect</text><g font-size="12" font-weight="600" fill="#e7e0cf"><rect x="120" y="150" width="160" height="32" rx="6" fill="#2c5239" stroke="#7d8b4a"/><text x="200" y="171" text-anchor="middle">Cloud Domain</text><rect x="350" y="150" width="160" height="32" rx="6" fill="#2c5239" stroke="#7d8b4a"/><text x="430" y="171" text-anchor="middle">DWP Domain</text><rect x="580" y="150" width="160" height="32" rx="6" fill="#2c5239" stroke="#7d8b4a"/><text x="660" y="171" text-anchor="middle">Network Domain</text></g><g font-size="10.5" fill="#0c1410" font-weight="600"><rect x="70" y="250" width="64" height="28" rx="5" fill="#c98a5b"/><text x="102" y="268" text-anchor="middle">AWS</text><rect x="142" y="250" width="64" height="28" rx="5" fill="#c98a5b"/><text x="174" y="268" text-anchor="middle">Azure</text><rect x="214" y="250" width="64" height="28" rx="5" fill="#c98a5b"/><text x="246" y="268" text-anchor="middle">GCP</text><rect x="358" y="250" width="74" height="28" rx="5" fill="#c98a5b"/><text x="395" y="268" text-anchor="middle">MS DWP</text><rect x="436" y="250" width="80" height="28" rx="5" fill="#c98a5b"/><text x="476" y="268" text-anchor="middle">End-User Virt</text><rect x="540" y="290" width="80" height="28" rx="5" fill="#c98a5b"/><text x="580" y="308" text-anchor="middle">Core Net</text><rect x="626" y="290" width="92" height="28" rx="5" fill="#c98a5b"/><text x="672" y="308" text-anchor="middle">SD-WAN/SASE</text><rect x="540" y="324" width="80" height="28" rx="5" fill="#c98a5b"/><text x="580" y="342" text-anchor="middle">Net Security</text><rect x="626" y="324" width="92" height="28" rx="5" fill="#c98a5b"/><text x="672" y="342" text-anchor="middle">NetOps AIOps</text></g><g stroke="#4a5429" stroke-width="1.4" fill="none" opacity="0.7"><path d="M660,182 L580,290 M660,182 L672,290 M660,182 L580,324 M660,182 L672,324"/></g></svg></div><p class="aj-cap">Thirteen agents across three tiers. Each domain owns a colour; specialists inherit it. Low confidence climbs the parent chain → Domain → Enterprise.</p><h3 id="added-to-the-pack">Added to the pack</h3><ul class="leaf"><li><strong>Microsoft DWP Technology Architect</strong> — a broad M365, Intune, Entra, Defender and Copilot specialist.</li><li><strong>End-User Virtualization</strong> — Citrix, Horizon, AVD and FSLogix under one roof.</li><li><strong>A four-agent network sub-branch</strong> — Core Networking, SD-WAN/SASE, Network Security and NetOps AIOps.</li></ul><h3 id="cut-from-the-undergrowth">Cut from the undergrowth</h3><ul class="leaf"><li><strong>The standalone Virtualization Domain Architect</strong> — folded into DWP.</li><li><strong>The Intune-only agent</strong> — replaced by the broader Microsoft DWP role.</li><li><strong>The AVD- and Citrix-specific agents</strong> — absorbed by End-User Virtualization.</li><li><strong>The retired KB taxonomy</strong> —<code class="inline">kb_intune</code>,<code class="inline">kb_avd</code> and<code class="inline">kb_citrix</code>.</li></ul><p class="aj-label">core/router.py</p><h2 id="routing--the-hardest-won-code-in-the-repo">Routing — the hardest-won code in the repo</h2><p>Routing decides which expert hears the question. It sounds trivial — match keywords, pick an agent — and it turned out to be the single biggest source of subtle, infuriating bugs. The router now runs four ordered passes:</p><ul class="leaf"><li><strong>Forced design-doc override.</strong> Phrases like "solution design document" bypass keyword scanning entirely, and any mention of "Copilot" routes straight to the Microsoft DWP specialist.</li><li><strong>Tier-3 keyword match.</strong> The most specific specialists are scanned first, with<code class="inline">\b</code> word-boundary regex to stop false hits.</li><li><strong>Tier-2 domain match.</strong> Broader strategy keywords catch domain-level queries that no specialist claimed.</li><li><strong>Default to Enterprise.</strong> Anything unmatched falls through to the catch-all at the top.</li></ul><style>
.routing-anim-wrapper {
display: flex;
flex-direction: column;
gap: 1rem;
margin: 3rem 0;
}
@media (min-width: 1024px) {
.routing-anim-wrapper {
flex-direction: row;
}
}
.routing-anim {
flex: 1.3;
position: relative;
background: #0f172a;
border: 1px solid #1e293b;
border-radius: 1rem;
padding: 2rem;
height: 520px;
overflow: hidden;
font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace;
box-shadow: inset 0 2px 20px rgba(0,0,0,0.5);
}
.route-debug {
flex: 0.7;
background: #020617;
border: 1px solid #1e293b;
border-radius: 1rem;
padding: 1.5rem;
height: 520px;
overflow-y: auto;
box-shadow: inset 0 0 10px rgba(0,0,0,0.8);
font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace;
}
.debug-header {
font-size: 0.8rem;
font-weight: 700;
color: #64748b;
margin-bottom: 1rem;
letter-spacing: 0.05em;
border-bottom: 1px solid #1e293b;
padding-bottom: 0.75rem;
display: flex;
align-items: center;
gap: 0.5rem;
}
.debug-dot {
width: 8px;
height: 8px;
background: #10b981;
border-radius: 50%;
animation: pulse-dot 2s infinite;
}
@keyframes pulse-dot { 50% { opacity: 0.5; } }
.debug-code {
color: #38bdf8;
font-size: 0.8rem;
line-height: 1.6;
white-space: pre-wrap;
}
.route-line {
position: absolute;
top: 0;
bottom: 0;
left: 80px;
width: 2px;
background: repeating-linear-gradient(to bottom, #334155 0, #334155 10px, transparent 10px, transparent 20px);
z-index: 1;
}
.route-packet {
position: absolute;
left: 80px;
top: -40px;
transform: translate(-50%, -50%);
background: #0ea5e9;
color: white;
padding: 0.5rem 1rem;
border-radius: 999px;
font-size: 0.875rem;
font-weight: 600;
z-index: 20;
box-shadow: 0 0 20px rgba(14, 165, 233, 0.5);
white-space: nowrap;
transition: top 0.5s ease-in-out, background 0.3s, box-shadow 0.3s, opacity 0.3s;
}
.route-step {
position: absolute;
left: 140px;
right: 20px;
background: #1e293b;
border: 1px solid #334155;
color: #94a3b8;
padding: 1rem 1.25rem;
border-radius: 0.5rem;
text-align: left;
z-index: 10;
transition: all 0.3s ease;
transform: translateY(-50%);
}
.route-step::before {
content: '';
position: absolute;
top: 50%;
left: -60px;
transform: translate(-50%, -50%);
width: 12px;
height: 12px;
background: #0f172a;
border: 2px solid #334155;
border-radius: 50%;
transition: border-color 0.3s, background 0.3s;
z-index: 2;
}
.route-step::after {
content: '';
position: absolute;
top: 50%;
left: -60px;
width: 60px;
height: 2px;
background: #334155;
transform: translateY(-50%);
z-index: 1;
transition: background 0.3s;
}
.route-step.active-scan {
border-color: #0ea5e9;
background: rgba(14, 165, 233, 0.05);
}
.route-step.active-scan::before {
border-color: #0ea5e9;
background: #0ea5e9;
}
.route-step.active-scan::after {
background: #0ea5e9;
}
.route-step.match {
border-color: #10b981;
background: rgba(16, 185, 129, 0.05);
box-shadow: 0 0 15px rgba(16, 185, 129, 0.1);
}
.route-step.match::before {
border-color: #10b981;
background: #10b981;
}
.route-step.match::after {
background: #10b981;
}
.r-step-1 { top: 80px; }
.r-step-2 { top: 190px; }
.r-step-3 { top: 300px; }
.r-step-4 { top: 410px; }
.step-title {
font-size: 0.9rem;
font-weight: 700;
color: #f8fafc;
margin-bottom: 0.25rem;
}
.step-desc {
font-size: 0.8rem;
color: #94a3b8;
line-height: 1.4;
}</style><div class="routing-anim-wrapper not-prose" id="routing-anim-wrapper"><div class="routing-anim" id="routing-anim"><div class="route-line"/><div class="route-step r-step-1" id="rs-1"><div class="step-title">1. Forced Override</div><div class="step-desc">Certain keywords (like "solution design document" or "Copilot") bypass deep scanning and hardcode a route to a specific agent.</div></div><div class="route-step r-step-2" id="rs-2"><div class="step-title">2. Tier-3 Specialist</div><div class="step-desc">Scans for highly specific technologies using exact word boundaries (<code class="inline">\bids\b</code>). Prevents broad terms from triggering niche specialists.</div></div><div class="route-step r-step-3" id="rs-3"><div class="step-title">3. Tier-2 Domain Architect</div><div class="step-desc">Catches broader domain-level queries (like "firewall" or "routing") if no specific L3 specialist claimed the question.</div></div><div class="route-step r-step-4" id="rs-4"><div class="step-title">4. Default to Enterprise</div><div class="step-desc">The catch-all bucket. If a query falls all the way through without matching anything, it goes to the Enterprise Architect.</div></div><div class="route-packet" id="r-packet">Starting...</div></div><div class="route-debug"><div class="debug-header"><div class="debug-dot"/> ROUTER ENGINE TRACE</div><div class="debug-code" id="debug-log">Waiting for query...</div></div></div><script>
document.addEventListener('DOMContentLoaded', function() {
const packet = document.getElementById('r-packet');
const debugLog = document.getElementById('debug-log');
const steps = [
document.getElementById('rs-1'),
document.getElementById('rs-2'),
document.getElementById('rs-3'),
document.getElementById('rs-4')
];
if(!packet || !steps[0] || !debugLog) return;
const scenarios = [
{ query: '"How to enable Copilot"', targetStep: 1, targetName: 'Microsoft DWP' },
{ query: '"BGP flapping issue"', targetStep: 2, targetName: 'Network L3 Tech' },
{ query: '"Firewall strategy"', targetStep: 3, targetName: 'Network Sec Arch' },
{ query: '"General IT strategy"', targetStep: 4, targetName: 'Enterprise Arch' }
];
let currentScenario = 0;
const sleep = ms => new Promise(r => setTimeout(r, ms));
async function runAnimation() {
while(true) {
const s = scenarios[currentScenario];
const reqId = "req_" + Math.random().toString(36).substr(2, 6);
let debugState = {
request_id: reqId,
query: s.query.replace(/"/g, ''),
status: "scanning",
cache: "redis_miss",
scan_log: []
};
const renderDebug = () => {
debugLog.textContent = JSON.stringify(debugState, null, 2);
};
packet.style.transition = 'none';
packet.style.top = '-40px';
packet.style.opacity = '1';
packet.style.background = '#0ea5e9';
packet.style.boxShadow = '0 0 20px rgba(14, 165, 233, 0.5)';
packet.textContent = s.query;
steps.forEach(step => step.classList.remove('active-scan', 'match'));
renderDebug();
await sleep(2000);
packet.style.transition = 'top 1.5s ease-in-out, background 0.3s, box-shadow 0.3s, opacity 0.3s';
for(let i=0; i<s.targetStep; i++)= {= let= stepTop=80 += (i*110);= packet.style.top=stepTop += 'px';= steps[i].classList.add('active-scan');= debugState.scan_log.push(`evaluating_tier_${i+1}`);= renderDebug();= await= sleep(2000);= if= (i= <= s.targetStep= -= 1)= {= packet.style.background='#ef4444' ;= packet.style.boxShadow='0 0 20px rgba(239, 68, 68, 0.5)' ;= packet.textContent="✗ No match" ;= debugState[`tier_${i+1}_match`]=false; renderDebug();= await= sleep(2000);= packet.style.background='#0ea5e9' ;= packet.style.boxShadow='0 0 20px rgba(14, 165, 233, 0.5)' ;= packet.textContent=s.query; steps[i].classList.remove('active-scan');= }= else= {= steps[i].classList.remove('active-scan');= steps[i].classList.add('match');= packet.style.background='#10b981' ;= packet.style.boxShadow='0 0 30px rgba(16, 185, 129, 0.8)' ;= packet.textContent="✓ Routed: " += s.targetName;= debugState.status="routed" ;= debugState[`tier_${i+1}_match`]=true; debugState.target_agent=s.targetName.replace(/ /g,= '_').toLowerCase();= debugState.cache_action="set_redis_ttl_3600" ;= renderDebug();= await= sleep(6000);= packet.style.opacity='0' ;= steps[i].classList.remove('match');= await= sleep(2500);= }= }= currentScenario=(currentScenario += 1)= %= scenarios.length;= }= }= runAnimation();= });= </script=><p>The bugs here were the kind that hide in plain sight. A bare<code class="inline">"ids"</code> keyword matched the "IDs" in<em class="q">"track record IDs"</em> and wrongly summoned the Network Security agent — fixed by swapping it for the explicit phrase<code class="inline">"intrusion detection system"</code>. Worse, the Tier-2 rules stored their keywords as double-escaped regex (<code class="inline">\\bfirewall\\b</code>), which turned the backslashes literal, so those rules could<em>never</em> match and three domain architects sat silently unreachable. A small<code class="inline">_kw_matches()</code> helper now treats real regex as regex and plain words as plain words, while Redis caches each verdict for an hour and degrades gracefully if it goes down.</p><p class="aj-label">The Single Beast</p><h2 id="one-local-gemma-thirteen-personas">One local Gemma, thirteen personas</h2><p>The headline trick from Part I survived intact, and got sharper. There is still exactly one model resident in memory:<code class="inline">gemma-4-26b-a4b-it-mlx</code> — Google's open Gemma family, the instruction-tuned variant in an active-parameter configuration, MLX-quantized for Apple silicon and served through LM Studio's OpenAI-compatible API at<code class="inline">localhost:1234</code>. Thirteen specialists, one set of weights. The persona switch lives entirely in the YAML manifest:</p><pre><span class="c"># agents/enterprise_architect.yaml</span><span class="k">id:</span> enterprise_architect<span class="k">tier:</span> 1<span class="k">namespace:</span> kb_enterprise<span class="k">parent:</span> null<span class="k">temperature:</span> 0.2<span class="k">max_tokens:</span> 3000<span class="k">system_prompt:</span> |<span class="s">DOMAIN MANDATE</span> — out-of-domain query → refuse, hard-code 0.0 confidence.<span class="s">ENTERPRISE MANDATE</span> — always cover Identity (AD/Entra, RBAC) and HA/DR;
state assumptions, never claim 100%.<span class="s">STYLE MANDATE</span> — no invented URLs, no gratuitous TOGAF/ISO name-drops,
plain-text arrows, never LaTeX.<span class="s">DIAGRAM RULE</span> — emit a concise Mermaid block (≤15 nodes) for architectures.</pre><p>That manifest<em>is</em> the agent. The mandates are the lessons of a hundred bad answers, hardened into rules: out-of-domain questions get refused and forced to escalate, every architecture must address identity and disaster recovery, and decorative fake citations are banned outright. There is also a discipline the local hardware taught the hard way — large design templates once overflowed the context window and OOM-crashed LM Studio, forcing careful tuning of how much web context gets injected per query.</p><p class="aj-label">The v2.8 Redesign</p><h2 id="retrieval-flips-web-first-kb-as-fallback">Retrieval flips: web-first, KB as fallback</h2><p>This was the biggest architectural change since the blueprint. The original design retrieved from the local LanceDB knowledge base first. The problem was mundane and fatal: the knowledge base started empty. An empty KB meant a zero retrieval score, which meant zero confidence, which meant every single query escalated all the way to the Enterprise Architect for nothing.</p><p>So the pipeline flipped. The orchestrator now scrapes trusted vendor documentation<em>before</em> it touches the KB, and only falls back to local documents when the web returns less than about 1,200 characters of usable text.</p><style>
.retrieval-anim {
position: relative;
background: #0f172a;
border: 1px solid #1e293b;
border-radius: 1rem;
padding: 0 40px;
margin: 3rem 0 1rem 0;
height: 260px;
display: flex;
align-items: center;
justify-content: space-between;
font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace;
box-shadow: inset 0 2px 20px rgba(0,0,0,0.5);
overflow: hidden;
}
.ret-line {
position: absolute;
top: 50%;
left: 40px;
right: 40px;
height: 2px;
background: #334155;
transform: translateY(-50%);
z-index: 1;
}
.ret-node-wrapper {
position: relative;
z-index: 10;
display: flex;
flex-direction: column;
align-items: center;
}
.ret-node {
padding: 0.8rem 1.2rem;
border-radius: 0.4rem;
font-size: 1rem;
font-weight: 700;
text-align: center;
color: #0c1410;
transition: all 0.3s ease;
box-shadow: 0 2px 5px rgba(0,0,0,0.3);
border: 2px solid transparent;
}
.node-green { background: #7d8b4a; }
.node-gold { background: #d9a521; }
.node-brown { background: #c98a5b; }
.ret-node.active {
box-shadow: 0 0 20px rgba(255,255,255,0.4);
transform: scale(1.15);
border-color: #f8fafc;
}
.ret-node.skipped {
opacity: 0.3;
filter: grayscale(80%);
}
.ret-packet {
position: absolute;
top: 50%;
left: -100px;
transform: translate(-50%, -50%);
background: #0ea5e9;
color: white;
padding: 0.5rem 1rem;
border-radius: 999px;
font-size: 0.9rem;
font-weight: bold;
z-index: 20;
white-space: nowrap;
box-shadow: 0 0 15px rgba(14, 165, 233, 0.6);
transition: left 0.6s ease-in-out, top 0.6s ease-in-out, opacity 0.3s;
}
.ret-log {
position: absolute;
bottom: 20px;
left: 50%;
transform: translateX(-50%);
color: #38bdf8;
font-size: 0.9rem;
background: #020617;
padding: 0.5rem 1.2rem;
border-radius: 0.5rem;
border: 1px solid #1e293b;
opacity: 0;
transition: opacity 0.3s;
white-space: nowrap;
}
.kb-label {
position: absolute;
top: 100%;
margin-top: 12px;
font-size: 0.8rem;
color: #94a3b8;
white-space: nowrap;
}
@media (max-width: 768px) {
.retrieval-anim {
flex-direction: column;
padding: 40px 0;
height: 600px;
}
.ret-line {
top: 40px; bottom: 40px; left: 50%; right: auto; width: 2px; height: auto; transform: translateX(-50%);
}
.kb-label {
left: 100%; top: 50%; margin-top: 0; margin-left: 12px; transform: translateY(-50%);
}
}</style><div class="retrieval-anim not-prose" id="ret-container"><div class="ret-line"/><div class="ret-node-wrapper"><div class="ret-node node-green" id="rn-1">Route</div></div><div class="ret-node-wrapper"><div class="ret-node node-gold" id="rn-2">Web primary</div></div><div class="ret-node-wrapper"><div class="ret-node node-brown" id="rn-3">KB fallback</div><div class="kb-label">only if web &lt; 1200 chars</div></div><div class="ret-node-wrapper"><div class="ret-node node-green" id="rn-4">History</div></div><div class="ret-node-wrapper"><div class="ret-node node-green" id="rn-5">Build</div></div><div class="ret-node-wrapper"><div class="ret-node node-gold" id="rn-6">Gemma</div></div><div class="ret-packet" id="ret-packet">Query</div><div class="ret-log" id="ret-log">Logs</div></div><p class="aj-cap" style="margin-top: 0; margin-bottom: 3rem;">The v2.8 retrieval path — web-first against a vendor allowlist, with the knowledge base merging in only when the web comes back thin.</p><script>
document.addEventListener('DOMContentLoaded', function() {
const container = document.getElementById('ret-container');
const packet = document.getElementById('ret-packet');
const log = document.getElementById('ret-log');
const nodes = [
document.getElementById('rn-1'),
document.getElementById('rn-2'),
document.getElementById('rn-3'),
document.getElementById('rn-4'),
document.getElementById('rn-5'),
document.getElementById('rn-6')
];
if(!packet || !nodes[0]) return;
const scenarios = [
{
query: '"AWS VPC Limits"',
webRes: 'Web hit: 5,420 chars (AWS Docs)',
skipKb: true
},
{
query: '"Project Aether Spec"',
webRes: 'Web hit: 120 chars (Thin)',
kbRes: 'KB hit: 8,300 chars (Local LanceDB)',
skipKb: false
}
];
let cur = 0;
const sleep = ms => new Promise(r => setTimeout(r, ms));
function moveTo(index) {
let rect = nodes[index].getBoundingClientRect();
let cRect = container.getBoundingClientRect();
packet.style.left = (rect.left - cRect.left + rect.width / 2) + 'px';
packet.style.top = (rect.top - cRect.top + rect.height / 2) + 'px';
}
async function run() {
while(true) {
const s = scenarios[cur];
packet.style.transition = 'none';
if (window.innerWidth< 768)= {= packet.style.left='50%' ;= packet.style.top='-50px' ;= }= else= {= packet.style.left='-100px' ;= packet.style.top='50%' ;= }= packet.style.opacity='1' ;= packet.textContent=s.query; log.style.opacity='0' ;= nodes.forEach(n=> n.classList.remove('active', 'skipped'));
await sleep(2000);
packet.style.transition = 'left 1.5s ease-in-out, top 1.5s ease-in-out, opacity 0.3s';
// Step 0: Route
moveTo(0);
nodes[0].classList.add('active');
await sleep(2000);
nodes[0].classList.remove('active');
// Step 1: Web primary
moveTo(1);
nodes[1].classList.add('active');
log.textContent = "Scraping trusted vendor docs...";
log.style.opacity = '1';
await sleep(2500);
log.textContent = s.webRes;
await sleep(2500);
nodes[1].classList.remove('active');
// Step 2: KB
moveTo(2);
if (s.skipKb) {
nodes[2].classList.add('skipped');
log.textContent = "Skipping KB (web threshold met)";
await sleep(2500);
} else {
nodes[2].classList.add('active');
log.textContent = "Web thin. Querying local LanceDB...";
await sleep(2500);
log.textContent = s.kbRes;
await sleep(2500);
nodes[2].classList.remove('active');
}
// Step 3: History
moveTo(3);
nodes[3].classList.add('active');
log.textContent = "Appending chat history...";
await sleep(2000);
nodes[3].classList.remove('active');
// Step 4: Build
moveTo(4);
nodes[4].classList.add('active');
log.textContent = "Constructing final prompt...";
await sleep(2000);
nodes[4].classList.remove('active');
// Step 5: Gemma
moveTo(5);
nodes[5].classList.add('active');
log.textContent = "Streaming to gemma-4-26b...";
packet.style.background = '#10b981';
packet.style.boxShadow = '0 0 20px rgba(16, 185, 129, 0.6)';
await sleep(4000);
packet.style.opacity = '0';
log.style.opacity = '0';
nodes[5].classList.remove('active');
packet.style.background = '#0ea5e9';
packet.style.boxShadow = '0 0 15px rgba(14, 165, 233, 0.6)';
await sleep(3000);
cur = (cur + 1) % scenarios.length;
}
}
run();
});</script><ul class="leaf"><li><strong>A curated vendor-domain allowlist.</strong> Results are restricted to Microsoft Learn, AWS, Google Cloud, Cisco, Palo Alto, NIST and others, with suffix-safe matching that blocks spoofs like<code class="inline">cisco.com.evil.com</code>.</li><li><strong>Per-agent ranking.</strong><code class="inline">AGENT_DOMAINS</code> ranks each specialist's preferred docs first, so the AWS agent leans on AWS documentation before anything else.</li><li><strong>Source-agnostic results.</strong> Web hits are reshaped to look exactly like KB results (<code class="inline">source / url / text / score</code>), so the confidence math works identically on either.</li><li><strong>Rich metadata.</strong> A schema of<code class="inline">domain</code>,<code class="inline">vendor</code>,<code class="inline">document_type</code> and<code class="inline">version_date</code> travels with every chunk into the prompt.</li></ul><p class="aj-label">Trust, Computed</p><h2 id="confidence-is-math-not-vibes">Confidence is math, not vibes</h2><p>In v2.6 the model ended each answer with<code class="inline">Confidence: 0.92</code> and the orchestrator believed it. The trouble is that a model confidently answering an AWS question with Azure facts will happily rate itself 0.92 too. Self-report is theatre. So confidence became a number the<em>system</em> computes, before and after generation:</p><div class="aj-formula">
pre_gen<span>=</span> 0.6 · retrieval<span>+</span> 0.4 · namespace_overlap<br>
confidence<span>=</span> min(1.0, best_pre_gen<span>+</span> 0.2 · citation_density)</div><ul class="leaf"><li><strong>retrieval_score</strong> — the quality of the retrieved documents. With an empty KB it is derived from web hits: 0.85 for preferred vendor domains, 0.70 otherwise.</li><li><strong>namespace_overlap</strong> — does the query hit the agent's keywords? A strong match scores about 0.85; off-topic collapses to 0.1, all but guaranteeing escalation.</li><li><strong>citation_density</strong> — the share of claims backed by<code class="inline">[1] [2]</code> sources, rewarding grounded answers with up to a 0.2 boost.</li></ul><p>Below the 0.7 threshold, the query climbs to the parent agent, which re-retrieves against its own namespace and re-answers. One subtle fix mattered here: the strongest<code class="inline">_best_pre_gen</code> score is carried up the escalation chain, so a confident child's score is never erased by a weaker parent re-running the same step. The model no longer judges itself — the architecture does.</p><p class="aj-label">The Undergrowth</p><h2 id="the-bugs-that-shaped-the-design">The bugs that shaped the design</h2><p>Most of the architecture above exists because something broke first. The marquee disaster was the<strong>0% confidence saga</strong> — a cluster of unrelated failures that all produced the same symptom: every query inexplicably crashing to zero confidence and escalating to the top of the tree.</p><ul class="leaf"><li><strong>Silent search failure.</strong><code class="inline">duckduckgo_search</code> was renamed to<code class="inline">ddgs</code> upstream and returned an empty list. A catch-all<code class="inline">except</code> swallowed the error, so retrieval quietly went to zero.</li><li><strong>Empty-KB zero score.</strong> With no documents ingested, the retrieval score defaulted to 0.0 — now derived from web-result quality instead.</li><li><strong>Overwritten scores.</strong> Each escalation re-ran<code class="inline">step_build</code> and erased the child's strong score, until<code class="inline">_best_pre_gen</code> was preserved across the chain.</li><li><strong>Dead Tier-2 regex.</strong> Double-escaped<code class="inline">\\b</code> made domain routing rules unreachable, and malformed YAML broke parsing entirely.</li><li><strong>Context overflow.</strong> The expanded universal design template pushed prompts past the local model's context limit — a 400 error until web-context sizes were tuned back down.</li><li><strong>A frozen UI on long jobs.</strong> Design-doc generation runs for minutes and the UI silently froze, until streaming feedback and a 10-minute timeout were added.</li></ul><p class="aj-pull">A single catch-all<code class="inline" style="font-size:1rem;">except</code> turned a renamed package into an invisible, week-long confidence collapse. Fail loud, not silent.</p><p class="aj-label">The Working Contract</p><h2 id="how-the-build-stays-honest">How the build stays honest</h2><p>With Git tracking disabled, discipline had to live somewhere. It lives in two places. The first is a hand-maintained<code class="inline">CHANGELOG.md</code> that records every change, reversal and reason. The second is a behavioural contract the assistant itself follows when modifying the code — think before acting, make the smallest possible diff, verify against success criteria, keep everything auditable and reversible, and prefer less code over more. It reads less like an engineering process and more like the Law of the Jungle: a few rules everyone keeps, because the alternative is chaos.</p><div class="aj-cta"><h2>See it running — the screenshots</h2><p>Nine captioned frames from the live system: the Gemma model resident in LM Studio, the Gradio chat generating a Microsoft 365 Copilot design document at 90% confidence with live web search, the namespace-per-domain knowledge base, and the design template behind it all.</p><a class="aj-btn" href="/aether-screens.html" target="_blank" rel="noopener">Open the screenshot gallery →</a></div><p class="aj-label">The Map So Far</p><h2 id="six-releases-one-expedition">Six releases, one expedition</h2><div class="aj-timeline"><div class="aj-tl"><div class="aj-tl-dot">2.6.0</div><div class="aj-tl-body"><h3>Base system</h3><p>Routing, RAG, agents and escalation — the first end-to-end build.</p></div></div><div class="aj-tl"><div class="aj-tl-dot">2.7.0</div><div class="aj-tl-body"><h3>Grounded confidence</h3><p>Dropped LLM self-report for a math formula; added web-search fallback and Mermaid diagrams in the UI.</p></div></div><div class="aj-tl"><div class="aj-tl-dot">2.7.1</div><div class="aj-tl-body"><h3>Agent reshuffle</h3><p>Retired the Intune, AVD, Citrix and Virtualization agents; added Microsoft DWP, End-User Virt and the network specialists.</p></div></div><div class="aj-tl"><div class="aj-tl-dot">2.7.4</div><div class="aj-tl-body"><h3>Routing repairs</h3><p>Fixed dead regex rules, corrupted YAMLs, and a hardcoded-path split that loaded stale manifests.</p></div></div><div class="aj-tl"><div class="aj-tl-dot">2.8.0</div><div class="aj-tl-body"><h3>Web-first retrieval</h3><p>Scrape vendor docs before the KB; vendor allowlist; KB fallback merge.</p></div></div><div class="aj-tl"><div class="aj-tl-dot">2.8.2</div><div class="aj-tl-body"><h3>Confidence fixes</h3><p>Solved 0% confidence on design docs; preserved the best score across the escalation chain.</p></div></div><div class="aj-tl"><div class="aj-tl-dot">2.9.0</div><div class="aj-tl-body"><h3>Clarifying questions</h3><p>Before generating a design, agents now ask two to four targeted questions — org size, compliance, stack, timeline — and fold the answers into a far more specific result.</p></div></div></div><p class="aj-label">Where We Are · What's Next</p><h2 id="wired-working-and-honest">Wired, working, and honest</h2><p>v2.8.2 is a working, daily-use system. The full pipeline runs route → retrieve → build → generate → score → escalate → audit, all thirteen manifests parse, routing and parent maps are consistent, and live vendor-doc search returns current, citable content. What's still open is honest too: the KB folders are wired but largely empty and need real source documents ingested; a rule-ordering overlap can still misroute shared virtualization keywords between End-User Virt and DWP; Git is off; and the context budget stays tight on the local model for large templates.</p><p>The road ahead, in order: populate the knowledge base so RAG augments rather than just the web; resolve the routing overlaps and settle End-User-Virt-versus-DWP ownership; re-enable Git and move off the manual changelog; then build a query test-suite to calibrate the confidence threshold against measured answer quality.</p><p class="aj-label">Lessons from the Trail</p><h2 id="what-the-journey-taught">What the journey taught</h2><ul class="leaf"><li><strong>Fail loud, not silent.</strong> A catch-all<code class="inline">except</code> turned a renamed package into invisible 0% confidence. Surface errors — never swallow them.</li><li><strong>Measure what you trust.</strong> Self-reported confidence is theatre. Grounding trust in retrieval and citations made escalation actually mean something.</li><li><strong>Narrow beats broad.</strong> Specialists with tight domains hallucinate far less than one generalist trying to know everything.</li><li><strong>Write the changelog.</strong> With Git off, the manual<code class="inline">CHANGELOG.md</code> became the single source of truth for every decision and fix.</li></ul><p style="font-size:1.18rem; color:var(--clay-lt); font-style:italic; line-height:1.6; margin-top:34px;">Aether began as a single sentence — "one model can't know everything." It grew into a hierarchy of grounded, self-aware experts, and the changelog is the proof of the journey. Specialise · Ground · Measure · Escalate.</p></div>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/ather22.jpg" medium="image"><media:title type="plain">Posts</media:title></media:content><category>artificial-intelligence</category><category>architecture</category><category>automation</category><category>engineering</category><category>llm</category><category>Knowledge Base</category></item><item><title>RAG, Graph RAG, Agentic RAG — and How to Make Any of Them 32× Memory Efficient</title><link>https://curiousbit.netlify.app/rag-graph-rag-agentic-rag-and-how-to-make-any-of-them-32-memory-efficient/</link><guid isPermaLink="true">https://curiousbit.netlify.app/rag-graph-rag-agentic-rag-and-how-to-make-any-of-them-32-memory-efficient/</guid><pubDate>Thu, 28 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;style&gt;
@import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&amp;family=JetBrains+Mono:wght@400;600&amp;display=swap');
.rag-art {
--bg: #0a1220;
--bg2: #0f1a2e;
--bg3: #142442;
--line: #1f3358;
--text: #e2e8f0;
--muted: #8aa0c0;
--accent: #22d3ee;
--accent2: #00e5a8;
--warn: #f59e0b;
--danger: #ef4444;
--purple: #a78bfa;
font-family: 'Inter', sans-serif;
color: var(--text);
background: var(--bg);
padding: 2rem 2.25rem;
border-radius: 16px;
box-shadow: 0 4px 30px rgba(0,0,0,0.55);
line-height: 1.75;
}
.rag-art * { box-sizing: border-box; }
.rag-art .section { padding: 40px 0; border-bottom: 1px solid var(--line); }
.rag-art .section:last-child { border-bottom: none; }
.rag-art .label { font-family: 'JetBrains Mono', monospace; font-size: .72rem; letter-spacing: .22em; text-transform: uppercase; color: var(--accent); font-weight: 600; margin-bottom: 8px; }
.rag-art h2 { font-family: 'Inter', sans-serif; font-size: clamp(1.5rem, 1.2rem + 1vw, 2.2rem); font-weight: 700; letter-spacing: -.02em; color: #fff; margin: 0 0 18px; }
.rag-art h3 { font-family: 'Inter', sans-serif; font-size: 1.15rem; font-weight: 600; color: #fff; margin: 28px 0 10px; }
.rag-art p { margin: 0 0 16px; font-size: 1rem; color: var(--text); }
.rag-art ul { list-style: none; padding: 0; margin: 0 0 18px; }
.rag-art ul li { padding: 4px 0 4px 22px; position: relative; font-size: .95rem; }
.rag-art ul li::before { content: "▸"; position: absolute; left: 0; color: var(--accent); }
.rag-art strong { color: #fff; }
.rag-art code { background: var(--bg3); padding: 1px 6px; border-radius: 4px; font-family: 'JetBrains Mono', monospace; font-size: .88rem; color: var(--accent2); }
.rag-visual { margin: 28px 0; border-radius: 8px; overflow-x: auto; overflow-y: hidden; border: 1px solid var(--line); background: var(--bg2); }
.rag-visual svg { display: block; width: 100%; height: auto; min-width: 700px; }
.rag-cap { text-align: center; font-size: .78rem; color: var(--muted); font-style: italic; margin: -10px 0 24px; }
.rag-figure { margin: 28px 0; }
.rag-figure img, .rag-figure video { display: block; width: 100%; height: auto; border-radius: 8px; border: 1px solid var(--line); }
.rag-figure figcaption { text-align: center; font-size: .78rem; color: var(--muted); font-style: italic; margin-top: 8px; }
@keyframes ragPulseSoft { 0%,100% { opacity: .35; } 50% { opacity: 1; } }
@keyframes ragPulseStrong{ 0%,100% { opacity: .55; r: 9; } 50% { opacity: 1; r: 11; } }
@keyframes ragMissed { 0%,100% { opacity: .25; } 50% { opacity: .9; } }
@keyframes ragFlow { from { stroke-dashoffset: 24; } to { stroke-dashoffset: 0; } }
@keyframes ragDraw { from { stroke-dashoffset: 800; } to { stroke-dashoffset: 0; } }
.rag-flow-line { stroke-dasharray: 6 4; animation: ragFlow 1.2s linear infinite; }
.rag-traversal { stroke-dasharray: 6 4; animation: ragFlow 1.5s linear infinite; }
.rag-glow { animation: ragPulseStrong 2.4s ease-in-out infinite; }
.rag-missed { animation: ragMissed 3s ease-in-out infinite; }
/* Sequential agent tool highlight (5 tools) */
@keyframes ragAgentT { 0%,18% { opacity: 1; } 25%,100% { opacity: .35; } }
.rag-tool-1 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 0s; }
.rag-tool-2 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 1s; }
.rag-tool-3 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 2s; }
.rag-tool-4 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 3s; }
.rag-tool-5 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 4s; }
/* Sequential float→binary cell transform (8 cells) */
@keyframes ragBitTransform { 0%,8% { opacity: 0; } 14%,100% { opacity: 1; } }
.rag-bit-1 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 0s; }
.rag-bit-2 { animation: ragBitTransform 4s ease-out infinite; animation-delay: .35s; }
.rag-bit-3 { animation: ragBitTransform 4s ease-out infinite; animation-delay: .7s; }
.rag-bit-4 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 1.05s; }
.rag-bit-5 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 1.4s; }
.rag-bit-6 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 1.75s; }
.rag-bit-7 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 2.1s; }
.rag-bit-8 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 2.45s; }
@media (prefers-reduced-motion: reduce) {
.rag-flow-line, .rag-traversal, .rag-glow, .rag-missed,
.rag-tool-1,.rag-tool-2,.rag-tool-3,.rag-tool-4,.rag-tool-5,
.rag-bit-1,.rag-bit-2,.rag-bit-3,.rag-bit-4,.rag-bit-5,.rag-bit-6,.rag-bit-7,.rag-bit-8 { animation: none !important; opacity: 1 !important; }
}
.rag-cards { display: grid; grid-template-columns: repeat(auto-fit, minmax(240px, 1fr)); gap: 14px; margin: 24px 0; }
.rag-card { background: var(--bg2); border: 1px solid var(--line); border-radius: 10px; padding: 18px 18px; }
.rag-card h4 { font-size: 1rem; font-weight: 600; color: #fff; margin: 0 0 8px; }
.rag-card p { font-size: .88rem; color: var(--muted); margin: 0; }
.rag-pull { border-left: 3px solid var(--accent); padding: 6px 0 6px 18px; margin: 28px 0; font-size: 1.08rem; color: var(--accent); font-style: italic; }
.rag-pull.warn { border-color: var(--warn); color: var(--warn); }
.rag-table { width: 100%; border-collapse: collapse; margin: 18px 0 26px; font-size: .9rem; }
.rag-table th, .rag-table td { padding: 11px 14px; text-align: left; border-bottom: 1px solid var(--line); vertical-align: top; }
.rag-table th { color: var(--accent); font-family: 'JetBrains Mono', monospace; font-size: .72rem; letter-spacing: .14em; text-transform: uppercase; border-bottom: 1px solid var(--accent); font-weight: 600; }
.rag-table td.k { color: #fff; font-weight: 600; width: 25%; }
.rag-table td.v { color: var(--text); }
.rag-table td.m { color: var(--muted); font-size: .85rem; }
.rag-stat { display: grid; grid-template-columns: repeat(4,1fr); gap: 1px; background: var(--line); border: 1px solid var(--line); margin: 28px 0; border-radius: 6px; overflow: hidden; }
.rag-stat &gt; div { background: var(--bg2); padding: 18px 14px; text-align: center; }
.rag-stat .num { font-family: 'Inter', sans-serif; font-weight: 700; font-size: 1.7rem; color: var(--accent); display: block; }
.rag-stat .lbl { font-family: 'JetBrains Mono', monospace; font-size: .66rem; letter-spacing: .1em; color: var(--muted); text-transform: uppercase; margin-top: 4px; display: block; }
@media (max-width: 600px) {
.rag-art { padding: 1.25rem; }
.rag-stat { grid-template-columns: repeat(2,1fr); }
.rag-visual svg { min-width: 600px; }
}
&lt;/style&gt;
&lt;div class="rag-art"&gt;
&lt;!-- ── OPENING ── --&gt;
&lt;div class="section" style="padding-top:0;border-bottom:none"&gt;
&lt;p&gt;Retrieval-Augmented Generation has become the default way to give a language model access to your data. But "RAG" now covers at least three meaningfully different architectures, and most engineers only know the first one well. Pick the wrong one and your assistant answers confidently with information it never actually retrieved.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/rag-variants/hero.png" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><style>
@import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&family=JetBrains+Mono:wght@400;600&display=swap');
.rag-art {
--bg: #0a1220;
--bg2: #0f1a2e;
--bg3: #142442;
--line: #1f3358;
--text: #e2e8f0;
--muted: #8aa0c0;
--accent: #22d3ee;
--accent2: #00e5a8;
--warn: #f59e0b;
--danger: #ef4444;
--purple: #a78bfa;
font-family: 'Inter', sans-serif;
color: var(--text);
background: var(--bg);
padding: 2rem 2.25rem;
border-radius: 16px;
box-shadow: 0 4px 30px rgba(0,0,0,0.55);
line-height: 1.75;
}
.rag-art * { box-sizing: border-box; }
.rag-art .section { padding: 40px 0; border-bottom: 1px solid var(--line); }
.rag-art .section:last-child { border-bottom: none; }
.rag-art .label { font-family: 'JetBrains Mono', monospace; font-size: .72rem; letter-spacing: .22em; text-transform: uppercase; color: var(--accent); font-weight: 600; margin-bottom: 8px; }
.rag-art h2 { font-family: 'Inter', sans-serif; font-size: clamp(1.5rem, 1.2rem + 1vw, 2.2rem); font-weight: 700; letter-spacing: -.02em; color: #fff; margin: 0 0 18px; }
.rag-art h3 { font-family: 'Inter', sans-serif; font-size: 1.15rem; font-weight: 600; color: #fff; margin: 28px 0 10px; }
.rag-art p { margin: 0 0 16px; font-size: 1rem; color: var(--text); }
.rag-art ul { list-style: none; padding: 0; margin: 0 0 18px; }
.rag-art ul li { padding: 4px 0 4px 22px; position: relative; font-size: .95rem; }
.rag-art ul li::before { content: "▸"; position: absolute; left: 0; color: var(--accent); }
.rag-art strong { color: #fff; }
.rag-art code { background: var(--bg3); padding: 1px 6px; border-radius: 4px; font-family: 'JetBrains Mono', monospace; font-size: .88rem; color: var(--accent2); }
.rag-visual { margin: 28px 0; border-radius: 8px; overflow-x: auto; overflow-y: hidden; border: 1px solid var(--line); background: var(--bg2); }
.rag-visual svg { display: block; width: 100%; height: auto; min-width: 700px; }
.rag-cap { text-align: center; font-size: .78rem; color: var(--muted); font-style: italic; margin: -10px 0 24px; }
.rag-figure { margin: 28px 0; }
.rag-figure img, .rag-figure video { display: block; width: 100%; height: auto; border-radius: 8px; border: 1px solid var(--line); }
.rag-figure figcaption { text-align: center; font-size: .78rem; color: var(--muted); font-style: italic; margin-top: 8px; }
@keyframes ragPulseSoft { 0%,100% { opacity: .35; } 50% { opacity: 1; } }
@keyframes ragPulseStrong{ 0%,100% { opacity: .55; r: 9; } 50% { opacity: 1; r: 11; } }
@keyframes ragMissed { 0%,100% { opacity: .25; } 50% { opacity: .9; } }
@keyframes ragFlow { from { stroke-dashoffset: 24; } to { stroke-dashoffset: 0; } }
@keyframes ragDraw { from { stroke-dashoffset: 800; } to { stroke-dashoffset: 0; } }
.rag-flow-line { stroke-dasharray: 6 4; animation: ragFlow 1.2s linear infinite; }
.rag-traversal { stroke-dasharray: 6 4; animation: ragFlow 1.5s linear infinite; }
.rag-glow { animation: ragPulseStrong 2.4s ease-in-out infinite; }
.rag-missed { animation: ragMissed 3s ease-in-out infinite; }
/* Sequential agent tool highlight (5 tools) */
@keyframes ragAgentT { 0%,18% { opacity: 1; } 25%,100% { opacity: .35; } }
.rag-tool-1 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 0s; }
.rag-tool-2 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 1s; }
.rag-tool-3 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 2s; }
.rag-tool-4 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 3s; }
.rag-tool-5 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 4s; }
/* Sequential float→binary cell transform (8 cells) */
@keyframes ragBitTransform { 0%,8% { opacity: 0; } 14%,100% { opacity: 1; } }
.rag-bit-1 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 0s; }
.rag-bit-2 { animation: ragBitTransform 4s ease-out infinite; animation-delay: .35s; }
.rag-bit-3 { animation: ragBitTransform 4s ease-out infinite; animation-delay: .7s; }
.rag-bit-4 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 1.05s; }
.rag-bit-5 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 1.4s; }
.rag-bit-6 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 1.75s; }
.rag-bit-7 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 2.1s; }
.rag-bit-8 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 2.45s; }
@media (prefers-reduced-motion: reduce) {
.rag-flow-line, .rag-traversal, .rag-glow, .rag-missed,
.rag-tool-1,.rag-tool-2,.rag-tool-3,.rag-tool-4,.rag-tool-5,
.rag-bit-1,.rag-bit-2,.rag-bit-3,.rag-bit-4,.rag-bit-5,.rag-bit-6,.rag-bit-7,.rag-bit-8 { animation: none !important; opacity: 1 !important; }
}
.rag-cards { display: grid; grid-template-columns: repeat(auto-fit, minmax(240px, 1fr)); gap: 14px; margin: 24px 0; }
.rag-card { background: var(--bg2); border: 1px solid var(--line); border-radius: 10px; padding: 18px 18px; }
.rag-card h4 { font-size: 1rem; font-weight: 600; color: #fff; margin: 0 0 8px; }
.rag-card p { font-size: .88rem; color: var(--muted); margin: 0; }
.rag-pull { border-left: 3px solid var(--accent); padding: 6px 0 6px 18px; margin: 28px 0; font-size: 1.08rem; color: var(--accent); font-style: italic; }
.rag-pull.warn { border-color: var(--warn); color: var(--warn); }
.rag-table { width: 100%; border-collapse: collapse; margin: 18px 0 26px; font-size: .9rem; }
.rag-table th, .rag-table td { padding: 11px 14px; text-align: left; border-bottom: 1px solid var(--line); vertical-align: top; }
.rag-table th { color: var(--accent); font-family: 'JetBrains Mono', monospace; font-size: .72rem; letter-spacing: .14em; text-transform: uppercase; border-bottom: 1px solid var(--accent); font-weight: 600; }
.rag-table td.k { color: #fff; font-weight: 600; width: 25%; }
.rag-table td.v { color: var(--text); }
.rag-table td.m { color: var(--muted); font-size: .85rem; }
.rag-stat { display: grid; grid-template-columns: repeat(4,1fr); gap: 1px; background: var(--line); border: 1px solid var(--line); margin: 28px 0; border-radius: 6px; overflow: hidden; }
.rag-stat > div { background: var(--bg2); padding: 18px 14px; text-align: center; }
.rag-stat .num { font-family: 'Inter', sans-serif; font-weight: 700; font-size: 1.7rem; color: var(--accent); display: block; }
.rag-stat .lbl { font-family: 'JetBrains Mono', monospace; font-size: .66rem; letter-spacing: .1em; color: var(--muted); text-transform: uppercase; margin-top: 4px; display: block; }
@media (max-width: 600px) {
.rag-art { padding: 1.25rem; }
.rag-stat { grid-template-columns: repeat(2,1fr); }
.rag-visual svg { min-width: 600px; }
}</style><div class="rag-art"><div class="section" style="padding-top:0;border-bottom:none"><p>Retrieval-Augmented Generation has become the default way to give a language model access to your data. But "RAG" now covers at least three meaningfully different architectures, and most engineers only know the first one well. Pick the wrong one and your assistant answers confidently with information it never actually retrieved.</p><p>This piece does two things.<strong>First</strong> — break down RAG, Graph RAG, and Agentic RAG visually: how each works, where each one breaks, and which query type it's the right fit for.<strong>Second</strong> — show how a single technique called<em>binary quantization</em> can shrink the vector index inside any of these architectures by a factor of 32 without breaking retrieval quality. This is the trick Perplexity, Azure, and HubSpot use in production.</p><div class="rag-stat"><div><span class="num">3</span><span class="lbl">Architectures<br>compared</span></div><div><span class="num">32×</span><span class="lbl">Memory<br>reduction</span></div><div><span class="num">1-hop / N-hop</span><span class="lbl">When each<br>variant wins</span></div><div><span class="num">0 code</span><span class="lbl">All concept,<br>all visual</span></div></div></div><div class="section"><div class="label">Part 1A · Standard RAG</div><h2>The Default Pipeline — and What It's Actually Good At</h2><p>Standard RAG is what most engineers mean when they say "RAG". Documents are split into chunks, each chunk is embedded into a high-dimensional vector, and those vectors are stored in a vector database. At query time, the user's question is embedded too, and the database returns the top-k chunks by similarity (usually cosine distance). Those chunks are pasted into the LLM's prompt as context, and the model answers from them.</p><div class="rag-visual"><svg viewBox="0 0 860 240" xmlns="http://www.w3.org/2000/svg"><rect width="860" height="240" fill="#0f1a2e"/><text x="430" y="22" text-anchor="middle" font-family="Inter,sans-serif" font-size="11" fill="#22d3ee" letter-spacing="2.5" font-weight="600">STANDARD RAG · LINEAR PIPELINE</text><defs><marker id="rarr" markerWidth="8" markerHeight="8" refX="5" refY="4" orient="auto"><path d="M0,0 L8,4 L0,8 Z" fill="#22d3ee"/></marker></defs><g font-family="Inter,sans-serif"><rect x="30" y="92" width="120" height="56" rx="6" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="90" y="118" text-anchor="middle" font-size="12" fill="#fff" font-weight="600">User Query</text><text x="90" y="135" text-anchor="middle" font-size="9" fill="#8aa0c0">natural language</text><line x1="150" y1="120" x2="195" y2="120" stroke="#22d3ee" stroke-width="1.6" marker-end="url(#rarr)" class="rag-flow-line"/><rect x="200" y="92" width="120" height="56" rx="6" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="260" y="118" text-anchor="middle" font-size="12" fill="#fff" font-weight="600">Embed</text><text x="260" y="135" text-anchor="middle" font-size="9" fill="#8aa0c0">float32 vector</text><line x1="320" y1="120" x2="365" y2="120" stroke="#22d3ee" stroke-width="1.6" marker-end="url(#rarr)" class="rag-flow-line"/><rect x="370" y="78" width="130" height="84" rx="6" fill="#1f3358" stroke="#00e5a8" stroke-width="1.3"/><text x="435" y="100" text-anchor="middle" font-size="12" fill="#fff" font-weight="600">Vector DB</text><text x="435" y="116" text-anchor="middle" font-size="9" fill="#8aa0c0">cosine similarity</text><text x="435" y="140" text-anchor="middle" font-size="9" fill="#00e5a8" font-family="JetBrains Mono,monospace">top-k chunks</text><line x1="500" y1="120" x2="545" y2="120" stroke="#22d3ee" stroke-width="1.6" marker-end="url(#rarr)" class="rag-flow-line"/><rect x="550" y="78" width="130" height="84" rx="6" fill="#142442" stroke="#a78bfa" stroke-width="1.3"/><text x="615" y="100" text-anchor="middle" font-size="12" fill="#fff" font-weight="600">LLM</text><text x="615" y="116" text-anchor="middle" font-size="9" fill="#8aa0c0">prompt + context</text><text x="615" y="140" text-anchor="middle" font-size="9" fill="#a78bfa" font-family="JetBrains Mono,monospace">grounded answer</text><line x1="680" y1="120" x2="725" y2="120" stroke="#22d3ee" stroke-width="1.6" marker-end="url(#rarr)" class="rag-flow-line"/><rect x="730" y="92" width="100" height="56" rx="6" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="780" y="118" text-anchor="middle" font-size="12" fill="#fff" font-weight="600">Answer</text><text x="780" y="135" text-anchor="middle" font-size="9" fill="#8aa0c0">to user</text><line x1="370" y1="178" x2="500" y2="178" stroke="#8aa0c0" stroke-width=".5" stroke-dasharray="3,2"/><text x="435" y="196" text-anchor="middle" font-size="9" fill="#8aa0c0" font-style="italic">retrieval happens here</text><text x="435" y="208" text-anchor="middle" font-size="9" fill="#8aa0c0" font-style="italic">— individual chunks, ranked by similarity</text></g></svg></div><h3>Where Standard RAG Wins</h3><ul><li><strong>Direct factual lookups.</strong> Single chunk contains the answer. "What is our refund policy?" → retrieves the refund policy chunk → done.</li><li><strong>Cost and latency.</strong> One embedding call, one similarity search, one LLM call. Easy to debug.</li><li><strong>Mature tooling.</strong> Pinecone, Weaviate, Qdrant, Milvus, pgvector — all production-ready for this pattern.</li></ul><h3>Where Standard RAG Breaks</h3><p>It retrieves<em>chunks</em>, never the relationships<em>between</em> chunks. The moment the answer requires combining facts that live in different documents — or even different sections of the same document — similarity search starts missing things.</p><div class="rag-pull warn">Similarity search will happily return two facts that sit close to the query in embedding space, while the missing third fact that connects them sits far away and never makes it into the context window.</div><p>Concretely, imagine a vector database storing three facts about your internal services:</p><div class="rag-visual"><svg viewBox="0 0 860 280" xmlns="http://www.w3.org/2000/svg"><rect width="860" height="280" fill="#0f1a2e"/><text x="430" y="22" text-anchor="middle" font-family="Inter,sans-serif" font-size="11" fill="#f59e0b" letter-spacing="2.5" font-weight="600">THE MULTI-HOP PROBLEM</text><rect x="40" y="44" width="780" height="200" rx="6" fill="#0a1220" stroke="#1f3358" stroke-width=".5" stroke-dasharray="3,3"/><text x="55" y="62" font-family="JetBrains Mono,monospace" font-size="8" fill="#8aa0c0" letter-spacing="1">EMBEDDING SPACE (conceptual)</text><g transform="translate(110,180)"><circle r="10" fill="#a78bfa"/><text x="0" y="-18" text-anchor="middle" font-family="Inter,sans-serif" font-size="9.5" fill="#a78bfa" font-weight="600">QUERY</text><text x="0" y="32" text-anchor="middle" font-family="Inter,sans-serif" font-size="8.5" fill="#fff">"Will checkout be affected</text><text x="0" y="42" text-anchor="middle" font-family="Inter,sans-serif" font-size="8.5" fill="#fff">by Friday's maintenance?"</text></g><g transform="translate(230,150)"><circle r="9" fill="#00e5a8" class="rag-glow"/><text x="0" y="-16" text-anchor="middle" font-family="Inter,sans-serif" font-size="9" fill="#00e5a8" font-weight="600">Fact 1 · RETRIEVED</text><text x="0" y="30" text-anchor="middle" font-family="Inter,sans-serif" font-size="8.5" fill="#fff">"Checkout service uses</text><text x="0" y="40" text-anchor="middle" font-family="Inter,sans-serif" font-size="8.5" fill="#fff">the payments API."</text></g><g transform="translate(440,90)"><circle r="9" fill="#ef4444" opacity=".75" class="rag-missed"/><text x="0" y="-14" text-anchor="middle" font-family="Inter,sans-serif" font-size="9" fill="#ef4444" font-weight="600">Fact 2 · MISSED</text><text x="0" y="26" text-anchor="middle" font-family="Inter,sans-serif" font-size="8.5" fill="#fff">"Payments API runs</text><text x="0" y="36" text-anchor="middle" font-family="Inter,sans-serif" font-size="8.5" fill="#fff">on cluster-3."</text><text x="0" y="48" text-anchor="middle" font-family="Inter,sans-serif" font-size="7.5" fill="#ef4444" font-style="italic">no "checkout" · no "maintenance"</text></g><g transform="translate(680,160)"><circle r="9" fill="#00e5a8" class="rag-glow" style="animation-delay:1.2s"/><text x="0" y="-16" text-anchor="middle" font-family="Inter,sans-serif" font-size="9" fill="#00e5a8" font-weight="600">Fact 3 · RETRIEVED</text><text x="0" y="30" text-anchor="middle" font-family="Inter,sans-serif" font-size="8.5" fill="#fff">"Cluster-3 maintenance</text><text x="0" y="40" text-anchor="middle" font-family="Inter,sans-serif" font-size="8.5" fill="#fff">scheduled for Friday."</text></g><line x1="120" y1="180" x2="220" y2="156" stroke="#00e5a8" stroke-width="1" stroke-opacity=".6"/><line x1="120" y1="180" x2="670" y2="166" stroke="#00e5a8" stroke-width="1" stroke-opacity=".6"/><line x1="120" y1="180" x2="430" y2="96" stroke="#ef4444" stroke-width="1" stroke-opacity=".4" stroke-dasharray="4,3"/><text x="430" y="266" text-anchor="middle" font-family="Inter,sans-serif" font-size="10" fill="#f59e0b" font-style="italic">LLM gets facts 1 and 3 but can't link them. Answers "I don't know" or hallucinates a connection.</text></svg></div><p class="rag-cap">The bridge fact sits too far from the query in embedding space. Similarity search has no way to find it from where it started.</p></div><div class="section"><div class="label">Part 1B · Graph RAG</div><h2>Adding a Knowledge Graph on Top</h2><p>Graph RAG addresses the multi-hop problem by adding a structural layer over the documents. During indexing, an LLM extracts<strong>entities</strong> (services, people, places, concepts) and the<strong>relationships</strong> between them, building a knowledge graph alongside the vector index. At query time, the system traverses that graph instead of relying purely on embedding similarity.</p><div class="rag-visual"><svg viewBox="0 0 860 320" xmlns="http://www.w3.org/2000/svg"><rect width="860" height="320" fill="#0f1a2e"/><text x="430" y="22" text-anchor="middle" font-family="Inter,sans-serif" font-size="11" fill="#22d3ee" letter-spacing="2.5" font-weight="600">GRAPH RAG · GRAPH TRAVERSAL OVER LINKED ENTITIES</text><defs><marker id="garr" markerWidth="8" markerHeight="8" refX="5" refY="4" orient="auto"><path d="M0,0 L8,4 L0,8 Z" fill="#00e5a8"/></marker></defs><g font-family="Inter,sans-serif"><text x="170" y="50" text-anchor="middle" font-size="10" fill="#8aa0c0" font-family="JetBrains Mono,monospace" letter-spacing="1.5">[ AT INDEXING TIME ]</text><rect x="60" y="70" width="220" height="50" rx="4" fill="#142442" stroke="#1f3358"/><text x="170" y="92" text-anchor="middle" font-size="11" fill="#fff" font-weight="600">Documents</text><text x="170" y="108" text-anchor="middle" font-size="9" fill="#8aa0c0">chunks · prose · tables</text><line x1="170" y1="120" x2="170" y2="142" stroke="#00e5a8" stroke-width="1.2" marker-end="url(#garr)"/><rect x="60" y="148" width="220" height="50" rx="4" fill="#142442" stroke="#a78bfa"/><text x="170" y="170" text-anchor="middle" font-size="11" fill="#fff" font-weight="600">LLM Entity Extractor</text><text x="170" y="186" text-anchor="middle" font-size="9" fill="#a78bfa">nodes + edges</text><line x1="170" y1="198" x2="170" y2="220" stroke="#00e5a8" stroke-width="1.2" marker-end="url(#garr)"/><rect x="60" y="226" width="220" height="50" rx="4" fill="#1f3358" stroke="#00e5a8" stroke-width="1.3"/><text x="170" y="248" text-anchor="middle" font-size="11" fill="#fff" font-weight="600">Knowledge Graph</text><text x="170" y="264" text-anchor="middle" font-size="9" fill="#00e5a8">Neo4j · Memgraph · property graph</text></g><line x1="320" y1="50" x2="320" y2="290" stroke="#1f3358" stroke-width=".5" stroke-dasharray="3,3"/><text x="600" y="50" text-anchor="middle" font-size="10" fill="#8aa0c0" font-family="JetBrains Mono,monospace" letter-spacing="1.5">[ THE RESULTING GRAPH ]</text><g font-family="Inter,sans-serif"><circle cx="400" cy="120" r="28" fill="#142442" stroke="#22d3ee" stroke-width="1.4"/><text x="400" y="118" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">checkout</text><text x="400" y="130" text-anchor="middle" font-size="9" fill="#8aa0c0">service</text><circle cx="560" cy="170" r="28" fill="#142442" stroke="#22d3ee" stroke-width="1.4"/><text x="560" y="168" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">payments</text><text x="560" y="180" text-anchor="middle" font-size="9" fill="#8aa0c0">API</text><circle cx="720" cy="120" r="28" fill="#142442" stroke="#22d3ee" stroke-width="1.4"/><text x="720" y="118" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">cluster-3</text><text x="720" y="130" text-anchor="middle" font-size="9" fill="#8aa0c0">infra</text><circle cx="640" cy="260" r="28" fill="#142442" stroke="#f59e0b" stroke-width="1.4"/><text x="640" y="258" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">Friday</text><text x="640" y="270" text-anchor="middle" font-size="9" fill="#f59e0b">maintenance</text><line x1="428" y1="120" x2="534" y2="166" stroke="#00e5a8" stroke-width="1.2"/><text x="475" y="135" font-size="8" fill="#00e5a8" font-style="italic">uses</text><line x1="588" y1="170" x2="692" y2="124" stroke="#00e5a8" stroke-width="1.2"/><text x="635" y="138" font-size="8" fill="#00e5a8" font-style="italic">runs_on</text><line x1="700" y1="148" x2="660" y2="232" stroke="#00e5a8" stroke-width="1.2"/><text x="700" y="195" font-size="8" fill="#00e5a8" font-style="italic">affects</text><path d="M 400 120 Q 480 70 560 170 Q 640 210 640 260" fill="none" stroke="#f59e0b" stroke-width="2.5" opacity=".85" class="rag-traversal" pathLength="100" stroke-dasharray="100"><animate attributeName="stroke-dashoffset" from="100" to="0" dur="2.5s" repeatCount="indefinite"/></path><text x="430" y="298" font-size="9" fill="#f59e0b" font-family="Inter,sans-serif" font-style="italic">↑ Traversal: checkout → uses → payments → runs_on → cluster-3 → affects → Friday maintenance</text></g></svg></div><h3>How a Graph RAG Query Actually Runs</h3><p>The user asks "Will checkout be affected by Friday's maintenance?". The system identifies the entities mentioned in the query (<code>checkout</code>,<code>Friday maintenance</code>), looks them up as nodes in the graph, and walks the edges between them. The traversal returns the chain of relationships, and that chain gets handed to the LLM as structured context — not as random chunks of prose.</p><div class="rag-cards"><div class="rag-card"><h4>Multi-hop reasoning</h4><p>Following<code>uses → runs_on → affects</code> recovers the bridge fact that pure similarity search missed.</p></div><div class="rag-card"><h4>Explainable context</h4><p>Every answer comes with a traversable path. Easier to audit than "the top-5 most similar chunks said so."</p></div><div class="rag-card"><h4>Heavier to build</h4><p>Entity extraction at index time is expensive. Schema design matters. Not free lunch.</p></div><div class="rag-card"><h4>Less flexible than agents</h4><p>The graph schema is fixed at indexing time. Queries that need fresh tools or external sources still need help.</p></div></div></div><div class="section"><div class="label">Part 1C · Agentic RAG</div><h2>Letting the LLM Choose How to Retrieve</h2><p>Agentic RAG replaces the fixed retrieval pipeline with an LLM agent that decides — at query time — which tools to invoke, which sources to query, and in what order. The agent might call a vector search, then a SQL database, then a web fetch, then a graph traversal, all in service of one question.</p><div class="rag-visual"><svg viewBox="0 0 860 360" xmlns="http://www.w3.org/2000/svg"><rect width="860" height="360" fill="#0f1a2e"/><text x="430" y="22" text-anchor="middle" font-family="Inter,sans-serif" font-size="11" fill="#22d3ee" letter-spacing="2.5" font-weight="600">AGENTIC RAG · DYNAMIC TOOL ORCHESTRATION</text><defs><marker id="aarr" markerWidth="7" markerHeight="7" refX="4" refY="3.5" orient="auto"><path d="M0,0 L7,3.5 L0,7 Z" fill="#a78bfa"/></marker></defs><g font-family="Inter,sans-serif"><rect x="30" y="170" width="120" height="40" rx="4" fill="#142442" stroke="#22d3ee"/><text x="90" y="194" text-anchor="middle" font-size="11" fill="#fff" font-weight="600">User Query</text></g><line x1="150" y1="190" x2="310" y2="190" stroke="#a78bfa" stroke-width="1.4" marker-end="url(#aarr)"/><g font-family="Inter,sans-serif"><circle cx="380" cy="190" r="64" fill="#1f3358" stroke="#a78bfa" stroke-width="1.8"/><text x="380" y="186" text-anchor="middle" font-size="13" fill="#fff" font-weight="700">LLM Agent</text><text x="380" y="202" text-anchor="middle" font-size="9" fill="#a78bfa" font-family="JetBrains Mono,monospace">plan · choose · iterate</text><text x="380" y="216" text-anchor="middle" font-size="8" fill="#8aa0c0">reflects on partial results</text></g><g font-family="Inter,sans-serif"><rect x="540" y="40" width="130" height="44" rx="5" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="605" y="60" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">Vector DB</text><text x="605" y="74" text-anchor="middle" font-size="8.5" fill="#8aa0c0">unstructured docs</text><rect x="700" y="100" width="130" height="44" rx="5" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="765" y="120" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">SQL Database</text><text x="765" y="134" text-anchor="middle" font-size="8.5" fill="#8aa0c0">structured rows</text><rect x="720" y="170" width="130" height="44" rx="5" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="785" y="190" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">Knowledge Graph</text><text x="785" y="204" text-anchor="middle" font-size="8.5" fill="#8aa0c0">linked entities</text><rect x="700" y="240" width="130" height="44" rx="5" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="765" y="260" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">Web Search</text><text x="765" y="274" text-anchor="middle" font-size="8.5" fill="#8aa0c0">fresh facts</text><rect x="540" y="300" width="130" height="44" rx="5" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="605" y="320" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">Code Interpreter</text><text x="605" y="334" text-anchor="middle" font-size="8.5" fill="#8aa0c0">compute · joins</text><rect x="380" y="300" width="130" height="44" rx="5" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="445" y="320" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">Internal Systems</text><text x="445" y="334" text-anchor="middle" font-size="8.5" fill="#8aa0c0">Slack · Jira · ITSM</text><line x1="430" y1="155" x2="540" y2="65" stroke="#a78bfa" stroke-width="1.6" stroke-dasharray="3,2" class="rag-tool-1"/><line x1="444" y1="175" x2="700" y2="118" stroke="#a78bfa" stroke-width="1.6" stroke-dasharray="3,2" class="rag-tool-2"/><line x1="444" y1="194" x2="720" y2="192" stroke="#a78bfa" stroke-width="1.6" stroke-dasharray="3,2" class="rag-tool-3"/><line x1="440" y1="220" x2="700" y2="258" stroke="#a78bfa" stroke-width="1.6" stroke-dasharray="3,2" class="rag-tool-4"/><line x1="406" y1="248" x2="586" y2="300" stroke="#a78bfa" stroke-width="1.6" stroke-dasharray="3,2" class="rag-tool-5"/><line x1="378" y1="254" x2="426" y2="300" stroke="#a78bfa" stroke-width="1.2" stroke-dasharray="3,2" opacity=".5"/><circle cx="380" cy="190" r="64" fill="none" stroke="#a78bfa" stroke-width="1.4" opacity=".6"><animate attributeName="r" from="64" to="96" dur="2.8s" repeatCount="indefinite"/><animate attributeName="opacity" from=".55" to="0" dur="2.8s" repeatCount="indefinite"/></circle></g><text x="430" y="354" text-anchor="middle" font-family="Inter,sans-serif" font-size="9" fill="#8aa0c0" font-style="italic">No fixed pipeline. The agent picks the next tool based on what the last tool returned.</text></svg></div><h3>What "Dynamic" Actually Means</h3><p>A user asks:<em>"Has any customer raised a ticket about the checkout outage we had last Friday, and what was our response time on it?"</em> An agentic system might:</p><ul><li>Call the<strong>knowledge graph</strong> to confirm there was an outage on the checkout service last Friday.</li><li>Call<strong>SQL</strong> on the ticketing database to list tickets opened that day mentioning "checkout".</li><li>Call the<strong>vector DB</strong> over chat history to find related customer complaints in Slack.</li><li>Call the<strong>code interpreter</strong> to compute average first-response time on the matching tickets.</li><li>Compose the answer.</li></ul><p>None of that ordering was decided in advance. The agent chose it. That flexibility is the whole point — and the whole risk.</p><div class="rag-cards"><div class="rag-card"><h4>Flexible</h4><p>Handles open-ended tasks that touch multiple data sources and require fresh information.</p></div><div class="rag-card"><h4>Higher latency</h4><p>Several tool calls per question. A simple lookup that took 200ms in standard RAG now takes 4–8 seconds.</p></div><div class="rag-card"><h4>Harder to debug</h4><p>The agent's reasoning path is non-deterministic. Reproducing a failure mode can be slippery.</p></div><div class="rag-card"><h4>Can spiral</h4><p>Without tight tool authority and budgets, agents loop on themselves. Pair this with a state machine.</p></div></div></div><div class="section"><div class="label">Part 1D · Decision</div><h2>These Aren't Levels — They're Different Tools</h2><p>The most common mistake is treating these as a maturity ladder you have to climb. They aren't. They solve different query types. A good system often uses all three in different parts of the same product.</p><div class="rag-visual"><svg viewBox="0 0 860 280" xmlns="http://www.w3.org/2000/svg"><rect width="860" height="280" fill="#0f1a2e"/><text x="430" y="22" text-anchor="middle" font-family="Inter,sans-serif" font-size="11" fill="#22d3ee" letter-spacing="2.5" font-weight="600">PICK BY QUERY TYPE — NOT BY HYPE</text><g font-family="Inter,sans-serif"><rect x="40" y="56" width="240" height="200" rx="8" fill="#142442" stroke="#22d3ee" stroke-width="1.3"/><text x="160" y="86" text-anchor="middle" font-size="13" fill="#fff" font-weight="700">Standard RAG</text><line x1="60" y1="98" x2="260" y2="98" stroke="#22d3ee" stroke-width=".5"/><text x="160" y="124" text-anchor="middle" font-size="10" fill="#22d3ee" font-family="JetBrains Mono,monospace">single-hop</text><text x="160" y="144" text-anchor="middle" font-size="10" fill="#22d3ee" font-family="JetBrains Mono,monospace">factual lookups</text><text x="160" y="174" text-anchor="middle" font-size="9.5" fill="#8aa0c0">"What's our refund policy?"</text><text x="160" y="190" text-anchor="middle" font-size="9.5" fill="#8aa0c0">"How do I reset my password?"</text><text x="160" y="206" text-anchor="middle" font-size="9.5" fill="#8aa0c0">"Where is the SLA defined?"</text><text x="160" y="234" text-anchor="middle" font-size="9" fill="#00e5a8" font-style="italic">cost: low · latency: low</text><rect x="310" y="56" width="240" height="200" rx="8" fill="#142442" stroke="#00e5a8" stroke-width="1.3"/><text x="430" y="86" text-anchor="middle" font-size="13" fill="#fff" font-weight="700">Graph RAG</text><line x1="330" y1="98" x2="530" y2="98" stroke="#00e5a8" stroke-width=".5"/><text x="430" y="124" text-anchor="middle" font-size="10" fill="#00e5a8" font-family="JetBrains Mono,monospace">multi-hop</text><text x="430" y="144" text-anchor="middle" font-size="10" fill="#00e5a8" font-family="JetBrains Mono,monospace">relationship queries</text><text x="430" y="174" text-anchor="middle" font-size="9.5" fill="#8aa0c0">"Who depends on cluster-3?"</text><text x="430" y="190" text-anchor="middle" font-size="9.5" fill="#8aa0c0">"What did this RCA reference?"</text><text x="430" y="206" text-anchor="middle" font-size="9.5" fill="#8aa0c0">"What blocks this release?"</text><text x="430" y="234" text-anchor="middle" font-size="9" fill="#00e5a8" font-style="italic">cost: medium · build: heavy</text><rect x="580" y="56" width="240" height="200" rx="8" fill="#142442" stroke="#a78bfa" stroke-width="1.3"/><text x="700" y="86" text-anchor="middle" font-size="13" fill="#fff" font-weight="700">Agentic RAG</text><line x1="600" y1="98" x2="800" y2="98" stroke="#a78bfa" stroke-width=".5"/><text x="700" y="124" text-anchor="middle" font-size="10" fill="#a78bfa" font-family="JetBrains Mono,monospace">multi-source</text><text x="700" y="144" text-anchor="middle" font-size="10" fill="#a78bfa" font-family="JetBrains Mono,monospace">tool-using tasks</text><text x="700" y="174" text-anchor="middle" font-size="9.5" fill="#8aa0c0">"Did anyone open a ticket about</text><text x="700" y="190" text-anchor="middle" font-size="9.5" fill="#8aa0c0">last Friday's outage, and</text><text x="700" y="206" text-anchor="middle" font-size="9.5" fill="#8aa0c0">what was our response time?"</text><text x="700" y="234" text-anchor="middle" font-size="9" fill="#a78bfa" font-style="italic">cost: high · debug: hardest</text></g></svg></div><p>Once the right architecture is in place for the query type, the next leverage point is efficiency. Every one of these three depends on a vector index somewhere underneath — and that index is where most of the memory cost lives.</p></div><div class="section"><div class="label">Part 2 · Efficiency</div><h2>How to Make Any RAG 32× More Memory Efficient</h2><figure class="rag-figure"><video autoplay= muted= loop= playsinline= preload="metadata" poster="/images/rag-variants/binary-quantization-chip.png"><source src="/images/rag-variants/binary-quantization-chip.mp4" type="video/mp4"><img src="/images/rag-variants/binary-quantization-chip.png" alt="A glowing silicon chip with float values transforming into binary on its surface"/><figcaption>The 32× trick — float magnitudes compressed to a single sign bit per dimension.</figcaption></figure><p>Every RAG variant pays the same tax: it stores high-dimensional embeddings of every chunk it's ever indexed. That tax adds up fast. At ten million chunks, a standard 768-dimension float32 index needs about 30 GB just to hold the vectors — and that index has to sit in fast RAM if you want sub-second retrieval. Doubling your corpus doubles the bill.</p><p>The trick that Perplexity, Azure AI Search, and HubSpot all use in production is called<strong>binary quantization</strong>. It cuts the memory footprint by 32 times. The architecture above it doesn't change — Standard, Graph, or Agentic, the same trick applies.</p><h3>The Memory Bill, in Numbers</h3><div class="rag-visual"><svg viewBox="0 0 860 220" xmlns="http://www.w3.org/2000/svg"><rect width="860" height="220" fill="#0f1a2e"/><text x="430" y="22" text-anchor="middle" font-family="Inter,sans-serif" font-size="11" fill="#22d3ee" letter-spacing="2.5" font-weight="600">A SINGLE EMBEDDING — WHAT YOU'RE ACTUALLY PAYING FOR</text><g font-family="JetBrains Mono,monospace"><text x="60" y="60" font-size="9" fill="#8aa0c0">VECTOR:</text><g transform="translate(60,70)"><rect x="0" y="0" width="95" height="40" fill="#142442" stroke="#22d3ee" stroke-width=".5"/><text x="48" y="22" text-anchor="middle" font-size="9.5" fill="#fff">0.42</text><text x="48" y="34" text-anchor="middle" font-size="7" fill="#8aa0c0">32 bits</text><rect x="96" y="0" width="95" height="40" fill="#142442" stroke="#22d3ee" stroke-width=".5"/><text x="144" y="22" text-anchor="middle" font-size="9.5" fill="#fff">-0.18</text><text x="144" y="34" text-anchor="middle" font-size="7" fill="#8aa0c0">32 bits</text><rect x="192" y="0" width="95" height="40" fill="#142442" stroke="#22d3ee" stroke-width=".5"/><text x="240" y="22" text-anchor="middle" font-size="9.5" fill="#fff">0.93</text><text x="240" y="34" text-anchor="middle" font-size="7" fill="#8aa0c0">32 bits</text><rect x="288" y="0" width="95" height="40" fill="#142442" stroke="#22d3ee" stroke-width=".5"/><text x="336" y="22" text-anchor="middle" font-size="9.5" fill="#fff">-0.05</text><text x="336" y="34" text-anchor="middle" font-size="7" fill="#8aa0c0">32 bits</text><text x="410" y="22" font-size="9" fill="#8aa0c0">…</text><text x="410" y="34" font-size="9" fill="#8aa0c0">…</text><rect x="450" y="0" width="280" height="40" fill="#0a1220" stroke="#1f3358" stroke-width=".5"/><text x="590" y="22" text-anchor="middle" font-size="9.5" fill="#22d3ee">764 more dimensions</text><text x="590" y="34" text-anchor="middle" font-size="7" fill="#8aa0c0">×32 bits each</text></g></g><g font-family="Inter,sans-serif"><text x="60" y="138" font-size="11" fill="#fff" font-weight="600">One vector = 768 dimensions × 32 bits =<tspan fill="#f59e0b" font-weight="700">3,072 bytes</tspan></text><text x="60" y="158" font-size="11" fill="#fff" font-weight="600">10 million vectors =<tspan fill="#f59e0b" font-weight="700">~30 GB</tspan>, all in hot RAM for real-time retrieval</text><text x="60" y="178" font-size="10" fill="#8aa0c0" font-style="italic">Most of those 32 bits per dimension are encoding magnitudes you never actually use for ranking.</text></g></svg></div><h3>The Trick: Throw Away Magnitudes, Keep the Sign</h3><p>Binary quantization is structurally simple. For every dimension of every vector, ask one question:<em>is the value positive or negative?</em> Positive becomes<code>1</code>, negative becomes<code>0</code>. The 32-bit float is replaced by a single bit. Same dimensionality, 1/32nd the storage.</p><div class="rag-visual"><svg viewBox="0 0 860 280" xmlns="http://www.w3.org/2000/svg"><rect width="860" height="280" fill="#0f1a2e"/><text x="430" y="22" text-anchor="middle" font-family="Inter,sans-serif" font-size="11" fill="#22d3ee" letter-spacing="2.5" font-weight="600">FLOAT32 → BINARY · ONE BIT PER DIMENSION</text><g font-family="JetBrains Mono,monospace"><text x="60" y="58" font-size="9" fill="#8aa0c0">FLOAT32:</text><g transform="translate(60,68)"><g><rect x="0" y="0" width="86" height="36" fill="#142442" stroke="#22d3ee" stroke-width=".6"/><text x="43" y="22" text-anchor="middle" font-size="11" fill="#fff">+0.42</text><rect x="90" y="0" width="86" height="36" fill="#142442" stroke="#22d3ee" stroke-width=".6"/><text x="133" y="22" text-anchor="middle" font-size="11" fill="#f59e0b">−0.18</text><rect x="180" y="0" width="86" height="36" fill="#142442" stroke="#22d3ee" stroke-width=".6"/><text x="223" y="22" text-anchor="middle" font-size="11" fill="#fff">+0.93</text><rect x="270" y="0" width="86" height="36" fill="#142442" stroke="#22d3ee" stroke-width=".6"/><text x="313" y="22" text-anchor="middle" font-size="11" fill="#f59e0b">−0.05</text><rect x="360" y="0" width="86" height="36" fill="#142442" stroke="#22d3ee" stroke-width=".6"/><text x="403" y="22" text-anchor="middle" font-size="11" fill="#fff">+0.21</text><rect x="450" y="0" width="86" height="36" fill="#142442" stroke="#22d3ee" stroke-width=".6"/><text x="493" y="22" text-anchor="middle" font-size="11" fill="#f59e0b">−0.67</text><rect x="540" y="0" width="86" height="36" fill="#142442" stroke="#22d3ee" stroke-width=".6"/><text x="583" y="22" text-anchor="middle" font-size="11" fill="#fff">+0.11</text><rect x="630" y="0" width="86" height="36" fill="#142442" stroke="#22d3ee" stroke-width=".6"/><text x="673" y="22" text-anchor="middle" font-size="11" fill="#fff">+0.88</text></g></g></g><text x="430" y="138" text-anchor="middle" font-family="Inter,sans-serif" font-size="10" fill="#00e5a8" font-style="italic">sign(x) — if x &gt; 0 then 1 else 0</text><line x1="380" y1="118" x2="380" y2="150" stroke="#00e5a8" stroke-width="1.5"/><line x1="380" y1="150" x2="384" y2="146" stroke="#00e5a8" stroke-width="1.5"/><line x1="380" y1="150" x2="376" y2="146" stroke="#00e5a8" stroke-width="1.5"/><g font-family="JetBrains Mono,monospace"><text x="60" y="178" font-size="9" fill="#8aa0c0">BINARY:</text><g transform="translate(60,188)"><g class="rag-bit-1"><rect x="0" y="0" width="86" height="36" fill="#1f3358" stroke="#00e5a8" stroke-width=".8"/><text x="43" y="24" text-anchor="middle" font-size="14" fill="#00e5a8" font-weight="700">1</text></g><g class="rag-bit-2"><rect x="90" y="0" width="86" height="36" fill="#142442" stroke="#00e5a8" stroke-width=".8"/><text x="133" y="24" text-anchor="middle" font-size="14" fill="#00e5a8" font-weight="700">0</text></g><g class="rag-bit-3"><rect x="180" y="0" width="86" height="36" fill="#1f3358" stroke="#00e5a8" stroke-width=".8"/><text x="223" y="24" text-anchor="middle" font-size="14" fill="#00e5a8" font-weight="700">1</text></g><g class="rag-bit-4"><rect x="270" y="0" width="86" height="36" fill="#142442" stroke="#00e5a8" stroke-width=".8"/><text x="313" y="24" text-anchor="middle" font-size="14" fill="#00e5a8" font-weight="700">0</text></g><g class="rag-bit-5"><rect x="360" y="0" width="86" height="36" fill="#1f3358" stroke="#00e5a8" stroke-width=".8"/><text x="403" y="24" text-anchor="middle" font-size="14" fill="#00e5a8" font-weight="700">1</text></g><g class="rag-bit-6"><rect x="450" y="0" width="86" height="36" fill="#142442" stroke="#00e5a8" stroke-width=".8"/><text x="493" y="24" text-anchor="middle" font-size="14" fill="#00e5a8" font-weight="700">0</text></g><g class="rag-bit-7"><rect x="540" y="0" width="86" height="36" fill="#1f3358" stroke="#00e5a8" stroke-width=".8"/><text x="583" y="24" text-anchor="middle" font-size="14" fill="#00e5a8" font-weight="700">1</text></g><g class="rag-bit-8"><rect x="630" y="0" width="86" height="36" fill="#1f3358" stroke="#00e5a8" stroke-width=".8"/><text x="673" y="24" text-anchor="middle" font-size="14" fill="#00e5a8" font-weight="700">1</text></g></g></g><text x="60" y="252" font-family="Inter,sans-serif" font-size="11" fill="#fff" font-weight="600">8 floats (256 bits) → 8 bits. Apply to 768 dims:<tspan fill="#00e5a8">3,072 bytes → 96 bytes.</tspan></text><text x="60" y="270" font-family="Inter,sans-serif" font-size="10" fill="#00e5a8" font-style="italic">Exactly 32× reduction. Per vector. Across the whole index.</text></svg></div><h3>The Distance Metric Changes Too — Cosine Becomes Hamming</h3><p>Float32 vectors compare via<strong>cosine similarity</strong>, which is computed from dot products. Binary vectors compare via<strong>Hamming distance</strong>: count the number of bits that differ between two vectors. On modern CPUs, this is two instructions —<code>XOR</code> then<code>popcount</code> — and runs at billions of comparisons per second.</p><div class="rag-visual"><svg viewBox="0 0 860 240" xmlns="http://www.w3.org/2000/svg"><rect width="860" height="240" fill="#0f1a2e"/><text x="430" y="22" text-anchor="middle" font-family="Inter,sans-serif" font-size="11" fill="#22d3ee" letter-spacing="2.5" font-weight="600">DISTANCE METRIC · BEFORE AND AFTER</text><g font-family="Inter,sans-serif"><rect x="40" y="50" width="380" height="170" rx="6" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="230" y="76" text-anchor="middle" font-size="12" fill="#fff" font-weight="700">Float32 · Cosine Similarity</text><line x1="120" y1="180" x2="220" y2="110" stroke="#22d3ee" stroke-width="1.5"/><g transform-origin="120 180"><line x1="120" y1="180" x2="300" y2="160" stroke="#a78bfa" stroke-width="1.5"><animateTransform attributeName="transform" type="rotate" values="-8 120 180; 6 120 180; -8 120 180" dur="3.6s" repeatCount="indefinite"/></line></g><path d="M 165 158 A 50 50 0 0 1 195 175" fill="none" stroke="#f59e0b" stroke-width="1.5"/><text x="200" y="155" font-size="11" fill="#f59e0b" font-style="italic">θ</text><text x="230" y="206" text-anchor="middle" font-size="10" fill="#8aa0c0" font-family="JetBrains Mono,monospace">cos(θ) = (A · B) / (||A|| × ||B||)</text></g><g font-family="Inter,sans-serif"><rect x="440" y="50" width="380" height="170" rx="6" fill="#142442" stroke="#00e5a8" stroke-width="1"/><text x="630" y="76" text-anchor="middle" font-size="12" fill="#fff" font-weight="700">Binary · Hamming Distance</text><g font-family="JetBrains Mono,monospace" font-size="14"><text x="480" y="118" fill="#8aa0c0">A:</text><text x="510" y="118" fill="#00e5a8" font-weight="700">1</text><text x="530" y="118" fill="#00e5a8" font-weight="700">0</text><text x="550" y="118" fill="#00e5a8" font-weight="700">1</text><text x="570" y="118" fill="#00e5a8" font-weight="700">0</text><text x="590" y="118" fill="#00e5a8" font-weight="700">1</text><text x="610" y="118" fill="#00e5a8" font-weight="700">0</text><text x="630" y="118" fill="#00e5a8" font-weight="700">1</text><text x="650" y="118" fill="#00e5a8" font-weight="700">1</text><text x="480" y="142" fill="#8aa0c0">B:</text><text x="510" y="142" fill="#00e5a8" font-weight="700">1</text><text x="530" y="142" fill="#f59e0b" font-weight="700">1</text><text x="550" y="142" fill="#00e5a8" font-weight="700">1</text><text x="570" y="142" fill="#f59e0b" font-weight="700">1</text><text x="590" y="142" fill="#00e5a8" font-weight="700">1</text><text x="610" y="142" fill="#00e5a8" font-weight="700">0</text><text x="630" y="142" fill="#00e5a8" font-weight="700">1</text><text x="650" y="142" fill="#00e5a8" font-weight="700">1</text><text x="480" y="168" fill="#8aa0c0" font-size="11">XOR:</text><text x="510" y="168" fill="#22d3ee" font-size="14">0</text><text x="530" y="168" fill="#f59e0b" font-size="14">1</text><text x="550" y="168" fill="#22d3ee" font-size="14">0</text><text x="570" y="168" fill="#f59e0b" font-size="14">1</text><text x="590" y="168" fill="#22d3ee" font-size="14">0</text><text x="610" y="168" fill="#22d3ee" font-size="14">0</text><text x="630" y="168" fill="#22d3ee" font-size="14">0</text><text x="650" y="168" fill="#22d3ee" font-size="14">0</text></g><text x="630" y="200" text-anchor="middle" font-size="11" fill="#00e5a8" font-family="Inter,sans-serif">popcount(XOR) = 2 bits different = distance</text></g></svg></div><h3>The Trade-off — and the Fix</h3><p>Of course, throwing away the magnitudes throws away some information. A naive binary index loses roughly 5–10% of retrieval accuracy compared to the full float32 index. Production systems solve this with a<strong>two-stage search</strong>: use the cheap binary index to retrieve a wide net of candidates fast, then re-score the small candidate set using the original full-precision vectors.</p><div class="rag-visual"><svg viewBox="0 0 860 240" xmlns="http://www.w3.org/2000/svg"><rect width="860" height="240" fill="#0f1a2e"/><text x="430" y="22" text-anchor="middle" font-family="Inter,sans-serif" font-size="11" fill="#22d3ee" letter-spacing="2.5" font-weight="600">TWO-STAGE RETRIEVAL · SPEED FIRST, PRECISION SECOND</text><defs><marker id="harr" markerWidth="8" markerHeight="8" refX="5" refY="4" orient="auto"><path d="M0,0 L8,4 L0,8 Z" fill="#00e5a8"/></marker></defs><g font-family="Inter,sans-serif"><rect x="40" y="100" width="100" height="44" rx="5" fill="#142442" stroke="#22d3ee"/><text x="90" y="124" text-anchor="middle" font-size="11" fill="#fff" font-weight="600">Query</text><text x="90" y="138" text-anchor="middle" font-size="8.5" fill="#8aa0c0">embedded</text><line x1="140" y1="122" x2="195" y2="122" stroke="#00e5a8" stroke-width="1.4" marker-end="url(#harr)"/><rect x="200" y="80" width="200" height="84" rx="5" fill="#1f3358" stroke="#00e5a8" stroke-width="1.3"/><text x="300" y="102" text-anchor="middle" font-size="11" fill="#fff" font-weight="700">Stage 1 · Binary Index</text><text x="300" y="118" text-anchor="middle" font-size="9" fill="#00e5a8" font-family="JetBrains Mono,monospace">10M vectors · 0.94 GB</text><text x="300" y="134" text-anchor="middle" font-size="9" fill="#8aa0c0">XOR + popcount</text><text x="300" y="150" text-anchor="middle" font-size="9" fill="#8aa0c0">→ top 500 candidates</text><line x1="400" y1="122" x2="455" y2="122" stroke="#00e5a8" stroke-width="1.4" marker-end="url(#harr)"/><rect x="460" y="80" width="200" height="84" rx="5" fill="#1f3358" stroke="#a78bfa" stroke-width="1.3"/><text x="560" y="102" text-anchor="middle" font-size="11" fill="#fff" font-weight="700">Stage 2 · Float Rescore</text><text x="560" y="118" text-anchor="middle" font-size="9" fill="#a78bfa" font-family="JetBrains Mono,monospace">500 vectors · cosine</text><text x="560" y="134" text-anchor="middle" font-size="9" fill="#8aa0c0">full precision</text><text x="560" y="150" text-anchor="middle" font-size="9" fill="#8aa0c0">→ top 10 results</text><line x1="660" y1="122" x2="715" y2="122" stroke="#00e5a8" stroke-width="1.4" marker-end="url(#harr)"/><rect x="720" y="100" width="100" height="44" rx="5" fill="#142442" stroke="#22d3ee"/><text x="770" y="124" text-anchor="middle" font-size="11" fill="#fff" font-weight="600">LLM</text><text x="770" y="138" text-anchor="middle" font-size="8.5" fill="#8aa0c0">answer</text><circle r="5" fill="#00e5a8" opacity=".9"><animateMotion dur="4s" repeatCount="indefinite" rotate="auto"><mpath href="#ragFlowPath"/></animateMotion></circle><path id="ragFlowPath" d="M 140 122 L 195 122 L 300 122 L 460 122 L 560 122 L 720 122" fill="none" stroke="none"/><rect x="200" y="180" width="200" height="40" rx="4" fill="#0a1220" stroke="#00e5a8" stroke-width=".6" stroke-dasharray="3,2"/><text x="300" y="196" text-anchor="middle" font-size="9" fill="#00e5a8" font-family="JetBrains Mono,monospace">hot RAM</text><text x="300" y="210" text-anchor="middle" font-size="8.5" fill="#8aa0c0">fits in CPU cache for billions</text><rect x="460" y="180" width="200" height="40" rx="4" fill="#0a1220" stroke="#a78bfa" stroke-width=".6" stroke-dasharray="3,2"/><text x="560" y="196" text-anchor="middle" font-size="9" fill="#a78bfa" font-family="JetBrains Mono,monospace">cold tier · SSD or compressed RAM</text><text x="560" y="210" text-anchor="middle" font-size="8.5" fill="#8aa0c0">accessed only for 500 candidates</text></g></svg></div><p>Stage 1 is where the 32× memory win lives. The binary index is small enough to fit comfortably in CPU cache, so you can scan tens of millions of candidates in single-digit milliseconds. Stage 2 only ever touches a few hundred full-precision vectors, so the expensive cosine math is bounded.</p><div class="rag-pull">The recall lost in stage 1 is paid back in stage 2. End-to-end retrieval quality is typically within 1% of a full float32 search, at 1/32 the hot memory.</div><h3>Memory Bill at Scale — Before and After</h3><table class="rag-table"><thead><tr><th>Corpus size</th><th>Float32 only</th><th>Binary (stage 1)</th><th>Hybrid (stage 1 hot + stage 2 cold)</th></tr></thead><tbody><tr><td class="k">1 M vectors</td><td class="v">3 GB · hot RAM</td><td class="v">94 MB · hot RAM</td><td class="m">94 MB hot · 3 GB cold</td></tr><tr><td class="k">10 M vectors</td><td class="v">30 GB · hot RAM</td><td class="v">940 MB · hot RAM</td><td class="m">940 MB hot · 30 GB cold</td></tr><tr><td class="k">100 M vectors</td><td class="v">300 GB · multi-node</td><td class="v">9.4 GB · single node</td><td class="m">9.4 GB hot · 300 GB cold</td></tr><tr><td class="k">1 B vectors</td><td class="v">3 TB · cluster</td><td class="v">94 GB · single beefy node</td><td class="m">94 GB hot · 3 TB cold tier</td></tr></tbody></table><p>The shape of the curve is what matters: the hot index — the part that controls latency — stays manageable even as the corpus grows by orders of magnitude. The cold tier scales linearly but cheaply, because it only gets touched for the few hundred candidates surfaced by stage 1.</p><h3>When to Reach for This</h3><div class="rag-cards"><div class="rag-card"><h4>Above ~1 M vectors</h4><p>Below that scale, plain float32 is fine. The complexity of two-stage retrieval isn't worth the few hundred MB you'd save.</p></div><div class="rag-card"><h4>Hot real-time queries</h4><p>If your retrieval p95 needs to stay under 100ms, the binary first stage is what keeps you there as the index grows.</p></div><div class="rag-card"><h4>Cost-sensitive deployments</h4><p>Saving 30 GB of RAM × 3 replicas × 12 months adds up to real money. Especially on managed vector services.</p></div><div class="rag-card"><h4>Any of the three architectures</h4><p>Standard, Graph, Agentic — they all sit on a vector index somewhere. This optimisation applies everywhere they do.</p></div></div></div><div class="section" style="border-bottom:none"><div class="label">Putting It Together</div><h2>Architecture and Efficiency Are Orthogonal</h2><p>Two decisions, independent of each other.<strong>What kind of question does my system have to answer?</strong> — that's the architecture decision. Single-hop facts go to Standard RAG. Multi-hop relationship questions go to Graph RAG. Open-ended tool-using tasks go to Agentic RAG.<strong>How big is my index going to get?</strong> — that's the efficiency decision. Above a million chunks, binary quantization plus float rescoring buys you 32× memory headroom for ~1% quality cost.</p><p>The same vector index sits underneath all three architectures. The same trick applies to all three. Pick the architecture for the query type. Apply the efficiency trick because the math works.</p><div class="rag-pull">RAG isn't one thing — it's a layered decision. Get the architecture right for the query, then make the index small enough to keep up.</div><p style="text-align:center;margin-top:32px;color:var(--muted);font-size:.85rem"><strong style="color:var(--accent2)">Ajay Walia</strong> · CuriousBit Knowledge Base · May 2026</p></div></div>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/rag-variants/hero.png" medium="image"><media:title type="plain">Posts</media:title></media:content><category>artificial-intelligence</category><category>rag</category><category>llm</category><category>vector-search</category><category>architecture</category><category>Knowledge Base</category></item><item><title>Cortex Swarm: Upgrading the Traditional IT Operations with Agentic AI</title><link>https://curiousbit.netlify.app/cortex-swarm-replacing-the-follow-the-sun-model/</link><guid isPermaLink="true">https://curiousbit.netlify.app/cortex-swarm-replacing-the-follow-the-sun-model/</guid><pubDate>Sun, 24 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;style&gt;
@import url('https://fonts.googleapis.com/css2?family=Noto+Serif:ital,wght@0,400;0,700;1,400&amp;family=Noto+Sans:wght@300;400;600;700&amp;display=swap');
.cs-article {
--bg: #080810;
--bg2: #0d0d1a;
--bg3: #111122;
--gold: #c9a227;
--gold-light: #e8cc6a;
--gold-dark: #7a5f0e;
--blue: #1a2744;
--blue2: #0f1d38;
--cream: #e8dcc8;
--muted: #9a8a6a;
--blossom: #c4909a;
--rule: rgba(201,162,39,0.22);
font-family: 'Noto Sans', sans-serif;
line-height: 1.75;
color: var(--cream);
background: var(--bg);
padding: 2.5rem;
border-radius: 16px;
box-shadow: 0 4px 30px rgba(0,0,0,0.6);
}
.cs-article * { box-sizing: border-box; }
.cs-section { padding: 56px 0; border-bottom: 1px solid var(--rule); }
.cs-section:last-child { border-bottom: none; }
.cs-label { font-size: .7rem; letter-spacing: .2em; text-transform: uppercase; color: var(--gold); font-weight: 600; margin-bottom: 10px; }
.cs-article h2 { font-family: 'Noto Serif', serif; font-size: clamp(1.5rem, 3.5vw, 2.2rem); color: var(--gold-light); margin-bottom: 20px; margin-top: 0; }
.cs-article h3 { font-family: 'Noto Serif', serif; font-size: 1.2rem; color: var(--gold-light); margin-bottom: 12px; margin-top: 0; }
.cs-article p { margin-bottom: 18px; font-size: 1rem; color: var(--cream); opacity: 0.95; }
.cs-article ul { list-style: none; padding: 0; margin-bottom: 18px; }
.cs-article ul li { padding: 4px 0 4px 20px; position: relative; font-size: 1rem; }
.cs-article ul li::before { content: "—"; position: absolute; left: 0; color: var(--gold); }
.cs-statbar { display: grid; grid-template-columns: repeat(4,1fr); gap: 1px; background: var(--rule); border: 1px solid var(--rule); margin: 36px 0; border-radius: 4px; overflow: hidden; }
.cs-stat { background: var(--bg2); padding: 20px 14px; text-align: center; }
.cs-stat .num { font-family: 'Noto Serif', serif; font-size: 1.8rem; color: var(--gold); display: block; }
.cs-stat .lbl { font-size: .68rem; letter-spacing: .08em; color: var(--muted); text-transform: uppercase; }
.cs-visual { margin: 32px 0; border-radius: 6px; overflow-x: auto; overflow-y: hidden; border: 1px solid var(--rule); }
.cs-visual svg { display: block; width: 100%; height: auto; }
.cs-figure { margin: 32px 0; }
.cs-figure img, .cs-figure video { display: block; width: 100%; height: auto; border-radius: 6px; border: 1px solid var(--rule); }
.cs-figure figcaption { text-align: center; font-size: .78rem; color: var(--muted); font-style: italic; margin-top: 8px; }
.cs-hero-video { position: relative; margin: 0 0 16px 0; border-radius: 8px; overflow: hidden; border: 1px solid var(--rule); box-shadow: 0 4px 20px rgba(0,0,0,.5); }
.cs-hero-video video { display: block; width: 100%; height: auto; }
.cs-hero-video .overlay { position: absolute; left: 0; right: 0; bottom: 0; padding: 40px 30px 48px; background: linear-gradient(transparent, rgba(8,8,16,.92) 30%); text-align: center; pointer-events: none; }
.cs-hero-video .overlay .h-title { font-family: 'Noto Serif', serif; font-size: clamp(2.2rem, 5.5vw, 4.5rem); color: #fff; font-weight: 900; letter-spacing: 2px; }
.cs-hero-video .overlay .h-sub { font-size: clamp(1rem, 2vw, 1.4rem); letter-spacing: .25em; color: var(--gold); font-weight: 700; margin-top: 14px; }
.cs-hero-video .overlay .h-by { font-size: clamp(0.85rem, 1.5vw, 1.15rem); letter-spacing: .15em; color: var(--muted); margin-top: 18px; font-weight: 500; }
.cs-cards { display: grid; grid-template-columns: repeat(auto-fit,minmax(230px,1fr)); gap: 14px; margin: 24px 0; }
.cs-card { background: var(--bg2); border: 1px solid var(--rule); border-radius: 6px; padding: 20px 18px; }
.cs-agent-row { display: grid; grid-template-columns: 150px 1fr; border: 1px solid var(--rule); border-radius: 6px; overflow: hidden; margin-bottom: 10px; background: var(--bg2); }
.cs-agent-label { background: var(--blue2); padding: 16px 14px; border-right: 1px solid var(--rule); display: flex; flex-direction: column; justify-content: center; }
.cs-agent-label .tier { font-size: .67rem; letter-spacing: .14em; color: var(--gold); text-transform: uppercase; margin-bottom: 3px; }
.cs-agent-label .role { font-family: 'Noto Serif', serif; font-size: .95rem; color: #fff; }
.cs-agent-label .replace { font-size: .67rem; color: var(--muted); margin-top: 5px; }
.cs-agent-body { padding: 14px 18px; }
.cs-agent-body .desc { font-size: .8rem; color: var(--muted); font-style: italic; margin-bottom: 8px; }
.cs-tag-row { display: flex; flex-wrap: wrap; gap: 5px; margin-top: 6px; }
.cs-tag { font-size: .67rem; padding: 2px 8px; border-radius: 20px; background: rgba(201,162,39,.1); border: 1px solid var(--gold-dark); color: var(--gold-light); }
.cs-timeline { position: relative; margin: 24px 0; }
.cs-timeline::before { content: ''; position: absolute; left: 19px; top: 0; bottom: 0; width: 1px; background: var(--rule); }
.cs-tl-item { display: grid; grid-template-columns: 38px 1fr; gap: 14px; margin-bottom: 20px; }
.cs-tl-dot { width: 38px; height: 38px; border-radius: 50%; background: var(--bg2); border: 1px solid var(--gold); display: flex; align-items: center; justify-content: center; font-size: .7rem; color: var(--gold); font-weight: 700; position: relative; z-index: 1; }
.cs-tl-content { padding-top: 6px; }
.cs-tl-content h3 { font-size: .9rem; margin-bottom: 3px; }
.cs-tl-content p { font-size: .82rem; color: var(--muted); margin: 0; }
.cs-phase-grid { display: grid; grid-template-columns: repeat(3,1fr); gap: 10px; margin: 24px 0; }
.cs-phase-card { background: var(--bg2); border: 1px solid var(--rule); border-radius: 6px; padding: 16px 14px; }
.cs-phase-num { font-size: .65rem; letter-spacing: .14em; color: var(--gold); text-transform: uppercase; margin-bottom: 5px; }
.cs-phase-title { font-family: 'Noto Serif', serif; font-size: .95rem; color: #fff; margin-bottom: 7px; }
.cs-phase-trigger { font-size: .68rem; color: var(--muted); border-top: 1px solid var(--rule); padding-top: 7px; margin-top: 7px; }
.cs-table { width: 100%; border-collapse: collapse; margin: 20px 0; font-size: .85rem; }
.cs-table th, .cs-table td { padding: 10px 12px; text-align: left; border-bottom: 1px solid var(--rule); vertical-align: top; }
.cs-table th { color: var(--gold); font-weight: 600; font-size: .7rem; letter-spacing: .1em; text-transform: uppercase; border-bottom: 1px solid var(--gold-dark); }
.cs-table td.k { color: var(--gold-light); font-family: 'Noto Serif', serif; width: 30%; }
.cs-table td.v { color: var(--cream); }
.cs-table td.w { color: var(--muted); font-size: .8rem; }
.cs-pull { border-left: 2px solid var(--gold); padding: 6px 0 6px 16px; margin: 24px 0; font-family: 'Noto Serif', serif; font-size: 1.05rem; color: var(--gold-light); font-style: italic; }
.cs-ornament { text-align: center; padding: 14px 0; color: var(--gold-dark); letter-spacing: .4em; font-size: .78rem; }
@media (max-width: 600px) {
.cs-article { padding: 1.25rem; }
.cs-statbar { grid-template-columns: repeat(2,1fr); }
.cs-phase-grid { grid-template-columns: 1fr 1fr; }
.cs-agent-row { grid-template-columns: 1fr; }
.cs-agent-label { border-right: none; border-bottom: 1px solid var(--rule); }
.cs-table { font-size: .78rem; }
.cs-table td.k { width: 35%; }
}
&lt;/style&gt;
&lt;div class="cs-article"&gt;
&lt;!-- ── HERO ── --&gt;
&lt;div class="cs-hero-video"&gt;
&lt;video autoplay muted loop playsinline preload="metadata" poster="/images/cortex-swarm/cortex-swarm-hero-v2.png"&gt;
&lt;source src="https://curiousbit.netlify.app/images/cortex-swarm/cortex-swarm-hero-v2.mp4" type="video/mp4"&gt;
&lt;img src="https://curiousbit.netlify.app/images/cortex-swarm/cortex-swarm-hero-v2.png" alt="Five masked specialists seated around a low table at twilight — the swarm as a team"&gt;
&lt;/video&gt;
&lt;div class="overlay"&gt;
&lt;div class="h-title"&gt;Cortex Swarm&lt;/div&gt;
&lt;div class="h-sub"&gt;REPLACING THE FOLLOW-THE-SUN MODEL · 2026&lt;/div&gt;
&lt;div class="h-by"&gt;AJAY WALIA · DIGITAL WORKPLACE OPERATIONS · MAY 2026&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;!-- ══════════════════════════════
OPENING — set the thesis upfront
══════════════════════════════ --&gt;
&lt;div class="cs-section" style="padding-top:8px"&gt;
&lt;p&gt;Every employee depends on a Workplace Operations team they will never meet. It is the team that resets their MFA when they fly to a new country, recovers their shared mailbox when it stops syncing, pushes the Intune policy that lets them install a piece of software, and decides at 3am whether a regional O365 Or Exchange outage warrants paging a human.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/cortex-swarm/cortex-swarm-hero-v2.png" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><style>
@import url('https://fonts.googleapis.com/css2?family=Noto+Serif:ital,wght@0,400;0,700;1,400&family=Noto+Sans:wght@300;400;600;700&display=swap');
.cs-article {
--bg: #080810;
--bg2: #0d0d1a;
--bg3: #111122;
--gold: #c9a227;
--gold-light: #e8cc6a;
--gold-dark: #7a5f0e;
--blue: #1a2744;
--blue2: #0f1d38;
--cream: #e8dcc8;
--muted: #9a8a6a;
--blossom: #c4909a;
--rule: rgba(201,162,39,0.22);
font-family: 'Noto Sans', sans-serif;
line-height: 1.75;
color: var(--cream);
background: var(--bg);
padding: 2.5rem;
border-radius: 16px;
box-shadow: 0 4px 30px rgba(0,0,0,0.6);
}
.cs-article * { box-sizing: border-box; }
.cs-section { padding: 56px 0; border-bottom: 1px solid var(--rule); }
.cs-section:last-child { border-bottom: none; }
.cs-label { font-size: .7rem; letter-spacing: .2em; text-transform: uppercase; color: var(--gold); font-weight: 600; margin-bottom: 10px; }
.cs-article h2 { font-family: 'Noto Serif', serif; font-size: clamp(1.5rem, 3.5vw, 2.2rem); color: var(--gold-light); margin-bottom: 20px; margin-top: 0; }
.cs-article h3 { font-family: 'Noto Serif', serif; font-size: 1.2rem; color: var(--gold-light); margin-bottom: 12px; margin-top: 0; }
.cs-article p { margin-bottom: 18px; font-size: 1rem; color: var(--cream); opacity: 0.95; }
.cs-article ul { list-style: none; padding: 0; margin-bottom: 18px; }
.cs-article ul li { padding: 4px 0 4px 20px; position: relative; font-size: 1rem; }
.cs-article ul li::before { content: "—"; position: absolute; left: 0; color: var(--gold); }
.cs-statbar { display: grid; grid-template-columns: repeat(4,1fr); gap: 1px; background: var(--rule); border: 1px solid var(--rule); margin: 36px 0; border-radius: 4px; overflow: hidden; }
.cs-stat { background: var(--bg2); padding: 20px 14px; text-align: center; }
.cs-stat .num { font-family: 'Noto Serif', serif; font-size: 1.8rem; color: var(--gold); display: block; }
.cs-stat .lbl { font-size: .68rem; letter-spacing: .08em; color: var(--muted); text-transform: uppercase; }
.cs-visual { margin: 32px 0; border-radius: 6px; overflow-x: auto; overflow-y: hidden; border: 1px solid var(--rule); }
.cs-visual svg { display: block; width: 100%; height: auto; }
.cs-figure { margin: 32px 0; }
.cs-figure img, .cs-figure video { display: block; width: 100%; height: auto; border-radius: 6px; border: 1px solid var(--rule); }
.cs-figure figcaption { text-align: center; font-size: .78rem; color: var(--muted); font-style: italic; margin-top: 8px; }
.cs-hero-video { position: relative; margin: 0 0 16px 0; border-radius: 8px; overflow: hidden; border: 1px solid var(--rule); box-shadow: 0 4px 20px rgba(0,0,0,.5); }
.cs-hero-video video { display: block; width: 100%; height: auto; }
.cs-hero-video .overlay { position: absolute; left: 0; right: 0; bottom: 0; padding: 40px 30px 48px; background: linear-gradient(transparent, rgba(8,8,16,.92) 30%); text-align: center; pointer-events: none; }
.cs-hero-video .overlay .h-title { font-family: 'Noto Serif', serif; font-size: clamp(2.2rem, 5.5vw, 4.5rem); color: #fff; font-weight: 900; letter-spacing: 2px; }
.cs-hero-video .overlay .h-sub { font-size: clamp(1rem, 2vw, 1.4rem); letter-spacing: .25em; color: var(--gold); font-weight: 700; margin-top: 14px; }
.cs-hero-video .overlay .h-by { font-size: clamp(0.85rem, 1.5vw, 1.15rem); letter-spacing: .15em; color: var(--muted); margin-top: 18px; font-weight: 500; }
.cs-cards { display: grid; grid-template-columns: repeat(auto-fit,minmax(230px,1fr)); gap: 14px; margin: 24px 0; }
.cs-card { background: var(--bg2); border: 1px solid var(--rule); border-radius: 6px; padding: 20px 18px; }
.cs-agent-row { display: grid; grid-template-columns: 150px 1fr; border: 1px solid var(--rule); border-radius: 6px; overflow: hidden; margin-bottom: 10px; background: var(--bg2); }
.cs-agent-label { background: var(--blue2); padding: 16px 14px; border-right: 1px solid var(--rule); display: flex; flex-direction: column; justify-content: center; }
.cs-agent-label .tier { font-size: .67rem; letter-spacing: .14em; color: var(--gold); text-transform: uppercase; margin-bottom: 3px; }
.cs-agent-label .role { font-family: 'Noto Serif', serif; font-size: .95rem; color: #fff; }
.cs-agent-label .replace { font-size: .67rem; color: var(--muted); margin-top: 5px; }
.cs-agent-body { padding: 14px 18px; }
.cs-agent-body .desc { font-size: .8rem; color: var(--muted); font-style: italic; margin-bottom: 8px; }
.cs-tag-row { display: flex; flex-wrap: wrap; gap: 5px; margin-top: 6px; }
.cs-tag { font-size: .67rem; padding: 2px 8px; border-radius: 20px; background: rgba(201,162,39,.1); border: 1px solid var(--gold-dark); color: var(--gold-light); }
.cs-timeline { position: relative; margin: 24px 0; }
.cs-timeline::before { content: ''; position: absolute; left: 19px; top: 0; bottom: 0; width: 1px; background: var(--rule); }
.cs-tl-item { display: grid; grid-template-columns: 38px 1fr; gap: 14px; margin-bottom: 20px; }
.cs-tl-dot { width: 38px; height: 38px; border-radius: 50%; background: var(--bg2); border: 1px solid var(--gold); display: flex; align-items: center; justify-content: center; font-size: .7rem; color: var(--gold); font-weight: 700; position: relative; z-index: 1; }
.cs-tl-content { padding-top: 6px; }
.cs-tl-content h3 { font-size: .9rem; margin-bottom: 3px; }
.cs-tl-content p { font-size: .82rem; color: var(--muted); margin: 0; }
.cs-phase-grid { display: grid; grid-template-columns: repeat(3,1fr); gap: 10px; margin: 24px 0; }
.cs-phase-card { background: var(--bg2); border: 1px solid var(--rule); border-radius: 6px; padding: 16px 14px; }
.cs-phase-num { font-size: .65rem; letter-spacing: .14em; color: var(--gold); text-transform: uppercase; margin-bottom: 5px; }
.cs-phase-title { font-family: 'Noto Serif', serif; font-size: .95rem; color: #fff; margin-bottom: 7px; }
.cs-phase-trigger { font-size: .68rem; color: var(--muted); border-top: 1px solid var(--rule); padding-top: 7px; margin-top: 7px; }
.cs-table { width: 100%; border-collapse: collapse; margin: 20px 0; font-size: .85rem; }
.cs-table th, .cs-table td { padding: 10px 12px; text-align: left; border-bottom: 1px solid var(--rule); vertical-align: top; }
.cs-table th { color: var(--gold); font-weight: 600; font-size: .7rem; letter-spacing: .1em; text-transform: uppercase; border-bottom: 1px solid var(--gold-dark); }
.cs-table td.k { color: var(--gold-light); font-family: 'Noto Serif', serif; width: 30%; }
.cs-table td.v { color: var(--cream); }
.cs-table td.w { color: var(--muted); font-size: .8rem; }
.cs-pull { border-left: 2px solid var(--gold); padding: 6px 0 6px 16px; margin: 24px 0; font-family: 'Noto Serif', serif; font-size: 1.05rem; color: var(--gold-light); font-style: italic; }
.cs-ornament { text-align: center; padding: 14px 0; color: var(--gold-dark); letter-spacing: .4em; font-size: .78rem; }
@media (max-width: 600px) {
.cs-article { padding: 1.25rem; }
.cs-statbar { grid-template-columns: repeat(2,1fr); }
.cs-phase-grid { grid-template-columns: 1fr 1fr; }
.cs-agent-row { grid-template-columns: 1fr; }
.cs-agent-label { border-right: none; border-bottom: 1px solid var(--rule); }
.cs-table { font-size: .78rem; }
.cs-table td.k { width: 35%; }
}</style><div class="cs-article"><div class="cs-hero-video"><video autoplay= muted= loop= playsinline= preload="metadata" poster="/images/cortex-swarm/cortex-swarm-hero-v2.png"><source src="/images/cortex-swarm/cortex-swarm-hero-v2.mp4" type="video/mp4"><img src="/images/cortex-swarm/cortex-swarm-hero-v2.png" alt="Five masked specialists seated around a low table at twilight — the swarm as a team"/><div class="overlay"><div class="h-title">Cortex Swarm</div><div class="h-sub">REPLACING THE FOLLOW-THE-SUN MODEL · 2026</div><div class="h-by">AJAY WALIA · DIGITAL WORKPLACE OPERATIONS · MAY 2026</div></div></div><div class="cs-section" style="padding-top:8px"><p>Every employee depends on a Workplace Operations team they will never meet. It is the team that resets their MFA when they fly to a new country, recovers their shared mailbox when it stops syncing, pushes the Intune policy that lets them install a piece of software, and decides at 3am whether a regional O365 Or Exchange outage warrants paging a human.</p><p>This piece is about three things, in order:</p><ul><li><strong style="color:var(--gold-light)">First</strong> — how those teams are actually structured today, how they function day-to-day, and the structural problems they carry.</li><li><strong style="color:var(--gold-light)">Second</strong> — where agentic AI sits in 2026, and where the field is heading over the next two to three years.</li><li><strong style="color:var(--gold-light)">Third</strong> — how a small swarm of specialised agents can replace this team tier-for-tier, what efficiencies that produces, and the new set of challenges it creates in return.</li></ul><div class="cs-pull">The org chart is the answer. The five tiers that make a DWP team work for humans are the same five seams that make it work for agents.</div><div class="cs-statbar"><div class="cs-stat"><span class="num">5</span><span class="lbl">Autonomous agents<br>L1 → SDM</span></div><div class="cs-stat"><span class="num">~137</span><span class="lbl">FTE mirrored<br>across all tiers</span></div><div class="cs-stat"><span class="num">24×7</span><span class="lbl">Single team<br>no shift roster</span></div><div class="cs-stat"><span class="num">1-click</span><span class="lbl">Audit verify<br>any ticket</span></div></div></div><div class="cs-section"><div class="cs-label">Part 1 · Structure</div><h2>How Digital Workplace Operations Teams Are Structured Today</h2><p>A DWP team exists because every employee uses IT every day, and someone has to keep that working. For a Global 500 with 10,000–100,000+ employees, the work is too broad, too multilingual, and too time-zone-spanning for an in-house team. Almost without exception, it is outsourced to a Tier-1 IT services firm — TCS, Infosys, Wipro, Accenture, HCLTech, Cognizant — running a 24×7 follow-the-sun roster across multiple delivery centres.</p><h3 style="margin-top:28px">The Scope — What Actually Sits Under "Workplace"</h3><p>The label undersells the breadth. A typical DWP contract covers seven functional areas, each with its own runbooks, its own vendors, and its own escalation paths.</p><div class="cs-cards"><div class="cs-card"><h3>Identity</h3><p style="font-size:.82rem;color:var(--muted);margin:0">Who you are — joiner / mover / leaver, password, MFA, SSO, entitlements</p></div><div class="cs-card"><h3>Access</h3><p style="font-size:.82rem;color:var(--muted);margin:0">What you can use — catalogs, licenses, groups, approvals</p></div><div class="cs-card"><h3>Productivity</h3><p style="font-size:.82rem;color:var(--muted);margin:0">Outlook, Teams, M365, OneDrive, SharePoint</p></div><div class="cs-card"><h3>Endpoint</h3><p style="font-size:.82rem;color:var(--muted);margin:0">Laptops, peripherals, network, Intune compliance, patching</p></div><div class="cs-card"><h3>Applications</h3><p style="font-size:.82rem;color:var(--muted);margin:0">Line-of-business app support, vendor liaison</p></div><div class="cs-card"><h3>Infrastructure</h3><p style="font-size:.82rem;color:var(--muted);margin:0">Connectivity, VPN, cloud health, capacity</p></div><div class="cs-card"><h3>Change &amp; Governance</h3><p style="font-size:.82rem;color:var(--muted);margin:0">RFCs, CAB, RCAs, SLA reporting, customer comms</p></div></div><h3 style="margin-top:36px">The Five-Tier Hierarchy</h3><p>To deliver against this scope at scale, providers build a five-tier hierarchy. Each tier exists because of what the tier below it can't or shouldn't do. Tickets enter at the bottom and move upward only when scope, authority, or evidence demands it.</p><div class="cs-visual"><svg viewBox="0 0 820 420" xmlns="http://www.w3.org/2000/svg"><rect width="820" height="420" fill="#0d0d1a"/><text x="410" y="28" text-anchor="middle" font-family="Georgia,serif" font-size="11" fill="#c9a227" letter-spacing="2.5">REPORTING LINES, RESPONSIBILITY, AND VOLUME</text><line x1="120" y1="38" x2="700" y2="38" stroke="#c9a227" stroke-width=".4" stroke-opacity=".35"/><defs><marker id="dwparr" markerWidth="6" markerHeight="6" refX="3" refY="3" orient="auto"><path d="M0,0 L6,3 L0,6 Z" fill="#c9a227" fill-opacity=".5"/></marker></defs><pre><code> &lt;!-- SDM --&gt;
&lt;rect x="340" y="58" width="140" height="44" rx="4" fill="#1a2744" stroke="#c9a227" stroke-width=".9"/&gt;
&lt;text x="410" y="76" text-anchor="middle" font-size="10" fill="#e8cc6a" font-family="Georgia,serif"&gt;SDM&lt;/text&gt;
&lt;text x="410" y="90" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;Service Delivery Manager · ~2 FTE&lt;/text&gt;
&lt;!-- Architect --&gt;
&lt;rect x="340" y="128" width="140" height="44" rx="4" fill="#1a2744" stroke="#c9a227" stroke-width=".9"/&gt;
&lt;text x="410" y="146" text-anchor="middle" font-size="10" fill="#e8cc6a" font-family="Georgia,serif"&gt;Architect&lt;/text&gt;
&lt;text x="410" y="160" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;Design Authority · ~3 FTE&lt;/text&gt;
&lt;!-- L3 --&gt;
&lt;rect x="340" y="198" width="140" height="44" rx="4" fill="#1a2744" stroke="#c9a227" stroke-width=".9"/&gt;
&lt;text x="410" y="216" text-anchor="middle" font-size="10" fill="#e8cc6a" font-family="Georgia,serif"&gt;L3 Engineering&lt;/text&gt;
&lt;text x="410" y="230" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;Senior Engineers · ~12 FTE&lt;/text&gt;
&lt;!-- L2 --&gt;
&lt;rect x="200" y="268" width="420" height="44" rx="4" fill="#0f1a30" stroke="#c9a227" stroke-width=".9"/&gt;
&lt;text x="410" y="286" text-anchor="middle" font-size="10" fill="#e8cc6a" font-family="Georgia,serif"&gt;L2 Specialists&lt;/text&gt;
&lt;text x="410" y="300" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;M365 · Intune · Exchange · ServiceNow · ~40 FTE · 2 shifts&lt;/text&gt;
&lt;!-- L1 --&gt;
&lt;rect x="80" y="338" width="660" height="48" rx="4" fill="#0a1020" stroke="#c9a227" stroke-width="1.2"/&gt;
&lt;text x="410" y="358" text-anchor="middle" font-size="11" fill="#c9a227" font-weight="700" font-family="sans-serif"&gt;L1 Service Desk&lt;/text&gt;
&lt;text x="410" y="374" text-anchor="middle" font-size="8.5" fill="#9a8a6a" font-family="sans-serif"&gt;~80 FTE · 3 follow-the-sun shifts · handles ~80% of all ticket volume&lt;/text&gt;
&lt;!-- Reporting lines --&gt;
&lt;line x1="410" y1="172" x2="410" y2="198" stroke="#c9a227" stroke-width=".8" stroke-opacity=".55" stroke-dasharray="3,2"/&gt;
&lt;line x1="410" y1="102" x2="410" y2="128" stroke="#c9a227" stroke-width=".8" stroke-opacity=".55" stroke-dasharray="3,2"/&gt;
&lt;line x1="410" y1="242" x2="410" y2="268" stroke="#c9a227" stroke-width=".8" stroke-opacity=".55" marker-end="url(#dwparr)"/&gt;
&lt;line x1="410" y1="312" x2="410" y2="338" stroke="#c9a227" stroke-width=".8" stroke-opacity=".55" marker-end="url(#dwparr)"/&gt;
&lt;!-- Side annotations --&gt;
&lt;text x="500" y="80" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;↔ owns the customer relationship&lt;/text&gt;
&lt;text x="500" y="150" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;↔ owns design + RCA authority&lt;/text&gt;
&lt;text x="500" y="220" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;↔ root cause + change requests&lt;/text&gt;
&lt;text x="160" y="290" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;↔ specialist diagnosis&lt;/text&gt;
&lt;text x="40" y="360" font-size="8" fill="#9a8a6a" font-family="sans-serif" font-weight="600"&gt;FRONT LINE&lt;/text&gt;
&lt;text x="410" y="408" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif" font-style="italic"&gt;Tickets enter at L1 and escalate upward only when scope, authority, or evidence demands it.&lt;/text&gt;
&lt;/svg&gt;</code></pre></div><p style="font-size:.88rem;color:var(--muted);font-style:italic">The exact FTE counts vary with employee population and contract scope. The shape — heavy at the base, narrowing to a point — is universal.</p></div><div class="cs-section"><div class="cs-label">Part 2 · Function</div><h2>How They Actually Function Day-to-Day</h2><p>Three forces govern day-to-day operation: time zones, ticket flow, and knowledge. Understanding all three is what makes the rest of the piece make sense.</p><h3 style="margin-top:24px">Time Zones — the Follow-the-Sun Roster</h3><p>Coverage is achieved by handing tickets between geographies as the sun moves. A ticket opened in Sydney at 4pm local rolls over to Manila, then to Mumbai or Hyderabad, then to Krakow or Sofia, then to a US east-coast hub. Three or four formal shift handoffs per day, every day, forever.</p><div class="cs-visual"><svg viewBox="0 0 820 320" xmlns="http://www.w3.org/2000/svg"><rect width="820" height="320" fill="#0d0d1a"/><text x="410" y="24" text-anchor="middle" font-size="10" fill="#c9a227" letter-spacing="2.5" font-family="Georgia,serif">A 24-HOUR DAY ACROSS THREE GEOGRAPHIES</text><pre><code> &lt;!-- 24-hour ring --&gt;
&lt;circle cx="410" cy="170" r="120" fill="#0a1020" stroke="#c9a227" stroke-width=".6" stroke-opacity=".45"/&gt;
&lt;circle cx="410" cy="170" r="98" fill="none" stroke="#c9a227" stroke-width=".3" stroke-opacity=".3"/&gt;
&lt;text x="410" y="62" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;00:00 UTC&lt;/text&gt;
&lt;text x="540" y="174" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;06:00&lt;/text&gt;
&lt;text x="410" y="296" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;12:00&lt;/text&gt;
&lt;text x="280" y="174" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;18:00&lt;/text&gt;
&lt;!-- APAC sector (0-8 UTC) --&gt;
&lt;path d="M 410 170 L 410 50 A 120 120 0 0 1 514 230 Z" fill="#1a2744" fill-opacity=".55" stroke="#c9a227" stroke-width=".4"/&gt;
&lt;text x="478" y="120" text-anchor="middle" font-size="10" fill="#e8cc6a" font-family="Georgia,serif"&gt;APAC&lt;/text&gt;
&lt;text x="478" y="134" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Manila · Sydney&lt;/text&gt;
&lt;text x="478" y="145" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif"&gt;08:00 – 18:00 local&lt;/text&gt;
&lt;!-- EMEA sector (8-16 UTC) --&gt;
&lt;path d="M 410 170 L 514 230 A 120 120 0 0 1 306 230 Z" fill="#162038" fill-opacity=".55" stroke="#c9a227" stroke-width=".4"/&gt;
&lt;text x="410" y="240" text-anchor="middle" font-size="10" fill="#e8cc6a" font-family="Georgia,serif"&gt;EMEA / India&lt;/text&gt;
&lt;text x="410" y="254" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Mumbai · Krakow · Sofia&lt;/text&gt;
&lt;text x="410" y="265" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif"&gt;08:00 – 18:00 local&lt;/text&gt;
&lt;!-- AMER sector (16-24 UTC) --&gt;
&lt;path d="M 410 170 L 306 230 A 120 120 0 0 1 410 50 Z" fill="#111830" fill-opacity=".55" stroke="#c9a227" stroke-width=".4"/&gt;
&lt;text x="342" y="120" text-anchor="middle" font-size="10" fill="#e8cc6a" font-family="Georgia,serif"&gt;AMER&lt;/text&gt;
&lt;text x="342" y="134" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Atlanta · Dallas&lt;/text&gt;
&lt;text x="342" y="145" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif"&gt;08:00 – 18:00 local&lt;/text&gt;
&lt;!-- Handoff markers --&gt;
&lt;circle cx="410" cy="50" r="5" fill="#c9a227"/&gt;
&lt;circle cx="514" cy="230" r="5" fill="#c9a227"/&gt;
&lt;circle cx="306" cy="230" r="5" fill="#c9a227"/&gt;
&lt;!-- Right-side notes --&gt;
&lt;text x="660" y="80" font-size="9" fill="#c9a227" font-family="Georgia,serif"&gt;Three handoffs / day&lt;/text&gt;
&lt;text x="660" y="100" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Open tickets transfer at each&lt;/text&gt;
&lt;text x="660" y="112" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;geographic boundary&lt;/text&gt;
&lt;text x="660" y="140" font-size="9" fill="#c9a227" font-family="Georgia,serif"&gt;Context lost at each handoff&lt;/text&gt;
&lt;text x="660" y="160" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Notes summarised, not&lt;/text&gt;
&lt;text x="660" y="172" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;replayed in full&lt;/text&gt;
&lt;text x="660" y="200" font-size="9" fill="#c9a227" font-family="Georgia,serif"&gt;Weekend &amp;amp; holiday gaps&lt;/text&gt;
&lt;text x="660" y="220" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Coverage thins where&lt;/text&gt;
&lt;text x="660" y="232" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;no region is in business hours&lt;/text&gt;
&lt;/svg&gt;</code></pre></div><h3 style="margin-top:36px">Ticket Flow — Entry, Triage, Escalation, Closure</h3><p>Every employee interaction is a ticket. Most enter via chat or self-service portal, a smaller share through phone or email. From entry, the path is the same: triage at L1, attempt resolution, escalate if the agent at the current tier cannot solve it within authority and budget, then close.</p><div class="cs-visual"><svg viewBox="0 0 820 230" xmlns="http://www.w3.org/2000/svg"><rect width="820" height="230" fill="#0d0d1a"/><text x="410" y="22" text-anchor="middle" font-size="10" fill="#c9a227" letter-spacing="2" font-family="Georgia,serif">TICKET LIFECYCLE — THE COMMON PATH</text><defs><marker id="tfarr" markerWidth="6" markerHeight="6" refX="3" refY="3" orient="auto"><path d="M0,0 L6,3 L0,6 Z" fill="#c9a227"/></marker></defs><pre><code> &lt;!-- Entry sources --&gt;
&lt;g transform="translate(40,80)"&gt;
&lt;rect width="100" height="34" rx="3" fill="#0a1020" stroke="#c9a227" stroke-width=".7"/&gt;
&lt;text x="50" y="15" text-anchor="middle" font-size="7.5" fill="#e8cc6a" font-family="sans-serif"&gt;CHAT / PORTAL&lt;/text&gt;
&lt;text x="50" y="27" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif"&gt;~70% of entries&lt;/text&gt;
&lt;rect y="44" width="100" height="34" rx="3" fill="#0a1020" stroke="#c9a227" stroke-width=".5" stroke-opacity=".5"/&gt;
&lt;text x="50" y="59" text-anchor="middle" font-size="7.5" fill="#e8cc6a" font-family="sans-serif"&gt;PHONE&lt;/text&gt;
&lt;text x="50" y="71" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif"&gt;~20%&lt;/text&gt;
&lt;rect y="88" width="100" height="34" rx="3" fill="#0a1020" stroke="#c9a227" stroke-width=".5" stroke-opacity=".5"/&gt;
&lt;text x="50" y="103" text-anchor="middle" font-size="7.5" fill="#e8cc6a" font-family="sans-serif"&gt;EMAIL / OTHER&lt;/text&gt;
&lt;text x="50" y="115" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif"&gt;~10%&lt;/text&gt;
&lt;/g&gt;
&lt;line x1="145" y1="118" x2="195" y2="118" stroke="#c9a227" stroke-width=".7" marker-end="url(#tfarr)"/&gt;
&lt;!-- Triage --&gt;
&lt;g transform="translate(200,98)"&gt;
&lt;rect width="105" height="44" rx="3" fill="#162038" stroke="#c9a227" stroke-width=".9"/&gt;
&lt;text x="52" y="20" text-anchor="middle" font-size="9" fill="#e8cc6a" font-family="Georgia,serif"&gt;L1 Triage&lt;/text&gt;
&lt;text x="52" y="34" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif"&gt;Categorise · prioritise&lt;/text&gt;
&lt;/g&gt;
&lt;line x1="310" y1="120" x2="360" y2="120" stroke="#c9a227" stroke-width=".7" marker-end="url(#tfarr)"/&gt;
&lt;!-- Resolution attempt --&gt;
&lt;g transform="translate(365,98)"&gt;
&lt;rect width="120" height="44" rx="3" fill="#162038" stroke="#c9a227" stroke-width=".9"/&gt;
&lt;text x="60" y="20" text-anchor="middle" font-size="9" fill="#e8cc6a" font-family="Georgia,serif"&gt;Resolve in tier&lt;/text&gt;
&lt;text x="60" y="34" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif"&gt;SOP · KB · tools&lt;/text&gt;
&lt;/g&gt;
&lt;!-- Branch up: escalate --&gt;
&lt;line x1="425" y1="98" x2="425" y2="60" stroke="#c9a227" stroke-width=".7" marker-end="url(#tfarr)"/&gt;
&lt;g transform="translate(380,30)"&gt;
&lt;rect width="90" height="34" rx="3" fill="#1a2744" stroke="#c9a227" stroke-width=".7"/&gt;
&lt;text x="45" y="15" text-anchor="middle" font-size="8" fill="#e8cc6a" font-family="Georgia,serif"&gt;Escalate&lt;/text&gt;
&lt;text x="45" y="27" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif"&gt;L1 → L2 → L3&lt;/text&gt;
&lt;/g&gt;
&lt;text x="442" y="80" font-size="6.5" fill="#9a8a6a" font-family="sans-serif"&gt;~20%&lt;/text&gt;
&lt;!-- Branch right: close --&gt;
&lt;line x1="487" y1="120" x2="550" y2="120" stroke="#c9a227" stroke-width=".7" marker-end="url(#tfarr)"/&gt;
&lt;g transform="translate(555,98)"&gt;
&lt;rect width="95" height="44" rx="3" fill="#0a1020" stroke="#c9a227" stroke-width=".9"/&gt;
&lt;text x="47" y="20" text-anchor="middle" font-size="9" fill="#e8cc6a" font-family="Georgia,serif"&gt;Close&lt;/text&gt;
&lt;text x="47" y="34" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif"&gt;Notify · KB tag&lt;/text&gt;
&lt;/g&gt;
&lt;text x="510" y="115" font-size="6.5" fill="#9a8a6a" font-family="sans-serif"&gt;~80%&lt;/text&gt;
&lt;!-- Loop back from escalate to resolve at higher tier --&gt;
&lt;path d="M 470 47 Q 540 47 540 100" fill="none" stroke="#c9a227" stroke-width=".6" stroke-opacity=".5" stroke-dasharray="3,2"/&gt;
&lt;text x="615" y="55" font-size="7" fill="#9a8a6a" font-family="sans-serif" font-style="italic"&gt;Higher tier owns it now,&lt;/text&gt;
&lt;text x="615" y="67" font-size="7" fill="#9a8a6a" font-family="sans-serif" font-style="italic"&gt;same flow recurses&lt;/text&gt;
&lt;!-- Bottom strip: SLA clock --&gt;
&lt;rect x="40" y="170" width="740" height="40" rx="3" fill="#080810" stroke="#c9a227" stroke-width=".4" stroke-opacity=".4"/&gt;
&lt;text x="410" y="188" text-anchor="middle" font-size="8.5" fill="#c9a227" font-family="Georgia,serif"&gt;SLA CLOCK runs continuously · pauses only on AWAITING_USER · resumes on every state change&lt;/text&gt;
&lt;text x="410" y="202" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Time to acknowledge · time to resolve · time to communicate — measured per priority (P1–P4)&lt;/text&gt;
&lt;/svg&gt;</code></pre></div><h3 style="margin-top:36px">Knowledge — Runbooks, KBs, and Tribal Memory</h3><p>Each tier owns a knowledge base scoped to its authority. L1 has SOPs for ~40 standard scenarios. L2 holds vendor documentation, Intune policy templates, and Exchange runbooks. L3 holds architecture decision records, past postmortems, and vendor escalation contacts. The Architect carries the long-term design library; the SDM holds SLA templates, comms playbooks, and historical breach reports.</p><p>A great deal also lives in<em>tribal memory</em> — the senior engineer who happens to remember that a similar incident last August was caused by a CA policy. That memory walks out the door every time someone resigns.</p><figure class="cs-figure"><video autoplay= muted= loop= playsinline= preload="metadata" poster="/images/cortex-swarm/dwp-operations-floor.png"><source src="/images/cortex-swarm/dwp-operations-floor.mp4" type="video/mp4"><img src="/images/cortex-swarm/dwp-operations-floor.png" alt="Atmospheric illustration of a 24×7 IT operations floor across three time zones"/><figcaption>The operations floor — a continent away, three shifts deep, somewhere working through your ticket right now.</figcaption></figure></div><div class="cs-section"><div class="cs-label">Part 3 · Challenges</div><h2>What's Structurally Wrong With This Model</h2><p>Nothing in the model is broken; it just isn't designed for the kind of demand it now carries. The pain points below are not the fault of any one team — they are consequences of<em>how</em> the model is built. Each tier carries some version of every one of them.</p><div class="cs-cards"><div class="cs-card"><div class="cs-label" style="font-size:.62rem">01</div><h3>Shift Gravity</h3><ul><li>Three follow-the-sun shifts every 24 hours</li><li>Context is summarised, not replayed, at every boundary</li><li>Onshore-offshore split hides inefficiency in plain sight</li></ul></div><div class="cs-card"><div class="cs-label" style="font-size:.62rem">02</div><h3>Quality Variance</h3><ul><li>Varies by shift, by tenure, by individual</li><li>SLA breaches cluster on weekends and holidays</li><li>The customer never sees an even service level</li></ul></div><div class="cs-card"><div class="cs-label" style="font-size:.62rem">03</div><h3>Attrition Tax</h3><ul><li>20–35% annual attrition at L1, lower but real higher up</li><li>4–8 weeks of training before a new hire is productive</li><li>Tribal knowledge leaves with every resignation</li></ul></div><div class="cs-card"><div class="cs-label" style="font-size:.62rem">04</div><h3>Inelastic Capacity</h3><ul><li>A 2× ticket spike cannot be staffed in &lt; 24 hours</li><li>Patch-Tuesday outages routinely take SLA hits</li><li>Surge headcount is a fiction; surge overtime is what actually happens</li></ul></div><div class="cs-card"><div class="cs-label" style="font-size:.62rem">05</div><h3>Audit Friction</h3><ul><li>Reconstructing what happened on a ticket takes weeks</li><li>Chat logs, ticket history, and admin-tool actions live in different systems</li><li>G500 internal-audit reviews drag on for months</li></ul></div><div class="cs-card"><div class="cs-label" style="font-size:.62rem">06</div><h3>Language &amp; KB Silos</h3><ul><li>Multilingual coverage means hiring native speakers locally</li><li>Knowledge bases drift between tiers, regions, and locales</li><li>New runbooks are rarely peer-reviewed for quality</li></ul></div></div><h3 style="margin-top:36px">The Headcount Paradox</h3><p>Stack the team by volume and headcount and the same shape appears every time: an inverted pyramid where the tier carrying the most repetition is also the tier carrying the most people, the most attrition, and the lowest unit-economics. The next two parts argue this is exactly the part the next wave of agentic AI can credibly absorb.</p><div class="cs-visual"><svg viewBox="0 0 820 320" xmlns="http://www.w3.org/2000/svg"><rect width="820" height="320" fill="#0d0d1a"/><text x="410" y="24" text-anchor="middle" font-size="10" fill="#c9a227" letter-spacing="2.5" font-family="Georgia,serif">WHERE THE PEOPLE AND THE TICKETS SIT</text><line x1="120" y1="34" x2="700" y2="34" stroke="#c9a227" stroke-width=".4" stroke-opacity=".3"/><pre><code> &lt;polygon points="410,46 468,80 352,80" fill="#1a2744" stroke="#c9a227" stroke-width=".8"/&gt;
&lt;text x="410" y="68" text-anchor="middle" font-size="8" fill="#e8cc6a" font-family="sans-serif"&gt;SDM · ~2 FTE · 10% touches&lt;/text&gt;
&lt;polygon points="352,82 468,82 500,118 320,118" fill="#162038" stroke="#c9a227" stroke-width=".8"/&gt;
&lt;text x="410" y="105" text-anchor="middle" font-size="8.5" fill="#e8cc6a" font-family="sans-serif"&gt;ARCHITECT · ~3 FTE · 5% touches&lt;/text&gt;
&lt;polygon points="320,120 500,120 536,158 284,158" fill="#111830" stroke="#c9a227" stroke-width=".8"/&gt;
&lt;text x="410" y="144" text-anchor="middle" font-size="9.5" fill="#e8cc6a" font-family="sans-serif"&gt;L3 · ~12 FTE · 4% tickets&lt;/text&gt;
&lt;polygon points="284,160 536,160 576,202 244,202" fill="#0e1528" stroke="#c9a227" stroke-width=".8"/&gt;
&lt;text x="410" y="187" text-anchor="middle" font-size="10.5" fill="#e8cc6a" font-family="sans-serif"&gt;L2 · ~40 FTE · 15% tickets&lt;/text&gt;
&lt;polygon points="244,204 576,204 630,262 190,262" fill="#0a1020" stroke="#c9a227" stroke-width="1.1"/&gt;
&lt;text x="410" y="234" text-anchor="middle" font-size="12.5" fill="#c9a227" font-weight="bold" font-family="sans-serif"&gt;L1 · ~80 FTE · 3 SHIFTS · 80% tickets&lt;/text&gt;
&lt;text x="410" y="250" text-anchor="middle" font-size="8.5" fill="#9a8a6a" font-family="sans-serif"&gt;most volume · most repetition · highest attrition&lt;/text&gt;
&lt;rect x="60" y="282" width="700" height="30" rx="4" fill="#111122" stroke="#c9a227" stroke-width=".5" stroke-opacity=".5"/&gt;
&lt;text x="410" y="302" text-anchor="middle" font-size="9.5" fill="#c9a227" font-family="Georgia,serif" font-style="italic"&gt;Headcount and decisions are concentrated exactly where they are easiest to automate.&lt;/text&gt;
&lt;/svg&gt;</code></pre></div></div><div class="cs-section"><div class="cs-label">Part 4 · State of the Art</div><h2>Where Agentic AI Sits in 2026</h2><figure class="cs-figure"><video autoplay= muted= loop= playsinline= preload="metadata" poster="/images/cortex-swarm/capability-ramp-temple.png"><source src="/images/cortex-swarm/capability-ramp-temple.mp4" type="video/mp4"><img src="/images/cortex-swarm/capability-ramp-temple.png" alt="Six-step stone temple staircase ascending into a starlit sky"/><figcaption>Six years, six steps — each layer of capability built on the one beneath it.</figcaption></figure><p>Two years ago, "AI agent" meant a chatbot with a system prompt. In 2026 it means something specific: a model that can decompose a goal, call tools to gather evidence, maintain state across turns, and stop when the work is done. The shift is real, and it is what makes the rest of this piece possible.</p><h3 style="margin-top:24px">A Six-Year Capability Ramp</h3><p>Each year since 2020 has unlocked a layer of capability that wasn't there the year before. The cumulative effect is what now allows specialised agents to do specialist work, not just general chat.</p><div class="cs-visual"><svg viewBox="0 0 820 280" xmlns="http://www.w3.org/2000/svg"><rect width="820" height="280" fill="#0d0d1a"/><text x="410" y="22" text-anchor="middle" font-size="10" fill="#c9a227" letter-spacing="2" font-family="Georgia,serif">LLM CAPABILITY · CUMULATIVE LAYERS · 2020 → 2026</text><line x1="50" y1="240" x2="780" y2="240" stroke="#c9a227" stroke-width=".7" stroke-opacity=".55"/><pre><code> &lt;!-- Year ticks --&gt;
&lt;g font-size="8" fill="#9a8a6a" font-family="sans-serif" text-anchor="middle"&gt;
&lt;text x="110" y="258"&gt;2020&lt;/text&gt;
&lt;text x="220" y="258"&gt;2021&lt;/text&gt;
&lt;text x="330" y="258"&gt;2022&lt;/text&gt;
&lt;text x="440" y="258"&gt;2023&lt;/text&gt;
&lt;text x="550" y="258"&gt;2024&lt;/text&gt;
&lt;text x="660" y="258"&gt;2025&lt;/text&gt;
&lt;text x="770" y="258"&gt;2026&lt;/text&gt;
&lt;/g&gt;
&lt;!-- Capability stacking bars --&gt;
&lt;g&gt;
&lt;rect x="92" y="200" width="36" height="40" fill="#1a2744" stroke="#c9a227" stroke-width=".4"/&gt;
&lt;rect x="202" y="180" width="36" height="60" fill="#1a2744" stroke="#c9a227" stroke-width=".4"/&gt;
&lt;rect x="312" y="158" width="36" height="82" fill="#1a2744" stroke="#c9a227" stroke-width=".4"/&gt;
&lt;rect x="422" y="128" width="36" height="112" fill="#1a2744" stroke="#c9a227" stroke-width=".4"/&gt;
&lt;rect x="532" y="98" width="36" height="142" fill="#1a2744" stroke="#c9a227" stroke-width=".4"/&gt;
&lt;rect x="642" y="74" width="36" height="166" fill="#1a2744" stroke="#c9a227" stroke-width=".4"/&gt;
&lt;rect x="752" y="58" width="22" height="182" fill="#e8cc6a" stroke="#c9a227" stroke-width=".4"/&gt;
&lt;/g&gt;
&lt;!-- Layer labels --&gt;
&lt;text x="50" y="80" font-size="8.5" fill="#e8cc6a" font-family="Georgia,serif"&gt;Multi-agent&lt;/text&gt;
&lt;text x="50" y="92" font-size="7" fill="#9a8a6a" font-family="sans-serif"&gt;orchestration&lt;/text&gt;
&lt;text x="50" y="115" font-size="8.5" fill="#e8cc6a" font-family="Georgia,serif"&gt;Local inference&lt;/text&gt;
&lt;text x="50" y="127" font-size="7" fill="#9a8a6a" font-family="sans-serif"&gt;30B–70B on laptop&lt;/text&gt;
&lt;text x="50" y="148" font-size="8.5" fill="#e8cc6a" font-family="Georgia,serif"&gt;Reliable tool use&lt;/text&gt;
&lt;text x="50" y="160" font-size="7" fill="#9a8a6a" font-family="sans-serif"&gt;structured output&lt;/text&gt;
&lt;text x="50" y="178" font-size="8.5" fill="#e8cc6a" font-family="Georgia,serif"&gt;Reasoning &amp;amp; RAG&lt;/text&gt;
&lt;text x="50" y="190" font-size="7" fill="#9a8a6a" font-family="sans-serif"&gt;retrieval-augmented&lt;/text&gt;
&lt;text x="50" y="208" font-size="8.5" fill="#e8cc6a" font-family="Georgia,serif"&gt;Instruction tuning&lt;/text&gt;
&lt;text x="50" y="220" font-size="7" fill="#9a8a6a" font-family="sans-serif"&gt;long context&lt;/text&gt;
&lt;text x="50" y="234" font-size="8.5" fill="#e8cc6a" font-family="Georgia,serif"&gt;Pre-training&lt;/text&gt;
&lt;text x="410" y="48" text-anchor="middle" font-size="9" fill="#c9a227" font-family="Georgia,serif" font-style="italic"&gt;2026: every layer needed for autonomous tier-aligned agents is in place.&lt;/text&gt;
&lt;/svg&gt;</code></pre></div><h3 style="margin-top:36px">What "Agentic" Actually Means</h3><p>Four ingredients distinguish an agent from a chatbot. Every component is now boring engineering — no novel research required.</p><div class="cs-visual"><svg viewBox="0 0 820 280" xmlns="http://www.w3.org/2000/svg"><rect width="820" height="280" fill="#0d0d1a"/><text x="410" y="22" text-anchor="middle" font-size="10" fill="#c9a227" letter-spacing="2" font-family="Georgia,serif">ANATOMY OF AN AGENT</text><pre><code> &lt;!-- Central agent core --&gt;
&lt;circle cx="410" cy="150" r="58" fill="#1a2744" stroke="#e8cc6a" stroke-width="1.2"/&gt;
&lt;text x="410" y="148" text-anchor="middle" font-size="12" fill="#e8cc6a" font-family="Georgia,serif"&gt;Agent&lt;/text&gt;
&lt;text x="410" y="164" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;persona + system prompt&lt;/text&gt;
&lt;!-- Four ingredients --&gt;
&lt;g transform="translate(140,90)"&gt;
&lt;rect width="150" height="60" rx="4" fill="#0a1020" stroke="#c9a227" stroke-width=".7"/&gt;
&lt;text x="75" y="20" text-anchor="middle" font-size="9.5" fill="#e8cc6a" font-family="Georgia,serif"&gt;Reasoning&lt;/text&gt;
&lt;text x="75" y="34" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Decompose · plan · reflect&lt;/text&gt;
&lt;text x="75" y="46" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Why a step, not just what&lt;/text&gt;
&lt;/g&gt;
&lt;g transform="translate(530,90)"&gt;
&lt;rect width="150" height="60" rx="4" fill="#0a1020" stroke="#c9a227" stroke-width=".7"/&gt;
&lt;text x="75" y="20" text-anchor="middle" font-size="9.5" fill="#e8cc6a" font-family="Georgia,serif"&gt;Tool Use&lt;/text&gt;
&lt;text x="75" y="34" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Structured calls · typed inputs&lt;/text&gt;
&lt;text x="75" y="46" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Real-world side effects&lt;/text&gt;
&lt;/g&gt;
&lt;g transform="translate(140,170)"&gt;
&lt;rect width="150" height="60" rx="4" fill="#0a1020" stroke="#c9a227" stroke-width=".7"/&gt;
&lt;text x="75" y="20" text-anchor="middle" font-size="9.5" fill="#e8cc6a" font-family="Georgia,serif"&gt;Memory&lt;/text&gt;
&lt;text x="75" y="34" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Conversation · KB retrieval&lt;/text&gt;
&lt;text x="75" y="46" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;State across turns&lt;/text&gt;
&lt;/g&gt;
&lt;g transform="translate(530,170)"&gt;
&lt;rect width="150" height="60" rx="4" fill="#0a1020" stroke="#c9a227" stroke-width=".7"/&gt;
&lt;text x="75" y="20" text-anchor="middle" font-size="9.5" fill="#e8cc6a" font-family="Georgia,serif"&gt;Authority&lt;/text&gt;
&lt;text x="75" y="34" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Bounded scope · approvals&lt;/text&gt;
&lt;text x="75" y="46" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;When to act vs escalate&lt;/text&gt;
&lt;/g&gt;
&lt;!-- Connecting lines --&gt;
&lt;line x1="290" y1="120" x2="358" y2="135" stroke="#c9a227" stroke-width=".5" stroke-opacity=".5"/&gt;
&lt;line x1="530" y1="120" x2="462" y2="135" stroke="#c9a227" stroke-width=".5" stroke-opacity=".5"/&gt;
&lt;line x1="290" y1="200" x2="358" y2="170" stroke="#c9a227" stroke-width=".5" stroke-opacity=".5"/&gt;
&lt;line x1="530" y1="200" x2="462" y2="170" stroke="#c9a227" stroke-width=".5" stroke-opacity=".5"/&gt;
&lt;text x="410" y="266" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif" font-style="italic"&gt;An agent is a system prompt, a model, a memory, a tool registry, and rules about when to call which.&lt;/text&gt;
&lt;/svg&gt;</code></pre></div><h3 style="margin-top:36px">Three Independent Shifts That Made This Credible</h3><p>Each on its own is interesting. Together, they remove the standard objections G500 buyers raise to bringing AI inside the perimeter.</p><div class="cs-visual"><svg viewBox="0 0 820 240" xmlns="http://www.w3.org/2000/svg"><rect width="820" height="240" fill="#0d0d1a"/><circle cx="185" cy="120" r="80" fill="#0a1428" stroke="#c9a227" stroke-width=".8" stroke-opacity=".7"/><circle cx="410" cy="120" r="80" fill="#0a1428" stroke="#c9a227" stroke-width=".8" stroke-opacity=".7"/><circle cx="635" cy="120" r="80" fill="#0a1428" stroke="#c9a227" stroke-width=".8" stroke-opacity=".7"/><text x="185" y="98" text-anchor="middle" font-size="22" fill="#c9a227" font-family="Georgia,serif">30B–70B</text><text x="185" y="116" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif">open-weight params</text><text x="185" y="138" text-anchor="middle" font-size="8.5" fill="#e8dcc8" font-family="sans-serif">Tool-calling · reasoning</text><text x="185" y="152" text-anchor="middle" font-size="8.5" fill="#e8dcc8" font-family="sans-serif">Multi-step planning</text><text x="185" y="178" text-anchor="middle" font-size="8" fill="#c9a227" font-family="Georgia,serif" font-style="italic">Model Capability</text><pre><code> &lt;text x="410" y="98" text-anchor="middle" font-size="22" fill="#c9a227" font-family="Georgia,serif"&gt;Local&lt;/text&gt;
&lt;text x="410" y="116" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;on-device inference&lt;/text&gt;
&lt;text x="410" y="138" text-anchor="middle" font-size="8.5" fill="#e8dcc8" font-family="sans-serif"&gt;No data leaves building&lt;/text&gt;
&lt;text x="410" y="152" text-anchor="middle" font-size="8.5" fill="#e8dcc8" font-family="sans-serif"&gt;Compliance objection gone&lt;/text&gt;
&lt;text x="410" y="178" text-anchor="middle" font-size="8" fill="#c9a227" font-family="Georgia,serif" font-style="italic"&gt;Inference Economics&lt;/text&gt;
&lt;text x="635" y="98" text-anchor="middle" font-size="22" fill="#c9a227" font-family="Georgia,serif"&gt;100%&lt;/text&gt;
&lt;text x="635" y="116" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;safety-pass target&lt;/text&gt;
&lt;text x="635" y="138" text-anchor="middle" font-size="8.5" fill="#e8dcc8" font-family="sans-serif"&gt;Semantic grading&lt;/text&gt;
&lt;text x="635" y="152" text-anchor="middle" font-size="8.5" fill="#e8dcc8" font-family="sans-serif"&gt;Adversarial eval cases&lt;/text&gt;
&lt;text x="635" y="178" text-anchor="middle" font-size="8" fill="#c9a227" font-family="Georgia,serif" font-style="italic"&gt;Eval Discipline&lt;/text&gt;
&lt;/svg&gt;</code></pre></div><h3 style="margin-top:36px">A Capability Map by Tier — Ready Now vs Emerging</h3><p>This is not a roadmap; it is an honest read of what's possible today. "Ready" means the prompt, tool set, KB and eval pattern are known. "Emerging" means the approach is understood but still being measured.</p><div class="cs-cards"><div class="cs-card"><div class="cs-label" style="font-size:.62rem;color:var(--gold-light)">L1 · Ready Now</div><h3>Front-Line Desk</h3><ul><li>Identity verification · password / MFA / unlock</li><li>Catalog software install + approval</li><li>Outlook / Teams diagnostics</li><li>Printer · peripheral pairing</li><li>KB retrieval + grounded response</li></ul></div><div class="cs-card"><div class="cs-label" style="font-size:.62rem;color:var(--gold-light)">L2 · Ready</div><h3>App Specialist</h3><ul><li>App log structured analysis</li><li>Service health diagnostic</li><li>Intune compliance + push</li><li>Mailbox + M365 admin actions</li><li>Hypothesis-test workflows</li></ul></div><div class="cs-card"><div class="cs-label" style="font-size:.62rem;color:var(--gold-light)">L3 · Emerging</div><h3>Senior Engineer</h3><ul><li>Infrastructure root-cause</li><li>AD attribute engineering</li><li>Kusto / log-analytics</li><li>Change request authoring</li><li>Emergency change application</li></ul></div><div class="cs-card"><div class="cs-label" style="font-size:.62rem;color:var(--gold-light)">Architect · Emerging</div><h3>Design Authority</h3><ul><li>Change review against ADR library</li><li>P1 RCA authoring</li><li>Pattern-vs-one-off classification</li><li>Capacity-review triggers</li><li>Design-impact assessment</li></ul></div><div class="cs-card"><div class="cs-label" style="font-size:.62rem;color:var(--gold-light)">SDM · Emerging</div><h3>Delivery Manager</h3><ul><li>Customer comms drafting</li><li>SLA dashboard + breach alerts</li><li>War-room convene flow</li><li>Weekly briefing generation</li><li>Status update cadence</li></ul></div></div></div><div class="cs-section"><div class="cs-label">Part 5 · Trajectory</div><h2>Where Agentic AI Is Heading — Next 2–3 Years</h2><p>The trajectory of the last six years points in a clear direction: from a single model answering a single question, to<em>swarms</em> of specialised agents collaborating on bounded problems under an orchestrator they cannot themselves modify.</p><div class="cs-visual"><svg viewBox="0 0 820 260" xmlns="http://www.w3.org/2000/svg"><rect width="820" height="260" fill="#0d0d1a"/><text x="410" y="22" text-anchor="middle" font-size="10" fill="#c9a227" letter-spacing="2.5" font-family="Georgia,serif">FROM SINGLE MODEL TO SPECIALISED SWARM</text><line x1="50" y1="180" x2="780" y2="180" stroke="#c9a227" stroke-width=".5" stroke-opacity=".5"/><defs><marker id="evarr" markerWidth="6" markerHeight="6" refX="3" refY="3" orient="auto"><path d="M0,0 L6,3 L0,6 Z" fill="#c9a227" fill-opacity=".7"/></marker></defs><pre><code> &lt;!-- Stage 1: LLM --&gt;
&lt;g transform="translate(90,80)"&gt;
&lt;circle r="32" fill="#0a1428" stroke="#c9a227" stroke-width=".8"/&gt;
&lt;text y="4" text-anchor="middle" font-size="11" fill="#e8cc6a" font-family="Georgia,serif"&gt;LLM&lt;/text&gt;
&lt;text y="56" text-anchor="middle" font-size="8" fill="#c9a227" font-family="sans-serif" font-weight="700"&gt;2020–2023&lt;/text&gt;
&lt;text y="72" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Single model&lt;/text&gt;
&lt;text y="84" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Chat completion&lt;/text&gt;
&lt;/g&gt;
&lt;line x1="130" y1="110" x2="195" y2="110" stroke="#c9a227" stroke-width=".7" marker-end="url(#evarr)"/&gt;
&lt;!-- Stage 2: Agent --&gt;
&lt;g transform="translate(245,80)"&gt;
&lt;circle r="34" fill="#0a1428" stroke="#c9a227" stroke-width=".9"/&gt;
&lt;text y="4" text-anchor="middle" font-size="11" fill="#e8cc6a" font-family="Georgia,serif"&gt;Agent&lt;/text&gt;
&lt;text y="56" text-anchor="middle" font-size="8" fill="#c9a227" font-family="sans-serif" font-weight="700"&gt;2023–2024&lt;/text&gt;
&lt;text y="72" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;+ Tools + Memory&lt;/text&gt;
&lt;text y="84" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Single task loop&lt;/text&gt;
&lt;/g&gt;
&lt;line x1="287" y1="110" x2="352" y2="110" stroke="#c9a227" stroke-width=".7" marker-end="url(#evarr)"/&gt;
&lt;!-- Stage 3: Multi-agent --&gt;
&lt;g transform="translate(404,80)"&gt;
&lt;circle r="36" fill="#0a1428" stroke="#c9a227" stroke-width="1"/&gt;
&lt;text y="0" text-anchor="middle" font-size="10" fill="#e8cc6a" font-family="Georgia,serif"&gt;Multi-&lt;/text&gt;
&lt;text y="12" text-anchor="middle" font-size="10" fill="#e8cc6a" font-family="Georgia,serif"&gt;agent&lt;/text&gt;
&lt;text y="56" text-anchor="middle" font-size="8" fill="#c9a227" font-family="sans-serif" font-weight="700"&gt;2024–2026&lt;/text&gt;
&lt;text y="72" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Planner + executors&lt;/text&gt;
&lt;text y="84" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Verifier loops&lt;/text&gt;
&lt;/g&gt;
&lt;line x1="448" y1="110" x2="513" y2="110" stroke="#c9a227" stroke-width=".7" marker-end="url(#evarr)"/&gt;
&lt;!-- Stage 4: Specialised Swarm --&gt;
&lt;g transform="translate(573,80)"&gt;
&lt;circle r="38" fill="#1a2744" stroke="#e8cc6a" stroke-width="1.4"/&gt;
&lt;text y="0" text-anchor="middle" font-size="10" fill="#e8cc6a" font-family="Georgia,serif"&gt;Specialised&lt;/text&gt;
&lt;text y="12" text-anchor="middle" font-size="10" fill="#e8cc6a" font-family="Georgia,serif"&gt;Swarm&lt;/text&gt;
&lt;text y="56" text-anchor="middle" font-size="8" fill="#e8cc6a" font-family="sans-serif" font-weight="700"&gt;2026 →&lt;/text&gt;
&lt;text y="72" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Role-aligned agents&lt;/text&gt;
&lt;text y="84" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Bounded authority&lt;/text&gt;
&lt;/g&gt;
&lt;line x1="619" y1="110" x2="694" y2="110" stroke="#c9a227" stroke-width=".7" marker-end="url(#evarr)"/&gt;
&lt;!-- Stage 5: Enterprise --&gt;
&lt;g transform="translate(740,80)"&gt;
&lt;circle r="32" fill="#0a1428" stroke="#c9a227" stroke-width=".6" stroke-dasharray="3,2"/&gt;
&lt;text y="4" text-anchor="middle" font-size="9" fill="#9a8a6a" font-family="Georgia,serif"&gt;Enterprise&lt;/text&gt;
&lt;text y="56" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif" font-weight="700"&gt;2027 →&lt;/text&gt;
&lt;text y="72" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Cross-function&lt;/text&gt;
&lt;text y="84" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif"&gt;Swarm-of-swarms&lt;/text&gt;
&lt;/g&gt;
&lt;text x="410" y="218" text-anchor="middle" font-size="9" fill="#c9a227" font-family="Georgia,serif" font-style="italic"&gt;Cortex Swarm sits at the fourth stage. The fifth is what this pattern unlocks once it's proven inside one function.&lt;/text&gt;
&lt;/svg&gt;</code></pre></div><h3 style="margin-top:36px">Three Bets About the Next 24 Months</h3><div class="cs-cards"><div class="cs-card"><div class="cs-label" style="font-size:.62rem">Bet 01</div><h3>Specialisation beats generalisation</h3><p style="font-size:.85rem;color:var(--cream);margin:0">One large general agent doing everything is brittle. Five small role-aligned agents — each with its own persona, tools, and KB — are more reliable, more debuggable, and more auditable.</p></div><div class="cs-card"><div class="cs-label" style="font-size:.62rem">Bet 02</div><h3>The orchestrator is the operating system</h3><p style="font-size:.85rem;color:var(--cream);margin:0">Frameworks like LangChain / LangGraph / Autogen are scaffolding. Production systems will hold their durable value in a bespoke orchestrator that owns state, audit, identity, and policy — not in any library it depends on.</p></div><div class="cs-card"><div class="cs-label" style="font-size:.62rem">Bet 03</div><h3>Compliance becomes the product</h3><p style="font-size:.85rem;color:var(--cream);margin:0">The agent that wins inside a regulated enterprise is not the one with the highest benchmark — it's the one whose every action a G500 internal-audit team can replay in one click.</p></div></div></div><div class="cs-section"><div class="cs-label">Part 6 · The Cortex Swarm</div><h2>How a Five-Agent Swarm Replaces the Five-Tier Team</h2><figure class="cs-figure"><img src="/images/cortex-swarm/mirror-reflection.png" alt="A Japanese silver mirror reflecting five human silhouettes as five glowing agent crests — the mirror principle made literal"><figcaption>The mirror principle — the agent inherits the role the human already plays.</figcaption></figure><p>The idea is structurally simple.<em>Don't reinvent the team. Mirror it.</em> One agent per tier. Distinct persona, tools, knowledge base, and authority. The org chart<strong style="color:var(--gold)">is</strong> the system architecture.</p><div class="cs-pull">If a human L2 specialist refuses to apply a config change without log evidence, the L2 agent does the same. If the Architect won't approve a change without checking the ADR library, neither does the Architect agent.</div><h3 style="margin-top:24px">Human Team ↔ Agent Swarm — Tier for Tier</h3><div class="cs-visual"><svg viewBox="0 0 820 360" xmlns="http://www.w3.org/2000/svg"><rect width="820" height="360" fill="#0d0d1a"/><text x="200" y="28" text-anchor="middle" font-size="10" fill="#c9a227" letter-spacing="2" font-family="Georgia,serif">HUMAN TEAM</text><text x="620" y="28" text-anchor="middle" font-size="10" fill="#c9a227" letter-spacing="2" font-family="Georgia,serif">CORTEX SWARM</text><line x1="50" y1="38" x2="350" y2="38" stroke="#c9a227" stroke-width=".4" stroke-opacity=".4"/><line x1="470" y1="38" x2="770" y2="38" stroke="#c9a227" stroke-width=".4" stroke-opacity=".4"/><defs><marker id="mirarr" markerWidth="6" markerHeight="6" refX="3" refY="3" orient="auto"><path d="M0,0 L6,3 L0,6 Z" fill="#c9a227"/></marker></defs><pre><code> &lt;!-- Pairs --&gt;
&lt;g font-family="Georgia,serif"&gt;
&lt;!-- SDM --&gt;
&lt;rect x="80" y="56" width="240" height="44" rx="3" fill="#0a1020" stroke="#c9a227" stroke-width=".7"/&gt;
&lt;text x="200" y="74" text-anchor="middle" font-size="10" fill="#e8cc6a"&gt;SDM · 2 FTE&lt;/text&gt;
&lt;text x="200" y="88" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;Customer voice · SLA · war room&lt;/text&gt;
&lt;line x1="320" y1="78" x2="470" y2="78" stroke="#c9a227" stroke-width=".6" stroke-opacity=".5" marker-end="url(#mirarr)"/&gt;
&lt;rect x="480" y="56" width="240" height="44" rx="3" fill="#1a2744" stroke="#e8cc6a" stroke-width=".9"/&gt;
&lt;text x="600" y="74" text-anchor="middle" font-size="10" fill="#e8cc6a"&gt;SDM Agent&lt;/text&gt;
&lt;text x="600" y="88" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;Same persona · same tools · always on&lt;/text&gt;
&lt;!-- Architect --&gt;
&lt;rect x="80" y="112" width="240" height="44" rx="3" fill="#0a1020" stroke="#c9a227" stroke-width=".7"/&gt;
&lt;text x="200" y="130" text-anchor="middle" font-size="10" fill="#e8cc6a"&gt;Architect · 3 FTE&lt;/text&gt;
&lt;text x="200" y="144" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;Design authority · ADRs · RCAs&lt;/text&gt;
&lt;line x1="320" y1="134" x2="470" y2="134" stroke="#c9a227" stroke-width=".6" stroke-opacity=".5" marker-end="url(#mirarr)"/&gt;
&lt;rect x="480" y="112" width="240" height="44" rx="3" fill="#1a2744" stroke="#e8cc6a" stroke-width=".9"/&gt;
&lt;text x="600" y="130" text-anchor="middle" font-size="10" fill="#e8cc6a"&gt;Architect Agent&lt;/text&gt;
&lt;text x="600" y="144" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;Pattern-vs-one-off · ADR-gated&lt;/text&gt;
&lt;!-- L3 --&gt;
&lt;rect x="80" y="168" width="240" height="44" rx="3" fill="#0a1020" stroke="#c9a227" stroke-width=".7"/&gt;
&lt;text x="200" y="186" text-anchor="middle" font-size="10" fill="#e8cc6a"&gt;L3 · 12 FTE&lt;/text&gt;
&lt;text x="200" y="200" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;Root cause · changes · vendor escalation&lt;/text&gt;
&lt;line x1="320" y1="190" x2="470" y2="190" stroke="#c9a227" stroke-width=".6" stroke-opacity=".5" marker-end="url(#mirarr)"/&gt;
&lt;rect x="480" y="168" width="240" height="44" rx="3" fill="#1a2744" stroke="#e8cc6a" stroke-width=".9"/&gt;
&lt;text x="600" y="186" text-anchor="middle" font-size="10" fill="#e8cc6a"&gt;L3 Agent&lt;/text&gt;
&lt;text x="600" y="200" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;Same diagnostics · proposes change request&lt;/text&gt;
&lt;!-- L2 --&gt;
&lt;rect x="80" y="224" width="240" height="44" rx="3" fill="#0a1020" stroke="#c9a227" stroke-width=".7"/&gt;
&lt;text x="200" y="242" text-anchor="middle" font-size="10" fill="#e8cc6a"&gt;L2 · 40 FTE&lt;/text&gt;
&lt;text x="200" y="256" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;App / platform · log-driven diagnosis&lt;/text&gt;
&lt;line x1="320" y1="246" x2="470" y2="246" stroke="#c9a227" stroke-width=".6" stroke-opacity=".5" marker-end="url(#mirarr)"/&gt;
&lt;rect x="480" y="224" width="240" height="44" rx="3" fill="#1a2744" stroke="#e8cc6a" stroke-width=".9"/&gt;
&lt;text x="600" y="242" text-anchor="middle" font-size="10" fill="#e8cc6a"&gt;L2 Agent&lt;/text&gt;
&lt;text x="600" y="256" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;Reads logs · hypothesis-test · bounces back&lt;/text&gt;
&lt;!-- L1 --&gt;
&lt;rect x="80" y="280" width="240" height="44" rx="3" fill="#0a1020" stroke="#c9a227" stroke-width="1.1"/&gt;
&lt;text x="200" y="298" text-anchor="middle" font-size="10" fill="#c9a227" font-weight="700"&gt;L1 · 80 FTE · 3 shifts&lt;/text&gt;
&lt;text x="200" y="312" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;Front-line desk · password · catalog · printer&lt;/text&gt;
&lt;line x1="320" y1="302" x2="470" y2="302" stroke="#c9a227" stroke-width="1" marker-end="url(#mirarr)"/&gt;
&lt;rect x="480" y="280" width="240" height="44" rx="3" fill="#1a2744" stroke="#e8cc6a" stroke-width="1.4"/&gt;
&lt;text x="600" y="298" text-anchor="middle" font-size="10" fill="#e8cc6a" font-weight="700"&gt;L1 Agent · always on&lt;/text&gt;
&lt;text x="600" y="312" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif"&gt;Same persona · 25 tools · escalates on rule&lt;/text&gt;
&lt;/g&gt;
&lt;text x="410" y="348" text-anchor="middle" font-size="8.5" fill="#9a8a6a" font-family="sans-serif" font-style="italic"&gt;Each pair shares the same persona, scope and authority. The agent inherits the role; the orchestrator inherits the rules.&lt;/text&gt;
&lt;/svg&gt;</code></pre></div><h3 style="margin-top:36px">One Orchestrator. Five Agents. Clean Seams.</h3><p>The architecture is deliberately conservative. The model proposes; the orchestrator and adapters decide whether the proposal executes. Bespoke ~250-line orchestrator. No LangChain. No LangGraph. No Autogen.</p><div class="cs-visual"><svg viewBox="0 0 820 340" xmlns="http://www.w3.org/2000/svg"><rect width="820" height="340" fill="#0d0d1a"/><text x="14" y="68" font-size="7" fill="#9a8a6a" font-family="sans-serif">FRONTEND</text><text x="14" y="126" font-size="7" fill="#9a8a6a" font-family="sans-serif">ORCHESTRATOR</text><text x="14" y="182" font-size="7" fill="#9a8a6a" font-family="sans-serif">AGENTS</text><text x="14" y="238" font-size="7" fill="#9a8a6a" font-family="sans-serif">TOOLS</text><text x="14" y="288" font-size="7" fill="#9a8a6a" font-family="sans-serif">ADAPTERS</text><text x="14" y="328" font-size="7" fill="#9a8a6a" font-family="sans-serif">INFERENCE</text><rect x="74" y="46" width="730" height="36" rx="3" fill="#1a2744" stroke="#c9a227" stroke-width=".7"/><text x="439" y="61" text-anchor="middle" font-size="9" fill="#e8cc6a" font-family="sans-serif" font-weight="600">Web Portal</text><text x="439" y="74" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif">Next.js 15 · Tailwind · shadcn/ui · Chat · Ticket Explorer · SDM Dashboard</text><rect x="74" y="94" width="730" height="36" rx="3" fill="#162038" stroke="#e8cc6a" stroke-width="1"/><text x="439" y="109" text-anchor="middle" font-size="9" fill="#e8cc6a" font-family="sans-serif" font-weight="700">Orchestrator (bespoke ~250 lines)</text><text x="439" y="122" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif">State machine · Identity gate · Audit emission · Output validation · Budget enforcement</text><rect x="74" y="142" width="730" height="36" rx="3" fill="#0f1a30" stroke="#c9a227" stroke-width=".7"/><text x="439" y="157" text-anchor="middle" font-size="9" fill="#e8cc6a" font-family="sans-serif" font-weight="600">Agent Registry</text><text x="439" y="170" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif">l1 · l2 · l3 · architect · sdm · planner/verifier (P3) · voice/i18n (P4-5)</text><rect x="74" y="192" width="730" height="36" rx="3" fill="#0a1020" stroke="#c9a227" stroke-width=".7"/><text x="439" y="207" text-anchor="middle" font-size="9" fill="#e8cc6a" font-family="sans-serif" font-weight="600">Tool Registry (~60 tools)</text><text x="439" y="220" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif">Identity · Access · Productivity · Endpoint · Observability · Change · Communications</text><rect x="74" y="242" width="730" height="36" rx="3" fill="#080f1c" stroke="#c9a227" stroke-width=".7"/><text x="439" y="257" text-anchor="middle" font-size="9" fill="#e8cc6a" font-family="sans-serif" font-weight="600">Adapter Layer</text><text x="439" y="270" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif">ITSM · Identity · Endpoint · Catalog · Approval · Obs · Change · Comms · Voice (mock | real)</text><rect x="74" y="290" width="348" height="36" rx="3" fill="#080810" stroke="#c9a227" stroke-width=".5" stroke-opacity=".5" stroke-dasharray="4,3"/><text x="248" y="305" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif">Sidecars: ChromaDB+BGE-M3 · SHA-256 audit chain · State machine · Budgets</text><text x="248" y="319" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif">SQLite (P1) → Postgres (P2+) · per-role namespace</text><rect x="434" y="290" width="370" height="36" rx="3" fill="#080810" stroke="#e8cc6a" stroke-width=".8"/><text x="619" y="305" text-anchor="middle" font-size="8.5" fill="#e8cc6a" font-family="sans-serif">LM Studio @ 127.0.0.1:1234/v1</text><text x="619" y="319" text-anchor="middle" font-size="7.5" fill="#9a8a6a" font-family="sans-serif">Qwen3-Coder-Next (P1) · Larger models for L3/Architect in P2-3</text><line x1="439" y1="82" x2="439" y2="94" stroke="#c9a227" stroke-width=".5" stroke-opacity=".4"/><line x1="439" y1="130" x2="439" y2="142" stroke="#c9a227" stroke-width=".5" stroke-opacity=".4"/><line x1="439" y1="178" x2="439" y2="192" stroke="#c9a227" stroke-width=".5" stroke-opacity=".4"/><line x1="439" y1="228" x2="439" y2="242" stroke="#c9a227" stroke-width=".5" stroke-opacity=".4"/><line x1="439" y1="278" x2="439" y2="290" stroke="#c9a227" stroke-width=".5" stroke-opacity=".4"/></svg></div><h3 style="margin-top:36px">The Five Agents — in Detail</h3><div style="margin-top:20px"><div class="cs-agent-row"><div class="cs-agent-label"><div class="tier">L1 · Phase 1</div><div class="role">Service Desk</div><div class="replace">replaces ~80 FTE · 3 shifts · 80% tickets</div></div><div class="cs-agent-body"><div class="desc">"Polite. Fast. Scripted. Resolves common categories. Never speculates on root cause."</div><ul><li>Password / MFA / unlock · Catalog install · Printer · Outlook/Teams basics</li><li>25 tools without further approval; mutating tools pass the identity gate</li><li>Escalates after 2 failed attempts; infra/outage/security always escalate</li><li>Target SLA: TTA &lt; 5 min · TTR &lt; 30 min</li></ul><div class="cs-tag-row"><span class="cs-tag">Identity</span><span class="cs-tag">Access</span><span class="cs-tag">Productivity</span><span class="cs-tag">Endpoint</span></div></div></div><div class="cs-agent-row"><div class="cs-agent-label"><div class="tier">L2 · Phase 2</div><div class="role">App Specialist</div><div class="replace">replaces ~40 FTE · 2 shifts · 15% tickets</div></div><div class="cs-agent-body"><div class="desc">"Specialist. Methodical. Demands diagnostics before acting. Reads logs."</div><ul><li>App log analysis · Intune compliance + push · Exchange diagnostics</li><li>Hypothesis-test workflows — escalates only after evidence is gathered</li><li>Rejects wrongly-escalated tickets back to L1 with structured rationale</li><li>Target SLA: TTA &lt; 30 min · TTR &lt; 4 hrs</li></ul><div class="cs-tag-row"><span class="cs-tag">M365</span><span class="cs-tag">Intune</span><span class="cs-tag">Exchange</span><span class="cs-tag">ServiceNow</span></div></div></div><div class="cs-agent-row"><div class="cs-agent-label"><div class="tier">L3 · Phase 2</div><div class="role">Senior Engineer</div><div class="replace">replaces ~12 FTE · day + oncall · 4% tickets</div></div><div class="cs-agent-body"><div class="desc">"Root-cause focused. Drives change requests. Consults Architect for design-impacting fixes."</div><ul><li>AD/Identity engineering · Kusto/log-analytics · Emergency change application</li><li>Authors RCA stubs — Architect finalises</li><li>Consults Architect for design-impacting changes</li><li>Target SLA: TTA &lt; 2 hrs · TTR &lt; 24 hrs</li></ul><div class="cs-tag-row"><span class="cs-tag">AD Engineering</span><span class="cs-tag">Infrastructure</span><span class="cs-tag">Change Mgmt</span><span class="cs-tag">Security</span></div></div></div><div class="cs-agent-row"><div class="cs-agent-label"><div class="tier">Architect · Phase 3</div><div class="role">Design Authority</div><div class="replace">replaces ~3 FTE · 5% touches</div></div><div class="cs-agent-body"><div class="desc">"Asks 'is this a one-off or a pattern?' Advises and gates — not a doer."</div><ul><li>Approves/rejects change requests against the ADR library</li><li>Authors P1 RCAs · pattern-vs-one-off classification · capacity reviews</li><li>May veto L3-proposed changes</li><li>Target SLA: Design review ≤ 2 biz days · RCA ≤ 5 days</li></ul><div class="cs-tag-row"><span class="cs-tag">ADR Library</span><span class="cs-tag">Change Review</span><span class="cs-tag">RCA</span><span class="cs-tag">Capacity</span></div></div></div><div class="cs-agent-row"><div class="cs-agent-label"><div class="tier">SDM · Phase 3</div><div class="role">Delivery Manager</div><div class="replace">replaces ~2 FTE · 10% touches</div></div><div class="cs-agent-body"><div class="desc">"Customer-facing. SLA-driven. Calm under pressure. Owns comms cadence."</div><ul><li>SLA dashboard · customer comms · war-room convening · weekly briefing</li><li>Cannot make technical decisions — can demand them by SLA-bound deadline</li><li>Can convene a war room without prior approval</li><li>Target SLA: Customer comms &lt; 1 hr (P1) · weekly briefing every Friday</li></ul><div class="cs-tag-row"><span class="cs-tag">SLA Governance</span><span class="cs-tag">Customer Comms</span><span class="cs-tag">War Room</span><span class="cs-tag">Weekly Briefing</span></div></div></div></div><h3 style="margin-top:44px">Five Collaboration Patterns</h3><p>Five patterns cover every interaction between the five agents. Escalation is just the first.</p><div class="cs-cards"><div class="cs-card"><h3>Escalation</h3><p style="font-size:.82rem;color:var(--cream);margin:0">L1 → L2 → L3 with filtered conversation history per tier scope. The higher tier sees only what's relevant to its authority.</p></div><div class="cs-card"><h3>Bounce-back</h3><p style="font-size:.82rem;color:var(--cream);margin:0">L2 or L3 → L1 with structured de-escalation rationale. Cycle detection prevents loops.</p></div><div class="cs-card"><h3>Design gate</h3><p style="font-size:.82rem;color:var(--cream);margin:0">L3 → Architect via<code style="font-size:.75rem;color:var(--gold-light)">propose_change_request</code>; result returns via<code style="font-size:.75rem;color:var(--gold-light)">respond_to_l3</code>.</p></div><div class="cs-card"><h3>War-room</h3><p style="font-size:.82rem;color:var(--cream);margin:0">SDM forces L3 + Architect to sync on a single ticket thread under stricter time budgets.</p></div><div class="cs-card"><h3>Internal escalate</h3><p style="font-size:.82rem;color:var(--cream);margin:0">SDM → SP leadership (humans) on systemic breach patterns — humans re-enter the loop only when patterns demand it.</p></div></div><h3 style="margin-top:44px">A Representative Ticket — End to End</h3><p><em style="color:var(--muted)">Scenario: "Team shared mailbox stopped syncing."</em> A representative ticket traversing all five agents under a formal 13-state machine.</p><div class="cs-visual"><svg viewBox="0 0 820 400" xmlns="http://www.w3.org/2000/svg"><rect width="820" height="400" fill="#0d0d1a"/><text x="410" y="20" text-anchor="middle" font-size="7" fill="#9a8a6a" letter-spacing="1.5" font-family="sans-serif">STATE MACHINE TRACE</text><text x="410" y="35" text-anchor="middle" font-size="6" fill="#c9a227" font-family="monospace">NEW → TRIAGED → DIAGNOSING → AWAITING_L2 → DIAGNOSING(L2) → AWAITING_L3 → DIAGNOSING(L3) → AWAITING_APPROVAL → ACTING(L3) → VERIFYING → RESOLVED</text><line x1="40" y1="43" x2="780" y2="43" stroke="#c9a227" stroke-width=".3" stroke-opacity=".35"/><g transform="translate(0,52)"><circle cx="40" cy="20" r="16" fill="#0a1020" stroke="#9a8a6a" stroke-width=".7"/><text x="40" y="25" text-anchor="middle" font-size="11" fill="#9a8a6a">👤</text><text x="78" y="16" font-size="9" fill="#9a8a6a" font-family="sans-serif" font-weight="600">EMPLOYEE</text><text x="78" y="29" font-size="8" fill="#e8dcc8" font-family="sans-serif">Files chat: "Team shared mailbox stopped syncing"</text><line x1="40" y1="36" x2="40" y2="52" stroke="#c9a227" stroke-width=".5" stroke-opacity=".3"/></g><g transform="translate(0,108)"><circle cx="40" cy="20" r="16" fill="#0a1020" stroke="#c9a227" stroke-width=".8"/><text x="40" y="25" text-anchor="middle" font-size="8" fill="#c9a227" font-family="sans-serif" font-weight="700">L1</text><text x="78" y="16" font-size="9" fill="#c9a227" font-family="sans-serif" font-weight="600">DESK AGENT</text><text x="78" y="29" font-size="8" fill="#e8dcc8" font-family="sans-serif">Triages → Productivity/Mailbox. Beyond L1 SOPs. Calls<tspan font-family="monospace" font-size="7.5">escalate_to_l2()</tspan></text><line x1="40" y1="36" x2="40" y2="52" stroke="#c9a227" stroke-width=".5" stroke-opacity=".3"/></g><g transform="translate(0,164)"><circle cx="40" cy="20" r="16" fill="#0a1020" stroke="#c9a227" stroke-width=".8"/><text x="40" y="25" text-anchor="middle" font-size="8" fill="#c9a227" font-family="sans-serif" font-weight="700">L2</text><text x="78" y="16" font-size="9" fill="#c9a227" font-family="sans-serif" font-weight="600">APP SPECIALIST</text><text x="78" y="29" font-size="8" fill="#e8dcc8" font-family="sans-serif">Inspects Exchange logs. Tenant-level drift detected. Calls<tspan font-family="monospace" font-size="7.5">escalate_to_l3()</tspan></text><line x1="40" y1="36" x2="40" y2="52" stroke="#c9a227" stroke-width=".5" stroke-opacity=".3"/></g><g transform="translate(0,220)"><circle cx="40" cy="20" r="16" fill="#1a2744" stroke="#e8cc6a" stroke-width=".9"/><text x="40" y="25" text-anchor="middle" font-size="8" fill="#e8cc6a" font-family="sans-serif" font-weight="700">L3</text><text x="78" y="16" font-size="9" fill="#e8cc6a" font-family="sans-serif" font-weight="600">SENIOR ENGINEER</text><text x="78" y="29" font-size="8" fill="#e8dcc8" font-family="sans-serif">Queries AD + tenant. CA policy change = root cause. Calls<tspan font-family="monospace" font-size="7.5">propose_change_request()</tspan></text><line x1="40" y1="36" x2="40" y2="52" stroke="#c9a227" stroke-width=".5" stroke-opacity=".3"/></g><g transform="translate(0,276)"><circle cx="40" cy="20" r="16" fill="#0a1020" stroke="#c9a227" stroke-width=".8"/><text x="40" y="25" text-anchor="middle" font-size="7" fill="#c9a227" font-family="sans-serif" font-weight="700">ARCH</text><text x="78" y="16" font-size="9" fill="#c9a227" font-family="sans-serif" font-weight="600">ARCHITECT</text><text x="78" y="29" font-size="8" fill="#e8dcc8" font-family="sans-serif">Reviews ADR library. Approves:<tspan font-family="monospace" font-size="7.5">approve_change(conditions=["monitor_7d"])</tspan></text><line x1="40" y1="36" x2="40" y2="52" stroke="#c9a227" stroke-width=".5" stroke-opacity=".3"/></g><g transform="translate(0,332)"><circle cx="40" cy="20" r="16" fill="#1a2744" stroke="#e8cc6a" stroke-width=".9"/><text x="40" y="25" text-anchor="middle" font-size="8" fill="#e8cc6a" font-family="sans-serif" font-weight="700">L3</text><text x="78" y="16" font-size="9" fill="#e8cc6a" font-family="sans-serif" font-weight="600">RESOLVED — L3 APPLIES · L1 CLOSES · SDM LOGS</text><text x="78" y="29" font-size="8" fill="#e8dcc8" font-family="sans-serif">Change applied. Sync verified. Employee notified. Hash-chained audit emitted for full lifecycle.</text></g><rect x="490" y="100" width="302" height="188" rx="4" fill="#111122" stroke="#c9a227" stroke-width=".6" stroke-opacity=".4"/><text x="641" y="120" text-anchor="middle" font-size="8.5" fill="#c9a227" letter-spacing="1.5" font-family="sans-serif">WHAT THIS PROVES</text><line x1="508" y1="127" x2="775" y2="127" stroke="#c9a227" stroke-width=".3" stroke-opacity=".3"/><text x="510" y="146" font-size="8" fill="#e8dcc8" font-family="sans-serif">→ Every tier did exactly its scope — no overreach</text><text x="510" y="163" font-size="8" fill="#e8dcc8" font-family="sans-serif">→ Architect gated without executing</text><text x="510" y="180" font-size="8" fill="#e8dcc8" font-family="sans-serif">→ SDM never intervened (SLA healthy)</text><text x="510" y="197" font-size="8" fill="#e8dcc8" font-family="sans-serif">→ State machine refused backward transitions</text><text x="510" y="214" font-size="8" fill="#e8dcc8" font-family="sans-serif">→ G500 audit team replays from single ticket ID</text><text x="510" y="231" font-size="8" fill="#e8dcc8" font-family="sans-serif">→ Every tool call, KB hit &amp; state change logged</text></svg></div><h3 style="margin-top:44px">A P1 Incident — 5 Minutes, No Human Paged</h3><p>Region-wide Exchange Online failure. The swarm runs the entire war-room cycle while humans are asleep.</p><div class="cs-visual"><svg viewBox="0 0 820 175" xmlns="http://www.w3.org/2000/svg"><rect width="820" height="175" fill="#0d0d1a"/><line x1="36" y1="72" x2="784" y2="72" stroke="#c9a227" stroke-width=".9" stroke-opacity=".45"/><g transform="translate(62,72)"><circle cy="0" r="9" fill="#c9a227"/><text y="-17" text-anchor="middle" font-size="8" fill="#c9a227" font-family="sans-serif" font-weight="700">T+0</text><text y="20" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif">DETECTOR</text><text y="34" text-anchor="middle" font-size="7" fill="#e8dcc8" font-family="sans-serif">14 failures</text><text y="46" text-anchor="middle" font-size="7" fill="#e8dcc8" font-family="sans-serif">in 60 secs</text><text y="58" text-anchor="middle" font-size="7" fill="#c9a227" font-family="sans-serif">P1 created</text></g><g transform="translate(198,72)"><circle cy="0" r="9" fill="#c9a227"/><text y="-17" text-anchor="middle" font-size="8" fill="#c9a227" font-family="sans-serif" font-weight="700">T+15s</text><text y="20" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif">SDM</text><text y="34" text-anchor="middle" font-size="7" fill="#e8dcc8" font-family="sans-serif">War room</text><text y="46" text-anchor="middle" font-size="7" fill="#e8dcc8" font-family="sans-serif">L3+Arch sync</text><text y="58" text-anchor="middle" font-size="7" fill="#e8dcc8" font-family="sans-serif">comms/15min</text></g><g transform="translate(366,72)"><circle cy="0" r="10" fill="#1a2744" stroke="#e8cc6a" stroke-width="1.3"/><text y="-17" text-anchor="middle" font-size="8" fill="#e8cc6a" font-family="sans-serif" font-weight="700">T+45s</text><text y="20" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif">L3</text><text y="34" text-anchor="middle" font-size="7" fill="#e8dcc8" font-family="sans-serif">CA policy</text><text y="46" text-anchor="middle" font-size="7" fill="#e8dcc8" font-family="sans-serif">identified</text><text y="58" text-anchor="middle" font-size="7" fill="#c9a227" font-family="sans-serif">propose rollback</text></g><g transform="translate(530,72)"><circle cy="0" r="9" fill="#c9a227"/><text y="-17" text-anchor="middle" font-size="8" fill="#c9a227" font-family="sans-serif" font-weight="700">T+90s</text><text y="20" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif">ARCHITECT</text><text y="34" text-anchor="middle" font-size="7" fill="#e8dcc8" font-family="sans-serif">ADR review</text><text y="46" text-anchor="middle" font-size="7" fill="#e8dcc8" font-family="sans-serif">Approves</text><text y="58" text-anchor="middle" font-size="7" fill="#e8dcc8" font-family="sans-serif">5% sample</text></g><g transform="translate(700,72)"><circle cy="0" r="10" fill="#c9a227" stroke="#e8cc6a" stroke-width="1.3"/><text y="-17" text-anchor="middle" font-size="8" fill="#c9a227" font-family="sans-serif" font-weight="700">T+5min</text><text y="20" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif">RESOLVED</text><text y="34" text-anchor="middle" font-size="7" fill="#e8dcc8" font-family="sans-serif">Applied &amp;</text><text y="46" text-anchor="middle" font-size="7" fill="#e8dcc8" font-family="sans-serif">verified</text><text y="58" text-anchor="middle" font-size="7" fill="#c9a227" font-family="sans-serif">Comms sent</text></g><text x="410" y="152" text-anchor="middle" font-size="8" fill="#9a8a6a" font-family="sans-serif" font-style="italic">Humans receive the post-incident notification — they were never paged at 3am.</text></svg></div></div><div class="cs-section"><div class="cs-label">Part 7 · Efficiency</div><h2>What Changes — and Why It Compounds</h2><p>None of the gains below are individually surprising. The point is that all seven happen simultaneously, on the same architecture, against the same SLAs that already exist on the contract.</p><table class="cs-table"><thead><tr><th>Dimension</th><th>Today (Human Team)</th><th>With Cortex Swarm</th></tr></thead><tbody><tr><td class="k">Coverage</td><td class="v">3 follow-the-sun shifts with formal handoffs</td><td class="v">Single team, always on. No handoffs. No context lost between geographies.</td></tr><tr><td class="k">Capacity</td><td class="v">Inelastic. 2× spike can't be staffed in &lt; 24h</td><td class="v">Elastic by definition. 10× spike = 10× concurrent agent instances.</td></tr><tr><td class="k">Quality</td><td class="v">Varies by shift, tenure, individual</td><td class="v">Even. Same prompt, same KB, same eval bar everywhere.</td></tr><tr><td class="k">Speed</td><td class="v">P1 page → human awake → diagnose: 30–60 min</td><td class="v">P1 detect → diagnose → propose → approve → apply: 5 min.</td></tr><tr><td class="k">Languages</td><td class="v">Each language needs native-speaker hiring</td><td class="v">EN/HI/DE on the same agent. Locale bundle is a config file.</td></tr><tr><td class="k">Audit</td><td class="v">Reconstruct from chat + ticket + admin logs over weeks</td><td class="v">One ticket ID → hash-chained replay of every tool call and state change.</td></tr><tr><td class="k">Knowledge</td><td class="v">Walks out with every resignation. 4–8 weeks to rebuild.</td><td class="v">Persistent. KB is versioned. Prompts and tools are reviewed in PR.</td></tr></tbody></table><h3 style="margin-top:36px">The Compounding Effect</h3><p>Speed compounds with capacity (faster resolution × elastic concurrency = shorter incident windows). Audit compounds with quality (every action is replayable, so every regression has a fix that ships in a single PR rather than a memo). Language compounds with coverage (one swarm serves every region in every supported language at the same SLA).</p><div class="cs-pull">The org chart was never the bottleneck. The bottleneck was the shift roster underneath it.</div></div><div class="cs-section"><div class="cs-label">Part 8 · New Challenges</div><h2>The Honest List — What Could Go Wrong</h2><p>The model is not free. It trades a familiar set of operational problems for a less familiar set of socio-technical ones. Each one below is real; each one has a specific mitigation already wired into Cortex Swarm.</p><table class="cs-table"><thead><tr><th>Challenge</th><th>What It Looks Like</th><th>Mitigation</th></tr></thead><tbody><tr><td class="k">Trust gap</td><td class="v">End users distrust "the bot". CIOs distrust autonomy.</td><td class="w">Phase-gated rollout. Human approval on every mutating tool until evals plateau. Audit replay UI for skeptical buyers.</td></tr><tr><td class="k">Audit scrutiny</td><td class="v">Regulators want to know "what did the model decide and why?"</td><td class="w">Hash-chained SHA-256 audit. Every tool call, KB chunk, and state change is replayable in one click via<code style="font-size:.78rem;color:var(--gold-light)">/audit/verify</code>.</td></tr><tr><td class="k">Prompt injection</td><td class="v">Adversarial input tries to make the model exfiltrate or escalate.</td><td class="w">Six layered defences: directive precedence, &lt;data&gt;-tagged inputs, identity gate, Pydantic validation, rate limits, output filter.</td></tr><tr><td class="k">Model drift</td><td class="v">A new model version regresses on something nobody noticed.</td><td class="w">52 eval cases as a CI gate. Semantic grading on resolution, tool-correctness, grounding, citation.<code style="font-size:.78rem;color:var(--gold-light)">make evals</code> fails the build.</td></tr><tr><td class="k">Long-tail edge cases</td><td class="v">Rare scenarios the agent has never seen.</td><td class="w">Escalation patterns. Out-of-scope intent triggers<code style="font-size:.78rem;color:var(--gold-light)">escalate_out_of_scope()</code> to the next tier or to a human.</td></tr><tr><td class="k">Change management</td><td class="v">Humans whose roles dissolve. SP commercial models built on FTE counts.</td><td class="w">The hardest one. Honest position: agents replace tier responsibilities, not the function. Senior staff move to swarm operations, KB curation, and eval authoring.</td></tr><tr><td class="k">Cost &amp; scaling</td><td class="v">Inference cost grows with ticket volume.</td><td class="w">Local inference (LM Studio + Qwen3-Coder-Next) keeps marginal cost low. Bigger models reserved for L3/Architect on rare paths.</td></tr><tr><td class="k">Failure modes</td><td class="v">"Right tool, hallucinated reasoning." Confident wrong answers.</td><td class="w">Formal 13-state machine. Mutating tools callable only in<code style="font-size:.78rem;color:var(--gold-light)">ACTING</code>.<code style="font-size:.78rem;color:var(--gold-light)">RESOLVED</code> reachable only from<code style="font-size:.78rem;color:var(--gold-light)">VERIFYING</code>.</td></tr></tbody></table><h3 style="margin-top:36px">Six Security Defences in Depth</h3><div class="cs-timeline"><div class="cs-tl-item"><div class="cs-tl-dot">01</div><div class="cs-tl-content"><h3>System-prompt directive</h3><p>SECURITY DIRECTIVE at highest precedence — cannot be overridden by user input.</p></div></div><div class="cs-tl-item"><div class="cs-tl-dot">02</div><div class="cs-tl-content"><h3>Input/output normalisation</h3><p>&lt;user_message&gt; and &lt;tool_output&gt; treated as DATA, never instructions.</p></div></div><div class="cs-tl-item"><div class="cs-tl-dot">03</div><div class="cs-tl-content"><h3>Identity-gate decorator</h3><p>Orchestrator blocks cross-user mutations even if the model forgets to check.</p></div></div><div class="cs-tl-item"><div class="cs-tl-dot">04</div><div class="cs-tl-content"><h3>Tool input validation</h3><p>Pydantic + allow-lists. Schema mismatches rejected before execution.</p></div></div><div class="cs-tl-item"><div class="cs-tl-dot">05</div><div class="cs-tl-content"><h3>Per-session rate limit</h3><p>Max 30 tool calls / 10 min · 50 chat turns / hour — prevents runaway loops.</p></div></div><div class="cs-tl-item"><div class="cs-tl-dot">06</div><div class="cs-tl-content"><h3>Output filter</h3><p>Leaked-secret patterns stripped before reaching the UI layer.</p></div></div></div><figure class="cs-figure"><img src="/images/cortex-swarm/swarm-with-guardrails.png" alt="Nihonga-style illustration of five agent crests bound by gold cord — guardrails as composition"><figcaption>Authority is bounded by composition: the model proposes, the orchestrator and adapters dispose.</figcaption></figure></div><div class="cs-section"><div class="cs-label">Part 9 · Delivery</div><h2>Six Phases Over ~12 Months</h2><p>Phase 1 ships in 5 weeks. Each later phase is a drop-in module gated on an explicit trigger — not an arbitrary date.</p><div class="cs-phase-grid"><div class="cs-phase-card"><div class="cs-phase-num">Phase 01 · Now · ~50h</div><div class="cs-phase-title">Foundation</div><ul><li>L1 agent · EN chat</li><li>Stub L2/L3 · mock backends</li><li>State machine + 52 evals</li><li>Hash-chained audit</li></ul><div class="cs-phase-trigger">Trigger: Building now</div></div><div class="cs-phase-card"><div class="cs-phase-num">Phase 02 · ~35h</div><div class="cs-phase-title">Specialist Tiers</div><ul><li>Full L2 + L3 agents</li><li>Real ServiceNow / AD / Intune</li><li>Real OIDC SSO</li><li>KB pruning via evals</li></ul><div class="cs-phase-trigger">Trigger: First pilot signed</div></div><div class="cs-phase-card"><div class="cs-phase-num">Phase 03 · ~45h</div><div class="cs-phase-title">Planning Architecture</div><ul><li>Architect + SDM agents</li><li>Planner / Executor / Verifier</li><li>SDM dashboard + war-room</li></ul><div class="cs-phase-trigger">Trigger: Eval data shows plateau</div></div><div class="cs-phase-card"><div class="cs-phase-num">Phase 04 · ~25h</div><div class="cs-phase-title">Multilingual</div><ul><li>HI + DE locale bundles</li><li>Per-locale KB ingest</li><li>KB provenance + trust tiers</li></ul><div class="cs-phase-trigger">Trigger: External KB integrated</div></div><div class="cs-phase-card"><div class="cs-phase-num">Phase 05 · ~20h</div><div class="cs-phase-title">Voice</div><ul><li>Sarvam (EN/HI) · Azure (DE)</li><li>ElevenLabs alternate</li><li>Browser mic + playback</li></ul><div class="cs-phase-trigger">Trigger: Demand-driven</div></div><div class="cs-phase-card"><div class="cs-phase-num">Phase 06 · ~12h</div><div class="cs-phase-title">Multi-Tenant</div><ul><li>Tenant ID propagation</li><li>Branded shells per tenant</li><li>Per-tenant SLA dashboards</li></ul><div class="cs-phase-trigger">Trigger: Second client signed</div></div></div><div class="cs-visual" style="margin-top:24px"><svg viewBox="0 0 820 90" xmlns="http://www.w3.org/2000/svg"><rect width="820" height="90" fill="#0d0d1a"/><text x="410" y="18" text-anchor="middle" font-size="7" fill="#9a8a6a" letter-spacing="2" font-family="sans-serif">~12 MONTHS · TRIGGER-GATED · EACH PHASE IS A DROP-IN MODULE</text><line x1="36" y1="46" x2="784" y2="46" stroke="#c9a227" stroke-width=".8" stroke-opacity=".4"/><circle cx="90" cy="46" r="13" fill="#c9a227"/><text x="90" y="50" text-anchor="middle" font-size="8" fill="#080810" font-weight="700" font-family="sans-serif">P1</text><text x="90" y="72" text-anchor="middle" font-size="7" fill="#c9a227" font-family="sans-serif">5 weeks</text><circle cx="218" cy="46" r="10" fill="#0a1428" stroke="#c9a227" stroke-width="1"/><text x="218" y="50" text-anchor="middle" font-size="8" fill="#c9a227" font-family="sans-serif">P2</text><text x="218" y="72" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif">pilot+</text><circle cx="374" cy="46" r="10" fill="#0a1428" stroke="#c9a227" stroke-width="1"/><text x="374" y="50" text-anchor="middle" font-size="8" fill="#c9a227" font-family="sans-serif">P3</text><text x="374" y="72" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif">plateau</text><circle cx="522" cy="46" r="10" fill="#0a1428" stroke="#c9a227" stroke-width="1"/><text x="522" y="50" text-anchor="middle" font-size="8" fill="#c9a227" font-family="sans-serif">P4</text><text x="522" y="72" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif">multilingual</text><circle cx="644" cy="46" r="10" fill="#0a1428" stroke="#c9a227" stroke-width="1"/><text x="644" y="50" text-anchor="middle" font-size="8" fill="#c9a227" font-family="sans-serif">P5</text><text x="644" y="72" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif">voice</text><circle cx="750" cy="46" r="10" fill="#0a1428" stroke="#c9a227" stroke-width="1"/><text x="750" y="50" text-anchor="middle" font-size="8" fill="#c9a227" font-family="sans-serif">P6</text><text x="750" y="72" text-anchor="middle" font-size="7" fill="#9a8a6a" font-family="sans-serif">multi-tenant</text></svg></div></div><div class="cs-section" style="border-bottom:none"><div class="cs-label">Closing</div><h2>The Org Chart Was Always the Answer</h2><p>Every senior engineer who has ever worked in IT operations recognises the five-tier shape. It's the shape that emerges every time, in every geography, in every sector, because the responsibilities map cleanly onto the kinds of decisions a team needs to make. That same shape is exactly what makes a swarm legible: each agent does what the role already does, no more, no less, and the rest of the building knows where to send its ticket.</p><p>The interesting work over the next year is not adding more agents. It is sharpening the seams between them — better identity gates, better KB provenance, faster audit replays, tighter eval cases — so that what runs in production is a system anyone in IT operations can trust without having to also be an AI specialist.</p><div class="cs-statbar"><div class="cs-stat"><span class="num">5</span><span class="lbl">Agents<br>L1 · L2 · L3 · Arch · SDM</span></div><div class="cs-stat"><span class="num">~137</span><span class="lbl">FTE mirrored<br>across all tiers</span></div><div class="cs-stat"><span class="num">24×7</span><span class="lbl">Single team<br>no shift roster</span></div><div class="cs-stat"><span class="num">1-click</span><span class="lbl">Audit verify<br>any ticket</span></div></div><div class="cs-ornament" style="margin-top:40px">· · ·</div><p style="text-align:center;margin-top:20px;color:var(--muted);font-size:.85rem"><strong style="color:var(--gold);font-family:'Noto Serif',serif">Ajay Walia</strong> · Digital Workplace Operations · May 2026<br><a href="https://linkedin.com/in/ajaywalia" style="color:var(--gold)">LinkedIn: /in/ajaywalia</a></p></div></div>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/cortex-swarm/cortex-swarm-hero-v2.png" medium="image"><media:title type="plain">Posts</media:title></media:content><category>artificial-intelligence</category><category>agentic-ai</category><category>digital-workplace</category><category>itil</category><category>automation</category><category>llm</category><category>Projects</category></item><item><title>From FTEs to Agentic Workflows — Reshaping Infrastructure Outsourcing</title><link>https://curiousbit.netlify.app/from-ftes-to-agentic-workflows/</link><guid isPermaLink="true">https://curiousbit.netlify.app/from-ftes-to-agentic-workflows/</guid><pubDate>Wed, 20 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;A six-minute companion to the &lt;a href="https://curiousbit.netlify.app/spreading-fire-uber-mcp-indian-it-outsourcing/"&gt;long-form analysis&lt;/a&gt;. Read this if you have ten minutes and a calendar full of other meetings. Read the long-form if you have an hour and a strategy decision to make.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/UberMCP/executive-brief-hero.svg" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>A six-minute companion to the<a href="/spreading-fire-uber-mcp-indian-it-outsourcing/">long-form analysis</a>. Read this if you have ten minutes and a calendar full of other meetings. Read the long-form if you have an hour and a strategy decision to make.</p><hr><h2 id="the-30-second-version">The 30-Second Version</h2><p>Uber&rsquo;s engineering team has published material describing a working internal platform that, by their own numbers, supports around sixty thousand weekly AI tasks flowing through a single governed gateway, with around ten thousand internal services exposed as agent-callable tools. The architectural pattern — a Model Context Protocol-centred fabric with a tool registry, central governance and auto-converted tool wrappers — is reproducible. The protocol is open.</p><p>Indian IT outsourcing — TCS, Infosys, Wipro, HCLTech, LTIMindtree, Cognizant, Tech Mahindra, Mphasis, Coforge — is the natural beneficiary of the same architecture applied to client estates rather than a single internal estate. The decisive question is not whether to build the platform. It is whether the firms that build it have the commercial nerve to convert the technical capability into a different deal shape. The technology is the easy part. The conversation about who gets paid is harder.</p><p>This brief covers the three things that matter most for an executive reading on a Tuesday: where the industry sits on the maturity curve, why the incentive conflict is the real constraint, and what the rollout looks like on a real account.</p><hr><h2 id="where-the-industry-sits-on-the-curve">Where The Industry Sits On The Curve</h2><p><img src="/images/UberMCP/maturity-model.svg" alt="Six-stage automation maturity model — scripts, RPA, copilots, MCP gateway, multi-agent workflows, outcome-billed delivery"/><p>Most outsourcers operate large client accounts at Stage 1 or Stage 2 today. RPA bots and per-tool copilots are useful, but they are<em>point automation</em> — each new tool or workflow improves one workflow. They do not compound.</p><p>Stage 3 is where compounding starts. An MCP-centred gateway with a tool registry, central governance, and auto-converted tool wrappers means every new tool exposed makes every existing agent more capable. The investment shape changes from &ldquo;build a copilot&rdquo; to &ldquo;build a platform that produces copilots&rdquo;. This is the published state inside Uber.</p><p>Stages 4 and 5 are where the commercial model shifts to outcome-billed delivery. Nobody is there at scale yet.</p><p>The honest gap from Stage 2 to Stage 3 is two to four quarters of platform work plus an unglamorous data-hygiene investment — CMDB cleanup, knowledge-base curation, ticket-taxonomy normalisation. The gap from Stage 3 to Stage 5 is not technical at all. It is the incentive problem.</p><hr><h2 id="the-incentive-problem-is-the-real-constraint">The Incentive Problem Is The Real Constraint</h2><p>This is the part of the thesis that gets understated in most conversations, including the original long-form draft of this analysis.</p><p><strong>Outsourcing revenue is FTE-shaped.</strong> Contracts are priced as a stack of named roles multiplied by billing rates and headcounts. Reduce the headcount required to deliver a service and the immediate effect is a smaller bill to the customer. That is revenue cannibalisation, and it cuts against the platform business case.</p><p>The same effect applies inside the firm. A tower lead&rsquo;s standing is shaped partly by tower size. A practice head&rsquo;s position is shaped partly by billable headcount. The people who would have to champion the platform are also the people whose internal position it weakens. This is not a moral failing — it is rational behaviour given how the firm rewards them.</p><p>Four commercial paths through this, each failing in a different way.</p><p><strong>Productivity dividend.</strong> Keep the FTE count, use the gains to expand scope or absorb new work without growing the team. Lowest internal resistance. Often produces no measurable customer benefit and therefore no commercial differentiation.</p><p><strong>Margin retention.</strong> Hold the customer price, reduce delivery cost through automation, retain the margin. Works in shorter contracts. Creates trust issues at renewal — customers eventually see the AI talking points elsewhere and ask why they are paying yesterday&rsquo;s price for tomorrow&rsquo;s delivery model.</p><p><strong>Outcome repricing.</strong> Move new contracts to per-ticket, per-incident or per-user pricing. The structural prize. Requires the commercial nerve to renegotiate large contracts, and the conversation is uncomfortable.</p><p><strong>Productivity-billed sharing.</strong> Open-book pricing where automation gains are explicitly shared between vendor and customer. Mature, durable, hard to negotiate. Only viable in the small minority of high-trust accounts.</p><p>Customers also resist — not all, but a meaningful share — particularly in regulated industries, in accounts with hard data-sovereignty constraints, and where the customer&rsquo;s own IT organisation has a vested interest in keeping the outsourcer&rsquo;s headcount visible.</p><p>The firms that win will build the platform<strong>and</strong> have the commercial conversation. The firms that build the platform without the conversation will end up with a quietly impressive internal capability that does not show up in revenue.</p><hr><h2 id="what-this-looks-like-for-a-real-account">What This Looks Like For A Real Account</h2><p>A composite client. Eighty thousand endpoints across thirty countries. Around four hundred and fifty named FTEs across all towers (a typical 1:175 endpoint-to-FTE ratio). Around ninety-five thousand monthly tickets, of which roughly forty-two thousand are repetitive L1 categories. Five-year contract with a single Indian outsourcer. Moderate CMDB hygiene. Well-maintained ServiceNow; inconsistent integration with endpoint management.</p><p><strong>Phase 1 (months 0–3).</strong> MCP gateway stood up. Three tools exposed: ServiceNow, Active Directory, Intune. One agent built: L1 password-reset and account-unlock automation. No-code builder wired. Baseline measured. Six gateway calls per minute by week ten.</p><p><strong>Phase 2 (months 3–9).</strong> Orchestrator agent and VDI-triage workflow live. SCCM, Splunk, vSphere, NetBackup exposed. Fifteen SDMs and twenty senior engineers trained on the builder. L1 auto-resolution stabilises around 22 percent. Headcount has not changed yet, but queue depth has dropped by around 30 percent and customer satisfaction has lifted by single-digit points.</p><p><strong>Phase 3 (months 9–18).</strong> Onboarding workflow live. Emergency patch workflow live. Asset and license reconciliation agents live. L1 auto-resolution stabilises around 32 percent. The customer asks why the monthly report has improved so visibly. The renewal conversation opens.</p><p><strong>Phase 4 (months 18–24).</strong> Renewal lands on a hybrid commercial model — partial FTE billing plus a per-ticket envelope. Around four percent of the original FTE base is redeployed to other accounts. Another eight percent shifts to higher-value advisory work. The remainder continues in delivery roles. The vendor&rsquo;s blended margin on the account improves by approximately two to four percentage points despite a lower headline contract value. Customer-side, cost-per-incident reduces by around 18 to 22 percent.</p><p><img src="/images/UberMCP/economic-scoreboard.svg" alt="Economic impact scoreboard — before vs after metrics in Contra HUD style"/><p>The metrics that actually decide success are not the headline ticket numbers. They are the<strong>renewal terms</strong> and the<strong>redeployment ratio</strong>. The renewal tells you whether the platform investment converted into a defensible commercial position. The redeployment ratio tells you whether the firm managed the internal political conversation well enough to keep its experienced people inside the business.</p><p>The bank is fictional. The numbers should be read as illustrative.</p><hr><h2 id="what-to-do-this-quarter">What To Do This Quarter</h2><p>Three concrete actions for the executive reading this with a calendar full of other meetings.</p><p><strong>One.</strong> Pick a single client account and a single process. Stand up an MCP-centred gateway against it. Two-quarter timeline. Real money — somewhere between $500K and $2M depending on starting position and tool count. Password reset and account unlock are the canonical starting workflows because the volume is enormous and the failure modes are well-understood.</p><p><strong>Two.</strong> Open the commercial conversation in parallel with the technical build. Do not wait for delivery. Renewal cycles are long, and the conversation about deal shape needs months of customer education before it can convert. The technical platform and the commercial repositioning are two sides of the same investment.</p><p><strong>Three.</strong> Pick the right SDM and the right tower lead to champion the rollout. The wrong choice will produce technical success and commercial failure. The internal politics of automation are as important as the architecture — possibly more.</p><hr><h2 id="where-to-go-deeper">Where To Go Deeper</h2><p>The full analysis — protocol-versus-platform distinction, competitive landscape, role taxonomy, three concrete workflows in detail, the risks landscape, the five-phase rollout path — is in the long-form companion piece:<a href="/spreading-fire-uber-mcp-indian-it-outsourcing/">From Service Desk to Agent Fabric</a>.</p><p>The work that distinguishes the firms that win from the firms that wait is the execution discipline applied to both halves of the problem. The platform is the easier half. The conversation about who gets paid for the productivity is the harder half. Both must happen at the same time.</p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/UberMCP/executive-brief-hero.svg" medium="image"><media:title type="plain">Posts</media:title></media:content><category>digital-workplace</category><category>Knowledge Base</category></item><item><title>From Service Desk to Agent Fabric — MCP Playbook and the Future of Indian IT Outsourcing</title><link>https://curiousbit.netlify.app/spreading-fire-uber-mcp-indian-it-outsourcing/</link><guid isPermaLink="true">https://curiousbit.netlify.app/spreading-fire-uber-mcp-indian-it-outsourcing/</guid><pubDate>Wed, 20 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;Uber&amp;rsquo;s engineering team has published material describing a meaningful shift in how they operate AI inside the company. They have moved Model Context Protocol (MCP) from a pilot into a standardised internal platform. Their published numbers — in the engineering blog and the accompanying conference talk linked at the foot of this post — describe in the order of sixty thousand weekly AI tasks flowing through a centralised gateway, with around ten thousand internal services exposed as agent-callable tools. Engineers and non-technical staff compose those tools into workflows through either a no-code builder or SDKs, with governance, redaction and access control inside the platform rather than in each agent.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/UberMCP/uber-mcp-hero.jpg" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>Uber&rsquo;s engineering team has published material describing a meaningful shift in how they operate AI inside the company. They have moved Model Context Protocol (MCP) from a pilot into a standardised internal platform. Their published numbers — in the engineering blog and the accompanying conference talk linked at the foot of this post — describe in the order of sixty thousand weekly AI tasks flowing through a centralised gateway, with around ten thousand internal services exposed as agent-callable tools. Engineers and non-technical staff compose those tools into workflows through either a no-code builder or SDKs, with governance, redaction and access control inside the platform rather than in each agent.</p><p>That is interesting on its own technical merits. It becomes much more interesting if you sit in Bangalore, Pune, Hyderabad, Chennai or Noida and run a large infrastructure-delivery organisation for global clients. The pattern Uber has industrialised internally is, in shape, the same pattern that Indian IT outsourcing has been circling around for years without quite landing.</p><p>This article walks through what Uber appears to have built, separates that confirmed picture from my own extrapolation into the outsourcing world, places it in a maturity model, and addresses the political and commercial reasons why the rollout is harder than the architecture diagrams suggest.</p><p>If you only have six minutes, there is a<a href="/from-ftes-to-agentic-workflows/">companion executive briefing</a> covering the maturity model, the incentive conflict and the bank case study. This is the long form.</p><p><img src="/images/UberMCP/uber-mcp-hero.svg" alt="Spreading Fire — Uber&rsquo;s MCP playbook meets Indian IT outsourcing, in Contra NES palette"/><hr><h2 id="whats-verified-whats-mine">What&rsquo;s Verified, What&rsquo;s Mine</h2><p>Before going further: what Uber has publicly described — the gateway pattern, the auto-conversion of service endpoints into MCP tools, the no-code builder, the weekly-task and service-count numbers — is treated here as the<strong>confirmed picture</strong>, citable to their public material. The specific figures (60,000 weekly tasks, 10,000 services) come from Uber and should be verified against the linked sources rather than taken on my authority.</p><p>Everything that follows about Indian IT outsourcing — the role taxonomy, the workflow examples, the economic envelope, the rollout phases, the incentive analysis, the bank case study — is<strong>my extrapolation</strong>, informed by fifteen years inside the industry. It is not an Uber statement and should not be read as one.</p><p>That separation matters because the Uber numbers travel well — they get repeated in pitch decks and LinkedIn posts without the surrounding caveats — and outsourcing-specific claims get attributed back to Uber in ways that misrepresent what they actually said.</p><hr><h2 id="protocol-versus-platform">Protocol Versus Platform</h2><p>A precision point worth getting right up front, because technical readers will object if it is left fuzzy.</p><p><strong>MCP itself is a protocol.</strong> It specifies how an AI agent describes tools, calls them, and consumes their results. It is open, broadly adopted, and useful. The protocol on its own does not build a platform. HTTP is a protocol — the World Wide Web is the vast set of conventions, infrastructure and tooling built on top of it. MCP sits in the same relationship to a platform like the one Uber has built.</p><p>What Uber has built is not &ldquo;MCP&rdquo;. It is a<strong>gateway, registry, governance and builder platform that uses MCP as its interface contract</strong>. The protocol is the load-bearing standard. The platform around it is where most of the engineering investment sits.</p><p><img src="/images/UberMCP/mcp-protocol-vs-platform.svg" alt="MCP itself versus the platform built around it — protocol at the bottom, gateway and governance in the middle, agents and workflows on top, business value above that"/><p>For the rest of this article, when I say &ldquo;the Uber MCP stack&rdquo; or &ldquo;an MCP-centred fabric&rdquo;, I mean the protocol-and-platform combination. When I mean the protocol itself, I will say &ldquo;the MCP specification&rdquo;.</p><hr><h2 id="ubers-build-in-four-pieces">Uber&rsquo;s Build, In Four Pieces</h2><p>Before standardising MCP, Uber&rsquo;s AI initiatives looked like most large engineering organisations&rsquo; — multiple teams, multiple stacks, multiple ways of calling internal services, and every new agent requiring hand-wired integrations against an estate of around ten thousand services. Governance was worse: each integration had its own access pattern and (often non-existent) data-redaction story. Uber has been candid in their public material that this was the state they were trying to change.</p><p>What they built has four moving parts.</p><p>A<strong>gateway</strong> sits between every agent and every internal service. AI traffic terminates there. Authentication, authorisation, rate-limiting, logging and PII redaction happen at a single layer, so that sensitive-data handling is done once rather than once per agent.</p><p>A<strong>registry</strong> catalogues the tools an agent can call. The decisive design choice — and the source of most of the engineering value — is that Uber did not require every team to hand-write an MCP tool definition. Their tooling<strong>auto-converts existing service endpoints into MCP tools</strong>. New service ships, new tool appears, agents pick it up.</p><p>A<strong>scanning and access-control layer</strong> runs inside the gateway. Every tool registered goes through risk scanning. Every call is bound to a named identity with scoped permissions and is logged.</p><p>A<strong>builder layer</strong> sits on top: a no-code workflow composer for non-engineers, and SDKs for engineers. Both produce workflows that run through the same gateway against the same registry. There is no second-class tier of tools or governance.</p><p>Uber&rsquo;s material also describes a future direction — a<strong>skills repository</strong> that curates and shares verified multi-step workflows across the organisation, on the model of an internal app store.</p><p><img src="/images/UberMCP/mcp-stack-architecture.svg" alt="The MCP-centred platform: agents at the top calling through a governance bunker, which connects via tool-wrapper turrets to the underlying internal services"/><hr><h2 id="whats-actually-novel">What&rsquo;s Actually Novel</h2><p>Most of the individual components are not new. API gateways are decades old. Tool registries exist in many forms. Governance is well-trodden territory. What is distinctive is the<strong>combination at scale, with one open protocol as the contract, and with auto-generation of tools from existing service definitions</strong>.</p><p>Specifically: the auto-conversion of endpoints into MCP tools is the multiplier. Without it, exposing ten thousand services as agent-callable would be a years-long manual programme, and most organisations stop at a few hundred. With it, the marginal cost of bringing a new service into the agent surface area drops close to zero, and the platform compounds.<strong>Every new tool added makes every existing agent more capable.</strong> That compounding is the architectural insight most observers miss.</p><p>The single governance chokepoint is the second distinguishing characteristic. Many enterprises have built point AI integrations; few have built one place through which every AI call passes, logged, audited and scoped. That single chokepoint is what makes the platform credible to a public-company audit function.</p><hr><h2 id="what-it-bought-them">What It Bought Them</h2><p>The published numbers — around sixty thousand weekly AI tasks and ten thousand services exposed — are the headline. The qualitative outcomes Uber describes matter as much: governance moving from per-project negotiation to platform property, every new tool exposed making every existing agent more capable, and the population of people able to compose workflows expanding well beyond the AI engineering team.</p><p>Verify the methodology against the linked engineering blog and conference talk before quoting these figures further. The definitions of &ldquo;task&rdquo; and &ldquo;service&rdquo; matter, and pitch-deck repetition tends to lose the qualifiers.</p><hr><h2 id="why-mcp-and-not-an-existing-platform">Why MCP And Not An Existing Platform</h2><p>A reasonable question: why would an outsourcer build an MCP-centred fabric rather than adopt one of the existing platforms already in this space?</p><p><strong>ServiceNow&rsquo;s AI agents and Workflow Studio</strong> are strong when ServiceNow is the system of record, weaker when the client estate is multi-vendor.</p><p><strong>Microsoft Copilot Studio</strong> is excellent if the estate is heavily M365/Azure-centric. The trade-off is platform lock-in and a relatively closed extensibility model.</p><p><strong>Moveworks and Aisera</strong> are specialist L1-automation vendors — strong out-of-the-box for tickets and natural-language ITSM, less flexible beyond the service desk.</p><p><strong>PagerDuty AIOps</strong> targets incident response and on-call; not designed as an estate-wide fabric.</p><p><strong>UiPath, Automation Anywhere and Blue Prism</strong> are RPA platforms with AI features layered in; they work, particularly for legacy systems without APIs, but were not designed around the agent-orchestration model.</p><p><strong>Atlassian Rovo, Cortex, and OpenAI operator-style systems</strong> are credible newer entrants, worth tracking particularly when the underlying platform is already in use.</p><p>The case for an MCP-centred fabric is specifically protocol-openness and multi-tenant flexibility. An Indian outsourcer running infrastructure for thirty different clients does not want to bet on each client&rsquo;s chosen vendor stack. They want a fabric that can wrap any client&rsquo;s tool surface — ServiceNow at client A, BMC at client B, Cherwell at client C — into the same agent-callable shape. MCP is the only protocol-level standard that makes that portable.</p><p>The honest counterpoint: the MCP ecosystem is younger than any of the platforms above, and the per-tool engineering effort to build wrappers is real. The right answer for most outsourcers is probably<strong>both</strong> — use Copilot Studio or ServiceNow AI inside specific engagements where platform fit is strong, and stand up an MCP-centred fabric as the cross-client integration substrate that owns multi-tenant governance.</p><hr><h2 id="the-outsourcing-mirror">The Outsourcing Mirror</h2><p>Replace &ldquo;Uber&rsquo;s ten thousand internal services&rdquo; with<strong>&ldquo;a client&rsquo;s BigFix, SCCM, Intune, ServiceNow, Active Directory, Azure, vSphere, NetBackup, Splunk, CrowdStrike, Citrix DaaS and forty other tools&rdquo;</strong> and the structural picture is the same. Indian IT outsourcing companies run the infrastructure for a large share of the Global 2000. The same constellation of vendor tools, the same heavy ITIL process layer, the same delivery pyramid of engineers operating that process at scale.</p><p>A typical infrastructure engagement touches forty to one hundred discrete tools, most with an API and none currently agent-safe. ITIL workflows are structured, repeatable and multi-system — the exact shape an orchestration platform is designed to industrialise. A large outsourcing tower processes hundreds of thousands of tickets a month, of which a meaningful proportion is highly repetitive.</p><p>The genuinely difficult difference is multi-tenancy. Uber operates one estate. An outsourcer operates many. The platform has to be multi-tenant by design — strong client isolation, separate audit boundaries, per-client data sovereignty. That is materially harder than the Uber single-tenant build.</p><p><img src="/images/UberMCP/outsourcing-pyramid.svg" alt="Indian IT outsourcing — today&rsquo;s service desk pyramid (human-heavy at the base) vs tomorrow&rsquo;s augmented pyramid (agents handling routine work, humans elevated to higher-value tasks)"/><hr><h2 id="the-service-desk-becomes-a-routing-layer">The Service Desk Becomes A Routing Layer</h2><p>A working taxonomy of roles inside a typical infrastructure outsourcing tower, sorted by how aggressively the MCP-centred pattern can be applied.</p><p><img src="/images/UberMCP/role-automation-quadrant.svg" alt="Role automation quadrant — fully automated, agent-led with human approval, human-led with agent assist, human-only"/><p><strong>Fully automated</strong> covers password resets, account unlocks, drive and printer mappings, mailbox quota requests, software-catalogue requests, license assignment, standard patch deployment, backup-job verification, daily dashboards, and joiner-leaver provisioning. Structured inputs, well-defined steps, known failure modes. Agents run them; humans review the exception queue.</p><p><strong>Agent-led with human approval</strong> covers production change execution, privileged access grants, firewall rule additions, DNS changes — anything that would have gone through a CAB. The agent drafts the change record, attaches the technical justification, runs pre-change validation, and produces the back-out plan. A human approves with a click.</p><p><strong>Human-led with agent assist</strong> covers L2 and L3 incident analysis, root cause analysis, architecture review, capacity planning, vendor escalation, major-incident communications and service-review preparation. The agent gathers diagnostics, summarises logs, retrieves relevant knowledge-base material, and drafts a first version. The human applies judgment, two to four times faster.</p><p><strong>Human-only, at least for now</strong>, covers Sev-1 crisis leadership with customer C-suite involvement, contract negotiation, performance management, and the conversations that require reading what is not being said. The agent can prepare the brief; the conversation is yours.</p><p>The role that changes most is the<strong>L1 engineer</strong>. Most of the L1 queue today is repetitive triage; after MCP, most of that volume is auto-resolved before reaching a human. The L1 engineer ends up doing what L2 does today — the genuinely novel cases. The role population shrinks but the people in it operate at a higher level.</p><p>The<strong>patch-management engineer</strong> moves from running deployments to owning strategy. The<strong>asset and license manager</strong> shifts from reconciliation to vendor negotiation. The<strong>SDM</strong> stops stitching reports together and starts having strategic conversations.<strong>L3 specialists</strong> are the least affected in volume but the most augmented in capability — the same person designing a thirty-thousand-endpoint patching strategy, with a domain agent at their side that knows every Fixlet and every advisory.</p><hr><h2 id="three-workflows-in-detail">Three Workflows In Detail</h2><p>Three concrete workflows, each composed from agents calling registered tools through the gateway. Every system referenced is one Indian outsourcers already run for clients.</p><p><img src="/images/UberMCP/workflow-examples.svg" alt="Three example workflows drawn as Contra-style level maps — VDI session failure, new joiner onboarding, emergency patch deployment"/><h3 id="a--vdi-session-wont-connect">A — VDI session won&rsquo;t connect</h3><p>Today, a ticket of this shape sits in the L1 queue for fifteen to forty-five minutes before an engineer picks it up, asks for basic information the user has not provided, checks four or five systems by hand, and either resolves it or escalates.</p><p>With an MCP-centred fabric, the ticket arrives at an orchestrator agent that recognises the symptom class and dispatches in parallel — an Active Directory agent checks account status and recent authentication failures; an Intune agent checks device compliance and last-seen timestamp; a Citrix agent checks session-host availability and active maintenance windows. Within seconds the orchestrator has a structured diagnostic.</p><p>If the issue is a known pattern, the orchestrator resolves directly or routes the user to the right self-service path. If the diagnostic shows something unusual, the orchestrator escalates to L2 with the full brief pre-attached. The L2 engineer reads a one-paragraph summary and makes a decision, rather than starting from zero in four different tools.</p><h3 id="b--new-joiner-onboarding">B — New joiner onboarding</h3><p>HR fires a &ldquo;new hire confirmed&rdquo; event. An onboarding orchestrator reads the role profile and location. An identity agent creates the AD account and provisions the M365 mailbox. A licensing agent assigns the right SKUs. An endpoint agent triggers the laptop build. A collaboration agent adds the new joiner to relevant Teams and SharePoint sites. A VPN agent provisions remote access. A notification agent sends the welcome packet and notifies the manager.</p><p>Every step is logged. Every approval that requires a human — sensitive group memberships, elevated privileges, anything outside policy — is routed as a single one-click decision. What used to be a three-to-five-day, twelve-touchpoint process across three teams becomes a four-to-eight-hour automated workflow with a clean audit trail.</p><h3 id="c--emergency-patch-deployment">C — Emergency patch deployment</h3><p>A vendor publishes a critical advisory. A vulnerability agent parses it. A CMDB agent identifies affected assets. A risk agent calculates the blast radius. A change agent drafts the emergency CAB record — justification, test plan, back-out plan, customer notification — and routes for one-click approval. A patch agent executes the pilot deployment. A monitoring agent watches for regressions. If no regressions, broader rollout proceeds in stages. A reporting agent generates the customer report at each stage.</p><p>The patch-management engineer&rsquo;s role here is to set the strategy, define the pilot group, review the monitoring envelope, and approve the rollout cadence. One engineer, around an engineer-hour in the loop and four-to-eight hours of wall-clock execution, on a workflow that previously consumed two to three engineers across one-and-a-half days.</p><hr><h2 id="from-fte-arbitrage-to-outcome-arbitrage">From FTE Arbitrage To Outcome Arbitrage</h2><p>The strategic case is more persuasive with numbers. The figures below are<strong>directional, my own estimates</strong> based on what I have seen in infrastructure delivery — not Uber numbers and not benchmarks. Real client engagements will move these by a factor of two or three either way depending on tower maturity, tool quality and data hygiene.</p><p><strong>These are platform-success-case figures, not industry averages.</strong> Realistic deployments will hit the lower bound of the &ldquo;after&rdquo; column unless three conditions are met: clean CMDB hygiene, well-classified ticket taxonomy, and an MCP-centred fabric covering at least 60 percent of L1 ticket categories. The &ldquo;before&rdquo; column reflects mid-range offshore-billed contracts; US- and Europe-billed contracts run two to three times higher on the L1 cost line.</p><p><img src="/images/UberMCP/economic-scoreboard.svg" alt="Economic impact scoreboard — before vs after metrics in Contra HUD style"/><table><thead><tr><th>Metric</th><th>Today (typical)</th><th>With MCP-centred fabric</th></tr></thead><tbody><tr><td>Password reset, end-to-end</td><td>8–15 min</td><td>1–2 min</td></tr><tr><td>L1 ticket auto-resolution rate</td><td>5–15%</td><td>25–40%</td></tr><tr><td>MTTR, repetitive L1 incident</td><td>30–60 min</td><td>2–5 min</td></tr><tr><td>MTTR, L2 incident</td><td>90–240 min</td><td>60–180 min</td></tr><tr><td>New-joiner onboarding cycle</td><td>3–5 business days</td><td>4–12 hours</td></tr><tr><td>Emergency patch deployment</td><td>2–3 engineer-days</td><td>1 engineer-hour (4–8 hr wall-clock)</td></tr><tr><td>L1 ticket cost (fully-burdened, offshore)</td><td>$5–$25</td><td>$0.50–$1.50</td></tr><tr><td>Change-record drafting</td><td>30–60 min</td><td>5–10 min</td></tr><tr><td>SDM reporting overhead (per week)</td><td>6–10 hours</td><td>2–3 hours</td></tr><tr><td>License optimisation cycle</td><td>quarterly manual</td><td>continuous</td></tr></tbody></table><p>The L1 auto-resolution rate is the line worth interrogating most carefully. Public Moveworks and Aisera figures advertise 40–60 percent on flagship deployments, but the deployed-average across their customer base is closer to 25–35 percent. The 25–40 percent range above reflects that deployed-average reality, not the marketing number.</p><p>The commercial implication, if these numbers hold even directionally, is the structural shift from billed-FTE to billed-outcome. That shift has been talked about in this industry for a decade. The MCP-centred pattern is the first thing I have seen that makes it plausibly executable. Whether the industry actually executes it is a different question — and the most interesting one in this whole article.</p><hr><h2 id="the-incentive-problem">The Incentive Problem</h2><p>The hardest problem here is not technical. It is commercial and political, and the previous version of this article underplayed it.</p><p><strong>Outsourcing revenue is FTE-shaped.</strong> A typical infrastructure deal is priced as a stack of named roles — L1 engineers, L2 engineers, L3 specialists, SDMs, project managers, change managers — multiplied by a billing rate and a headcount, with volume commitments and tiered SLAs layered on top. When an outsourcer reduces the headcount required to deliver a given service, the immediate effect is a smaller bill to the customer. That is revenue cannibalisation, and it cuts directly against the platform business case.</p><p>The same is true at the individual delivery-team level. A tower lead&rsquo;s annual review is partly shaped by the size of the tower. A practice head&rsquo;s standing inside the firm is partly shaped by the practice&rsquo;s billable headcount. When automation reduces both, the people who would have to champion the platform are also the people whose internal position it weakens. This is not a moral failing on their part — it is rational behaviour given how the firm rewards them.</p><p>There are four paths through this, and they tend to fail in different ways.</p><p><strong>Path one — productivity dividend.</strong> Keep the FTE count and use the productivity gains to expand scope or accept new work without growing the team. This is the path of least internal resistance but the one most likely to produce no measurable customer benefit and therefore no commercial differentiation.</p><p><strong>Path two — margin retention.</strong> Hold the customer price flat, reduce the delivery cost through automation, retain the margin. This works in shorter contracts but creates trust issues at renewal — customers eventually see the AI talking points elsewhere and ask why they are paying yesterday&rsquo;s price for tomorrow&rsquo;s delivery model.</p><p><strong>Path three — outcome repricing.</strong> Move new contracts to per-ticket, per-incident or per-user pricing where automation directly drives margin. This is the structural prize but requires commercial nerve, and the renegotiation conversation is uncomfortable. Existing customers will ask why their renewal terms are changing; new customers will ask whether the model has been proven elsewhere first.</p><p><strong>Path four — productivity-billed sharing.</strong> Open-book pricing where automation gains are explicitly shared between vendor and customer. Mature, durable, hard to negotiate. Requires the kind of trust relationship that exists in a small minority of outsourcing accounts.</p><p>Customers will also resist. Not all customers, but a meaningful share — particularly in regulated industries where automated change execution is uncomfortable, where data-sovereignty constraints make agentic processing politically difficult, or where the customer&rsquo;s own IT organisation has a vested interest in keeping the outsourcer&rsquo;s headcount visible and known.</p><p>None of this is a reason not to build the platform. It is a reason to plan the commercial conversation as carefully as the technical one. The firms that succeed will be the ones that build the platform<strong>and</strong> have the commercial discipline to convert the technical capability into a different deal shape. The firms that build the platform but cannot have the conversation will end up with a quietly impressive internal capability that does not show up in revenue.</p><hr><h2 id="where-this-can-go-wrong">Where This Can Go Wrong</h2><p>The first version of this article underplayed operational complexity. The honest list of failure modes.</p><p><img src="/images/UberMCP/risks-landscape.svg" alt="Risks landscape — automation blast radius, stale CMDB, privilege escalation, hallucinated remediation, audit exposure, over-automation, model drift, integration brittleness"/><p><strong>Automation blast radius.</strong> An agent that can change firewall rules across an estate can break that estate at machine speed. Mitigations: layered approvals, staged rollouts, kill switches, per-action blast-radius modelling.</p><p><strong>Incorrect remediations.</strong> The agent confidently executes a fix that is wrong for the specific environment. Mitigations: bias toward propose-not-execute for production actions, confidence scoring on every action, source attribution.</p><p><strong>Data quality dependency.</strong> The platform is only as good as the data it relies on. The CMDB problem is the most visible — agents need accurate asset inventory for identification, blast-radius calculation and change correlation, and most long-running outsourcing engagements have CMDB hygiene problems that have been tolerated for years. Beyond CMDB, the same applies to<strong>knowledge-base quality</strong> (agents grounded in stale or contradictory KB articles produce confidently wrong answers),<strong>ticket-taxonomy quality</strong> (poorly classified incident categories destroy the orchestrator&rsquo;s routing accuracy),<strong>inconsistent naming and tagging</strong> (the same server appearing as three different identities across tools), and the<strong>tribal-knowledge problem</strong> (the senior engineer who knows the one production server that must not be patched on Wednesdays because of a quirk that was never documented). Standing up the fabric forces these issues into daylight, which is uncomfortable in the short term and necessary in the long term.</p><p><strong>Privilege escalation.</strong> Scopes too broad means agent compromise gives lateral movement. Mitigations: minimum-privilege per workflow, separate identities per agent, secret rotation, prompt-injection defences.</p><p><strong>Audit, legal and data sovereignty.</strong> Cross-border inference is a real audit issue; contracts often constrain where prompts can travel. Mitigations: region-pinned gateway, sovereign model selection, redaction before any cross-border hop.</p><p><strong>AI-induced outage.</strong> Tool-wrapper regression, model behaviour change, or workflow loop can cascade. Mitigations: canary tool deploys, per-client model pinning, workflow tests in CI, clear rollback story.</p><p><strong>Over-automation.</strong> Removing humans from too many steps removes the implicit knowledge transfer that keeps L2 and L3 engineers sharp. The L1-to-L2 promotion path was always partly about pattern recognition built up over thousands of tickets. Mitigations: rotate humans through agent-handled work for training, design agent output as a learning surface rather than a black box.</p><p><strong>Integration brittleness.</strong> Auto-generated tool wrappers are not always semantically correct — the wrapper might technically work but get the underlying business logic subtly wrong. Mitigations: human review of high-stakes wrappers, regression test suites at the tool level, a tiered trust model (draft → verified → skills-grade).</p><p>None of these is a reason not to build the platform. All are reasons to build it with the operational discipline that distinguishes a working enterprise platform from a clever demo.</p><hr><h2 id="a-maturity-model">A Maturity Model</h2><p>The pattern fits into a broader six-stage model that locates outsourcers against where the industry is heading.</p><p><img src="/images/UberMCP/maturity-model.svg" alt="Six-stage automation maturity model — scripts, RPA, copilots, MCP gateway, multi-agent workflows, outcome-billed delivery"/><table><thead><tr><th>Stage</th><th>Capability</th><th>Where it sits</th></tr></thead><tbody><tr><td>0</td><td>Scripts (PowerShell, Bash, ad-hoc)</td><td>Legacy operations</td></tr><tr><td>1</td><td>RPA (UiPath, Blue Prism — UI-driven)</td><td>Most outsourcers today</td></tr><tr><td>2</td><td>Copilots (Moveworks, Aisera, Copilot Studio — per-tool LLM helpers)</td><td>A growing share of leading-edge engagements</td></tr><tr><td>3</td><td>MCP gateway (single chokepoint, registry, governed tools)</td><td>Uber&rsquo;s published state</td></tr><tr><td>4</td><td>Multi-agent workflows (orchestrators, specialists, skills repository)</td><td>Very few in production at scale</td></tr><tr><td>5</td><td>Outcome-billed delivery (per-ticket, per-incident, per-user pricing)</td><td>None at scale yet</td></tr></tbody></table><p>The ceiling between Stage 2 and Stage 3 is where the platform investment matters. Everything before Stage 3 is point automation — useful, productive, but not compounding. Stage 3 is where every new tool exposed makes every existing agent more capable, and where governance becomes a property of the platform rather than the agent. Stages 4 and 5 are where the commercial model shifts.</p><p>Most outsourcers reading this article sit at Stage 1 or 2 for their largest accounts. The honest gap to Stage 3 is two to four quarters of platform work plus the unglamorous data-hygiene investment described in the risks section. The gap from 3 to 5 is the incentive problem above.</p><hr><h2 id="case-study-an-80000-endpoint-bank">Case Study: An 80,000-Endpoint Bank</h2><p>To anchor the abstractions, a fictional but representative client journey.</p><p>A global bank with 80,000 endpoints across thirty countries, currently outsourced to a single Indian IT provider on a five-year contract. The current delivery footprint is roughly 450 named FTEs across all towers (service desk, endpoint, identity, cloud, network, security operations) — an endpoint-to-FTE ratio of around 1:175, which is typical for a well-run but not bleeding-edge tower. Monthly ticket volume sits around 95,000, of which roughly 42,000 are repetitive L1 categories. The customer&rsquo;s CMDB hygiene is moderate; their ITSM (ServiceNow) is well-maintained but the integration with their endpoint management estate (a mix of SCCM and Intune) is inconsistent.</p><p><strong>Phase 1 (months 0–3).</strong> Stand up the MCP gateway. Expose three tools: ServiceNow, Active Directory, Intune. Build one agent: L1 password-reset and account-unlock automation. Wire the no-code builder. Measure baseline. Six gateway calls per minute by week ten.</p><p><strong>Phase 2 (months 3–9).</strong> Expose SCCM, Splunk, vSphere, NetBackup. Build the orchestrator agent and the VDI-triage workflow. Train fifteen SDMs and twenty senior engineers on the builder. By month nine, around 22 percent of L1 volume auto-resolves; the L1 headcount has not changed yet, but the queue depth has dropped by around 30 percent and customer satisfaction has lifted by single-digit points.</p><p><strong>Phase 3 (months 9–18).</strong> New-joiner onboarding workflow goes live. Emergency patch workflow goes live. Asset and license reconciliation agents go live. The L1 auto-resolution rate stabilises around 32 percent. The bank&rsquo;s customer asks why the SDM monthly report has improved so visibly. The renewal conversation begins.</p><p><strong>Phase 4 (months 18–24).</strong> Renewal lands on a hybrid model — partial FTE billing plus a per-ticket envelope. Around four percent of the original FTE base is redeployed to other accounts; another eight percent shifts to higher-value advisory work; the remainder continues in delivery roles. The vendor&rsquo;s blended margin on the account improves by approximately 2–4 percentage points despite the lower headline contract value. Customer-side, the cost-per-incident reduces by around 18–22 percent.</p><p>The interesting metrics in this case are not the headline ticket numbers. They are the renewal terms and the redeployment ratio. The renewal is what tells you whether the platform investment converted into a defensible commercial position. The redeployment ratio is what tells you whether the firm managed the internal political conversation well enough to keep its experienced people inside the business.</p><p>The fictional bank is a composite, not a real client, and the numbers should be read as illustrative rather than predictive.</p><hr><h2 id="the-rollout-path">The Rollout Path</h2><p>A staged path for an outsourcer starting today. The phases are deliberately conservative.</p><p><strong>Phase 1 (months 0–3).</strong> Pick one client, one process. Stand up a per-client MCP gateway. Expose three to five tools. Build governance from day one: scanning, redaction, audit logging, scoped access, secret rotation. Build one specialist agent. Measure baseline before anything is turned on.</p><p><strong>Phase 2 (months 3–6).</strong> Build the no-code builder layer. Train ten SDMs and ten senior engineers. Compose the first three workflows — password reset, account unlock, basic VDI triage. Publish the numbers internally.</p><p><strong>Phase 3 (months 6–12).</strong> Expose the rest of the client&rsquo;s estate. Build specialist agents per major tool. Compose the next ten workflows. Begin the internal skills-repository pattern, curated rather than free-for-all.</p><p><strong>Phase 4 (year 2).</strong> Replicate the platform pattern to the next two or three clients. Each new client engagement still requires substantial per-client work — tenant isolation, data sovereignty, CMDB onboarding, integration with each client&rsquo;s specific tool versions, contract renegotiation, change management with the client&rsquo;s IT organisation. Implementation time improves from quarters to a few months, not weeks.</p><p><strong>Phase 5 (year 3+).</strong> Outcome-billed contracts begin to make sense with two or three reference clients in production. The commercial nerve to renegotiate large multi-year contracts on a new basis is as important as the technology by this point.</p><p>The substantial work that remains, even after the platform pattern is established, includes identity federation across client tenants, robust RBAC for multi-tenant agent access, rollback safety for agent-executed changes, hallucination containment in production conditions, change-governance integration with each client&rsquo;s existing CAB, observability across the full stack, and workflow determinism for processes with regulatory SLAs. These are the issues that determine whether the platform delivers in production or merely demos well.</p><hr><h2 id="the-third-wave">The Third Wave</h2><p>Indian IT outsourcing has been through three waves. Offshoring in the 1990s and early 2000s — moving work from where it was expensive to where it was cheap. Digital transformation in the 2010s — moving customers from legacy stacks to cloud, mobile and modern collaboration. The third wave is automation-led services — moving the unit of delivery from the FTE to the agent-augmented workflow.</p><p>The companies that build a credible MCP-centred fabric first will set the terms for the decade that follows. The ones that wait will negotiate with customers who have done the math and know what the cost-per-ticket should be in a world where an agent handles a meaningful share of the volume.</p><p>Uber&rsquo;s published material is the most usable public case study to date of what such a platform looks like in production at scale. The architecture is reproducible. The protocol is open. The operational discipline required is well within what mature outsourcing organisations already have. The incentive conversation is harder than the engineering, and it is the part that decides who wins.</p><p>The technology has been ready for two years. The platform pattern is now visible. The work that remains is execution, and the conversation about who gets paid for it.</p><hr><h2 id="sources--notes">Sources &amp; Notes</h2><p>Uber claims in this article reference the company&rsquo;s publicly available engineering material on MCP:</p><ul><li>The conference talk:<a href="https://youtu.be/yVqMxBahjfA">https://youtu.be/yVqMxBahjfA</a></li><li>Uber&rsquo;s engineering blog at<a href="https://www.uber.com/blog/engineering/">https://www.uber.com/blog/engineering/</a> — search for their MCP and AI platform posts.</li><li>The Model Context Protocol specification at<a href="https://modelcontextprotocol.io">https://modelcontextprotocol.io</a></li></ul><p>Vendor capabilities in the competitive-landscape section reflect publicly available product documentation as of 2026 and move quickly; verify with vendors directly before procurement decisions.</p><p>Economic envelope figures, the role taxonomy, the workflow examples, the maturity model, the bank case study and the incentive analysis are my extrapolation based on fifteen years in infrastructure delivery. Not benchmarked, not audited, not validated by any of the firms named. Treat as a starting hypothesis for a client-specific business case, not as a benchmark.</p><p>Where the article uses &ldquo;could&rdquo;, &ldquo;would&rdquo; or &ldquo;should&rdquo;, that is deliberate hedging. The MCP-centred pattern is a working architecture inside Uber; whether it works equivalently inside a multi-tenant outsourcing environment is a thesis I find credible but cannot yet point to a published reference for.</p><p>Feedback, corrections and counter-evidence are welcomed.</p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/UberMCP/uber-mcp-hero.jpg" medium="image"><media:title type="plain">Posts</media:title></media:content><category>digital-workplace</category><category>Knowledge Base</category></item><item><title>RAG Chatbot from indexed public documentation</title><link>https://curiousbit.netlify.app/rag-chatbot-from-indexed-public-documentation/</link><guid isPermaLink="true">https://curiousbit.netlify.app/rag-chatbot-from-indexed-public-documentation/</guid><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;This article documents the design and implementation of a domain-specific Retrieval-Augmented Generation (RAG) conversational assistant. The project was completed as the &lt;strong&gt;Week 15 Graded Mini Project&lt;/strong&gt; of the &lt;strong&gt;IITM Pravartak Professional Certificate Programme in Agentic AI and Applications&lt;/strong&gt;. The brief required the construction of an assistant that retrieves content from a private document set and produces answers grounded in that retrieved context, while refusing to answer when supporting evidence is absent.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/IITM/IIT1.jpg" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>This article documents the design and implementation of a domain-specific Retrieval-Augmented Generation (RAG) conversational assistant. The project was completed as the<strong>Week 15 Graded Mini Project</strong> of the<strong>IITM Pravartak Professional Certificate Programme in Agentic AI and Applications</strong>. The brief required the construction of an assistant that retrieves content from a private document set and produces answers grounded in that retrieved context, while refusing to answer when supporting evidence is absent.</p><p>The implementation selects the<strong>Technology</strong> domain and indexes six official pages of the GitHub REST API documentation. The assistant accepts free-form questions in a command-line interface, performs context-aware follow-up handling, returns concise answers with source attribution, and emits a fixed refusal phrase when the retrieved context does not contain the answer.</p><h2 id="architecture">Architecture</h2><p>The pipeline is split into two phases. An offline ingestion stage prepares a reusable vector index; a runtime stage answers user turns by retrieving from that index and prompting a large language model under strict grounding instructions.</p><p><img src="/images/rag-github-architecture.png" alt="RAG architecture for the GitHub REST API assistant, showing ingestion and runtime phases" title="End-to-end flow: ingestion builds the FAISS index; runtime retrieves, rewrites follow-ups, and grounds the answer."/><h2 id="public-data-sources">Public Data Sources</h2><p>Six public pages from the official GitHub Docs were saved locally as plain-text files in<code>data/raw/</code>:</p><ul><li>Quickstart for GitHub REST API</li><li>Getting started with the REST API</li><li>Authenticating to the REST API</li><li>Rate limits for the REST API</li><li>Best practices for using the REST API</li><li>Troubleshooting the REST API</li></ul><p>These pages were chosen because they form a self-contained operational knowledge set: a developer integrating with the API typically needs to consult exactly this material when handling authentication, rate-limit pressure, and recovery from common errors.</p><h2 id="components-and-concepts">Components and Concepts</h2><p>The following components and ideas underpin the implementation. Each is summarised briefly to make the design choices easy to evaluate.</p><p><strong>Retrieval-Augmented Generation (RAG).</strong> A pattern in which a language model is grounded on retrieved passages from an external corpus rather than relying solely on parametric memory. Retrieval reduces hallucination and lets the assistant cite source material it actually consulted.</p><p><strong>LangChain.</strong> The orchestration framework used throughout the project. It provides composable abstractions for document loaders, text splitters, embeddings, vector stores and chat models, which keeps the ingestion and runtime code small and idiomatic.</p><p><strong>RAG Chunking.</strong> Long documents are split into smaller passages before embedding. This implementation uses<code>RecursiveCharacterTextSplitter</code> with a 900-character chunk size and a 150-character overlap. Splitting prefers semantic boundaries (paragraph breaks and Markdown headings) so that each chunk is internally coherent and retrieval surfaces meaningful units rather than fragments.</p><p><strong>OpenAI Embeddings.</strong> Each chunk is converted into a high-dimensional vector using the<code>text-embedding-3-small</code> model. The vector captures the semantic content of the chunk so that questions phrased differently from the source text still retrieve the right passages.</p><p><strong>FAISS.</strong> Facebook AI Similarity Search is the local vector store used to index the embeddings. The index is persisted to disk under<code>vectorstore/github_rest_api_faiss</code> so that ingestion runs once and the runtime simply loads the existing index, satisfying the rubric&rsquo;s reusable-index criterion.</p><p><strong>Top-k Retrieval.</strong> At each user turn the retriever returns the<code>k=4</code> chunks most similar to the query. A small<code>k</code> keeps the prompt focused and reduces the chance that off-topic passages dilute the grounded answer.</p><p><strong>Follow-up Rewriter.</strong> A short, dedicated LLM call rewrites the user&rsquo;s latest question into a standalone retrieval query using recent conversation history. This is the mechanism that allows ambiguous follow-ups such as &ldquo;what about rate-limit headers?&rdquo; to be embedded and retrieved correctly without the model having to guess the missing antecedent.</p><p><strong>Grounded Answerer.</strong> The final answer is produced by<code>gpt-4o-mini</code> at temperature zero under a system prompt that instructs the model to answer only from the retrieved context, append a<code>Sources:</code> line naming the document titles consulted, and otherwise return the exact refusal phrase.</p><p><strong>Refusal Behaviour.</strong> When retrieval returns no usable context, the runtime short-circuits and emits the required string verbatim:<code>I don't have enough information in the provided documents.</code> The refusal is also enforced by the system prompt, providing defence in depth against hallucinated answers.</p><h2 id="implementation-layout">Implementation Layout</h2><p>The submission is organised as two thin Python modules and a Jupyter notebook.<code>ingest.py</code> loads the text files, removes GitHub Docs templating syntax, extracts a human-readable title from each document, chunks the text, embeds the chunks, and writes the FAISS index.<code>chatbot.py</code> loads the persisted index, manages conversation history, invokes the rewriter and the grounded answerer in sequence, and prints the response with source attribution. The accompanying notebook imports the same functions so that the notebook and the CLI execute identical code paths, eliminating drift between the two surfaces.</p><p>Configuration is environment-driven. The OpenAI API key is read from a<code>.env</code> file and an optional<code>OPENAI_BASE_URL</code> is honoured so that the same code runs against the Vocareum-provided proxy without modification. No secrets are committed to source.</p><h2 id="sample-conversation">Sample Conversation</h2><p>A short transcript illustrates the three behaviours the rubric tests — grounded answer, follow-up resolution, and refusal.</p><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">You: What headers should I send when making requests to the GitHub REST API?</span></span><span class="line"><span class="cl">Assistant: GitHub REST API requests commonly use an Accept header such as</span></span><span class="line"><span class="cl">application/vnd.github+json. Authenticated requests send an Authorization</span></span><span class="line"><span class="cl">header with Bearer YOUR-TOKEN. GitHub also recommends a valid User-Agent</span></span><span class="line"><span class="cl">value, such as your GitHub username or application name.</span></span><span class="line"><span class="cl">Sources: Getting started with the REST API; Authenticating to the REST API</span></span><span class="line"><span class="cl"/></span><span class="line"><span class="cl">You: What about rate limit headers?</span></span><span class="line"><span class="cl">Search query: rate limit response headers GitHub REST API</span></span><span class="line"><span class="cl">Assistant: Response headers can include x-ratelimit-remaining and</span></span><span class="line"><span class="cl">x-ratelimit-reset, which indicate how many requests remain in the current</span></span><span class="line"><span class="cl">period and when the rate limit resets.</span></span><span class="line"><span class="cl">Sources: Getting started with the REST API; Rate limits for the REST API</span></span><span class="line"><span class="cl"/></span><span class="line"><span class="cl">You: How should an integration handle rate limit errors?</span></span><span class="line"><span class="cl">Assistant: If a retry-after header is present, wait that many seconds before</span></span><span class="line"><span class="cl">retrying. If x-ratelimit-remaining is 0, wait until the time given by</span></span><span class="line"><span class="cl">x-ratelimit-reset. Otherwise, wait at least one minute and then use</span></span><span class="line"><span class="cl">exponentially increasing waits for repeated secondary rate-limit failures.</span></span><span class="line"><span class="cl">Continuing to make requests while rate-limited may result in a ban.</span></span><span class="line"><span class="cl">Sources: Best practices for using the REST API; Troubleshooting the REST API</span></span></code></pre></div><p>The second exchange demonstrates the follow-up rewriter at work: the original &ldquo;what about rate limit headers?&rdquo; is expanded into a self-contained search query before retrieval, which is what allows the retriever to surface the correct passages despite the missing antecedent.</p><p>Now, a deliberately off-topic question — one the index cannot possibly support — triggers the safety path:</p><div style="background:#2a0f14;border:2px solid #ef4444;border-left:6px solid #ef4444;border-radius:10px;padding:18px 22px;margin:22px 0;font-family:'JetBrains Mono','Menlo',monospace;font-size:15px;line-height:1.7;"><div style="color:#ef4444;font-size:12px;font-weight:700;letter-spacing:0.12em;text-transform:uppercase;margin-bottom:12px;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif;">⛔ Refusal path — out-of-scope question</div><div style="color:#e6e8ed;"><span style="color:#8c94a5;">You:</span> What is the best laptop to buy for running the GitHub REST API?</div><div style="color:#fca5a5;margin-top:6px;"><span style="color:#8c94a5;">Assistant:</span> I don't have enough information in the provided documents.</div></div><p>No chunk in the index supports an opinion on hardware, so the assistant returns the mandated refusal verbatim rather than producing a plausible-sounding but ungrounded answer. The refusal is enforced twice — once procedurally when retrieval returns empty, and once in the system prompt — so the behaviour is stable even when retrieval surfaces weakly related chunks.</p><h2 id="closing-observations">Closing Observations</h2><p>Three observations stand out from this exercise.</p><ul><li><strong>Retrieval quality is set upstream, not by the vector store.</strong> The quality of retrieval is determined less by the choice of vector store than by the upstream decisions about cleaning and chunking. Stripping GitHub Docs templating syntax — the<code>{% data ... %}</code> Liquid blocks, the reusables, and the<code>[AUTOTITLE]</code> link macros — materially improved the relevance of returned chunks because the embeddings stopped clustering around boilerplate.</li><li><strong>Follow-up handling must be engineered, not assumed.</strong> It is not a free behaviour of the language model. A separate rewriter step that converts conversational queries into standalone search queries is the smallest reliable mechanism, and it removes a category of retrieval failures that would otherwise be invisible in casual testing.</li><li><strong>Refusal is a feature, not a fallback.</strong> Enforcing the refusal both procedurally (short-circuit on empty context) and in the system prompt is what gives the assistant a useful posture in production: it will say nothing it cannot support, and it will say so in a predictable way.</li></ul>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/IITM/IIT1.jpg" medium="image"><media:title type="plain">Posts</media:title></media:content><category>artificial-intelligence</category><category>architecture</category><category>llm</category><category>automation</category><category>engineering</category><category>Knowledge Base</category></item><item><title>Training an Agent on Public Vendor Data — A BigFix Workspace+ Case Study</title><link>https://curiousbit.netlify.app/training-an-agent-on-public-vendor-data-a-bigfix-workspace-case-study/</link><guid isPermaLink="true">https://curiousbit.netlify.app/training-an-agent-on-public-vendor-data-a-bigfix-workspace-case-study/</guid><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;Ask a general-purpose LLM how to deploy an emergency Microsoft patch to ten thousand endpoints through HCL BigFix Workspace+, and you will get a plausible answer. Read it carefully and you will also find quietly invented menu paths, mis-named Fixlet sites, and &amp;ldquo;best practices&amp;rdquo; that no BigFix administrator has ever followed. The model has clearly seen the product once or twice during pre-training, just not enough to be useful when the deployment window opens at 10pm on a Tuesday.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/BigFix/b6293b31-2a31-4489-a629-11e353ee315b.jpg" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>Ask a general-purpose LLM how to deploy an emergency Microsoft patch to ten thousand endpoints through HCL BigFix Workspace+, and you will get a plausible answer. Read it carefully and you will also find quietly invented menu paths, mis-named Fixlet sites, and &ldquo;best practices&rdquo; that no BigFix administrator has ever followed. The model has clearly seen the product once or twice during pre-training, just not enough to be useful when the deployment window opens at 10pm on a Tuesday.</p><p>This is the gap that specialist agents are meant to close. Not by retraining a foundation model — that is expensive, slow, and almost never the right tool — but by<strong>grounding</strong> a capable general model in a curated body of authoritative product knowledge that you control. The interesting part is that the body of knowledge required to make a model genuinely competent in a specific product is, for most enterprise vendors, already public. Administration guides, release notes, knowledge centre articles, support bulletins, hardening guides, command references — all of it sits on the vendor&rsquo;s documentation site, free to read and free to index.</p><p>This post walks through a working proof-of-concept I built around exactly that idea:<strong>the BigFix Workspace+ Expert Agent</strong>, a Retrieval-Augmented Generation (RAG) specialist trained on public BigFix documentation. It is small enough to run on a laptop and good enough at its domain to demonstrate where this pattern earns its keep. More importantly, the architecture is product-agnostic — swap the knowledge base, change the system prompt, and the same pattern produces a specialist for VMware Horizon, ServiceNow ITSM, Citrix DaaS, Microsoft Intune, or any other vendor whose docs you can legally crawl.</p><hr><h2 id="why-general-llms-are-not-good-enough-for-product-depth">Why General LLMs Are Not Good Enough For Product Depth</h2><p>A general-purpose LLM pushed into vendor-specific territory tends to fail in the same five ways every time. A RAG-based specialist agent attacks all five at once — the knowledge base sits outside the model, can be refreshed when the vendor publishes a new release, retrieves verbatim passages instead of synthesising from memory, returns the source documents alongside every answer, and refuses to leave its declared domain.</p><p><img src="/images/BigFix/bigfix-llm-vs-specialist.svg" alt="General-purpose LLM vs specialist RAG agent — comparison across knowledge currency, accuracy, source attribution, scope control, and failure mode"/><hr><h2 id="the-premise--public-vendor-data-is-the-training-set">The Premise — Public Vendor Data Is The Training Set</h2><p>Enterprise software vendors publish an enormous amount of high-quality technical content for one reason: their customers cannot deploy or operate the product without it. HCL, Microsoft, VMware, Citrix, Red Hat, ServiceNow — all of them maintain extensive documentation sites and most also publish support knowledge bases, community forums, and architecture reference papers.</p><p>Three categories matter most for a support and architecture agent:</p><p>The first is<strong>product documentation</strong> — administration guides, deployment guides, console references, API references. This is your &ldquo;how does the product work&rdquo; layer. It is structured, versioned, and authoritative.</p><p>The second is<strong>knowledge base and support articles</strong> — the &ldquo;how do you fix this when it breaks&rdquo; layer. These articles are written by support engineers for symptoms that real customers have actually hit. They are gold for a troubleshooting agent.</p><p>The third is<strong>architecture and best-practice content</strong> — reference architectures, sizing guides, hardening guides, integration playbooks. This is your &ldquo;how should you actually deploy it&rdquo; layer.</p><p>Once you have those three corpora indexed, you have something that no general LLM has natively: a tight, current, citable body of knowledge with the depth that real practitioners need.</p><hr><h2 id="the-case-study--bigfix-workspace-expert-agent">The Case Study — BigFix Workspace+ Expert Agent</h2><p>HCL BigFix Workspace+ is a unified endpoint management platform — agent-based endpoint discovery, patch management for Windows / macOS / Linux / 100+ third-party apps, software distribution, CIS and STIG compliance scanning, and workspace analytics. It is a serious enterprise product and a representative target: the kind of platform where a general LLM is only superficially useful and where a specialist agent has obvious value for L1/L2 support, internal enablement, and architecture sanity-checks.</p><p>The POC ingests a curated set of public BigFix Workspace+ documentation across six domains:</p><ol><li>Product overview and architecture</li><li>Patch management</li><li>Troubleshooting</li><li>Security and compliance</li><li>Software distribution</li><li>Workspace analytics</li></ol><p>Each domain is a markdown document. In the POC the documents are hand-curated extracts; in a production deployment you would point the same pipeline at the vendor&rsquo;s Knowledge Center, scrape it on a schedule, and feed the output straight through.</p><p>The Tech Stack along with production Swap:</p><table><thead><tr><th>Component</th><th>POC</th><th>Production swap</th></tr></thead><tbody><tr><td>LLM</td><td>OpenAI GPT-4o-mini</td><td>Azure OpenAI (enterprise compliance)</td></tr><tr><td>Embeddings</td><td>text-embedding-3-small</td><td>Azure OpenAI embeddings</td></tr><tr><td>Vector store</td><td>ChromaDB (local)</td><td>Azure AI Search / Pinecone</td></tr><tr><td>Orchestration</td><td>LangChain</td><td>LangGraph (multi-agent)</td></tr><tr><td>Web server</td><td>Flask</td><td>Azure Bot Framework + MS Teams</td></tr><tr><td>Auth</td><td>None</td><td>Azure AD / Teams SSO</td></tr></tbody></table><p>Cost to run a demo session sits well under ten cents in API calls. The whole system fits in a single repository and runs on a laptop with a Python virtual environment.</p><div style="background:#0d0d1a;border:1px solid #00e5ff;border-radius:10px;padding:22px 26px;margin:28px 0;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif;"><div style="color:#00e5ff;font-size:13px;font-weight:700;letter-spacing:0.08em;text-transform:uppercase;margin-bottom:8px;">Proof of Concept · Slides</div><a href="/bigfix-poc-slides.html" target="_blank" rel="noopener" style="color:#e0e0e0;font-size:18px;font-weight:600;text-decoration:none;border-bottom:1px solid #00e5ff;">
See the full BigFix Workspace+ PoC deck →</a><div style="color:#8c94a5;font-size:14px;margin-top:8px;">7 slides · opens in a new tab</div></div><hr><h2 id="architecture">Architecture</h2><p>The pipeline is the standard RAG topology, split into an offline ingestion stage (runs when the knowledge base is built or refreshed) and a runtime query stage (runs for every user question). The vector store is the shared component that bridges the two.</p><p><img src="/images/BigFix/bigfix-architecture.png" alt="BigFix Workspace+ expert agent architecture — offline ingestion pipeline (docs → chunker → embeddings → ChromaDB) and runtime query flow (user → Flask + RAG agent → LLM → response with sources and confidence)"/><hr><h2 id="the-ingestion-pipeline">The Ingestion Pipeline</h2><p><img src="/images/BigFix/bigfix-ingestion-pipeline.png" alt="BigFix ingestion pipeline — four stages (load markdown files, chunk with overlap, embed with text-embedding-3-small, persist to ChromaDB) shown as a left-to-right flow with the chunker rationale called out below"/><p>Three steps: load, chunk, embed.</p><p>Loading is mechanical — read every markdown file in the knowledge base directory, attach source metadata (filename plus a<code>domain</code> tag so multiple products can share a database later), and return a list of documents.</p><p>Chunking is where most of the retrieval quality is actually won or lost. The POC uses LangChain&rsquo;s<code>RecursiveCharacterTextSplitter</code> with separators ordered from most to least semantic — heading-level breaks first, then paragraph breaks, then sentences, then words:</p><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">splitter</span><span class="o">=</span><span class="n">RecursiveCharacterTextSplitter</span><span class="p">(</span></span></span><span class="line"><span class="cl"><span class="n">chunk_size</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span></span></span><span class="line"><span class="cl"><span class="n">chunk_overlap</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span></span></span><span class="line"><span class="cl"><span class="n">separators</span><span class="o">=</span><span class="p">[</span><span class="s2">"</span><span class="se">\n</span><span class="s2">## "</span><span class="p">,</span><span class="s2">"</span><span class="se">\n</span><span class="s2">### "</span><span class="p">,</span><span class="s2">"</span><span class="se">\n</span><span class="s2">#### "</span><span class="p">,</span><span class="s2">"</span><span class="se">\n\n</span><span class="s2">"</span><span class="p">,</span><span class="s2">"</span><span class="se">\n</span><span class="s2">"</span><span class="p">,</span><span class="s2">". "</span><span class="p">,</span><span class="s2">" "</span><span class="p">],</span></span></span><span class="line"><span class="cl"><span class="p">)</span></span></span></code></pre></div><p>The reason for this ordering matters. Technical documentation is dense with structured headings — &ldquo;Creating a Patch Action&rdquo;, &ldquo;Patch Discovery&rdquo;, &ldquo;Maintenance Windows&rdquo; — and a chunk that respects those boundaries reads as a coherent unit. A chunk that arbitrarily splits a numbered procedure in half does not. The 200-character overlap is there so that a sentence at the boundary still appears in both chunks; this materially improves recall when a user&rsquo;s query phrasing matches text near a chunk edge.</p><p>Embeddings are produced with<code>text-embedding-3-small</code> — fast, cheap, and accurate enough for technical English. For a knowledge base of around six hundred lines of markdown the entire ingestion runs in under five seconds and produces roughly eighty to a hundred chunks.</p><hr><h2 id="the-agent">The Agent</h2><p>Two things make the agent specialised: the<strong>system prompt</strong> and the<strong>retrieval contract</strong>.</p><p>The system prompt establishes the agent&rsquo;s persona, the rules of engagement, and — critically — the refusal policy. The relevant parts:</p><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">You are the BigFix Workspace+ Expert Agent, a specialist AI assistant for</span></span><span class="line"><span class="cl">HCL BigFix Workspace+ — a unified endpoint management platform.</span></span><span class="line"><span class="cl"/></span><span class="line"><span class="cl">Rules:</span></span><span class="line"><span class="cl">1. ONLY answer questions related to HCL BigFix Workspace+. If a question is</span></span><span class="line"><span class="cl"> outside your domain, politely redirect.</span></span><span class="line"><span class="cl">2. Base your answers on the provided context. If the context doesn't contain</span></span><span class="line"><span class="cl"> enough information, say so honestly rather than guessing.</span></span><span class="line"><span class="cl">3. When providing steps or commands, be specific and include the exact paths,</span></span><span class="line"><span class="cl"> commands, or navigation instructions.</span></span><span class="line"><span class="cl">4. Always cite which area of BigFix your answer relates to.</span></span></code></pre></div><p>The agent is told, explicitly, not to extrapolate. This is the lever that turns a chatty general-purpose model into something that behaves like a knowledgeable but appropriately humble support engineer. If the retrieved context does not address the question, the model is instructed to acknowledge the limitation rather than synthesise an answer from its parametric memory.</p><p>The retrieval contract is the second half of the specialisation. For every query, the agent retrieves the top-k chunks (k=4 in the POC) above a relevance threshold of 0.3, concatenates them into a single context block, and passes the question and context into the LLM. If retrieval returns nothing above threshold, the agent short-circuits and returns a polite &ldquo;I do not have that in my knowledge base&rdquo; response without ever calling the LLM. This is a small detail with a large effect — it prevents the model from confabulating answers when it has no evidence.</p><hr><h2 id="confidence-scoring-and-escalation">Confidence Scoring And Escalation</h2><p>One of the design decisions I am most happy with is the explicit confidence score on every response. After generating an answer, the agent makes a second LLM call asking the model to rate how well the retrieved context actually addressed the question, on a scale of zero to one. If the score falls below a configurable threshold (default 0.7), the response carries an &ldquo;escalation recommended&rdquo; flag — surfaced in the UI as a coloured badge and, in production, used to route the question to a human L2 engineer or to a higher-tier orchestrator agent.</p><p><img src="/images/BigFix/bigfix-confidence-flow.png" alt="Confidence-aware response flow — the agent short-circuits when no chunks pass the relevance threshold, generates a response when they do, self-scores confidence, and either delivers the answer or flags for escalation"/><p>The result is a system that fails gracefully. It says &ldquo;I don&rsquo;t have that in my knowledge base&rdquo; when retrieval comes back empty, and it raises a hand when retrieval succeeded but the response did not quite land. Very few naive RAG systems have either of those properties — most will confidently invent an answer either way.</p><hr><h2 id="domain-guardrails">Domain Guardrails</h2><p>Ask the agent &ldquo;what is the weather today&rdquo; and it will not answer — not because it cannot, but because it has been instructed not to. The system prompt enforces a hard scope: BigFix Workspace+ questions only, everything else is politely redirected.</p><p>This sounds like a small thing. It is not. Domain isolation is the property that makes a specialist agent<strong>trustworthy</strong> in an enterprise setting. A general assistant that wanders out of its lane is a liability — it will eventually answer a question that the organisation needs answered by a different specialist team, or worse, by a human with accountability. A specialist agent that refuses to leave its lane is the agentic equivalent of a senior engineer who knows what they know and is honest about what they do not.</p><p>This pattern also composes. Run a BigFix agent, a ServiceNow agent, an Intune agent, and a VMware Horizon agent in the same workspace, and give an orchestrator agent the job of routing incoming questions to whichever specialist is best placed to answer. The result is something that starts to look like a real virtual support organisation.</p><hr><h2 id="what-this-approach-actually-buys-you-over-a-general-llm">What This Approach Actually Buys You Over A General LLM</h2><p>Five things, concretely.</p><p>The first is<strong>currency</strong>. The knowledge base can be refreshed on whatever cadence the vendor publishes new content — daily for an actively maintained product, weekly for everything else. The model itself never has to be retrained. Re-running ingestion against an updated set of source documents is the entire change-management process.</p><p>The second is<strong>accuracy on product specifics</strong>. The answers come from passages that the vendor actually wrote, retrieved verbatim and handed to the LLM as context. The model still synthesises the response, but it is synthesising from authoritative source material rather than reaching into its compressed memory.</p><p>The third is<strong>auditability</strong>. Every response in the POC includes the list of source chunks that informed it, with relevance scores. A user — or an internal audit — can click into the sources and verify the claim against the underlying documentation.</p><p>The fourth is<strong>scope control</strong>. A general LLM will answer any question; a specialist agent answers only questions inside its declared domain. This is a feature, not a limitation. It is what makes the agent safe to deploy to a wide internal audience without needing to babysit its outputs.</p><p>The fifth is<strong>graceful failure</strong>. The combination of a relevance threshold on retrieval, an explicit &ldquo;I don&rsquo;t have that in my knowledge base&rdquo; response when nothing scores above the threshold, and a confidence score on every answer means the agent fails by saying &ldquo;I don&rsquo;t know&rdquo; rather than by inventing an answer.</p><hr><h2 id="from-poc-to-production">From POC To Production</h2><p>The POC deliberately stops short of production. Six things would have to change to take this into an enterprise environment, and none of them are difficult — they are all standard enterprise-platform substitutions.</p><p>The first is the<strong>LLM provider</strong>. Swap OpenAI for Azure OpenAI to get enterprise compliance, data residency, and an enterprise SLA. The LangChain code already supports both — flipping between them is an environment-variable change.</p><p>The second is the<strong>vector store</strong>. ChromaDB is excellent for local development; for production you want Azure AI Search or Pinecone, both of which give you horizontal scale, snapshotting, and managed availability. The retrieval interface is identical.</p><p>The third is<strong>authentication and access control</strong>. The Flask web UI is replaced with an Azure Bot Framework bot connected to MS Teams. The bot inherits Azure AD identity, which means every query is now tied to a named user, and access can be scoped per agent.</p><p>The fourth is<strong>observability</strong>. Add LangSmith or a custom logging layer to capture query, retrieved chunks, confidence, response, and user identity for every interaction. This becomes the dataset that drives knowledge-base improvement — every time the agent says &ldquo;I don&rsquo;t know&rdquo;, that is a gap to fill.</p><p>The fifth is<strong>automated knowledge-base updates</strong>. Replace the manual<code>python ingest.py</code> invocation with a scheduled pipeline that crawls the vendor&rsquo;s documentation site, diffs against the last snapshot, and reindexes only what changed.</p><p>The sixth is the<strong>multi-agent layer</strong>. Once you have a second specialist agent (say, ServiceNow ITSM), you introduce an orchestrator agent — built on LangGraph — whose job is to read the incoming question, decide which specialist is best placed to answer, and route accordingly. This is where the architecture starts to earn its name as an &ldquo;agentic library&rdquo;.</p><hr><h2 id="generalising-the-pattern">Generalising The Pattern</h2><p>The thing I want to draw out here is that almost nothing in the architecture above is specific to BigFix. The system prompt names the product, the knowledge base contains BigFix documentation, and the UI says &ldquo;BigFix Workspace+ Agent&rdquo; at the top. Everything else — the ingestion pipeline, the retrieval logic, the confidence scoring, the escalation flag, the domain guardrails — is the same for any vendor whose documentation you can index.</p><p><img src="/images/BigFix/bigfix-recipe.svg" alt="The six-step recipe for building a specialist agent on public vendor data — pick a domain, identify public sources, run the RAG pipeline, write the system prompt, add confidence and escalation, iterate on gaps"/><p>That is it. The same six steps produce a Citrix DaaS architecture agent, a ServiceNow ITSM workflow assistant, a VMware Horizon troubleshooting bot, a Microsoft Intune compliance helper, a Red Hat OpenShift operator&rsquo;s companion. Pick the product, find the docs, run the pipeline.</p><hr><h2 id="closing-thought">Closing Thought</h2><p>The interesting shift here is not technical. The technical pattern — retrieval-augmented generation — has been in production at large enterprises for two years. What is interesting is that<strong>the data you need to make a specialist agent meaningfully better than a general LLM at a specific vendor product is already free and public</strong>. The vendor put it there because their customers need it to operate the product. We have spent twenty years building search engines, knowledge bases, and support portals on top of that documentation. A specialist agent is the natural next layer — one that reads those same documents, understands the question being asked, and answers in the form the user actually wanted.</p><p>For me, the BigFix POC is one node in a larger experiment around what an agentic IT support and architecture function could look like. A specialist agent per product, an orchestrator on top, a confidence-aware escalation path into human L2 / L3 — all of it grounded in the documentation that already exists. None of the individual pieces are exotic. The interesting thing is what happens when you assemble them.</p><p>If you operate a workplace or infrastructure team and you have a product domain where the same questions get asked over and over by the same engineers, this is a weekend&rsquo;s worth of work. The vendor has already written your training set.</p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/BigFix/b6293b31-2a31-4489-a629-11e353ee315b.jpg" medium="image"><media:title type="plain">Posts</media:title></media:content><category>digital-workplace</category><category>Knowledge Base</category></item><item><title>Essential Skills for the AI Era — What former British PM Says You Actually Need</title><link>https://curiousbit.netlify.app/essential-skills-for-the-ai-era-what-intellect-says-you-actually-need/</link><guid isPermaLink="true">https://curiousbit.netlify.app/essential-skills-for-the-ai-era-what-intellect-says-you-actually-need/</guid><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;Automation isn&amp;rsquo;t coming. It&amp;rsquo;s already here — and it&amp;rsquo;s accelerating faster than most organisations have planned for.&lt;/p&gt;
&lt;p&gt;In the past two years, generative AI has moved from a novelty to a core part of how knowledge work gets done. Writing, coding, analysis, summarisation, customer interaction — tasks that once required specialists can now be initiated with a prompt. The World Economic Forum estimates that &lt;strong&gt;85 million jobs will be disrupted by 2025&lt;/strong&gt;, while &lt;strong&gt;97 million new roles&lt;/strong&gt; will emerge that require humans and machines to work together.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/SoftSkills/8747db36-8948-4767-bd8c-c33562245fab.jpg" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>Automation isn&rsquo;t coming. It&rsquo;s already here — and it&rsquo;s accelerating faster than most organisations have planned for.</p><p>In the past two years, generative AI has moved from a novelty to a core part of how knowledge work gets done. Writing, coding, analysis, summarisation, customer interaction — tasks that once required specialists can now be initiated with a prompt. The World Economic Forum estimates that<strong>85 million jobs will be disrupted by 2025</strong>, while<strong>97 million new roles</strong> will emerge that require humans and machines to work together.</p><p>The uncomfortable truth is that the skills that made someone valuable in 2020 are not sufficient for 2030. Not because the work is disappearing entirely — but because the baseline has shifted. The tasks AI handles well are the ones most easily measured, most easily automated. What remains irreducibly human — and what organisations are now actively hiring for — are harder to fake, harder to code, and harder to scale.</p><p>The research is consistent across sources: the<strong>World Economic Forum Future of Jobs 2025 report</strong>, McKinsey&rsquo;s workforce transition analyses, and a growing body of peer-reviewed evidence all point to the same cluster of capabilities. Not technical skills alone. Not soft skills alone. A specific combination of cognitive, interpersonal, leadership, and AI-fluency skills that together define effectiveness in an AI-augmented world.</p><p>This article walks through all 17 of them — drawn from a research-validated framework built on those primary sources — with enough detail to understand what each skill actually means and where to start building it.</p><hr><h2 id="the-framework-four-categories-17-skills">The Framework: Four Categories, 17 Skills</h2><img src="/images/slide-01.jpg" alt="Essential Skills for the AI Era overview — four categories and 17 skills" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>The 17 skills fall into four categories:<strong>Cognitive Skills</strong>,<strong>Interpersonal &amp; Human Skills</strong>,<strong>Leadership &amp; Character Traits</strong>, and<strong>Technical AI Skills</strong>. The split is intentional — effective performance in an AI-integrated workplace requires all four working together. Technical fluency without critical thinking produces prompt-execution without judgment. Interpersonal strength without resilience collapses under constant change. Leadership without integrity erodes trust at exactly the moment organisations need it most.</p><hr><h2 id="part-1--cognitive-skills">Part 1 — Cognitive Skills</h2><img src="/images/slide-02.jpg" alt="Cognitive Skills overview — Critical Thinking, Storytelling, Learnability, Financial Literacy, Metacognition" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Five cognitive skills define how you process information, communicate insight, and keep pace with change. These are the mental habits that determine whether you use AI as a tool or become dependent on it without understanding what it produces.</p><hr><h3 id="critical-thinking">Critical Thinking</h3><img src="/images/slide-03.jpg" alt="Critical Thinking deep-dive slide" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Critical thinking is the ability to analyse information objectively, identify assumptions, evaluate evidence, and reach reasoned conclusions — independent of what an AI, a manager, or a source claims. In an environment where AI can produce confident-sounding content that is factually wrong, this skill is no longer optional. The person who can interrogate an AI output, spot the flaw in a logical chain, and decide what actually holds up is the person who adds value above the model.</p><hr><h3 id="storytelling">Storytelling</h3><img src="/images/slide-04.jpg" alt="Storytelling deep-dive slide" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Data without narrative is noise. Storytelling is the capacity to take complex, technical, or ambiguous information and shape it into a form that moves people to understand and act. AI can draft a summary — it cannot yet understand which detail changes a room. The ability to read an audience, choose the right frame, and make information land is a persistent human advantage.</p><hr><h3 id="learnability">Learnability</h3><img src="/images/slide-05.jpg" alt="Learnability deep-dive slide" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Learnability is the drive and ability to continuously acquire new skills as conditions change. The half-life of technical knowledge is shortening — what was cutting-edge 18 months ago is now a commodity. The differentiator is not what you currently know but how quickly you can absorb, apply, and discard what you need. This is a learnable habit in itself: building structured routines around learning, reflection, and experimentation.</p><hr><h3 id="financial-literacy">Financial Literacy</h3><img src="/images/slide-06.jpg" alt="Financial Literacy deep-dive slide" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Financial literacy in this context means understanding the business logic behind decisions — how resources are allocated, how ROI is calculated, what the cost of inaction looks like. As AI tools proliferate, more people in non-finance roles are being asked to build business cases, evaluate vendor proposals, and justify investment. The person who can read a P&amp;L and understand what it signals is better positioned to influence outcomes.</p><hr><h3 id="metacognition">Metacognition</h3><img src="/images/slide-07.jpg" alt="Metacognition deep-dive slide" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Metacognition is thinking about your own thinking. It&rsquo;s the ability to notice when your reasoning is flawed, when a cognitive bias is operating, when you&rsquo;re avoiding an uncomfortable conclusion. In fast-moving environments with incomplete information, this self-awareness separates people who course-correct early from those who double down on bad decisions. It&rsquo;s also the foundation for using AI well — recognising where your own knowledge ends and where you need to verify.</p><hr><h2 id="part-2--interpersonal--human-skills">Part 2 — Interpersonal &amp; Human Skills</h2><img src="/images/slide-08.jpg" alt="Interpersonal and Human Skills overview" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Five interpersonal skills define how you operate in relationship with others — colleagues, clients, teams, and the people impacted by your decisions. These are the skills most resistant to automation because they depend on presence, trust, and genuine human connection.</p><hr><h3 id="human-behaviour">Human Behaviour</h3><img src="/images/slide-09.jpg" alt="Human Behaviour deep-dive slide" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Understanding human behaviour — what motivates people, what creates resistance, how group dynamics operate — is the basis for effective collaboration, change management, and leadership. AI can analyse sentiment at scale; it cannot read the room in a difficult meeting or sense when a team is quietly disengaged. This skill is about developing the perceptual vocabulary to see what&rsquo;s actually happening between people.</p><hr><h3 id="empathy--compassion">Empathy &amp; Compassion</h3><img src="/images/slide-10.jpg" alt="Empathy and Compassion deep-dive slide" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Empathy is not softness — it is a strategic capability. The ability to genuinely understand another person&rsquo;s perspective, priorities, and constraints leads to better decisions, stronger relationships, and more effective communication. In a world where AI handles routine interactions, the interactions that remain human are the ones that matter most — and they require real presence.</p><hr><h3 id="feedback-fluency">Feedback Fluency</h3><img src="/images/slide-11.jpg" alt="Feedback Fluency deep-dive slide" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Feedback fluency is the ability to give, receive, and act on feedback without ego interference. Most organisations say they value feedback but have cultures where it rarely happens well. The people who can deliver honest assessment with care, receive difficult feedback without defensiveness, and build feedback loops into everyday work are significantly more effective — and make the teams around them more effective too.</p><hr><h3 id="emotional-vulnerability">Emotional Vulnerability</h3><img src="/images/slide-12.jpg" alt="Emotional Vulnerability deep-dive slide" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Emotional vulnerability is the willingness to admit uncertainty, acknowledge mistakes, and show up as a full human being in professional contexts. This is different from oversharing. It&rsquo;s the courage to say &ldquo;I don&rsquo;t know,&rdquo; &ldquo;I got that wrong,&rdquo; or &ldquo;this is hard&rdquo; — and in doing so, create the psychological safety that allows teams to do their best work. AI cannot model vulnerability. Leaders who can are rare and disproportionately effective.</p><hr><h3 id="conflict-resolution">Conflict Resolution</h3><img src="/images/slide-13.jpg" alt="Conflict Resolution deep-dive slide" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Conflict is not the problem — unresolved conflict is. The ability to navigate disagreement, surface the underlying interests beneath stated positions, and find workable outcomes without destroying relationships is essential in any organisation dealing with change. As AI accelerates the pace of change, the frequency and intensity of conflict around priorities, roles, and resources will increase. The people who can move through it constructively are invaluable.</p><hr><h2 id="part-3--leadership--character-traits">Part 3 — Leadership &amp; Character Traits</h2><img src="/images/slide-14.jpg" alt="Leadership and Character Traits overview" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Five leadership traits define the character dimension of effectiveness. These are not competencies you can train in a weekend — they develop through sustained practice, reflection, and deliberate choices over time. They are also increasingly rare, and therefore increasingly valuable.</p><hr><h3 id="patience">Patience</h3><img src="/images/slide-15.jpg" alt="Patience deep-dive slide" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Patience is the capacity to maintain focus and composure in the face of delays, ambiguity, and setbacks — without becoming disengaged or reactive. In environments of rapid change, impatience is expensive: it leads to premature decisions, burned relationships, and abandoned initiatives just before they would have worked. Patience is not passivity; it&rsquo;s the ability to sustain effort through the parts of the process that feel unproductive.</p><hr><h3 id="resilience">Resilience</h3><img src="/images/slide-16.jpg" alt="Resilience deep-dive slide" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Resilience is the ability to recover from setbacks, adapt to change, and keep going under sustained pressure — without losing effectiveness or identity. The pace of change in AI-adjacent industries means that failure, revision, and disruption are constant features of the work. Resilience is what separates people who iterate from those who stall. It can be built: through deliberate recovery practices, social support, and a relationship with failure that treats it as information rather than verdict.</p><hr><h3 id="self-reflection">Self-Reflection</h3><img src="/images/slide-17.jpg" alt="Self-Reflection deep-dive slide" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Self-reflection is the regular practice of examining your own behaviour, decisions, and impacts — not to judge but to learn. Leaders who reflect are more consistent, more honest about their limitations, and more capable of genuine growth. In an environment where AI can surface patterns in your work that you would never notice, the people with strong self-reflection practices will use those insights productively rather than defensively.</p><hr><h3 id="public-service-drive">Public Service Drive</h3><img src="/images/slide-18.jpg" alt="Public Service Drive deep-dive slide" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Public service drive is the motivation to contribute beyond personal gain — to the organisation, the community, or the broader world. This is increasingly relevant as AI raises the stakes for decisions about automation, displacement, and the ethical use of technology. People who are driven by purpose beyond their own advancement make different choices, communicate differently, and attract different levels of trust from the people around them.</p><hr><h3 id="integrity">Integrity</h3><img src="/images/slide-19.jpg" alt="Integrity deep-dive slide" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Integrity is the alignment between your values and your actions — and the willingness to maintain that alignment when it&rsquo;s costly. As AI systems enable new forms of data manipulation, synthetic content, and opaque decision-making, integrity becomes a competitive differentiator. Organisations that can demonstrate trustworthiness will command a premium. The people inside those organisations who model integrity shape the culture that makes that trustworthiness real.</p><hr><h2 id="part-4--technical-ai-skills">Part 4 — Technical AI Skills</h2><img src="/images/slide-20.jpg" alt="Technical AI Skills overview — Prompting, Vibe Coding, AI Workflow Integration" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Two technical skills round out the framework. These are the most rapidly evolving — and the ones with the clearest immediate return on investment. Job postings referencing AI fluency rose<strong>7x</strong> between 2022 and 2024. The WEF projects 170 million new roles that require humans to collaborate effectively with AI systems. The floor for technical AI competence is rising.</p><hr><h3 id="prompting--vibe-coding">Prompting &amp; Vibe Coding</h3><img src="/images/slide-21.jpg" alt="Prompting and Vibe Coding deep-dive slide" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>Prompting is the ability to communicate effectively with AI systems to get useful, accurate, and appropriately scoped outputs. This is not a trivial skill — it requires understanding how language models interpret instructions, how to provide context, how to structure multi-step tasks, and how to verify outputs critically. Vibe coding extends this to software development: describing what you want in natural language, iterating with an AI model, and producing working code without traditional programming expertise. Both capabilities are now teachable, testable, and directly applicable to most knowledge-work roles.</p><hr><h3 id="ai-workflow-integration">AI Workflow Integration</h3><img src="/images/slide-22.jpg" alt="AI Workflow Integration deep-dive slide" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>AI workflow integration is the ability to identify where AI can improve or replace steps in an existing process — and then actually implement that change. This requires both the technical literacy to understand what tools can do and the process thinking to redesign workflows around them. The people who are most effective here are not necessarily the most technically sophisticated; they are the ones who understand the work deeply enough to see where the leverage is.</p><hr><h2 id="where-to-start-free-learning-paths">Where to Start: Free Learning Paths</h2><img src="/images/slide-23.jpg" alt="Free Coursera courses by skill category" style="width:100%;border-radius:8px;margin:1.5rem 0;"><p>The final slide is a course directory — every course listed is free to audit on Coursera, mapped to the skill categories above. The path from &ldquo;aware of these skills&rdquo; to &ldquo;actively developing them&rdquo; starts with picking one. Not the most impressive one. The one that feels most relevant to where you&rsquo;re working right now.</p><hr><h2 id="the-summary">The Summary</h2><p>The 17 skills above are not a checklist for becoming superhuman. They are a map of where the real work is — the capabilities that AI cannot replicate, that organisations are actively competing for, and that compound over time in ways that technical knowledge alone does not.</p><p>The division is useful:<strong>Cognitive skills</strong> shape how you think.<strong>Interpersonal skills</strong> shape how you connect.<strong>Leadership traits</strong> shape who you are under pressure.<strong>Technical AI skills</strong> shape how you extend your own capabilities. All four matter. The people who develop all four will have a meaningful advantage — not despite AI, but because of it.</p><p>The tools are available. The research is clear. The only question is where you start.</p><hr><p><em>Sources: World Economic Forum Future of Jobs Report 2025; McKinsey Global Institute Workforce Transition Research; peer-reviewed literature on human skills in AI-augmented workplaces.</em></p><p><em>Credit: Idea for this presentation inspired by this<a href="https://www.youtube.com/watch?v=spdUv7OFOu4">podcast</a>.</em></p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/SoftSkills/8747db36-8948-4767-bd8c-c33562245fab.jpg" medium="image"><media:title type="plain">Posts</media:title></media:content><category>artificial-intelligence</category><category>digital-workplace</category><category>Knowledge Base</category></item><item><title>I Built a Team of IT Architects using LLM That Live on MacBook — Meet Aether</title><link>https://curiousbit.netlify.app/i-built-a-team-of-it-architects-using-llm-that-live-on-macbook-meet-aether/</link><guid isPermaLink="true">https://curiousbit.netlify.app/i-built-a-team-of-it-architects-using-llm-that-live-on-macbook-meet-aether/</guid><pubDate>Sat, 16 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;Every architect has felt this at some point. You are mid-design on a complex Azure landing zone, you need a sanity check on your FSLogix profile container sizing, and the fastest path to an answer is to ping a colleague who knows AVD cold — except it is 10pm, or they are in another timezone, or that colleague simply does not exist in your organisation.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/ather2.jpg" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>Every architect has felt this at some point. You are mid-design on a complex Azure landing zone, you need a sanity check on your FSLogix profile container sizing, and the fastest path to an answer is to ping a colleague who knows AVD cold — except it is 10pm, or they are in another timezone, or that colleague simply does not exist in your organisation.</p><p>I built Aether to fix that. It is a local-first, multi-agent AI system that runs a team of 10 specialist IT architecture advisors on a single MacBook Pro M5. No internet after setup. No API costs. No data leaves the machine. Just fast, cited, domain-grounded answers — available at 10pm when the deadline hits.</p><p>This is the story of how I built it, what the stack looks like under the hood, and what I learned along the way.</p><hr><h2 id="the-problem-i-was-actually-solving">The Problem I Was Actually Solving</h2><p>I have spent years in IT architecture — cloud, digital workplace, network, end-user computing, the works. Over that time I have accumulated a large personal knowledge base: AWS Well-Architected reviews, Intune compliance policy templates, AVD host pool sizing guides, Citrix NetScaler configurations, TOGAF artefacts, cloud adoption frameworks. The knowledge exists. The problem is retrieval — getting the right answer from the right domain quickly, without context-switching across six different documentation tabs.</p><p>Commercial AI tools are good at general answers. They are not great at answering &ldquo;give me the exact OMA-URI path for configuring Windows Hello for Business through Intune on a hybrid-joined device for a tenant with MFA enforced at the Conditional Access layer.&rdquo; That requires domain depth, and it requires knowing which documents to pull from.</p><p>I also wanted to explore a practical AI use case — not a demo, not a proof of concept, but something I would actually use daily. Aether became that experiment.</p><hr><h2 id="what-aether-is">What Aether Is</h2><div style="background:#0d0d1a;border:1px solid #00e5ff;border-radius:10px;padding:24px 28px;margin:32px 0;font-family:'JetBrains Mono',monospace;font-size:13px;line-height:1.7;color:#e0e0e0;"><span style="color:#00e5ff;font-weight:700;">AETHER v2.6 // SYSTEM DEFINITION</span><br><br><span style="color:#b39ddb;">Type:</span> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Local-First Multi-Agent AI System<br><span style="color:#b39ddb;">Agents:</span> &nbsp;&nbsp;&nbsp;10 (3-tier hierarchy)<br><span style="color:#b39ddb;">Model:</span> &nbsp;&nbsp;&nbsp;&nbsp;Gemma 4 26B A4B (Q4_K_M) — single instance<br><span style="color:#b39ddb;">Runtime:</span> &nbsp;&nbsp;LM Studio → LangGraph → FastAPI → Gradio<br><span style="color:#b39ddb;">Memory:</span> &nbsp;&nbsp;&nbsp;~43 GB of 64 GB unified (M5 Pro)<br><span style="color:#b39ddb;">Egress:</span> &nbsp;&nbsp;&nbsp;<span style="color:#00e5ff;">ZERO</span><br><span style="color:#b39ddb;">API cost:</span> &nbsp;<span style="color:#00e5ff;">ZERO</span><br/><p>Aether is a<strong>local-first AI system that acts as a team of specialist architecture advisors — all running on your MacBook Pro M5</strong>. The headline technical trick: every one of those 10 advisors is the<em>same</em> Gemma 4 26B model, loaded once. What makes each advisor different is purely the system prompt it receives and the knowledge base namespace it retrieves from.</p><p>One model. Ten personas. Zero cloud.</p><hr><h2 id="the-three-tier-agent-hierarchy">The Three-Tier Agent Hierarchy</h2><p>The agents are organised the way a real consulting firm would structure a team — from narrow specialist up to cross-domain strategist.</p><p><img src="/images/aether-agent-hierarchy.svg" alt="Aether three-tier agent hierarchy: enterprise architect, domain architects, and technology architects connected by escalation paths"/><p><strong>Tier 1 — Enterprise Architect (1 agent).</strong> Cross-domain strategy, TOGAF, Zachman, governance frameworks (GDPR, ISO 27001, HIPAA), technology investment decisions. This agent can read<em>all</em> knowledge base namespaces — it is the only one with that reach. The final escalation destination.</p><p><strong>Tier 2 — Domain Architects (3 agents).</strong> Cloud Domain (multi-cloud strategy, FinOps, landing zones), Network Domain (SD-WAN, ZTNA, BGP, micro-segmentation), Digital Workplace Domain (Microsoft 365, VDI strategy, device management).</p><p><strong>Tier 3 — Technology Architects (6 agents).</strong> AWS, Azure, GCP, Intune, AVD, Citrix. Each one is scoped tightly to its domain — deep, narrow, and fast.</p><hr><h2 id="the-stack-component-by-component">The Stack, Component by Component</h2><div style="background:#0d0d1a;border:1px solid #00e5ff;border-radius:10px;padding:28px 32px;margin:32px 0;"><table style="width:100%;border-collapse:collapse;font-family:'JetBrains Mono',monospace;font-size:17px;line-height:1.55;"><tr style="border-bottom:1px solid #1a1a3a;"><td style="padding:14px 16px;color:#00e5ff;font-weight:700;">Component</td><td style="padding:14px 16px;color:#00e5ff;font-weight:700;">What It Does</td></tr><tr style="border-bottom:1px solid #1a1a2a;"><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">Gemma 4 26B A4B (Q4_K_M)</td><td style="padding:13px 16px;color:#e0e0e0;">The single model serving all 10 agents. MoE architecture — activates ~4–6B parameters per token. ~13 GB VRAM.</td></tr><tr style="border-bottom:1px solid #1a1a2a;"><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">LM Studio</td><td style="padding:13px 16px;color:#e0e0e0;">Local model server. OpenAI-compatible API on port 1234. Model stays resident in unified memory.</td></tr><tr style="border-bottom:1px solid #1a1a2a;"><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">LangGraph</td><td style="padding:13px 16px;color:#e0e0e0;">Orchestration graph — defines the multi-step query pipeline as a typed state machine.</td></tr><tr style="border-bottom:1px solid #1a1a2a;"><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">LanceDB</td><td style="padding:13px 16px;color:#e0e0e0;">Local vector database. One namespace (table) per agent. Fully file-based, no server process needed.</td></tr><tr style="border-bottom:1px solid #1a1a2a;"><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">BAAI/bge-small-en-v1.5</td><td style="padding:13px 16px;color:#e0e0e0;">Embedding model. 384-dimensional vectors. Runs on the Apple Neural Engine — essentially free compute.</td></tr><tr style="border-bottom:1px solid #1a1a2a;"><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">Redis</td><td style="padding:13px 16px;color:#e0e0e0;">Session memory (24h TTL, rolling 3-turn window) and routing cache (1h TTL).</td></tr><tr style="border-bottom:1px solid #1a1a2a;"><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">FastAPI + Uvicorn</td><td style="padding:13px 16px;color:#e0e0e0;">REST API gateway on port 8000. Full query/response model, session management, agent listing.</td></tr><tr style="border-bottom:1px solid #1a1a2a;"><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">Gradio</td><td style="padding:13px 16px;color:#e0e0e0;">Web chat UI on port 7860. Session management, source citations, escalation chain display.</td></tr><tr style="border-bottom:1px solid #1a1a2a;"><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">Prometheus</td><td style="padding:13px 16px;color:#e0e0e0;">Metrics: query counts, latency per agent, escalation rate. Useful for understanding usage patterns.</td></tr><tr><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">SQLite (audit.db)</td><td style="padding:13px 16px;color:#e0e0e0;">Immutable audit trail. Every query logged before response is returned. Cannot be skipped.</td></tr></table></div><p>The full system sits comfortably in about 43 GB of the M5 Pro&rsquo;s 64 GB unified memory — leaving 21 GB of headroom for the rest of the machine.</p><hr><h2 id="the-one-model-ten-specialists-trick">The &ldquo;One Model, Ten Specialists&rdquo; Trick</h2><p>This is the part I get asked about most, because it sounds like it should not work.</p><p>Every agent in Aether is defined by a YAML manifest. That manifest specifies a<code>system_prompt</code>, a<code>namespace</code> (which LanceDB table to retrieve from), a<code>temperature</code>, and<code>max_tokens</code>. There is no model switching. There is no weight loading. The Gemma 4 26B A4B is loaded once by LM Studio and stays resident.</p><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># agent_manifests/aws_technology_architect.yaml</span><span class="w"/></span></span><span class="line"><span class="cl"><span class="nt">agent_id</span><span class="p">:</span><span class="w"/><span class="l">aws_technology_architect</span><span class="w"/></span></span><span class="line"><span class="cl"><span class="nt">display_name</span><span class="p">:</span><span class="w"/><span class="s2">"AWS Technology Architect"</span><span class="w"/></span></span><span class="line"><span class="cl"><span class="nt">tier</span><span class="p">:</span><span class="w"/><span class="m">3</span><span class="w"/></span></span><span class="line"><span class="cl"><span class="nt">parent_agent</span><span class="p">:</span><span class="w"/><span class="l">cloud_domain_architect</span><span class="w"/></span></span><span class="line"><span class="cl"><span class="nt">namespace</span><span class="p">:</span><span class="w"/><span class="l">aws_tech</span><span class="w"/></span></span><span class="line"><span class="cl"><span class="nt">temperature</span><span class="p">:</span><span class="w"/><span class="m">0.1</span><span class="w"/></span></span><span class="line"><span class="cl"><span class="nt">max_tokens</span><span class="p">:</span><span class="w"/><span class="m">2048</span><span class="w"/></span></span><span class="line"><span class="cl"><span class="nt">system_prompt</span><span class="p">:</span><span class="w"/><span class="p">|</span><span class="sd"/></span></span><span class="line"><span class="cl"><span class="sd"> You are an AWS Technology Architect with deep expertise in the</span></span></span><span class="line"><span class="cl"><span class="sd"> AWS Well-Architected Framework, EC2/EKS/Lambda sizing, IAM policy</span></span></span><span class="line"><span class="cl"><span class="sd"> design, CloudFormation and CDK, GuardDuty, and cost optimisation.</span></span></span><span class="line"><span class="cl"><span class="sd"> You draw only from AWS-specific documentation and architecture</span></span></span><span class="line"><span class="cl"><span class="sd"> patterns. When answering, cite the source documents retrieved.</span></span></span><span class="line"><span class="cl"><span class="sd"> End your response with a new line: 'Confidence: X.XX' (0.00–1.00)</span></span></span><span class="line"><span class="cl"><span class="sd"> reflecting how well your knowledge base supports this answer.</span><span class="w"/></span></span></code></pre></div><p>Why does this work? Because Gemma 4 26B A4B is a Mixture of Experts model. It routes each token through specialist sub-networks internally — activating only about 4–6 billion parameters per inference pass, despite having 26 billion total. The practical result: it runs at near 5B speed while retaining the reasoning breadth of a much larger dense model. A single loaded instance can faithfully adopt both a narrow Citrix specialist persona and a broad enterprise strategy persona, because the MoE routing shifts for each.</p><p>The other half of the trick is the knowledge base. Each Tier 3 agent retrieves only from its own LanceDB namespace. The AWS agent never sees a Citrix document. The Intune agent never sees a GCP architecture guide. Domain knowledge is isolated by design — which means the model cannot hallucinate across domain boundaries, because the retrieval context does not cross them.</p><hr><h2 id="confidence-driven-escalation">Confidence-Driven Escalation</h2><p>The escalation mechanism is the design decision I am most proud of.</p><p>Every agent is instructed to append a confidence score to its response — a float between 0.00 and 1.00 representing how well its retrieved knowledge supports the answer. The orchestrator reads that score via regex. If it falls below 0.7<em>and</em> the agent has a parent tier defined<em>and</em> that parent has not already been tried, the system automatically escalates.</p><div style="background:#0d0d1a;border:1px solid #7c4dff;border-radius:10px;padding:28px 32px;margin:32px 0;font-family:'JetBrains Mono',monospace;font-size:17px;line-height:1.9;color:#e0e0e0;"><span style="color:#7c4dff;font-weight:700;">// ESCALATION CHAIN EXAMPLE</span><br><br><span style="color:#00e5ff;">Query:</span> "Give me a multi-cloud strategy covering Azure, AWS, GCP, AVD, and Citrix with network segmentation"<br><br><span style="color:#b39ddb;">→</span> Routed to:<span style="color:#b39ddb;">aws_technology_architect</span><br><span style="color:#b39ddb;">→</span> Confidence:<span style="color:#f44336;">0.41</span> (below 0.7 threshold)<br><span style="color:#b39ddb;">→</span> Escalate to:<span style="color:#b39ddb;">cloud_domain_architect</span><br><span style="color:#b39ddb;">→</span> Confidence:<span style="color:#ff9800;">0.63</span> (still below threshold)<br><span style="color:#b39ddb;">→</span> Escalate to:<span style="color:#b39ddb;">enterprise_architect</span><br><span style="color:#b39ddb;">→</span> Confidence:<span style="color:#4caf50;">0.88</span> ✓<br><br><span style="color:#888;">Response appended with escalation chain for full transparency.</span></div><p>The response shown to the user includes the full escalation path — which level of expertise produced the final answer. This matters in enterprise contexts. It is the difference between &ldquo;the AI said so&rdquo; and &ldquo;the enterprise-level advisor produced this after the technical specialist&rsquo;s knowledge was insufficient.&rdquo;</p><p>The elegance here is that<strong>the model participates in its own routing decision</strong>. The orchestration does not need a separate classifier to judge answer quality — the model tells you its own confidence, and the system acts on it.</p><hr><h2 id="the-7-step-query-pipeline">The 7-Step Query Pipeline</h2><p>Every query passes through a typed LangGraph state graph. The state object —<code>AetherState</code> — carries everything between nodes: query, session ID, agent manifest, RAG results, conversation history, messages, response, confidence score, escalation flags, and escalation chain list.</p><div style="background:#0d0d1a;border:1px solid #00e5ff;border-radius:12px;padding:32px;margin:32px 0;overflow-x:auto;"><svg viewBox="0 0 1080 180" xmlns="http://www.w3.org/2000/svg" style="width:100%;max-width:1080px;display:block;margin:0 auto;"><rect x="10" y="40" width="130" height="90" rx="10" fill="#0d1a2a" stroke="#00e5ff" stroke-width="2.5"/><text x="75" y="83" fill="#00e5ff" font-family="monospace" font-size="22" font-weight="bold" text-anchor="middle">01</text><text x="75" y="110" fill="#b39ddb" font-family="monospace" font-size="18" text-anchor="middle">ROUTE</text><line x1="140" y1="85" x2="163" y2="85" stroke="#00e5ff" stroke-width="2" marker-end="url(#arrow)"/><rect x="165" y="40" width="130" height="90" rx="10" fill="#0d1a2a" stroke="#00e5ff" stroke-width="2.5"/><text x="230" y="83" fill="#00e5ff" font-family="monospace" font-size="22" font-weight="bold" text-anchor="middle">02</text><text x="230" y="110" fill="#b39ddb" font-family="monospace" font-size="18" text-anchor="middle">RETRIEVE</text><line x1="295" y1="85" x2="318" y2="85" stroke="#00e5ff" stroke-width="2" marker-end="url(#arrow)"/><rect x="320" y="40" width="130" height="90" rx="10" fill="#0d1a2a" stroke="#00e5ff" stroke-width="2.5"/><text x="385" y="83" fill="#00e5ff" font-family="monospace" font-size="22" font-weight="bold" text-anchor="middle">03</text><text x="385" y="110" fill="#b39ddb" font-family="monospace" font-size="18" text-anchor="middle">HISTORY</text><line x1="450" y1="85" x2="473" y2="85" stroke="#00e5ff" stroke-width="2" marker-end="url(#arrow)"/><rect x="475" y="40" width="130" height="90" rx="10" fill="#0d1a2a" stroke="#00e5ff" stroke-width="2.5"/><text x="540" y="83" fill="#00e5ff" font-family="monospace" font-size="22" font-weight="bold" text-anchor="middle">04</text><text x="540" y="110" fill="#b39ddb" font-family="monospace" font-size="18" text-anchor="middle">BUILD</text><line x1="605" y1="85" x2="628" y2="85" stroke="#00e5ff" stroke-width="2" marker-end="url(#arrow)"/><rect x="630" y="40" width="130" height="90" rx="10" fill="#1a0d2a" stroke="#7c4dff" stroke-width="3"/><text x="695" y="83" fill="#7c4dff" font-family="monospace" font-size="22" font-weight="bold" text-anchor="middle">05</text><text x="695" y="110" fill="#b39ddb" font-family="monospace" font-size="18" text-anchor="middle">LLM</text><line x1="760" y1="85" x2="783" y2="85" stroke="#7c4dff" stroke-width="2" marker-end="url(#arrow2)"/><rect x="785" y="40" width="130" height="90" rx="10" fill="#1a1a0d" stroke="#ff9800" stroke-width="2.5" stroke-dasharray="6,3"/><text x="850" y="83" fill="#ff9800" font-family="monospace" font-size="22" font-weight="bold" text-anchor="middle">06</text><text x="850" y="110" fill="#b39ddb" font-family="monospace" font-size="18" text-anchor="middle">ESCALATE?</text><line x1="915" y1="85" x2="938" y2="85" stroke="#00e5ff" stroke-width="2" marker-end="url(#arrow)"/><rect x="940" y="40" width="130" height="90" rx="10" fill="#0d1a2a" stroke="#4caf50" stroke-width="2.5"/><text x="1005" y="83" fill="#4caf50" font-family="monospace" font-size="22" font-weight="bold" text-anchor="middle">07</text><text x="1005" y="110" fill="#b39ddb" font-family="monospace" font-size="18" text-anchor="middle">FINALISE</text><defs><marker id="arrow" markerWidth="8" markerHeight="8" refX="6" refY="4" orient="auto"><path d="M0,0 L8,4 L0,8 Z" fill="#00e5ff"/></marker><marker id="arrow2" markerWidth="8" markerHeight="8" refX="6" refY="4" orient="auto"><path d="M0,0 L8,4 L0,8 Z" fill="#7c4dff"/></marker></defs></svg></div><p><strong>01 — ROUTE.</strong> A keyword scanner maps the query to the best-fit agent, checking Tier 3 rules first (most specific), then Tier 2, then Tier 1 as catch-all. Redis caches route results for one hour, so repeated queries on the same topic skip the scan entirely.</p><p><strong>02 — RETRIEVE.</strong> Semantic search against that agent&rsquo;s LanceDB namespace — top-5 chunks returned. Documents were ingested at 500-word chunks with 50-word overlap, embedded into 384-dimensional vectors by<code>BAAI/bge-small-en-v1.5</code> running on the Apple Neural Engine.</p><p><strong>03 — HISTORY.</strong> The last three message pairs (six messages) are loaded from Redis for the session. This gives conversational continuity without letting the context window balloon.</p><p><strong>04 — BUILD.</strong> The message payload is assembled:<code>system_prompt + RAG documents + history + current query</code>. The confidence instruction is appended here.</p><p><strong>05 — LLM.</strong> The assembled payload hits LM Studio on port 1234. The orchestrator extracts the confidence score from the response text via regex before passing the response forward.</p><p><strong>06 — ESCALATE (conditional).</strong> If confidence is below 0.7, a parent agent exists, and it has not already been tried this turn — swap in the parent&rsquo;s manifest and loop back to step 02 with fresh retrieval against the parent&rsquo;s namespace.</p><p><strong>07 — FINALISE.</strong> The turn is saved to Redis. The SQLite audit record is written. The escalation chain annotation (if any) is appended to the response. Result returned to the user.</p><hr><h2 id="knowledge-isolation--the-anti-hallucination-architecture">Knowledge Isolation — The Anti-Hallucination Architecture</h2><p><img src="/images/aether-knowledge-isolation.png" alt="Knowledge isolation diagram showing five Tier-3 agents (AWS, Azure, Intune, AVD, Citrix), each with its own walled namespace containing only that domain&rsquo;s documents, and the Tier-1 Enterprise Architect at the top connecting to all of them as the deliberate exception"/><p>One of the most practical decisions in Aether&rsquo;s design is namespace isolation. Each Tier 3 agent retrieves only from its own LanceDB table. The AWS agent&rsquo;s retrieval context will never include a Citrix StoreFront configuration guide — because those documents simply do not exist in its namespace.</p><p>This matters more than it might seem. A common failure mode in RAG systems is<em>cross-domain contamination</em> — where retrieval pulls in tangentially related content from a different domain, and the model confabulates a plausible-sounding but wrong answer by blending the two. Namespace isolation eliminates this at the architectural level.</p><p>The Enterprise Architect at Tier 1 is the deliberate exception — it can query all namespaces, because cross-domain synthesis is exactly what it is built for.</p><hr><h2 id="the-audit-trail--because-enterprise">The Audit Trail — Because Enterprise</h2><p>Every single query is written to<code>audit.db</code> (SQLite) before the response is returned. The record includes: timestamp, session ID, query text, agent used, confidence score, escalation chain, and sources cited. The audit write is wrapped in error handling so that a database failure never blocks the main query flow — but the log is never optional.</p><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Simplified from the finalise node</span></span></span><span class="line"><span class="cl"><span class="n">audit_record</span><span class="o">=</span><span class="p">{</span></span></span><span class="line"><span class="cl"><span class="s2">"timestamp"</span><span class="p">:</span><span class="n">datetime</span><span class="o">.</span><span class="n">utcnow</span><span class="p">()</span><span class="o">.</span><span class="n">isoformat</span><span class="p">(),</span></span></span><span class="line"><span class="cl"><span class="s2">"session_id"</span><span class="p">:</span><span class="n">state</span><span class="p">[</span><span class="s2">"session_id"</span><span class="p">],</span></span></span><span class="line"><span class="cl"><span class="s2">"query"</span><span class="p">:</span><span class="n">state</span><span class="p">[</span><span class="s2">"query"</span><span class="p">],</span></span></span><span class="line"><span class="cl"><span class="s2">"agent_id"</span><span class="p">:</span><span class="n">state</span><span class="p">[</span><span class="s2">"agent_id"</span><span class="p">],</span></span></span><span class="line"><span class="cl"><span class="s2">"confidence"</span><span class="p">:</span><span class="n">state</span><span class="p">[</span><span class="s2">"confidence"</span><span class="p">],</span></span></span><span class="line"><span class="cl"><span class="s2">"escalation_chain"</span><span class="p">:</span><span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s2">"escalation_chain"</span><span class="p">]),</span></span></span><span class="line"><span class="cl"><span class="s2">"response_length"</span><span class="p">:</span><span class="nb">len</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s2">"final_response"</span><span class="p">])</span></span></span><span class="line"><span class="cl"><span class="p">}</span></span></span><span class="line"><span class="cl"><span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">"INSERT INTO audit_log VALUES (:timestamp, :session_id, ...)"</span><span class="p">,</span><span class="n">audit_record</span><span class="p">)</span></span></span></code></pre></div><p>For a system giving architecture recommendations — decisions that feed into multi-million dollar cloud commitments — having an immutable audit trail of what was asked, what agent answered, at what confidence level, via what escalation path, is not a nice-to-have. It is the thing that makes it organisationally defensible.</p><hr><h2 id="what-it-can-actually-do">What It Can Actually Do</h2><div style="background:#0d0d1a;border-left:4px solid #00e5ff;padding:26px 30px;margin:28px 0;font-family:'JetBrains Mono',monospace;font-size:17px;line-height:2.0;color:#e0e0e0;border-radius:0 8px 8px 0;"><span style="color:#666;">$</span><span style="color:#00e5ff;">aether query</span><span style="color:#e0e0e0;">"Design an Azure landing zone for PCI-DSS compliance"</span><br><span style="color:#666;">$</span><span style="color:#00e5ff;">aether query</span><span style="color:#e0e0e0;">"Right-size EC2 instances for a memory-intensive Java application"</span><br><span style="color:#666;">$</span><span style="color:#00e5ff;">aether query</span><span style="color:#e0e0e0;">"Configure FSLogix profile containers for 500 AVD users"</span><br><span style="color:#666;">$</span><span style="color:#00e5ff;">aether query</span><span style="color:#e0e0e0;">"Zero-trust network architecture for 5,000 remote employees"</span><br><span style="color:#666;">$</span><span style="color:#00e5ff;">aether query</span><span style="color:#e0e0e0;">"Cloud adoption roadmap for a financial services firm"</span><br><span style="color:#666;">$</span><span style="color:#00e5ff;">aether query</span><span style="color:#e0e0e0;">"Intune compliance policy for iOS BYOD — hybrid-joined, MFA enforced"</span><br/><p>The depth per domain is real. The Intune agent knows OMA-URI paths, Graph API commands, Autopilot profiles, and licensing requirements — because those are the documents I ingested into its namespace. The AVD agent knows host pool design, FSLogix sizing, MSIX app attach, and session host scaling plans. The knowledge base is only as good as what you put into it — but that is also the point. This is<em>my</em> architecture knowledge, curated, searchable, and queryable at any hour.</p><hr><h2 id="what-i-learned-building-this">What I Learned Building This</h2><p><strong>Prompt engineering IS the architecture.</strong> In a system like this, the YAML manifest<em>is</em> the agent. The difference between a brilliant AWS specialist and a generic AI assistant is entirely in what the system prompt says and what documents back it up. Getting those prompts precise, domain-bounded, and calibrated for the right temperature took longer than any of the code.</p><p><strong>MoE models are underrated for local multi-agent work.</strong> The choice of Gemma 4 26B A4B over a dense model was the right call. You get reasoning breadth comparable to a much larger model at the inference cost of a small one, on hardware that most architects already carry.</p><p><strong>Confidence as a first-class citizen.</strong> Asking the model to self-assess and surface that score is one of the highest-value things I added. It makes the system honest — and it drives the escalation logic that makes the team metaphor actually work.</p><p><strong>Namespace isolation is a practical hallucination brake.</strong> Not a theoretical one. In the first version of Aether, all documents lived in a single namespace. Cross-domain contamination was visible and annoying. Splitting into per-agent namespaces fixed it immediately.</p><p><strong>Audit trails are not overhead — they are the point.</strong> Every enterprise AI deployment should have one. Building it into the core pipeline from day one changes how you think about what the system is producing.</p><hr><h2 id="what-is-next">What Is Next</h2><p>Aether v2.6 is a working, daily-use system. The next version I am working toward adds a web-based ingestion UI (so loading new documents does not require touching the ingest script), structured output for architecture decision records (ADRs) in a consistent format, and inter-agent communication — where a Tier 3 agent can proactively pull context from a peer rather than waiting for the escalation chain to activate.</p><p>I built Aether because I wanted a team. It turns out a team was always available — it just needed the right prompts and a local model to bring it to life.</p><hr><p><em>Have questions about the stack or want to see specific parts of the implementation? Drop a comment or reach out on<a href="https://www.linkedin.com/in/ajay-walia-8b066a1b/">LinkedIn</a>.</em></p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/ather2.jpg" medium="image"><media:title type="plain">Posts</media:title></media:content><category>artificial-intelligence</category><category>architecture</category><category>automation</category><category>engineering</category><category>llm</category><category>Knowledge Base</category></item><item><title>I Built My Own Video Downloader — No Ads, No Watermarks, Three Platforms</title><link>https://curiousbit.netlify.app/i-built-my-own-video-downloader-no-ads-no-watermarks-three-platforms/</link><guid isPermaLink="true">https://curiousbit.netlify.app/i-built-my-own-video-downloader-no-ads-no-watermarks-three-platforms/</guid><pubDate>Mon, 11 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;Every video download site I tried felt like navigating a minefield — five ad clicks to reach a button that triggers another redirect. So I stopped using them and built a clean local tool that handles TikTok, Instagram, and X in one paste.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/video-downloader-banner.jpg" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>Every video download site I tried felt like navigating a minefield — five ad clicks to reach a button that triggers another redirect. So I stopped using them and built a clean local tool that handles TikTok, Instagram, and X in one paste.</p><p><strong>No watermarks. No ads. No accounts. Just a URL.</strong></p><p><em>~6 min read · Node.js + React · conceptual deep-dive</em></p><style>
@import url('https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;600;700;800&family=JetBrains+Mono:wght@400;500&display=swap');
.vd-article {
--vd-surface: #111827;
--vd-surface2: #1a2235;
--vd-border: #1f2d45;
--vd-text: #e2e8f0;
--vd-muted: #8b9ab3;
--vd-green: #00c853;
--vd-green-glow: rgba(0,200,83,0.15);
--vd-cyan: #22d3ee;
--vd-cyan-glow: rgba(34,211,238,0.12);
--vd-orange: #f59e0b;
color: var(--vd-text);
font-size: 1.05rem;
line-height: 1.85;
margin: 2.5rem 0;
}
.vd-article * { box-sizing: border-box; }
.vd-article h2 {
font-family: 'Space Grotesk', sans-serif;
font-size: clamp(1.5rem, 1.2rem + 1vw, 2.1rem);
font-weight: 700;
letter-spacing: -0.02em;
color: var(--vd-text);
margin: 2.6rem 0 0.9rem;
padding-bottom: 0.5rem;
border-bottom: 1px solid var(--vd-border);
}
.vd-article h3 {
font-family: 'Space Grotesk', sans-serif;
font-size: clamp(1.15rem, 1.05rem + 0.3vw, 1.45rem);
font-weight: 700;
color: var(--vd-green);
margin: 2rem 0 0.6rem;
}
.vd-article p { margin: 0 0 1.2rem; color: var(--vd-text); }
.vd-article strong { color: #fff; font-weight: 600; }
.vd-article a { color: var(--vd-green); text-decoration: underline; text-underline-offset: 3px; }
.vd-article code {
background: var(--vd-surface2);
padding: 0.1rem 0.4rem;
border-radius: 4px;
font-family: 'JetBrains Mono', monospace;
font-size: 0.88rem;
}
.vd-callout {
background: var(--vd-surface2);
border-left: 3px solid var(--vd-green);
border-radius: 0 10px 10px 0;
padding: 1.1rem 1.4rem;
margin: 1.6rem 0;
font-size: 1rem;
line-height: 1.7;
}
.vd-callout.warning { border-color: var(--vd-orange); background: rgba(245,158,11,0.06); }
.vd-callout.info { border-color: var(--vd-cyan); background: rgba(34,211,238,0.06); }
.vd-steps { display: flex; flex-direction: column; gap: 1rem; margin: 1.5rem 0 2rem; }
.vd-step-card {
background: var(--vd-surface);
border: 1px solid var(--vd-border);
border-radius: 12px;
padding: 1.2rem 1.4rem;
display: flex;
gap: 1.1rem;
align-items: flex-start;
}
.vd-step-num {
font-family: 'Space Grotesk', sans-serif;
font-size: 1.3rem; font-weight: 800;
color: var(--vd-green);
min-width: 2rem;
line-height: 1.3;
}
.vd-step-body h4 {
font-family: 'Space Grotesk', sans-serif;
font-weight: 700; font-size: 1rem;
margin: 0 0 0.3rem;
color: #fff;
}
.vd-step-body p { margin: 0; font-size: 0.92rem; color: var(--vd-muted); line-height: 1.6; }
.vd-platform-row { display: flex; gap: 0.75rem; flex-wrap: wrap; margin: 1.2rem 0 1.8rem; }
.vd-platform-badge {
display: flex; align-items: center; gap: 0.5rem;
background: var(--vd-surface);
border: 1px solid var(--vd-border);
border-radius: 10px;
padding: 0.6rem 1rem;
font-size: 0.88rem; font-weight: 600;
color: var(--vd-text);
}
.vd-platform-badge .vd-dot { width: 8px; height: 8px; border-radius: 50%; }
.vd-dot-tiktok { background: #ff0050; }
.vd-dot-insta { background: #e1306c; }
.vd-dot-x { background: #1d9bf0; }
.vd-post-img {
width: 100%; border-radius: 12px;
border: 1px solid var(--vd-border);
margin: 1.5rem 0 0.4rem;
display: block;
box-shadow: 0 4px 24px rgba(0,0,0,0.4);
}
.vd-img-caption {
text-align: center;
font-size: 0.8rem;
color: var(--vd-muted);
margin: 0 0 2rem;
font-style: italic;
}
.vd-flow {
background: var(--vd-surface);
border: 1px solid var(--vd-border);
border-radius: 14px;
padding: 1.8rem 1.4rem;
margin: 1.8rem 0;
display: flex; flex-direction: column; align-items: center;
}
.vd-flow-node {
background: var(--vd-surface2);
border: 1px solid var(--vd-border);
border-radius: 10px;
padding: 0.65rem 1.4rem;
font-size: 0.9rem;
font-weight: 600;
color: var(--vd-text);
text-align: center;
width: 100%;
max-width: 420px;
}
.vd-flow-node.green { border-color: rgba(0,200,83,0.4); background: var(--vd-green-glow); color: var(--vd-green); }
.vd-flow-node.orange { border-color: rgba(245,158,11,0.4); background: rgba(245,158,11,0.07); color: var(--vd-orange); }
.vd-flow-node.cyan { border-color: rgba(34,211,238,0.4); background: var(--vd-cyan-glow); color: var(--vd-cyan); }
.vd-flow-arrow { color: var(--vd-muted); font-size: 1.1rem; line-height: 1; padding: 0.35rem 0; }
.vd-results { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; margin: 1.5rem 0 2rem; }
@media (max-width: 600px) { .vd-results { grid-template-columns: 1fr; } }
.vd-result-card {
background: var(--vd-surface);
border: 1px solid var(--vd-border);
border-radius: 12px;
padding: 1.2rem;
}
.vd-result-card .vd-rc-icon { font-size: 1.6rem; margin-bottom: 0.5rem; }
.vd-result-card h4 {
font-family: 'Space Grotesk', sans-serif;
font-size: 0.95rem; font-weight: 700;
color: #fff; margin: 0 0 0.3rem;
}
.vd-result-card p { font-size: 0.86rem; color: var(--vd-muted); margin: 0; line-height: 1.55; }
.vd-cta {
background: var(--vd-surface);
border: 1px solid rgba(0,200,83,0.25);
border-radius: 14px;
padding: 1.6rem 1.8rem;
margin: 2.5rem 0 0;
}
.vd-cta p { font-size: 0.95rem; color: var(--vd-text); margin: 0; line-height: 1.7; }
.vd-cta strong { color: var(--vd-green); }
.vd-divider { border: none; border-top: 1px solid var(--vd-border); margin: 2.2rem 0; }</style><div class="vd-article"><h2>The Problem With Every Downloader Site</h2><p>You find a TikTok video you want to keep. Maybe it's a tutorial, a recipe, a clip you want to share somewhere offline. You Google "download TikTok without watermark" and click the first result. What follows is a ritual:</p><div class="vd-callout warning">
Pop-up #1 appears. You close it. A second tab opens. You close that. You find the actual download button — but it's fake and triggers another ad. The real button is somewhere underneath a consent banner. You finally click download. It starts… and gives you a watermarked file anyway.</div><p>This happens on nearly every popular downloader site. They're monetised almost entirely through advertising, and the UX is designed to maximise your exposure to that advertising — not to help you download a video. The actual download logic underneath all that noise is usually a single API call.</p><p>So I asked myself: how hard would it actually be to build a clean version of this for personal use?</p><h2>What I Actually Wanted</h2><p>Three things, nothing more:</p><div class="vd-steps"><div class="vd-step-card"><div class="vd-step-num">1</div><div class="vd-step-body"><h4>No watermark, always</h4><p>The tool should try its hardest to get a watermark-free file. If it can't, it tells you — it doesn't quietly give you the watermarked version pretending it's clean.</p></div></div><div class="vd-step-card"><div class="vd-step-num">2</div><div class="vd-step-body"><h4>Support the platforms I actually use</h4><p>TikTok, Instagram, and X (Twitter). Paste any URL from any of these three and it should just work.</p></div></div><div class="vd-step-card"><div class="vd-step-num">3</div><div class="vd-step-body"><h4>Zero friction interface</h4><p>One input, one button. No accounts, no CAPTCHAs, no ads. It runs locally so there's nothing to sign up for.</p></div></div></div><h2>How It Works — The Conceptual Picture</h2><p>The tool is split into two halves: a lightweight React frontend and a Node.js backend. The frontend is just the URL input box. All the interesting logic lives in the backend.</p><p>Here are the five things that happen the moment you paste a URL and hit Download:</p><h3>Step 1 — Figure Out Which Platform You're On</h3><p>The very first thing the backend does is inspect the URL and decide: is this TikTok, Instagram, or X? Each platform has its own URL patterns — including short links, mobile URLs, and regional variants — and the tool checks all of them.</p><div class="vd-platform-row"><div class="vd-platform-badge"><span class="vd-dot vd-dot-tiktok"/> tiktok.com · vm.tiktok.com · vt.tiktok.com</div><div class="vd-platform-badge"><span class="vd-dot vd-dot-insta"/> instagram.com · www.instagram.com</div><div class="vd-platform-badge"><span class="vd-dot vd-dot-x"/> x.com · twitter.com · mobile.twitter.com</div></div><p>If the URL doesn't match any of these, the backend rejects it immediately with a clear error before wasting any time trying to fetch something it can't handle. No silent failures.</p><img src="/images/video-downloader-arch.jpg" alt="Anime engineer pointing at a holographic flowchart of the download architecture" class="vd-post-img"><p class="vd-img-caption">The backend runs through a short chain: detect → resolve → try providers → cache → serve</p><h3>Step 2 — Resolve Short Links</h3><p>TikTok in particular loves to generate short share links like<code>vm.tiktok.com/AbcXyz</code>. These redirect to the full video URL, but the download providers need the real URL to work with. So the backend follows up to five redirects to resolve the final destination before doing anything else.</p><h3>Step 3 — The Provider Chain</h3><p>This is the core of the tool. The backend doesn't rely on a single source for the download link — it tries multiple providers in order, and only moves to the next one if the previous failed.</p><div class="vd-flow"><div class="vd-flow-node cyan">Validated &amp; resolved URL</div><div class="vd-flow-arrow">↓</div><div class="vd-flow-node"><strong>Provider 1:</strong> TikWM API — fast, HD, usually no-watermark</div><div class="vd-flow-arrow">↓ if failed</div><div class="vd-flow-node"><strong>Provider 2:</strong> yt-dlp — catches what TikWM misses, supports all 3 platforms</div><div class="vd-flow-arrow">↓ if both fail on no-watermark</div><div class="vd-flow-node orange">Last resort: watermarked fallback (with a clear warning shown)</div><div class="vd-flow-arrow">↓</div><div class="vd-flow-node green">✓ Download link returned to frontend</div></div><p>The first provider, TikWM, is a public API that's fast and usually returns an HD, watermark-free file. But it occasionally struggles with newer videos or private content. That's when<strong>yt-dlp</strong> steps in — a powerful open-source tool that knows how to extract media from hundreds of platforms, and is updated constantly as platforms change their serving behaviour.</p><div class="vd-callout info"><strong>Why yt-dlp as a fallback and not primary?</strong> TikWM is faster and returns a clean pre-parsed result. yt-dlp is more capable but adds latency since it runs as a local process and parses raw platform data. Using TikWM first keeps the happy path quick.</div><h3>Step 4 — The Proxy Download</h3><p>Here's a detail that actually matters for reliability: the tool doesn't give your browser a direct link to TikTok's CDN or Instagram's servers. Instead, it registers a short-lived<strong>secure token</strong> that points back to the Node backend. When you click download, your browser hits the backend's own<code>/api/file</code> endpoint, which streams the video directly to you.</p><p>Why does this matter? Direct CDN links from social platforms often include authentication tokens or short expiry times. They also sometimes block downloads when accessed directly from a browser outside the platform. Running the stream through the backend sidesteps both issues — and means the download starts with a clean filename instead of a jumble of CDN parameters.</p><h3>Step 5 — Caching</h3><p>Once a video URL has been resolved and a download link extracted, the result is cached in memory for 30 minutes. If you (or someone else on the same local instance) pastes the same video URL again within that window, the backend returns the cached result instantly — no API calls, no yt-dlp process, just the stored answer.</p><h2>The Result</h2><p>What this adds up to in practice:</p><div class="vd-results"><div class="vd-result-card"><div class="vd-rc-icon">🎯</div><h4>No-watermark first, always</h4><p>The tool tries every avenue for a clean file before falling back. The fallback is clearly labelled.</p></div><div class="vd-result-card"><div class="vd-rc-icon">⚡</div><h4>Fast on repeat URLs</h4><p>Same video twice within 30 minutes? Instant response from the in-memory cache.</p></div><div class="vd-result-card"><div class="vd-rc-icon">🎵</div><h4>Audio extraction too</h4><p>The tool also surfaces the audio-only track when available — useful for saving music from TikTok.</p></div><div class="vd-result-card"><div class="vd-rc-icon">🔒</div><h4>No external accounts</h4><p>Runs entirely locally. Nothing to log into, nothing phoning home, no API keys required.</p></div></div><img src="/images/video-downloader-result.jpg" alt="Anime developer smiling at a clean minimal download interface" class="vd-post-img"><p class="vd-img-caption">One URL input, three download buttons, zero pop-ups. That's the whole interface.</p><h2>What's Next</h2><p>The tool is currently running locally but I'm planning to deploy it on<strong>curiousbit.netlify.app</strong> so anyone who wants a clean download experience can use it without having to run Node themselves. The architecture is already production-ready — it's just a matter of pointing it at a hosting environment and wiring up the environment variables.</p><p>A few things I'd like to add before making it fully public: rate limiting per IP (to avoid abuse), a simple download history in the UI, and potentially Instagram Stories support which currently needs a different extraction path.</p><div class="vd-callout">
The code is structured as a monorepo — backend and frontend live together, share URL validation logic, and build to a single deployable package. If you want to run it yourself locally, it's a single<code>npm install &amp;&amp; npm run dev</code> away.</div><hr class="vd-divider"><div class="vd-cta"><p><strong>Over to you:</strong> Have you ever got fed up enough with a broken web experience that you built your own alternative? I'd love to hear what you made — or whether you'd actually use a clean, ad-free downloader like this if it were publicly hosted. Drop your thoughts below or find me on LinkedIn.</p></div></div>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/video-downloader-banner.jpg" medium="image"><media:title type="plain">Posts</media:title></media:content><category>automation</category><category>build-log</category><category>nodejs</category><category>Knowledge Base</category></item><item><title>Attention Is All You Need — The Paper That Rewired AI</title><link>https://curiousbit.netlify.app/attention-is-all-you-need-the-paper-that-rewired-ai/</link><guid isPermaLink="true">https://curiousbit.netlify.app/attention-is-all-you-need-the-paper-that-rewired-ai/</guid><pubDate>Sun, 10 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;style&gt;
@import url('https://fonts.googleapis.com/css2?family=Bangers&amp;family=Space+Grotesk:wght@400;500;600;700&amp;family=Inter:wght@400;500&amp;family=JetBrains+Mono:wght@400;500&amp;display=swap');
.attn-article {
--bg: #080b14;
--surface: #111827;
--surface2: #1a2235;
--border: #1f2d45;
--text: #e2e8f0;
--muted: #8b9ab3;
--purple: #8b5cf6;
--purple-glow: rgba(139,92,246,0.2);
--cyan: #22d3ee;
--cyan-glow: rgba(34,211,238,0.15);
--gold: #f59e0b;
--gold-glow: rgba(245,158,11,0.2);
--red: #ef4444;
--green: #10b981;
--manga-bg: #0e1520;
--manga-border: #f59e0b;
}
/* ── Particles background ── */
/* ── Site header ── */
/* ── Main wrapper ── */
/* ── Hero ── */
.hero {
text-align: center;
padding: 5rem 0 3rem;
}
.hero-tag {
display: inline-block;
background: var(--purple-glow);
border: 1px solid var(--purple);
color: var(--purple);
font-family: 'Space Grotesk', sans-serif;
font-size: 0.75rem;
font-weight: 700;
letter-spacing: 3px;
text-transform: uppercase;
padding: 0.35rem 1rem;
border-radius: 100px;
margin-bottom: 1.5rem;
}
.hero h1 {
font-family: 'Bangers', cursive;
font-size: clamp(2.8rem, 8vw, 5.5rem);
line-height: 1.05;
letter-spacing: 3px;
background: linear-gradient(135deg, #fff 0%, var(--cyan) 40%, var(--purple) 100%);
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
background-clip: text;
margin-bottom: 1.2rem;
}
.hero-sub {
font-size: clamp(1.15rem, 1rem + 0.5vw, 1.45rem);
color: var(--muted);
max-width: 980px;
margin: 0 auto 2rem;
line-height: 1.7;
}
.hero-meta {
display: flex; align-items: center; justify-content: center; gap: 1.5rem;
flex-wrap: wrap;
font-size: 0.82rem;
color: var(--muted);
font-family: 'Space Grotesk', sans-serif;
}
.hero-meta span { display: flex; align-items: center; gap: 0.4rem; }
/* ── Manga Panel ── */
.manga-panel {
margin: 2.5rem 0;
background: var(--manga-bg);
border: 2px solid var(--manga-border);
border-radius: 4px;
padding: 1.5rem;
position: relative;
box-shadow: 0 0 30px rgba(245,158,11,0.1), inset 0 0 60px rgba(0,0,0,0.4);
}
.manga-panel::before {
content: '🎨 GROK IMAGINE';
position: absolute;
top: -1px; left: -1px;
background: var(--manga-border);
color: #000;
font-family: 'Bangers', cursive;
font-size: 0.8rem;
letter-spacing: 2px;
padding: 0.2rem 0.8rem;
border-radius: 2px 0 4px 0;
}
.manga-panel .panel-title {
font-family: 'Bangers', cursive;
font-size: clamp(1.35rem, 1rem + 1vw, 2rem);
letter-spacing: 2px;
color: var(--gold);
margin: 1rem 0 0.75rem;
}
.manga-panel img {
width: 100%;
border-radius: 6px;
display: block;
margin-top: 0.5rem;
box-shadow: 0 4px 24px rgba(0,0,0,0.5);
}
section,
article {
font-size: clamp(1.15rem, 1.05rem + 0.3vw, 1.4rem);
line-height: 1.85;
}
/* ── Chapter headings ── */
.chapter-label {
font-family: 'Bangers', cursive;
font-size: 0.85rem;
letter-spacing: 4px;
color: var(--cyan);
margin-bottom: 0.5rem;
display: block;
}
h2 {
font-family: 'Space Grotesk', sans-serif;
font-size: clamp(1.6rem, 4vw, 2.2rem);
font-weight: 700;
margin-bottom: 1.2rem;
line-height: 1.2;
color: #fff;
}
h2 .highlight { color: var(--cyan); }
h3 {
font-family: 'Space Grotesk', sans-serif;
font-size: 1.2rem;
font-weight: 600;
color: var(--purple);
margin: 2rem 0 0.75rem;
}
/* ── Body text ── */
p { margin-bottom: 1.2rem; color: var(--text); }
strong { color: #fff; }
em { color: var(--cyan); font-style: italic; }
/* ── Section divider ── */
.section {
margin-top: 4.5rem;
padding-top: 1rem;
border-top: 1px solid var(--border);
}
/* ── Callout boxes ── */
.callout {
margin: 2rem 0;
padding: 1.25rem 1.5rem;
border-radius: 6px;
border-left: 4px solid;
}
.callout.purple {
background: var(--purple-glow);
border-color: var(--purple);
}
.callout.cyan {
background: var(--cyan-glow);
border-color: var(--cyan);
}
.callout.gold {
background: var(--gold-glow);
border-color: var(--gold);
}
.callout-title {
font-family: 'Space Grotesk', sans-serif;
font-weight: 700;
font-size: 0.85rem;
letter-spacing: 1px;
text-transform: uppercase;
margin-bottom: 0.5rem;
}
.callout.purple .callout-title { color: var(--purple); }
.callout.cyan .callout-title { color: var(--cyan); }
.callout.gold .callout-title { color: var(--gold); }
.callout p { margin: 0; font-size: 0.95rem; }
/* ── Inline diagram: RNN chain ── */
.diagram-box {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 8px;
padding: 1.5rem;
margin: 2rem 0;
overflow-x: auto;
}
.diagram-title {
font-family: 'Space Grotesk', sans-serif;
font-size: 0.78rem;
font-weight: 600;
letter-spacing: 2px;
text-transform: uppercase;
color: var(--muted);
margin-bottom: 1rem;
text-align: center;
}
/* RNN Chain */
.rnn-chain {
display: flex;
align-items: center;
gap: 0;
min-width: 600px;
justify-content: center;
}
.rnn-node {
width: 52px; height: 52px;
border-radius: 50%;
border: 2px solid #ef4444;
background: rgba(239,68,68,0.1);
display: flex; align-items: center; justify-content: center;
font-family: 'Space Grotesk', sans-serif;
font-size: 0.7rem;
font-weight: 600;
color: #ef4444;
flex-shrink: 0;
position: relative;
}
.rnn-arrow {
width: 32px; height: 2px;
background: linear-gradient(90deg, #ef4444, rgba(239,68,68,0.3));
position: relative;
flex-shrink: 0;
}
.rnn-arrow::after {
content: '';
position: absolute;
right: 0; top: -4px;
border-left: 8px solid rgba(239,68,68,0.5);
border-top: 5px solid transparent;
border-bottom: 5px solid transparent;
}
.rnn-fade { opacity: 0.35; }
.rnn-labels {
display: flex;
justify-content: space-between;
font-size: 0.7rem;
color: var(--muted);
margin-top: 0.6rem;
font-family: 'JetBrains Mono', monospace;
min-width: 600px;
}
/* Attention Matrix */
.attn-matrix {
display: grid;
grid-template-columns: auto repeat(6, 1fr);
gap: 3px;
font-size: 0.7rem;
font-family: 'JetBrains Mono', monospace;
}
.attn-label {
display: flex; align-items: center; justify-content: flex-end;
padding-right: 8px;
color: var(--muted);
font-size: 0.68rem;
}
.attn-col-labels {
display: grid;
grid-template-columns: auto repeat(6, 1fr);
gap: 3px;
margin-bottom: 3px;
}
.attn-col-label {
text-align: center;
color: var(--muted);
font-size: 0.65rem;
font-family: 'JetBrains Mono', monospace;
}
.attn-cell {
height: 36px;
border-radius: 3px;
display: flex; align-items: center; justify-content: center;
font-size: 0.65rem;
font-weight: 600;
transition: transform 0.2s;
cursor: default;
}
.attn-cell:hover { transform: scale(1.08); }
/* Timeline */
.timeline {
position: relative;
padding-left: 2rem;
margin: 2rem 0;
}
.timeline::before {
content: '';
position: absolute;
left: 7px; top: 0; bottom: 0;
width: 2px;
background: linear-gradient(180deg, var(--purple), var(--cyan), var(--gold));
}
.timeline-item {
position: relative;
margin-bottom: 2rem;
padding-left: 1.5rem;
}
.timeline-dot {
position: absolute;
left: -1.85rem; top: 0.35rem;
width: 16px; height: 16px;
border-radius: 50%;
border: 2px solid;
background: var(--bg);
}
.timeline-dot.purple { border-color: var(--purple); box-shadow: 0 0 10px var(--purple); }
.timeline-dot.cyan { border-color: var(--cyan); box-shadow: 0 0 10px var(--cyan); }
.timeline-dot.gold { border-color: var(--gold); box-shadow: 0 0 10px var(--gold); }
.timeline-dot.green { border-color: var(--green); box-shadow: 0 0 10px var(--green); }
.timeline-year {
font-family: 'Bangers', cursive;
font-size: 1.2rem;
letter-spacing: 2px;
color: var(--gold);
margin-bottom: 0.25rem;
}
.timeline-milestone {
font-family: 'Space Grotesk', sans-serif;
font-weight: 700;
font-size: 1rem;
color: #fff;
margin-bottom: 0.25rem;
}
.timeline-desc {
font-size: 0.88rem;
color: var(--muted);
line-height: 1.5;
}
.timeline-badge {
display: inline-block;
font-size: 0.65rem;
font-weight: 700;
font-family: 'Space Grotesk', sans-serif;
padding: 0.15rem 0.5rem;
border-radius: 100px;
margin-left: 0.5rem;
vertical-align: middle;
}
.badge-origin { background: rgba(139,92,246,0.25); color: var(--purple); border: 1px solid var(--purple); }
.badge-encoder { background: rgba(34,211,238,0.2); color: var(--cyan); border: 1px solid var(--cyan); }
.badge-decoder { background: rgba(245,158,11,0.2); color: var(--gold); border: 1px solid var(--gold); }
.badge-scale { background: rgba(16,185,129,0.2); color: var(--green); border: 1px solid var(--green); }
/* Citation / quote block */
blockquote {
margin: 2rem 0;
padding: 1.5rem 1.5rem 1.5rem 2rem;
background: var(--surface2);
border-left: 4px solid var(--purple);
border-radius: 0 6px 6px 0;
font-style: italic;
color: var(--muted);
position: relative;
}
blockquote::before {
content: '"';
font-family: 'Bangers', cursive;
font-size: 5rem;
color: var(--purple);
opacity: 0.3;
position: absolute;
top: -1rem; left: 0.5rem;
line-height: 1;
}
blockquote cite {
display: block;
margin-top: 0.75rem;
font-size: 0.82rem;
font-style: normal;
color: var(--purple);
font-family: 'Space Grotesk', sans-serif;
font-weight: 600;
}
/* Code-like inline */
code {
font-family: 'JetBrains Mono', monospace;
font-size: 0.85em;
background: rgba(139,92,246,0.15);
color: var(--purple);
padding: 0.15em 0.4em;
border-radius: 3px;
}
/* Problem list */
.problem-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
gap: 1rem;
margin: 1.5rem 0;
}
.problem-card {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 8px;
padding: 1.25rem;
transition: border-color 0.2s;
}
.problem-card:hover { border-color: var(--red); }
.problem-card .icon { font-size: 1.8rem; margin-bottom: 0.6rem; }
.problem-card h4 {
font-family: 'Space Grotesk', sans-serif;
font-size: 0.95rem;
font-weight: 700;
color: var(--red);
margin-bottom: 0.4rem;
}
.problem-card p { font-size: 0.88rem; color: var(--muted); margin: 0; }
/* Breakthrough cards */
.breakthrough-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
gap: 1rem;
margin: 1.5rem 0;
}
.breakthrough-card {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 8px;
padding: 1.25rem;
transition: border-color 0.2s, box-shadow 0.2s;
}
.breakthrough-card:hover { border-color: var(--green); box-shadow: 0 0 20px rgba(16,185,129,0.1); }
.breakthrough-card .icon { font-size: 1.8rem; margin-bottom: 0.6rem; }
.breakthrough-card h4 {
font-family: 'Space Grotesk', sans-serif;
font-size: 0.95rem;
font-weight: 700;
color: var(--green);
margin-bottom: 0.4rem;
}
.breakthrough-card p { font-size: 0.88rem; color: var(--muted); margin: 0; }
/* Attention heads visual */
.heads-grid {
display: grid;
grid-template-columns: repeat(4, 1fr);
gap: 0.75rem;
margin: 1.5rem 0;
}
.head-card {
padding: 1rem;
border-radius: 6px;
text-align: center;
border: 1px solid;
}
.head-card .head-num {
font-family: 'Bangers', cursive;
font-size: 1.6rem;
letter-spacing: 1px;
}
.head-card .head-desc {
font-size: 0.72rem;
font-family: 'Space Grotesk', sans-serif;
margin-top: 0.25rem;
}
.head-1 { border-color: #8b5cf6; background: rgba(139,92,246,0.1); }
.head-1 .head-num { color: #8b5cf6; }
.head-1 .head-desc { color: #8b5cf6; }
.head-2 { border-color: #22d3ee; background: rgba(34,211,238,0.1); }
.head-2 .head-num { color: #22d3ee; }
.head-2 .head-desc { color: #22d3ee; }
.head-3 { border-color: #f59e0b; background: rgba(245,158,11,0.1); }
.head-3 .head-num { color: #f59e0b; }
.head-3 .head-desc { color: #f59e0b; }
.head-4 { border-color: #10b981; background: rgba(16,185,129,0.1); }
.head-4 .head-num { color: #10b981; }
.head-4 .head-desc { color: #10b981; }
/* Big stat */
.big-stat {
text-align: center;
padding: 2.5rem 1rem;
background: var(--surface);
border-radius: 12px;
margin: 2rem 0;
}
.big-stat .number {
font-family: 'Bangers', cursive;
font-size: 4rem;
letter-spacing: 3px;
background: linear-gradient(135deg, var(--gold), var(--purple));
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
background-clip: text;
}
.big-stat .label {
font-family: 'Space Grotesk', sans-serif;
font-size: 0.9rem;
color: var(--muted);
margin-top: 0.5rem;
}
/* Footer */
/* Responsive */
@media (max-width: 600px) {
main { width: 100%; }
.site-nav-inner { gap: 1rem; }
.site-brand span:last-child { display: none; }
.site-nav .nav-links { gap: 0.9rem; overflow-x: auto; }
.heads-grid { grid-template-columns: repeat(2, 1fr); }
.hero h1 { font-size: 2.5rem; }
}
/* Scroll fade-in */
.fade-in {
opacity: 0;
transform: translateY(24px);
transition: opacity 0.6s ease, transform 0.6s ease;
}
.fade-in.visible {
opacity: 1;
transform: translateY(0);
}
/* Inlined into Hugo: ensure content is visible even without the IntersectionObserver script */
.attn-article .fade-in { opacity: 1; transform: none; }
&lt;/style&gt;
&lt;div class="attn-article"&gt;
&lt;!-- ── HERO ── --&gt;
&lt;section class="hero fade-in"&gt;
&lt;div class="hero-tag"&gt;Deep Dive · Artificial Intelligence · 2017&lt;/div&gt;
&lt;p class="hero-sub"&gt;
The seven-word title that ended one era of AI and launched another.
A beginner-friendly, technically honest tour through the paper that birthed every LLM you've ever heard of.
&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/panel-01.jpg" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><style>
@import url('https://fonts.googleapis.com/css2?family=Bangers&family=Space+Grotesk:wght@400;500;600;700&family=Inter:wght@400;500&family=JetBrains+Mono:wght@400;500&display=swap');
.attn-article {
--bg: #080b14;
--surface: #111827;
--surface2: #1a2235;
--border: #1f2d45;
--text: #e2e8f0;
--muted: #8b9ab3;
--purple: #8b5cf6;
--purple-glow: rgba(139,92,246,0.2);
--cyan: #22d3ee;
--cyan-glow: rgba(34,211,238,0.15);
--gold: #f59e0b;
--gold-glow: rgba(245,158,11,0.2);
--red: #ef4444;
--green: #10b981;
--manga-bg: #0e1520;
--manga-border: #f59e0b;
}
/* ── Particles background ── */
/* ── Site header ── */
/* ── Main wrapper ── */
/* ── Hero ── */
.hero {
text-align: center;
padding: 5rem 0 3rem;
}
.hero-tag {
display: inline-block;
background: var(--purple-glow);
border: 1px solid var(--purple);
color: var(--purple);
font-family: 'Space Grotesk', sans-serif;
font-size: 0.75rem;
font-weight: 700;
letter-spacing: 3px;
text-transform: uppercase;
padding: 0.35rem 1rem;
border-radius: 100px;
margin-bottom: 1.5rem;
}
.hero h1 {
font-family: 'Bangers', cursive;
font-size: clamp(2.8rem, 8vw, 5.5rem);
line-height: 1.05;
letter-spacing: 3px;
background: linear-gradient(135deg, #fff 0%, var(--cyan) 40%, var(--purple) 100%);
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
background-clip: text;
margin-bottom: 1.2rem;
}
.hero-sub {
font-size: clamp(1.15rem, 1rem + 0.5vw, 1.45rem);
color: var(--muted);
max-width: 980px;
margin: 0 auto 2rem;
line-height: 1.7;
}
.hero-meta {
display: flex; align-items: center; justify-content: center; gap: 1.5rem;
flex-wrap: wrap;
font-size: 0.82rem;
color: var(--muted);
font-family: 'Space Grotesk', sans-serif;
}
.hero-meta span { display: flex; align-items: center; gap: 0.4rem; }
/* ── Manga Panel ── */
.manga-panel {
margin: 2.5rem 0;
background: var(--manga-bg);
border: 2px solid var(--manga-border);
border-radius: 4px;
padding: 1.5rem;
position: relative;
box-shadow: 0 0 30px rgba(245,158,11,0.1), inset 0 0 60px rgba(0,0,0,0.4);
}
.manga-panel::before {
content: '🎨 GROK IMAGINE';
position: absolute;
top: -1px; left: -1px;
background: var(--manga-border);
color: #000;
font-family: 'Bangers', cursive;
font-size: 0.8rem;
letter-spacing: 2px;
padding: 0.2rem 0.8rem;
border-radius: 2px 0 4px 0;
}
.manga-panel .panel-title {
font-family: 'Bangers', cursive;
font-size: clamp(1.35rem, 1rem + 1vw, 2rem);
letter-spacing: 2px;
color: var(--gold);
margin: 1rem 0 0.75rem;
}
.manga-panel img {
width: 100%;
border-radius: 6px;
display: block;
margin-top: 0.5rem;
box-shadow: 0 4px 24px rgba(0,0,0,0.5);
}
section,
article {
font-size: clamp(1.15rem, 1.05rem + 0.3vw, 1.4rem);
line-height: 1.85;
}
/* ── Chapter headings ── */
.chapter-label {
font-family: 'Bangers', cursive;
font-size: 0.85rem;
letter-spacing: 4px;
color: var(--cyan);
margin-bottom: 0.5rem;
display: block;
}
h2 {
font-family: 'Space Grotesk', sans-serif;
font-size: clamp(1.6rem, 4vw, 2.2rem);
font-weight: 700;
margin-bottom: 1.2rem;
line-height: 1.2;
color: #fff;
}
h2 .highlight { color: var(--cyan); }
h3 {
font-family: 'Space Grotesk', sans-serif;
font-size: 1.2rem;
font-weight: 600;
color: var(--purple);
margin: 2rem 0 0.75rem;
}
/* ── Body text ── */
p { margin-bottom: 1.2rem; color: var(--text); }
strong { color: #fff; }
em { color: var(--cyan); font-style: italic; }
/* ── Section divider ── */
.section {
margin-top: 4.5rem;
padding-top: 1rem;
border-top: 1px solid var(--border);
}
/* ── Callout boxes ── */
.callout {
margin: 2rem 0;
padding: 1.25rem 1.5rem;
border-radius: 6px;
border-left: 4px solid;
}
.callout.purple {
background: var(--purple-glow);
border-color: var(--purple);
}
.callout.cyan {
background: var(--cyan-glow);
border-color: var(--cyan);
}
.callout.gold {
background: var(--gold-glow);
border-color: var(--gold);
}
.callout-title {
font-family: 'Space Grotesk', sans-serif;
font-weight: 700;
font-size: 0.85rem;
letter-spacing: 1px;
text-transform: uppercase;
margin-bottom: 0.5rem;
}
.callout.purple .callout-title { color: var(--purple); }
.callout.cyan .callout-title { color: var(--cyan); }
.callout.gold .callout-title { color: var(--gold); }
.callout p { margin: 0; font-size: 0.95rem; }
/* ── Inline diagram: RNN chain ── */
.diagram-box {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 8px;
padding: 1.5rem;
margin: 2rem 0;
overflow-x: auto;
}
.diagram-title {
font-family: 'Space Grotesk', sans-serif;
font-size: 0.78rem;
font-weight: 600;
letter-spacing: 2px;
text-transform: uppercase;
color: var(--muted);
margin-bottom: 1rem;
text-align: center;
}
/* RNN Chain */
.rnn-chain {
display: flex;
align-items: center;
gap: 0;
min-width: 600px;
justify-content: center;
}
.rnn-node {
width: 52px; height: 52px;
border-radius: 50%;
border: 2px solid #ef4444;
background: rgba(239,68,68,0.1);
display: flex; align-items: center; justify-content: center;
font-family: 'Space Grotesk', sans-serif;
font-size: 0.7rem;
font-weight: 600;
color: #ef4444;
flex-shrink: 0;
position: relative;
}
.rnn-arrow {
width: 32px; height: 2px;
background: linear-gradient(90deg, #ef4444, rgba(239,68,68,0.3));
position: relative;
flex-shrink: 0;
}
.rnn-arrow::after {
content: '';
position: absolute;
right: 0; top: -4px;
border-left: 8px solid rgba(239,68,68,0.5);
border-top: 5px solid transparent;
border-bottom: 5px solid transparent;
}
.rnn-fade { opacity: 0.35; }
.rnn-labels {
display: flex;
justify-content: space-between;
font-size: 0.7rem;
color: var(--muted);
margin-top: 0.6rem;
font-family: 'JetBrains Mono', monospace;
min-width: 600px;
}
/* Attention Matrix */
.attn-matrix {
display: grid;
grid-template-columns: auto repeat(6, 1fr);
gap: 3px;
font-size: 0.7rem;
font-family: 'JetBrains Mono', monospace;
}
.attn-label {
display: flex; align-items: center; justify-content: flex-end;
padding-right: 8px;
color: var(--muted);
font-size: 0.68rem;
}
.attn-col-labels {
display: grid;
grid-template-columns: auto repeat(6, 1fr);
gap: 3px;
margin-bottom: 3px;
}
.attn-col-label {
text-align: center;
color: var(--muted);
font-size: 0.65rem;
font-family: 'JetBrains Mono', monospace;
}
.attn-cell {
height: 36px;
border-radius: 3px;
display: flex; align-items: center; justify-content: center;
font-size: 0.65rem;
font-weight: 600;
transition: transform 0.2s;
cursor: default;
}
.attn-cell:hover { transform: scale(1.08); }
/* Timeline */
.timeline {
position: relative;
padding-left: 2rem;
margin: 2rem 0;
}
.timeline::before {
content: '';
position: absolute;
left: 7px; top: 0; bottom: 0;
width: 2px;
background: linear-gradient(180deg, var(--purple), var(--cyan), var(--gold));
}
.timeline-item {
position: relative;
margin-bottom: 2rem;
padding-left: 1.5rem;
}
.timeline-dot {
position: absolute;
left: -1.85rem; top: 0.35rem;
width: 16px; height: 16px;
border-radius: 50%;
border: 2px solid;
background: var(--bg);
}
.timeline-dot.purple { border-color: var(--purple); box-shadow: 0 0 10px var(--purple); }
.timeline-dot.cyan { border-color: var(--cyan); box-shadow: 0 0 10px var(--cyan); }
.timeline-dot.gold { border-color: var(--gold); box-shadow: 0 0 10px var(--gold); }
.timeline-dot.green { border-color: var(--green); box-shadow: 0 0 10px var(--green); }
.timeline-year {
font-family: 'Bangers', cursive;
font-size: 1.2rem;
letter-spacing: 2px;
color: var(--gold);
margin-bottom: 0.25rem;
}
.timeline-milestone {
font-family: 'Space Grotesk', sans-serif;
font-weight: 700;
font-size: 1rem;
color: #fff;
margin-bottom: 0.25rem;
}
.timeline-desc {
font-size: 0.88rem;
color: var(--muted);
line-height: 1.5;
}
.timeline-badge {
display: inline-block;
font-size: 0.65rem;
font-weight: 700;
font-family: 'Space Grotesk', sans-serif;
padding: 0.15rem 0.5rem;
border-radius: 100px;
margin-left: 0.5rem;
vertical-align: middle;
}
.badge-origin { background: rgba(139,92,246,0.25); color: var(--purple); border: 1px solid var(--purple); }
.badge-encoder { background: rgba(34,211,238,0.2); color: var(--cyan); border: 1px solid var(--cyan); }
.badge-decoder { background: rgba(245,158,11,0.2); color: var(--gold); border: 1px solid var(--gold); }
.badge-scale { background: rgba(16,185,129,0.2); color: var(--green); border: 1px solid var(--green); }
/* Citation / quote block */
blockquote {
margin: 2rem 0;
padding: 1.5rem 1.5rem 1.5rem 2rem;
background: var(--surface2);
border-left: 4px solid var(--purple);
border-radius: 0 6px 6px 0;
font-style: italic;
color: var(--muted);
position: relative;
}
blockquote::before {
content: '"';
font-family: 'Bangers', cursive;
font-size: 5rem;
color: var(--purple);
opacity: 0.3;
position: absolute;
top: -1rem; left: 0.5rem;
line-height: 1;
}
blockquote cite {
display: block;
margin-top: 0.75rem;
font-size: 0.82rem;
font-style: normal;
color: var(--purple);
font-family: 'Space Grotesk', sans-serif;
font-weight: 600;
}
/* Code-like inline */
code {
font-family: 'JetBrains Mono', monospace;
font-size: 0.85em;
background: rgba(139,92,246,0.15);
color: var(--purple);
padding: 0.15em 0.4em;
border-radius: 3px;
}
/* Problem list */
.problem-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
gap: 1rem;
margin: 1.5rem 0;
}
.problem-card {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 8px;
padding: 1.25rem;
transition: border-color 0.2s;
}
.problem-card:hover { border-color: var(--red); }
.problem-card .icon { font-size: 1.8rem; margin-bottom: 0.6rem; }
.problem-card h4 {
font-family: 'Space Grotesk', sans-serif;
font-size: 0.95rem;
font-weight: 700;
color: var(--red);
margin-bottom: 0.4rem;
}
.problem-card p { font-size: 0.88rem; color: var(--muted); margin: 0; }
/* Breakthrough cards */
.breakthrough-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
gap: 1rem;
margin: 1.5rem 0;
}
.breakthrough-card {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 8px;
padding: 1.25rem;
transition: border-color 0.2s, box-shadow 0.2s;
}
.breakthrough-card:hover { border-color: var(--green); box-shadow: 0 0 20px rgba(16,185,129,0.1); }
.breakthrough-card .icon { font-size: 1.8rem; margin-bottom: 0.6rem; }
.breakthrough-card h4 {
font-family: 'Space Grotesk', sans-serif;
font-size: 0.95rem;
font-weight: 700;
color: var(--green);
margin-bottom: 0.4rem;
}
.breakthrough-card p { font-size: 0.88rem; color: var(--muted); margin: 0; }
/* Attention heads visual */
.heads-grid {
display: grid;
grid-template-columns: repeat(4, 1fr);
gap: 0.75rem;
margin: 1.5rem 0;
}
.head-card {
padding: 1rem;
border-radius: 6px;
text-align: center;
border: 1px solid;
}
.head-card .head-num {
font-family: 'Bangers', cursive;
font-size: 1.6rem;
letter-spacing: 1px;
}
.head-card .head-desc {
font-size: 0.72rem;
font-family: 'Space Grotesk', sans-serif;
margin-top: 0.25rem;
}
.head-1 { border-color: #8b5cf6; background: rgba(139,92,246,0.1); }
.head-1 .head-num { color: #8b5cf6; }
.head-1 .head-desc { color: #8b5cf6; }
.head-2 { border-color: #22d3ee; background: rgba(34,211,238,0.1); }
.head-2 .head-num { color: #22d3ee; }
.head-2 .head-desc { color: #22d3ee; }
.head-3 { border-color: #f59e0b; background: rgba(245,158,11,0.1); }
.head-3 .head-num { color: #f59e0b; }
.head-3 .head-desc { color: #f59e0b; }
.head-4 { border-color: #10b981; background: rgba(16,185,129,0.1); }
.head-4 .head-num { color: #10b981; }
.head-4 .head-desc { color: #10b981; }
/* Big stat */
.big-stat {
text-align: center;
padding: 2.5rem 1rem;
background: var(--surface);
border-radius: 12px;
margin: 2rem 0;
}
.big-stat .number {
font-family: 'Bangers', cursive;
font-size: 4rem;
letter-spacing: 3px;
background: linear-gradient(135deg, var(--gold), var(--purple));
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
background-clip: text;
}
.big-stat .label {
font-family: 'Space Grotesk', sans-serif;
font-size: 0.9rem;
color: var(--muted);
margin-top: 0.5rem;
}
/* Footer */
/* Responsive */
@media (max-width: 600px) {
main { width: 100%; }
.site-nav-inner { gap: 1rem; }
.site-brand span:last-child { display: none; }
.site-nav .nav-links { gap: 0.9rem; overflow-x: auto; }
.heads-grid { grid-template-columns: repeat(2, 1fr); }
.hero h1 { font-size: 2.5rem; }
}
/* Scroll fade-in */
.fade-in {
opacity: 0;
transform: translateY(24px);
transition: opacity 0.6s ease, transform 0.6s ease;
}
.fade-in.visible {
opacity: 1;
transform: translateY(0);
}
/* Inlined into Hugo: ensure content is visible even without the IntersectionObserver script */
.attn-article .fade-in { opacity: 1; transform: none; }</style><div class="attn-article"><section class="hero fade-in"><div class="hero-tag">Deep Dive · Artificial Intelligence · 2017</div><p class="hero-sub">
The seven-word title that ended one era of AI and launched another.
A beginner-friendly, technically honest tour through the paper that birthed every LLM you've ever heard of.</p><div class="hero-meta"><span>✍️ Ajay Walia</span><span>📅 May 2026</span><span>⏱ ~15 min read</span><span>🧠 Beginner → Intermediate</span></div></section><div class="manga-panel fade-in"><div class="panel-title">Panel 1 — THE SCROLL APPEARS</div><img src="/images/panel-01.jpg" alt="Panel 1 — The Scroll Appears: A scientist holds up the Attention Is All You Need scroll" loading="lazy"/><section id="before" class="section fade-in"><span class="chapter-label">CHAPTER 01</span><h2>The<span class="highlight">Dark Ages</span> of Language AI</h2><p>To understand why "Attention Is All You Need" was a thunderclap, you first need to appreciate how painful life was before it. Cast your mind back to 2016. AI researchers around the world were working incredibly hard on language problems — translation, summarisation, question answering — but they were doing so with a fundamental handicap baked into their tools.</p><p>The dominant models at the time were<strong>Recurrent Neural Networks (RNNs)</strong> and their smarter cousin, the<strong>Long Short-Term Memory network (LSTM)</strong>. Both were designed to handle sequences: text goes in word by word, the model builds up a hidden memory state as it reads, and produces an output at the end.</p><p>The intuition seems sensible. After all,<em>we</em> read left to right (in English). Why shouldn't a machine? The problem, as we'll see, was catastrophic at scale.</p><h3>How an RNN Actually Works</h3><p>Imagine you're a gold fish with a tiny little notepad. Every time you read a new word, you scribble something on your notepad, then<em>erase half of it</em> to make room for the next word. By the time you reach the end of a 500-word paragraph, your notepad is a smeared mess of partial impressions. That's an RNN.</p><p>More precisely: an RNN processes tokens<strong>one at a time, left to right</strong>. At each step, it combines the current word's embedding with a<em>hidden state</em> vector (its "memory" from all previous words) and produces a new hidden state. That hidden state is passed forward to the next step.</p><div class="diagram-box fade-in"><div class="diagram-title">RNN: Sequential Processing Chain</div><div style="overflow-x: auto;"><div class="rnn-chain"><div class="rnn-node">The</div><div class="rnn-arrow"/><div class="rnn-node">cat</div><div class="rnn-arrow"/><div class="rnn-node">sat</div><div class="rnn-arrow"/><div class="rnn-node">on</div><div class="rnn-arrow"/><div class="rnn-node rnn-fade">the</div><div class="rnn-arrow" style="opacity:0.3"/><div class="rnn-node rnn-fade" style="border-color:#fbbf24; background:rgba(251,191,36,0.1); color:#fbbf24;">???</div></div><div class="rnn-labels"><span>Step 1</span><span>Step 2</span><span>Step 3</span><span>Step 4</span><span>Step 5 (fading)</span><span>Long range: ☠️</span></div></div><p style="text-align:center; font-size:0.75rem; color:var(--muted); margin-top:0.8rem; font-family:'Space Grotesk',sans-serif;">Information degrades as it passes along the chain. Early words become "forgotten."</p></div><h3>LSTMs: A Better Notepad, Same Problem</h3><p>LSTMs (invented by Hochreiter &amp; Schmidhuber in 1997) were the RNN's upgrade. Instead of one hidden state, they have three "gates" — input, forget, and output — plus a separate "cell state" that acts as a longer-term memory. They were genuinely better at remembering things across longer sequences.</p><p>But LSTMs didn't solve the core architectural problem. They still processed one token at a time, sequentially. And at massive scale, that was the killer.</p></section><div class="manga-panel fade-in"><div class="panel-title">Panel 2 — THE MEMORY BALL</div><img src="/images/panel-02.jpg" alt="Panel 2 — The Memory Ball: RNN chain passing a fading memory ball" loading="lazy"/><section id="problems" class="section fade-in"><span class="chapter-label">CHAPTER 02</span><h2>Three Problems That<span class="highlight">Crippled</span> the Old Models</h2><p>Before we get to the solution, let's be precise about the pain. The pre-Transformer era had three interconnected crises, and solving<em>any one</em> of them would have been significant. The Transformer paper solved all three simultaneously.</p><div class="problem-grid"><div class="problem-card"><div class="icon">⛓️</div><h4>Problem 1: Sequential Bottleneck</h4><p>RNNs and LSTMs process tokens one at a time. Step 2 cannot begin until Step 1 finishes. This means you<strong>cannot parallelize training</strong> across GPU cores. Training was agonisingly slow — weeks or months for large models.</p></div><div class="problem-card"><div class="icon">🌫️</div><h4>Problem 2: Vanishing Gradients</h4><p>When you train a neural network with backpropagation, you compute gradients (error signals) and push them backwards through the chain. In a long sequence, those gradients shrink exponentially as they travel. Early tokens barely learn anything.</p></div><div class="problem-card"><div class="icon">📏</div><h4>Problem 3: Long-Range Amnesia</h4><p>In the sentence<em>"The trophy didn't fit in the suitcase because<strong>it</strong> was too big"</em> — what does "it" refer to? The trophy. A human knows instantly. An RNN processing hundreds of words between "trophy" and "it" often forgot the connection entirely.</p></div></div><h3>The Telephone Game at Scale</h3><p>The vanishing gradient problem is best understood through the "telephone game" (Chinese Whispers). You whisper a sentence to the first person in a chain. By the time it reaches the 20th person, the message is garbled beyond recognition. In an RNN, the gradient signal is that whisper — and long sequences were destroying it.</p><p>LSTMs reduced the garbling with their gating mechanisms, but didn't eliminate it. And crucially, every single token in the sequence still had to wait for the one before it to finish processing. At a time when researchers were starting to dream about training on billions of words, this was a scaling cliff.</p></section><div class="manga-panel fade-in"><div class="panel-title">Panel 3 — THE VANISHING WHISPER</div><img src="/images/panel-03.jpg" alt="Panel 3 — The Vanishing Whisper: Telephone chain showing vanishing gradient problem" loading="lazy"/><section class="section fade-in"><span class="chapter-label">CHAPTER 03</span><h2>A Spark Before the Fire —<span class="highlight">Bahdanau Attention (2014)</span></h2><p>To be accurate about the history: the 2017 paper didn't invent attention from scratch. In 2014, Dzmitry Bahdanau and colleagues published a paper that added an "attention mechanism" on top of existing encoder-decoder RNNs for machine translation.</p><p>The idea was elegant: when generating each output word, instead of squishing the entire input sentence into one fixed-size vector, the model learns to "look back" at different parts of the input and assign weights — attention scores — to each input word. Generate "Hund" in German? Pay more attention to "dog" in the English source.</p><div class="callout gold"><div class="callout-title">🏅 The 2014 Precursor</div><p><strong>Bahdanau et al. (2014)</strong> showed attention worked. But they bolted it<em>on top of</em> RNNs — the sequential backbone was still there, just with a better look-back mechanism. It was like putting a turbocharged engine in a horse-drawn carriage.</p></div><p>The 2017 breakthrough came when Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin at Google Brain asked a radical question:<em>what if we got rid of the carriage altogether?</em></p><blockquote>
"We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely."<cite>— Vaswani et al., Attention Is All You Need (2017)</cite></blockquote></section><section id="paper" class="section fade-in"><span class="chapter-label">CHAPTER 04</span><h2>June 2017 —<span class="highlight">The Paper Drops</span></h2><p>Eight Google Brain researchers quietly uploaded a pre-print to arXiv on June 12, 2017. The title was almost cheeky — "Attention Is All You Need" — a pun on The Beatles' "All You Need Is Love" and a direct challenge to the field:<em>attention mechanisms alone are sufficient.</em> No recurrence. No convolutions. Just attention.</p><p>The abstract was direct. They proposed the Transformer architecture, showed it achieved state-of-the-art on English-to-German translation (28.4 BLEU, surpassing the previous best by more than 2 BLEU points), trained it in a fraction of the time, and made a claim that would prove prophetic: this architecture was far more parallelisable and required significantly less time to train.</p><p>At the time, the machine learning community took notice but didn't immediately grasp the full magnitude. It looked like a better translation model. What it actually was, in retrospect: the foundation of every major AI system built in the next decade.</p><div class="big-stat"><div class="number">200,000+</div><div class="label">Citations as of 2025 — one of the most cited papers in all of computer science history</div></div></section><div class="manga-panel fade-in"><div class="panel-title">Panel 4 — THE GOOGLE BRAIN MOMENT</div><img src="/images/panel-04.jpg" alt="Panel 4 — The Google Brain Moment: Eight researchers eureka breakthrough" loading="lazy"/><section id="attention" class="section fade-in"><span class="chapter-label">CHAPTER 05</span><h2>Self-Attention —<span class="highlight">Every Word Watches Every Word</span></h2><p>This is the heart of it. Everything else in the Transformer paper is (brilliant) supporting machinery. Self-attention is the engine.</p><p>Here's the core idea in plain English:<strong>when processing any word, the model looks at every other word in the sentence simultaneously and calculates a relevance score</strong>. Instead of passing information along a chain one step at a time, every token "talks to" every other token in parallel.</p><h3>The Library Analogy</h3><p>Imagine a library where every book can send a little messenger to every other book, asking: "Hey, are you relevant to me?" Each pair of books gives an answer — a number from 0 to 1. The books with higher scores get to "share" more of their information when the library compiles its final report.</p><p>In the sentence<em>"The cat sat on the mat because it was tired"</em> — when the model processes "it", the self-attention mechanism computes a score between "it" and every other word. The word "cat" gets a very high score (because "it" refers to the cat), while words like "the" and "on" get low scores. This is done in one parallel operation — no sequential chain required.</p><h3>The Math Behind It (Don't Panic)</h3><p>The paper formalises this with three vectors derived from each word's embedding: a<strong>Query (Q)</strong>, a<strong>Key (K)</strong>, and a<strong>Value (V)</strong>. Think of it like a search engine:</p><div class="callout cyan"><div class="callout-title">🔍 Q / K / V Intuition</div><p><strong>Query (Q):</strong> "What am I looking for?" — what the current word wants to know.<br/><strong>Key (K):</strong> "What do I offer?" — what each other word has to advertise.<br/><strong>Value (V):</strong> "Here's my actual content" — what each word shares if chosen.<br/><br/>
Attention score =<code>softmax( Q · Kᵀ / √d_k ) · V</code></p></div><p>The division by √d_k (square root of the dimension) is a stabilising trick — without it, the dot products can get very large and the softmax function becomes extremely "peaky" (everything goes to one word), which hurts training. The softmax then converts raw scores into a probability distribution — so all the weights add up to 1.0.</p><div class="diagram-box fade-in"><div class="diagram-title">Self-Attention Weight Matrix — "it" attending to other words</div><div class="attn-col-labels" style="display:grid; grid-template-columns: 80px repeat(6,1fr); gap:3px; margin-bottom:3px;"><div/><div class="attn-col-label">The</div><div class="attn-col-label">cat</div><div class="attn-col-label">sat</div><div class="attn-col-label">on</div><div class="attn-col-label">mat</div><div class="attn-col-label">it</div></div><div style="display:grid; grid-template-columns: 80px repeat(6,1fr); gap:3px;"><div class="attn-label">"it" →</div><div class="attn-cell" style="background:rgba(139,92,246,0.15); color:#8b5cf6;">0.05</div><div class="attn-cell" style="background:rgba(139,92,246,0.85); color:#fff; font-weight:800;">0.72</div><div class="attn-cell" style="background:rgba(139,92,246,0.1); color:#8b5cf6;">0.04</div><div class="attn-cell" style="background:rgba(139,92,246,0.08); color:#8b5cf6;">0.03</div><div class="attn-cell" style="background:rgba(139,92,246,0.2); color:#8b5cf6;">0.10</div><div class="attn-cell" style="background:rgba(139,92,246,0.25); color:#c4b5fd;">0.06</div></div><p style="text-align:center; font-size:0.72rem; color:var(--muted); margin-top:0.8rem; font-family:'Space Grotesk',sans-serif;">
"it" pays most attention to "cat" (0.72 weight) — this is how the model resolves co-reference. Higher = brighter.</p></div><p>The critical breakthrough isn't just<em>that</em> these scores are computed — it's<em>how</em>:<strong>all pairs are computed in parallel using matrix multiplication</strong>. A sentence of 512 tokens doesn't require 512 sequential steps. It requires one big matrix operation that modern GPUs execute extremely fast. This is the parallelisation breakthrough that made scaling possible.</p></section><div class="manga-panel fade-in"><div class="panel-title">Panel 5 — WORDS LOOKING AT WORDS</div><img src="/images/panel-05.jpg" alt="Panel 5 — Words Looking at Words: Self-attention beam from IT to CAT" loading="lazy"/><section id="multihead" class="section fade-in"><span class="chapter-label">CHAPTER 06</span><h2>Multi-Head Attention —<span class="highlight">Many Perspectives at Once</span></h2><p>Here's where the paper goes from clever to ingenious. A single self-attention computation gives you one view of how words relate. But language is rich — words relate to each other in many<em>different</em> ways simultaneously.</p><p>Consider the sentence<em>"She gave him the book she wrote"</em>:</p><p>— "she" and "him" have a<em>grammatical subject/object relationship</em><br/>
— "she" (first occurrence) and "she" (second) have a<em>co-reference relationship</em><br/>
— "book" and "wrote" have a<em>semantic relationship</em> (you write books)<br/>
— "gave" and "book" have a<em>verb-object relationship</em></p><p>One attention head would have to pick one of these.<strong>Multi-head attention runs several attention computations in parallel</strong>, each in a different "subspace" of the representation. The results are then concatenated and projected back to the original dimension.</p><div class="heads-grid"><div class="head-card head-1"><div class="head-num">Head 1</div><div class="head-desc">Grammatical roles (subject, object, verb)</div></div><div class="head-card head-2"><div class="head-num">Head 2</div><div class="head-desc">Co-reference resolution ("she" = "she")</div></div><div class="head-card head-3"><div class="head-num">Head 3</div><div class="head-desc">Semantic relatedness (book ↔ wrote)</div></div><div class="head-card head-4"><div class="head-num">Head 4</div><div class="head-desc">Syntactic dependencies (verb-object)</div></div></div><p>The original Transformer used<strong>8 attention heads</strong>. Modern LLMs like GPT-3 use 96, and models like Claude use even more. Each head develops its own specialisation during training — not by design, but emergently, because the model learns that different heads can capture different useful patterns.</p><div class="callout purple"><div class="callout-title">🎭 The Right Analogy</div><p>Multi-head attention is like having a team of editors review your essay simultaneously. One editor focuses on grammar, one on logical flow, one on vocabulary, one on argument structure. You get all their feedback at once, then synthesise it. No editor has to wait for the previous one to finish.</p></div></section><div class="manga-panel fade-in"><div class="panel-title">Panel 6 — THE EIGHT-EYED TEAM</div><img src="/images/panel-06.jpg" alt="Panel 6 — The Eight-Eyed Team: Multi-head attention split panel" loading="lazy"/><section class="section fade-in"><span class="chapter-label">CHAPTER 07</span><h2>Positional Encoding —<span class="highlight">Teaching Order Without Recurrence</span></h2><p>Here's a subtle but critical problem. In an RNN, word order is implicit — you literally process word 1, then word 2, then word 3. The order is baked into the architecture. But in a Transformer, all words are processed in parallel. If you showed it "Dog bites man" and "Man bites dog" simultaneously, the attention mechanism alone would see the same set of words and might produce the same result.</p><p>That's obviously catastrophic for language. "The bank by the river" and "river by the bank the" mean very different things.</p><p>The solution:<strong>Positional Encoding</strong>. Before feeding word embeddings into the Transformer, you add a unique positional signal to each one. The paper uses a clever combination of sine and cosine functions at different frequencies:</p><div class="callout cyan"><div class="callout-title">📐 The Formula</div><p><code>PE(pos, 2i) = sin(pos / 10000^(2i/d))</code><br/><code>PE(pos, 2i+1) = cos(pos / 10000^(2i/d))</code><br/><br/>
Where<code>pos</code> is the word's position and<code>i</code> is the dimension. The result: each position gets a unique, smooth vector that the model can learn to interpret. The sine/cosine waves at different frequencies are like a musical chord unique to each seat in the stadium.</p></div><p>Why sinusoids and not just the number 1, 2, 3...? Because sinusoids generalise. They allow the model to learn<em>relative positions</em> (word 2 is one step after word 1) not just absolute ones. And they handle sequences longer than those seen in training gracefully, because the wave patterns extend naturally.</p><p>Modern variants like RoPE (Rotary Position Embedding, used in Llama and GPT-NeoX) and ALiBi have since improved on the original scheme — but they're all descendants of this 2017 insight.</p></section><div class="manga-panel fade-in"><div class="panel-title">Panel 7 — THE NUMBERED LINEUP</div><img src="/images/panel-07.jpg" alt="Panel 7 — The Numbered Lineup: Positional encoding lineup of word characters" loading="lazy"/><section id="architecture" class="section fade-in"><span class="chapter-label">CHAPTER 08</span><h2>The Full Transformer —<span class="highlight">Encoder + Decoder</span></h2><p>The original paper was designed for sequence-to-sequence tasks — specifically machine translation. The architecture has two halves that work together: an<strong>Encoder</strong> that reads the input (the English sentence) and a<strong>Decoder</strong> that generates the output (the German translation).</p><h3>The Encoder</h3><p>The encoder stack (6 identical layers in the original paper) processes the entire input sentence in parallel. Each layer has two sub-components:<em>(1) multi-head self-attention</em> (all words attend to all other words) and<em>(2) a feed-forward neural network</em> applied to each position independently. Both sub-components use<strong>residual connections</strong> (the input is added back to the output) and<strong>layer normalisation</strong> — both stability tricks borrowed from computer vision.</p><h3>The Decoder</h3><p>The decoder is similar but has three sub-components per layer. The first is<em>masked self-attention</em> — like encoder self-attention, but masked so that when generating word N, the model can only attend to words 1 through N-1 (it can't cheat by looking at future words). The second is<em>cross-attention</em> — the decoder attends to the encoder's output, connecting the input sentence to the generation process. The third is the same feed-forward network as in the encoder.</p><div class="callout purple"><div class="callout-title">🧬 The Encoder-Decoder Legacy</div><p>BERT (2018) uses<em>only the encoder</em> — great for understanding tasks (classification, named entity recognition). GPT-1/2/3/4 use<em>only the decoder</em> — great for generation tasks (writing, code, conversation). The full encoder-decoder design lives on in models like T5 and BART, used heavily for translation and summarisation.</p></div><p>One more key ingredient:<strong>Feed-Forward layers</strong>. After each attention block, every position's representation passes through a small, identical 2-layer neural network. In the original paper, the inner dimension of this network was 2048 — 4× the model's embedding dimension of 512. In GPT-3, it's 4× 12,288 = 49,152. These layers are believed to act as "fact storage" — where knowledge learned during training gets encoded.</p></section><div class="manga-panel fade-in"><div class="panel-title">Panel 8 — THE TRANSFORMER MECHA</div><img src="/images/panel-08.jpg" alt="Panel 8 — The Transformer Mecha: Encoder-Decoder mecha robot" loading="lazy"/><section id="timeline" class="section fade-in"><span class="chapter-label">CHAPTER 09</span><h2>The Impact<span class="highlight">Timeline</span> — 2017 to Now</h2><p>The Transformer paper wasn't just a research curiosity. It was a platform. Within a year, the entire field had pivoted. Within five years, it had generated a trillion-dollar industry. Here's the direct lineage:</p><div class="timeline"><div class="timeline-item"><div class="timeline-dot purple"/><div class="timeline-year">2017</div><div class="timeline-milestone">"Attention Is All You Need"<span class="timeline-badge badge-origin">ORIGIN</span></div><div class="timeline-desc">Vaswani et al. (Google Brain). The original Transformer architecture. State-of-the-art on WMT English→German translation. 65M parameters. Training: 3.5 days on 8 P100 GPUs.</div></div><div class="timeline-item"><div class="timeline-dot cyan"/><div class="timeline-year">2018</div><div class="timeline-milestone">BERT — Google<span class="timeline-badge badge-encoder">ENCODER-ONLY</span></div><div class="timeline-desc">Bidirectional Encoder Representations from Transformers. 340M parameters. Pre-trained on masked language modelling — predict randomly hidden words. Demolished 11 NLP benchmarks on release. The model that proved "pre-train, then fine-tune" as the dominant paradigm.</div></div><div class="timeline-item"><div class="timeline-dot gold"/><div class="timeline-year">2018</div><div class="timeline-milestone">GPT-1 — OpenAI<span class="timeline-badge badge-decoder">DECODER-ONLY</span></div><div class="timeline-desc">Generative Pre-trained Transformer. 117M parameters. The first proof that a decoder-only Transformer, trained on unsupervised language modelling, could be fine-tuned for diverse tasks. OpenAI's foundational bet on the decoder path.</div></div><div class="timeline-item"><div class="timeline-dot green"/><div class="timeline-year">2019–2020</div><div class="timeline-milestone">GPT-2 → GPT-3<span class="timeline-badge badge-scale">SCALE</span></div><div class="timeline-desc">GPT-2 (1.5B params) was so good at generating text that OpenAI staged a "staged release" over safety concerns. GPT-3 (175B params) — the first model to demonstrate serious few-shot learning. You could give it 3 examples and it could do a new task without any fine-tuning. The scaling laws paper (Kaplan et al.) proved bigger = better predictably.</div></div><div class="timeline-item"><div class="timeline-dot purple"/><div class="timeline-year">2022</div><div class="timeline-milestone">ChatGPT — RLHF Changes Everything</div><div class="timeline-desc">GPT-3.5 fine-tuned with Reinforcement Learning from Human Feedback (RLHF). The public suddenly had a conversational interface to an LLM. 1 million users in 5 days. 100M in 2 months. Fastest product growth in history. Every major tech company scrambled.</div></div><div class="timeline-item"><div class="timeline-dot cyan"/><div class="timeline-year">2023–2026</div><div class="timeline-milestone">Claude · Grok · Gemini · Llama · Mistral · GPT-4/4o</div><div class="timeline-desc">All Transformer-based. All descendants of the 2017 paper. Claude (Anthropic) adds Constitutional AI. Llama (Meta) brings open-source to the frontier. Gemini (Google) goes multimodal. Grok (xAI) takes on real-time search. The Cambrian explosion of LLMs — every one of them tracing its lineage back to June 12, 2017.</div></div></div></section><div class="manga-panel fade-in"><div class="panel-title">Panel 9 — THE ROCKET OF PROGRESS</div><img src="/images/panel-09.jpg" alt="Panel 9 — The Rocket of Progress: AI timeline rocket from 2017 to present" loading="lazy"/><section id="verdict" class="section fade-in"><span class="chapter-label">CHAPTER 10</span><h2>Is It The Most Important AI Paper<span class="highlight">Ever Written?</span></h2><p>Let's be honest about this. The question is fascinating precisely because it's not entirely settled — there are serious candidates.</p><h3>The Case For: Yes, Unambiguously</h3><p>No single paper has had a more direct and immediate commercial and scientific impact in the modern AI era. Every frontier LLM in existence today is a Transformer. The trillion-dollar AI industry of the mid-2020s is built on this foundation. With 200,000+ citations, it's a runaway leader in citation counts for an ML paper. The research it unlocked — in text, image (ViT), audio (Whisper), protein structure (AlphaFold 2), video (Sora), code (Copilot) — spans essentially every domain of AI.</p><h3>The Case For Other Contenders</h3><div class="callout gold"><div class="callout-title">🏆 Other Papers That Matter</div><p><strong>Backpropagation (Rumelhart et al., 1986)</strong> — Without the ability to train neural networks at all, there's nothing to build on.<br/><br/><strong>ImageNet + AlexNet (Krizhevsky et al., 2012)</strong> — The moment deep learning proved itself to the world, launching the modern deep learning era.<br/><br/><strong>Word2Vec (Mikolov et al., 2013)</strong> — Showed that word embeddings encode semantic meaning; a prerequisite for Transformer input representations.<br/><br/><strong>Scaling Laws (Kaplan et al., 2020)</strong> — Proved that LLM capabilities grow predictably with compute and data, enabling the investment thesis behind GPT-3 and everything after.<br/><br/><strong>RLHF (Christiano et al., 2017)</strong> — The alignment technique that turned raw LLMs into assistants humans actually want to use.</p></div><p>The honest verdict:<strong>In the specific context of modern generative AI — LLMs, multimodal models, and the AI products billions of people use daily — "Attention Is All You Need" is the clearest single point of origin.</strong> Without backprop it couldn't exist, but without this paper, it wouldn't have become what it is. It's the right answer to the question "which paper made today's AI possible?"</p><blockquote>
"We are all standing on the shoulders of eight people who asked: what if recurrence isn't actually necessary?"<cite>— A reasonable paraphrase of the entire modern AI research community</cite></blockquote></section><div class="manga-panel fade-in"><div class="panel-title">Panel 10 — THE AI FAMILY PORTRAIT</div><img src="/images/panel-10.jpg" alt="Panel 10 — The AI Family Portrait: GPT, Claude, Grok, Gemini, Llama before the founders portrait" loading="lazy"/><section class="section fade-in"><span class="chapter-label">CHAPTER 11</span><h2>Where Do We Go From<span class="highlight">Here?</span></h2><p>The Transformer is dominant but not invincible. Researchers are actively working on what comes next — and several serious challengers are emerging.</p><h3>The Current Limitations</h3><p>Self-attention has a quadratic complexity problem. If your sequence has N tokens, the attention matrix is N × N. Double the sequence length, quadruple the compute. For long documents — books, codebases, hours of audio — this becomes brutally expensive. The context window you experience in Claude or GPT-4 represents enormous engineering effort to extend what was originally a very limited range.</p><h3>What's Being Explored</h3><p><strong>Mamba / State Space Models (SSMs)</strong> process sequences in linear time, not quadratic — a genuine architectural alternative that some researchers believe could eventually rival or exceed Transformers for long-context tasks.<strong>Flash Attention</strong> (Dao et al., 2022) is an algorithmic optimisation that makes standard attention dramatically more memory-efficient without changing the math.<strong>Mixture of Experts (MoE)</strong> architectures — used in GPT-4 and Gemini — activate only a subset of parameters per token, allowing models with trillions of total parameters to run at the cost of a much smaller model.</p><p><strong>Multimodality</strong> is the frontier. The Transformer's attention mechanism generalises naturally to images (patch tokens), audio (spectrogram tokens), video (frame tokens), and structured data. A single Transformer can in principle process all of these simultaneously — and models like GPT-4o and Gemini Ultra are moving rapidly in this direction.</p><p>The question researchers are now asking: is intelligence primarily a function of architecture, or of scale and data? The scaling laws suggest it's mostly the latter. If that's true, the Transformer need not be dethroned — it just needs to be fed more.</p><div class="callout cyan"><div class="callout-title">🔭 The Bigger Picture</div><p>We are, by most accounts, somewhere in the middle of the most important technological transition in human history. Every model at the frontier — the ones writing code, passing medical exams, generating video — shares a common ancestor. Eight researchers. One arXiv pre-print. Twelve hundred lines of Python. June 12, 2017.</p></div></section><div class="manga-panel fade-in"><div class="panel-title">Panel 11 — THE NEURAL SKY</div><img src="/images/panel-11.jpg" alt="Panel 11 — The Neural Sky: Lone figure beneath Transformer constellation sky" loading="lazy"/><section class="section fade-in"><span class="chapter-label">CONCLUSION</span><h2>The Seven Words That<span class="highlight">Changed Everything</span></h2><p>We started this piece in 2016, watching RNNs and LSTMs struggle through their sequential chains, watching gradient signals vanish like whispers in a long corridor. We watched researchers work incredibly hard to coax these architectures to handle longer contexts, more complex language, bigger training sets — and hit wall after wall.</p><p>Then eight people asked a simple question —<em>what if you just... paid attention?</em> — and rewired the entire field.</p><p>The Transformer is not magic. It's mathematics: query-key-value lookups, scaled dot-product attention, layer normalisation, residual connections, feed-forward networks. Every piece is graspable. The genius was in the combination — and in the willingness to abandon the assumption that sequences must be processed sequentially.</p><p>The models you interact with today — the ones that draft your emails, explain your code, answer your questions about transformer architecture with exhaustive detail — are all running on this foundation. Claude (the AI that helped outline this post) is a Transformer. GPT-4 is a Transformer. Grok is a Transformer. Gemini is a Transformer. They are, in the deepest technical sense, all direct descendants of that arXiv upload.</p><p>Understanding "Attention Is All You Need" is not just historical curiosity. It's the grammar of modern AI. Once you understand it, you have a lens through which almost everything in the field makes sense — the scaling laws, the context window debates, the encoder vs. decoder architecture choices, the multimodal experiments, the efficiency research.</p><p>The paper is free. The arXiv link still works. It's 15 pages and reads more clearly than most ML papers. If this post piqued your curiosity: go read it. You'll understand it now.</p><div class="big-stat"><div class="number">2017 → ∞</div><div class="label">The year a single paper changed the trajectory of intelligence itself</div></div></section></div><script>
(function() {
var root = document.querySelector('.attn-article');
if (!root || !('IntersectionObserver' in window)) return;
// Re-apply the hidden start state, then animate in on scroll.
var els = root.querySelectorAll('.fade-in');
els.forEach(function(el) { el.style.opacity = '0'; el.style.transform = 'translateY(24px)'; el.style.transition = 'opacity 0.6s ease, transform 0.6s ease'; });
var io = new IntersectionObserver(function(entries) {
entries.forEach(function(entry) {
if (entry.isIntersecting) {
entry.target.style.opacity = '1';
entry.target.style.transform = 'none';
io.unobserve(entry.target);
}
});
}, { threshold: 0.08 });
els.forEach(function(el) { io.observe(el); });
root.querySelectorAll('.attn-cell').forEach(function(cell) {
cell.title = 'Attention weight: ' + cell.textContent.trim();
});
})();</script>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/panel-01.jpg" medium="image"><media:title type="plain">Posts</media:title></media:content><category>artificial-intelligence</category><category>architecture</category><category>automation</category><category>Knowledge Base</category></item><item><title>Building with LLMs in 2026: The Framework Atlas</title><link>https://curiousbit.netlify.app/building-with-llms-in-2026-the-framework-atlas/</link><guid isPermaLink="true">https://curiousbit.netlify.app/building-with-llms-in-2026-the-framework-atlas/</guid><pubDate>Fri, 08 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;Four years after LLMs entered the mainstream, the single most common mistake I see architects make is spending most of their decision energy on the model. Which frontier model? GPT or Claude or Gemini? The model choice matters — but it is one decision out of roughly fifteen, and it is far from the most consequential one.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/llm-framework-atlas-cover.jpg" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>Four years after LLMs entered the mainstream, the single most common mistake I see architects make is spending most of their decision energy on the model. Which frontier model? GPT or Claude or Gemini? The model choice matters — but it is one decision out of roughly fifteen, and it is far from the most consequential one.</p><p>Building an LLM-powered system in 2026 is an architecture decision made across a stack of competing frameworks, each solving a well-posed problem at a specific layer. I spent several months mapping that landscape as a practitioner — the result is a 73-page whitepaper I call the Framework Atlas. This post distils it into the five things I think every architect, engineer, and senior IT leader should know before picking a single tool.</p><hr><h2 id="the-stack-has-a-shape">The Stack Has a Shape</h2><p>There is no single AI stack, but there is a canonical shape. Every non-trivial production LLM system — whether a support chatbot, a document search engine, or a multi-agent workflow — is a composition of six layers:</p><p><strong>Application layer.</strong> The surface your user interacts with. LangChain is the default; Semantic Kernel is the Microsoft-native choice; CrewAI leads when the app itself is agentic.</p><p><strong>Agent layer.</strong> When a single LLM call is not enough — when the system needs to plan, call tools, or coordinate among multiple agents — this layer provides the loop. LangGraph is the most production-grounded option in 2026.</p><p><strong>Data / retrieval layer.</strong> The memory of your system. LlamaIndex leads on orchestration; Weaviate, Pinecone, and Chroma compete at storage, each tuned for a different operational profile.</p><p><strong>Model layer.</strong> The foundation models themselves. This layer is increasingly commoditised. The most important design decision here is not which model you start with — it is whether you can swap it without rewriting the layers above.</p><p><strong>Serving / inference layer.</strong> How you turn a model into an endpoint. vLLM dominates throughput-bound workloads; BentoML packages models into clean APIs for teams that want to think about models, not infrastructure.</p><p><strong>Infrastructure layer.</strong> Kubernetes, Docker, cloud, on-prem. Every framework choice depends on where you can actually deploy.</p><p>Wrapped around all six layers are three concerns that have become non-negotiable since 2024:<strong>observability and evaluation</strong>,<strong>fine-tuning and training</strong>, and<strong>guardrails and safety</strong>. If your system design has no answer for any of these three, it is under-designed. Ignoring them does not eliminate the risk — it just defers it until something breaks in production.</p><p><img src="/images/llm-framework-stack.jpg" alt="The canonical AI stack — Tower of Power comic book illustration"/><hr><h2 id="the-abstraction-trap">The Abstraction Trap</h2><p>Every framework in the atlas is catalogued against eleven attributes, but the one architects under-weight most consistently is<strong>abstraction level</strong> — how much code you write versus how much the framework decides for you.</p><p>LangChain&rsquo;s high abstraction makes the first demo fast and the tenth production fix slow, because you are debugging through someone else&rsquo;s default decisions. FAISS&rsquo;s low abstraction costs more lines but yields fewer surprises at 3am.</p><p><img src="/images/llm-abstraction-trap.jpg" alt="The Abstraction Trap — High Abstraction vs Debug Hell comic book illustration"/><p>The operational signal:<strong>match abstraction to team seniority.</strong> Junior teams over-value high abstraction; senior teams over-value low. A mixed team benefits from a medium default — and from making the choice explicitly rather than defaulting to whatever has the best GitHub star count.</p><hr><h2 id="decision-heuristics-that-actually-hold">Decision Heuristics That Actually Hold</h2><p>Rather than optimising at each layer independently, the atlas maps common requirements to preferred framework combinations. These are the ones I have found most durable in practice:</p><table><thead><tr><th>Requirement</th><th>Starting stack</th></tr></thead><tbody><tr><td>Fast LLM prototype</td><td>LangChain + Chroma + OpenAI API</td></tr><tr><td>Enterprise-grade RAG</td><td>LlamaIndex + Weaviate + LangSmith</td></tr><tr><td>Multi-agent workflow</td><td>LangGraph (+ AutoGen for agent conversations)</td></tr><tr><td>High-throughput inference</td><td>vLLM + Ray Serve</td></tr><tr><td>Local / offline / on-prem AI</td><td>Ollama + FAISS + LangChain-local</td></tr><tr><td>Domain-specialised model</td><td>Axolotl (QLoRA) + vLLM + MLflow</td></tr></tbody></table><p>Two things stand out from this table. First, LangChain and LlamaIndex are not competitors — they compose cleanly, with LangChain at the application layer and LlamaIndex at the retrieval layer. Second, local inference is no longer an edge case. Ollama plus a Llama-3-class model is a realistic production option for regulated industries where data sovereignty is a hard constraint.</p><p><img src="/images/llm-framework-heatmap.jpg" alt="Choose Your Weapon — framework decision map comic book illustration"/><hr><h2 id="agents-moved-to-production--with-guardrails">Agents Moved to Production — With Guardrails</h2><p>In 2023, autonomous agents were mostly demos. By 2026, they are in targeted production use: triage, routing, research synthesis. What changed is not the models — it is the frameworks.</p><p>LangGraph&rsquo;s state-machine model gives agents deterministic control flow: you declare states, transitions, and retry policies explicitly. AutoGen models multi-agent systems as conversations, which makes it remarkably expressive for critique-revise loops and planner-executor separations. The practitioner heuristic:<strong>for production agents, LangGraph. For multi-agent conversations, AutoGen. For lightweight document workflows, CrewAI.</strong></p><p>The critical note:<strong>never deploy autonomous agents in production unless the failure cost is bounded.</strong> The agent should draft; a human should approve. The pattern that ships is almost always a hybrid — autonomy where the stakes are low, escalation where they are not.</p><p><img src="/images/llm-agent-loop.jpg" alt="The Agent Loop — Plan, Act, Observe, Reflect comic book illustration"/><p>Guardrails have crossed from afterthought to critical infrastructure in the same period. Prompt injection is the new SQL injection. Every production system needs an input guard, an output guard, and a policy layer between them. The minimum viable defense in 2026 is: input guard → LLM → output guard. Anything less is operating without a seat belt.</p><hr><h2 id="the-2026-outlook-three-trends-worth-designing-for">The 2026 Outlook: Three Trends Worth Designing For</h2><p><strong>Agents are becoming the compile target.</strong> LangGraph, AutoGen, and CrewAI are converging on a common abstraction — a loop over an LLM with tool use and state. Expect a future that looks like the deep learning layer in 2018: multiple frontend frameworks, one common runtime. Design your agent layer to be swappable.</p><p><strong>Retrieval is eating search.</strong> Elasticsearch, Postgres, and OpenSearch all ship vector indexes now; Weaviate and Pinecone ship BM25. The primitives have converged. The differentiator is no longer features — it is operational maturity and the team&rsquo;s ability to run the infrastructure. Hybrid retrieval (vector + keyword) is the production-safe default.</p><p><strong>Guardrails are becoming infrastructure.</strong> Today they are a library you bolt on. In two years they will be a runtime — prompt injection detection, PII scrubbing, and policy enforcement applied by default to every model invocation, the way CORS and auth middleware is applied to every HTTP request today. Get ahead of this by treating your guardrails layer as critical infrastructure now, not as a compliance checkbox later.</p><hr><h2 id="a-practitioners-closing-note">A Practitioner&rsquo;s Closing Note</h2><p>Frameworks age faster than architectures. The stack shape you design today — application, agent, retrieval, model, serving, infrastructure — will still be valid in three years. The individual framework boxes you fill it with probably will not be. The single most important design invariant is<strong>swappability at each layer</strong>. Make the layer interfaces clean, keep the framework-specific code thin, and you will be able to move when the landscape shifts — and it will.</p><p>The full Framework Atlas (v4.0, April 2026) covers all ten framework categories in detail, including comparison tables, maturity radars, cost and latency envelopes, and four reference architectures with working code. It is available below.</p><p><a href="/docs/Building_with_LLMs_Framework_Atlas.pdf"><strong>Download the Framework Atlas — Building with LLMs v4.0 (PDF)</strong></a></p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/llm-framework-atlas-cover.jpg" medium="image"><media:title type="plain">Posts</media:title></media:content><category>artificial-intelligence</category><category>architecture</category><category>llm</category><category>frameworks</category><category>engineering</category><category>Knowledge Base</category></item><item><title>I Built, My Own Screenshot App for macOS (No More Clunky Screenshots)</title><link>https://curiousbit.netlify.app/i-built-my-own-screenshot-app-for-macos-no-more-clunky-screenshots/</link><guid isPermaLink="true">https://curiousbit.netlify.app/i-built-my-own-screenshot-app-for-macos-no-more-clunky-screenshots/</guid><pubDate>Fri, 08 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;It was a Tuesday evening. I had just taken my fourteenth screenshot of the day — a mix of Cmd+Shift+4, accidental desktop saves, files named &lt;code&gt;Screenshot 2026-05-06 at 11.43.22 PM.png&lt;/code&gt; scattered across my Downloads folder like confetti after a party nobody enjoyed.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/ajshot-p1.jpg" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>It was a Tuesday evening. I had just taken my fourteenth screenshot of the day — a mix of Cmd+Shift+4, accidental desktop saves, files named<code>Screenshot 2026-05-06 at 11.43.22 PM.png</code> scattered across my Downloads folder like confetti after a party nobody enjoyed.</p><p><img src="/images/ajshot-p1.jpg" alt="Panel 1 – The Screenshot Graveyard"/><p>I opened Finder. Forty-seven PNGs. Forty-seven. Some blurry. Some with my other windows bleeding in at the edges. Some cropped wrong because I&rsquo;d sneezed mid-drag.</p><p>There had to be a better way.</p><hr><h2 id="the-ghost-of-greenshot-past">The Ghost of Greenshot Past</h2><p>Back in my Windows days — and I spent a<em>lot</em> of years in Windows, running datacentres, managing Wintel estates, building VDI platforms — I had<a href="https://getgreenshot.org/">Greenshot</a>. If you&rsquo;ve never used it, Greenshot is a free, lightweight screenshot tool that lives in your system tray. Press<code>PrtSc</code>, draw a box, done. Instant annotation. Instant clipboard. Instant sanity.</p><p>When I moved to macOS full-time, I expected something better. Apple builds beautiful hardware. Their software is generally excellent. And their screenshot workflow is… fine. Technically fine. But when you&rsquo;re taking screenshots all day for documentation, Slack messages, architecture diagrams, and blog posts — &ldquo;fine&rdquo; isn&rsquo;t good enough.</p><p>The native flow is:<code>Cmd+Shift+4</code>, drag imprecisely, file saves somewhere, you hunt for it, you paste it, you realise the crop was off, you do it again.</p><p><img src="/images/ajshot-p2.jpg" alt="Panel 2 – The Ghost of Greenshot"/><p>I Googled &ldquo;Greenshot for Mac&rdquo; approximately eleven times over the past year. The answers: CleanShot X (paid, subscription), Shottr (good but someone else&rsquo;s decisions), or just &ldquo;get used to it.&rdquo;</p><p>I couldn&rsquo;t get used to it. So I did what any reasonable enterprise architect does at 10pm on a Tuesday.</p><p>I decided to build it myself.</p><hr><h2 id="fine-ill-build-it">Fine. I&rsquo;ll Build It.</h2><p>The plan was simple: a lightweight macOS menu bar app. Lives in the status bar. One click or a hotkey and you&rsquo;ve got a clean area capture, a window capture, or a full-screen grab. Auto-saves with a sensible filename. Copies to clipboard. Makes a satisfying shutter sound. Stays out of your way.</p><p>The twist: I&rsquo;d build it the same way I built this blog — with Claude as my pair programmer and VS Code as the editor. I&rsquo;m an architect and technologist, not a Swift developer. But that was the whole point.</p><p>I opened a new conversation, described what I wanted in plain English, and we got to work.</p><hr><h2 id="what-ajshot-is">What AJShot Is</h2><p><img src="/images/ajshot-p3.jpg" alt="Panel 3 – The &ldquo;AJ&rdquo; badge appears in the menu bar"/><p><strong>AJShot</strong> is a native macOS menu bar app — background-only, no Dock icon, no bloat. It sits quietly in your menu bar with a small<code>AJ</code> badge. Left-click opens your last screenshot. Right-click opens the action menu.</p><p>Memory footprint: around<strong>50 MB</strong> at idle. Compare that to some Electron-based tools that idle at 300 MB just saying hello.</p><p>The right-click menu looks like this:</p><pre tabindex="0"><code>📷 Capture Area ⌘⇧3
🖥 Capture Window ⌘⇧4
🖥 Capture Fullscreen ⌘⇧5
─────────────────────────────
📂 Open Last Screenshot
📁 Open Screenshots Folder
─────────────────────────────
⚙️ Preferences
ℹ️ About
✕ Quit</code></pre><p>The keyboard shortcuts are global — they work even when AJShot is in the background. You&rsquo;re deep in a Zoom call, you need a quick capture, you hit<code>Cmd+Shift+3</code>, draw the box, done. The screenshot is already in your designated folder and on your clipboard before you&rsquo;ve even let go of the mouse.</p><hr><h2 id="core-features">Core Features</h2><p><img src="/images/ajshot-p4.jpg" alt="Panel 4 – Area selection crosshair in action"/><h3 id="capture-modes">Capture Modes</h3><p><strong>Area Capture</strong> (<code>⌘⇧3</code>) — the workhorse. Full-screen overlay appears, cursor becomes a crosshair, you draw a box. Precise. Consistent. No more dragging the wrong direction and getting a sliver of your taskbar.</p><p><strong>Window Capture</strong> (<code>⌘⇧4</code>) — click a window, capture just that window. No background bleed, no accidental desktop icons photobombing your documentation.</p><p><strong>Fullscreen Capture</strong> (<code>⌘⇧5</code>) — captures everything. Supports multi-display setups with a &ldquo;capture all&rdquo; or &ldquo;ask each time&rdquo; behaviour you configure once.</p><h3 id="preferences-that-actually-make-sense">Preferences That Actually Make Sense</h3><p><img src="/images/ajshot-p5.jpg" alt="Panel 5 – The Preferences panel, organised and calm"/><p>Open Preferences and you get a clean panel with the options you&rsquo;d actually want to configure:</p><table><thead><tr><th>Setting</th><th>Default</th><th>What it does</th></tr></thead><tbody><tr><td>Launch at login</td><td>On</td><td>AJShot is always ready, even after restart</td></tr><tr><td>Play capture sound</td><td>On</td><td>Satisfying shutter click confirms the capture</td></tr><tr><td>Show thumbnail preview</td><td>On</td><td>Floating preview appears — click to open in editor</td></tr><tr><td>Auto-copy to clipboard</td><td>On</td><td>Screenshot lands in your clipboard automatically</td></tr><tr><td>Save folder</td><td><code>~/Pictures/AJShot</code></td><td>Your own designated screenshot home</td></tr><tr><td>File format</td><td>PNG</td><td>Or JPG with a quality slider</td></tr><tr><td>Filename template</td><td><code>AJShot_{date}_{time}</code></td><td>Consistent, sortable, sane filenames</td></tr></tbody></table><p>The filename templating engine handles the sanitisation itself — no illegal characters, no trailing dots, no 240-character filenames. Files come out like<code>AJShot_2026-05-09_22-14-37.png</code>. You can find them instantly. You can sort them. Your Downloads folder stays clean.</p><h3 id="post-capture-flow">Post-Capture Flow</h3><p>After every capture, AJShot runs through a quick sequence:</p><ol><li><strong>Auto-saves</strong> to your configured folder with the filename template applied</li><li><strong>Copies to clipboard</strong> (if enabled) — so you can paste immediately</li><li><strong>Plays the shutter sound</strong> — audio confirmation that the capture happened</li><li><strong>Shows a floating thumbnail</strong> in the corner — click it to open in the editor</li></ol><p>You can also configure the post-capture action to always ask, always edit, always save, or always copy. Once you&rsquo;ve decided how you work, it remembers.</p><hr><h2 id="the-real-problem-it-solves">The Real Problem It Solves</h2><p>Here&rsquo;s the honest comparison:</p><table><thead><tr><th/><th><strong>Native macOS</strong></th><th><strong>AJShot</strong></th></tr></thead><tbody><tr><td>Start capture</td><td><code>Cmd+Shift+4</code>, wait for toolbar</td><td><code>Cmd+Shift+3</code>, instant overlay</td></tr><tr><td>File location</td><td>Desktop or last used folder</td><td>Always<code>~/Pictures/AJShot</code></td></tr><tr><td>Filename</td><td><code>Screenshot 2026-05-09 at 10.14.22 PM.png</code></td><td><code>AJShot_2026-05-09_22-14-22.png</code></td></tr><tr><td>Clipboard copy</td><td>Manual extra step</td><td>Automatic</td></tr><tr><td>Editor</td><td>Preview.app detour</td><td>Built-in (annotation tools in progress)</td></tr><tr><td>Memory use</td><td>System process</td><td>~50 MB dedicated</td></tr><tr><td>Works in background</td><td>Partial</td><td>Full global hotkeys</td></tr><tr><td>Shutter sound</td><td>Yes</td><td>Yes (configurable)</td></tr></tbody></table><p>The editor is where things get even more interesting. The annotation scaffolding is already built — the tools are there in the codebase: arrows, blur, text, highlights, shapes, step-number callouts. They&rsquo;re not wired to the UI yet, but they&rsquo;re coming. That&rsquo;s the next milestone.</p><hr><h2 id="tech-stack--whats-actually-under-the-hood">Tech Stack — What&rsquo;s Actually Under the Hood</h2><p><img src="/images/ajshot-p6.jpg" alt="Panel 6 – The Swift codebase, compact and organised"/><p>This is a<strong>native macOS app</strong> built entirely in<strong>Swift 5.9+</strong>, targeting<strong>macOS 12 (Monterey) and later</strong>. The UI is a hybrid of<strong>SwiftUI</strong> and<strong>AppKit</strong> — SwiftUI for the Preferences panel and modern views, AppKit where you need fine-grained control over system behaviour (status bar items, window management, capture overlays).</p><p><strong>Dependencies</strong> (both by<a href="https://github.com/sindresorhus">Sindre Sorhus</a>, the prolific open-source developer behind half the macOS indie tool ecosystem):</p><ul><li><a href="https://github.com/sindresorhus/KeyboardShortcuts"><code>KeyboardShortcuts</code></a> — global hotkey registration that plays nicely with macOS security</li><li><a href="https://github.com/sindresorhus/Defaults"><code>Defaults</code></a> — a type-safe, SwiftUI-friendly wrapper around<code>UserDefaults</code></li></ul><p><strong>Capture pipeline</strong>: uses<code>CGPreflightScreenCaptureAccess</code> and<code>CGRequestScreenCaptureAccess</code> from CoreGraphics for the permission preflight, then<code>ScreenCaptureKit</code> for the actual pixel capture. The permission flow has a proper retry loop — if you&rsquo;ve granted permission but haven&rsquo;t restarted the app, it tells you exactly that and offers a one-click restart.</p><p><strong>Build and distribution</strong>: Swift Package Manager for dependencies, Xcode 15 for the build, and a<code>build-dmg.sh</code> script that produces a signed<code>AJShot-1.0.0.dmg</code> for distribution. The DMG is already built. You can drag it to Applications and run it today.</p><p><strong>Security considerations</strong> (because I work in security and these things matter):</p><ul><li>Screenshot folder is set to<code>0700</code> permissions — only you can read it</li><li>Individual screenshot files are<code>0600</code> — same</li><li>Filename template engine strips illegal characters and control characters at the source</li><li>Code signing and notarization stubs are already in the README for when distribution goes wide</li></ul><p>The architecture is clean:<code>App</code> →<code>AppDelegate</code> →<code>StatusBarController</code> →<code>CaptureManager</code> →<code>ScreenshotStorage</code>. Each module does one thing. The<code>FilenameTemplateEngine</code> is a pure static function. The<code>ThumbnailPresenter</code> is decoupled from capture. Claude helped me keep it disciplined.</p><hr><h2 id="the-honest-struggles">The Honest Struggles</h2><p>No developer story is complete without the bit where things go sideways.</p><p><strong>Screen Recording permissions</strong> nearly broke me on the first build. macOS 12+ requires explicit Screen Recording permission in System Settings, and you can&rsquo;t force-trigger the dialog more than once per session. I had to build the multi-stage fallback: first launch triggers Apple&rsquo;s native dialog, subsequent denials open System Settings directly, and if you&rsquo;ve granted permission but not restarted, it detects that state and offers a restart button. It took longer to get that right than the actual capture code.</p><p><strong>Swift Package Manager vs Xcode</strong> had a brief disagreement about the resource bundle for<code>shutter.aiff</code>. The sound file lives in<code>AJShot/Sounds/</code> and has to be declared as a<code>.process("Sounds")</code> resource in<code>Package.swift</code>. Simple when you know it. Less simple at 11pm when it just silently fails to play.</p><p><strong>The<code>LaunchAtLogin</code> manager</strong> required writing a<code>LaunchAgent</code> plist to<code>~/Library/LaunchAgents/</code>. Straightforward on paper. In practice, macOS is protective of that folder in ways that aren&rsquo;t well-documented, and the error messages when something goes wrong are the kind that send you to Stack Overflow threads from 2019.</p><p>That&rsquo;s the thing about building a native macOS app — the platform is powerful and the APIs are solid, but the surface area of &ldquo;things that can quietly not work&rdquo; is larger than you&rsquo;d expect. Claude helped me navigate most of it. The frustration was real and so was the progress.</p><hr><h2 id="whats-next">What&rsquo;s Next</h2><p><img src="/images/ajshot-p7.jpg" alt="Panel 7 – &ldquo;It just works.&rdquo; The menu bar icon glows."/><p>The annotation editor is the next major milestone — the scaffolding is already there. Arrow tool, blur tool, text tool, highlight, shapes, step-number callouts. When that&rsquo;s wired up, AJShot becomes a complete Greenshot replacement for macOS: capture, annotate, save, share.</p><p>After that: GitHub release with a public DMG, code signing, notarization, and probably a product page here on CuriousBit.</p><p>If you&rsquo;re a macOS user who&rsquo;s been putting up with the native screenshot workflow out of habit — you don&rsquo;t have to. And if you&rsquo;re a developer (or aspiring one) who thinks you can&rsquo;t build a native app without being a full-time Swift engineer — you can. You really can.</p><hr><h2 id="try-it--share-what-you-use">Try It / Share What You Use</h2><p>The GitHub repo is coming soon — I&rsquo;ll post the link here when it&rsquo;s public. In the meantime, if you&rsquo;re curious about the architecture, the Swift source, or want to follow along as the editor gets built, keep an eye on this blog.</p><p>And I&rsquo;m genuinely curious:<strong>what screenshot tool do you use on macOS?</strong> Are you a CleanShot loyalist? A Shottr person? Still using the native tools and somehow thriving? Drop a note — I&rsquo;d love to know.</p><p>Because the best tools are the ones built out of genuine frustration with the alternative. And I was<em>very</em> genuinely frustrated.</p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/ajshot-p1.jpg" medium="image"><media:title type="plain">Posts</media:title></media:content><category>artificial-intelligence</category><category>automation</category><category>architecture</category><category>Knowledge Base</category></item><item><title>Knowledge Distillation: From Massive Models to Efficient Intelligence</title><link>https://curiousbit.netlify.app/knowledge-distillation-from-massive-models-to-efficient-intelligence/</link><guid isPermaLink="true">https://curiousbit.netlify.app/knowledge-distillation-from-massive-models-to-efficient-intelligence/</guid><pubDate>Fri, 08 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;There is a scene you have probably seen in countless films: a master craftsman, decades of experience locked in his hands, patiently guiding a young apprentice. The master does not hand over a textbook. He transfers something richer — intuition, nuance, an understanding of &lt;em&gt;why&lt;/em&gt; certain choices matter. The apprentice, unburdened by the master&amp;rsquo;s size and slowness, eventually moves faster and in some cases surpasses the teacher entirely.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/kd-master-apprentice.jpg" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>There is a scene you have probably seen in countless films: a master craftsman, decades of experience locked in his hands, patiently guiding a young apprentice. The master does not hand over a textbook. He transfers something richer — intuition, nuance, an understanding of<em>why</em> certain choices matter. The apprentice, unburdened by the master&rsquo;s size and slowness, eventually moves faster and in some cases surpasses the teacher entirely.</p><p>Knowledge Distillation is that scene, rendered in mathematics.</p><p>Introduced formally by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean at Google in 2015, Knowledge Distillation (KD) is a model compression technique where a large, expensive model — the<strong>teacher</strong> — transfers its learned intelligence to a compact, deployable model — the<strong>student</strong>. The student retains over 90% of the teacher&rsquo;s accuracy while being up to 100× smaller and faster.</p><p>This article takes you from the intuition all the way through to the advanced variants that are reshaping AI deployment in 2026.</p><hr><h2 id="the-problem-intelligence-is-expensive">The Problem: Intelligence Is Expensive</h2><p>Modern AI models are enormous. GPT-4 is estimated to contain over a trillion parameters. BERT-large has 340 million. These models achieve stunning accuracy — but they are cumbersome to deploy. Running a trillion-parameter model for every user query would require data centres the size of small cities.</p><p>The engineering instinct is to train a smaller model directly. But smaller models trained from scratch on raw data consistently underperform large ones. Why?</p><p>Because raw training data is<em>hard</em>. A cat photo labelled simply &ldquo;cat&rdquo; gives a small model very little to work with. A large model, however, does not just see &ldquo;cat&rdquo; — it sees a distribution of confidence across thousands of classes. &ldquo;Cat: 0.92, Lynx: 0.06, Tabby: 0.02.&rdquo; That probability distribution is enormously richer than the hard label.</p><p>Hinton called this richer signal<strong>dark knowledge</strong> — the information encoded in what the model<em>almost</em> predicted.</p><hr><h2 id="the-teacher-student-paradigm">The Teacher-Student Paradigm</h2><p><img src="/images/kd-master-apprentice.jpg" alt="A Renaissance master transfers glowing knowledge orbs to his apprentice"/><p>The core idea is elegant. Instead of training the student on raw labelled data, you train it to<strong>mimic the teacher&rsquo;s output distribution</strong>.</p><p>You run every training example through the large teacher model. For each example, instead of a hard label (0 or 1), you collect the teacher&rsquo;s full<strong>soft target</strong> — the probability it assigns to every possible class. You then train the student to produce those same soft probability distributions.</p><p>The student loss function becomes:</p><pre tabindex="0"><code>Loss = α × (cross-entropy with hard labels)
+ (1-α) × (KL divergence from teacher soft targets)</code></pre><p>The blending weight<code>α</code> controls how much the student learns from the raw data versus the teacher&rsquo;s guidance. In practice, a small<code>α</code> (more weight on teacher targets) is usually optimal.</p><hr><h2 id="soft-targets-and-dark-knowledge">Soft Targets and Dark Knowledge</h2><p><img src="/images/kd-dark-knowledge.jpg" alt="A Flemish alchemist distils the essence of a massive Teacher Model flask into a tiny Student Model vial"/><p>Hard labels are binary. Soft targets are continuous. That difference is enormous.</p><p>Consider an image of a dog that slightly resembles a wolf. A hard label says &ldquo;dog: 1, wolf: 0.&rdquo; A teacher that has seen millions of examples says &ldquo;dog: 0.84, wolf: 0.13, fox: 0.03.&rdquo; That residual probability on<em>wolf</em> carries genuine information about the visual ambiguity in the image. The student trained on soft targets learns not just the answer, but the<em>shape of uncertainty</em> around the answer.</p><p>This is the dark knowledge. It lives in the tails of the distribution — the non-zero probabilities on wrong answers — and it makes the student dramatically more robust than one trained on hard labels alone.</p><hr><h2 id="temperature-the-control-knob">Temperature: The Control Knob</h2><p><img src="/images/kd-temperature.jpg" alt="A Renaissance philosopher adjusts the Temperature T dial on a celestial orrery, sharpening planets on the left and softening them to probability clouds on the right"/><p>Soft targets, by default, tend to be very peaked — the teacher is often highly confident in its top prediction, assigning 0.99 to the correct class and tiny residuals to everything else. At that extreme, the soft target is barely different from a hard label, and the dark knowledge disappears.</p><p>Hinton&rsquo;s solution was<strong>temperature scaling</strong>. Before computing the softmax, you divide the logits by a temperature parameter T:</p><pre tabindex="0"><code>p_i = exp(z_i / T) / Σ exp(z_j / T)</code></pre><p>At<strong>T = 1</strong> (standard), outputs are sharp and peaked.
At<strong>T &gt; 1</strong> (high temperature), outputs become softer and more spread, revealing the relative confidence structure across all classes.</p><p>During distillation, both teacher and student use the same elevated temperature (typically T = 3–5). This &ldquo;warms up&rdquo; the teacher&rsquo;s output into a richer, more informative distribution for the student to learn from. After training, the student is deployed with T = 1.</p><p>The effect is striking. Higher temperatures expose more inter-class structure, giving the student a better map of the concept landscape rather than just a list of correct answers.</p><hr><h2 id="what-gets-transferred-three-flavours-of-distillation">What Gets Transferred? Three Flavours of Distillation</h2><p>Knowledge can flow from teacher to student in different ways. The research community has converged on three main categories:</p><p><strong>Response-based distillation</strong> — the original Hinton approach. The student matches the teacher&rsquo;s final output layer (soft targets). Simple, effective, widely used.</p><p><strong>Feature-based distillation</strong> — the student is trained to match not just the final output but intermediate representations — specific layers or attention maps inside the teacher. This transfers<em>how</em> the teacher thinks, not just what it concludes. The trade-off is complexity: the teacher and student must have compatible architectures or an adapter layer is needed.</p><p><strong>Relation-based distillation</strong> — the student learns to replicate the<em>relationships</em> between different training examples as the teacher sees them. If the teacher places cat images and dog images in nearby regions of its feature space, the student should too. This approach is particularly powerful for metric learning and few-shot tasks.</p><hr><h2 id="advanced-variants">Advanced Variants</h2><h3 id="multi-task-distillation">Multi-Task Distillation</h3><p><img src="/images/kd-polymath-student.jpg" alt="A Leonardo da Vinci polymath student simultaneously masters writing, painting, anatomy, and geometry with golden threads connecting all disciplines to a glowing brain"/><p>Microsoft&rsquo;s MT-DNN research showed that distillation composes naturally with multi-task learning. A teacher trained on nine different natural language tasks simultaneously was distilled into a single student model. The distilled MT-DNN outperformed the original on 7 of 9 GLUE benchmark tasks — pushing the single-model state of the art to 83.7%.</p><p>The insight: when a teacher has learned to generalise across many domains, its soft targets encode cross-task structure that a specialised student cannot discover on its own.</p><h3 id="the-teacher-assistant-bridge">The Teacher Assistant Bridge</h3><p>What happens when the teacher and student are so different in capacity that direct distillation fails? A very large teacher produces soft targets the tiny student simply cannot model well.</p><p>The solution is an intermediate<strong>Teacher Assistant (TA)</strong> — a medium-sized model that first distils from the large teacher, then acts as teacher to the small student. The TA bridges the capacity gap, giving the small student a more tractable target. Research has consistently shown this staged approach outperforms direct large-to-small distillation when the size gap is more than an order of magnitude.</p><h3 id="when-the-student-surpasses-the-teacher">When the Student Surpasses the Teacher</h3><p><img src="/images/kd-student-exceeds.jpg" alt="A young apprentice stands triumphant as his glowing painting outshines the master&rsquo;s faded work, with the aged teacher bowing respectfully"/><p>One of the most counter-intuitive findings in knowledge distillation is that the student can sometimes<em>exceed</em> the teacher.</p><p>The 2022<strong>Symbolic Knowledge Distillation</strong> paper demonstrated this dramatically. The researchers distilled commonsense reasoning from GPT-3 (175B parameters) into a purpose-built commonsense model at 100× smaller size. The resulting student — COMET-DISTIL — outperformed GPT-3 on commonsense benchmarks.</p><p>How? The distillation process acted as a filter. Rather than transferring all of GPT-3&rsquo;s knowledge, the researchers used a<strong>critic model</strong> to selectively distil only high-quality, high-confidence commonsense triples. The student was not burdened by GPT-3&rsquo;s off-topic knowledge or low-confidence noise. It received a curated, concentrated version of the teacher&rsquo;s relevant expertise.</p><p>This is the Renaissance apprentice story made literal: the student, given the master&rsquo;s best knowledge and freed from the master&rsquo;s constraints, eventually does better work.</p><hr><h2 id="real-world-results">Real-World Results</h2><p>The numbers behind knowledge distillation are worth anchoring:</p><p>In Hinton&rsquo;s original speech recognition experiments on a heavily used commercial system, a distilled single model<strong>matched the accuracy of a 10-model ensemble</strong> while requiring one-tenth the compute at inference time.</p><p>In the speech recognition benchmark specifically:</p><table><thead><tr><th>Model</th><th>Frame Accuracy</th><th>Word Error Rate</th></tr></thead><tbody><tr><td>Baseline (single model)</td><td>50.9%</td><td>10.9%</td></tr><tr><td>10× model ensemble (teacher)</td><td>61.1%</td><td>10.7%</td></tr><tr><td>Distilled single student</td><td><strong>60.8%</strong></td><td><strong>10.7%</strong></td></tr></tbody></table><p>The student matches the ensemble at a fraction of the cost. This is the central promise of KD — and it has held up across vision, language, and speech for over a decade.</p><hr><h2 id="why-this-matters-in-2026">Why This Matters in 2026</h2><p>Knowledge distillation is no longer a research technique. It is infrastructure.</p><p>Every major on-device AI model — the language models on your phone, the vision models in your camera, the wake-word detectors in your earbuds — was almost certainly distilled from a much larger cloud model. DistilBERT, MobileNet, and Whisper Tiny are all products of distillation.</p><p>The technique is also central to the LLM compression wave of the past two years. Models like Phi-3, Mistral Small, and Gemma were designed with distillation-aware training pipelines from the start. The goal: deliver GPT-4-class reasoning in a model small enough to run locally, privately, and cheaply.</p><p>And symbolic distillation — transferring knowledge as structured text rather than as neural activations — is opening entirely new territory, allowing language model intelligence to flow into specialised domain models that do not even share the same architecture.</p><hr><h2 id="a-practical-starting-point">A Practical Starting Point</h2><p>If you want to experiment with knowledge distillation today:</p><p><strong>For response-based KD in PyTorch</strong>, the training loop change is minimal — replace your standard cross-entropy loss with the blended loss described above and pass the teacher&rsquo;s logits alongside the hard labels.</p><p><strong>For NLP tasks</strong>, Hugging Face&rsquo;s<code>transformers</code> library includes DistilBERT as a reference distilled model with its training recipe documented.</p><p><strong>For vision</strong>, TorchVision&rsquo;s knowledge distillation tutorial is the fastest on-ramp.</p><p>The key design decisions are: the temperature T (start at 4), the blending weight α (start at 0.5), and whether you need feature-based or response-based transfer (response-based first, feature-based if accuracy is still insufficient).</p><hr><p>The master-apprentice metaphor is more than decorative. Knowledge distillation encodes a genuine pedagogical insight: that the richer the guidance a learner receives, the more efficiently it reaches competence. The hard labels of raw data are the equivalent of telling a student the answer. The soft targets of a teacher model are the equivalent of showing them how to think.</p><p>That distinction — answer versus thinking — is what makes knowledge distillation one of the most elegant ideas in modern machine learning.</p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/kd-master-apprentice.jpg" medium="image"><media:title type="plain">Posts</media:title></media:content><category>artificial-intelligence</category><category>machine-learning</category><category>model-compression</category><category>deep-learning</category><category>Knowledge Base</category></item><item><title>I Built This Blog Without Writing a Single Line of Code (Almost)</title><link>https://curiousbit.netlify.app/i-built-this-blog-without-writing-a-single-line-of-code-almost/</link><guid isPermaLink="true">https://curiousbit.netlify.app/i-built-this-blog-without-writing-a-single-line-of-code-almost/</guid><pubDate>Thu, 07 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;It started like most late-night ideas in Sydney — a MacBook, too much coffee, and a thought that refused to go away.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://curiousbit.netlify.app/images/comic-p1.jpg" alt="Panel 1 – Sydney, 11:47 PM. The idea hits."&gt;&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/comic-p1.jpg" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>It started like most late-night ideas in Sydney — a MacBook, too much coffee, and a thought that refused to go away.</p><p><img src="/images/comic-p1.jpg" alt="Panel 1 – Sydney, 11:47 PM. The idea hits."/><p>I wanted a tech blog. A proper one — clean, fast, with a Knowledge Base, Videos section, and a place to put thoughts on AI and enterprise tech. The problem? Zero design skills. Zero frontend experience. I know infrastructure, architecture, platforms — but CSS gives me a headache.</p><hr><h2 id="the-crazy-idea">The Crazy Idea</h2><p>Then it clicked.</p><p><img src="/images/comic-p2.png" alt="Panel 2 – What if I use Claude and Codex to build the whole thing?"/><p>What if I just described the site I wanted — in plain English — and let AI build it? Not generate a snippet. Build the<em>whole thing</em>. Layouts, config, content, deployment config. Everything.</p><p>The prompt I used was simple but specific:</p><blockquote><p><em>&ldquo;Make me a clean HackerNoon-style Hugo + Tailwind blog with KB, Videos, News &amp; Views, and About pages. Deploy-ready for Netlify.&rdquo;</em></p></blockquote><hr><h2 id="the-stack">The Stack</h2><p><img src="/images/comic-p3.png" alt="Panel 3 – The prompt fires. Hugo, Tailwind, and a full project structure come back."/><p>The AI returned a complete project in one shot. Here&rsquo;s what it chose and why it makes sense:</p><p><strong>Hugo</strong> — a static site generator written in Go. Blazing fast builds, no database, no server to maintain. Perfect for a content blog.</p><p><strong>Tailwind CSS</strong> — utility-first CSS. Instead of writing stylesheets, you compose classes directly in HTML. The AI could reason about it well and generate clean, consistent UI.</p><p><strong>Netlify</strong> — one-click deployment from a GitHub repo. Push to<code>main</code>, site rebuilds automatically. Free tier covers everything a personal blog needs.</p><p><strong>GitHub</strong> — version control and the bridge between local edits and live site.</p><p>The structure it generated:<code>content/</code>,<code>layouts/</code>,<code>assets/</code>,<code>static/</code>,<code>config.yaml</code>,<code>netlify.toml</code>. Exactly what you&rsquo;d expect from an experienced Hugo developer.</p><hr><h2 id="first-roadblock--hit-in-under-5-minutes">First Roadblock — Hit in Under 5 Minutes</h2><p><img src="/images/comic-p4.png" alt="Panel 4 – zsh: command not found: brew. What the&hellip; brew?!"/><p>Real talk — the first thing I hit was a terminal error before I&rsquo;d even installed anything.<code>zsh: command not found: brew</code>. Homebrew wasn&rsquo;t on the machine.</p><p>At first I had no idea what I was looking at.</p><p><img src="/images/comic-p5.png" alt="Panel 5 – Small wins matter. At least I can read the error now."/><p>But here&rsquo;s the thing about working with AI tools — you learn to debug faster because you can ask<em>why</em> something failed, not just<em>how</em> to fix it. Within a few minutes I understood: Homebrew is the Mac package manager. I needed it to install Hugo. Small win — I could read what was going wrong.</p><hr><h2 id="getting-hugo-running">Getting Hugo Running</h2><p><img src="/images/comic-p6.png" alt="Panel 6 – brew install hugo → Installed. I&rsquo;m unstoppable!"/><p>Three commands and Hugo was running:</p><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">/bin/bash -c<span class="s2">"</span><span class="k">$(</span>curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh<span class="k">)</span><span class="s2">"</span></span></span><span class="line"><span class="cl">brew install hugo</span></span><span class="line"><span class="cl">hugo server</span></span></code></pre></div><p>The site was alive at<code>localhost:1313</code> — navigation, article cards, dark mode toggle, the works.</p><hr><h2 id="the-project-structure">The Project Structure</h2><p><img src="/images/comic-p7.png" alt="Panel 7 – The full folder structure. It actually generated everything perfectly."/><p>The generated project was clean and logical:</p><ul><li><code>content/posts/</code> — articles as Markdown files with YAML frontmatter</li><li><code>layouts/</code> — Hugo HTML templates for each page type</li><li><code>assets/css/</code> — Tailwind CSS, processed at build time</li><li><code>static/</code> — images and files served as-is</li><li><code>config.yaml</code> — site title, menus, author, base URL</li><li><code>netlify.toml</code> — build command, publish directory, Node version</li></ul><p>Hugo reads all frontmatter at build time and generates a completely static site. No database. No server-side rendering. No WordPress. Just fast HTML files.</p><hr><h2 id="real-developer-moments">Real Developer Moments</h2><p><img src="/images/comic-p8.png" alt="Panel 8 – How do I copy the path again?! Real developer moments."/><p>Not everything was smooth. There were genuine &ldquo;how does anyone actually do this&rdquo; moments — forgetting how to copy a file path in Finder, templates not updating because of Hugo&rsquo;s build cache, Git authentication failing because GitHub dropped password support in 2021.</p><p>These are the moments tutorials skip. The AI handled every one of them.</p><hr><h2 id="deployed-its-alive">Deployed. It&rsquo;s Alive.</h2><p><img src="/images/comic-p9.png" alt="Panel 9 – Site Deployed Successfully. IT&rsquo;S ALIVE!!!"/><p>Getting it live was surprisingly painless:</p><ol><li>Push the project to a GitHub repo</li><li>Connect the repo to Netlify</li><li>Set build command to<code>npm run build</code>, publish directory to<code>public/</code></li><li>Hit deploy</li></ol><p>Netlify pulled the code, ran Hugo, and published the static site in under 30 seconds. Every<code>git push</code> from that point triggers an automatic rebuild.</p><hr><h2 id="what-the-live-site-looks-like">What the Live Site Looks Like</h2><p><img src="/images/comic-p10.png" alt="Panel 10 – The live curiousbit site — KB, Videos, rotating content."/><p>The site has:</p><ul><li><strong>Homepage</strong> — featured article hero, rotating video cards, Knowledge Base grid</li><li><strong>KB</strong> — long-form technical articles in Markdown</li><li><strong>Videos</strong> — auto-synced from a YouTube playlist via YouTube Data API v3. A Node.js script fetches the playlist at build time, writes<code>videos.json</code> to<code>static/</code>, and the browser loads it at runtime. Add a video to YouTube, trigger a deploy, it appears on the site.</li><li><strong>About</strong> — a simple bio</li></ul><p>The video section was the most technically interesting piece. No manual uploads, no embeds to maintain — just a playlist that feeds itself.</p><hr><h2 id="the-takeaway">The Takeaway</h2><p><img src="/images/comic-p11.png" alt="Panel 11 – You don&rsquo;t need to be a frontend expert. Just good prompts and persistence."/><p>This entire site — layouts, templates, CSS, YouTube integration, Git workflow, deployment pipeline — was built by someone with no frontend background. What I brought was the ability to describe what I wanted clearly, debug errors methodically, and push through the friction points.</p><p>The tools did the heavy lifting. The judgment about<em>what</em> to build was still mine.</p><hr><h2 id="now-its-your-turn">Now It&rsquo;s Your Turn</h2><p><img src="/images/comic-p12.png" alt="Panel 12 – Want to see the full process? Now it&rsquo;s your turn!"/><p>The stack is:<strong>Hugo + Tailwind + GitHub + Netlify + Claude</strong>. All free. All production-grade.</p><p>Start with a prompt. Describe what you want clearly. Expect errors. Fix them one at a time. Deploy early and often.</p><p>The gap between &ldquo;I have an idea&rdquo; and &ldquo;it&rsquo;s live on the internet&rdquo; has never been smaller.</p><hr><p><em>The full source for this site is on GitHub at<a href="https://github.com/ibn-Battuta/AjayW_blog">ibn-Battuta/AjayW_blog</a>.</em></p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/comic-p1.jpg" medium="image"><media:title type="plain">Posts</media:title></media:content><category>artificial-intelligence</category><category>architecture</category><category>automation</category><category>Knowledge Base</category></item><item><title>Camera Roll to Caption — Python Pipeline, Vision Model for Photo Tags</title><link>https://curiousbit.netlify.app/camera-roll-to-caption-python-pipeline-vision-model-for-photo-tags/</link><guid isPermaLink="true">https://curiousbit.netlify.app/camera-roll-to-caption-python-pipeline-vision-model-for-photo-tags/</guid><pubDate>Sat, 02 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;Vision models, language models, and most other generative systems are confident-but-wrong some non-trivial fraction of the time. The instinct is to fix that with better prompts, bigger models, or smarter agents. The cheaper move is usually to add a small structured review seam — a thirty-second checkpoint where a human can glance, correct, and move on.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/ctp/bottlebrush.jpg" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>Vision models, language models, and most other generative systems are confident-but-wrong some non-trivial fraction of the time. The instinct is to fix that with better prompts, bigger models, or smarter agents. The cheaper move is usually to add a small structured review seam — a thirty-second checkpoint where a human can glance, correct, and move on.</p><p>This post is the case study for one such seam, dropped into a build I needed for myself. Of 35 garden photos handed to a vision model,<strong>74% came back with correct first-pass labels</strong>. After thirty seconds editing a CSV,<strong>97% were acceptable to publish</strong>. Total API cost:<strong>$0.18</strong>. Total inference time:<strong>~74 seconds</strong> at 2.1 sec/photo on<code>gpt-4o-mini</code>. The CSV was the highest-leverage code in the project — and it isn&rsquo;t really code.</p><p>Here&rsquo;s the story.</p><h2 id="the-annoyance">The annoyance</h2><p>It was a Saturday afternoon in early March. I&rsquo;d come back from a walk around the garden with thirty-five photos on my iPhone — bottlebrush in full red, honeysuckle dripping with rain, a lilly-pilly cluster doing its outrageous pink thing, and at least one inexplicable shot of an old railway station I&rsquo;d passed on the way home.</p><p>I wanted to post a handful of them with consistent little hashtag labels —<code>#bottlebrush</code>,<code>#honeysuckle</code>,<code>#flower</code> — burned into the corner like a quiet caption. Not a watermark, not a filter, just a small readable pill that says &ldquo;this is what you&rsquo;re looking at.&rdquo;</p><p>What I didn&rsquo;t want was to open each HEIC in Preview, draw a text box, fiddle with the font, export, repeat thirty-five times. So I did the only reasonable thing: I wrote a small Python tool that does it for me.</p><p><img src="/images/ctp/bottlebrush.jpg" alt="Bottlebrush hero — red Australian bottlebrush flower with a #bottlebrush hashtag pill in the bottom-right corner"/><h2 id="the-shape-of-the-pipeline">The shape of the pipeline</h2><pre tabindex="0"><code>┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Folder │──▶│ Vision │──▶│ CSV │──▶│ Apply │──▶│ Tagged │
│ photos │ │ provider │ │ review │ │ + pill │ │ output │
└──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
↑
human-in-the-loop seam
--mode propose: folder ─▶ vision provider ─▶ CSV
--mode apply: CSV ─▶ render + pill ─▶ tagged output</code></pre><p>The minimal product was easy to describe. Point the script at a folder. For each image — HEIC, JPG, PNG, whatever the iPhone or my camera roll throws at it — open it, figure out what&rsquo;s in it, draw a small rounded hashtag pill into the bottom-right corner, save the result to a<code>tagged_output/</code> subfolder. No watermark across the centre of the image, no filter or colour grade, no destructive edit to the original, and no making me choose the label by hand when a vision model can have a decent first guess.</p><p>That last point is where the design got interesting.</p><h2 id="the-seam">The seam</h2><p>You could write this as a single command: walk the folder, ask the model, render the tag, done. I tried that first. The first-pass run produced a folder of beautifully tagged images, about a quarter of which were wrong in some quietly maddening way — a daisy called<code>#flower</code>, a fern called<code>#leaves</code>, the railway station called, charitably,<code>#station</code>.</p><p>So the script runs in two passes.</p><p><code>--mode propose</code> opens each image, hands it to the vision model, and writes a CSV with five columns:</p><pre tabindex="0"><code>image_path, label, score, suggested_tag, final_tag</code></pre><p><code>final_tag</code> is initialised to<code>suggested_tag</code>, but the whole point of the column is that you can edit it. Open the CSV, glance down the list, fix anything obvious —<code>flower</code> becomes<code>morning_glory</code>,<code>leaves</code> becomes<code>bamboo</code> — save, close. On this batch,<strong>9 of 35 rows needed editing</strong> (a daisy, the railway station, two ferns, the bamboo, and four generic-flower fallbacks). A thirty-second pass.</p><p><code>--mode apply</code> then reads the CSV row by row and renders the tag using whatever&rsquo;s in<code>final_tag</code>. The CSV is the human-in-the-loop seam. It is much cheaper than re-running inference, and it catches the cases where the model was right about the genus but wrong about the species, or just wrong.</p><h2 id="three-providers-one-interface">Three providers, one interface</h2><p>I didn&rsquo;t want to commit to one vision model — the price/quality trade-offs are too lively right now. The script supports three providers behind one interface, picked via<code>--provider local|openai|xai</code>.</p><p><strong>Local CLIP.</strong> HuggingFace&rsquo;s<code>openai/clip-vit-large-patch14</code> against a fixed candidate list. Free, offline,<strong>~0.4 sec/photo on an M3 Pro</strong>. The cost is breadth: anything outside the candidate list collapses to the nearest match. CLIP doesn&rsquo;t know what a bottlebrush is unless I tell it the word.</p><p><strong>OpenAI.</strong><code>gpt-4o-mini</code> by default, with an opt-in<code>--high-accuracy</code> flag that retries low-confidence cases (under 0.72) on<code>gpt-4o</code>.<strong>~2.1 sec/photo, ~$0.18 for the 35-photo batch.</strong> Open-ended labels — how<code>bottlebrush</code>,<code>honeysuckle</code>,<code>fern</code>, and<code>berries</code> ended up in the CSV rather than<code>flower</code>,<code>flower</code>,<code>leaves</code>,<code>fruit</code>.<strong>22% of the batch tripped the retry threshold</strong> and went to<code>gpt-4o</code>.</p><p><img src="/images/ctp/berries.jpg" alt="Hot-pink lilly-pilly berries tagged #berries — an example of gpt-4o-mini producing a specific label rather than the generic &ldquo;fruit&rdquo;"/><p><strong>xAI Grok.</strong> Same OpenAI-compatible client, pointed at<code>api.x.ai</code> with<code>grok-2-vision-latest</code>. Useful if you&rsquo;re already on the x.ai stack or want a different model family&rsquo;s vote.</p><p>The mental model: local CLIP for batch-of-a-hundred-photos-on-a-flight, OpenAI as the daily driver, and the high-accuracy retry for exactly the case where the model says &ldquo;flower&rdquo; with 0.55 confidence and I want it to look harder before I have to.</p><p>The blue morning glory below is what generic labels look like in practice — still a decent fallback, just unspecific. The model wasn&rsquo;t wrong; it just wasn&rsquo;t curious.</p><p><img src="/images/ctp/morning-glory.jpg" alt="Blue morning glory tagged #flower — an example of the model falling back to a generic label even with the specific species clearly visible"/><h2 id="two-small-touches">Two small touches</h2><p>Two design choices are the difference between &ldquo;the script works&rdquo; and &ldquo;the output looks intentional.&rdquo;</p><p><strong>Style-aware contrast.</strong> The pill needs to be readable on both a bright sky and dark foliage. The script crops the bottom-right region of the image, measures the mean luminance using the standard Rec. 709 weights, and flips the colour scheme above or below a threshold:</p><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span><span class="nf">style_aware_colors</span><span class="p">(</span><span class="n">img</span><span class="p">):</span></span></span><span class="line"><span class="cl"><span class="n">w</span><span class="p">,</span><span class="n">h</span><span class="o">=</span><span class="n">img</span><span class="o">.</span><span class="n">size</span></span></span><span class="line"><span class="cl"><span class="n">crop</span><span class="o">=</span><span class="n">img</span><span class="o">.</span><span class="n">crop</span><span class="p">((</span><span class="nb">int</span><span class="p">(</span><span class="n">w</span><span class="o">*</span><span class="mf">0.68</span><span class="p">),</span><span class="nb">int</span><span class="p">(</span><span class="n">h</span><span class="o">*</span><span class="mf">0.80</span><span class="p">),</span><span class="n">w</span><span class="p">,</span><span class="n">h</span><span class="p">))</span></span></span><span class="line"><span class="cl"><span class="n">r</span><span class="p">,</span><span class="n">g</span><span class="p">,</span><span class="n">b</span><span class="o">=</span><span class="n">ImageStat</span><span class="o">.</span><span class="n">Stat</span><span class="p">(</span><span class="n">crop</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="s2">"RGB"</span><span class="p">))</span><span class="o">.</span><span class="n">mean</span><span class="p">[:</span><span class="mi">3</span><span class="p">]</span></span></span><span class="line"><span class="cl"><span class="n">luminance</span><span class="o">=</span><span class="mf">0.2126</span><span class="o">*</span><span class="n">r</span><span class="o">+</span><span class="mf">0.7152</span><span class="o">*</span><span class="n">g</span><span class="o">+</span><span class="mf">0.0722</span><span class="o">*</span><span class="n">b</span></span></span><span class="line"><span class="cl"><span class="k">if</span><span class="n">luminance</span><span class="o">&lt;</span><span class="mi">140</span><span class="p">:</span></span></span><span class="line"><span class="cl"><span class="k">return</span><span class="p">(</span><span class="mi">255</span><span class="p">,</span><span class="mi">255</span><span class="p">,</span><span class="mi">255</span><span class="p">,</span><span class="mi">245</span><span class="p">),</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">95</span><span class="p">)</span><span class="c1"># white text, dark pill</span></span></span><span class="line"><span class="cl"><span class="k">return</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">245</span><span class="p">),</span><span class="p">(</span><span class="mi">255</span><span class="p">,</span><span class="mi">255</span><span class="p">,</span><span class="mi">255</span><span class="p">,</span><span class="mi">95</span><span class="p">)</span><span class="c1"># black text, light pill</span></span></span></code></pre></div><p>Eight lines of PIL. In this batch every photo sampled dark — gardens are mostly green and shadow in the corner — so every output got the dark pill. The bright-pill branch is still there, waiting for a photo with sky or a light wall in the corner.</p><p><strong>Save with fallback.</strong> HEIC writes occasionally fail for reasons that aren&rsquo;t worth diagnosing in a personal tool. The save function tries the original format first; if PIL throws, it quietly drops to JPEG with the same filename stem. Eight more lines. On this batch,<strong>3 of 35 fell back to JPEG</strong>. Without the fallback those three would have been a stack trace and a half-finished folder. With it, thirty-five of thirty-five made it through.</p><h2 id="what-id-add-next">What I&rsquo;d add next</h2><p>Multi-tag support, so a photo can be<code>#lorikeet #bottlebrush</code> when the bird showed up in the bottlebrush. EXIF preservation through the round-trip — right now PIL strips most of the metadata, which I don&rsquo;t love. A tiny review UI to replace the CSV step, either a Tkinter window or a one-page localhost app. Smarter candidate lists for the local provider, scoped by season or geography — Sydney summer has a different vocabulary than European spring.</p><p>None of these are urgent enough to displace &ldquo;the script already does what I wanted.&rdquo;</p><h2 id="closing-observations">Closing observations</h2><p>Three lessons that generalise beyond this script.</p><p><strong>Human-in-the-loop is cheap and underrated.</strong> The CSV seam between propose and apply takes thirty seconds per batch and saves me from confidently wrong outputs. For any task where a model is confident-but-wrong some non-trivial fraction of the time — RAG, codegen, moderation, enterprise copilots, agentic workflows — a structured review step pays for itself almost immediately. The CSV doesn&rsquo;t have to be elegant. It has to exist.</p><p><strong>Pluggable providers are worth the small abstraction tax even on personal tools.</strong> I went from local CLIP to<code>gpt-4o-mini</code> to Grok in the space of one afternoon without rewriting the rendering code. The interface is<code>(client, model, image) → (label, score)</code> and that&rsquo;s it. Once you&rsquo;ve paid that cost once, you can keep up with a fast-moving model market essentially for free.</p><p><strong>Small touches decide whether a script feels finished.</strong> Luminance-aware contrast and a save-format fallback don&rsquo;t change what the tool does; they change how the output reads.</p><p>The model wasn&rsquo;t the product. The seam was.</p><hr><p>A short reel of the tagged photos in the wild:<a href="https://www.instagram.com/reel/DVhMhwIE9YKZ-4xtOHeud3rY1IO2x_3OeGzr9M0/">Instagram story</a>.</p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/ctp/bottlebrush.jpg" medium="image"><media:title type="plain">Posts</media:title></media:content><category>automation</category><category>artificial-intelligence</category><category>build-log</category><category>engineering</category><category>Knowledge Base</category></item><item><title>A Field Guide to AI Chips</title><link>https://curiousbit.netlify.app/a-field-guide-to-ai-chips/</link><guid isPermaLink="true">https://curiousbit.netlify.app/a-field-guide-to-ai-chips/</guid><pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;style&gt;
@import url('https://fonts.googleapis.com/css2?family=Cinzel:wght@500;700;900&amp;family=IM+Fell+English:ital@0;1&amp;family=Inter:wght@400;500;600&amp;display=swap');
.chip-guide {
--bg: #0a1424;
--surface: #14253f;
--surface2: #1c2d4f;
--border: #3a4a6a;
--bronze: #b87333;
--bronze-bright: #d4a04a;
--gold: #e5b85e;
--parchment: #e8e0d0;
--muted: #9aa6bf;
--accent-purple: #8b5cf6;
--accent-teal: #2dd4bf;
color: var(--parchment);
font-family: 'Inter', sans-serif;
font-size: 17px;
line-height: 1.75;
}
.chip-guide h1, .chip-guide h2, .chip-guide h3 {
font-family: 'Cinzel', serif;
letter-spacing: 0.04em;
color: var(--gold);
font-weight: 700;
margin-top: 2.5rem;
margin-bottom: 1rem;
}
.chip-guide h1 {
font-size: clamp(2.2rem, 4vw, 3.2rem);
font-weight: 900;
background: linear-gradient(135deg, #e5b85e 0%, #d4a04a 50%, #b87333 100%);
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
background-clip: text;
text-align: center;
margin: 0.5rem 0 0.5rem;
}
.chip-guide h2 {
font-size: clamp(1.4rem, 2.2vw, 2rem);
border-bottom: 2px double var(--bronze);
padding-bottom: 0.4rem;
}
.chip-guide h3 {
font-size: clamp(1.2rem, 1.6vw, 1.5rem);
color: var(--bronze-bright);
}
.chip-guide p { margin: 0.6rem 0 1rem; }
.chip-guide strong { color: var(--gold); }
.chip-guide a { color: var(--bronze-bright); }
.chip-guide a:hover { color: var(--gold); }
.chip-guide .hero {
text-align: center;
padding: 2.5rem 0 1.5rem;
border-bottom: 1px solid var(--border);
margin-bottom: 2rem;
}
.chip-guide .eyebrow {
display: inline-block;
font-family: 'Cinzel', serif;
font-size: 0.78rem;
letter-spacing: 0.35em;
color: var(--bronze-bright);
text-transform: uppercase;
margin-bottom: 0.6rem;
}
.chip-guide .subtitle {
font-family: 'IM Fell English', serif;
font-style: italic;
font-size: clamp(1.05rem, 1.4vw, 1.3rem);
color: var(--muted);
margin: 0.8rem auto 0;
max-width: 720px;
}
.chip-guide .ornament {
text-align: center;
margin: 1.2rem 0;
color: var(--bronze);
font-size: 1.3rem;
letter-spacing: 0.6rem;
}
.chip-guide table {
width: 100%;
border-collapse: collapse;
margin: 1.5rem 0;
font-size: 0.95rem;
background: var(--surface);
border: 1px solid var(--border);
}
.chip-guide table th {
background: var(--surface2);
color: var(--gold);
font-family: 'Cinzel', serif;
text-align: left;
padding: 0.7rem 1rem;
border-bottom: 2px solid var(--bronze);
font-weight: 600;
letter-spacing: 0.04em;
font-size: 0.92rem;
}
.chip-guide table td {
padding: 0.7rem 1rem;
border-bottom: 1px solid var(--border);
vertical-align: top;
color: var(--parchment);
}
.chip-guide table tr:last-child td { border-bottom: none; }
.chip-guide table tr:hover td { background: rgba(229, 184, 94, 0.04); }
.chip-guide .stat-block {
background: var(--surface);
border: 2px solid var(--bronze);
border-radius: 4px;
margin: 1.5rem 0;
font-family: 'Inter', sans-serif;
}
.chip-guide .stat-block-header {
padding: 0.55rem 1rem;
background: var(--surface2);
border-bottom: 2px solid var(--bronze);
font-family: 'Cinzel', serif;
font-weight: 700;
color: var(--gold);
letter-spacing: 0.05em;
font-size: 1.05rem;
text-transform: uppercase;
}
.chip-guide .stat-block dl {
padding: 0.7rem 1rem 0.8rem;
margin: 0;
display: grid;
grid-template-columns: 8rem 1fr;
row-gap: 0.4rem;
column-gap: 0.6rem;
}
.chip-guide .stat-block dt {
font-weight: 700;
color: var(--bronze-bright);
font-family: 'Cinzel', serif;
font-size: 0.82rem;
letter-spacing: 0.05em;
text-transform: uppercase;
margin: 0;
}
.chip-guide .stat-block dd {
margin: 0;
color: var(--parchment);
font-size: 0.94rem;
}
.chip-guide .read-more {
display: inline-block;
margin-top: 0.4rem;
color: var(--bronze-bright);
font-family: 'Cinzel', serif;
font-weight: 700;
text-decoration: none;
border-bottom: 1px solid var(--bronze);
padding-bottom: 1px;
letter-spacing: 0.05em;
font-size: 0.92rem;
}
.chip-guide .read-more:hover {
color: var(--gold);
border-bottom-color: var(--gold);
}
.chip-guide .quote {
font-family: 'IM Fell English', serif;
font-style: italic;
font-size: 1.18rem;
color: var(--parchment);
background: rgba(184, 115, 51, 0.08);
border-left: 4px solid var(--bronze);
padding: 1rem 1.5rem;
margin: 1.5rem 0;
}
.chip-guide ul { padding-left: 1.4rem; }
.chip-guide ul li { margin: 0.3rem 0; }
&lt;/style&gt;
&lt;div class="chip-guide"&gt;
&lt;div class="hero"&gt;
&lt;div class="eyebrow"&gt;A Field Guide · 2026&lt;/div&gt;
&lt;h1 id="a-field-guide-to-ai-chips"&gt;A Field Guide to AI Chips&lt;/h1&gt;
&lt;div class="subtitle"&gt;Stat blocks, lairs and notable specimens for the eight kinds of silicon that power modern AI.&lt;/div&gt;
&lt;div class="ornament"&gt;❦ ❦ ❦&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Modern AI runs on a small zoo of specialised chips. Each evolved to handle a different workload — training a frontier model, answering a billion queries a day, recognising a face on your phone, keeping a drone alive in the air. This guide catalogues eight of them, with a stat block and a &amp;ldquo;where you&amp;rsquo;ll meet it&amp;rdquo; entry for each. Each section links to a deeper entry for the curious.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/ai-chips/field-guide-hero.png" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><style>
@import url('https://fonts.googleapis.com/css2?family=Cinzel:wght@500;700;900&family=IM+Fell+English:ital@0;1&family=Inter:wght@400;500;600&display=swap');
.chip-guide {
--bg: #0a1424;
--surface: #14253f;
--surface2: #1c2d4f;
--border: #3a4a6a;
--bronze: #b87333;
--bronze-bright: #d4a04a;
--gold: #e5b85e;
--parchment: #e8e0d0;
--muted: #9aa6bf;
--accent-purple: #8b5cf6;
--accent-teal: #2dd4bf;
color: var(--parchment);
font-family: 'Inter', sans-serif;
font-size: 17px;
line-height: 1.75;
}
.chip-guide h1, .chip-guide h2, .chip-guide h3 {
font-family: 'Cinzel', serif;
letter-spacing: 0.04em;
color: var(--gold);
font-weight: 700;
margin-top: 2.5rem;
margin-bottom: 1rem;
}
.chip-guide h1 {
font-size: clamp(2.2rem, 4vw, 3.2rem);
font-weight: 900;
background: linear-gradient(135deg, #e5b85e 0%, #d4a04a 50%, #b87333 100%);
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
background-clip: text;
text-align: center;
margin: 0.5rem 0 0.5rem;
}
.chip-guide h2 {
font-size: clamp(1.4rem, 2.2vw, 2rem);
border-bottom: 2px double var(--bronze);
padding-bottom: 0.4rem;
}
.chip-guide h3 {
font-size: clamp(1.2rem, 1.6vw, 1.5rem);
color: var(--bronze-bright);
}
.chip-guide p { margin: 0.6rem 0 1rem; }
.chip-guide strong { color: var(--gold); }
.chip-guide a { color: var(--bronze-bright); }
.chip-guide a:hover { color: var(--gold); }
.chip-guide .hero {
text-align: center;
padding: 2.5rem 0 1.5rem;
border-bottom: 1px solid var(--border);
margin-bottom: 2rem;
}
.chip-guide .eyebrow {
display: inline-block;
font-family: 'Cinzel', serif;
font-size: 0.78rem;
letter-spacing: 0.35em;
color: var(--bronze-bright);
text-transform: uppercase;
margin-bottom: 0.6rem;
}
.chip-guide .subtitle {
font-family: 'IM Fell English', serif;
font-style: italic;
font-size: clamp(1.05rem, 1.4vw, 1.3rem);
color: var(--muted);
margin: 0.8rem auto 0;
max-width: 720px;
}
.chip-guide .ornament {
text-align: center;
margin: 1.2rem 0;
color: var(--bronze);
font-size: 1.3rem;
letter-spacing: 0.6rem;
}
.chip-guide table {
width: 100%;
border-collapse: collapse;
margin: 1.5rem 0;
font-size: 0.95rem;
background: var(--surface);
border: 1px solid var(--border);
}
.chip-guide table th {
background: var(--surface2);
color: var(--gold);
font-family: 'Cinzel', serif;
text-align: left;
padding: 0.7rem 1rem;
border-bottom: 2px solid var(--bronze);
font-weight: 600;
letter-spacing: 0.04em;
font-size: 0.92rem;
}
.chip-guide table td {
padding: 0.7rem 1rem;
border-bottom: 1px solid var(--border);
vertical-align: top;
color: var(--parchment);
}
.chip-guide table tr:last-child td { border-bottom: none; }
.chip-guide table tr:hover td { background: rgba(229, 184, 94, 0.04); }
.chip-guide .stat-block {
background: var(--surface);
border: 2px solid var(--bronze);
border-radius: 4px;
margin: 1.5rem 0;
font-family: 'Inter', sans-serif;
}
.chip-guide .stat-block-header {
padding: 0.55rem 1rem;
background: var(--surface2);
border-bottom: 2px solid var(--bronze);
font-family: 'Cinzel', serif;
font-weight: 700;
color: var(--gold);
letter-spacing: 0.05em;
font-size: 1.05rem;
text-transform: uppercase;
}
.chip-guide .stat-block dl {
padding: 0.7rem 1rem 0.8rem;
margin: 0;
display: grid;
grid-template-columns: 8rem 1fr;
row-gap: 0.4rem;
column-gap: 0.6rem;
}
.chip-guide .stat-block dt {
font-weight: 700;
color: var(--bronze-bright);
font-family: 'Cinzel', serif;
font-size: 0.82rem;
letter-spacing: 0.05em;
text-transform: uppercase;
margin: 0;
}
.chip-guide .stat-block dd {
margin: 0;
color: var(--parchment);
font-size: 0.94rem;
}
.chip-guide .read-more {
display: inline-block;
margin-top: 0.4rem;
color: var(--bronze-bright);
font-family: 'Cinzel', serif;
font-weight: 700;
text-decoration: none;
border-bottom: 1px solid var(--bronze);
padding-bottom: 1px;
letter-spacing: 0.05em;
font-size: 0.92rem;
}
.chip-guide .read-more:hover {
color: var(--gold);
border-bottom-color: var(--gold);
}
.chip-guide .quote {
font-family: 'IM Fell English', serif;
font-style: italic;
font-size: 1.18rem;
color: var(--parchment);
background: rgba(184, 115, 51, 0.08);
border-left: 4px solid var(--bronze);
padding: 1rem 1.5rem;
margin: 1.5rem 0;
}
.chip-guide ul { padding-left: 1.4rem; }
.chip-guide ul li { margin: 0.3rem 0; }</style><div class="chip-guide"><div class="hero"><div class="eyebrow">A Field Guide · 2026</div><h1 id="a-field-guide-to-ai-chips">A Field Guide to AI Chips</h1><div class="subtitle">Stat blocks, lairs and notable specimens for the eight kinds of silicon that power modern AI.</div><div class="ornament">❦ ❦ ❦</div></div><p>Modern AI runs on a small zoo of specialised chips. Each evolved to handle a different workload — training a frontier model, answering a billion queries a day, recognising a face on your phone, keeping a drone alive in the air. This guide catalogues eight of them, with a stat block and a &ldquo;where you&rsquo;ll meet it&rdquo; entry for each. Each section links to a deeper entry for the curious.</p><h2 id="the-roll-call">The Roll Call</h2><table><thead><tr><th>Chip</th><th>Best for</th><th>Memory &amp; Interconnect</th><th>Cost &amp; Access</th><th>Notable Specimens (2026)</th></tr></thead><tbody><tr><td><strong>GPU</strong></td><td>Training + inference</td><td>80–192GB HBM3/3e; NVLink 5, PCIe 5</td><td>$25–40K each; cloud-only at scale</td><td>NVIDIA H100, B200, GB200 NVL72; AMD MI325X</td></tr><tr><td><strong>TPU</strong></td><td>Hyperscale training</td><td>95–192GB HBM; OCS interconnect</td><td>Google Cloud only</td><td>TPU v5p, v6 Trillium</td></tr><tr><td><strong>NPU</strong></td><td>On-device AI</td><td>Shared LPDDR / unified memory</td><td>Bundled in device</td><td>Apple Neural Engine (M4), Intel AI Boost (Lunar Lake), Qualcomm Hexagon (8 Elite)</td></tr><tr><td><strong>CPU</strong></td><td>Orchestration &amp; control plane</td><td>DDR5; PCIe 5, CXL</td><td>$1–15K; retail</td><td>Intel Xeon 6, AMD EPYC 9005</td></tr><tr><td><strong>ASIC</strong></td><td>Inference at scale; specialised training</td><td>Custom HBM / SRAM; proprietary fabric</td><td>Cloud-only</td><td>AWS Inferentia2, Trainium2; Cerebras WSE-3; Groq LPU; SambaNova SN40L</td></tr><tr><td><strong>FPGA</strong></td><td>Custom, low-latency, adaptive</td><td>DDR/HBM; reprogrammable fabric</td><td>$5–50K each; cloud</td><td>AMD Versal AI Edge, Intel Agilex 7</td></tr><tr><td><strong>Edge AI</strong></td><td>Mobile, robotics, IoT</td><td>LPDDR; low-power</td><td>$50–2000, embedded in product</td><td>NVIDIA Jetson Orin, Google Coral, Hailo-8</td></tr><tr><td><strong>Emerging</strong></td><td>Frontier R&amp;D</td><td>Wafer-scale SRAM / photonic / analog</td><td>Mostly research, limited cloud</td><td>Cerebras (covered above), Lightmatter, Mythic</td></tr></tbody></table><h2 id="1--gpu--graphics-processing-unit">1 · GPU — Graphics Processing Unit</h2><div class="stat-block"><div class="stat-block-header">GPU · The Apex Predator</div><dl><dt>Class</dt><dd>Parallel beast</dd><dt>Memory</dt><dd>80–192GB HBM3 / HBM3e</dd><dt>Interconnect</dt><dd>NVLink 5, PCIe 5, InfiniBand</dd><dt>Power</dt><dd>350–1200W per die</dd><dt>Habitat</dt><dd>Hyperscale datacenters</dd><dt>Cost</dt><dd>$25–40K per card · cloud rental at scale</dd><dt>Best Prey</dt><dd>LLM training, diffusion, multimodal pretraining</dd><dt>Specimens</dt><dd>NVIDIA H100, B200, GB200 NVL72; AMD MI325X</dd></dl></div><p>GPUs are the apex predator of the AI hardware ecosystem in 2026. Originally designed for graphics, they turned out to be ideal for the dense matrix multiplications that dominate neural network training. NVIDIA&rsquo;s H100 made the LLM era possible; B200 and the rack-scale GB200 NVL72 (72 GPUs treated as one machine, lashed together by NVLink switches) define the current frontier.</p><p>The reason GPUs dominate isn&rsquo;t just parallel processing — it&rsquo;s the combination of HBM (high-bandwidth memory mounted directly on the chip package), tensor cores (specialised matrix-multiply units), and a mature software ecosystem (CUDA, PyTorch, JAX) that nothing else has matched at scale. AMD&rsquo;s MI325X is the only serious open-market competitor, and even it ships running CUDA-compatible code through ROCm translation.</p><p>The catch: you cannot really buy them. H100s and B200s ship into hyperscaler datacenters first and reach the open market — when they do — through Lambda, CoreWeave, AWS, and friends, rented by the hour at $2–8 each.</p><p><a class="read-more" href="/field-guide-gpus/">→ Full entry: Field Guide · GPUs</a></p><h2 id="2--tpu--tensor-processing-unit">2 · TPU — Tensor Processing Unit</h2><div class="stat-block"><div class="stat-block-header">TPU · Google's Matmul Colossus</div><dl><dt>Class</dt><dd>Bespoke matrix engine</dd><dt>Memory</dt><dd>95–192GB HBM (varies by generation)</dd><dt>Interconnect</dt><dd>OCS (Optical Circuit Switching) + ICI</dd><dt>Power</dt><dd>~200–300W per chip</dd><dt>Habitat</dt><dd>Google Cloud (only)</dd><dt>Cost</dt><dd>Cloud rental only</dd><dt>Best Prey</dt><dd>Hyperscale training of Gemini-class models</dd><dt>Specimens</dt><dd>TPU v5p (training), v5e (inference), v6 Trillium</dd></dl></div><p>Google designed TPUs in-house to avoid paying NVIDIA&rsquo;s margins on a workload they knew exactly — TensorFlow matrix multiplications at hyperscale. Each generation has narrowed the gap with GPUs on flexibility while widening it on energy efficiency per FLOP.</p><p>The architectural bet is the systolic array: a grid of multiply-add units that pumps data through in lockstep, achieving near-peak utilisation on matmul-heavy workloads. The trade-off is that anything outside that sweet spot (irregular memory access, highly dynamic shapes) runs less efficiently than on a GPU. The OCS-based interconnect lets Google rewire a TPU pod&rsquo;s topology per job, which matters enormously at the scale of a Gemini training run.</p><p>You cannot buy a TPU. They exist exclusively inside Google Cloud, rented by the hour. Gemini was trained on them; many third parties (Anthropic for a stretch, plus enterprise customers) rent slices for their own runs.</p><p><a class="read-more" href="/field-guide-tpus/">→ Full entry: Field Guide · TPUs</a></p><h2 id="3--npu--neural-processing-unit">3 · NPU — Neural Processing Unit</h2><div class="stat-block"><div class="stat-block-header">NPU · The Resident Familiar</div><dl><dt>Class</dt><dd>On-device specialist</dd><dt>Memory</dt><dd>Shared LPDDR / unified system memory</dd><dt>Interconnect</dt><dd>SoC fabric (on-die)</dd><dt>Power</dt><dd>5–40W</dd><dt>Habitat</dt><dd>Laptops, phones, tablets</dd><dt>Cost</dt><dd>Bundled — no separate purchase</dd><dt>Best Prey</dt><dd>Voice, camera AI, on-device LLMs, Copilot features</dd><dt>Specimens</dt><dd>Apple Neural Engine (M4), Intel AI Boost (Lunar Lake / Arrow Lake), Qualcomm Hexagon (8 Elite, X Elite)</dd></dl></div><p>NPUs are the chip type most people interact with every day without knowing it. They live inside the SoC of your phone or laptop, optimised for running already-trained models locally with extreme power efficiency. Voice transcription, Face ID, Pixel&rsquo;s call screening, the on-device chat in Copilot+ PCs — all NPU workloads.</p><p>The defining trait is integer-quantised math (INT8 / INT4) at very low wattage. Where a datacenter GPU might pull 700W to serve a model, an NPU runs a comparable inference on the same model — quantised down — at 5–15W, with the weights sitting in the device&rsquo;s main memory because there is no discrete accelerator memory to fill.</p><p>Microsoft now requires 40+ TOPS of NPU performance for a laptop to qualify as a &ldquo;Copilot+ PC&rdquo; — a forcing function that pushed Qualcomm, Intel and AMD into a 12-month arms race. As of 2026, top mobile SoCs ship 50–60 TOPS of NPU performance.</p><p><a class="read-more" href="/field-guide-npus/">→ Full entry: Field Guide · NPUs</a></p><h2 id="4--cpu--central-processing-unit">4 · CPU — Central Processing Unit</h2><div class="stat-block"><div class="stat-block-header">CPU · The Foundational Workhorse</div><dl><dt>Class</dt><dd>General-purpose</dd><dt>Memory</dt><dd>DDR5 (system RAM)</dd><dt>Interconnect</dt><dd>PCIe 5, CXL</dd><dt>Power</dt><dd>100–500W</dd><dt>Habitat</dt><dd>Every server, every workstation</dd><dt>Cost</dt><dd>$1–15K, retail and widely available</dd><dt>Best Prey</dt><dd>Orchestration, preprocessing, control plane, small-batch inference</dd><dt>Specimens</dt><dd>Intel Xeon 6, AMD EPYC 9005</dd></dl></div><p>CPUs aren&rsquo;t obsolete — they&rsquo;re indispensable. Every AI training run needs CPUs to feed data to the accelerators (decompression, tokenisation, augmentation), schedule jobs and run the control plane. The ratio matters: a typical training cluster pairs eight GPUs with one or two CPU sockets.</p><p>Modern server CPUs ship with AI-targeted extensions — AVX-512, AMX (Advanced Matrix Extensions), bf16 support — that let them handle small-batch inference and embedding generation reasonably well. For workloads under 7B parameters at low traffic, a CPU is often more economical than a dedicated accelerator.</p><p>What CPUs cannot do is train frontier models. The arithmetic density and memory bandwidth needed for LLM pretraining is 10–100× what a CPU delivers per watt. CPUs do the surrounding work; accelerators do the math.</p><p><a class="read-more" href="/field-guide-cpus/">→ Full entry: Field Guide · CPUs</a></p><h2 id="5--asic--application-specific-integrated-circuit">5 · ASIC — Application-Specific Integrated Circuit</h2><div class="stat-block"><div class="stat-block-header">ASIC · The Purpose-Bred Specialist</div><dl><dt>Class</dt><dd>Fixed-function accelerator</dd><dt>Memory</dt><dd>Custom HBM / on-die SRAM</dd><dt>Interconnect</dt><dd>Proprietary fabric</dd><dt>Power</dt><dd>75–450W per chip</dd><dt>Habitat</dt><dd>Hyperscaler clouds (AWS, Cerebras, Groq, SambaNova)</dd><dt>Cost</dt><dd>Cloud rental only</dd><dt>Best Prey</dt><dd>Inference at scale, specialised training</dd><dt>Specimens</dt><dd>AWS Inferentia2, Trainium2; Cerebras WSE-3; Groq LPU; SambaNova SN40L</dd></dl></div><p>ASICs are chips designed for one thing and one thing only — and they&rsquo;re brutally good at that thing. AWS Inferentia2 runs production inference for Anthropic, Amazon search and Alexa at a cost-per-token that beats GPUs. Trainium2 is AWS&rsquo;s training equivalent, taking aim at NVIDIA&rsquo;s H100/B200 dominance. Groq&rsquo;s LPU posts inference latencies — sub-1ms first-token for many models — that GPUs simply cannot match.</p><p>The architectural philosophy is &ldquo;build silicon for the specific math you do most often, throw away the rest.&rdquo; Cerebras takes this furthest: their Wafer-Scale Engine 3 is a single chip the size of an entire silicon wafer (900,000 cores, 44GB on-die SRAM) that eliminates the multi-GPU communication overhead which plagues distributed training.</p><p>The price of specialisation: you cannot pivot. When the dominant architecture changes — and it does (Mamba, MoE, diffusion, JEPA) — ASICs designed for the last era stop being competitive overnight. GPUs hedge their bets; ASICs commit.</p><p><a class="read-more" href="/field-guide-asics/">→ Full entry: Field Guide · ASICs</a></p><h2 id="6--fpga--field-programmable-gate-array">6 · FPGA — Field-Programmable Gate Array</h2><div class="stat-block"><div class="stat-block-header">FPGA · The Shapeshifter</div><dl><dt>Class</dt><dd>Reprogrammable logic</dd><dt>Memory</dt><dd>DDR / HBM (model-dependent)</dd><dt>Interconnect</dt><dd>PCIe; custom</dd><dt>Power</dt><dd>50–300W</dd><dt>Habitat</dt><dd>Trading desks, 5G basebands, telecom, occasional inference</dd><dt>Cost</dt><dd>$5–50K each; cloud</dd><dt>Best Prey</dt><dd>Ultra-low-latency inference, custom protocols, evolving workloads</dd><dt>Specimens</dt><dd>AMD/Xilinx Versal AI Edge, Intel Agilex 7</dd></dl></div><p>FPGAs occupy a strange ecological niche. Unlike ASICs, their internal wiring is reprogrammable — you can compile a new circuit into them, deploy it, and reprogram it tomorrow. This makes them ideal for workloads that change faster than chip fabrication cycles (years), or where you need an ultra-low-latency that even ASICs struggle to deliver.</p><p>In AI specifically, FPGAs are rarely the first choice for mainstream training or inference — they are slower to develop for and harder to program than GPUs. Where they shine: when the model is small enough to fit, the latency budget is brutal (single-digit microseconds), and the workload spec might shift quarterly. Microsoft used FPGAs heavily in early Bing Search ranking and Azure networking; financial firms still run them for inline ML in trading.</p><p>For most readers, FPGAs will be a &ldquo;did you know?&rdquo; category rather than a chip you&rsquo;ll ever deploy.</p><p><a class="read-more" href="/field-guide-fpgas/">→ Full entry: Field Guide · FPGAs</a></p><h2 id="7--edge-ai--mobile-robotics-iot">7 · Edge AI — Mobile, Robotics, IoT</h2><div class="stat-block"><div class="stat-block-header">Edge AI · The Frontier Ranger</div><dl><dt>Class</dt><dd>Embedded inference</dd><dt>Memory</dt><dd>LPDDR, sometimes onboard SRAM</dd><dt>Interconnect</dt><dd>PCIe, MIPI, USB</dd><dt>Power</dt><dd>1–25W</dd><dt>Habitat</dt><dd>Drones, robots, cameras, autonomous systems, sensors</dd><dt>Cost</dt><dd>$50–2000; embedded in product</dd><dt>Best Prey</dt><dd>Real-time inference, computer vision, robotics</dd><dt>Specimens</dt><dd>NVIDIA Jetson Orin, Google Coral / Edge TPU, Hailo-8, Ambarella CV5</dd></dl></div><p>Edge AI chips are NPUs&rsquo; cousins — same family, different role. Where an NPU lives inside a consumer laptop alongside other compute, an edge AI chip is purpose-built for an embedded device: a security camera, a drone, a forklift, a Tesla.</p><p>The defining constraints are size, power and latency. A camera processing 4K video at 30fps cannot afford to ship frames to a cloud GPU; it has to detect motion locally, identify objects locally and signal events within tens of milliseconds — on a few watts, because the device runs on battery or is fanless.</p><p>NVIDIA&rsquo;s Jetson family is the broadest platform — same CUDA software stack as their datacenter GPUs, scaled down to 7–60W. Google&rsquo;s Edge TPU is the smallest, cheapest and lowest power (Coral USB stick: $40, 2W). Hailo-8 and Ambarella sit in between, targeting industrial and automotive customers.</p><p><a class="read-more" href="/field-guide-edge-ai/">→ Full entry: Field Guide · Edge AI</a></p><h2 id="8--emerging-architectures">8 · Emerging Architectures</h2><div class="stat-block"><div class="stat-block-header">Emerging · The Frontier Beasts</div><dl><dt>Class</dt><dd>Experimental</dd><dt>Memory</dt><dd>Wafer-scale SRAM / photonic / analog</dd><dt>Interconnect</dt><dd>On-wafer / optical / in-memory</dd><dt>Power</dt><dd>Varies wildly</dd><dt>Habitat</dt><dd>Research labs, narrow cloud offerings</dd><dt>Cost</dt><dd>Mostly inaccessible; limited cloud</dd><dt>Best Prey</dt><dd>The next 10× efficiency leap</dd><dt>Specimens</dt><dd>Cerebras WSE-3 (wafer-scale); Lightmatter, Lightelligence (photonic); Mythic, IBM (analog/in-memory)</dd></dl></div><p>Three architectures sit at the frontier — promising, but not yet mainstream.</p><ul><li><strong>Wafer-scale</strong> (Cerebras): one single chip the size of an entire silicon wafer. Eliminates multi-chip communication entirely; presents the whole system to software as a single device. Already commercial.</li><li><strong>Photonic / optical AI</strong> (Lightmatter, Lightelligence): perform matrix math using light interference instead of electricity. Potentially orders of magnitude lower energy per operation; currently limited to inference and constrained models.</li><li><strong>Analog / in-memory compute</strong> (Mythic, IBM, several startups): compute<em>inside</em> memory arrays using analog voltage levels. Removes the von-Neumann bottleneck — the constant shuttling of data between memory and compute — entirely. Promising for low-power inference; precision limitations make training hard today.</li></ul><p><a class="read-more" href="/field-guide-emerging-architectures/">→ Full entry: Field Guide · Emerging Architectures</a></p><h2 id="current-industry-reality-2026">Current Industry Reality (2026)</h2><ul><li><strong>GPUs dominate training.</strong> Every frontier model — GPT-5, Claude, Gemini, Llama, Grok — is still trained on NVIDIA or AMD silicon at hyperscale.</li><li><strong>ASICs are ascendant in inference.</strong> AWS reports more than 40% of internal inference now runs on Inferentia and Trainium; Groq leads on latency-critical applications.</li><li><strong>NPUs are exploding on consumer devices.</strong> Every premium laptop and phone shipped in 2026 has a 40+ TOPS NPU.</li><li><strong>CPUs remain foundational.</strong> No accelerator runs without one.</li><li><strong>TPUs are Google-only.</strong> Gemini, Veo and Imagen were all trained on TPU v5p / v6.</li></ul><h2 id="simplified-view">Simplified View</h2><table><thead><tr><th>Use case</th><th>Typical chip</th></tr></thead><tbody><tr><td>Train a GPT-class model</td><td>GPU clusters (or TPU pods if you're Google)</td></tr><tr><td>Run ChatGPT-class inference at scale</td><td>GPUs + ASICs (Inferentia, Groq, Trainium)</td></tr><tr><td>AI on laptop</td><td>NPU + integrated GPU</td></tr><tr><td>AI on phone</td><td>Mobile NPU</td></tr><tr><td>Robot or drone AI</td><td>Edge AI chips (Jetson, Hailo)</td></tr><tr><td>Ultra-low-latency custom AI</td><td>FPGA or ASIC</td></tr></tbody></table><h2 id="the-industry-trend">The Industry Trend</h2><div class="quote">
The industry is moving from "general-purpose GPU everything" to "specialised chip for each layer of the stack."</div><p>Power and inference cost are now the binding constraints. A frontier model serving billions of queries spends more on inference electricity in a year than its entire training run cost. The economics force specialisation: train once on GPUs, serve forever on cheaper inference silicon. Expect the gap between training hardware (still GPU-dominant) and inference hardware (rapidly ASIC- and NPU-fragmented) to widen.</p><div class="ornament">❦ ❦ ❦</div></div>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/ai-chips/field-guide-hero.png" medium="image"><media:title type="plain">Posts</media:title></media:content><category>artificial-intelligence</category><category>architecture</category><category>hardware</category><category>engineering</category><category>Knowledge Base</category></item><item><title>LLM &amp; Embeddings — One Predicts Words. One Maps Meaning.</title><link>https://curiousbit.netlify.app/one-predicts-words-one-maps-meaning/</link><guid isPermaLink="true">https://curiousbit.netlify.app/one-predicts-words-one-maps-meaning/</guid><pubDate>Tue, 03 Mar 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;style&gt;
.two-mech { margin: 2rem 0 2.5rem; border-radius: 14px; overflow: hidden; border: 1px solid #1f3358; background: #0a1424; }
.two-mech svg { display: block; width: 100%; height: auto; min-width: 720px; }
.two-mech-wrap { overflow-x: auto; }
@media (prefers-reduced-motion: reduce) { .two-mech .tm-particle { display: none; } }
&lt;/style&gt;
&lt;div class="two-mech"&gt;&lt;div class="two-mech-wrap"&gt;
&lt;svg viewBox="0 0 1200 720" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="Animated diagram contrasting the LLM stochastic generation loop with the deterministic embedding similarity pipeline"&gt;
&lt;defs&gt;
&lt;filter id="tmGlow" x="-50%" y="-50%" width="200%" height="200%"&gt;
&lt;feGaussianBlur stdDeviation="3" result="b"/&gt;
&lt;feMerge&gt;&lt;feMergeNode in="b"/&gt;&lt;feMergeNode in="SourceGraphic"/&gt;&lt;/feMerge&gt;
&lt;/filter&gt;
&lt;marker id="tmArrowBlue" markerWidth="10" markerHeight="10" refX="6" refY="5" orient="auto"&gt;
&lt;path d="M0,0 L10,5 L0,10 Z" fill="#60a5fa"/&gt;
&lt;/marker&gt;
&lt;marker id="tmArrowGreen" markerWidth="10" markerHeight="10" refX="6" refY="5" orient="auto"&gt;
&lt;path d="M0,0 L10,5 L0,10 Z" fill="#34d399"/&gt;
&lt;/marker&gt;
&lt;marker id="tmArrowAmber" markerWidth="10" markerHeight="10" refX="6" refY="5" orient="auto"&gt;
&lt;path d="M0,0 L10,5 L0,10 Z" fill="#f59e0b"/&gt;
&lt;/marker&gt;
&lt;!-- Hidden paths the particles animate along --&gt;
&lt;path id="tmPathLLM"
d="M 140 240 L 320 240 L 500 240 L 680 240 L 860 240
L 860 320 Q 860 360 820 360 L 540 360 Q 500 360 500 320 L 500 280
L 500 240 L 680 240 L 860 240"
fill="none" stroke="none"/&gt;
&lt;path id="tmPathEmb"
d="M 140 540 L 320 540 L 500 540 L 680 540 L 860 540"
fill="none" stroke="none"/&gt;
&lt;path id="tmPathEmb2"
d="M 140 540 L 320 540 L 500 540 L 680 540 L 860 540"
fill="none" stroke="none"/&gt;
&lt;/defs&gt;
&lt;!-- background --&gt;
&lt;rect width="1200" height="720" fill="#0a1424"/&gt;
&lt;!-- header --&gt;
&lt;text x="120" y="60" font-family="'Space Grotesk','Inter',sans-serif" font-size="16" fill="#f59e0b" letter-spacing="3" font-weight="700"&gt;TWO MECHANISMS&lt;/text&gt;
&lt;text x="120" y="100" font-family="'Space Grotesk','Inter',sans-serif" font-size="38" fill="#ffffff" font-weight="700" letter-spacing="-.5"&gt;Generation vs Similarity&lt;/text&gt;
&lt;!-- ════════ LLM PIPELINE (top) ════════ --&gt;
&lt;text x="120" y="180" font-family="'Space Grotesk','Inter',sans-serif" font-size="26" fill="#60a5fa" font-weight="700"&gt;LLM&lt;/text&gt;
&lt;text x="120" y="208" font-family="'Inter',sans-serif" font-size="14" fill="#7e95b5" font-style="italic"&gt;non-deterministic · sampling&lt;/text&gt;
&lt;!-- LLM boxes --&gt;
&lt;g font-family="'Inter',sans-serif"&gt;
&lt;!-- prompt --&gt;
&lt;g&gt;
&lt;rect x="60" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/&gt;
&lt;text x="140" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;prompt&lt;/text&gt;
&lt;text x="140" y="258" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;"AI is..."&lt;/text&gt;
&lt;/g&gt;
&lt;!-- tokenize --&gt;
&lt;g&gt;
&lt;rect x="240" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/&gt;
&lt;text x="320" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;tokenize&lt;/text&gt;
&lt;text x="320" y="258" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;BPE → IDs&lt;/text&gt;
&lt;/g&gt;
&lt;!-- model --&gt;
&lt;g&gt;
&lt;rect x="420" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/&gt;
&lt;text x="500" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;model&lt;/text&gt;
&lt;text x="500" y="258" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;forward pass&lt;/text&gt;
&lt;/g&gt;
&lt;!-- sample --&gt;
&lt;g&gt;
&lt;rect x="600" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/&gt;
&lt;text x="680" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;sample&lt;/text&gt;
&lt;text x="680" y="258" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;temp / top-p&lt;/text&gt;
&lt;/g&gt;
&lt;!-- next token --&gt;
&lt;g&gt;
&lt;rect x="780" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/&gt;
&lt;text x="860" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;next token&lt;/text&gt;
&lt;text x="860" y="258" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;append, loop&lt;/text&gt;
&lt;/g&gt;
&lt;/g&gt;
&lt;!-- LLM arrows --&gt;
&lt;g stroke="#60a5fa" stroke-width="2" fill="none"&gt;
&lt;line x1="220" y1="240" x2="232" y2="240" marker-end="url(#tmArrowBlue)"/&gt;
&lt;line x1="400" y1="240" x2="412" y2="240" marker-end="url(#tmArrowBlue)"/&gt;
&lt;line x1="580" y1="240" x2="592" y2="240" marker-end="url(#tmArrowBlue)"/&gt;
&lt;line x1="760" y1="240" x2="772" y2="240" marker-end="url(#tmArrowBlue)"/&gt;
&lt;/g&gt;
&lt;!-- Stochastic loop arrow (next token → model) --&gt;
&lt;path d="M 860 280 L 860 340 Q 860 360 840 360 L 520 360 Q 500 360 500 340 L 500 286"
fill="none" stroke="#f59e0b" stroke-width="2" stroke-dasharray="5,4" marker-end="url(#tmArrowAmber)"/&gt;
&lt;text x="680" y="384" text-anchor="middle" font-family="'Inter',sans-serif" font-size="13" fill="#f59e0b" font-style="italic"&gt;stochastic loop&lt;/text&gt;
&lt;!-- Pulsing rings on each LLM box (highlight as particle passes) --&gt;
&lt;rect class="tm-box-pulse" x="60" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="5s" begin="0s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;rect class="tm-box-pulse" x="240" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="5s" begin="0.6s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;rect class="tm-box-pulse" x="420" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="5s" begin="1.2s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;rect class="tm-box-pulse" x="600" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="5s" begin="1.8s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;rect class="tm-box-pulse" x="780" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="5s" begin="2.4s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;!-- Animated particle on the LLM path (loops continuously, traverses stochastic loop) --&gt;
&lt;circle class="tm-particle" r="9" fill="#60a5fa" filter="url(#tmGlow)"&gt;
&lt;animateMotion dur="5s" repeatCount="indefinite" rotate="auto"&gt;
&lt;mpath href="#tmPathLLM"/&gt;
&lt;/animateMotion&gt;
&lt;/circle&gt;
&lt;!-- ════════ EMBEDDING PIPELINE (bottom) ════════ --&gt;
&lt;text x="120" y="480" font-family="'Space Grotesk','Inter',sans-serif" font-size="26" fill="#34d399" font-weight="700"&gt;EMBEDDING&lt;/text&gt;
&lt;text x="120" y="508" font-family="'Inter',sans-serif" font-size="14" fill="#7e95b5" font-style="italic"&gt;deterministic · geometric&lt;/text&gt;
&lt;g font-family="'Inter',sans-serif"&gt;
&lt;!-- word --&gt;
&lt;g&gt;
&lt;rect x="60" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/&gt;
&lt;text x="140" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;word&lt;/text&gt;
&lt;text x="140" y="558" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;"king"&lt;/text&gt;
&lt;/g&gt;
&lt;!-- lookup --&gt;
&lt;g&gt;
&lt;rect x="240" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/&gt;
&lt;text x="320" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;lookup&lt;/text&gt;
&lt;text x="320" y="558" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;GloVe / SBERT&lt;/text&gt;
&lt;/g&gt;
&lt;!-- vector --&gt;
&lt;g&gt;
&lt;rect x="420" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/&gt;
&lt;text x="500" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;vector&lt;/text&gt;
&lt;text x="500" y="558" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;fixed dim&lt;/text&gt;
&lt;/g&gt;
&lt;!-- cosine --&gt;
&lt;g&gt;
&lt;rect x="600" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/&gt;
&lt;text x="680" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;cosine&lt;/text&gt;
&lt;text x="680" y="558" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;vs corpus&lt;/text&gt;
&lt;/g&gt;
&lt;!-- similarity --&gt;
&lt;g&gt;
&lt;rect x="780" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/&gt;
&lt;text x="860" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;similarity&lt;/text&gt;
&lt;text x="860" y="558" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;score · rank&lt;/text&gt;
&lt;/g&gt;
&lt;/g&gt;
&lt;g stroke="#34d399" stroke-width="2" fill="none"&gt;
&lt;line x1="220" y1="540" x2="232" y2="540" marker-end="url(#tmArrowGreen)"/&gt;
&lt;line x1="400" y1="540" x2="412" y2="540" marker-end="url(#tmArrowGreen)"/&gt;
&lt;line x1="580" y1="540" x2="592" y2="540" marker-end="url(#tmArrowGreen)"/&gt;
&lt;line x1="760" y1="540" x2="772" y2="540" marker-end="url(#tmArrowGreen)"/&gt;
&lt;/g&gt;
&lt;!-- Pulsing rings on each Embedding box --&gt;
&lt;rect class="tm-box-pulse" x="60" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="4s" begin="0s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;rect class="tm-box-pulse" x="240" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="4s" begin="0.7s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;rect class="tm-box-pulse" x="420" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="4s" begin="1.4s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;rect class="tm-box-pulse" x="600" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="4s" begin="2.1s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;rect class="tm-box-pulse" x="780" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="4s" begin="2.8s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;!-- Embedding particle (one-shot left-to-right, restarts cleanly) --&gt;
&lt;circle class="tm-particle" r="9" fill="#34d399" filter="url(#tmGlow)"&gt;
&lt;animateMotion dur="4s" repeatCount="indefinite" rotate="auto"&gt;
&lt;mpath href="#tmPathEmb"/&gt;
&lt;/animateMotion&gt;
&lt;/circle&gt;
&lt;!-- Annotation under embedding to underscore "one-shot, no loop" --&gt;
&lt;text x="500" y="624" text-anchor="middle" font-family="'Inter',sans-serif" font-size="13" fill="#34d399" font-style="italic"&gt;one-shot · same input always produces same output&lt;/text&gt;
&lt;!-- Bottom rule + caption --&gt;
&lt;line x1="60" y1="660" x2="940" y2="660" stroke="#1f3358" stroke-width="1"/&gt;
&lt;text x="500" y="690" text-anchor="middle" font-family="'Inter',sans-serif" font-size="14" fill="#7e95b5" font-style="italic"&gt;Both paths exist in every modern NLP system. Which one you reach for depends on whether the answer needs to be &lt;tspan fill="#60a5fa" font-weight="700"&gt;written&lt;/tspan&gt; or &lt;tspan fill="#34d399" font-weight="700"&gt;found&lt;/tspan&gt;.&lt;/text&gt;
&lt;/svg&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The model is what writes the email. The embedding is what finds the one you wrote last March.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/IITM/week6-mechanisms.png" alt="Posts" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><style>
.two-mech { margin: 2rem 0 2.5rem; border-radius: 14px; overflow: hidden; border: 1px solid #1f3358; background: #0a1424; }
.two-mech svg { display: block; width: 100%; height: auto; min-width: 720px; }
.two-mech-wrap { overflow-x: auto; }
@media (prefers-reduced-motion: reduce) { .two-mech .tm-particle { display: none; } }</style><div class="two-mech"><div class="two-mech-wrap"><svg viewBox="0 0 1200 720" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="Animated diagram contrasting the LLM stochastic generation loop with the deterministic embedding similarity pipeline"><defs><filter id="tmGlow" x="-50%" y="-50%" width="200%" height="200%"><feGaussianBlur stdDeviation="3" result="b"/><feMerge><feMergeNode in="b"/><feMergeNode in="SourceGraphic"/></feMerge></filter><marker id="tmArrowBlue" markerWidth="10" markerHeight="10" refX="6" refY="5" orient="auto"><path d="M0,0 L10,5 L0,10 Z" fill="#60a5fa"/></marker><marker id="tmArrowGreen" markerWidth="10" markerHeight="10" refX="6" refY="5" orient="auto"><path d="M0,0 L10,5 L0,10 Z" fill="#34d399"/></marker><marker id="tmArrowAmber" markerWidth="10" markerHeight="10" refX="6" refY="5" orient="auto"><path d="M0,0 L10,5 L0,10 Z" fill="#f59e0b"/></marker><path id="tmPathLLM" d="M 140 240 L 320 240 L 500 240 L 680 240 L 860 240              L 860 320 Q 860 360 820 360 L 540 360 Q 500 360 500 320 L 500 280              L 500 240 L 680 240 L 860 240" fill="none" stroke="none"/><path id="tmPathEmb" d="M 140 540 L 320 540 L 500 540 L 680 540 L 860 540" fill="none" stroke="none"/><path id="tmPathEmb2" d="M 140 540 L 320 540 L 500 540 L 680 540 L 860 540" fill="none" stroke="none"/></defs><rect width="1200" height="720" fill="#0a1424"/><text x="120" y="60" font-family="'Space Grotesk','Inter',sans-serif" font-size="16" fill="#f59e0b" letter-spacing="3" font-weight="700">TWO MECHANISMS</text><text x="120" y="100" font-family="'Space Grotesk','Inter',sans-serif" font-size="38" fill="#ffffff" font-weight="700" letter-spacing="-.5">Generation vs Similarity</text><text x="120" y="180" font-family="'Space Grotesk','Inter',sans-serif" font-size="26" fill="#60a5fa" font-weight="700">LLM</text><text x="120" y="208" font-family="'Inter',sans-serif" font-size="14" fill="#7e95b5" font-style="italic">non-deterministic · sampling</text><g font-family="'Inter',sans-serif"><g><rect x="60" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/><text x="140" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">prompt</text><text x="140" y="258" text-anchor="middle" font-size="13" fill="#7e95b5">"AI is..."</text></g><g><rect x="240" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/><text x="320" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">tokenize</text><text x="320" y="258" text-anchor="middle" font-size="13" fill="#7e95b5">BPE → IDs</text></g><g><rect x="420" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/><text x="500" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">model</text><text x="500" y="258" text-anchor="middle" font-size="13" fill="#7e95b5">forward pass</text></g><g><rect x="600" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/><text x="680" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">sample</text><text x="680" y="258" text-anchor="middle" font-size="13" fill="#7e95b5">temp / top-p</text></g><g><rect x="780" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/><text x="860" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">next token</text><text x="860" y="258" text-anchor="middle" font-size="13" fill="#7e95b5">append, loop</text></g></g><g stroke="#60a5fa" stroke-width="2" fill="none"><line x1="220" y1="240" x2="232" y2="240" marker-end="url(#tmArrowBlue)"/><line x1="400" y1="240" x2="412" y2="240" marker-end="url(#tmArrowBlue)"/><line x1="580" y1="240" x2="592" y2="240" marker-end="url(#tmArrowBlue)"/><line x1="760" y1="240" x2="772" y2="240" marker-end="url(#tmArrowBlue)"/></g><path d="M 860 280 L 860 340 Q 860 360 840 360 L 520 360 Q 500 360 500 340 L 500 286" fill="none" stroke="#f59e0b" stroke-width="2" stroke-dasharray="5,4" marker-end="url(#tmArrowAmber)"/><text x="680" y="384" text-anchor="middle" font-family="'Inter',sans-serif" font-size="13" fill="#f59e0b" font-style="italic">stochastic loop</text><rect class="tm-box-pulse" x="60" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="5s" begin="0s" repeatCount="indefinite"/></rect><rect class="tm-box-pulse" x="240" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="5s" begin="0.6s" repeatCount="indefinite"/></rect><rect class="tm-box-pulse" x="420" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="5s" begin="1.2s" repeatCount="indefinite"/></rect><rect class="tm-box-pulse" x="600" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="5s" begin="1.8s" repeatCount="indefinite"/></rect><rect class="tm-box-pulse" x="780" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="5s" begin="2.4s" repeatCount="indefinite"/></rect><circle class="tm-particle" r="9" fill="#60a5fa" filter="url(#tmGlow)"><animateMotion dur="5s" repeatCount="indefinite" rotate="auto"><mpath href="#tmPathLLM"/></animateMotion></circle><text x="120" y="480" font-family="'Space Grotesk','Inter',sans-serif" font-size="26" fill="#34d399" font-weight="700">EMBEDDING</text><text x="120" y="508" font-family="'Inter',sans-serif" font-size="14" fill="#7e95b5" font-style="italic">deterministic · geometric</text><g font-family="'Inter',sans-serif"><g><rect x="60" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/><text x="140" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">word</text><text x="140" y="558" text-anchor="middle" font-size="13" fill="#7e95b5">"king"</text></g><g><rect x="240" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/><text x="320" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">lookup</text><text x="320" y="558" text-anchor="middle" font-size="13" fill="#7e95b5">GloVe / SBERT</text></g><g><rect x="420" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/><text x="500" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">vector</text><text x="500" y="558" text-anchor="middle" font-size="13" fill="#7e95b5">fixed dim</text></g><g><rect x="600" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/><text x="680" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">cosine</text><text x="680" y="558" text-anchor="middle" font-size="13" fill="#7e95b5">vs corpus</text></g><g><rect x="780" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/><text x="860" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">similarity</text><text x="860" y="558" text-anchor="middle" font-size="13" fill="#7e95b5">score · rank</text></g></g><g stroke="#34d399" stroke-width="2" fill="none"><line x1="220" y1="540" x2="232" y2="540" marker-end="url(#tmArrowGreen)"/><line x1="400" y1="540" x2="412" y2="540" marker-end="url(#tmArrowGreen)"/><line x1="580" y1="540" x2="592" y2="540" marker-end="url(#tmArrowGreen)"/><line x1="760" y1="540" x2="772" y2="540" marker-end="url(#tmArrowGreen)"/></g><rect class="tm-box-pulse" x="60" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="4s" begin="0s" repeatCount="indefinite"/></rect><rect class="tm-box-pulse" x="240" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="4s" begin="0.7s" repeatCount="indefinite"/></rect><rect class="tm-box-pulse" x="420" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="4s" begin="1.4s" repeatCount="indefinite"/></rect><rect class="tm-box-pulse" x="600" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="4s" begin="2.1s" repeatCount="indefinite"/></rect><rect class="tm-box-pulse" x="780" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="4s" begin="2.8s" repeatCount="indefinite"/></rect><circle class="tm-particle" r="9" fill="#34d399" filter="url(#tmGlow)"><animateMotion dur="4s" repeatCount="indefinite" rotate="auto"><mpath href="#tmPathEmb"/></animateMotion></circle><text x="500" y="624" text-anchor="middle" font-family="'Inter',sans-serif" font-size="13" fill="#34d399" font-style="italic">one-shot · same input always produces same output</text><line x1="60" y1="660" x2="940" y2="660" stroke="#1f3358" stroke-width="1"/><text x="500" y="690" text-anchor="middle" font-family="'Inter',sans-serif" font-size="14" fill="#7e95b5" font-style="italic">Both paths exist in every modern NLP system. Which one you reach for depends on whether the answer needs to be<tspan fill="#60a5fa" font-weight="700">written</tspan> or<tspan fill="#34d399" font-weight="700">found</tspan>.</text></svg></div></div><p>The model is what writes the email. The embedding is what finds the one you wrote last March.</p><p>Most modern AI systems are built from two fundamentally different mechanisms, and most confusion about what AI &ldquo;is&rdquo; comes from conflating them. LLMs are<em>generative</em>: tokens in, tokens out, with the output shaped by the prompt and the sampling settings, varying every time you ask. Embeddings are<em>geometric</em>: a deterministic mapping from a word or sentence to a fixed vector, where comparisons are positional and identical input always produces identical output. Both are essential. Both are old enough to be uncontroversial. Most useful systems combine them.</p><p>What follows is the<strong>Week 6 Graded Mini Project</strong> of the<strong>IITM Pravartak Professional Certificate Programme in Agentic AI and Applications</strong>, used here as a lens for both mechanisms across five hands-on exercises.</p><h2 id="the-two-paths-side-by-side">The two paths, side by side</h2><p>The header image above shows the contrast in one frame. The LLM path is a loop with sampling — non-deterministic by design, behaviour controlled by temperature, top-p, and prompt structure. The embedding path is a one-shot lookup followed by a geometric comparison — deterministic, fast, stable.</p><p>That single distinction tells you which mechanism to reach for. If the answer needs to be written, generated, synthesized, or improvised, you want the LLM. If the answer needs to be found, ranked, deduplicated, clustered, or routed, you want embeddings. Most production systems use both because most real problems are some combination of &ldquo;find the right context&rdquo; and &ldquo;say something useful about it.&rdquo;</p><p>A quick decision table to anchor the rest of the article:</p><table><thead><tr><th>Problem</th><th>Reach for</th></tr></thead><tbody><tr><td>Semantic search over a corpus</td><td>Embeddings</td></tr><tr><td>Conversational reply or text drafting</td><td>LLM</td></tr><tr><td>Near-duplicate detection or content clustering</td><td>Embeddings</td></tr><tr><td>Summarization of a long document</td><td>LLM</td></tr><tr><td>Routing a support ticket to the right team</td><td>Embeddings + a small classifier head</td></tr><tr><td>Question answering grounded in your docs</td><td>Both (RAG)</td></tr><tr><td>Image or text classification</td><td>Embeddings + a categorical head</td></tr><tr><td>Translation, rewriting, code generation</td><td>LLM</td></tr></tbody></table><p>The exercises below show why each row works the way it does.</p><h2 id="exercise-1-text-generation-reveals-prompt-and-sampling-sensitivity">Exercise 1: Text generation reveals prompt and sampling sensitivity</h2><p>Section A1 loaded<code>distilgpt2</code> through the Hugging Face<code>pipeline</code> API and generated three continuations of the same prompt:</p><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">generator</span><span class="o">=</span><span class="n">pipeline</span><span class="p">(</span><span class="s2">"text-generation"</span><span class="p">,</span><span class="n">model</span><span class="o">=</span><span class="s2">"distilgpt2"</span><span class="p">)</span></span></span><span class="line"><span class="cl"><span class="n">generator</span><span class="p">(</span><span class="s2">"AI is transforming industries by"</span><span class="p">,</span></span></span><span class="line"><span class="cl"><span class="n">max_new_tokens</span><span class="o">=</span><span class="mi">40</span><span class="p">,</span><span class="n">num_return_sequences</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span><span class="n">do_sample</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span></span></span></code></pre></div><p>Three continuations came back from the same model, the same prompt, the same call:</p><blockquote><p><em>&ldquo;AI is transforming industries by using science to bring people together with a greater understanding of the importance of science. The new book takes an approach to both science and technology, allowing people to focus more more effectively on the basics and to&hellip;&rdquo;</em></p></blockquote><blockquote><p><em>&ldquo;AI is transforming industries by replacing the manufacturing sector with a manufacturing sector that can be turned into a manufacturing and IT sector by creating new jobs and creating new jobs. The new jobs and investment in the next decade will help spur growth&hellip;&rdquo;</em></p></blockquote><blockquote><p><em>&ldquo;AI is transforming industries by creating a new, faster, and more attractive way of generating capital and creating jobs for both the United States and Europe. This is an effective new way of doing this.&rdquo;</em></p></blockquote><p>Three different stories. None of which the model &ldquo;knew&rdquo; — it just produced plausible-sounding next tokens under stochastic sampling. Notice the repetitions (&ldquo;manufacturing sector with a manufacturing sector&rdquo;), the loops (&ldquo;more more effectively&rdquo;), the empty filler (&ldquo;a new, faster, and more attractive way of generating capital&rdquo;). DistilGPT-2 is a small model — these are the artefacts of a system that&rsquo;s good at local fluency but doesn&rsquo;t have a strong forward plan.</p><p>The headline insight: LLM outputs are statistical, prompt-sensitive, and unrepeatable unless you fix the seed. The same prompt can give you variety (a feature when brainstorming) or drift (a bug when consistency matters).</p><h2 id="exercise-2-tokenization-is-where-the-abstraction-begins">Exercise 2: Tokenization is where the abstraction begins</h2><p>This is the section to slow down on. Take the sentence:</p><blockquote><p><em>&ldquo;LLMs are powerful tools for natural language understanding.&rdquo;</em></p></blockquote><p>A human reads eight words. The model sees ten tokens.</p><p><img src="/images/IITM/week6-tokens.png" alt="BPE tokenization of the sentence, showing each token as a coloured pill"/><p>After BPE (Byte-Pair Encoding) with the DistilGPT-2 tokenizer:</p><pre tabindex="0"><code>['LL', 'Ms', 'Ġare', 'Ġpowerful', 'Ġtools', 'Ġfor', 'Ġnatural',
'Ġlanguage', 'Ġunderstanding', '.']</code></pre><p>The string<code>LLMs</code> doesn&rsquo;t appear in the model&rsquo;s vocabulary as a single unit, so it is split into<code>LL</code> and<code>Ms</code>. The<code>Ġ</code> prefix encodes &ldquo;preceding space&rdquo; — that&rsquo;s how BPE preserves word boundaries without a separator character. The period gets its own token.</p><p>The mismatch between<em>what a human reads</em> and<em>what the model processes</em> has real consequences:</p><ul><li><strong>Cost is per token, not per word.</strong> API billing, latency, and rate limits are all token-denominated. A 1,000-word prompt to a frontier model may bill at 1,300–1,500 tokens depending on language.</li><li><strong>Context windows are token windows.</strong> A 4,096-token context holds roughly 3,000 English words. Much less for code (whitespace and symbols inflate counts), much less again for languages with poor vocabulary coverage in the tokenizer.</li><li><strong>Rare strings behave oddly.</strong> Brand names, technical acronyms, foreign words, internal jargon — anything outside the trained vocabulary gets fractured. Model behaviour around those fractures is harder to predict, and prompt sensitivity often hides at this layer.</li><li><strong>The same string can tokenize differently with leading whitespace.</strong><code>"king"</code> and<code>" king"</code> are different token sequences. That&rsquo;s why pasted prompts sometimes produce subtly different outputs than typed ones.</li></ul><p>Tokenization is the lowest layer of the LLM stack and the one most engineering conversations skip. If you&rsquo;re tuning prompts and getting unstable behaviour, the first place to look is what your input looks like<em>after the tokenizer touches it</em>, not what it looks like in your editor.</p><h2 id="exercise-3-prompts-shape-what-you-get">Exercise 3: Prompts shape what you get</h2><p>Section B ran three task-shaped prompts through the same generator, with<code>temperature=0.8</code> and<code>top_p=0.95</code>:</p><ul><li><strong>Summarization</strong> — explicit instruction with a 30-word cap.</li><li><strong>Q&amp;A</strong> — structured format with<code>Q:</code> and<code>A:</code> markers.</li><li><strong>Creative</strong> — open-ended request for a 4-line poem about AI.</li></ul><p>The summarization output respected the spirit of the constraint but drifted past 30 words on most runs — DistilGPT-2 is small enough that hard length control isn&rsquo;t reliable even with explicit instructions. The Q&amp;A output, asked for the capital of Japan, returned<code>I believe...</code> — the model hedged. A larger model would say Tokyo confidently; a small model produces statistically plausible Q&amp;A-shaped text without strong factual grounding. The creative prompt produced varied and stylistic continuations, but with the lowest grounding: fluency over precision.</p><p>Structure compresses the output space the model is sampling from. Vagueness expands it. That single sentence is most of what &ldquo;prompt engineering&rdquo; actually is — the rest is technique.</p><h2 id="exercise-4-word-embeddings-encode-semantic-geometry">Exercise 4: Word embeddings encode semantic geometry</h2><p>Pivot to the other mechanism. Section C1 loaded<strong>GloVe</strong> vectors (<code>glove-wiki-gigaword-50</code> — 50 dimensions, trained on Wikipedia and Gigaword) via Gensim, then asked for the five nearest neighbours of three words:</p><table><thead><tr><th>Query</th><th>Top 5 neighbours (cosine similarity)</th></tr></thead><tbody><tr><td><code>king</code></td><td>prince (0.82), queen (0.78), ii (0.77), emperor (0.77), son (0.77)</td></tr><tr><td><code>queen</code></td><td>princess (0.85), lady (0.81), elizabeth (0.79), king (0.78), prince (0.78)</td></tr><tr><td><code>diamond</code></td><td>gold (0.77), diamonds (0.77), gem (0.74), silver (0.72), jewel (0.71)</td></tr></tbody></table><p>There is no generation here. Each word is mapped to a fixed 50-dimensional vector, and the &ldquo;nearest neighbours&rdquo; are the words whose vectors sit closest in that space by cosine similarity. The geometry was learned by training on co-occurrence — words that appear in similar contexts end up in similar positions. That&rsquo;s why<code>king</code> and<code>prince</code> are nearest neighbours, why<code>queen</code> pulls in<code>elizabeth</code> (the corpus has plenty of references to Queen Elizabeth), and why<code>diamond</code> cleanly resolves to a jewellery cluster.</p><p>The classic<code>king − man + woman ≈ queen</code> analogy works in this same space; the lab didn&rsquo;t run it, but the geometry is there. Embeddings don&rsquo;t<em>write</em> anything — they<em>place</em> things near other things. That single property is what makes them the backbone of semantic search, retrieval, deduplication, and recommendation.</p><h2 id="exercise-5-sentence-similarity-from-averaged-word-vectors">Exercise 5: Sentence similarity from averaged word vectors</h2><p>Section C2 extended the geometry to sentences. Five short sentences across two topics — AI/ML and jewellery — were averaged into sentence vectors (mean of their word vectors, with simple lowercase tokenization), then compared with cosine similarity.</p><p>Plotted in 2D via multidimensional scaling on the cosine distances, the clustering is unambiguous:</p><p><img src="/images/IITM/week6-clusters.png" alt="Two-dimensional cluster plot of the five sentence vectors, with the AI/ML sentences clearly separated from the jewellery sentences"/><p>The numerical version:</p><table><thead><tr><th/><th>AI/support</th><th>ML/fraud</th><th>Jewellery</th><th>Neural/medical</th><th>Luxury/rings</th></tr></thead><tbody><tr><td><strong>AI/support</strong></td><td>1.00</td><td>0.84</td><td>0.60</td><td>0.80</td><td>0.50</td></tr><tr><td><strong>ML/fraud</strong></td><td>0.84</td><td>1.00</td><td>0.73</td><td>0.83</td><td>0.62</td></tr><tr><td><strong>Jewellery</strong></td><td>0.60</td><td>0.73</td><td>1.00</td><td>0.58</td><td>0.88</td></tr><tr><td><strong>Neural/medical</strong></td><td>0.80</td><td>0.83</td><td>0.58</td><td>1.00</td><td>0.56</td></tr><tr><td><strong>Luxury/rings</strong></td><td>0.50</td><td>0.62</td><td>0.88</td><td>0.56</td><td>1.00</td></tr></tbody></table><p>Within-cluster pairs sit at 0.84–0.88. Cross-domain pairs sit at 0.50–0.62. The grouping is exactly what you&rsquo;d want a retrieval system to do.</p><p>Three caveats worth naming, because they explain why modern retrieval doesn&rsquo;t actually use GloVe averages:</p><ul><li><strong>Averaging discards word order.</strong> &ldquo;Dog bites man&rdquo; and &ldquo;man bites dog&rdquo; produce identical sentence vectors. For most retrieval that&rsquo;s tolerable; for anything where syntax carries the meaning, it isn&rsquo;t.</li><li><strong>Transformer encoders fixed this.</strong> Models like BERT, RoBERTa, and their descendants produce<em>contextual</em> embeddings — each token&rsquo;s vector depends on the tokens around it. Pool those across a sentence and you get a representation that respects word order and disambiguates polysemy.</li><li><strong>Sentence-BERT and friends made it production-grade.</strong> SBERT (and successors like OpenAI&rsquo;s<code>text-embedding-3</code>, Cohere&rsquo;s embeddings, Voyage, etc.) trained encoders specifically for sentence-level similarity. That&rsquo;s the difference between &ldquo;the demo works on five sentences&rdquo; and &ldquo;you can index a million documents and search them in milliseconds.&rdquo;</li></ul><p>GloVe averaging is a baseline. It&rsquo;s the right baseline to start with, because it lets you see the geometry without the architecture getting in the way. Production systems start from this picture and replace the lookup step.</p><h2 id="when-both-mechanisms-meet">When both mechanisms meet</h2><p>The final exercise sits at the intersection.<code>distilbert-base-uncased-finetuned-sst-2-english</code> is a transformer encoder (an embedding model under the hood) with a classification head fine-tuned for sentiment. Run it on three workplace-themed inputs:</p><table><thead><tr><th>Input</th><th>Label</th><th>Score</th></tr></thead><tbody><tr><td>&ldquo;The chatbot reduced ticket resolution time by 40% this quarter.&rdquo;</td><td>POSITIVE</td><td>0.9962</td></tr><tr><td>&ldquo;Our deployment failed repeatedly and customers were upset.&rdquo;</td><td>NEGATIVE</td><td>0.9997</td></tr><tr><td>&ldquo;The new recommendation engine is acceptable but needs tuning.&rdquo;</td><td>NEGATIVE</td><td>0.9898</td></tr></tbody></table><p>The third row is the interesting one, and it&rsquo;s worth unpacking because it points at a problem that turns up in every enterprise deployment of pretrained models.</p><p>&ldquo;Acceptable but needs tuning&rdquo; is, in workplace context, a<em>lukewarm-positive</em> — closer to &ldquo;approved with caveats&rdquo; than &ldquo;this is bad.&rdquo; The classifier scored it NEGATIVE with 0.9898 confidence. Three things are happening at once:</p><ul><li><strong>Domain mismatch.</strong> The model was fine-tuned on SST-2, which is movie reviews. &ldquo;Needs tuning&rdquo; reads negative there. In an engineering team&rsquo;s language, &ldquo;needs tuning&rdquo; is constructive — the same words have different sentiment loadings in different domains.</li><li><strong>No calibration on workplace text.</strong> The score is 0.9898 — extreme confidence — for what should be a borderline case. Pretrained classifiers tend to be miscalibrated on out-of-distribution inputs: they&rsquo;re not just wrong, they&rsquo;re confidently wrong. Calibration techniques (temperature scaling, Platt scaling, conformal prediction) exist for exactly this.</li><li><strong>Weak supervision is the practical fix.</strong> When you can&rsquo;t fine-tune (no labelled data, no budget, no time), the durable answer is to treat the classifier as one signal among several — combine it with rules, keyword filters, or a second model — rather than trusting any single number above the threshold.</li></ul><p>Architecturally, the lesson generalises across all three Section D variants. Generation is &ldquo;embedding + decoder loop.&rdquo; Classification is &ldquo;embedding + categorical head.&rdquo; Retrieval is &ldquo;embedding + cosine.&rdquo; Same underlying mathematical object, different output shapes. The architectural choices around the embedding determine what the system does — and where it fails when you take it out of the domain it was trained on.</p><h2 id="closing-observations">Closing observations</h2><p>Three things that generalise beyond this lab.</p><p><strong>Tokenization is where most LLM cost and quirks actually originate.</strong> It&rsquo;s the lowest layer of the stack and the one most engineering conversations skip. If you&rsquo;re tuning prompts and getting unstable behaviour, the first place to look is what your input looks like after the tokenizer touches it.</p><p><strong>Embedding-based similarity is older, cheaper, and more deterministic than people remember.</strong> Before reaching for an LLM call to compare two pieces of text, embed them and compute cosine. It&rsquo;s milliseconds, free, and stable. A surprising fraction of &ldquo;AI features&rdquo; are really embedding lookups with a confidence threshold.</p><p><strong>Generation and similarity sit next to each other.</strong> They are not competitors. RAG is the obvious example — embeddings retrieve, the LLM generates the answer grounded in what was retrieved. The<a href="/rag-chatbot-for-the-github-rest-api/">Week 15 RAG chatbot post</a> is what these two mechanisms look like wired together for production.</p><p>One predicts words. One maps meaning. Knowing which one to reach for is most of the job.</p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/IITM/week6-mechanisms.png" medium="image"><media:title type="plain">Posts</media:title></media:content><category>artificial-intelligence</category><category>llm</category><category>machine-learning</category><category>engineering</category><category>Knowledge Base</category></item></channel></rss>