Ajay Walia

Aether, Rethought — The Shape Was Wrong All Along

Ajay Walia — Fri, 12 Jun 2026 00:00:00 +0000

Part III of the Aether series. Missed the first two? Start withMeet Aether (the build), thenAether, Grown Wild (what happened when it ran).

The story so far — one paragraph each

Part I: I built a 10-agent (later 13-agent) system that runs an entire team of IT architecture specialists on a single MacBook M5 Pro — one Gemma model, zero cloud, zero data egress. Every agent is just a YAML manifest: a different system prompt and a different knowledge-base namespace pointing at the same weights. The system escalates upward when confidence falls below 0.7.

Part II: That clean idea hit reality. The router was rebuilt twice, retrieval flipped from knowledge-base-first to web-first, and self-reported confidence was replaced by a formula the system computes. Thirteen agents, a live web allowlist, a computed confidence score, and a CHANGELOG.md in place of Git.

Part III (this one): The shape is wrong. The hierarchy that felt so natural — because it mirrors the org chart — turns out to optimise for the wrong things. I worked through five recognised ways to build an agentic AI system, scored each against the same criteria, and arrived at a recommendation that changes the architecture without throwing away anything we’ve built.

What we’re actually building

The output isn’t a chatbot. It’s a consulting deliverable — an architecture document advising how to run an IT transition. Every real client engagement spans multiple towers simultaneously: a move to Entra ID while modernising the network and shifting workloads to Azure touches Cloud, Network, Digital Workplace, and Security in the same breath. The value and the difficulty live in the cross-domain synthesis.

Where we started — the three-tier hierarchy

This is what Aether v2.x actually is. One model, thirteen agents, each differentiated only by system prompt and knowledge namespace. Work routes down the tree; low confidence escalates back up.

It was appealing for real reasons: it mirrors how a delivery organisation thinks, easy to explain to a client, and the RAG namespace-per-domain isolation is clean. But we copied an org chart into the control flow — and the next slide explains why that’s usually a trap.

The core insight — why the org chart is the wrong shape

This is the conceptual centre of the entire analysis. Everything that follows flows from these two points.

A strict tree only allows vertical movement — up to escalate, down to delegate. Real architecture work needs lateral collaboration. The AWS specialist can’t directly ask the Network specialist a question; it has to climb the entire tree and back down. That’s bureaucracy encoded in Python.

The second consequence is more fundamental:the deliverable is a workflow, not an organisation. Producing an architecture document is a consulting process with phases. The right structure for that process is a pipeline. We modelled the people first and the process second — we should have done it the other way round.

Five ways to build it

Rather than jump straight to a new design, I evaluated five recognised approaches against the same five axes: cross-domain capability, parallelism, auditability, simplicity, and fit for document generation.

Note the footer: RAG sits underneathall five approaches — they differ in control flow, not in whether they retrieve.

Approach 1 — Hierarchical / org-mirror

This is what Aether v2.x already is.

The pros are real — which is why we chose it. But the killer con:there is no sideways path. The AWS agent can’t ask the Network agent a question without escalating all the way up and back down. Also, escalation triggered by confidence scores sounds clean, but LLM confidence is unreliable — the trigger itself is shaky.

Verdict: Great for stable, siloed problems with a genuine audit chain. That is not what our engagements look like.

Approach 2 — Orchestrator + flat specialists

One orchestrator plans the task, fans it out to specialists runningin parallel, then synthesises. Flat: adding a domain means adding one specialist — no re-tiering.

Verdict: A strong, flexible default. A piece of the recommendation.

Approach 3 — Workflow / pipeline (process-native)

Instead of organising bywho, organise by thestages of producing the document: Discover → Assess → Design → Review → Assemble. Each stage maps to a section of the output. Human checkpoints slot between stages.

Verdict: This is the spine. The structure that mirrors how the deliverable is actually made. The so-called con — that domain experts become knowledge sources rather than autonomous agents — is arguably the correct framing.

Approach 4 — Blackboard / shared artifact

All agents read and write a common workspace — the evolving document. Strong on cross-domain consistency because there’s only one object. The hard part is concurrency control.

Key distinction: a true blackboard (free-for-all writes) is risky. A typed, controlled shared state object — which LangGraph gives us — keeps the benefit without the chaos. We use the controlled version.

Approach 5 — Single-agent baseline

One capable model. A lightweight router. Domain knowledge bases on demand. Simplest, cheapest, most reliable.

Verdict: The benchmark every fancier design has to beat. The real reason to decompose isinformation management — not model weakness. When CMDB exports, cloud inventories, and Terraform files all arrive together, no context window handles it cleanly.

The comparison

The table does the work. Our current design (Hierarchical) is weakest exactly where we need strength — cross-domain — while strong on auditability. The two approaches that score high on fit-for-doc-gen arecomplementary: strong in different places. That’s the bridge to the recommendation.

The recommendation — a hybrid

Not a ladder. Ahub. The pipeline (Discovery → Assessment → Design) flows through a single typed shared state object. Domain specialists are invoked within stages — callable skills, not autonomous routing agents. Cross-cutting reviewers (Security, Cost, Risk, Compliance) act across all stages. QA can loop work back to Design or Assessment until the rubric passes.

How the model works

Three things to highlight:

Parallel specialists write structured findings into one typed shared state — the single source of truth
Arbitration is deterministic — fires on conflict, policy breach, or missing data, never on confidence scores
The model is pluggable — local Gemma today, Claude/OpenAI/Gemini tomorrow, same knowledge, no rigid tree

Why this is the right call

The verdict, stated plainly: it’s the only option that handles cross-domain work with the auditability, governance, and evidence-traceability a client deliverable demands — at an acceptable, well-understood increase in build complexity.

The technology stack — nothing new to install

Every box in the stack is something we already run. The v3 work extends Orchestration and adds the governance/eval layer.This is not a rebuild.

The process — how a real engagement runs

Two things to notice:

Feedback arrows — review, QA, and cost/risk findings can send workback to Design or Assessment. It’s iterative, not a one-way pipeline.
Human gates — after assessment and before assembly. An architect validates the current-state picture and signs off before the document is built. Gated autonomy, not full automation — which matters when the output carries liability.

The real hard problem — evidence quality

Every downstream recommendation carries the confidence level and assumptions set at this gate.No silent guessing. The system requests more data, logs gaps in an assumption register, or proceeds while explicitly stating its confidence level.

What flows through the system — the canonical Architecture State

One typed object. Every stage, specialist, and reviewer reads from and writes to it. The document is rendered from it. Every recommendation traces back through the same object to the evidence that produced it — that traceability is what makes the deliverable auditable.

Lenses, not domains — where security and compliance live

Security is a lens, not a domain. Same for Cost, Compliance, Risk, Sustainability, and Supportability. They aren’t towers sitting next to Cloud and Network — they cut horizontally across every stage. Most rework loops originate in Security and Compliance, which is exactly why they’re first-class cross-cutting reviewers.

Key design considerations

The most important reframe:we decompose for information management, not because the model is weak. That reframes the whole multi-agent debate. And arbitration fires on deterministic events — never on confidence scores. Every recommendation must trace to evidence.

Design decisions — where the build effort actually goes

The effort allocation is the provocative bit.Only ~10% of build effort goes to the agents themselves. ~35% is evidence ingestion and normalisation. ~20% is evaluation and governance. Most teams invert these numbers — polished agents, no evaluation — and end up with a demo that can’t produce a deliverable a client would pay for. That’s the trap we’re avoiding.

Business value

The positioning:a consulting accelerator and quality platform — not a headcount-reduction play. The specific numbers (50–80% faster assessments, 60–90% less evidence-collection effort) are hypotheses to validate in a pilot, not measured results. They’re framed that way because that’s what earns trust.

Next steps

The sequencing matters:evidence ingestion and the QA rubric first, agent polish second. That’s where the quality actually lives.

The honest accounting

I built a hierarchy because it was intuitive. It is intuitive — it maps to how the organisation thinks, the escalation chain is easy to trace and audit, and it’s easy to explain to a client. Those are real advantages.

But intuitive for humans and optimal for agents are not the same thing. The hierarchy was designed to answer“who is responsible?” — an important human question that agents don’t need answered. The new design answers“what needs to happen next?” — which is the right question for a document-generation workflow.

What changes in v3:

Agents become implementation details inside the Knowledge layer — the durable value is in evidence, governance, evaluation, and the document workflow
Escalation fires on events (conflict, policy breach, missing data) — not on confidence scores
The canonical model is the design constraint — the document isn’t an afterthought, it’s what everything is structured around
Model-agnostic by default — local Gemma today, SOTA API tomorrow, same knowledge

What stays the same: LangGraph, LanceDB, BAAI/bge-small, FastAPI, SQLite, the YAML manifests, and the 13 domain specialists — which become callable skills within the Design stage rather than autonomous routing agents. Same knowledge, different invocation mechanism.

This is an extension of what we run, not a rebuild.

Questions about the design or the approach? Reach out onLinkedIn.

]]>

LLMs Are Probability Engines, Not "Thinkers"

Ajay Walia — Sun, 07 Jun 2026 00:00:00 +0000

@import url('https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;600;700&family=JetBrains+Mono:wght@400;500&display=swap'); .pe-article { --bg: #070b14; --bg2: #0d1423; --bg3: #111827; --cyan: #00e5ff; --purple: #a855f7; --gold: #fbbf24; --text: #e2e8f0; --muted: #94a3b8; --border: #1e293b; --danger: #f87171; font-family: 'Space Grotesk', system-ui, sans-serif; font-size: 1.08rem; line-height: 1.85; color: var(--text); } /* TOC */ .pe-toc { background: var(--bg2); border: 1px solid var(--border); border-left: 3px solid var(--cyan); border-radius: 10px; padding: 1.25rem 1.75rem; margin: 2rem 0; } .pe-toc h3 { font-size: 0.95rem; letter-spacing: 0.18em; text-transform: uppercase; color: var(--cyan); margin: 0 0 1rem; } .pe-toc ol { padding-left: 1.3rem; margin: 0; } .pe-toc li { margin-bottom: 0.6rem; } .pe-toc a { color: var(--muted); text-decoration: none; font-size: 1.15rem; font-weight: 600; transition: color 0.2s; } .pe-toc a:hover { color: var(--cyan); } /* Video */ .pe-video { margin: 2rem 0; border-radius: 12px; overflow: hidden; border: 1px solid var(--border); background: #000; } .pe-video video { width: 100%; display: block; } .pe-video-header { background: var(--bg2); padding: 1rem 1.4rem; font-size: 1.15rem; font-weight: 600; color: var(--cyan); border-bottom: 1px solid var(--border); line-height: 1.5; } /* Typography */ .pe-article h2 { font-size: 1.75rem; font-weight: 700; color: #fff; margin: 3rem 0 0.9rem; padding-bottom: 0.45rem; border-bottom: 1px solid var(--border); } .pe-sec-num { color: var(--cyan); font-size: 1rem; font-weight: 600; display: block; margin-bottom: 0.2rem; letter-spacing: 0.1em; } .pe-article p { margin-bottom: 1.1rem; } .pe-article strong { color: #fff; } .pe-em { color: var(--gold); } /* Callouts */ .pe-callout { background: var(--bg2); border-left: 4px solid var(--purple); border-radius: 0 8px 8px 0; padding: 1.4rem 1.8rem; margin: 1.5rem 0; font-size: 1.4rem; color: var(--muted); line-height: 1.75; } .pe-callout.cy { border-color: var(--cyan); } .pe-callout.gd { border-color: var(--gold); } .pe-callout strong { color: var(--text); } /* Compare table */ .pe-table { width: 100%; border-collapse: collapse; font-size: 1rem; margin: 1.25rem 0; } .pe-table th { text-align: left; padding: 0.7rem 1rem; background: var(--bg2); color: var(--cyan); font-size: 0.85rem; letter-spacing: 0.08em; text-transform: uppercase; border-bottom: 1px solid var(--border); } .pe-table td { padding: 0.85rem 1rem; border-bottom: 1px solid var(--border); color: var(--muted); vertical-align: top; line-height: 1.6; } .pe-table td:first-child { color: var(--text); font-weight: 500; } .pe-table tr:hover td { background: var(--bg2); } /* Formula boxes */ .pe-box { background: var(--bg2); border: 1px solid var(--border); border-radius: 12px; padding: 1.75rem; margin: 1.75rem 0; } .pe-box-title { font-size: 0.95rem; letter-spacing: 0.15em; text-transform: uppercase; color: var(--purple); margin-bottom: 1rem; } /* Anim 1 — token prediction */ .pe-sentence { font-size: 1.3rem; font-family: 'JetBrains Mono', monospace; color: var(--text); min-height: 2rem; margin-bottom: 1.1rem; } .pe-cursor { display: inline-block; width: 2px; height: 1em; background: var(--cyan); animation: pe-blink 0.8s infinite; vertical-align: middle; margin-left: 2px; } @keyframes pe-blink { 0%,100%{opacity:1} 50%{opacity:0} } .pe-prob-bars { display: flex; flex-direction: column; gap: 0.55rem; } .pe-prob-row { display: flex; align-items: center; gap: 0.8rem; font-size: 1.1rem; } .pe-prob-lbl { width: 80px; text-align: right; color: var(--muted); font-family: 'JetBrains Mono', monospace; flex-shrink: 0; } .pe-prob-track { flex: 1; height: 26px; background: var(--bg3); border-radius: 5px; overflow: hidden; } .pe-prob-fill { height: 100%; background: var(--cyan); border-radius: 5px; transition: width 0.55s cubic-bezier(0.4,0,0.2,1); width: 0; } .pe-prob-fill.win { background: var(--gold); } .pe-prob-pct { width: 50px; font-family: 'JetBrains Mono', monospace; font-size: 1rem; color: var(--muted); } /* Math display */ .pe-math { font-family: 'Georgia', serif; font-size: 1.45rem; color: var(--gold); text-align: center; padding: 1.3rem; background: var(--bg3); border-radius: 8px; margin-bottom: 0.9rem; } .pe-term { display: inline; opacity: 0; transition: opacity 0.4s; cursor: help; position: relative; } .pe-term.on { opacity: 1; } .pe-term:hover::after { content: attr(data-tip); position: absolute; bottom: 115%; left: 50%; transform: translateX(-50%); background: var(--bg); border: 1px solid var(--purple); color: var(--text); padding: 0.4rem 0.85rem; border-radius: 6px; font-family: 'Space Grotesk', sans-serif; font-size: 0.88rem; white-space: nowrap; z-index: 20; } .pe-anns { display: grid; grid-template-columns: 1fr 1fr; gap: 0.6rem; margin-top: 0.8rem; } .pe-ann { background: var(--bg3); border-radius: 6px; padding: 0.55rem 0.85rem; font-size: 0.93rem; opacity: 0; transition: opacity 0.5s; } .pe-ann.on { opacity: 1; } .pe-ann-sym { color: var(--gold); font-family: 'JetBrains Mono', monospace; font-weight: bold; } .pe-ann-desc { color: var(--muted); } /* Softmax */ .pe-sm-demo { display: flex; gap: 1rem; align-items: flex-start; flex-wrap: wrap; } .pe-sm-col { flex: 1; min-width: 160px; } .pe-col-lbl { font-size: 0.82rem; letter-spacing: 0.1em; text-transform: uppercase; color: var(--muted); margin-bottom: 0.75rem; } .pe-logit-row { display: flex; align-items: center; gap: 0.6rem; margin-bottom: 0.55rem; font-size: 0.97rem; font-family: 'JetBrains Mono', monospace; } .pe-logit-w { width: 60px; color: var(--text); } .pe-logit-v { padding: 0.22rem 0.6rem; border-radius: 4px; font-size: 0.92rem; } .pe-logit-v.neg { background: rgba(248,113,113,0.15); color: var(--danger); } .pe-logit-v.pos { background: rgba(0,229,255,0.1); color: var(--cyan); } .pe-sm-bar { height: 22px; border-radius: 4px; background: var(--purple); transition: width 0.75s cubic-bezier(0.4,0,0.2,1); width: 0; display: flex; align-items: center; padding-left: 7px; font-size: 0.86rem; color: #fff; overflow: hidden; white-space: nowrap; } .pe-arrow { display: flex; align-items: center; justify-content: center; padding-top: 1.4rem; font-size: 1.6rem; color: var(--cyan); } /* Loss */ .pe-loss-wrap { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; align-items: start; } @media(max-width:500px) { .pe-loss-wrap { grid-template-columns: 1fr; } } .pe-loss-num { font-size: 2.8rem; font-weight: 800; font-family: 'JetBrains Mono', monospace; color: var(--danger); transition: color 0.5s; line-height: 1; } .pe-loss-num.good { color: #4ade80; } .pe-loss-lbl { font-size: 0.88rem; color: var(--muted); margin-top: 0.3rem; } .pe-loss-slider label { font-size: 0.92rem; color: var(--muted); display: block; margin: 0.8rem 0 0.3rem; } input[type=range] { width: 100%; accent-color: var(--cyan); } .pe-loss-formula { background: var(--bg3); border-radius: 8px; padding: 1.1rem; font-family: 'JetBrains Mono', monospace; font-size: 1rem; color: var(--text); line-height: 2.1; } .pe-lf-hl { color: var(--gold); } .pe-lf-res { color: var(--cyan); font-weight: bold; } /* Attention */ .pe-attn-words { display: flex; gap: 0.5rem; flex-wrap: wrap; margin-bottom: 0.9rem; } .pe-attn-word { padding: 0.4rem 0.8rem; border-radius: 6px; background: var(--bg3); border: 1px solid var(--border); font-size: 1rem; cursor: pointer; transition: all 0.2s; user-select: none; } .pe-attn-word:hover { border-color: var(--cyan); } .pe-attn-word.sel { background: rgba(0,229,255,0.12); border-color: var(--cyan); color: var(--cyan); } /* Attention formula cards */ .pe-qkv { display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 0.6rem; margin-top: 0.8rem; } @media(max-width:480px) { .pe-qkv { grid-template-columns: 1fr; } } .pe-qkv-card { border-radius: 6px; padding: 0.75rem 0.9rem; font-size: 0.93rem; } /* Temperature */ .pe-temp-row-ctrl { display: flex; align-items: center; gap: 1rem; } .pe-temp-big { font-size: 1.6rem; font-family: 'JetBrains Mono', monospace; font-weight: 700; color: var(--cyan); width: 56px; flex-shrink: 0; } .pe-temp-lbl { font-size: 0.92rem; color: var(--muted); margin-top: 0.2rem; } .pe-tbars { display: flex; flex-direction: column; gap: 0.45rem; margin-top: 0.9rem; } .pe-trow { display: flex; align-items: center; gap: 0.65rem; font-size: 0.95rem; font-family: 'JetBrains Mono', monospace; } .pe-tlbl { width: 64px; color: var(--muted); text-align: right; flex-shrink: 0; } .pe-ttrack { flex: 1; height: 19px; background: var(--bg3); border-radius: 4px; overflow: hidden; } .pe-tfill { height: 100%; border-radius: 4px; background: var(--purple); transition: width 0.5s cubic-bezier(0.4,0,0.2,1); } .pe-tpct { width: 48px; text-align: right; color: var(--muted); } /* Limits */ .pe-limits { display: grid; grid-template-columns: repeat(3, 1fr); gap: 1rem; margin: 1.5rem 0; } @media(max-width:640px) { .pe-limits { grid-template-columns: 1fr; } } .pe-limit-card { background: var(--bg2); border: 1px solid var(--border); border-radius: 12px; padding: 1.5rem 1.6rem; transition: border-color 0.2s; display: flex; flex-direction: column; gap: 0.4rem; } .pe-limit-card:hover { border-color: var(--purple); } .pe-limit-icon { font-size: 2.2rem; line-height: 1; } .pe-limit-title { font-weight: 700; color: var(--text); font-size: 1.35rem; margin: 0; } .pe-limit-desc { font-size: 1.15rem; color: var(--muted); line-height: 1.65; margin: 0; } /* Dual meters */ .pe-meters { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; margin: 1.25rem 0; } .pe-meter { border-radius: 8px; padding: 1.1rem; text-align: center; } .pe-meter-lbl { font-size: 0.82rem; letter-spacing: 0.1em; text-transform: uppercase; margin-bottom: 0.4rem; } .pe-meter-val { font-size: 2.2rem; font-weight: 800; } /* Buttons */ .pe-btn { margin-top: 0.9rem; background: var(--bg3); border: 1px solid var(--cyan); color: var(--cyan); padding: 0.4rem 1.1rem; border-radius: 6px; cursor: pointer; font-size: 0.82rem; font-family: 'Space Grotesk', sans-serif; transition: background 0.2s; } .pe-btn:hover { background: rgba(0,229,255,0.1); } .pe-btn-pur { border-color: var(--purple); color: var(--purple); } .pe-btn-pur:hover { background: rgba(168,85,247,0.1); } /* Interactive badge */ .pe-interactive-header { display: flex; align-items: center; justify-content: space-between; margin-bottom: 1rem; } .pe-interactive-header .pe-box-title { margin-bottom: 0; } .pe-interactive-badge { display: inline-flex; align-items: center; gap: 0.4rem; background: rgba(0,229,255,0.08); border: 1px solid var(--cyan); color: var(--cyan); font-size: 0.72rem; font-weight: 700; letter-spacing: 0.12em; text-transform: uppercase; padding: 0.3rem 0.75rem; border-radius: 999px; flex-shrink: 0; } .pe-interactive-badge::before { content: ''; width: 7px; height: 7px; border-radius: 50%; background: var(--cyan); animation: pe-pulse 1.6s ease-in-out infinite; flex-shrink: 0; } @keyframes pe-pulse { 0%, 100% { opacity: 1; transform: scale(1); } 50% { opacity: 0.4; transform: scale(0.7); } } .pe-interact-hint { display: flex; align-items: center; gap: 0.6rem; margin-top: 0.9rem; padding: 0.85rem 1.1rem; background: rgba(0,229,255,0.05); border: 1px dashed rgba(0,229,255,0.25); border-radius: 8px; font-size: 1.05rem; color: var(--muted); } .pe-interact-hint span { font-size: 1.25rem; } .pe-divider { border: none; border-top: 1px solid var(--border); margin: 2.5rem 0; } @media(max-width:520px) { .pe-anns { grid-template-columns: 1fr; } .pe-sm-demo { flex-direction: column; } .pe-arrow { transform: rotate(90deg); } }

▶ Full Video Explainer — covering how LLMs work, from next-token prediction to attention, training, and why hallucinations are inevitable

Your browser doesn't support HTML5 video.

You've used ChatGPT. You've heard the word "AI" a thousand times this year. But here's something almost nobody explains clearly: the thing powering these tools isnot intelligent in any human sense. It doesn't think. It doesn't understand. It doesn't have goals.

It is, at its core, avery sophisticated next-word predictor — a probability engine trained on the vast majority of text the internet has ever produced. Once you understand this, everything else — its strengths, its failures, its weirdness — clicks into place.

Interactive animations ahead: Press buttons and move sliders as you go — seeing the math move makes it stick.

01 —What ChatGPT and Claude actually are

The term "Artificial Intelligence" conjures images of something that thinks, reasons, and understands — a mind in a machine. That framing is compelling, but misleading when applied to today's large language models (LLMs).

What you're actually talking to is anautoregressive probabilistic model. Every word it generates is the result of asking one question, over and over again:

"Given everything written so far, what word is most likely to come next?"

That's it. Do that billions of times on internet-scale text, and you get something that looks uncannily like reasoning. But it is, fundamentally, pattern matching at extraordinary scale — not understanding, consciousness, or genuine intelligence.

What you see	What's actually happening	The catch
It "reasons"	Pattern-matches reasoning traces from training data	Breaks on genuinely novel problems
It "knows facts"	Recalls high-frequency statistical associations	Hallucinates on rare edge cases
It's "creative"	Samples from learned creative pattern spaces	Derivative — remixes, doesn't invent
It has "opinions"	Outputs tokens shaped by training + alignment	No actual beliefs internally

02 —The one job every LLM does

Let's make this concrete. Below is a live simulation of next-token prediction. Press"Predict next token" and watch the model pick the next word based on probability scores.

🎯 Next-Token Prediction

Live · Interactive

The cat sat on the

👇 Press the button to watch the model predict — one token at a time.

Notice the bars: each candidate word gets a probability score. The model doesn't "decide" in any human sense — it samples from this distribution. The highest-probability word is chosen most often, but not always. That's where both creativity and errors come from.

03 —The probability formula

Here's the mathematical heart of it.Hover each term for a plain-English tooltip, then press the button to reveal the full breakdown piece by piece.

📐 Probability Formula

Live · Interactive

🖱️ Hover any term for a plain-English tooltip. Press the button to reveal the formula step by step.

P(w_t | w ; θ) = softmax(logits_t)

wₜ —The next token to predict

w —All previous tokens (the context)

θ —Billions of learned parameters

softmax —Converts scores → probabilities (sum = 1)

Plain English:"Given everything typed so far, and everything the model learned during training, what is the probability of each possible next word?" The model scores every word in its vocabulary — 50,000+ words — and softmax turns those raw scores into probabilities that add up to exactly 1.0.

04 —Softmax: raw scores → probabilities

The model internally produces a raw score (called alogit) for every possible next word. Logits can be any number — positive, negative, large, small. They're not probabilities yet. Thesoftmax function converts them into a clean distribution. Press the button to watch the transformation.

⚡ Softmax Transform

Live · Interactive

👇 Press the button to watch raw scores transform into probabilities.

Raw Logits (scores)

→

After Softmax (probabilities)

Notice: even the most negative logit still gets a small non-zero probability after softmax. The model never completely rules anything out — it just makes some words astronomically unlikely. This is partly why LLMs occasionally produce bizarre outputs: a 0.001% token still gets picked sometimes.

05 —How it learns: cross-entropy loss

During training, the model sees a sentence with the last word hidden and makes a prediction. The training algorithm asks:"How wrong were you?" The measure of wrongness iscross-entropy loss.

The formula:ℒ = −log P(correct word). If the model assigns 100% probability to the right word, loss = 0. If it assigns 1%, loss is very high.Drag the slider to see this in action.

📉 Cross-Entropy Loss

Live · Interactive

🎚️ Drag the slider to change the model's confidence and watch the loss recalculate live.

1.47

Loss ℒ = −log(p)

Model's confidence in correct word:23%

Correct word:"lazy"
P("lazy"):0.23

ℒ = −log(0.23)
ℒ =1.47

High loss → big update

06 —The Transformer: the machine inside

The specific architecture that makes modern LLMs work is called theTransformer, introduced in a landmark 2017 Google paper. Every major LLM today — GPT-4, Claude, Gemini, Llama — is built on this design.

A Transformer processes your text through many stacked layers. Each layer has two main components:

Multi-Head Self-Attention — Every word simultaneously looks at every other word, learning which relationships matter. This is the core insight.

Feed-Forward Network — A dense neural network that processes each token's information independently, after attention has been applied.

A large model like GPT-4 stacks around 96 of these layers. With enough layers, parameters, and training data, emergent abilities appear — code generation, translation, basic reasoning — that nobody explicitly programmed. They fall out of the math at scale.

07 —Self-attention: every word watches every word

Before the Transformer, AI models processed text word by word in sequence, making it hard to connect things far apart in a sentence. Self-attention solves this by letting every word simultaneously evaluate its relationship to every other word.Click a word to see its attention weights.

🔍 Self-Attention Weights

Live · Interactive

👆 Click any word above to see how it attends to every other word. Brighter = stronger attention — notice "it" lights up "animal."

📐 The Attention Equation

Attention(Q, K, V) = softmax(QKᵀ /√d_k ) ·V

Q — Query

"What am I looking for?"

K — Key

"What does each word offer?"

V — Value

"What info do I retrieve?"

08 —How text is actually generated

When you press Send in any AI chat app, here is exactly what happens:

Tokenization — Your message splits into tokens (subwords). "unbelievable" → ["un","believ","able"].
Embedding — Each token becomes a high-dimensional vector capturing meaning and position.
Forward pass — Vectors flow through all Transformer layers. Attention and feed-forward happen, repeatedly.
Logits → Probabilities — The final layer scores every vocabulary word. Softmax converts to probabilities.
Sampling — One word is chosen based on those probabilities.
Repeat — That word is appended and the whole process runs again until the response is done.

KV Caching: The model caches Key and Value matrices from previous steps so it doesn't recompute attention from scratch every token — making long responses computationally feasible.

09 —Temperature: controlling randomness

When sampling the next word, you can control how random the selection is with a parameter calledtemperature. Drag the slider to see how it reshapes the probability distribution in real time.

🌡️ Temperature Sampling

Live · Interactive

🎚️ Drag the slider left for predictable outputs, right for creative (or chaotic) ones.

1.0

Balanced

Formula: p'ᵢ = pᵢ^1/T / Σ(pⱼ^1/T)

Low temperature (e.g. 0.2) makes the model deterministic — it almost always picks the top word. High temperature (e.g. 2.0) flattens the distribution, giving unusual words a real chance. Most production systems run between 0.7 and 1.0.

10 —Why it sometimes lies (hallucinations)

One of the most misunderstood LLM behaviors ishallucination — when the model confidently states something false. This isn't a bug to be patched away. It's a direct consequence of the architecture.

The model has no internal truth checker. No access to the real world. It only knows:what sequence of words tends to follow this sequence of words? When asked something rare or obscure, the model fills the gap with statistically plausible text — which may be completely wrong.

Analogy: Imagine someone who has read every book in a library but never left the building. Ask what the weather is like outside — they'll give a confident, well-reasoned answer based on weather descriptions they've read. It might be completely wrong.

Ground Truth Access

NONE

Statistical Plausibility

HIGH

11 —Key limitations to know

Understanding these isn't pessimism — it's how you use these tools well.

📏

Context Window

Fixed memory. Older models: ~4K tokens. Newer: up to 1M+. Anything beyond the window is completely invisible to the model.

🌀

No Persistent Memory

Every conversation starts completely fresh. The model has no memory of past sessions unless you explicitly provide them.

🎲

Stochasticity

Same prompt, potentially different outputs. The sampling process is inherently random, even at low temperatures.

🔓

Jailbreaks

Safety training is pattern-based. Clever prompting can sometimes bypass it because the model is still a pattern matcher at heart.

💭

Hallucinations

Inevitable on low-frequency knowledge. No fact-checker means confident errors are always possible. Verify important claims.

⚡

Quadratic Cost

Attention cost grows quadratically with context length. Techniques like FlashAttention mitigate this, but it's a fundamental constraint.

12 —What's next

The probability-engine core remains — but researchers are building powerful layers on top.RAG (Retrieval-Augmented Generation) gives the model access to real documents at query time, dramatically reducing hallucination on factual tasks.Agentic systems let LLMs use tools, execute code, and iterate on their outputs.Reasoning models generate long internal chains of thought before answering, improving performance on math and logic. Andmultimodal models extend the same probabilistic core to images and audio.

None of these change the fundamental nature of what an LLM is. They all sit on top of the same next-token prediction engine. Understanding that foundation is what makes you a sharper thinker about where this technology is — and isn't — going.

The bottom line: LLMs are extraordinary pattern-recognition engines that have scaled statistical prediction to the point of producing genuinely useful, sometimes astonishing outputs. They are not intelligent in any human sense. Knowing this — really knowing it — is what separates clear thinking about AI from hype.

Video generated with Grok Imagine. Animations built with vanilla JavaScript.

A curated vendor-domain allowlist. Results are restricted to Microsoft Learn, AWS, Google Cloud, Cisco, Palo Alto, NIST and others, with suffix-safe matching that blocks spoofs likecisco.com.evil.com.
Per-agent ranking.AGENT_DOMAINS ranks each specialist's preferred docs first, so the AWS agent leans on AWS documentation before anything else.
Source-agnostic results. Web hits are reshaped to look exactly like KB results (source / url / text / score), so the confidence math works identically on either.
Rich metadata. A schema ofdomain,vendor,document_type andversion_date travels with every chunk into the prompt.

Trust, Computed

Confidence is math, not vibes

In v2.6 the model ended each answer withConfidence: 0.92 and the orchestrator believed it. The trouble is that a model confidently answering an AWS question with Azure facts will happily rate itself 0.92 too. Self-report is theatre. So confidence became a number thesystem computes, before and after generation:

pre_gen= 0.6 · retrieval+ 0.4 · namespace_overlap
confidence= min(1.0, best_pre_gen+ 0.2 · citation_density)

retrieval_score — the quality of the retrieved documents. With an empty KB it is derived from web hits: 0.85 for preferred vendor domains, 0.70 otherwise.
namespace_overlap — does the query hit the agent's keywords? A strong match scores about 0.85; off-topic collapses to 0.1, all but guaranteeing escalation.
citation_density — the share of claims backed by[1] [2] sources, rewarding grounded answers with up to a 0.2 boost.

Below the 0.7 threshold, the query climbs to the parent agent, which re-retrieves against its own namespace and re-answers. One subtle fix mattered here: the strongest_best_pre_gen score is carried up the escalation chain, so a confident child's score is never erased by a weaker parent re-running the same step. The model no longer judges itself — the architecture does.

The Undergrowth

The bugs that shaped the design

Most of the architecture above exists because something broke first. The marquee disaster was the0% confidence saga — a cluster of unrelated failures that all produced the same symptom: every query inexplicably crashing to zero confidence and escalating to the top of the tree.

Silent search failure.duckduckgo_search was renamed toddgs upstream and returned an empty list. A catch-allexcept swallowed the error, so retrieval quietly went to zero.
Empty-KB zero score. With no documents ingested, the retrieval score defaulted to 0.0 — now derived from web-result quality instead.
Overwritten scores. Each escalation re-ranstep_build and erased the child's strong score, until_best_pre_gen was preserved across the chain.
Dead Tier-2 regex. Double-escaped\\b made domain routing rules unreachable, and malformed YAML broke parsing entirely.
Context overflow. The expanded universal design template pushed prompts past the local model's context limit — a 400 error until web-context sizes were tuned back down.
A frozen UI on long jobs. Design-doc generation runs for minutes and the UI silently froze, until streaming feedback and a 10-minute timeout were added.

A single catch-allexcept turned a renamed package into an invisible, week-long confidence collapse. Fail loud, not silent.

The Working Contract

How the build stays honest

With Git tracking disabled, discipline had to live somewhere. It lives in two places. The first is a hand-maintainedCHANGELOG.md that records every change, reversal and reason. The second is a behavioural contract the assistant itself follows when modifying the code — think before acting, make the smallest possible diff, verify against success criteria, keep everything auditable and reversible, and prefer less code over more. It reads less like an engineering process and more like the Law of the Jungle: a few rules everyone keeps, because the alternative is chaos.

See it running — the screenshots

Nine captioned frames from the live system: the Gemma model resident in LM Studio, the Gradio chat generating a Microsoft 365 Copilot design document at 90% confidence with live web search, the namespace-per-domain knowledge base, and the design template behind it all.

Open the screenshot gallery →

The Map So Far

Six releases, one expedition

2.6.0

Base system

Routing, RAG, agents and escalation — the first end-to-end build.

2.7.0

Grounded confidence

Dropped LLM self-report for a math formula; added web-search fallback and Mermaid diagrams in the UI.

2.7.1

Agent reshuffle

Retired the Intune, AVD, Citrix and Virtualization agents; added Microsoft DWP, End-User Virt and the network specialists.

2.7.4

Routing repairs

Fixed dead regex rules, corrupted YAMLs, and a hardcoded-path split that loaded stale manifests.

2.8.0

Web-first retrieval

Scrape vendor docs before the KB; vendor allowlist; KB fallback merge.

2.8.2

Confidence fixes

Solved 0% confidence on design docs; preserved the best score across the escalation chain.

2.9.0

Clarifying questions

Before generating a design, agents now ask two to four targeted questions — org size, compliance, stack, timeline — and fold the answers into a far more specific result.

Where We Are · What's Next

Wired, working, and honest

v2.8.2 is a working, daily-use system. The full pipeline runs route → retrieve → build → generate → score → escalate → audit, all thirteen manifests parse, routing and parent maps are consistent, and live vendor-doc search returns current, citable content. What's still open is honest too: the KB folders are wired but largely empty and need real source documents ingested; a rule-ordering overlap can still misroute shared virtualization keywords between End-User Virt and DWP; Git is off; and the context budget stays tight on the local model for large templates.

The road ahead, in order: populate the knowledge base so RAG augments rather than just the web; resolve the routing overlaps and settle End-User-Virt-versus-DWP ownership; re-enable Git and move off the manual changelog; then build a query test-suite to calibrate the confidence threshold against measured answer quality.

Lessons from the Trail

What the journey taught

Fail loud, not silent. A catch-allexcept turned a renamed package into invisible 0% confidence. Surface errors — never swallow them.
Measure what you trust. Self-reported confidence is theatre. Grounding trust in retrieval and citations made escalation actually mean something.
Narrow beats broad. Specialists with tight domains hallucinate far less than one generalist trying to know everything.
Write the changelog. With Git off, the manualCHANGELOG.md became the single source of truth for every decision and fix.

Aether began as a single sentence — "one model can't know everything." It grew into a hierarchy of grounded, self-aware experts, and the changelog is the proof of the journey. Specialise · Ground · Measure · Escalate.

]]>

RAG, Graph RAG, Agentic RAG — and How to Make Any of Them 32× Memory Efficient

Ajay Walia — Thu, 28 May 2026 00:00:00 +0000

@import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&family=JetBrains+Mono:wght@400;600&display=swap'); .rag-art { --bg: #0a1220; --bg2: #0f1a2e; --bg3: #142442; --line: #1f3358; --text: #e2e8f0; --muted: #8aa0c0; --accent: #22d3ee; --accent2: #00e5a8; --warn: #f59e0b; --danger: #ef4444; --purple: #a78bfa; font-family: 'Inter', sans-serif; color: var(--text); background: var(--bg); padding: 2rem 2.25rem; border-radius: 16px; box-shadow: 0 4px 30px rgba(0,0,0,0.55); line-height: 1.75; } .rag-art * { box-sizing: border-box; } .rag-art .section { padding: 40px 0; border-bottom: 1px solid var(--line); } .rag-art .section:last-child { border-bottom: none; } .rag-art .label { font-family: 'JetBrains Mono', monospace; font-size: .72rem; letter-spacing: .22em; text-transform: uppercase; color: var(--accent); font-weight: 600; margin-bottom: 8px; } .rag-art h2 { font-family: 'Inter', sans-serif; font-size: clamp(1.5rem, 1.2rem + 1vw, 2.2rem); font-weight: 700; letter-spacing: -.02em; color: #fff; margin: 0 0 18px; } .rag-art h3 { font-family: 'Inter', sans-serif; font-size: 1.15rem; font-weight: 600; color: #fff; margin: 28px 0 10px; } .rag-art p { margin: 0 0 16px; font-size: 1rem; color: var(--text); } .rag-art ul { list-style: none; padding: 0; margin: 0 0 18px; } .rag-art ul li { padding: 4px 0 4px 22px; position: relative; font-size: .95rem; } .rag-art ul li::before { content: "▸"; position: absolute; left: 0; color: var(--accent); } .rag-art strong { color: #fff; } .rag-art code { background: var(--bg3); padding: 1px 6px; border-radius: 4px; font-family: 'JetBrains Mono', monospace; font-size: .88rem; color: var(--accent2); } .rag-visual { margin: 28px 0; border-radius: 8px; overflow-x: auto; overflow-y: hidden; border: 1px solid var(--line); background: var(--bg2); } .rag-visual svg { display: block; width: 100%; height: auto; min-width: 700px; } .rag-cap { text-align: center; font-size: .78rem; color: var(--muted); font-style: italic; margin: -10px 0 24px; } .rag-figure { margin: 28px 0; } .rag-figure img, .rag-figure video { display: block; width: 100%; height: auto; border-radius: 8px; border: 1px solid var(--line); } .rag-figure figcaption { text-align: center; font-size: .78rem; color: var(--muted); font-style: italic; margin-top: 8px; } @keyframes ragPulseSoft { 0%,100% { opacity: .35; } 50% { opacity: 1; } } @keyframes ragPulseStrong{ 0%,100% { opacity: .55; r: 9; } 50% { opacity: 1; r: 11; } } @keyframes ragMissed { 0%,100% { opacity: .25; } 50% { opacity: .9; } } @keyframes ragFlow { from { stroke-dashoffset: 24; } to { stroke-dashoffset: 0; } } @keyframes ragDraw { from { stroke-dashoffset: 800; } to { stroke-dashoffset: 0; } } .rag-flow-line { stroke-dasharray: 6 4; animation: ragFlow 1.2s linear infinite; } .rag-traversal { stroke-dasharray: 6 4; animation: ragFlow 1.5s linear infinite; } .rag-glow { animation: ragPulseStrong 2.4s ease-in-out infinite; } .rag-missed { animation: ragMissed 3s ease-in-out infinite; } /* Sequential agent tool highlight (5 tools) */ @keyframes ragAgentT { 0%,18% { opacity: 1; } 25%,100% { opacity: .35; } } .rag-tool-1 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 0s; } .rag-tool-2 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 1s; } .rag-tool-3 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 2s; } .rag-tool-4 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 3s; } .rag-tool-5 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 4s; } /* Sequential float→binary cell transform (8 cells) */ @keyframes ragBitTransform { 0%,8% { opacity: 0; } 14%,100% { opacity: 1; } } .rag-bit-1 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 0s; } .rag-bit-2 { animation: ragBitTransform 4s ease-out infinite; animation-delay: .35s; } .rag-bit-3 { animation: ragBitTransform 4s ease-out infinite; animation-delay: .7s; } .rag-bit-4 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 1.05s; } .rag-bit-5 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 1.4s; } .rag-bit-6 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 1.75s; } .rag-bit-7 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 2.1s; } .rag-bit-8 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 2.45s; } @media (prefers-reduced-motion: reduce) { .rag-flow-line, .rag-traversal, .rag-glow, .rag-missed, .rag-tool-1,.rag-tool-2,.rag-tool-3,.rag-tool-4,.rag-tool-5, .rag-bit-1,.rag-bit-2,.rag-bit-3,.rag-bit-4,.rag-bit-5,.rag-bit-6,.rag-bit-7,.rag-bit-8 { animation: none !important; opacity: 1 !important; } } .rag-cards { display: grid; grid-template-columns: repeat(auto-fit, minmax(240px, 1fr)); gap: 14px; margin: 24px 0; } .rag-card { background: var(--bg2); border: 1px solid var(--line); border-radius: 10px; padding: 18px 18px; } .rag-card h4 { font-size: 1rem; font-weight: 600; color: #fff; margin: 0 0 8px; } .rag-card p { font-size: .88rem; color: var(--muted); margin: 0; } .rag-pull { border-left: 3px solid var(--accent); padding: 6px 0 6px 18px; margin: 28px 0; font-size: 1.08rem; color: var(--accent); font-style: italic; } .rag-pull.warn { border-color: var(--warn); color: var(--warn); } .rag-table { width: 100%; border-collapse: collapse; margin: 18px 0 26px; font-size: .9rem; } .rag-table th, .rag-table td { padding: 11px 14px; text-align: left; border-bottom: 1px solid var(--line); vertical-align: top; } .rag-table th { color: var(--accent); font-family: 'JetBrains Mono', monospace; font-size: .72rem; letter-spacing: .14em; text-transform: uppercase; border-bottom: 1px solid var(--accent); font-weight: 600; } .rag-table td.k { color: #fff; font-weight: 600; width: 25%; } .rag-table td.v { color: var(--text); } .rag-table td.m { color: var(--muted); font-size: .85rem; } .rag-stat { display: grid; grid-template-columns: repeat(4,1fr); gap: 1px; background: var(--line); border: 1px solid var(--line); margin: 28px 0; border-radius: 6px; overflow: hidden; } .rag-stat > div { background: var(--bg2); padding: 18px 14px; text-align: center; } .rag-stat .num { font-family: 'Inter', sans-serif; font-weight: 700; font-size: 1.7rem; color: var(--accent); display: block; } .rag-stat .lbl { font-family: 'JetBrains Mono', monospace; font-size: .66rem; letter-spacing: .1em; color: var(--muted); text-transform: uppercase; margin-top: 4px; display: block; } @media (max-width: 600px) { .rag-art { padding: 1.25rem; } .rag-stat { grid-template-columns: repeat(2,1fr); } .rag-visual svg { min-width: 600px; } }

Retrieval-Augmented Generation has become the default way to give a language model access to your data. But "RAG" now covers at least three meaningfully different architectures, and most engineers only know the first one well. Pick the wrong one and your assistant answers confidently with information it never actually retrieved.

This piece does two things.First — break down RAG, Graph RAG, and Agentic RAG visually: how each works, where each one breaks, and which query type it's the right fit for.Second — show how a single technique calledbinary quantization can shrink the vector index inside any of these architectures by a factor of 32 without breaking retrieval quality. This is the trick Perplexity, Azure, and HubSpot use in production.

3Architectures
compared

32×Memory
reduction

1-hop / N-hopWhen each
variant wins

0 codeAll concept,
all visual

Part 1A · Standard RAG

The Default Pipeline — and What It's Actually Good At

Standard RAG is what most engineers mean when they say "RAG". Documents are split into chunks, each chunk is embedded into a high-dimensional vector, and those vectors are stored in a vector database. At query time, the user's question is embedded too, and the database returns the top-k chunks by similarity (usually cosine distance). Those chunks are pasted into the LLM's prompt as context, and the model answers from them.

Where Standard RAG Wins

Direct factual lookups. Single chunk contains the answer. "What is our refund policy?" → retrieves the refund policy chunk → done.
Cost and latency. One embedding call, one similarity search, one LLM call. Easy to debug.
Mature tooling. Pinecone, Weaviate, Qdrant, Milvus, pgvector — all production-ready for this pattern.

Where Standard RAG Breaks

It retrieveschunks, never the relationshipsbetween chunks. The moment the answer requires combining facts that live in different documents — or even different sections of the same document — similarity search starts missing things.

Similarity search will happily return two facts that sit close to the query in embedding space, while the missing third fact that connects them sits far away and never makes it into the context window.

Concretely, imagine a vector database storing three facts about your internal services:

The bridge fact sits too far from the query in embedding space. Similarity search has no way to find it from where it started.

Part 1B · Graph RAG

Adding a Knowledge Graph on Top

Graph RAG addresses the multi-hop problem by adding a structural layer over the documents. During indexing, an LLM extractsentities (services, people, places, concepts) and therelationships between them, building a knowledge graph alongside the vector index. At query time, the system traverses that graph instead of relying purely on embedding similarity.

How a Graph RAG Query Actually Runs

The user asks "Will checkout be affected by Friday's maintenance?". The system identifies the entities mentioned in the query (checkout,Friday maintenance), looks them up as nodes in the graph, and walks the edges between them. The traversal returns the chain of relationships, and that chain gets handed to the LLM as structured context — not as random chunks of prose.

Multi-hop reasoning

Followinguses → runs_on → affects recovers the bridge fact that pure similarity search missed.

Explainable context

Every answer comes with a traversable path. Easier to audit than "the top-5 most similar chunks said so."

Heavier to build

Entity extraction at index time is expensive. Schema design matters. Not free lunch.

Less flexible than agents

The graph schema is fixed at indexing time. Queries that need fresh tools or external sources still need help.

Part 1C · Agentic RAG

Letting the LLM Choose How to Retrieve

Agentic RAG replaces the fixed retrieval pipeline with an LLM agent that decides — at query time — which tools to invoke, which sources to query, and in what order. The agent might call a vector search, then a SQL database, then a web fetch, then a graph traversal, all in service of one question.

What "Dynamic" Actually Means

A user asks:"Has any customer raised a ticket about the checkout outage we had last Friday, and what was our response time on it?" An agentic system might:

Call theknowledge graph to confirm there was an outage on the checkout service last Friday.
CallSQL on the ticketing database to list tickets opened that day mentioning "checkout".
Call thevector DB over chat history to find related customer complaints in Slack.
Call thecode interpreter to compute average first-response time on the matching tickets.
Compose the answer.

None of that ordering was decided in advance. The agent chose it. That flexibility is the whole point — and the whole risk.

Flexible

Handles open-ended tasks that touch multiple data sources and require fresh information.

Higher latency

Several tool calls per question. A simple lookup that took 200ms in standard RAG now takes 4–8 seconds.

Harder to debug

The agent's reasoning path is non-deterministic. Reproducing a failure mode can be slippery.

Can spiral

Without tight tool authority and budgets, agents loop on themselves. Pair this with a state machine.

Part 1D · Decision

These Aren't Levels — They're Different Tools

The most common mistake is treating these as a maturity ladder you have to climb. They aren't. They solve different query types. A good system often uses all three in different parts of the same product.

Once the right architecture is in place for the query type, the next leverage point is efficiency. Every one of these three depends on a vector index somewhere underneath — and that index is where most of the memory cost lives.

Part 2 · Efficiency

How to Make Any RAG 32× More Memory Efficient

The 32× trick — float magnitudes compressed to a single sign bit per dimension.

Every RAG variant pays the same tax: it stores high-dimensional embeddings of every chunk it's ever indexed. That tax adds up fast. At ten million chunks, a standard 768-dimension float32 index needs about 30 GB just to hold the vectors — and that index has to sit in fast RAM if you want sub-second retrieval. Doubling your corpus doubles the bill.

The trick that Perplexity, Azure AI Search, and HubSpot all use in production is calledbinary quantization. It cuts the memory footprint by 32 times. The architecture above it doesn't change — Standard, Graph, or Agentic, the same trick applies.

The Memory Bill, in Numbers

The Trick: Throw Away Magnitudes, Keep the Sign

Binary quantization is structurally simple. For every dimension of every vector, ask one question:is the value positive or negative? Positive becomes1, negative becomes0. The 32-bit float is replaced by a single bit. Same dimensionality, 1/32nd the storage.

The Distance Metric Changes Too — Cosine Becomes Hamming

Float32 vectors compare viacosine similarity, which is computed from dot products. Binary vectors compare viaHamming distance: count the number of bits that differ between two vectors. On modern CPUs, this is two instructions —XOR thenpopcount — and runs at billions of comparisons per second.

The Trade-off — and the Fix

Of course, throwing away the magnitudes throws away some information. A naive binary index loses roughly 5–10% of retrieval accuracy compared to the full float32 index. Production systems solve this with atwo-stage search: use the cheap binary index to retrieve a wide net of candidates fast, then re-score the small candidate set using the original full-precision vectors.

Stage 1 is where the 32× memory win lives. The binary index is small enough to fit comfortably in CPU cache, so you can scan tens of millions of candidates in single-digit milliseconds. Stage 2 only ever touches a few hundred full-precision vectors, so the expensive cosine math is bounded.

The recall lost in stage 1 is paid back in stage 2. End-to-end retrieval quality is typically within 1% of a full float32 search, at 1/32 the hot memory.

Memory Bill at Scale — Before and After

Corpus size	Float32 only	Binary (stage 1)	Hybrid (stage 1 hot + stage 2 cold)
1 M vectors	3 GB · hot RAM	94 MB · hot RAM	94 MB hot · 3 GB cold
10 M vectors	30 GB · hot RAM	940 MB · hot RAM	940 MB hot · 30 GB cold
100 M vectors	300 GB · multi-node	9.4 GB · single node	9.4 GB hot · 300 GB cold
1 B vectors	3 TB · cluster	94 GB · single beefy node	94 GB hot · 3 TB cold tier

The shape of the curve is what matters: the hot index — the part that controls latency — stays manageable even as the corpus grows by orders of magnitude. The cold tier scales linearly but cheaply, because it only gets touched for the few hundred candidates surfaced by stage 1.

When to Reach for This

Above ~1 M vectors

Below that scale, plain float32 is fine. The complexity of two-stage retrieval isn't worth the few hundred MB you'd save.

Hot real-time queries

If your retrieval p95 needs to stay under 100ms, the binary first stage is what keeps you there as the index grows.

Cost-sensitive deployments

Saving 30 GB of RAM × 3 replicas × 12 months adds up to real money. Especially on managed vector services.

Any of the three architectures

Standard, Graph, Agentic — they all sit on a vector index somewhere. This optimisation applies everywhere they do.

Putting It Together

Architecture and Efficiency Are Orthogonal

Two decisions, independent of each other.What kind of question does my system have to answer? — that's the architecture decision. Single-hop facts go to Standard RAG. Multi-hop relationship questions go to Graph RAG. Open-ended tool-using tasks go to Agentic RAG.How big is my index going to get? — that's the efficiency decision. Above a million chunks, binary quantization plus float rescoring buys you 32× memory headroom for ~1% quality cost.

The same vector index sits underneath all three architectures. The same trick applies to all three. Pick the architecture for the query type. Apply the efficiency trick because the math works.

RAG isn't one thing — it's a layered decision. Get the architecture right for the query, then make the index small enough to keep up.

Ajay Walia · CuriousBit Knowledge Base · May 2026

]]>

The AI Subsidy Era Is Ending

Ajay Walia — Sun, 24 May 2026 00:00:00 +0000

A pattern from 1865 is playing out in 2026 enterprise AI — and most companies never saw it coming.

Start Here: What Is Jevons Paradox?

In 1865, British economist William Stanley Jevons publishedThe Coal Question, a book with a deeply counterintuitive thesis. He argued that improvements to the efficiency of steam engines — making them burn coal far more economically — would not reduce Britain’s total coal consumption. It would increase it dramatically.

His logic was simple: when a resource becomes cheaper to use, people usemore of it. Lower cost per unit lowers the barrier to adoption. More use cases become viable. More industries lean in. The aggregate demand grows far beyond what the efficiency gains saved. You don’t consume less of the thing; you find a hundred new reasons to consume it.

He was right. This effect became known asJevons Paradox — the uncomfortable truth that technological efficiency gains often increase resource consumption rather than reduce it.

It has been observed in energy, transportation, computing, and bandwidth. And right now, in May 2026, it is playing out in real time across enterprise AI.

What Just Happened

Two stories broke in mid-May 2026 that crystallised what many in the industry had been quietly sensing for months.

Microsoft began winding down internal access to Anthropic’s Claude Code for thousands of its own engineers — specifically across the Experiences + Devices division covering Windows, Microsoft 365, and Teams. By end of June 2026, those engineers will be redirected to GitHub Copilot CLI instead. This is the same Microsoft that has invested over $13 billion into OpenAI and hosts Anthropic’s workloads on Azure. The signal is unmistakable: even with deep infrastructure ownership, third-party token costs at scale become painful.

Uber reportedly exhausted its entire 2026 AI budget by April — four months into the year. With somewhere between 84 and 95 percent of engineers using AI coding tools monthly and a large and growing share of production code being AI-generated, token consumption blew past every internal forecast. The technology had worked exactly as intended. The economics had not been modelled for it.

These are not isolated incidents. They are the visible edge of a broader reckoning.

Why Jevons Paradox Explains Everything

The enterprises that adopted Claude Code, Copilot, and similar agentic tools in 2024 and early 2025 did so during a period of heavy subsidisation. AI labs — under competitive pressure and racing for market share — kept pricing low, sometimes below cost, to build usage and lock in developer workflows.

Companies responded rationally. They adopted aggressively. They built workflows around these tools. And crucially, they budgeted for AI as if it were traditional SaaS: a predictable per-seat or flat-rate line item that finance could model cleanly.

What they got instead was usage-based token billing — where highly capable tools get used constantly, and where the most valuable agentic workflows (multi-step code generation, automated testing, long-context reasoning) consume far more tokens per session than a simple chat query ever would.

Jevons Paradox kicked in hard. As the tools became genuinely useful, engineers reached for them constantly. The efficiency gain per task was real — but the total volume of tasks AI was applied to grew faster. The unit cost came down; the aggregate spend exploded.

No one was being reckless. They were just behaving exactly as users of a useful technology always do.

The Forces That Collided

Several things happened simultaneously to create this crunch.

Pricing shifted. Through 2025 and into 2026, AI labs moved from promotional and flat-rate models toward honest usage-based billing. Prices for frontier model inference rose 20 to 37 percent in some tiers. The subsidised introductory phase of the enterprise AI market was closing.

Consumption was underestimated everywhere. Most enterprises based their token usage forecasts on early-adopter pilots and simple query patterns. They did not anticipate what agentic workflows would look like at scale — long context windows, iterative back-and-forth with code environments, parallel agent runs. The consumption profile of a developer genuinely integrating AI into their daily workflow is an order of magnitude heavier than the pilot suggested.

Infrastructure reality reasserted itself. Running frontier models is genuinely expensive. The economics that make it cheap at a promotional level do not hold at enterprise scale without subsidy. Even the largest cloud operators, with the most favourable unit economics in the industry, are finding that third-party token costs sit uncomfortably on the balance sheet.

The Fork in the Road

This creates a structural tension that neither enterprises nor AI labs can easily resolve.

If enterprises cut back usage to manage costs, AI labs face slower revenue growth at exactly the moment they need strong numbers — several are eyeing IPO windows or need to justify valuations to investors. That is a problem.

If labs cut prices significantly to retain volume, their own unit economics deteriorate at a time when they are still burning enormous sums on training runs and infrastructure. That is also a problem.

The most likely path is not a neat resolution but a market forcing function: enterprise buyers will become far more disciplined about where AI spend is justified, which will accelerate optimisation rather than retreat.

Expect to see model routing mature quickly — the practice of directing simple queries to cheaper, smaller models and reserving frontier inference for genuinely hard tasks. Expect caching, fine-tuning on domain-specific data, and distillation to become standard parts of the enterprise AI stack rather than advanced techniques used by a few. And expect a meaningful shift toward strong open-weight models — DeepSeek, Qwen, Llama derivatives — running on-premises or via lower-cost inference providers, particularly for workloads where data residency and predictable cost matter more than peak capability.

The “vibe coding” era — where engineers use AI tools liberally and broadly without much thought to token spend — is likely to give way to something more deliberate. Finance teams now have the data to ask the hard questions, and they will.

What This Means in Practice

For anyone operating at the intersection of enterprise technology and AI — which, if you work in digital workplace, infrastructure, or IT strategy, is increasingly all of us — a few things are worth watching.

The ROI conversation is now unavoidable. The magic-feeling phase of enterprise AI, where adoption was justified by enthusiasm and competitive pressure alone, is giving way to measurement. Real productivity gains against actual token burn. That is a healthy shift, but it requires instrumentation most organisations have not yet built.

Build vs. buy calculus is shifting. Microsoft’s decision to push engineers toward its own tooling rather than pay Anthropic’s rates is a preview of how large enterprises with engineering capacity will respond. Owning the stack — or at least the inference layer — becomes strategically valuable.

Budget modelling for AI needs to look different from SaaS modelling. Usage-based costs with high variance require different financial governance than per-seat licensing. Teams that have not updated their procurement and budgeting frameworks for token economics will keep running into Uber-style surprises.

The technology is not in retreat. The hype phase is ending, which is actually good for the technology’s long-term credibility. What emerges from this price discovery period will be a more durable foundation: AI spend tied to measurable outcomes, workflows optimised for real-world economics, and organisations that understand what they are actually buying.

The Bottom Line

Jevons was writing about coal. But he was really writing about human behaviour in the presence of efficiency. We do not save what we make cheaper — we find more things to do with it.

Enterprise AI is at that inflection point now. The easy, subsidised phase is ending. The economics are catching up to the technology. That is not a sign that AI has failed to deliver — in many cases, the tools have worked remarkably well. It is simply the normal lifecycle of a technology maturing from hype into infrastructure.

The companies that navigate this well will treat AI spend like any other major operational cost: measuring it, routing it intelligently, and connecting it to outcomes. That is less exciting than the story of magic tools that write code overnight. But it is how useful technologies actually get embedded into how organisations work.

]]>

RAG Chatbot from indexed public documentation

Ajay Walia — Tue, 19 May 2026 00:00:00 +0000

This article documents the design and implementation of a domain-specific Retrieval-Augmented Generation (RAG) conversational assistant. The project was completed as theWeek 15 Graded Mini Project of theIITM Pravartak Professional Certificate Programme in Agentic AI and Applications. The brief required the construction of an assistant that retrieves content from a private document set and produces answers grounded in that retrieved context, while refusing to answer when supporting evidence is absent.

The implementation selects theTechnology domain and indexes six official pages of the GitHub REST API documentation. The assistant accepts free-form questions in a command-line interface, performs context-aware follow-up handling, returns concise answers with source attribution, and emits a fixed refusal phrase when the retrieved context does not contain the answer.

Architecture

The pipeline is split into two phases. An offline ingestion stage prepares a reusable vector index; a runtime stage answers user turns by retrieving from that index and prompting a large language model under strict grounding instructions.

Public Data Sources

Six public pages from the official GitHub Docs were saved locally as plain-text files indata/raw/:

Quickstart for GitHub REST API
Getting started with the REST API
Authenticating to the REST API
Rate limits for the REST API
Best practices for using the REST API
Troubleshooting the REST API

These pages were chosen because they form a self-contained operational knowledge set: a developer integrating with the API typically needs to consult exactly this material when handling authentication, rate-limit pressure, and recovery from common errors.

Components and Concepts

The following components and ideas underpin the implementation. Each is summarised briefly to make the design choices easy to evaluate.

Retrieval-Augmented Generation (RAG). A pattern in which a language model is grounded on retrieved passages from an external corpus rather than relying solely on parametric memory. Retrieval reduces hallucination and lets the assistant cite source material it actually consulted.

LangChain. The orchestration framework used throughout the project. It provides composable abstractions for document loaders, text splitters, embeddings, vector stores and chat models, which keeps the ingestion and runtime code small and idiomatic.

RAG Chunking. Long documents are split into smaller passages before embedding. This implementation usesRecursiveCharacterTextSplitter with a 900-character chunk size and a 150-character overlap. Splitting prefers semantic boundaries (paragraph breaks and Markdown headings) so that each chunk is internally coherent and retrieval surfaces meaningful units rather than fragments.

OpenAI Embeddings. Each chunk is converted into a high-dimensional vector using thetext-embedding-3-small model. The vector captures the semantic content of the chunk so that questions phrased differently from the source text still retrieve the right passages.

FAISS. Facebook AI Similarity Search is the local vector store used to index the embeddings. The index is persisted to disk undervectorstore/github_rest_api_faiss so that ingestion runs once and the runtime simply loads the existing index, satisfying the rubric’s reusable-index criterion.

Top-k Retrieval. At each user turn the retriever returns thek=4 chunks most similar to the query. A smallk keeps the prompt focused and reduces the chance that off-topic passages dilute the grounded answer.

Follow-up Rewriter. A short, dedicated LLM call rewrites the user’s latest question into a standalone retrieval query using recent conversation history. This is the mechanism that allows ambiguous follow-ups such as “what about rate-limit headers?” to be embedded and retrieved correctly without the model having to guess the missing antecedent.

Grounded Answerer. The final answer is produced bygpt-4o-mini at temperature zero under a system prompt that instructs the model to answer only from the retrieved context, append aSources: line naming the document titles consulted, and otherwise return the exact refusal phrase.

Refusal Behaviour. When retrieval returns no usable context, the runtime short-circuits and emits the required string verbatim:I don't have enough information in the provided documents. The refusal is also enforced by the system prompt, providing defence in depth against hallucinated answers.

Implementation Layout

The submission is organised as two thin Python modules and a Jupyter notebook.ingest.py loads the text files, removes GitHub Docs templating syntax, extracts a human-readable title from each document, chunks the text, embeds the chunks, and writes the FAISS index.chatbot.py loads the persisted index, manages conversation history, invokes the rewriter and the grounded answerer in sequence, and prints the response with source attribution. The accompanying notebook imports the same functions so that the notebook and the CLI execute identical code paths, eliminating drift between the two surfaces.

Configuration is environment-driven. The OpenAI API key is read from a.env file and an optionalOPENAI_BASE_URL is honoured so that the same code runs against the Vocareum-provided proxy without modification. No secrets are committed to source.

Sample Conversation

A short transcript illustrates the three behaviours the rubric tests — grounded answer, follow-up resolution, and refusal.

You: What headers should I send when making requests to the GitHub REST API?Assistant: GitHub REST API requests commonly use an Accept header such asapplication/vnd.github+json. Authenticated requests send an Authorizationheader with Bearer YOUR-TOKEN. GitHub also recommends a valid User-Agentvalue, such as your GitHub username or application name.Sources: Getting started with the REST API; Authenticating to the REST APIYou: What about rate limit headers?Search query: rate limit response headers GitHub REST APIAssistant: Response headers can include x-ratelimit-remaining andx-ratelimit-reset, which indicate how many requests remain in the currentperiod and when the rate limit resets.Sources: Getting started with the REST API; Rate limits for the REST APIYou: How should an integration handle rate limit errors?Assistant: If a retry-after header is present, wait that many seconds beforeretrying. If x-ratelimit-remaining is 0, wait until the time given byx-ratelimit-reset. Otherwise, wait at least one minute and then useexponentially increasing waits for repeated secondary rate-limit failures.Continuing to make requests while rate-limited may result in a ban.Sources: Best practices for using the REST API; Troubleshooting the REST API

The second exchange demonstrates the follow-up rewriter at work: the original “what about rate limit headers?” is expanded into a self-contained search query before retrieval, which is what allows the retriever to surface the correct passages despite the missing antecedent.

Now, a deliberately off-topic question — one the index cannot possibly support — triggers the safety path:

⛔ Refusal path — out-of-scope question
You: What is the best laptop to buy for running the GitHub REST API?
Assistant: I don't have enough information in the provided documents.

No chunk in the index supports an opinion on hardware, so the assistant returns the mandated refusal verbatim rather than producing a plausible-sounding but ungrounded answer. The refusal is enforced twice — once procedurally when retrieval returns empty, and once in the system prompt — so the behaviour is stable even when retrieval surfaces weakly related chunks.

Closing Observations

Three observations stand out from this exercise.

Retrieval quality is set upstream, not by the vector store. The quality of retrieval is determined less by the choice of vector store than by the upstream decisions about cleaning and chunking. Stripping GitHub Docs templating syntax — the{% data ... %} Liquid blocks, the reusables, and the[AUTOTITLE] link macros — materially improved the relevance of returned chunks because the embeddings stopped clustering around boilerplate.
Follow-up handling must be engineered, not assumed. It is not a free behaviour of the language model. A separate rewriter step that converts conversational queries into standalone search queries is the smallest reliable mechanism, and it removes a category of retrieval failures that would otherwise be invisible in casual testing.
Refusal is a feature, not a fallback. Enforcing the refusal both procedurally (short-circuit on empty context) and in the system prompt is what gives the assistant a useful posture in production: it will say nothing it cannot support, and it will say so in a predictable way.

]]>

I Built a Team of IT Architects using LLM That Live on MacBook — Meet Aether

Ajay Walia — Sat, 16 May 2026 00:00:00 +0000

Every architect has felt this at some point. You are mid-design on a complex Azure landing zone, you need a sanity check on your FSLogix profile container sizing, and the fastest path to an answer is to ping a colleague who knows AVD cold — except it is 10pm, or they are in another timezone, or that colleague simply does not exist in your organisation.

I built Aether to fix that. It is a local-first, multi-agent AI system that runs a team of 10 specialist IT architecture advisors on a single MacBook Pro M5. No internet after setup. No API costs. No data leaves the machine. Just fast, cited, domain-grounded answers — available at 10pm when the deadline hits.

This is the story of how I built it, what the stack looks like under the hood, and what I learned along the way.

The Problem I Was Actually Solving

I have spent years in IT architecture — cloud, digital workplace, network, end-user computing, the works. Over that time I have accumulated a large personal knowledge base: AWS Well-Architected reviews, Intune compliance policy templates, AVD host pool sizing guides, Citrix NetScaler configurations, TOGAF artefacts, cloud adoption frameworks. The knowledge exists. The problem is retrieval — getting the right answer from the right domain quickly, without context-switching across six different documentation tabs.

Commercial AI tools are good at general answers. They are not great at answering “give me the exact OMA-URI path for configuring Windows Hello for Business through Intune on a hybrid-joined device for a tenant with MFA enforced at the Conditional Access layer.” That requires domain depth, and it requires knowing which documents to pull from.

I also wanted to explore a practical AI use case — not a demo, not a proof of concept, but something I would actually use daily. Aether became that experiment.

What Aether Is

AETHER v2.6 // SYSTEM DEFINITION

Type:      Local-First Multi-Agent AI System
Agents:    10 (3-tier hierarchy)
Model:     Gemma 4 26B A4B (Q4_K_M) — single instance
Runtime:   LM Studio → LangGraph → FastAPI → Gradio
Memory:    ~43 GB of 64 GB unified (M5 Pro)
Egress:ZERO
API cost:ZERO
Aether is alocal-first AI system that acts as a team of specialist architecture advisors — all running on your MacBook Pro M5. The headline technical trick: every one of those 10 advisors is thesame Gemma 4 26B model, loaded once. What makes each advisor different is purely the system prompt it receives and the knowledge base namespace it retrieves from.
One model. Ten personas. Zero cloud.
The Three-Tier Agent HierarchyThe agents are organised the way a real consulting firm would structure a team — from narrow specialist up to cross-domain strategist.
Tier 1 — Enterprise Architect (1 agent). Cross-domain strategy, TOGAF, Zachman, governance frameworks (GDPR, ISO 27001, HIPAA), technology investment decisions. This agent can readall knowledge base namespaces — it is the only one with that reach. The final escalation destination.
Tier 2 — Domain Architects (3 agents). Cloud Domain (multi-cloud strategy, FinOps, landing zones), Network Domain (SD-WAN, ZTNA, BGP, micro-segmentation), Digital Workplace Domain (Microsoft 365, VDI strategy, device management).
Tier 3 — Technology Architects (6 agents). AWS, Azure, GCP, Intune, AVD, Citrix. Each one is scoped tightly to its domain — deep, narrow, and fast.
The Stack, Component by ComponentComponentWhat It Does
Gemma 4 26B A4B (Q4_K_M)The single model serving all 10 agents. MoE architecture — activates ~4–6B parameters per token. ~13 GB VRAM.
LM StudioLocal model server. OpenAI-compatible API on port 1234. Model stays resident in unified memory.
LangGraphOrchestration graph — defines the multi-step query pipeline as a typed state machine.
LanceDBLocal vector database. One namespace (table) per agent. Fully file-based, no server process needed.
BAAI/bge-small-en-v1.5Embedding model. 384-dimensional vectors. Runs on the Apple Neural Engine — essentially free compute.
RedisSession memory (24h TTL, rolling 3-turn window) and routing cache (1h TTL).
FastAPI + UvicornREST API gateway on port 8000. Full query/response model, session management, agent listing.
GradioWeb chat UI on port 7860. Session management, source citations, escalation chain display.
PrometheusMetrics: query counts, latency per agent, escalation rate. Useful for understanding usage patterns.
SQLite (audit.db)Immutable audit trail. Every query logged before response is returned. Cannot be skipped.
The full system sits comfortably in about 43 GB of the M5 Pro’s 64 GB unified memory — leaving 21 GB of headroom for the rest of the machine.
The “One Model, Ten Specialists” TrickThis is the part I get asked about most, because it sounds like it should not work.
Every agent in Aether is defined by a YAML manifest. That manifest specifies asystem_prompt, anamespace (which LanceDB table to retrieve from), atemperature, andmax_tokens. There is no model switching. There is no weight loading. The Gemma 4 26B A4B is loaded once by LM Studio and stays resident.
# agent_manifests/aws_technology_architect.yamlagent_id:aws_technology_architectdisplay_name:"AWS Technology Architect"tier:3parent_agent:cloud_domain_architectnamespace:aws_techtemperature:0.1max_tokens:2048system_prompt:| You are an AWS Technology Architect with deep expertise in the AWS Well-Architected Framework, EC2/EKS/Lambda sizing, IAM policy design, CloudFormation and CDK, GuardDuty, and cost optimisation. You draw only from AWS-specific documentation and architecture patterns. When answering, cite the source documents retrieved. End your response with a new line: 'Confidence: X.XX' (0.00–1.00) reflecting how well your knowledge base supports this answer.
Why does this work? Because Gemma 4 26B A4B is a Mixture of Experts model. It routes each token through specialist sub-networks internally — activating only about 4–6 billion parameters per inference pass, despite having 26 billion total. The practical result: it runs at near 5B speed while retaining the reasoning breadth of a much larger dense model. A single loaded instance can faithfully adopt both a narrow Citrix specialist persona and a broad enterprise strategy persona, because the MoE routing shifts for each.
The other half of the trick is the knowledge base. Each Tier 3 agent retrieves only from its own LanceDB namespace. The AWS agent never sees a Citrix document. The Intune agent never sees a GCP architecture guide. Domain knowledge is isolated by design — which means the model cannot hallucinate across domain boundaries, because the retrieval context does not cross them.
Confidence-Driven EscalationThe escalation mechanism is the design decision I am most proud of.
Every agent is instructed to append a confidence score to its response — a float between 0.00 and 1.00 representing how well its retrieved knowledge supports the answer. The orchestrator reads that score via regex. If it falls below 0.7and the agent has a parent tier definedand that parent has not already been tried, the system automatically escalates.
// ESCALATION CHAIN EXAMPLE

Query: "Give me a multi-cloud strategy covering Azure, AWS, GCP, AVD, and Citrix with network segmentation"

→ Routed to:aws_technology_architect
→ Confidence:0.41 (below 0.7 threshold)
→ Escalate to:cloud_domain_architect
→ Confidence:0.63 (still below threshold)
→ Escalate to:enterprise_architect
→ Confidence:0.88 ✓

Response appended with escalation chain for full transparency.
The response shown to the user includes the full escalation path — which level of expertise produced the final answer. This matters in enterprise contexts. It is the difference between “the AI said so” and “the enterprise-level advisor produced this after the technical specialist’s knowledge was insufficient.”
The elegance here is thatthe model participates in its own routing decision. The orchestration does not need a separate classifier to judge answer quality — the model tells you its own confidence, and the system acts on it.
The 7-Step Query PipelineEvery query passes through a typed LangGraph state graph. The state object —AetherState — carries everything between nodes: query, session ID, agent manifest, RAG results, conversation history, messages, response, confidence score, escalation flags, and escalation chain list.
01ROUTE02RETRIEVE03HISTORY04BUILD05LLM06ESCALATE?07FINALISE
01 — ROUTE. A keyword scanner maps the query to the best-fit agent, checking Tier 3 rules first (most specific), then Tier 2, then Tier 1 as catch-all. Redis caches route results for one hour, so repeated queries on the same topic skip the scan entirely.
02 — RETRIEVE. Semantic search against that agent’s LanceDB namespace — top-5 chunks returned. Documents were ingested at 500-word chunks with 50-word overlap, embedded into 384-dimensional vectors byBAAI/bge-small-en-v1.5 running on the Apple Neural Engine.
03 — HISTORY. The last three message pairs (six messages) are loaded from Redis for the session. This gives conversational continuity without letting the context window balloon.
04 — BUILD. The message payload is assembled:system_prompt + RAG documents + history + current query. The confidence instruction is appended here.
05 — LLM. The assembled payload hits LM Studio on port 1234. The orchestrator extracts the confidence score from the response text via regex before passing the response forward.
06 — ESCALATE (conditional). If confidence is below 0.7, a parent agent exists, and it has not already been tried this turn — swap in the parent’s manifest and loop back to step 02 with fresh retrieval against the parent’s namespace.
07 — FINALISE. The turn is saved to Redis. The SQLite audit record is written. The escalation chain annotation (if any) is appended to the response. Result returned to the user.
Knowledge Isolation — The Anti-Hallucination Architecture
One of the most practical decisions in Aether’s design is namespace isolation. Each Tier 3 agent retrieves only from its own LanceDB table. The AWS agent’s retrieval context will never include a Citrix StoreFront configuration guide — because those documents simply do not exist in its namespace.
This matters more than it might seem. A common failure mode in RAG systems iscross-domain contamination — where retrieval pulls in tangentially related content from a different domain, and the model confabulates a plausible-sounding but wrong answer by blending the two. Namespace isolation eliminates this at the architectural level.
The Enterprise Architect at Tier 1 is the deliberate exception — it can query all namespaces, because cross-domain synthesis is exactly what it is built for.
The Audit Trail — Because EnterpriseEvery single query is written toaudit.db (SQLite) before the response is returned. The record includes: timestamp, session ID, query text, agent used, confidence score, escalation chain, and sources cited. The audit write is wrapped in error handling so that a database failure never blocks the main query flow — but the log is never optional.
# Simplified from the finalise nodeaudit_record={"timestamp":datetime.utcnow().isoformat(),"session_id":state["session_id"],"query":state["query"],"agent_id":state["agent_id"],"confidence":state["confidence"],"escalation_chain":json.dumps(state["escalation_chain"]),"response_length":len(state["final_response"])}db.execute("INSERT INTO audit_log VALUES (:timestamp, :session_id, ...)",audit_record)
For a system giving architecture recommendations — decisions that feed into multi-million dollar cloud commitments — having an immutable audit trail of what was asked, what agent answered, at what confidence level, via what escalation path, is not a nice-to-have. It is the thing that makes it organisationally defensible.
What It Can Actually Do$aether query"Design an Azure landing zone for PCI-DSS compliance"
$aether query"Right-size EC2 instances for a memory-intensive Java application"
$aether query"Configure FSLogix profile containers for 500 AVD users"
$aether query"Zero-trust network architecture for 5,000 remote employees"
$aether query"Cloud adoption roadmap for a financial services firm"
$aether query"Intune compliance policy for iOS BYOD — hybrid-joined, MFA enforced"
The depth per domain is real. The Intune agent knows OMA-URI paths, Graph API commands, Autopilot profiles, and licensing requirements — because those are the documents I ingested into its namespace. The AVD agent knows host pool design, FSLogix sizing, MSIX app attach, and session host scaling plans. The knowledge base is only as good as what you put into it — but that is also the point. This ismy architecture knowledge, curated, searchable, and queryable at any hour.
What I Learned Building ThisPrompt engineering IS the architecture. In a system like this, the YAML manifestis the agent. The difference between a brilliant AWS specialist and a generic AI assistant is entirely in what the system prompt says and what documents back it up. Getting those prompts precise, domain-bounded, and calibrated for the right temperature took longer than any of the code.
MoE models are underrated for local multi-agent work. The choice of Gemma 4 26B A4B over a dense model was the right call. You get reasoning breadth comparable to a much larger model at the inference cost of a small one, on hardware that most architects already carry.
Confidence as a first-class citizen. Asking the model to self-assess and surface that score is one of the highest-value things I added. It makes the system honest — and it drives the escalation logic that makes the team metaphor actually work.
Namespace isolation is a practical hallucination brake. Not a theoretical one. In the first version of Aether, all documents lived in a single namespace. Cross-domain contamination was visible and annoying. Splitting into per-agent namespaces fixed it immediately.
Audit trails are not overhead — they are the point. Every enterprise AI deployment should have one. Building it into the core pipeline from day one changes how you think about what the system is producing.
What Is NextAether v2.6 is a working, daily-use system. The next version I am working toward adds a web-based ingestion UI (so loading new documents does not require touching the ingest script), structured output for architecture decision records (ADRs) in a consistent format, and inter-agent communication — where a Tier 3 agent can proactively pull context from a peer rather than waiting for the escalation chain to activate.
I built Aether because I wanted a team. It turns out a team was always available — it just needed the right prompts and a local model to bring it to life.
Have questions about the stack or want to see specific parts of the implementation? Drop a comment or reach out onLinkedIn.

]]>

Attention Is All You Need — The Paper That Rewired AI

Ajay Walia — Sun, 10 May 2026 00:00:00 +0000

@import url('https://fonts.googleapis.com/css2?family=Bangers&family=Space+Grotesk:wght@400;500;600;700&family=Inter:wght@400;500&family=JetBrains+Mono:wght@400;500&display=swap'); .attn-article { --bg: #080b14; --surface: #111827; --surface2: #1a2235; --border: #1f2d45; --text: #e2e8f0; --muted: #8b9ab3; --purple: #8b5cf6; --purple-glow: rgba(139,92,246,0.2); --cyan: #22d3ee; --cyan-glow: rgba(34,211,238,0.15); --gold: #f59e0b; --gold-glow: rgba(245,158,11,0.2); --red: #ef4444; --green: #10b981; --manga-bg: #0e1520; --manga-border: #f59e0b; } /* ── Particles background ── */ /* ── Site header ── */ /* ── Main wrapper ── */ /* ── Hero ── */ .hero { text-align: center; padding: 5rem 0 3rem; } .hero-tag { display: inline-block; background: var(--purple-glow); border: 1px solid var(--purple); color: var(--purple); font-family: 'Space Grotesk', sans-serif; font-size: 0.75rem; font-weight: 700; letter-spacing: 3px; text-transform: uppercase; padding: 0.35rem 1rem; border-radius: 100px; margin-bottom: 1.5rem; } .hero h1 { font-family: 'Bangers', cursive; font-size: clamp(2.8rem, 8vw, 5.5rem); line-height: 1.05; letter-spacing: 3px; background: linear-gradient(135deg, #fff 0%, var(--cyan) 40%, var(--purple) 100%); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text; margin-bottom: 1.2rem; } .hero-sub { font-size: clamp(1.15rem, 1rem + 0.5vw, 1.45rem); color: var(--muted); max-width: 980px; margin: 0 auto 2rem; line-height: 1.7; } .hero-meta { display: flex; align-items: center; justify-content: center; gap: 1.5rem; flex-wrap: wrap; font-size: 0.82rem; color: var(--muted); font-family: 'Space Grotesk', sans-serif; } .hero-meta span { display: flex; align-items: center; gap: 0.4rem; } /* ── Manga Panel ── */ .manga-panel { margin: 2.5rem 0; background: var(--manga-bg); border: 2px solid var(--manga-border); border-radius: 4px; padding: 1.5rem; position: relative; box-shadow: 0 0 30px rgba(245,158,11,0.1), inset 0 0 60px rgba(0,0,0,0.4); } .manga-panel::before { content: '🎨 GROK IMAGINE'; position: absolute; top: -1px; left: -1px; background: var(--manga-border); color: #000; font-family: 'Bangers', cursive; font-size: 0.8rem; letter-spacing: 2px; padding: 0.2rem 0.8rem; border-radius: 2px 0 4px 0; } .manga-panel .panel-title { font-family: 'Bangers', cursive; font-size: clamp(1.35rem, 1rem + 1vw, 2rem); letter-spacing: 2px; color: var(--gold); margin: 1rem 0 0.75rem; } .manga-panel img { width: 100%; border-radius: 6px; display: block; margin-top: 0.5rem; box-shadow: 0 4px 24px rgba(0,0,0,0.5); } section, article { font-size: clamp(1.15rem, 1.05rem + 0.3vw, 1.4rem); line-height: 1.85; } /* ── Chapter headings ── */ .chapter-label { font-family: 'Bangers', cursive; font-size: 0.85rem; letter-spacing: 4px; color: var(--cyan); margin-bottom: 0.5rem; display: block; } h2 { font-family: 'Space Grotesk', sans-serif; font-size: clamp(1.6rem, 4vw, 2.2rem); font-weight: 700; margin-bottom: 1.2rem; line-height: 1.2; color: #fff; } h2 .highlight { color: var(--cyan); } h3 { font-family: 'Space Grotesk', sans-serif; font-size: 1.2rem; font-weight: 600; color: var(--purple); margin: 2rem 0 0.75rem; } /* ── Body text ── */ p { margin-bottom: 1.2rem; color: var(--text); } strong { color: #fff; } em { color: var(--cyan); font-style: italic; } /* ── Section divider ── */ .section { margin-top: 4.5rem; padding-top: 1rem; border-top: 1px solid var(--border); } /* ── Callout boxes ── */ .callout { margin: 2rem 0; padding: 1.25rem 1.5rem; border-radius: 6px; border-left: 4px solid; } .callout.purple { background: var(--purple-glow); border-color: var(--purple); } .callout.cyan { background: var(--cyan-glow); border-color: var(--cyan); } .callout.gold { background: var(--gold-glow); border-color: var(--gold); } .callout-title { font-family: 'Space Grotesk', sans-serif; font-weight: 700; font-size: 0.85rem; letter-spacing: 1px; text-transform: uppercase; margin-bottom: 0.5rem; } .callout.purple .callout-title { color: var(--purple); } .callout.cyan .callout-title { color: var(--cyan); } .callout.gold .callout-title { color: var(--gold); } .callout p { margin: 0; font-size: 0.95rem; } /* ── Inline diagram: RNN chain ── */ .diagram-box { background: var(--surface); border: 1px solid var(--border); border-radius: 8px; padding: 1.5rem; margin: 2rem 0; overflow-x: auto; } .diagram-title { font-family: 'Space Grotesk', sans-serif; font-size: 0.78rem; font-weight: 600; letter-spacing: 2px; text-transform: uppercase; color: var(--muted); margin-bottom: 1rem; text-align: center; } /* RNN Chain */ .rnn-chain { display: flex; align-items: center; gap: 0; min-width: 600px; justify-content: center; } .rnn-node { width: 52px; height: 52px; border-radius: 50%; border: 2px solid #ef4444; background: rgba(239,68,68,0.1); display: flex; align-items: center; justify-content: center; font-family: 'Space Grotesk', sans-serif; font-size: 0.7rem; font-weight: 600; color: #ef4444; flex-shrink: 0; position: relative; } .rnn-arrow { width: 32px; height: 2px; background: linear-gradient(90deg, #ef4444, rgba(239,68,68,0.3)); position: relative; flex-shrink: 0; } .rnn-arrow::after { content: ''; position: absolute; right: 0; top: -4px; border-left: 8px solid rgba(239,68,68,0.5); border-top: 5px solid transparent; border-bottom: 5px solid transparent; } .rnn-fade { opacity: 0.35; } .rnn-labels { display: flex; justify-content: space-between; font-size: 0.7rem; color: var(--muted); margin-top: 0.6rem; font-family: 'JetBrains Mono', monospace; min-width: 600px; } /* Attention Matrix */ .attn-matrix { display: grid; grid-template-columns: auto repeat(6, 1fr); gap: 3px; font-size: 0.7rem; font-family: 'JetBrains Mono', monospace; } .attn-label { display: flex; align-items: center; justify-content: flex-end; padding-right: 8px; color: var(--muted); font-size: 0.68rem; } .attn-col-labels { display: grid; grid-template-columns: auto repeat(6, 1fr); gap: 3px; margin-bottom: 3px; } .attn-col-label { text-align: center; color: var(--muted); font-size: 0.65rem; font-family: 'JetBrains Mono', monospace; } .attn-cell { height: 36px; border-radius: 3px; display: flex; align-items: center; justify-content: center; font-size: 0.65rem; font-weight: 600; transition: transform 0.2s; cursor: default; } .attn-cell:hover { transform: scale(1.08); } /* Timeline */ .timeline { position: relative; padding-left: 2rem; margin: 2rem 0; } .timeline::before { content: ''; position: absolute; left: 7px; top: 0; bottom: 0; width: 2px; background: linear-gradient(180deg, var(--purple), var(--cyan), var(--gold)); } .timeline-item { position: relative; margin-bottom: 2rem; padding-left: 1.5rem; } .timeline-dot { position: absolute; left: -1.85rem; top: 0.35rem; width: 16px; height: 16px; border-radius: 50%; border: 2px solid; background: var(--bg); } .timeline-dot.purple { border-color: var(--purple); box-shadow: 0 0 10px var(--purple); } .timeline-dot.cyan { border-color: var(--cyan); box-shadow: 0 0 10px var(--cyan); } .timeline-dot.gold { border-color: var(--gold); box-shadow: 0 0 10px var(--gold); } .timeline-dot.green { border-color: var(--green); box-shadow: 0 0 10px var(--green); } .timeline-year { font-family: 'Bangers', cursive; font-size: 1.2rem; letter-spacing: 2px; color: var(--gold); margin-bottom: 0.25rem; } .timeline-milestone { font-family: 'Space Grotesk', sans-serif; font-weight: 700; font-size: 1rem; color: #fff; margin-bottom: 0.25rem; } .timeline-desc { font-size: 0.88rem; color: var(--muted); line-height: 1.5; } .timeline-badge { display: inline-block; font-size: 0.65rem; font-weight: 700; font-family: 'Space Grotesk', sans-serif; padding: 0.15rem 0.5rem; border-radius: 100px; margin-left: 0.5rem; vertical-align: middle; } .badge-origin { background: rgba(139,92,246,0.25); color: var(--purple); border: 1px solid var(--purple); } .badge-encoder { background: rgba(34,211,238,0.2); color: var(--cyan); border: 1px solid var(--cyan); } .badge-decoder { background: rgba(245,158,11,0.2); color: var(--gold); border: 1px solid var(--gold); } .badge-scale { background: rgba(16,185,129,0.2); color: var(--green); border: 1px solid var(--green); } /* Citation / quote block */ blockquote { margin: 2rem 0; padding: 1.5rem 1.5rem 1.5rem 2rem; background: var(--surface2); border-left: 4px solid var(--purple); border-radius: 0 6px 6px 0; font-style: italic; color: var(--muted); position: relative; } blockquote::before { content: '"'; font-family: 'Bangers', cursive; font-size: 5rem; color: var(--purple); opacity: 0.3; position: absolute; top: -1rem; left: 0.5rem; line-height: 1; } blockquote cite { display: block; margin-top: 0.75rem; font-size: 0.82rem; font-style: normal; color: var(--purple); font-family: 'Space Grotesk', sans-serif; font-weight: 600; } /* Code-like inline */ code { font-family: 'JetBrains Mono', monospace; font-size: 0.85em; background: rgba(139,92,246,0.15); color: var(--purple); padding: 0.15em 0.4em; border-radius: 3px; } /* Problem list */ .problem-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(240px, 1fr)); gap: 1rem; margin: 1.5rem 0; } .problem-card { background: var(--surface); border: 1px solid var(--border); border-radius: 8px; padding: 1.25rem; transition: border-color 0.2s; } .problem-card:hover { border-color: var(--red); } .problem-card .icon { font-size: 1.8rem; margin-bottom: 0.6rem; } .problem-card h4 { font-family: 'Space Grotesk', sans-serif; font-size: 0.95rem; font-weight: 700; color: var(--red); margin-bottom: 0.4rem; } .problem-card p { font-size: 0.88rem; color: var(--muted); margin: 0; } /* Breakthrough cards */ .breakthrough-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(240px, 1fr)); gap: 1rem; margin: 1.5rem 0; } .breakthrough-card { background: var(--surface); border: 1px solid var(--border); border-radius: 8px; padding: 1.25rem; transition: border-color 0.2s, box-shadow 0.2s; } .breakthrough-card:hover { border-color: var(--green); box-shadow: 0 0 20px rgba(16,185,129,0.1); } .breakthrough-card .icon { font-size: 1.8rem; margin-bottom: 0.6rem; } .breakthrough-card h4 { font-family: 'Space Grotesk', sans-serif; font-size: 0.95rem; font-weight: 700; color: var(--green); margin-bottom: 0.4rem; } .breakthrough-card p { font-size: 0.88rem; color: var(--muted); margin: 0; } /* Attention heads visual */ .heads-grid { display: grid; grid-template-columns: repeat(4, 1fr); gap: 0.75rem; margin: 1.5rem 0; } .head-card { padding: 1rem; border-radius: 6px; text-align: center; border: 1px solid; } .head-card .head-num { font-family: 'Bangers', cursive; font-size: 1.6rem; letter-spacing: 1px; } .head-card .head-desc { font-size: 0.72rem; font-family: 'Space Grotesk', sans-serif; margin-top: 0.25rem; } .head-1 { border-color: #8b5cf6; background: rgba(139,92,246,0.1); } .head-1 .head-num { color: #8b5cf6; } .head-1 .head-desc { color: #8b5cf6; } .head-2 { border-color: #22d3ee; background: rgba(34,211,238,0.1); } .head-2 .head-num { color: #22d3ee; } .head-2 .head-desc { color: #22d3ee; } .head-3 { border-color: #f59e0b; background: rgba(245,158,11,0.1); } .head-3 .head-num { color: #f59e0b; } .head-3 .head-desc { color: #f59e0b; } .head-4 { border-color: #10b981; background: rgba(16,185,129,0.1); } .head-4 .head-num { color: #10b981; } .head-4 .head-desc { color: #10b981; } /* Big stat */ .big-stat { text-align: center; padding: 2.5rem 1rem; background: var(--surface); border-radius: 12px; margin: 2rem 0; } .big-stat .number { font-family: 'Bangers', cursive; font-size: 4rem; letter-spacing: 3px; background: linear-gradient(135deg, var(--gold), var(--purple)); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text; } .big-stat .label { font-family: 'Space Grotesk', sans-serif; font-size: 0.9rem; color: var(--muted); margin-top: 0.5rem; } /* Footer */ /* Responsive */ @media (max-width: 600px) { main { width: 100%; } .site-nav-inner { gap: 1rem; } .site-brand span:last-child { display: none; } .site-nav .nav-links { gap: 0.9rem; overflow-x: auto; } .heads-grid { grid-template-columns: repeat(2, 1fr); } .hero h1 { font-size: 2.5rem; } } /* Scroll fade-in */ .fade-in { opacity: 0; transform: translateY(24px); transition: opacity 0.6s ease, transform 0.6s ease; } .fade-in.visible { opacity: 1; transform: translateY(0); } /* Inlined into Hugo: ensure content is visible even without the IntersectionObserver script */ .attn-article .fade-in { opacity: 1; transform: none; }

Deep Dive · Artificial Intelligence · 2017

The seven-word title that ended one era of AI and launched another. A beginner-friendly, technically honest tour through the paper that birthed every LLM you've ever heard of.

✍️ Ajay Walia📅 May 2026⏱ ~15 min read🧠 Beginner → Intermediate

Panel 1 — THE SCROLL APPEARS

CHAPTER 01

TheDark Ages of Language AI

To understand why "Attention Is All You Need" was a thunderclap, you first need to appreciate how painful life was before it. Cast your mind back to 2016. AI researchers around the world were working incredibly hard on language problems — translation, summarisation, question answering — but they were doing so with a fundamental handicap baked into their tools.

The dominant models at the time wereRecurrent Neural Networks (RNNs) and their smarter cousin, theLong Short-Term Memory network (LSTM). Both were designed to handle sequences: text goes in word by word, the model builds up a hidden memory state as it reads, and produces an output at the end.

The intuition seems sensible. After all,we read left to right (in English). Why shouldn't a machine? The problem, as we'll see, was catastrophic at scale.

How an RNN Actually Works

Imagine you're a gold fish with a tiny little notepad. Every time you read a new word, you scribble something on your notepad, thenerase half of it to make room for the next word. By the time you reach the end of a 500-word paragraph, your notepad is a smeared mess of partial impressions. That's an RNN.

More precisely: an RNN processes tokensone at a time, left to right. At each step, it combines the current word's embedding with ahidden state vector (its "memory" from all previous words) and produces a new hidden state. That hidden state is passed forward to the next step.

RNN: Sequential Processing Chain

The

cat

sat

the

???

Step 1Step 2Step 3Step 4Step 5 (fading)Long range: ☠️

Information degrades as it passes along the chain. Early words become "forgotten."

LSTMs: A Better Notepad, Same Problem

LSTMs (invented by Hochreiter & Schmidhuber in 1997) were the RNN's upgrade. Instead of one hidden state, they have three "gates" — input, forget, and output — plus a separate "cell state" that acts as a longer-term memory. They were genuinely better at remembering things across longer sequences.

But LSTMs didn't solve the core architectural problem. They still processed one token at a time, sequentially. And at massive scale, that was the killer.

Panel 2 — THE MEMORY BALL

CHAPTER 02

Three Problems ThatCrippled the Old Models

Before we get to the solution, let's be precise about the pain. The pre-Transformer era had three interconnected crises, and solvingany one of them would have been significant. The Transformer paper solved all three simultaneously.

⛓️

Problem 1: Sequential Bottleneck

RNNs and LSTMs process tokens one at a time. Step 2 cannot begin until Step 1 finishes. This means youcannot parallelize training across GPU cores. Training was agonisingly slow — weeks or months for large models.

🌫️

Problem 2: Vanishing Gradients

When you train a neural network with backpropagation, you compute gradients (error signals) and push them backwards through the chain. In a long sequence, those gradients shrink exponentially as they travel. Early tokens barely learn anything.

📏

Problem 3: Long-Range Amnesia

In the sentence"The trophy didn't fit in the suitcase becauseit was too big" — what does "it" refer to? The trophy. A human knows instantly. An RNN processing hundreds of words between "trophy" and "it" often forgot the connection entirely.

The Telephone Game at Scale

The vanishing gradient problem is best understood through the "telephone game" (Chinese Whispers). You whisper a sentence to the first person in a chain. By the time it reaches the 20th person, the message is garbled beyond recognition. In an RNN, the gradient signal is that whisper — and long sequences were destroying it.

LSTMs reduced the garbling with their gating mechanisms, but didn't eliminate it. And crucially, every single token in the sequence still had to wait for the one before it to finish processing. At a time when researchers were starting to dream about training on billions of words, this was a scaling cliff.

Panel 3 — THE VANISHING WHISPER

CHAPTER 03

A Spark Before the Fire —Bahdanau Attention (2014)

To be accurate about the history: the 2017 paper didn't invent attention from scratch. In 2014, Dzmitry Bahdanau and colleagues published a paper that added an "attention mechanism" on top of existing encoder-decoder RNNs for machine translation.

The idea was elegant: when generating each output word, instead of squishing the entire input sentence into one fixed-size vector, the model learns to "look back" at different parts of the input and assign weights — attention scores — to each input word. Generate "Hund" in German? Pay more attention to "dog" in the English source.

🏅 The 2014 Precursor

Bahdanau et al. (2014) showed attention worked. But they bolted iton top of RNNs — the sequential backbone was still there, just with a better look-back mechanism. It was like putting a turbocharged engine in a horse-drawn carriage.

The 2017 breakthrough came when Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin at Google Brain asked a radical question:what if we got rid of the carriage altogether?

"We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely."— Vaswani et al., Attention Is All You Need (2017)

CHAPTER 04

June 2017 —The Paper Drops

Eight Google Brain researchers quietly uploaded a pre-print to arXiv on June 12, 2017. The title was almost cheeky — "Attention Is All You Need" — a pun on The Beatles' "All You Need Is Love" and a direct challenge to the field:attention mechanisms alone are sufficient. No recurrence. No convolutions. Just attention.

The abstract was direct. They proposed the Transformer architecture, showed it achieved state-of-the-art on English-to-German translation (28.4 BLEU, surpassing the previous best by more than 2 BLEU points), trained it in a fraction of the time, and made a claim that would prove prophetic: this architecture was far more parallelisable and required significantly less time to train.

At the time, the machine learning community took notice but didn't immediately grasp the full magnitude. It looked like a better translation model. What it actually was, in retrospect: the foundation of every major AI system built in the next decade.

200,000+

Citations as of 2025 — one of the most cited papers in all of computer science history

Panel 4 — THE GOOGLE BRAIN MOMENT

CHAPTER 05

Self-Attention —Every Word Watches Every Word

This is the heart of it. Everything else in the Transformer paper is (brilliant) supporting machinery. Self-attention is the engine.

Here's the core idea in plain English:when processing any word, the model looks at every other word in the sentence simultaneously and calculates a relevance score. Instead of passing information along a chain one step at a time, every token "talks to" every other token in parallel.

The Library Analogy

Imagine a library where every book can send a little messenger to every other book, asking: "Hey, are you relevant to me?" Each pair of books gives an answer — a number from 0 to 1. The books with higher scores get to "share" more of their information when the library compiles its final report.

In the sentence"The cat sat on the mat because it was tired" — when the model processes "it", the self-attention mechanism computes a score between "it" and every other word. The word "cat" gets a very high score (because "it" refers to the cat), while words like "the" and "on" get low scores. This is done in one parallel operation — no sequential chain required.

The Math Behind It (Don't Panic)

The paper formalises this with three vectors derived from each word's embedding: aQuery (Q), aKey (K), and aValue (V). Think of it like a search engine:

🔍 Q / K / V Intuition

Query (Q): "What am I looking for?" — what the current word wants to know.
Key (K): "What do I offer?" — what each other word has to advertise.
Value (V): "Here's my actual content" — what each word shares if chosen.

Attention score =softmax( Q · Kᵀ / √d_k ) · V

The division by √d_k (square root of the dimension) is a stabilising trick — without it, the dot products can get very large and the softmax function becomes extremely "peaky" (everything goes to one word), which hurts training. The softmax then converts raw scores into a probability distribution — so all the weights add up to 1.0.

Self-Attention Weight Matrix — "it" attending to other words

The

cat

sat

mat

"it" →

0.05

0.72

0.04

0.03

0.10

0.06

"it" pays most attention to "cat" (0.72 weight) — this is how the model resolves co-reference. Higher = brighter.

The critical breakthrough isn't justthat these scores are computed — it'show:all pairs are computed in parallel using matrix multiplication. A sentence of 512 tokens doesn't require 512 sequential steps. It requires one big matrix operation that modern GPUs execute extremely fast. This is the parallelisation breakthrough that made scaling possible.

Panel 5 — WORDS LOOKING AT WORDS

CHAPTER 06

Multi-Head Attention —Many Perspectives at Once

Here's where the paper goes from clever to ingenious. A single self-attention computation gives you one view of how words relate. But language is rich — words relate to each other in manydifferent ways simultaneously.

Consider the sentence"She gave him the book she wrote":

— "she" and "him" have agrammatical subject/object relationship
— "she" (first occurrence) and "she" (second) have aco-reference relationship
— "book" and "wrote" have asemantic relationship (you write books)
— "gave" and "book" have averb-object relationship

One attention head would have to pick one of these.Multi-head attention runs several attention computations in parallel, each in a different "subspace" of the representation. The results are then concatenated and projected back to the original dimension.

Head 1

Grammatical roles (subject, object, verb)

Head 2

Co-reference resolution ("she" = "she")

Head 3

Semantic relatedness (book ↔ wrote)

Head 4

Syntactic dependencies (verb-object)

The original Transformer used8 attention heads. Modern LLMs like GPT-3 use 96, and models like Claude use even more. Each head develops its own specialisation during training — not by design, but emergently, because the model learns that different heads can capture different useful patterns.

🎭 The Right Analogy

Multi-head attention is like having a team of editors review your essay simultaneously. One editor focuses on grammar, one on logical flow, one on vocabulary, one on argument structure. You get all their feedback at once, then synthesise it. No editor has to wait for the previous one to finish.

Panel 6 — THE EIGHT-EYED TEAM

CHAPTER 07

Positional Encoding —Teaching Order Without Recurrence

Here's a subtle but critical problem. In an RNN, word order is implicit — you literally process word 1, then word 2, then word 3. The order is baked into the architecture. But in a Transformer, all words are processed in parallel. If you showed it "Dog bites man" and "Man bites dog" simultaneously, the attention mechanism alone would see the same set of words and might produce the same result.

That's obviously catastrophic for language. "The bank by the river" and "river by the bank the" mean very different things.

The solution:Positional Encoding. Before feeding word embeddings into the Transformer, you add a unique positional signal to each one. The paper uses a clever combination of sine and cosine functions at different frequencies:

📐 The Formula

PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Wherepos is the word's position andi is the dimension. The result: each position gets a unique, smooth vector that the model can learn to interpret. The sine/cosine waves at different frequencies are like a musical chord unique to each seat in the stadium.

Why sinusoids and not just the number 1, 2, 3...? Because sinusoids generalise. They allow the model to learnrelative positions (word 2 is one step after word 1) not just absolute ones. And they handle sequences longer than those seen in training gracefully, because the wave patterns extend naturally.

Modern variants like RoPE (Rotary Position Embedding, used in Llama and GPT-NeoX) and ALiBi have since improved on the original scheme — but they're all descendants of this 2017 insight.

Panel 7 — THE NUMBERED LINEUP

CHAPTER 08

The Full Transformer —Encoder + Decoder

The original paper was designed for sequence-to-sequence tasks — specifically machine translation. The architecture has two halves that work together: anEncoder that reads the input (the English sentence) and aDecoder that generates the output (the German translation).

The Encoder

The encoder stack (6 identical layers in the original paper) processes the entire input sentence in parallel. Each layer has two sub-components:(1) multi-head self-attention (all words attend to all other words) and(2) a feed-forward neural network applied to each position independently. Both sub-components useresidual connections (the input is added back to the output) andlayer normalisation — both stability tricks borrowed from computer vision.

The Decoder

The decoder is similar but has three sub-components per layer. The first ismasked self-attention — like encoder self-attention, but masked so that when generating word N, the model can only attend to words 1 through N-1 (it can't cheat by looking at future words). The second iscross-attention — the decoder attends to the encoder's output, connecting the input sentence to the generation process. The third is the same feed-forward network as in the encoder.

🧬 The Encoder-Decoder Legacy

BERT (2018) usesonly the encoder — great for understanding tasks (classification, named entity recognition). GPT-1/2/3/4 useonly the decoder — great for generation tasks (writing, code, conversation). The full encoder-decoder design lives on in models like T5 and BART, used heavily for translation and summarisation.

One more key ingredient:Feed-Forward layers. After each attention block, every position's representation passes through a small, identical 2-layer neural network. In the original paper, the inner dimension of this network was 2048 — 4× the model's embedding dimension of 512. In GPT-3, it's 4× 12,288 = 49,152. These layers are believed to act as "fact storage" — where knowledge learned during training gets encoded.

Panel 8 — THE TRANSFORMER MECHA

CHAPTER 09

The ImpactTimeline — 2017 to Now

The Transformer paper wasn't just a research curiosity. It was a platform. Within a year, the entire field had pivoted. Within five years, it had generated a trillion-dollar industry. Here's the direct lineage:

2017

"Attention Is All You Need"ORIGIN

Vaswani et al. (Google Brain). The original Transformer architecture. State-of-the-art on WMT English→German translation. 65M parameters. Training: 3.5 days on 8 P100 GPUs.

2018

BERT — GoogleENCODER-ONLY

Bidirectional Encoder Representations from Transformers. 340M parameters. Pre-trained on masked language modelling — predict randomly hidden words. Demolished 11 NLP benchmarks on release. The model that proved "pre-train, then fine-tune" as the dominant paradigm.

2018

GPT-1 — OpenAIDECODER-ONLY

Generative Pre-trained Transformer. 117M parameters. The first proof that a decoder-only Transformer, trained on unsupervised language modelling, could be fine-tuned for diverse tasks. OpenAI's foundational bet on the decoder path.

2019–2020

GPT-2 → GPT-3SCALE

GPT-2 (1.5B params) was so good at generating text that OpenAI staged a "staged release" over safety concerns. GPT-3 (175B params) — the first model to demonstrate serious few-shot learning. You could give it 3 examples and it could do a new task without any fine-tuning. The scaling laws paper (Kaplan et al.) proved bigger = better predictably.

2022

ChatGPT — RLHF Changes Everything

GPT-3.5 fine-tuned with Reinforcement Learning from Human Feedback (RLHF). The public suddenly had a conversational interface to an LLM. 1 million users in 5 days. 100M in 2 months. Fastest product growth in history. Every major tech company scrambled.

2023–2026

Claude · Grok · Gemini · Llama · Mistral · GPT-4/4o

All Transformer-based. All descendants of the 2017 paper. Claude (Anthropic) adds Constitutional AI. Llama (Meta) brings open-source to the frontier. Gemini (Google) goes multimodal. Grok (xAI) takes on real-time search. The Cambrian explosion of LLMs — every one of them tracing its lineage back to June 12, 2017.

Panel 9 — THE ROCKET OF PROGRESS

CHAPTER 10

Is It The Most Important AI PaperEver Written?

Let's be honest about this. The question is fascinating precisely because it's not entirely settled — there are serious candidates.

The Case For: Yes, Unambiguously

No single paper has had a more direct and immediate commercial and scientific impact in the modern AI era. Every frontier LLM in existence today is a Transformer. The trillion-dollar AI industry of the mid-2020s is built on this foundation. With 200,000+ citations, it's a runaway leader in citation counts for an ML paper. The research it unlocked — in text, image (ViT), audio (Whisper), protein structure (AlphaFold 2), video (Sora), code (Copilot) — spans essentially every domain of AI.

The Case For Other Contenders

🏆 Other Papers That Matter

Backpropagation (Rumelhart et al., 1986) — Without the ability to train neural networks at all, there's nothing to build on.

ImageNet + AlexNet (Krizhevsky et al., 2012) — The moment deep learning proved itself to the world, launching the modern deep learning era.

Word2Vec (Mikolov et al., 2013) — Showed that word embeddings encode semantic meaning; a prerequisite for Transformer input representations.

Scaling Laws (Kaplan et al., 2020) — Proved that LLM capabilities grow predictably with compute and data, enabling the investment thesis behind GPT-3 and everything after.

RLHF (Christiano et al., 2017) — The alignment technique that turned raw LLMs into assistants humans actually want to use.

The honest verdict:In the specific context of modern generative AI — LLMs, multimodal models, and the AI products billions of people use daily — "Attention Is All You Need" is the clearest single point of origin. Without backprop it couldn't exist, but without this paper, it wouldn't have become what it is. It's the right answer to the question "which paper made today's AI possible?"

"We are all standing on the shoulders of eight people who asked: what if recurrence isn't actually necessary?"— A reasonable paraphrase of the entire modern AI research community

Panel 10 — THE AI FAMILY PORTRAIT

CHAPTER 11

Where Do We Go FromHere?

The Transformer is dominant but not invincible. Researchers are actively working on what comes next — and several serious challengers are emerging.

The Current Limitations

Self-attention has a quadratic complexity problem. If your sequence has N tokens, the attention matrix is N × N. Double the sequence length, quadruple the compute. For long documents — books, codebases, hours of audio — this becomes brutally expensive. The context window you experience in Claude or GPT-4 represents enormous engineering effort to extend what was originally a very limited range.

What's Being Explored

Mamba / State Space Models (SSMs) process sequences in linear time, not quadratic — a genuine architectural alternative that some researchers believe could eventually rival or exceed Transformers for long-context tasks.Flash Attention (Dao et al., 2022) is an algorithmic optimisation that makes standard attention dramatically more memory-efficient without changing the math.Mixture of Experts (MoE) architectures — used in GPT-4 and Gemini — activate only a subset of parameters per token, allowing models with trillions of total parameters to run at the cost of a much smaller model.

Multimodality is the frontier. The Transformer's attention mechanism generalises naturally to images (patch tokens), audio (spectrogram tokens), video (frame tokens), and structured data. A single Transformer can in principle process all of these simultaneously — and models like GPT-4o and Gemini Ultra are moving rapidly in this direction.

The question researchers are now asking: is intelligence primarily a function of architecture, or of scale and data? The scaling laws suggest it's mostly the latter. If that's true, the Transformer need not be dethroned — it just needs to be fed more.

🔭 The Bigger Picture

We are, by most accounts, somewhere in the middle of the most important technological transition in human history. Every model at the frontier — the ones writing code, passing medical exams, generating video — shares a common ancestor. Eight researchers. One arXiv pre-print. Twelve hundred lines of Python. June 12, 2017.

Panel 11 — THE NEURAL SKY

CONCLUSION

The Seven Words ThatChanged Everything

We started this piece in 2016, watching RNNs and LSTMs struggle through their sequential chains, watching gradient signals vanish like whispers in a long corridor. We watched researchers work incredibly hard to coax these architectures to handle longer contexts, more complex language, bigger training sets — and hit wall after wall.

Then eight people asked a simple question —what if you just... paid attention? — and rewired the entire field.

The Transformer is not magic. It's mathematics: query-key-value lookups, scaled dot-product attention, layer normalisation, residual connections, feed-forward networks. Every piece is graspable. The genius was in the combination — and in the willingness to abandon the assumption that sequences must be processed sequentially.

The models you interact with today — the ones that draft your emails, explain your code, answer your questions about transformer architecture with exhaustive detail — are all running on this foundation. Claude (the AI that helped outline this post) is a Transformer. GPT-4 is a Transformer. Grok is a Transformer. Gemini is a Transformer. They are, in the deepest technical sense, all direct descendants of that arXiv upload.

Understanding "Attention Is All You Need" is not just historical curiosity. It's the grammar of modern AI. Once you understand it, you have a lens through which almost everything in the field makes sense — the scaling laws, the context window debates, the encoder vs. decoder architecture choices, the multimodal experiments, the efficiency research.

The paper is free. The arXiv link still works. It's 15 pages and reads more clearly than most ML papers. If this post piqued your curiosity: go read it. You'll understand it now.

2017 → ∞

The year a single paper changed the trajectory of intelligence itself

]]>

Building with LLMs in 2026: The Framework Atlas

Ajay Walia — Fri, 08 May 2026 00:00:00 +0000

Four years after LLMs entered the mainstream, the single most common mistake I see architects make is spending most of their decision energy on the model. Which frontier model? GPT or Claude or Gemini? The model choice matters — but it is one decision out of roughly fifteen, and it is far from the most consequential one.

Building an LLM-powered system in 2026 is an architecture decision made across a stack of competing frameworks, each solving a well-posed problem at a specific layer. I spent several months mapping that landscape as a practitioner — the result is a 73-page whitepaper I call the Framework Atlas. This post distils it into the five things I think every architect, engineer, and senior IT leader should know before picking a single tool.

The Stack Has a Shape

There is no single AI stack, but there is a canonical shape. Every non-trivial production LLM system — whether a support chatbot, a document search engine, or a multi-agent workflow — is a composition of six layers:

Application layer. The surface your user interacts with. LangChain is the default; Semantic Kernel is the Microsoft-native choice; CrewAI leads when the app itself is agentic.

Agent layer. When a single LLM call is not enough — when the system needs to plan, call tools, or coordinate among multiple agents — this layer provides the loop. LangGraph is the most production-grounded option in 2026.

Data / retrieval layer. The memory of your system. LlamaIndex leads on orchestration; Weaviate, Pinecone, and Chroma compete at storage, each tuned for a different operational profile.

Model layer. The foundation models themselves. This layer is increasingly commoditised. The most important design decision here is not which model you start with — it is whether you can swap it without rewriting the layers above.

Serving / inference layer. How you turn a model into an endpoint. vLLM dominates throughput-bound workloads; BentoML packages models into clean APIs for teams that want to think about models, not infrastructure.

Infrastructure layer. Kubernetes, Docker, cloud, on-prem. Every framework choice depends on where you can actually deploy.

Wrapped around all six layers are three concerns that have become non-negotiable since 2024:observability and evaluation,fine-tuning and training, andguardrails and safety. If your system design has no answer for any of these three, it is under-designed. Ignoring them does not eliminate the risk — it just defers it until something breaks in production.

The Abstraction Trap

Every framework in the atlas is catalogued against eleven attributes, but the one architects under-weight most consistently isabstraction level — how much code you write versus how much the framework decides for you.

LangChain’s high abstraction makes the first demo fast and the tenth production fix slow, because you are debugging through someone else’s default decisions. FAISS’s low abstraction costs more lines but yields fewer surprises at 3am.

The operational signal:match abstraction to team seniority. Junior teams over-value high abstraction; senior teams over-value low. A mixed team benefits from a medium default — and from making the choice explicitly rather than defaulting to whatever has the best GitHub star count.

Decision Heuristics That Actually Hold

Rather than optimising at each layer independently, the atlas maps common requirements to preferred framework combinations. These are the ones I have found most durable in practice:

Requirement	Starting stack
Fast LLM prototype	LangChain + Chroma + OpenAI API
Enterprise-grade RAG	LlamaIndex + Weaviate + LangSmith
Multi-agent workflow	LangGraph (+ AutoGen for agent conversations)
High-throughput inference	vLLM + Ray Serve
Local / offline / on-prem AI	Ollama + FAISS + LangChain-local
Domain-specialised model	Axolotl (QLoRA) + vLLM + MLflow

Two things stand out from this table. First, LangChain and LlamaIndex are not competitors — they compose cleanly, with LangChain at the application layer and LlamaIndex at the retrieval layer. Second, local inference is no longer an edge case. Ollama plus a Llama-3-class model is a realistic production option for regulated industries where data sovereignty is a hard constraint.

Agents Moved to Production — With Guardrails

In 2023, autonomous agents were mostly demos. By 2026, they are in targeted production use: triage, routing, research synthesis. What changed is not the models — it is the frameworks.

LangGraph’s state-machine model gives agents deterministic control flow: you declare states, transitions, and retry policies explicitly. AutoGen models multi-agent systems as conversations, which makes it remarkably expressive for critique-revise loops and planner-executor separations. The practitioner heuristic:for production agents, LangGraph. For multi-agent conversations, AutoGen. For lightweight document workflows, CrewAI.

The critical note:never deploy autonomous agents in production unless the failure cost is bounded. The agent should draft; a human should approve. The pattern that ships is almost always a hybrid — autonomy where the stakes are low, escalation where they are not.

Guardrails have crossed from afterthought to critical infrastructure in the same period. Prompt injection is the new SQL injection. Every production system needs an input guard, an output guard, and a policy layer between them. The minimum viable defense in 2026 is: input guard → LLM → output guard. Anything less is operating without a seat belt.

The 2026 Outlook: Three Trends Worth Designing For

Agents are becoming the compile target. LangGraph, AutoGen, and CrewAI are converging on a common abstraction — a loop over an LLM with tool use and state. Expect a future that looks like the deep learning layer in 2018: multiple frontend frameworks, one common runtime. Design your agent layer to be swappable.

Retrieval is eating search. Elasticsearch, Postgres, and OpenSearch all ship vector indexes now; Weaviate and Pinecone ship BM25. The primitives have converged. The differentiator is no longer features — it is operational maturity and the team’s ability to run the infrastructure. Hybrid retrieval (vector + keyword) is the production-safe default.

Guardrails are becoming infrastructure. Today they are a library you bolt on. In two years they will be a runtime — prompt injection detection, PII scrubbing, and policy enforcement applied by default to every model invocation, the way CORS and auth middleware is applied to every HTTP request today. Get ahead of this by treating your guardrails layer as critical infrastructure now, not as a compliance checkbox later.

A Practitioner’s Closing Note

Frameworks age faster than architectures. The stack shape you design today — application, agent, retrieval, model, serving, infrastructure — will still be valid in three years. The individual framework boxes you fill it with probably will not be. The single most important design invariant isswappability at each layer. Make the layer interfaces clean, keep the framework-specific code thin, and you will be able to move when the landscape shifts — and it will.

The full Framework Atlas (v4.0, April 2026) covers all ten framework categories in detail, including comparison tables, maturity radars, cost and latency envelopes, and four reference architectures with working code. It is available below.

Download the Framework Atlas — Building with LLMs v4.0 (PDF)

]]>

I Built, My Own Screenshot App for macOS (No More Clunky Screenshots)

Ajay Walia — Fri, 08 May 2026 00:00:00 +0000

It was a Tuesday evening. I had just taken my fourteenth screenshot of the day — a mix of Cmd+Shift+4, accidental desktop saves, files namedScreenshot 2026-05-06 at 11.43.22 PM.png scattered across my Downloads folder like confetti after a party nobody enjoyed.

I opened Finder. Forty-seven PNGs. Forty-seven. Some blurry. Some with my other windows bleeding in at the edges. Some cropped wrong because I’d sneezed mid-drag.

There had to be a better way.

The Ghost of Greenshot Past

Back in my Windows days — and I spent alot of years in Windows, running datacentres, managing Wintel estates, building VDI platforms — I hadGreenshot. If you’ve never used it, Greenshot is a free, lightweight screenshot tool that lives in your system tray. PressPrtSc, draw a box, done. Instant annotation. Instant clipboard. Instant sanity.

When I moved to macOS full-time, I expected something better. Apple builds beautiful hardware. Their software is generally excellent. And their screenshot workflow is… fine. Technically fine. But when you’re taking screenshots all day for documentation, Slack messages, architecture diagrams, and blog posts — “fine” isn’t good enough.

The native flow is:Cmd+Shift+4, drag imprecisely, file saves somewhere, you hunt for it, you paste it, you realise the crop was off, you do it again.

I Googled “Greenshot for Mac” approximately eleven times over the past year. The answers: CleanShot X (paid, subscription), Shottr (good but someone else’s decisions), or just “get used to it.”

I couldn’t get used to it. So I did what any reasonable enterprise architect does at 10pm on a Tuesday.

I decided to build it myself.

Fine. I’ll Build It.

The plan was simple: a lightweight macOS menu bar app. Lives in the status bar. One click or a hotkey and you’ve got a clean area capture, a window capture, or a full-screen grab. Auto-saves with a sensible filename. Copies to clipboard. Makes a satisfying shutter sound. Stays out of your way.

The twist: I’d build it the same way I built this blog — with Claude as my pair programmer and VS Code as the editor. I’m an architect and technologist, not a Swift developer. But that was the whole point.

I opened a new conversation, described what I wanted in plain English, and we got to work.

What AJShot Is

AJShot is a native macOS menu bar app — background-only, no Dock icon, no bloat. It sits quietly in your menu bar with a smallAJ badge. Left-click opens your last screenshot. Right-click opens the action menu.

Memory footprint: around50 MB at idle. Compare that to some Electron-based tools that idle at 300 MB just saying hello.

The right-click menu looks like this:

📷 Capture Area ⌘⇧3
🖥 Capture Window ⌘⇧4
🖥 Capture Fullscreen ⌘⇧5
─────────────────────────────
📂 Open Last Screenshot
📁 Open Screenshots Folder
─────────────────────────────
⚙️ Preferences
ℹ️ About
✕ Quit

The keyboard shortcuts are global — they work even when AJShot is in the background. You’re deep in a Zoom call, you need a quick capture, you hitCmd+Shift+3, draw the box, done. The screenshot is already in your designated folder and on your clipboard before you’ve even let go of the mouse.

Core Features

Capture Modes

Area Capture (⌘⇧3) — the workhorse. Full-screen overlay appears, cursor becomes a crosshair, you draw a box. Precise. Consistent. No more dragging the wrong direction and getting a sliver of your taskbar.

Window Capture (⌘⇧4) — click a window, capture just that window. No background bleed, no accidental desktop icons photobombing your documentation.

Fullscreen Capture (⌘⇧5) — captures everything. Supports multi-display setups with a “capture all” or “ask each time” behaviour you configure once.

Preferences That Actually Make Sense

Open Preferences and you get a clean panel with the options you’d actually want to configure:

Setting	Default	What it does
Launch at login	On	AJShot is always ready, even after restart
Play capture sound	On	Satisfying shutter click confirms the capture
Show thumbnail preview	On	Floating preview appears — click to open in editor
Auto-copy to clipboard	On	Screenshot lands in your clipboard automatically
Save folder	`~/Pictures/AJShot`	Your own designated screenshot home
File format	PNG	Or JPG with a quality slider
Filename template	`AJShot_{date}_{time}`	Consistent, sortable, sane filenames

The filename templating engine handles the sanitisation itself — no illegal characters, no trailing dots, no 240-character filenames. Files come out likeAJShot_2026-05-09_22-14-37.png. You can find them instantly. You can sort them. Your Downloads folder stays clean.

Post-Capture Flow

After every capture, AJShot runs through a quick sequence:

Auto-saves to your configured folder with the filename template applied
Copies to clipboard (if enabled) — so you can paste immediately
Plays the shutter sound — audio confirmation that the capture happened
Shows a floating thumbnail in the corner — click it to open in the editor

You can also configure the post-capture action to always ask, always edit, always save, or always copy. Once you’ve decided how you work, it remembers.

The Real Problem It Solves

Here’s the honest comparison:

	Native macOS	AJShot
Start capture	`Cmd+Shift+4`, wait for toolbar	`Cmd+Shift+3`, instant overlay
File location	Desktop or last used folder	Always`~/Pictures/AJShot`
Filename	`Screenshot 2026-05-09 at 10.14.22 PM.png`	`AJShot_2026-05-09_22-14-22.png`
Clipboard copy	Manual extra step	Automatic
Editor	Preview.app detour	Built-in (annotation tools in progress)
Memory use	System process	~50 MB dedicated
Works in background	Partial	Full global hotkeys
Shutter sound	Yes	Yes (configurable)

The editor is where things get even more interesting. The annotation scaffolding is already built — the tools are there in the codebase: arrows, blur, text, highlights, shapes, step-number callouts. They’re not wired to the UI yet, but they’re coming. That’s the next milestone.

Tech Stack — What’s Actually Under the Hood

This is anative macOS app built entirely inSwift 5.9+, targetingmacOS 12 (Monterey) and later. The UI is a hybrid ofSwiftUI andAppKit — SwiftUI for the Preferences panel and modern views, AppKit where you need fine-grained control over system behaviour (status bar items, window management, capture overlays).

Dependencies (both bySindre Sorhus, the prolific open-source developer behind half the macOS indie tool ecosystem):

KeyboardShortcuts — global hotkey registration that plays nicely with macOS security
Defaults — a type-safe, SwiftUI-friendly wrapper aroundUserDefaults

Capture pipeline: usesCGPreflightScreenCaptureAccess andCGRequestScreenCaptureAccess from CoreGraphics for the permission preflight, thenScreenCaptureKit for the actual pixel capture. The permission flow has a proper retry loop — if you’ve granted permission but haven’t restarted the app, it tells you exactly that and offers a one-click restart.

Build and distribution: Swift Package Manager for dependencies, Xcode 15 for the build, and abuild-dmg.sh script that produces a signedAJShot-1.0.0.dmg for distribution. The DMG is already built. You can drag it to Applications and run it today.

Security considerations (because I work in security and these things matter):

Screenshot folder is set to0700 permissions — only you can read it
Individual screenshot files are0600 — same
Filename template engine strips illegal characters and control characters at the source
Code signing and notarization stubs are already in the README for when distribution goes wide

The architecture is clean:App →AppDelegate →StatusBarController →CaptureManager →ScreenshotStorage. Each module does one thing. TheFilenameTemplateEngine is a pure static function. TheThumbnailPresenter is decoupled from capture. Claude helped me keep it disciplined.

The Honest Struggles

No developer story is complete without the bit where things go sideways.

Screen Recording permissions nearly broke me on the first build. macOS 12+ requires explicit Screen Recording permission in System Settings, and you can’t force-trigger the dialog more than once per session. I had to build the multi-stage fallback: first launch triggers Apple’s native dialog, subsequent denials open System Settings directly, and if you’ve granted permission but not restarted, it detects that state and offers a restart button. It took longer to get that right than the actual capture code.

Swift Package Manager vs Xcode had a brief disagreement about the resource bundle forshutter.aiff. The sound file lives inAJShot/Sounds/ and has to be declared as a.process("Sounds") resource inPackage.swift. Simple when you know it. Less simple at 11pm when it just silently fails to play.

TheLaunchAtLogin manager required writing aLaunchAgent plist to~/Library/LaunchAgents/. Straightforward on paper. In practice, macOS is protective of that folder in ways that aren’t well-documented, and the error messages when something goes wrong are the kind that send you to Stack Overflow threads from 2019.

That’s the thing about building a native macOS app — the platform is powerful and the APIs are solid, but the surface area of “things that can quietly not work” is larger than you’d expect. Claude helped me navigate most of it. The frustration was real and so was the progress.

What’s Next

The annotation editor is the next major milestone — the scaffolding is already there. Arrow tool, blur tool, text tool, highlight, shapes, step-number callouts. When that’s wired up, AJShot becomes a complete Greenshot replacement for macOS: capture, annotate, save, share.

After that: GitHub release with a public DMG, code signing, notarization, and probably a product page here on CuriousBit.

If you’re a macOS user who’s been putting up with the native screenshot workflow out of habit — you don’t have to. And if you’re a developer (or aspiring one) who thinks you can’t build a native app without being a full-time Swift engineer — you can. You really can.

The GitHub repo is coming soon — I’ll post the link here when it’s public. In the meantime, if you’re curious about the architecture, the Swift source, or want to follow along as the editor gets built, keep an eye on this blog.

And I’m genuinely curious:what screenshot tool do you use on macOS? Are you a CleanShot loyalist? A Shottr person? Still using the native tools and somehow thriving? Drop a note — I’d love to know.

Because the best tools are the ones built out of genuine frustration with the alternative. And I wasvery genuinely frustrated.

]]>

I Built This Blog Without Writing a Single Line of Code (Almost)

Ajay Walia — Thu, 07 May 2026 00:00:00 +0000

It started like most late-night ideas in Sydney — a MacBook, too much coffee, and a thought that refused to go away.

I wanted a tech blog. A proper one — clean, fast, with a Knowledge Base, Videos section, and a place to put thoughts on AI and enterprise tech. The problem? Zero design skills. Zero frontend experience. I know infrastructure, architecture, platforms — but CSS gives me a headache.

The Crazy Idea

Then it clicked.

What if I just described the site I wanted — in plain English — and let AI build it? Not generate a snippet. Build thewhole thing. Layouts, config, content, deployment config. Everything.

The prompt I used was simple but specific:

“Make me a clean HackerNoon-style Hugo + Tailwind blog with KB, Videos, News & Views, and About pages. Deploy-ready for Netlify.”

The Stack

The AI returned a complete project in one shot. Here’s what it chose and why it makes sense:

Hugo — a static site generator written in Go. Blazing fast builds, no database, no server to maintain. Perfect for a content blog.

Tailwind CSS — utility-first CSS. Instead of writing stylesheets, you compose classes directly in HTML. The AI could reason about it well and generate clean, consistent UI.

Netlify — one-click deployment from a GitHub repo. Push tomain, site rebuilds automatically. Free tier covers everything a personal blog needs.

GitHub — version control and the bridge between local edits and live site.

The structure it generated:content/,layouts/,assets/,static/,config.yaml,netlify.toml. Exactly what you’d expect from an experienced Hugo developer.

First Roadblock — Hit in Under 5 Minutes

Real talk — the first thing I hit was a terminal error before I’d even installed anything.zsh: command not found: brew. Homebrew wasn’t on the machine.

At first I had no idea what I was looking at.

But here’s the thing about working with AI tools — you learn to debug faster because you can askwhy something failed, not justhow to fix it. Within a few minutes I understood: Homebrew is the Mac package manager. I needed it to install Hugo. Small win — I could read what was going wrong.

Getting Hugo Running

Three commands and Hugo was running:

/bin/bash -c"$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"brew install hugohugo server

The site was alive atlocalhost:1313 — navigation, article cards, dark mode toggle, the works.

The Project Structure

The generated project was clean and logical:

content/posts/ — articles as Markdown files with YAML frontmatter
layouts/ — Hugo HTML templates for each page type
assets/css/ — Tailwind CSS, processed at build time
static/ — images and files served as-is
config.yaml — site title, menus, author, base URL
netlify.toml — build command, publish directory, Node version

Hugo reads all frontmatter at build time and generates a completely static site. No database. No server-side rendering. No WordPress. Just fast HTML files.

Real Developer Moments

Not everything was smooth. There were genuine “how does anyone actually do this” moments — forgetting how to copy a file path in Finder, templates not updating because of Hugo’s build cache, Git authentication failing because GitHub dropped password support in 2021.

These are the moments tutorials skip. The AI handled every one of them.

Deployed. It’s Alive.

Getting it live was surprisingly painless:

Push the project to a GitHub repo
Connect the repo to Netlify
Set build command tonpm run build, publish directory topublic/
Hit deploy

Netlify pulled the code, ran Hugo, and published the static site in under 30 seconds. Everygit push from that point triggers an automatic rebuild.

What the Live Site Looks Like

The site has:

Homepage — featured article hero, rotating video cards, Knowledge Base grid
KB — long-form technical articles in Markdown
Videos — auto-synced from a YouTube playlist via YouTube Data API v3. A Node.js script fetches the playlist at build time, writesvideos.json tostatic/, and the browser loads it at runtime. Add a video to YouTube, trigger a deploy, it appears on the site.
About — a simple bio

The video section was the most technically interesting piece. No manual uploads, no embeds to maintain — just a playlist that feeds itself.

The Takeaway

This entire site — layouts, templates, CSS, YouTube integration, Git workflow, deployment pipeline — was built by someone with no frontend background. What I brought was the ability to describe what I wanted clearly, debug errors methodically, and push through the friction points.

The tools did the heavy lifting. The judgment aboutwhat to build was still mine.

Now It’s Your Turn

The stack is:Hugo + Tailwind + GitHub + Netlify + Claude. All free. All production-grade.

Start with a prompt. Describe what you want clearly. Expect errors. Fix them one at a time. Deploy early and often.

The gap between “I have an idea” and “it’s live on the internet” has never been smaller.

The full source for this site is on GitHub atibn-Battuta/AjayW_blog.

]]>

A Field Guide to AI Chips

Ajay Walia — Wed, 29 Apr 2026 00:00:00 +0000

@import url('https://fonts.googleapis.com/css2?family=Cinzel:wght@500;700;900&family=IM+Fell+English:ital@0;1&family=Inter:wght@400;500;600&display=swap'); .chip-guide { --bg: #0a1424; --surface: #14253f; --surface2: #1c2d4f; --border: #3a4a6a; --bronze: #b87333; --bronze-bright: #d4a04a; --gold: #e5b85e; --parchment: #e8e0d0; --muted: #9aa6bf; --accent-purple: #8b5cf6; --accent-teal: #2dd4bf; color: var(--parchment); font-family: 'Inter', sans-serif; font-size: 17px; line-height: 1.75; } .chip-guide h1, .chip-guide h2, .chip-guide h3 { font-family: 'Cinzel', serif; letter-spacing: 0.04em; color: var(--gold); font-weight: 700; margin-top: 2.5rem; margin-bottom: 1rem; } .chip-guide h1 { font-size: clamp(2.2rem, 4vw, 3.2rem); font-weight: 900; background: linear-gradient(135deg, #e5b85e 0%, #d4a04a 50%, #b87333 100%); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text; text-align: center; margin: 0.5rem 0 0.5rem; } .chip-guide h2 { font-size: clamp(1.4rem, 2.2vw, 2rem); border-bottom: 2px double var(--bronze); padding-bottom: 0.4rem; } .chip-guide h3 { font-size: clamp(1.2rem, 1.6vw, 1.5rem); color: var(--bronze-bright); } .chip-guide p { margin: 0.6rem 0 1rem; } .chip-guide strong { color: var(--gold); } .chip-guide a { color: var(--bronze-bright); } .chip-guide a:hover { color: var(--gold); } .chip-guide .hero { text-align: center; padding: 2.5rem 0 1.5rem; border-bottom: 1px solid var(--border); margin-bottom: 2rem; } .chip-guide .eyebrow { display: inline-block; font-family: 'Cinzel', serif; font-size: 0.78rem; letter-spacing: 0.35em; color: var(--bronze-bright); text-transform: uppercase; margin-bottom: 0.6rem; } .chip-guide .subtitle { font-family: 'IM Fell English', serif; font-style: italic; font-size: clamp(1.05rem, 1.4vw, 1.3rem); color: var(--muted); margin: 0.8rem auto 0; max-width: 720px; } .chip-guide .ornament { text-align: center; margin: 1.2rem 0; color: var(--bronze); font-size: 1.3rem; letter-spacing: 0.6rem; } .chip-guide table { width: 100%; border-collapse: collapse; margin: 1.5rem 0; font-size: 0.95rem; background: var(--surface); border: 1px solid var(--border); } .chip-guide table th { background: var(--surface2); color: var(--gold); font-family: 'Cinzel', serif; text-align: left; padding: 0.7rem 1rem; border-bottom: 2px solid var(--bronze); font-weight: 600; letter-spacing: 0.04em; font-size: 0.92rem; } .chip-guide table td { padding: 0.7rem 1rem; border-bottom: 1px solid var(--border); vertical-align: top; color: var(--parchment); } .chip-guide table tr:last-child td { border-bottom: none; } .chip-guide table tr:hover td { background: rgba(229, 184, 94, 0.04); } .chip-guide .stat-block { background: var(--surface); border: 2px solid var(--bronze); border-radius: 4px; margin: 1.5rem 0; font-family: 'Inter', sans-serif; } .chip-guide .stat-block-header { padding: 0.55rem 1rem; background: var(--surface2); border-bottom: 2px solid var(--bronze); font-family: 'Cinzel', serif; font-weight: 700; color: var(--gold); letter-spacing: 0.05em; font-size: 1.05rem; text-transform: uppercase; } .chip-guide .stat-block dl { padding: 0.7rem 1rem 0.8rem; margin: 0; display: grid; grid-template-columns: 8rem 1fr; row-gap: 0.4rem; column-gap: 0.6rem; } .chip-guide .stat-block dt { font-weight: 700; color: var(--bronze-bright); font-family: 'Cinzel', serif; font-size: 0.82rem; letter-spacing: 0.05em; text-transform: uppercase; margin: 0; } .chip-guide .stat-block dd { margin: 0; color: var(--parchment); font-size: 0.94rem; } .chip-guide .read-more { display: inline-block; margin-top: 0.4rem; color: var(--bronze-bright); font-family: 'Cinzel', serif; font-weight: 700; text-decoration: none; border-bottom: 1px solid var(--bronze); padding-bottom: 1px; letter-spacing: 0.05em; font-size: 0.92rem; } .chip-guide .read-more:hover { color: var(--gold); border-bottom-color: var(--gold); } .chip-guide .quote { font-family: 'IM Fell English', serif; font-style: italic; font-size: 1.18rem; color: var(--parchment); background: rgba(184, 115, 51, 0.08); border-left: 4px solid var(--bronze); padding: 1rem 1.5rem; margin: 1.5rem 0; } .chip-guide ul { padding-left: 1.4rem; } .chip-guide ul li { margin: 0.3rem 0; }

A Field Guide · 2026

A Field Guide to AI Chips

Stat blocks, lairs and notable specimens for the eight kinds of silicon that power modern AI.

❦ ❦ ❦

Modern AI runs on a small zoo of specialised chips. Each evolved to handle a different workload — training a frontier model, answering a billion queries a day, recognising a face on your phone, keeping a drone alive in the air. This guide catalogues eight of them, with a stat block and a “where you’ll meet it” entry for each. Each section links to a deeper entry for the curious.

The Roll Call

Chip	Best for	Memory & Interconnect	Cost & Access	Notable Specimens (2026)
GPU	Training + inference	80–192GB HBM3/3e; NVLink 5, PCIe 5	$25–40K each; cloud-only at scale	NVIDIA H100, B200, GB200 NVL72; AMD MI325X
TPU	Hyperscale training	95–192GB HBM; OCS interconnect	Google Cloud only	TPU v5p, v6 Trillium
NPU	On-device AI	Shared LPDDR / unified memory	Bundled in device	Apple Neural Engine (M4), Intel AI Boost (Lunar Lake), Qualcomm Hexagon (8 Elite)
CPU	Orchestration & control plane	DDR5; PCIe 5, CXL	$1–15K; retail	Intel Xeon 6, AMD EPYC 9005
ASIC	Inference at scale; specialised training	Custom HBM / SRAM; proprietary fabric	Cloud-only	AWS Inferentia2, Trainium2; Cerebras WSE-3; Groq LPU; SambaNova SN40L
FPGA	Custom, low-latency, adaptive	DDR/HBM; reprogrammable fabric	$5–50K each; cloud	AMD Versal AI Edge, Intel Agilex 7
Edge AI	Mobile, robotics, IoT	LPDDR; low-power	$50–2000, embedded in product	NVIDIA Jetson Orin, Google Coral, Hailo-8
Emerging	Frontier R&D	Wafer-scale SRAM / photonic / analog	Mostly research, limited cloud	Cerebras (covered above), Lightmatter, Mythic

1 · GPU — Graphics Processing Unit

GPU · The Apex Predator

Class: Parallel beast
Memory: 80–192GB HBM3 / HBM3e
Interconnect: NVLink 5, PCIe 5, InfiniBand
Power: 350–1200W per die
Habitat: Hyperscale datacenters
Cost: $25–40K per card · cloud rental at scale
Best Prey: LLM training, diffusion, multimodal pretraining
Specimens: NVIDIA H100, B200, GB200 NVL72; AMD MI325X

GPUs are the apex predator of the AI hardware ecosystem in 2026. Originally designed for graphics, they turned out to be ideal for the dense matrix multiplications that dominate neural network training. NVIDIA’s H100 made the LLM era possible; B200 and the rack-scale GB200 NVL72 (72 GPUs treated as one machine, lashed together by NVLink switches) define the current frontier.

The reason GPUs dominate isn’t just parallel processing — it’s the combination of HBM (high-bandwidth memory mounted directly on the chip package), tensor cores (specialised matrix-multiply units), and a mature software ecosystem (CUDA, PyTorch, JAX) that nothing else has matched at scale. AMD’s MI325X is the only serious open-market competitor, and even it ships running CUDA-compatible code through ROCm translation.

The catch: you cannot really buy them. H100s and B200s ship into hyperscaler datacenters first and reach the open market — when they do — through Lambda, CoreWeave, AWS, and friends, rented by the hour at $2–8 each.

→ Full entry: Field Guide · GPUs

2 · TPU — Tensor Processing Unit

TPU · Google's Matmul Colossus

Class: Bespoke matrix engine
Memory: 95–192GB HBM (varies by generation)
Interconnect: OCS (Optical Circuit Switching) + ICI
Power: ~200–300W per chip
Habitat: Google Cloud (only)
Cost: Cloud rental only
Best Prey: Hyperscale training of Gemini-class models
Specimens: TPU v5p (training), v5e (inference), v6 Trillium

Google designed TPUs in-house to avoid paying NVIDIA’s margins on a workload they knew exactly — TensorFlow matrix multiplications at hyperscale. Each generation has narrowed the gap with GPUs on flexibility while widening it on energy efficiency per FLOP.

The architectural bet is the systolic array: a grid of multiply-add units that pumps data through in lockstep, achieving near-peak utilisation on matmul-heavy workloads. The trade-off is that anything outside that sweet spot (irregular memory access, highly dynamic shapes) runs less efficiently than on a GPU. The OCS-based interconnect lets Google rewire a TPU pod’s topology per job, which matters enormously at the scale of a Gemini training run.

You cannot buy a TPU. They exist exclusively inside Google Cloud, rented by the hour. Gemini was trained on them; many third parties (Anthropic for a stretch, plus enterprise customers) rent slices for their own runs.

→ Full entry: Field Guide · TPUs

3 · NPU — Neural Processing Unit

NPU · The Resident Familiar

Class: On-device specialist
Memory: Shared LPDDR / unified system memory
Interconnect: SoC fabric (on-die)
Power: 5–40W
Habitat: Laptops, phones, tablets
Cost: Bundled — no separate purchase
Best Prey: Voice, camera AI, on-device LLMs, Copilot features
Specimens: Apple Neural Engine (M4), Intel AI Boost (Lunar Lake / Arrow Lake), Qualcomm Hexagon (8 Elite, X Elite)

NPUs are the chip type most people interact with every day without knowing it. They live inside the SoC of your phone or laptop, optimised for running already-trained models locally with extreme power efficiency. Voice transcription, Face ID, Pixel’s call screening, the on-device chat in Copilot+ PCs — all NPU workloads.

The defining trait is integer-quantised math (INT8 / INT4) at very low wattage. Where a datacenter GPU might pull 700W to serve a model, an NPU runs a comparable inference on the same model — quantised down — at 5–15W, with the weights sitting in the device’s main memory because there is no discrete accelerator memory to fill.

Microsoft now requires 40+ TOPS of NPU performance for a laptop to qualify as a “Copilot+ PC” — a forcing function that pushed Qualcomm, Intel and AMD into a 12-month arms race. As of 2026, top mobile SoCs ship 50–60 TOPS of NPU performance.

→ Full entry: Field Guide · NPUs

4 · CPU — Central Processing Unit

CPU · The Foundational Workhorse

Class: General-purpose
Memory: DDR5 (system RAM)
Interconnect: PCIe 5, CXL
Power: 100–500W
Habitat: Every server, every workstation
Cost: $1–15K, retail and widely available
Best Prey: Orchestration, preprocessing, control plane, small-batch inference
Specimens: Intel Xeon 6, AMD EPYC 9005

CPUs aren’t obsolete — they’re indispensable. Every AI training run needs CPUs to feed data to the accelerators (decompression, tokenisation, augmentation), schedule jobs and run the control plane. The ratio matters: a typical training cluster pairs eight GPUs with one or two CPU sockets.

Modern server CPUs ship with AI-targeted extensions — AVX-512, AMX (Advanced Matrix Extensions), bf16 support — that let them handle small-batch inference and embedding generation reasonably well. For workloads under 7B parameters at low traffic, a CPU is often more economical than a dedicated accelerator.

What CPUs cannot do is train frontier models. The arithmetic density and memory bandwidth needed for LLM pretraining is 10–100× what a CPU delivers per watt. CPUs do the surrounding work; accelerators do the math.

→ Full entry: Field Guide · CPUs

5 · ASIC — Application-Specific Integrated Circuit

ASIC · The Purpose-Bred Specialist

Class: Fixed-function accelerator
Memory: Custom HBM / on-die SRAM
Interconnect: Proprietary fabric
Power: 75–450W per chip
Habitat: Hyperscaler clouds (AWS, Cerebras, Groq, SambaNova)
Cost: Cloud rental only
Best Prey: Inference at scale, specialised training
Specimens: AWS Inferentia2, Trainium2; Cerebras WSE-3; Groq LPU; SambaNova SN40L

ASICs are chips designed for one thing and one thing only — and they’re brutally good at that thing. AWS Inferentia2 runs production inference for Anthropic, Amazon search and Alexa at a cost-per-token that beats GPUs. Trainium2 is AWS’s training equivalent, taking aim at NVIDIA’s H100/B200 dominance. Groq’s LPU posts inference latencies — sub-1ms first-token for many models — that GPUs simply cannot match.

The architectural philosophy is “build silicon for the specific math you do most often, throw away the rest.” Cerebras takes this furthest: their Wafer-Scale Engine 3 is a single chip the size of an entire silicon wafer (900,000 cores, 44GB on-die SRAM) that eliminates the multi-GPU communication overhead which plagues distributed training.

The price of specialisation: you cannot pivot. When the dominant architecture changes — and it does (Mamba, MoE, diffusion, JEPA) — ASICs designed for the last era stop being competitive overnight. GPUs hedge their bets; ASICs commit.

→ Full entry: Field Guide · ASICs

6 · FPGA — Field-Programmable Gate Array

FPGA · The Shapeshifter

Class: Reprogrammable logic
Memory: DDR / HBM (model-dependent)
Interconnect: PCIe; custom
Power: 50–300W
Habitat: Trading desks, 5G basebands, telecom, occasional inference
Cost: $5–50K each; cloud
Best Prey: Ultra-low-latency inference, custom protocols, evolving workloads
Specimens: AMD/Xilinx Versal AI Edge, Intel Agilex 7

FPGAs occupy a strange ecological niche. Unlike ASICs, their internal wiring is reprogrammable — you can compile a new circuit into them, deploy it, and reprogram it tomorrow. This makes them ideal for workloads that change faster than chip fabrication cycles (years), or where you need an ultra-low-latency that even ASICs struggle to deliver.

In AI specifically, FPGAs are rarely the first choice for mainstream training or inference — they are slower to develop for and harder to program than GPUs. Where they shine: when the model is small enough to fit, the latency budget is brutal (single-digit microseconds), and the workload spec might shift quarterly. Microsoft used FPGAs heavily in early Bing Search ranking and Azure networking; financial firms still run them for inline ML in trading.

For most readers, FPGAs will be a “did you know?” category rather than a chip you’ll ever deploy.

→ Full entry: Field Guide · FPGAs

7 · Edge AI — Mobile, Robotics, IoT

Edge AI · The Frontier Ranger

Class: Embedded inference
Memory: LPDDR, sometimes onboard SRAM
Interconnect: PCIe, MIPI, USB
Power: 1–25W
Habitat: Drones, robots, cameras, autonomous systems, sensors
Cost: $50–2000; embedded in product
Best Prey: Real-time inference, computer vision, robotics
Specimens: NVIDIA Jetson Orin, Google Coral / Edge TPU, Hailo-8, Ambarella CV5

Edge AI chips are NPUs’ cousins — same family, different role. Where an NPU lives inside a consumer laptop alongside other compute, an edge AI chip is purpose-built for an embedded device: a security camera, a drone, a forklift, a Tesla.

The defining constraints are size, power and latency. A camera processing 4K video at 30fps cannot afford to ship frames to a cloud GPU; it has to detect motion locally, identify objects locally and signal events within tens of milliseconds — on a few watts, because the device runs on battery or is fanless.

NVIDIA’s Jetson family is the broadest platform — same CUDA software stack as their datacenter GPUs, scaled down to 7–60W. Google’s Edge TPU is the smallest, cheapest and lowest power (Coral USB stick: $40, 2W). Hailo-8 and Ambarella sit in between, targeting industrial and automotive customers.

→ Full entry: Field Guide · Edge AI

8 · Emerging Architectures

Emerging · The Frontier Beasts

Class: Experimental
Memory: Wafer-scale SRAM / photonic / analog
Interconnect: On-wafer / optical / in-memory
Power: Varies wildly
Habitat: Research labs, narrow cloud offerings
Cost: Mostly inaccessible; limited cloud
Best Prey: The next 10× efficiency leap
Specimens: Cerebras WSE-3 (wafer-scale); Lightmatter, Lightelligence (photonic); Mythic, IBM (analog/in-memory)

Three architectures sit at the frontier — promising, but not yet mainstream.

Wafer-scale (Cerebras): one single chip the size of an entire silicon wafer. Eliminates multi-chip communication entirely; presents the whole system to software as a single device. Already commercial.
Photonic / optical AI (Lightmatter, Lightelligence): perform matrix math using light interference instead of electricity. Potentially orders of magnitude lower energy per operation; currently limited to inference and constrained models.
Analog / in-memory compute (Mythic, IBM, several startups): computeinside memory arrays using analog voltage levels. Removes the von-Neumann bottleneck — the constant shuttling of data between memory and compute — entirely. Promising for low-power inference; precision limitations make training hard today.

→ Full entry: Field Guide · Emerging Architectures

Current Industry Reality (2026)

GPUs dominate training. Every frontier model — GPT-5, Claude, Gemini, Llama, Grok — is still trained on NVIDIA or AMD silicon at hyperscale.
ASICs are ascendant in inference. AWS reports more than 40% of internal inference now runs on Inferentia and Trainium; Groq leads on latency-critical applications.
NPUs are exploding on consumer devices. Every premium laptop and phone shipped in 2026 has a 40+ TOPS NPU.
CPUs remain foundational. No accelerator runs without one.
TPUs are Google-only. Gemini, Veo and Imagen were all trained on TPU v5p / v6.

Simplified View

Use case	Typical chip
Train a GPT-class model	GPU clusters (or TPU pods if you're Google)
Run ChatGPT-class inference at scale	GPUs + ASICs (Inferentia, Groq, Trainium)
AI on laptop	NPU + integrated GPU
AI on phone	Mobile NPU
Robot or drone AI	Edge AI chips (Jetson, Hailo)
Ultra-low-latency custom AI	FPGA or ASIC

The Industry Trend

The industry is moving from "general-purpose GPU everything" to "specialised chip for each layer of the stack."

Power and inference cost are now the binding constraints. A frontier model serving billions of queries spends more on inference electricity in a year than its entire training run cost. The economics force specialisation: train once on GPUs, serve forever on cheaper inference silicon. Expect the gap between training hardware (still GPU-dominant) and inference hardware (rapidly ASIC- and NPU-fragmented) to widen.

❦ ❦ ❦

]]>

Component	What It Does
Gemma 4 26B A4B (Q4_K_M)	The single model serving all 10 agents. MoE architecture — activates ~4–6B parameters per token. ~13 GB VRAM.
LM Studio	Local model server. OpenAI-compatible API on port 1234. Model stays resident in unified memory.
LangGraph	Orchestration graph — defines the multi-step query pipeline as a typed state machine.
LanceDB	Local vector database. One namespace (table) per agent. Fully file-based, no server process needed.
BAAI/bge-small-en-v1.5	Embedding model. 384-dimensional vectors. Runs on the Apple Neural Engine — essentially free compute.
Redis	Session memory (24h TTL, rolling 3-turn window) and routing cache (1h TTL).
FastAPI + Uvicorn	REST API gateway on port 8000. Full query/response model, session management, agent listing.
Gradio	Web chat UI on port 7860. Session management, source citations, escalation chain display.
Prometheus	Metrics: query counts, latency per agent, escalation rate. Useful for understanding usage patterns.
SQLite (audit.db)	Immutable audit trail. Every query logged before response is returned. Cannot be skipped.