Ajay Walia

Aether, Rethought — The Shape Was Wrong All Along

Ajay Walia — Fri, 12 Jun 2026 00:00:00 +0000

Part III of the Aether series. Missed the first two? Start withMeet Aether (the build), thenAether, Grown Wild (what happened when it ran).

The story so far — one paragraph each

Part I: I built a 10-agent (later 13-agent) system that runs an entire team of IT architecture specialists on a single MacBook M5 Pro — one Gemma model, zero cloud, zero data egress. Every agent is just a YAML manifest: a different system prompt and a different knowledge-base namespace pointing at the same weights. The system escalates upward when confidence falls below 0.7.

Part II: That clean idea hit reality. The router was rebuilt twice, retrieval flipped from knowledge-base-first to web-first, and self-reported confidence was replaced by a formula the system computes. Thirteen agents, a live web allowlist, a computed confidence score, and a CHANGELOG.md in place of Git.

Part III (this one): The shape is wrong. The hierarchy that felt so natural — because it mirrors the org chart — turns out to optimise for the wrong things. I worked through five recognised ways to build an agentic AI system, scored each against the same criteria, and arrived at a recommendation that changes the architecture without throwing away anything we’ve built.

What we’re actually building

The output isn’t a chatbot. It’s a consulting deliverable — an architecture document advising how to run an IT transition. Every real client engagement spans multiple towers simultaneously: a move to Entra ID while modernising the network and shifting workloads to Azure touches Cloud, Network, Digital Workplace, and Security in the same breath. The value and the difficulty live in the cross-domain synthesis.

Where we started — the three-tier hierarchy

This is what Aether v2.x actually is. One model, thirteen agents, each differentiated only by system prompt and knowledge namespace. Work routes down the tree; low confidence escalates back up.

It was appealing for real reasons: it mirrors how a delivery organisation thinks, easy to explain to a client, and the RAG namespace-per-domain isolation is clean. But we copied an org chart into the control flow — and the next slide explains why that’s usually a trap.

The core insight — why the org chart is the wrong shape

This is the conceptual centre of the entire analysis. Everything that follows flows from these two points.

A strict tree only allows vertical movement — up to escalate, down to delegate. Real architecture work needs lateral collaboration. The AWS specialist can’t directly ask the Network specialist a question; it has to climb the entire tree and back down. That’s bureaucracy encoded in Python.

The second consequence is more fundamental:the deliverable is a workflow, not an organisation. Producing an architecture document is a consulting process with phases. The right structure for that process is a pipeline. We modelled the people first and the process second — we should have done it the other way round.

Five ways to build it

Rather than jump straight to a new design, I evaluated five recognised approaches against the same five axes: cross-domain capability, parallelism, auditability, simplicity, and fit for document generation.

Note the footer: RAG sits underneathall five approaches — they differ in control flow, not in whether they retrieve.

Approach 1 — Hierarchical / org-mirror

This is what Aether v2.x already is.

The pros are real — which is why we chose it. But the killer con:there is no sideways path. The AWS agent can’t ask the Network agent a question without escalating all the way up and back down. Also, escalation triggered by confidence scores sounds clean, but LLM confidence is unreliable — the trigger itself is shaky.

Verdict: Great for stable, siloed problems with a genuine audit chain. That is not what our engagements look like.

Approach 2 — Orchestrator + flat specialists

One orchestrator plans the task, fans it out to specialists runningin parallel, then synthesises. Flat: adding a domain means adding one specialist — no re-tiering.

Verdict: A strong, flexible default. A piece of the recommendation.

Approach 3 — Workflow / pipeline (process-native)

Instead of organising bywho, organise by thestages of producing the document: Discover → Assess → Design → Review → Assemble. Each stage maps to a section of the output. Human checkpoints slot between stages.

Verdict: This is the spine. The structure that mirrors how the deliverable is actually made. The so-called con — that domain experts become knowledge sources rather than autonomous agents — is arguably the correct framing.

Approach 4 — Blackboard / shared artifact

All agents read and write a common workspace — the evolving document. Strong on cross-domain consistency because there’s only one object. The hard part is concurrency control.

Key distinction: a true blackboard (free-for-all writes) is risky. A typed, controlled shared state object — which LangGraph gives us — keeps the benefit without the chaos. We use the controlled version.

Approach 5 — Single-agent baseline

One capable model. A lightweight router. Domain knowledge bases on demand. Simplest, cheapest, most reliable.

Verdict: The benchmark every fancier design has to beat. The real reason to decompose isinformation management — not model weakness. When CMDB exports, cloud inventories, and Terraform files all arrive together, no context window handles it cleanly.

The comparison

The table does the work. Our current design (Hierarchical) is weakest exactly where we need strength — cross-domain — while strong on auditability. The two approaches that score high on fit-for-doc-gen arecomplementary: strong in different places. That’s the bridge to the recommendation.

The recommendation — a hybrid

Not a ladder. Ahub. The pipeline (Discovery → Assessment → Design) flows through a single typed shared state object. Domain specialists are invoked within stages — callable skills, not autonomous routing agents. Cross-cutting reviewers (Security, Cost, Risk, Compliance) act across all stages. QA can loop work back to Design or Assessment until the rubric passes.

How the model works

Three things to highlight:

Parallel specialists write structured findings into one typed shared state — the single source of truth
Arbitration is deterministic — fires on conflict, policy breach, or missing data, never on confidence scores
The model is pluggable — local Gemma today, Claude/OpenAI/Gemini tomorrow, same knowledge, no rigid tree

Why this is the right call

The verdict, stated plainly: it’s the only option that handles cross-domain work with the auditability, governance, and evidence-traceability a client deliverable demands — at an acceptable, well-understood increase in build complexity.

The technology stack — nothing new to install

Every box in the stack is something we already run. The v3 work extends Orchestration and adds the governance/eval layer.This is not a rebuild.

The process — how a real engagement runs

Two things to notice:

Feedback arrows — review, QA, and cost/risk findings can send workback to Design or Assessment. It’s iterative, not a one-way pipeline.
Human gates — after assessment and before assembly. An architect validates the current-state picture and signs off before the document is built. Gated autonomy, not full automation — which matters when the output carries liability.

The real hard problem — evidence quality

Every downstream recommendation carries the confidence level and assumptions set at this gate.No silent guessing. The system requests more data, logs gaps in an assumption register, or proceeds while explicitly stating its confidence level.

What flows through the system — the canonical Architecture State

One typed object. Every stage, specialist, and reviewer reads from and writes to it. The document is rendered from it. Every recommendation traces back through the same object to the evidence that produced it — that traceability is what makes the deliverable auditable.

Lenses, not domains — where security and compliance live

Security is a lens, not a domain. Same for Cost, Compliance, Risk, Sustainability, and Supportability. They aren’t towers sitting next to Cloud and Network — they cut horizontally across every stage. Most rework loops originate in Security and Compliance, which is exactly why they’re first-class cross-cutting reviewers.

Key design considerations

The most important reframe:we decompose for information management, not because the model is weak. That reframes the whole multi-agent debate. And arbitration fires on deterministic events — never on confidence scores. Every recommendation must trace to evidence.

Design decisions — where the build effort actually goes

The effort allocation is the provocative bit.Only ~10% of build effort goes to the agents themselves. ~35% is evidence ingestion and normalisation. ~20% is evaluation and governance. Most teams invert these numbers — polished agents, no evaluation — and end up with a demo that can’t produce a deliverable a client would pay for. That’s the trap we’re avoiding.

Business value

The positioning:a consulting accelerator and quality platform — not a headcount-reduction play. The specific numbers (50–80% faster assessments, 60–90% less evidence-collection effort) are hypotheses to validate in a pilot, not measured results. They’re framed that way because that’s what earns trust.

Next steps

The sequencing matters:evidence ingestion and the QA rubric first, agent polish second. That’s where the quality actually lives.

The honest accounting

I built a hierarchy because it was intuitive. It is intuitive — it maps to how the organisation thinks, the escalation chain is easy to trace and audit, and it’s easy to explain to a client. Those are real advantages.

But intuitive for humans and optimal for agents are not the same thing. The hierarchy was designed to answer“who is responsible?” — an important human question that agents don’t need answered. The new design answers“what needs to happen next?” — which is the right question for a document-generation workflow.

What changes in v3:

Agents become implementation details inside the Knowledge layer — the durable value is in evidence, governance, evaluation, and the document workflow
Escalation fires on events (conflict, policy breach, missing data) — not on confidence scores
The canonical model is the design constraint — the document isn’t an afterthought, it’s what everything is structured around
Model-agnostic by default — local Gemma today, SOTA API tomorrow, same knowledge

What stays the same: LangGraph, LanceDB, BAAI/bge-small, FastAPI, SQLite, the YAML manifests, and the 13 domain specialists — which become callable skills within the Design stage rather than autonomous routing agents. Same knowledge, different invocation mechanism.

This is an extension of what we run, not a rebuild.

Questions about the design or the approach? Reach out onLinkedIn.

]]>

I Built My Own RSS Reader in an Afternoon — With AI Doing the Typing

Ajay Walia — Wed, 10 Jun 2026 00:00:00 +0000

For years my RSS setup was a compromise. I never wanted to pay for a desktop reader, so I got stuck with The Old Reader — a perfectly fine service, but using it felt like visiting a website, because that’s exactly what it was. Open a browser tab, log in, scroll, repeat tomorrow. What I actually wanted was simple: a native Mac app, my feeds downloaded and stored locally, no account, no subscription, fast enough to triage a hundred articles with my keyboard.

Then I heard about Claude’s new Fable model and thought: fine, let’s stop wishing and just build the thing. This is the story ofLumenAI — a local-first RSS reader for macOS with AI summaries — built from an empty folder to a notarizable DMG in about an hour of wall-clock time, with me acting as product owner and build verifier while the AI wrote the code.

That’s the real app, not a mockup. There’s afull captioned gallery here — the loaded feed list, OPML import, the memory footprint, and the DMG build.

The Idea

The pitch I gave the AI was one paragraph: a local RSS reader for Mac, feeds downloaded and stored on my machine, a premium feel, and treat it like a real engineering project — clear phases, and ask me questions before making decisions. That last part turned out to be the most important sentence in the whole project.

Instead of immediately generating a wall of code, it interviewed me. What stack? What does v1 include, and — just as important — what does it exclude? How should refresh work? What does “premium” mean to you, concretely? By the end of a few rounds of multiple-choice questions, we had a real spec:

v1 goals: subscribe to feeds, full article extraction, fast local search, OPML import/export, saved views, deduplication, offline reading, AI summaries, keyboard-first navigation, dark/light themes, typography controls, reader mode, reading progress.

v1 non-goals: semantic search, topic clustering, multi-device sync, social anything.

Writing down the non-goals felt almost ceremonial at the time. It wasn’t. Every time scope tried to creep, that list killed the discussion in one line.

The Stack

Every choice optimized for “native feel, local data, no servers.”

The app is Swift and SwiftUI targeting macOS 14, because nothing fakes the feel of a real Mac app. Storage is SQLite viaGRDB.swift, chosen over Apple’s SwiftData specifically for FTS5 — SQLite’s built-in full-text search engine, which gives instant search across every article ever downloaded, entirely offline. Feed parsing is FeedKit, wrapped in a normalizer layer so the rest of the app never touches a FeedKit type and JSON Feed support costs no schema changes. Full-text extraction is Mozilla’s Readability.js — the same engine behind Firefox’s reader mode — running in a hidden WKWebView with the page’s own JavaScript disabled. The reader itself is a WKWebView used purely as a rendering layer for a themed HTML template; everything around it stays SwiftUI. The project file is generated by XcodeGen from a YAML spec, which kept the AI and Xcode from ever fighting over a.xcodeproj.

The AI layer is the part I’m proudest of architecturally: a singleSummaryProvider protocol with six implementations — Apple Intelligence (on-device, appears only on macOS 26+), Ollama for local models, Claude, OpenAI, Gemini, and Disabled. One protocol method. Swapping providers is a dropdown in Settings; API keys live in the macOS Keychain.

The Seven Phases (Okay, Eight)

We numbered from zero, like civilized people.

Phase 0 — Scaffold. XcodeGen project, sandbox and network entitlements, and a three-pane shell (sidebar, article list, reader) running on sample data. The exit criterion was simply “builds and runs.” It almost did: the very first build failed with Swift’s infamous“the compiler is unable to type-check this expression in reasonable time” — the AI had written a too-clever nested closure to generate sample data. It rewrote it as a boringfor loop. A very human bug, honestly.

Phase 1 — Data layer. The real schema: feeds, folders, articles, saved views, an FTS5 index kept in sync by SQL triggers, and a three-tier deduplication identity — an article is itsguid if the feed provides one, else its normalized URL (tracking parameters stripped), else a content hash. Ten unit tests against an in-memory database before any networking existed.

Phase 2 — Feed engine. Fetching with HTTP conditional GET, so unchanged feeds cost a 304 response instead of a re-download. Feed auto-discovery, so typingdaringfireball.net finds the actual feed URL by scanning the page’s tags. RSS, Atom, and JSON Feed all normalize into one canonical model. By the end of this phase the app was genuinely usable: subscribe, read, refresh.

Phase 3 — Core UI. Folders, favicons, thumbnails, unread badges, and the thing that makes an RSS reader feel like a tool instead of a website: keyboard navigation.j/k to move, space for next unread,s to star,m to toggle read. The Reeder dialect, basically.

Phase 4 — Reading experience. Select an article and it silently fetches the source page, runs Readability.js over it, and stores clean full text — so a feed that only publishes two-line excerpts still gives you whole articles, offline, forever. Typography controls (serif/sans, size, line width), themes that follow the system, and per-article reading progress that restores when you come back. This phase also produced the best bug of the project: scroll position was saved to app state, which regenerated the reader HTML, which reloaded the page, which reset the scroll — an infinite loop the AI caught in code reviewbefore I ever built it.

Phase 5 — Search, saved views, OPML. Global FTS5 search from the toolbar, saved views (persistent named filters — “unread Swift articles from these three feeds, last 30 days”), and OPML import/export so my subscriptions could finally walk out of The Old Reader with folder structure intact.

Phase 6 — AI summaries. The provider protocol described above, plus a deliberately boring prompt: summarize in two or three sentences, be specific, no “this article discusses.” The summary renders as a tinted card above the article. The point isn’t to replace reading — it’s triage. Is this worth my next ten minutes?

Phase 7 — Polish. An app icon (generated programmatically — an RSS glyph under a sparkle on an indigo gradient), a dock badge with the unread count, render-path caching so thousand-article lists scroll smoothly, and a one-command script that builds a signed, drag-to-Applications DMG.

The Phase 7 finale, for real: one script, one signed DMG on the Desktop. More screenshots — the loaded feed list after OPML import, and the app’s 143 MB memory footprint in Activity Monitor — are on theLumenAI screenshots page.

What the Lifecycle Actually Felt Like

The loop for every phase was identical: the AI proposed decisions and asked questions, wrote the code and its tests, and then stopped — because it couldn’t compile anything. Its sandbox is Linux; you can’t build a Mac app there. So I was the build machine. ⌘U, ⌘R, report back. “Build succeeded, go ahead” became the rhythm of the afternoon.

That constraint turned out to be a feature. It forced a real checkpoint between phases — a human running the actual app — instead of an unbroken firehose of unverified code. Of the three failures across the whole project, two were caught by my builds (the type-checker timeout, and a “cannot find type” error that turned out to mean I’d forgotten to re-run XcodeGen after files were added) and one was caught by the AI re-reading its own code. Final tally:38 Swift files, 27 tests, zero runtime crashes encountered.

The other thing that surprised me: being asked questions felt like the AI respecting that it wasmy app. Tech stack, refresh cadence, dedup policy, summary length, even the app’s name — every fork in the road was a decision I made in seconds from a menu of researched options, instead of an assumption silently baked into code I’d discover three weeks later.

What I’d Tell You If You’re Tempted

Treat it like an engineering project, not a magic trick. The phases, the non-goals list, the tests, the check-in after every phase — that structure is why this worked in an hour instead of unraveling in a weekend. The AI typed every line of code, but the spec, the taste, and the “no, simpler” calls were the human contribution, and the project needed both.

And yes — the app starts instantly, works on a plane, and never asks me to log in. The subscription I avoided paying for has been replaced by the most expensive thing of all: now I want to build everything.

LumenAI is Swift/SwiftUI on macOS 14+, with GRDB, FeedKit, and Readability.js. Built with Claude Fable 5.

]]>

LLMs Are Probability Engines, Not "Thinkers"

Ajay Walia — Sun, 07 Jun 2026 00:00:00 +0000

@import url('https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;600;700&family=JetBrains+Mono:wght@400;500&display=swap'); .pe-article { --bg: #070b14; --bg2: #0d1423; --bg3: #111827; --cyan: #00e5ff; --purple: #a855f7; --gold: #fbbf24; --text: #e2e8f0; --muted: #94a3b8; --border: #1e293b; --danger: #f87171; font-family: 'Space Grotesk', system-ui, sans-serif; font-size: 1.08rem; line-height: 1.85; color: var(--text); } /* TOC */ .pe-toc { background: var(--bg2); border: 1px solid var(--border); border-left: 3px solid var(--cyan); border-radius: 10px; padding: 1.25rem 1.75rem; margin: 2rem 0; } .pe-toc h3 { font-size: 0.95rem; letter-spacing: 0.18em; text-transform: uppercase; color: var(--cyan); margin: 0 0 1rem; } .pe-toc ol { padding-left: 1.3rem; margin: 0; } .pe-toc li { margin-bottom: 0.6rem; } .pe-toc a { color: var(--muted); text-decoration: none; font-size: 1.15rem; font-weight: 600; transition: color 0.2s; } .pe-toc a:hover { color: var(--cyan); } /* Video */ .pe-video { margin: 2rem 0; border-radius: 12px; overflow: hidden; border: 1px solid var(--border); background: #000; } .pe-video video { width: 100%; display: block; } .pe-video-header { background: var(--bg2); padding: 1rem 1.4rem; font-size: 1.15rem; font-weight: 600; color: var(--cyan); border-bottom: 1px solid var(--border); line-height: 1.5; } /* Typography */ .pe-article h2 { font-size: 1.75rem; font-weight: 700; color: #fff; margin: 3rem 0 0.9rem; padding-bottom: 0.45rem; border-bottom: 1px solid var(--border); } .pe-sec-num { color: var(--cyan); font-size: 1rem; font-weight: 600; display: block; margin-bottom: 0.2rem; letter-spacing: 0.1em; } .pe-article p { margin-bottom: 1.1rem; } .pe-article strong { color: #fff; } .pe-em { color: var(--gold); } /* Callouts */ .pe-callout { background: var(--bg2); border-left: 4px solid var(--purple); border-radius: 0 8px 8px 0; padding: 1.4rem 1.8rem; margin: 1.5rem 0; font-size: 1.4rem; color: var(--muted); line-height: 1.75; } .pe-callout.cy { border-color: var(--cyan); } .pe-callout.gd { border-color: var(--gold); } .pe-callout strong { color: var(--text); } /* Compare table */ .pe-table { width: 100%; border-collapse: collapse; font-size: 1rem; margin: 1.25rem 0; } .pe-table th { text-align: left; padding: 0.7rem 1rem; background: var(--bg2); color: var(--cyan); font-size: 0.85rem; letter-spacing: 0.08em; text-transform: uppercase; border-bottom: 1px solid var(--border); } .pe-table td { padding: 0.85rem 1rem; border-bottom: 1px solid var(--border); color: var(--muted); vertical-align: top; line-height: 1.6; } .pe-table td:first-child { color: var(--text); font-weight: 500; } .pe-table tr:hover td { background: var(--bg2); } /* Formula boxes */ .pe-box { background: var(--bg2); border: 1px solid var(--border); border-radius: 12px; padding: 1.75rem; margin: 1.75rem 0; } .pe-box-title { font-size: 0.95rem; letter-spacing: 0.15em; text-transform: uppercase; color: var(--purple); margin-bottom: 1rem; } /* Anim 1 — token prediction */ .pe-sentence { font-size: 1.3rem; font-family: 'JetBrains Mono', monospace; color: var(--text); min-height: 2rem; margin-bottom: 1.1rem; } .pe-cursor { display: inline-block; width: 2px; height: 1em; background: var(--cyan); animation: pe-blink 0.8s infinite; vertical-align: middle; margin-left: 2px; } @keyframes pe-blink { 0%,100%{opacity:1} 50%{opacity:0} } .pe-prob-bars { display: flex; flex-direction: column; gap: 0.55rem; } .pe-prob-row { display: flex; align-items: center; gap: 0.8rem; font-size: 1.1rem; } .pe-prob-lbl { width: 80px; text-align: right; color: var(--muted); font-family: 'JetBrains Mono', monospace; flex-shrink: 0; } .pe-prob-track { flex: 1; height: 26px; background: var(--bg3); border-radius: 5px; overflow: hidden; } .pe-prob-fill { height: 100%; background: var(--cyan); border-radius: 5px; transition: width 0.55s cubic-bezier(0.4,0,0.2,1); width: 0; } .pe-prob-fill.win { background: var(--gold); } .pe-prob-pct { width: 50px; font-family: 'JetBrains Mono', monospace; font-size: 1rem; color: var(--muted); } /* Math display */ .pe-math { font-family: 'Georgia', serif; font-size: 1.45rem; color: var(--gold); text-align: center; padding: 1.3rem; background: var(--bg3); border-radius: 8px; margin-bottom: 0.9rem; } .pe-term { display: inline; opacity: 0; transition: opacity 0.4s; cursor: help; position: relative; } .pe-term.on { opacity: 1; } .pe-term:hover::after { content: attr(data-tip); position: absolute; bottom: 115%; left: 50%; transform: translateX(-50%); background: var(--bg); border: 1px solid var(--purple); color: var(--text); padding: 0.4rem 0.85rem; border-radius: 6px; font-family: 'Space Grotesk', sans-serif; font-size: 0.88rem; white-space: nowrap; z-index: 20; } .pe-anns { display: grid; grid-template-columns: 1fr 1fr; gap: 0.6rem; margin-top: 0.8rem; } .pe-ann { background: var(--bg3); border-radius: 6px; padding: 0.55rem 0.85rem; font-size: 0.93rem; opacity: 0; transition: opacity 0.5s; } .pe-ann.on { opacity: 1; } .pe-ann-sym { color: var(--gold); font-family: 'JetBrains Mono', monospace; font-weight: bold; } .pe-ann-desc { color: var(--muted); } /* Softmax */ .pe-sm-demo { display: flex; gap: 1rem; align-items: flex-start; flex-wrap: wrap; } .pe-sm-col { flex: 1; min-width: 160px; } .pe-col-lbl { font-size: 0.82rem; letter-spacing: 0.1em; text-transform: uppercase; color: var(--muted); margin-bottom: 0.75rem; } .pe-logit-row { display: flex; align-items: center; gap: 0.6rem; margin-bottom: 0.55rem; font-size: 0.97rem; font-family: 'JetBrains Mono', monospace; } .pe-logit-w { width: 60px; color: var(--text); } .pe-logit-v { padding: 0.22rem 0.6rem; border-radius: 4px; font-size: 0.92rem; } .pe-logit-v.neg { background: rgba(248,113,113,0.15); color: var(--danger); } .pe-logit-v.pos { background: rgba(0,229,255,0.1); color: var(--cyan); } .pe-sm-bar { height: 22px; border-radius: 4px; background: var(--purple); transition: width 0.75s cubic-bezier(0.4,0,0.2,1); width: 0; display: flex; align-items: center; padding-left: 7px; font-size: 0.86rem; color: #fff; overflow: hidden; white-space: nowrap; } .pe-arrow { display: flex; align-items: center; justify-content: center; padding-top: 1.4rem; font-size: 1.6rem; color: var(--cyan); } /* Loss */ .pe-loss-wrap { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; align-items: start; } @media(max-width:500px) { .pe-loss-wrap { grid-template-columns: 1fr; } } .pe-loss-num { font-size: 2.8rem; font-weight: 800; font-family: 'JetBrains Mono', monospace; color: var(--danger); transition: color 0.5s; line-height: 1; } .pe-loss-num.good { color: #4ade80; } .pe-loss-lbl { font-size: 0.88rem; color: var(--muted); margin-top: 0.3rem; } .pe-loss-slider label { font-size: 0.92rem; color: var(--muted); display: block; margin: 0.8rem 0 0.3rem; } input[type=range] { width: 100%; accent-color: var(--cyan); } .pe-loss-formula { background: var(--bg3); border-radius: 8px; padding: 1.1rem; font-family: 'JetBrains Mono', monospace; font-size: 1rem; color: var(--text); line-height: 2.1; } .pe-lf-hl { color: var(--gold); } .pe-lf-res { color: var(--cyan); font-weight: bold; } /* Attention */ .pe-attn-words { display: flex; gap: 0.5rem; flex-wrap: wrap; margin-bottom: 0.9rem; } .pe-attn-word { padding: 0.4rem 0.8rem; border-radius: 6px; background: var(--bg3); border: 1px solid var(--border); font-size: 1rem; cursor: pointer; transition: all 0.2s; user-select: none; } .pe-attn-word:hover { border-color: var(--cyan); } .pe-attn-word.sel { background: rgba(0,229,255,0.12); border-color: var(--cyan); color: var(--cyan); } /* Attention formula cards */ .pe-qkv { display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 0.6rem; margin-top: 0.8rem; } @media(max-width:480px) { .pe-qkv { grid-template-columns: 1fr; } } .pe-qkv-card { border-radius: 6px; padding: 0.75rem 0.9rem; font-size: 0.93rem; } /* Temperature */ .pe-temp-row-ctrl { display: flex; align-items: center; gap: 1rem; } .pe-temp-big { font-size: 1.6rem; font-family: 'JetBrains Mono', monospace; font-weight: 700; color: var(--cyan); width: 56px; flex-shrink: 0; } .pe-temp-lbl { font-size: 0.92rem; color: var(--muted); margin-top: 0.2rem; } .pe-tbars { display: flex; flex-direction: column; gap: 0.45rem; margin-top: 0.9rem; } .pe-trow { display: flex; align-items: center; gap: 0.65rem; font-size: 0.95rem; font-family: 'JetBrains Mono', monospace; } .pe-tlbl { width: 64px; color: var(--muted); text-align: right; flex-shrink: 0; } .pe-ttrack { flex: 1; height: 19px; background: var(--bg3); border-radius: 4px; overflow: hidden; } .pe-tfill { height: 100%; border-radius: 4px; background: var(--purple); transition: width 0.5s cubic-bezier(0.4,0,0.2,1); } .pe-tpct { width: 48px; text-align: right; color: var(--muted); } /* Limits */ .pe-limits { display: grid; grid-template-columns: repeat(3, 1fr); gap: 1rem; margin: 1.5rem 0; } @media(max-width:640px) { .pe-limits { grid-template-columns: 1fr; } } .pe-limit-card { background: var(--bg2); border: 1px solid var(--border); border-radius: 12px; padding: 1.5rem 1.6rem; transition: border-color 0.2s; display: flex; flex-direction: column; gap: 0.4rem; } .pe-limit-card:hover { border-color: var(--purple); } .pe-limit-icon { font-size: 2.2rem; line-height: 1; } .pe-limit-title { font-weight: 700; color: var(--text); font-size: 1.35rem; margin: 0; } .pe-limit-desc { font-size: 1.15rem; color: var(--muted); line-height: 1.65; margin: 0; } /* Dual meters */ .pe-meters { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; margin: 1.25rem 0; } .pe-meter { border-radius: 8px; padding: 1.1rem; text-align: center; } .pe-meter-lbl { font-size: 0.82rem; letter-spacing: 0.1em; text-transform: uppercase; margin-bottom: 0.4rem; } .pe-meter-val { font-size: 2.2rem; font-weight: 800; } /* Buttons */ .pe-btn { margin-top: 0.9rem; background: var(--bg3); border: 1px solid var(--cyan); color: var(--cyan); padding: 0.4rem 1.1rem; border-radius: 6px; cursor: pointer; font-size: 0.82rem; font-family: 'Space Grotesk', sans-serif; transition: background 0.2s; } .pe-btn:hover { background: rgba(0,229,255,0.1); } .pe-btn-pur { border-color: var(--purple); color: var(--purple); } .pe-btn-pur:hover { background: rgba(168,85,247,0.1); } /* Interactive badge */ .pe-interactive-header { display: flex; align-items: center; justify-content: space-between; margin-bottom: 1rem; } .pe-interactive-header .pe-box-title { margin-bottom: 0; } .pe-interactive-badge { display: inline-flex; align-items: center; gap: 0.4rem; background: rgba(0,229,255,0.08); border: 1px solid var(--cyan); color: var(--cyan); font-size: 0.72rem; font-weight: 700; letter-spacing: 0.12em; text-transform: uppercase; padding: 0.3rem 0.75rem; border-radius: 999px; flex-shrink: 0; } .pe-interactive-badge::before { content: ''; width: 7px; height: 7px; border-radius: 50%; background: var(--cyan); animation: pe-pulse 1.6s ease-in-out infinite; flex-shrink: 0; } @keyframes pe-pulse { 0%, 100% { opacity: 1; transform: scale(1); } 50% { opacity: 0.4; transform: scale(0.7); } } .pe-interact-hint { display: flex; align-items: center; gap: 0.6rem; margin-top: 0.9rem; padding: 0.85rem 1.1rem; background: rgba(0,229,255,0.05); border: 1px dashed rgba(0,229,255,0.25); border-radius: 8px; font-size: 1.05rem; color: var(--muted); } .pe-interact-hint span { font-size: 1.25rem; } .pe-divider { border: none; border-top: 1px solid var(--border); margin: 2.5rem 0; } @media(max-width:520px) { .pe-anns { grid-template-columns: 1fr; } .pe-sm-demo { flex-direction: column; } .pe-arrow { transform: rotate(90deg); } }

▶ Full Video Explainer — covering how LLMs work, from next-token prediction to attention, training, and why hallucinations are inevitable

Your browser doesn't support HTML5 video.

You've used ChatGPT. You've heard the word "AI" a thousand times this year. But here's something almost nobody explains clearly: the thing powering these tools isnot intelligent in any human sense. It doesn't think. It doesn't understand. It doesn't have goals.

It is, at its core, avery sophisticated next-word predictor — a probability engine trained on the vast majority of text the internet has ever produced. Once you understand this, everything else — its strengths, its failures, its weirdness — clicks into place.

Interactive animations ahead: Press buttons and move sliders as you go — seeing the math move makes it stick.

01 —What ChatGPT and Claude actually are

The term "Artificial Intelligence" conjures images of something that thinks, reasons, and understands — a mind in a machine. That framing is compelling, but misleading when applied to today's large language models (LLMs).

What you're actually talking to is anautoregressive probabilistic model. Every word it generates is the result of asking one question, over and over again:

"Given everything written so far, what word is most likely to come next?"

That's it. Do that billions of times on internet-scale text, and you get something that looks uncannily like reasoning. But it is, fundamentally, pattern matching at extraordinary scale — not understanding, consciousness, or genuine intelligence.

What you see	What's actually happening	The catch
It "reasons"	Pattern-matches reasoning traces from training data	Breaks on genuinely novel problems
It "knows facts"	Recalls high-frequency statistical associations	Hallucinates on rare edge cases
It's "creative"	Samples from learned creative pattern spaces	Derivative — remixes, doesn't invent
It has "opinions"	Outputs tokens shaped by training + alignment	No actual beliefs internally

02 —The one job every LLM does

Let's make this concrete. Below is a live simulation of next-token prediction. Press"Predict next token" and watch the model pick the next word based on probability scores.

🎯 Next-Token Prediction

Live · Interactive

The cat sat on the

👇 Press the button to watch the model predict — one token at a time.

Notice the bars: each candidate word gets a probability score. The model doesn't "decide" in any human sense — it samples from this distribution. The highest-probability word is chosen most often, but not always. That's where both creativity and errors come from.

03 —The probability formula

Here's the mathematical heart of it.Hover each term for a plain-English tooltip, then press the button to reveal the full breakdown piece by piece.

📐 Probability Formula

Live · Interactive

🖱️ Hover any term for a plain-English tooltip. Press the button to reveal the formula step by step.

P(w_t | w ; θ) = softmax(logits_t)

wₜ —The next token to predict

w —All previous tokens (the context)

θ —Billions of learned parameters

softmax —Converts scores → probabilities (sum = 1)

Plain English:"Given everything typed so far, and everything the model learned during training, what is the probability of each possible next word?" The model scores every word in its vocabulary — 50,000+ words — and softmax turns those raw scores into probabilities that add up to exactly 1.0.

04 —Softmax: raw scores → probabilities

The model internally produces a raw score (called alogit) for every possible next word. Logits can be any number — positive, negative, large, small. They're not probabilities yet. Thesoftmax function converts them into a clean distribution. Press the button to watch the transformation.

⚡ Softmax Transform

Live · Interactive

👇 Press the button to watch raw scores transform into probabilities.

Raw Logits (scores)

→

After Softmax (probabilities)

Notice: even the most negative logit still gets a small non-zero probability after softmax. The model never completely rules anything out — it just makes some words astronomically unlikely. This is partly why LLMs occasionally produce bizarre outputs: a 0.001% token still gets picked sometimes.

05 —How it learns: cross-entropy loss

During training, the model sees a sentence with the last word hidden and makes a prediction. The training algorithm asks:"How wrong were you?" The measure of wrongness iscross-entropy loss.

The formula:ℒ = −log P(correct word). If the model assigns 100% probability to the right word, loss = 0. If it assigns 1%, loss is very high.Drag the slider to see this in action.

📉 Cross-Entropy Loss

Live · Interactive

🎚️ Drag the slider to change the model's confidence and watch the loss recalculate live.

1.47

Loss ℒ = −log(p)

Model's confidence in correct word:23%

Correct word:"lazy"
P("lazy"):0.23

ℒ = −log(0.23)
ℒ =1.47

High loss → big update

06 —The Transformer: the machine inside

The specific architecture that makes modern LLMs work is called theTransformer, introduced in a landmark 2017 Google paper. Every major LLM today — GPT-4, Claude, Gemini, Llama — is built on this design.

A Transformer processes your text through many stacked layers. Each layer has two main components:

Multi-Head Self-Attention — Every word simultaneously looks at every other word, learning which relationships matter. This is the core insight.

Feed-Forward Network — A dense neural network that processes each token's information independently, after attention has been applied.

A large model like GPT-4 stacks around 96 of these layers. With enough layers, parameters, and training data, emergent abilities appear — code generation, translation, basic reasoning — that nobody explicitly programmed. They fall out of the math at scale.

07 —Self-attention: every word watches every word

Before the Transformer, AI models processed text word by word in sequence, making it hard to connect things far apart in a sentence. Self-attention solves this by letting every word simultaneously evaluate its relationship to every other word.Click a word to see its attention weights.

🔍 Self-Attention Weights

Live · Interactive

👆 Click any word above to see how it attends to every other word. Brighter = stronger attention — notice "it" lights up "animal."

📐 The Attention Equation

Attention(Q, K, V) = softmax(QKᵀ /√d_k ) ·V

Q — Query

"What am I looking for?"

K — Key

"What does each word offer?"

V — Value

"What info do I retrieve?"

08 —How text is actually generated

When you press Send in any AI chat app, here is exactly what happens:

Tokenization — Your message splits into tokens (subwords). "unbelievable" → ["un","believ","able"].
Embedding — Each token becomes a high-dimensional vector capturing meaning and position.
Forward pass — Vectors flow through all Transformer layers. Attention and feed-forward happen, repeatedly.
Logits → Probabilities — The final layer scores every vocabulary word. Softmax converts to probabilities.
Sampling — One word is chosen based on those probabilities.
Repeat — That word is appended and the whole process runs again until the response is done.

KV Caching: The model caches Key and Value matrices from previous steps so it doesn't recompute attention from scratch every token — making long responses computationally feasible.

09 —Temperature: controlling randomness

When sampling the next word, you can control how random the selection is with a parameter calledtemperature. Drag the slider to see how it reshapes the probability distribution in real time.

🌡️ Temperature Sampling

Live · Interactive

🎚️ Drag the slider left for predictable outputs, right for creative (or chaotic) ones.

1.0

Balanced

Formula: p'ᵢ = pᵢ^1/T / Σ(pⱼ^1/T)

Low temperature (e.g. 0.2) makes the model deterministic — it almost always picks the top word. High temperature (e.g. 2.0) flattens the distribution, giving unusual words a real chance. Most production systems run between 0.7 and 1.0.

10 —Why it sometimes lies (hallucinations)

One of the most misunderstood LLM behaviors ishallucination — when the model confidently states something false. This isn't a bug to be patched away. It's a direct consequence of the architecture.

The model has no internal truth checker. No access to the real world. It only knows:what sequence of words tends to follow this sequence of words? When asked something rare or obscure, the model fills the gap with statistically plausible text — which may be completely wrong.

Analogy: Imagine someone who has read every book in a library but never left the building. Ask what the weather is like outside — they'll give a confident, well-reasoned answer based on weather descriptions they've read. It might be completely wrong.

Ground Truth Access

NONE

Statistical Plausibility

HIGH

11 —Key limitations to know

Understanding these isn't pessimism — it's how you use these tools well.

📏

Context Window

Fixed memory. Older models: ~4K tokens. Newer: up to 1M+. Anything beyond the window is completely invisible to the model.

🌀

No Persistent Memory

Every conversation starts completely fresh. The model has no memory of past sessions unless you explicitly provide them.

🎲

Stochasticity

Same prompt, potentially different outputs. The sampling process is inherently random, even at low temperatures.

🔓

Jailbreaks

Safety training is pattern-based. Clever prompting can sometimes bypass it because the model is still a pattern matcher at heart.

💭

Hallucinations

Inevitable on low-frequency knowledge. No fact-checker means confident errors are always possible. Verify important claims.

⚡

Quadratic Cost

Attention cost grows quadratically with context length. Techniques like FlashAttention mitigate this, but it's a fundamental constraint.

12 —What's next

The probability-engine core remains — but researchers are building powerful layers on top.RAG (Retrieval-Augmented Generation) gives the model access to real documents at query time, dramatically reducing hallucination on factual tasks.Agentic systems let LLMs use tools, execute code, and iterate on their outputs.Reasoning models generate long internal chains of thought before answering, improving performance on math and logic. Andmultimodal models extend the same probabilistic core to images and audio.

None of these change the fundamental nature of what an LLM is. They all sit on top of the same next-token prediction engine. Understanding that foundation is what makes you a sharper thinker about where this technology is — and isn't — going.

The bottom line: LLMs are extraordinary pattern-recognition engines that have scaled statistical prediction to the point of producing genuinely useful, sometimes astonishing outputs. They are not intelligent in any human sense. Knowing this — really knowing it — is what separates clear thinking about AI from hype.

Video generated with Grok Imagine. Animations built with vanilla JavaScript.

A curated vendor-domain allowlist. Results are restricted to Microsoft Learn, AWS, Google Cloud, Cisco, Palo Alto, NIST and others, with suffix-safe matching that blocks spoofs likecisco.com.evil.com.
Per-agent ranking.AGENT_DOMAINS ranks each specialist's preferred docs first, so the AWS agent leans on AWS documentation before anything else.
Source-agnostic results. Web hits are reshaped to look exactly like KB results (source / url / text / score), so the confidence math works identically on either.
Rich metadata. A schema ofdomain,vendor,document_type andversion_date travels with every chunk into the prompt.

Trust, Computed

Confidence is math, not vibes

In v2.6 the model ended each answer withConfidence: 0.92 and the orchestrator believed it. The trouble is that a model confidently answering an AWS question with Azure facts will happily rate itself 0.92 too. Self-report is theatre. So confidence became a number thesystem computes, before and after generation:

pre_gen= 0.6 · retrieval+ 0.4 · namespace_overlap
confidence= min(1.0, best_pre_gen+ 0.2 · citation_density)

retrieval_score — the quality of the retrieved documents. With an empty KB it is derived from web hits: 0.85 for preferred vendor domains, 0.70 otherwise.
namespace_overlap — does the query hit the agent's keywords? A strong match scores about 0.85; off-topic collapses to 0.1, all but guaranteeing escalation.
citation_density — the share of claims backed by[1] [2] sources, rewarding grounded answers with up to a 0.2 boost.

Below the 0.7 threshold, the query climbs to the parent agent, which re-retrieves against its own namespace and re-answers. One subtle fix mattered here: the strongest_best_pre_gen score is carried up the escalation chain, so a confident child's score is never erased by a weaker parent re-running the same step. The model no longer judges itself — the architecture does.

The Undergrowth

The bugs that shaped the design

Most of the architecture above exists because something broke first. The marquee disaster was the0% confidence saga — a cluster of unrelated failures that all produced the same symptom: every query inexplicably crashing to zero confidence and escalating to the top of the tree.

Silent search failure.duckduckgo_search was renamed toddgs upstream and returned an empty list. A catch-allexcept swallowed the error, so retrieval quietly went to zero.
Empty-KB zero score. With no documents ingested, the retrieval score defaulted to 0.0 — now derived from web-result quality instead.
Overwritten scores. Each escalation re-ranstep_build and erased the child's strong score, until_best_pre_gen was preserved across the chain.
Dead Tier-2 regex. Double-escaped\\b made domain routing rules unreachable, and malformed YAML broke parsing entirely.
Context overflow. The expanded universal design template pushed prompts past the local model's context limit — a 400 error until web-context sizes were tuned back down.
A frozen UI on long jobs. Design-doc generation runs for minutes and the UI silently froze, until streaming feedback and a 10-minute timeout were added.

A single catch-allexcept turned a renamed package into an invisible, week-long confidence collapse. Fail loud, not silent.

The Working Contract

How the build stays honest

With Git tracking disabled, discipline had to live somewhere. It lives in two places. The first is a hand-maintainedCHANGELOG.md that records every change, reversal and reason. The second is a behavioural contract the assistant itself follows when modifying the code — think before acting, make the smallest possible diff, verify against success criteria, keep everything auditable and reversible, and prefer less code over more. It reads less like an engineering process and more like the Law of the Jungle: a few rules everyone keeps, because the alternative is chaos.

See it running — the screenshots

Nine captioned frames from the live system: the Gemma model resident in LM Studio, the Gradio chat generating a Microsoft 365 Copilot design document at 90% confidence with live web search, the namespace-per-domain knowledge base, and the design template behind it all.

Open the screenshot gallery →

The Map So Far

Six releases, one expedition

2.6.0

Base system

Routing, RAG, agents and escalation — the first end-to-end build.

2.7.0

Grounded confidence

Dropped LLM self-report for a math formula; added web-search fallback and Mermaid diagrams in the UI.

2.7.1

Agent reshuffle

Retired the Intune, AVD, Citrix and Virtualization agents; added Microsoft DWP, End-User Virt and the network specialists.

2.7.4

Routing repairs

Fixed dead regex rules, corrupted YAMLs, and a hardcoded-path split that loaded stale manifests.

2.8.0

Web-first retrieval

Scrape vendor docs before the KB; vendor allowlist; KB fallback merge.

2.8.2

Confidence fixes

Solved 0% confidence on design docs; preserved the best score across the escalation chain.

2.9.0

Clarifying questions

Before generating a design, agents now ask two to four targeted questions — org size, compliance, stack, timeline — and fold the answers into a far more specific result.

Where We Are · What's Next

Wired, working, and honest

v2.8.2 is a working, daily-use system. The full pipeline runs route → retrieve → build → generate → score → escalate → audit, all thirteen manifests parse, routing and parent maps are consistent, and live vendor-doc search returns current, citable content. What's still open is honest too: the KB folders are wired but largely empty and need real source documents ingested; a rule-ordering overlap can still misroute shared virtualization keywords between End-User Virt and DWP; Git is off; and the context budget stays tight on the local model for large templates.

The road ahead, in order: populate the knowledge base so RAG augments rather than just the web; resolve the routing overlaps and settle End-User-Virt-versus-DWP ownership; re-enable Git and move off the manual changelog; then build a query test-suite to calibrate the confidence threshold against measured answer quality.

Lessons from the Trail

What the journey taught

Fail loud, not silent. A catch-allexcept turned a renamed package into invisible 0% confidence. Surface errors — never swallow them.
Measure what you trust. Self-reported confidence is theatre. Grounding trust in retrieval and citations made escalation actually mean something.
Narrow beats broad. Specialists with tight domains hallucinate far less than one generalist trying to know everything.
Write the changelog. With Git off, the manualCHANGELOG.md became the single source of truth for every decision and fix.

Aether began as a single sentence — "one model can't know everything." It grew into a hierarchy of grounded, self-aware experts, and the changelog is the proof of the journey. Specialise · Ground · Measure · Escalate.

]]>

RAG, Graph RAG, Agentic RAG — and How to Make Any of Them 32× Memory Efficient

Ajay Walia — Thu, 28 May 2026 00:00:00 +0000

@import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&family=JetBrains+Mono:wght@400;600&display=swap'); .rag-art { --bg: #0a1220; --bg2: #0f1a2e; --bg3: #142442; --line: #1f3358; --text: #e2e8f0; --muted: #8aa0c0; --accent: #22d3ee; --accent2: #00e5a8; --warn: #f59e0b; --danger: #ef4444; --purple: #a78bfa; font-family: 'Inter', sans-serif; color: var(--text); background: var(--bg); padding: 2rem 2.25rem; border-radius: 16px; box-shadow: 0 4px 30px rgba(0,0,0,0.55); line-height: 1.75; } .rag-art * { box-sizing: border-box; } .rag-art .section { padding: 40px 0; border-bottom: 1px solid var(--line); } .rag-art .section:last-child { border-bottom: none; } .rag-art .label { font-family: 'JetBrains Mono', monospace; font-size: .72rem; letter-spacing: .22em; text-transform: uppercase; color: var(--accent); font-weight: 600; margin-bottom: 8px; } .rag-art h2 { font-family: 'Inter', sans-serif; font-size: clamp(1.5rem, 1.2rem + 1vw, 2.2rem); font-weight: 700; letter-spacing: -.02em; color: #fff; margin: 0 0 18px; } .rag-art h3 { font-family: 'Inter', sans-serif; font-size: 1.15rem; font-weight: 600; color: #fff; margin: 28px 0 10px; } .rag-art p { margin: 0 0 16px; font-size: 1rem; color: var(--text); } .rag-art ul { list-style: none; padding: 0; margin: 0 0 18px; } .rag-art ul li { padding: 4px 0 4px 22px; position: relative; font-size: .95rem; } .rag-art ul li::before { content: "▸"; position: absolute; left: 0; color: var(--accent); } .rag-art strong { color: #fff; } .rag-art code { background: var(--bg3); padding: 1px 6px; border-radius: 4px; font-family: 'JetBrains Mono', monospace; font-size: .88rem; color: var(--accent2); } .rag-visual { margin: 28px 0; border-radius: 8px; overflow-x: auto; overflow-y: hidden; border: 1px solid var(--line); background: var(--bg2); } .rag-visual svg { display: block; width: 100%; height: auto; min-width: 700px; } .rag-cap { text-align: center; font-size: .78rem; color: var(--muted); font-style: italic; margin: -10px 0 24px; } .rag-figure { margin: 28px 0; } .rag-figure img, .rag-figure video { display: block; width: 100%; height: auto; border-radius: 8px; border: 1px solid var(--line); } .rag-figure figcaption { text-align: center; font-size: .78rem; color: var(--muted); font-style: italic; margin-top: 8px; } @keyframes ragPulseSoft { 0%,100% { opacity: .35; } 50% { opacity: 1; } } @keyframes ragPulseStrong{ 0%,100% { opacity: .55; r: 9; } 50% { opacity: 1; r: 11; } } @keyframes ragMissed { 0%,100% { opacity: .25; } 50% { opacity: .9; } } @keyframes ragFlow { from { stroke-dashoffset: 24; } to { stroke-dashoffset: 0; } } @keyframes ragDraw { from { stroke-dashoffset: 800; } to { stroke-dashoffset: 0; } } .rag-flow-line { stroke-dasharray: 6 4; animation: ragFlow 1.2s linear infinite; } .rag-traversal { stroke-dasharray: 6 4; animation: ragFlow 1.5s linear infinite; } .rag-glow { animation: ragPulseStrong 2.4s ease-in-out infinite; } .rag-missed { animation: ragMissed 3s ease-in-out infinite; } /* Sequential agent tool highlight (5 tools) */ @keyframes ragAgentT { 0%,18% { opacity: 1; } 25%,100% { opacity: .35; } } .rag-tool-1 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 0s; } .rag-tool-2 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 1s; } .rag-tool-3 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 2s; } .rag-tool-4 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 3s; } .rag-tool-5 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 4s; } /* Sequential float→binary cell transform (8 cells) */ @keyframes ragBitTransform { 0%,8% { opacity: 0; } 14%,100% { opacity: 1; } } .rag-bit-1 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 0s; } .rag-bit-2 { animation: ragBitTransform 4s ease-out infinite; animation-delay: .35s; } .rag-bit-3 { animation: ragBitTransform 4s ease-out infinite; animation-delay: .7s; } .rag-bit-4 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 1.05s; } .rag-bit-5 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 1.4s; } .rag-bit-6 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 1.75s; } .rag-bit-7 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 2.1s; } .rag-bit-8 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 2.45s; } @media (prefers-reduced-motion: reduce) { .rag-flow-line, .rag-traversal, .rag-glow, .rag-missed, .rag-tool-1,.rag-tool-2,.rag-tool-3,.rag-tool-4,.rag-tool-5, .rag-bit-1,.rag-bit-2,.rag-bit-3,.rag-bit-4,.rag-bit-5,.rag-bit-6,.rag-bit-7,.rag-bit-8 { animation: none !important; opacity: 1 !important; } } .rag-cards { display: grid; grid-template-columns: repeat(auto-fit, minmax(240px, 1fr)); gap: 14px; margin: 24px 0; } .rag-card { background: var(--bg2); border: 1px solid var(--line); border-radius: 10px; padding: 18px 18px; } .rag-card h4 { font-size: 1rem; font-weight: 600; color: #fff; margin: 0 0 8px; } .rag-card p { font-size: .88rem; color: var(--muted); margin: 0; } .rag-pull { border-left: 3px solid var(--accent); padding: 6px 0 6px 18px; margin: 28px 0; font-size: 1.08rem; color: var(--accent); font-style: italic; } .rag-pull.warn { border-color: var(--warn); color: var(--warn); } .rag-table { width: 100%; border-collapse: collapse; margin: 18px 0 26px; font-size: .9rem; } .rag-table th, .rag-table td { padding: 11px 14px; text-align: left; border-bottom: 1px solid var(--line); vertical-align: top; } .rag-table th { color: var(--accent); font-family: 'JetBrains Mono', monospace; font-size: .72rem; letter-spacing: .14em; text-transform: uppercase; border-bottom: 1px solid var(--accent); font-weight: 600; } .rag-table td.k { color: #fff; font-weight: 600; width: 25%; } .rag-table td.v { color: var(--text); } .rag-table td.m { color: var(--muted); font-size: .85rem; } .rag-stat { display: grid; grid-template-columns: repeat(4,1fr); gap: 1px; background: var(--line); border: 1px solid var(--line); margin: 28px 0; border-radius: 6px; overflow: hidden; } .rag-stat > div { background: var(--bg2); padding: 18px 14px; text-align: center; } .rag-stat .num { font-family: 'Inter', sans-serif; font-weight: 700; font-size: 1.7rem; color: var(--accent); display: block; } .rag-stat .lbl { font-family: 'JetBrains Mono', monospace; font-size: .66rem; letter-spacing: .1em; color: var(--muted); text-transform: uppercase; margin-top: 4px; display: block; } @media (max-width: 600px) { .rag-art { padding: 1.25rem; } .rag-stat { grid-template-columns: repeat(2,1fr); } .rag-visual svg { min-width: 600px; } }

Retrieval-Augmented Generation has become the default way to give a language model access to your data. But "RAG" now covers at least three meaningfully different architectures, and most engineers only know the first one well. Pick the wrong one and your assistant answers confidently with information it never actually retrieved.

This piece does two things.First — break down RAG, Graph RAG, and Agentic RAG visually: how each works, where each one breaks, and which query type it's the right fit for.Second — show how a single technique calledbinary quantization can shrink the vector index inside any of these architectures by a factor of 32 without breaking retrieval quality. This is the trick Perplexity, Azure, and HubSpot use in production.

3Architectures
compared

32×Memory
reduction

1-hop / N-hopWhen each
variant wins

0 codeAll concept,
all visual

Part 1A · Standard RAG

The Default Pipeline — and What It's Actually Good At

Standard RAG is what most engineers mean when they say "RAG". Documents are split into chunks, each chunk is embedded into a high-dimensional vector, and those vectors are stored in a vector database. At query time, the user's question is embedded too, and the database returns the top-k chunks by similarity (usually cosine distance). Those chunks are pasted into the LLM's prompt as context, and the model answers from them.

Where Standard RAG Wins

Direct factual lookups. Single chunk contains the answer. "What is our refund policy?" → retrieves the refund policy chunk → done.
Cost and latency. One embedding call, one similarity search, one LLM call. Easy to debug.
Mature tooling. Pinecone, Weaviate, Qdrant, Milvus, pgvector — all production-ready for this pattern.

Where Standard RAG Breaks

It retrieveschunks, never the relationshipsbetween chunks. The moment the answer requires combining facts that live in different documents — or even different sections of the same document — similarity search starts missing things.

Similarity search will happily return two facts that sit close to the query in embedding space, while the missing third fact that connects them sits far away and never makes it into the context window.

Concretely, imagine a vector database storing three facts about your internal services:

The bridge fact sits too far from the query in embedding space. Similarity search has no way to find it from where it started.

Part 1B · Graph RAG

Adding a Knowledge Graph on Top

Graph RAG addresses the multi-hop problem by adding a structural layer over the documents. During indexing, an LLM extractsentities (services, people, places, concepts) and therelationships between them, building a knowledge graph alongside the vector index. At query time, the system traverses that graph instead of relying purely on embedding similarity.

How a Graph RAG Query Actually Runs

The user asks "Will checkout be affected by Friday's maintenance?". The system identifies the entities mentioned in the query (checkout,Friday maintenance), looks them up as nodes in the graph, and walks the edges between them. The traversal returns the chain of relationships, and that chain gets handed to the LLM as structured context — not as random chunks of prose.

Multi-hop reasoning

Followinguses → runs_on → affects recovers the bridge fact that pure similarity search missed.

Explainable context

Every answer comes with a traversable path. Easier to audit than "the top-5 most similar chunks said so."

Heavier to build

Entity extraction at index time is expensive. Schema design matters. Not free lunch.

Less flexible than agents

The graph schema is fixed at indexing time. Queries that need fresh tools or external sources still need help.

Part 1C · Agentic RAG

Letting the LLM Choose How to Retrieve

Agentic RAG replaces the fixed retrieval pipeline with an LLM agent that decides — at query time — which tools to invoke, which sources to query, and in what order. The agent might call a vector search, then a SQL database, then a web fetch, then a graph traversal, all in service of one question.

What "Dynamic" Actually Means

A user asks:"Has any customer raised a ticket about the checkout outage we had last Friday, and what was our response time on it?" An agentic system might:

Call theknowledge graph to confirm there was an outage on the checkout service last Friday.
CallSQL on the ticketing database to list tickets opened that day mentioning "checkout".
Call thevector DB over chat history to find related customer complaints in Slack.
Call thecode interpreter to compute average first-response time on the matching tickets.
Compose the answer.

None of that ordering was decided in advance. The agent chose it. That flexibility is the whole point — and the whole risk.

Flexible

Handles open-ended tasks that touch multiple data sources and require fresh information.

Higher latency

Several tool calls per question. A simple lookup that took 200ms in standard RAG now takes 4–8 seconds.

Harder to debug

The agent's reasoning path is non-deterministic. Reproducing a failure mode can be slippery.

Can spiral

Without tight tool authority and budgets, agents loop on themselves. Pair this with a state machine.

Part 1D · Decision

These Aren't Levels — They're Different Tools

The most common mistake is treating these as a maturity ladder you have to climb. They aren't. They solve different query types. A good system often uses all three in different parts of the same product.

Once the right architecture is in place for the query type, the next leverage point is efficiency. Every one of these three depends on a vector index somewhere underneath — and that index is where most of the memory cost lives.

Part 2 · Efficiency

How to Make Any RAG 32× More Memory Efficient

The 32× trick — float magnitudes compressed to a single sign bit per dimension.

Every RAG variant pays the same tax: it stores high-dimensional embeddings of every chunk it's ever indexed. That tax adds up fast. At ten million chunks, a standard 768-dimension float32 index needs about 30 GB just to hold the vectors — and that index has to sit in fast RAM if you want sub-second retrieval. Doubling your corpus doubles the bill.

The trick that Perplexity, Azure AI Search, and HubSpot all use in production is calledbinary quantization. It cuts the memory footprint by 32 times. The architecture above it doesn't change — Standard, Graph, or Agentic, the same trick applies.

The Memory Bill, in Numbers

The Trick: Throw Away Magnitudes, Keep the Sign

Binary quantization is structurally simple. For every dimension of every vector, ask one question:is the value positive or negative? Positive becomes1, negative becomes0. The 32-bit float is replaced by a single bit. Same dimensionality, 1/32nd the storage.

The Distance Metric Changes Too — Cosine Becomes Hamming

Float32 vectors compare viacosine similarity, which is computed from dot products. Binary vectors compare viaHamming distance: count the number of bits that differ between two vectors. On modern CPUs, this is two instructions —XOR thenpopcount — and runs at billions of comparisons per second.

The Trade-off — and the Fix

Of course, throwing away the magnitudes throws away some information. A naive binary index loses roughly 5–10% of retrieval accuracy compared to the full float32 index. Production systems solve this with atwo-stage search: use the cheap binary index to retrieve a wide net of candidates fast, then re-score the small candidate set using the original full-precision vectors.

Stage 1 is where the 32× memory win lives. The binary index is small enough to fit comfortably in CPU cache, so you can scan tens of millions of candidates in single-digit milliseconds. Stage 2 only ever touches a few hundred full-precision vectors, so the expensive cosine math is bounded.

The recall lost in stage 1 is paid back in stage 2. End-to-end retrieval quality is typically within 1% of a full float32 search, at 1/32 the hot memory.

Memory Bill at Scale — Before and After

Corpus size	Float32 only	Binary (stage 1)	Hybrid (stage 1 hot + stage 2 cold)
1 M vectors	3 GB · hot RAM	94 MB · hot RAM	94 MB hot · 3 GB cold
10 M vectors	30 GB · hot RAM	940 MB · hot RAM	940 MB hot · 30 GB cold
100 M vectors	300 GB · multi-node	9.4 GB · single node	9.4 GB hot · 300 GB cold
1 B vectors	3 TB · cluster	94 GB · single beefy node	94 GB hot · 3 TB cold tier

The shape of the curve is what matters: the hot index — the part that controls latency — stays manageable even as the corpus grows by orders of magnitude. The cold tier scales linearly but cheaply, because it only gets touched for the few hundred candidates surfaced by stage 1.

When to Reach for This

Above ~1 M vectors

Below that scale, plain float32 is fine. The complexity of two-stage retrieval isn't worth the few hundred MB you'd save.

Hot real-time queries

If your retrieval p95 needs to stay under 100ms, the binary first stage is what keeps you there as the index grows.

Cost-sensitive deployments

Saving 30 GB of RAM × 3 replicas × 12 months adds up to real money. Especially on managed vector services.

Any of the three architectures

Standard, Graph, Agentic — they all sit on a vector index somewhere. This optimisation applies everywhere they do.

Putting It Together

Architecture and Efficiency Are Orthogonal

Two decisions, independent of each other.What kind of question does my system have to answer? — that's the architecture decision. Single-hop facts go to Standard RAG. Multi-hop relationship questions go to Graph RAG. Open-ended tool-using tasks go to Agentic RAG.How big is my index going to get? — that's the efficiency decision. Above a million chunks, binary quantization plus float rescoring buys you 32× memory headroom for ~1% quality cost.

The same vector index sits underneath all three architectures. The same trick applies to all three. Pick the architecture for the query type. Apply the efficiency trick because the math works.

RAG isn't one thing — it's a layered decision. Get the architecture right for the query, then make the index small enough to keep up.

Ajay Walia · CuriousBit Knowledge Base · May 2026

]]>

Cortex Swarm: Upgrading the Traditional IT Operations with Agentic AI

Ajay Walia — Sun, 24 May 2026 00:00:00 +0000

@import url('https://fonts.googleapis.com/css2?family=Noto+Serif:ital,wght@0,400;0,700;1,400&family=Noto+Sans:wght@300;400;600;700&display=swap'); .cs-article { --bg: #080810; --bg2: #0d0d1a; --bg3: #111122; --gold: #c9a227; --gold-light: #e8cc6a; --gold-dark: #7a5f0e; --blue: #1a2744; --blue2: #0f1d38; --cream: #e8dcc8; --muted: #9a8a6a; --blossom: #c4909a; --rule: rgba(201,162,39,0.22); font-family: 'Noto Sans', sans-serif; line-height: 1.75; color: var(--cream); background: var(--bg); padding: 2.5rem; border-radius: 16px; box-shadow: 0 4px 30px rgba(0,0,0,0.6); } .cs-article * { box-sizing: border-box; } .cs-section { padding: 56px 0; border-bottom: 1px solid var(--rule); } .cs-section:last-child { border-bottom: none; } .cs-label { font-size: .7rem; letter-spacing: .2em; text-transform: uppercase; color: var(--gold); font-weight: 600; margin-bottom: 10px; } .cs-article h2 { font-family: 'Noto Serif', serif; font-size: clamp(1.5rem, 3.5vw, 2.2rem); color: var(--gold-light); margin-bottom: 20px; margin-top: 0; } .cs-article h3 { font-family: 'Noto Serif', serif; font-size: 1.2rem; color: var(--gold-light); margin-bottom: 12px; margin-top: 0; } .cs-article p { margin-bottom: 18px; font-size: 1rem; color: var(--cream); opacity: 0.95; } .cs-article ul { list-style: none; padding: 0; margin-bottom: 18px; } .cs-article ul li { padding: 4px 0 4px 20px; position: relative; font-size: 1rem; } .cs-article ul li::before { content: "—"; position: absolute; left: 0; color: var(--gold); } .cs-statbar { display: grid; grid-template-columns: repeat(4,1fr); gap: 1px; background: var(--rule); border: 1px solid var(--rule); margin: 36px 0; border-radius: 4px; overflow: hidden; } .cs-stat { background: var(--bg2); padding: 20px 14px; text-align: center; } .cs-stat .num { font-family: 'Noto Serif', serif; font-size: 1.8rem; color: var(--gold); display: block; } .cs-stat .lbl { font-size: .68rem; letter-spacing: .08em; color: var(--muted); text-transform: uppercase; } .cs-visual { margin: 32px 0; border-radius: 6px; overflow-x: auto; overflow-y: hidden; border: 1px solid var(--rule); } .cs-visual svg { display: block; width: 100%; height: auto; } .cs-figure { margin: 32px 0; } .cs-figure img, .cs-figure video { display: block; width: 100%; height: auto; border-radius: 6px; border: 1px solid var(--rule); } .cs-figure figcaption { text-align: center; font-size: .78rem; color: var(--muted); font-style: italic; margin-top: 8px; } .cs-hero-video { position: relative; margin: 0 0 16px 0; border-radius: 8px; overflow: hidden; border: 1px solid var(--rule); box-shadow: 0 4px 20px rgba(0,0,0,.5); } .cs-hero-video video { display: block; width: 100%; height: auto; } .cs-hero-video .overlay { position: absolute; left: 0; right: 0; bottom: 0; padding: 40px 30px 48px; background: linear-gradient(transparent, rgba(8,8,16,.92) 30%); text-align: center; pointer-events: none; } .cs-hero-video .overlay .h-title { font-family: 'Noto Serif', serif; font-size: clamp(2.2rem, 5.5vw, 4.5rem); color: #fff; font-weight: 900; letter-spacing: 2px; } .cs-hero-video .overlay .h-sub { font-size: clamp(1rem, 2vw, 1.4rem); letter-spacing: .25em; color: var(--gold); font-weight: 700; margin-top: 14px; } .cs-hero-video .overlay .h-by { font-size: clamp(0.85rem, 1.5vw, 1.15rem); letter-spacing: .15em; color: var(--muted); margin-top: 18px; font-weight: 500; } .cs-cards { display: grid; grid-template-columns: repeat(auto-fit,minmax(230px,1fr)); gap: 14px; margin: 24px 0; } .cs-card { background: var(--bg2); border: 1px solid var(--rule); border-radius: 6px; padding: 20px 18px; } .cs-agent-row { display: grid; grid-template-columns: 150px 1fr; border: 1px solid var(--rule); border-radius: 6px; overflow: hidden; margin-bottom: 10px; background: var(--bg2); } .cs-agent-label { background: var(--blue2); padding: 16px 14px; border-right: 1px solid var(--rule); display: flex; flex-direction: column; justify-content: center; } .cs-agent-label .tier { font-size: .67rem; letter-spacing: .14em; color: var(--gold); text-transform: uppercase; margin-bottom: 3px; } .cs-agent-label .role { font-family: 'Noto Serif', serif; font-size: .95rem; color: #fff; } .cs-agent-label .replace { font-size: .67rem; color: var(--muted); margin-top: 5px; } .cs-agent-body { padding: 14px 18px; } .cs-agent-body .desc { font-size: .8rem; color: var(--muted); font-style: italic; margin-bottom: 8px; } .cs-tag-row { display: flex; flex-wrap: wrap; gap: 5px; margin-top: 6px; } .cs-tag { font-size: .67rem; padding: 2px 8px; border-radius: 20px; background: rgba(201,162,39,.1); border: 1px solid var(--gold-dark); color: var(--gold-light); } .cs-timeline { position: relative; margin: 24px 0; } .cs-timeline::before { content: ''; position: absolute; left: 19px; top: 0; bottom: 0; width: 1px; background: var(--rule); } .cs-tl-item { display: grid; grid-template-columns: 38px 1fr; gap: 14px; margin-bottom: 20px; } .cs-tl-dot { width: 38px; height: 38px; border-radius: 50%; background: var(--bg2); border: 1px solid var(--gold); display: flex; align-items: center; justify-content: center; font-size: .7rem; color: var(--gold); font-weight: 700; position: relative; z-index: 1; } .cs-tl-content { padding-top: 6px; } .cs-tl-content h3 { font-size: .9rem; margin-bottom: 3px; } .cs-tl-content p { font-size: .82rem; color: var(--muted); margin: 0; } .cs-phase-grid { display: grid; grid-template-columns: repeat(3,1fr); gap: 10px; margin: 24px 0; } .cs-phase-card { background: var(--bg2); border: 1px solid var(--rule); border-radius: 6px; padding: 16px 14px; } .cs-phase-num { font-size: .65rem; letter-spacing: .14em; color: var(--gold); text-transform: uppercase; margin-bottom: 5px; } .cs-phase-title { font-family: 'Noto Serif', serif; font-size: .95rem; color: #fff; margin-bottom: 7px; } .cs-phase-trigger { font-size: .68rem; color: var(--muted); border-top: 1px solid var(--rule); padding-top: 7px; margin-top: 7px; } .cs-table { width: 100%; border-collapse: collapse; margin: 20px 0; font-size: .85rem; } .cs-table th, .cs-table td { padding: 10px 12px; text-align: left; border-bottom: 1px solid var(--rule); vertical-align: top; } .cs-table th { color: var(--gold); font-weight: 600; font-size: .7rem; letter-spacing: .1em; text-transform: uppercase; border-bottom: 1px solid var(--gold-dark); } .cs-table td.k { color: var(--gold-light); font-family: 'Noto Serif', serif; width: 30%; } .cs-table td.v { color: var(--cream); } .cs-table td.w { color: var(--muted); font-size: .8rem; } .cs-pull { border-left: 2px solid var(--gold); padding: 6px 0 6px 16px; margin: 24px 0; font-family: 'Noto Serif', serif; font-size: 1.05rem; color: var(--gold-light); font-style: italic; } .cs-ornament { text-align: center; padding: 14px 0; color: var(--gold-dark); letter-spacing: .4em; font-size: .78rem; } @media (max-width: 600px) { .cs-article { padding: 1.25rem; } .cs-statbar { grid-template-columns: repeat(2,1fr); } .cs-phase-grid { grid-template-columns: 1fr 1fr; } .cs-agent-row { grid-template-columns: 1fr; } .cs-agent-label { border-right: none; border-bottom: 1px solid var(--rule); } .cs-table { font-size: .78rem; } .cs-table td.k { width: 35%; } }

Cortex Swarm

REPLACING THE FOLLOW-THE-SUN MODEL · 2026

AJAY WALIA · DIGITAL WORKPLACE OPERATIONS · MAY 2026

Every employee depends on a Workplace Operations team they will never meet. It is the team that resets their MFA when they fly to a new country, recovers their shared mailbox when it stops syncing, pushes the Intune policy that lets them install a piece of software, and decides at 3am whether a regional O365 Or Exchange outage warrants paging a human.

This piece is about three things, in order:

First — how those teams are actually structured today, how they function day-to-day, and the structural problems they carry.
Second — where agentic AI sits in 2026, and where the field is heading over the next two to three years.
Third — how a small swarm of specialised agents can replace this team tier-for-tier, what efficiencies that produces, and the new set of challenges it creates in return.

The org chart is the answer. The five tiers that make a DWP team work for humans are the same five seams that make it work for agents.

5Autonomous agents
L1 → SDM

~137FTE mirrored
across all tiers

24×7Single team
no shift roster

1-clickAudit verify
any ticket

Part 1 · Structure

How Digital Workplace Operations Teams Are Structured Today

A DWP team exists because every employee uses IT every day, and someone has to keep that working. For a Global 500 with 10,000–100,000+ employees, the work is too broad, too multilingual, and too time-zone-spanning for an in-house team. Almost without exception, it is outsourced to a Tier-1 IT services firm — TCS, Infosys, Wipro, Accenture, HCLTech, Cognizant — running a 24×7 follow-the-sun roster across multiple delivery centres.

The Scope — What Actually Sits Under "Workplace"

The label undersells the breadth. A typical DWP contract covers seven functional areas, each with its own runbooks, its own vendors, and its own escalation paths.

Identity

Who you are — joiner / mover / leaver, password, MFA, SSO, entitlements

Access

What you can use — catalogs, licenses, groups, approvals

Productivity

Outlook, Teams, M365, OneDrive, SharePoint

Endpoint

Laptops, peripherals, network, Intune compliance, patching

Applications

Line-of-business app support, vendor liaison

Infrastructure

Connectivity, VPN, cloud health, capacity

Change & Governance

RFCs, CAB, RCAs, SLA reporting, customer comms

The Five-Tier Hierarchy

To deliver against this scope at scale, providers build a five-tier hierarchy. Each tier exists because of what the tier below it can't or shouldn't do. Tickets enter at the bottom and move upward only when scope, authority, or evidence demands it.

 

SDM
Service Delivery Manager · ~2 FTE


Architect
Design Authority · ~3 FTE


L3 Engineering
Senior Engineers · ~12 FTE


L2 Specialists
M365 · Intune · Exchange · ServiceNow · ~40 FTE · 2 shifts


L1 Service Desk
~80 FTE · 3 follow-the-sun shifts · handles ~80% of all ticket volume






↔ owns the customer relationship
↔ owns design + RCA authority
↔ root cause + change requests
↔ specialist diagnosis
FRONT LINE
Tickets enter at L1 and escalate upward only when scope, authority, or evidence demands it.

The exact FTE counts vary with employee population and contract scope. The shape — heavy at the base, narrowing to a point — is universal.

Part 2 · Function

How They Actually Function Day-to-Day

Three forces govern day-to-day operation: time zones, ticket flow, and knowledge. Understanding all three is what makes the rest of the piece make sense.

Time Zones — the Follow-the-Sun Roster

Coverage is achieved by handing tickets between geographies as the sun moves. A ticket opened in Sydney at 4pm local rolls over to Manila, then to Mumbai or Hyderabad, then to Krakow or Sofia, then to a US east-coast hub. Three or four formal shift handoffs per day, every day, forever.

 


00:00 UTC
06:00
12:00
18:00


APAC
Manila · Sydney
08:00 – 18:00 local


EMEA / India
Mumbai · Krakow · Sofia
08:00 – 18:00 local


AMER
Atlanta · Dallas
08:00 – 18:00 local





Three handoffs / day
Open tickets transfer at each
geographic boundary
Context lost at each handoff
Notes summarised, not
replayed in full
Weekend & holiday gaps
Coverage thins where
no region is in business hours

Ticket Flow — Entry, Triage, Escalation, Closure

Every employee interaction is a ticket. Most enter via chat or self-service portal, a smaller share through phone or email. From entry, the path is the same: triage at L1, attempt resolution, escalate if the agent at the current tier cannot solve it within authority and budget, then close.

 


CHAT / PORTAL
~70% of entries

PHONE
~20%

EMAIL / OTHER
~10%





L1 Triage
Categorise · prioritise





Resolve in tier
SOP · KB · tools





Escalate
L1 → L2 → L3

~20%




Close
Notify · KB tag

~80%


Higher tier owns it now,
same flow recurses


SLA CLOCK runs continuously · pauses only on AWAITING_USER · resumes on every state change
Time to acknowledge · time to resolve · time to communicate — measured per priority (P1–P4)

Knowledge — Runbooks, KBs, and Tribal Memory

Each tier owns a knowledge base scoped to its authority. L1 has SOPs for ~40 standard scenarios. L2 holds vendor documentation, Intune policy templates, and Exchange runbooks. L3 holds architecture decision records, past postmortems, and vendor escalation contacts. The Architect carries the long-term design library; the SDM holds SLA templates, comms playbooks, and historical breach reports.

A great deal also lives intribal memory — the senior engineer who happens to remember that a similar incident last August was caused by a CA policy. That memory walks out the door every time someone resigns.

The operations floor — a continent away, three shifts deep, somewhere working through your ticket right now.

Part 3 · Challenges

What's Structurally Wrong With This Model

Nothing in the model is broken; it just isn't designed for the kind of demand it now carries. The pain points below are not the fault of any one team — they are consequences ofhow the model is built. Each tier carries some version of every one of them.

Shift Gravity

Three follow-the-sun shifts every 24 hours
Context is summarised, not replayed, at every boundary
Onshore-offshore split hides inefficiency in plain sight

Quality Variance

Varies by shift, by tenure, by individual
SLA breaches cluster on weekends and holidays
The customer never sees an even service level

Attrition Tax

20–35% annual attrition at L1, lower but real higher up
4–8 weeks of training before a new hire is productive
Tribal knowledge leaves with every resignation

Inelastic Capacity

A 2× ticket spike cannot be staffed in < 24 hours
Patch-Tuesday outages routinely take SLA hits
Surge headcount is a fiction; surge overtime is what actually happens

Audit Friction

Reconstructing what happened on a ticket takes weeks
Chat logs, ticket history, and admin-tool actions live in different systems
G500 internal-audit reviews drag on for months

Language & KB Silos

Multilingual coverage means hiring native speakers locally
Knowledge bases drift between tiers, regions, and locales
New runbooks are rarely peer-reviewed for quality

The Headcount Paradox

Stack the team by volume and headcount and the same shape appears every time: an inverted pyramid where the tier carrying the most repetition is also the tier carrying the most people, the most attrition, and the lowest unit-economics. The next two parts argue this is exactly the part the next wave of agentic AI can credibly absorb.

 
SDM · ~2 FTE · 10% touches

ARCHITECT · ~3 FTE · 5% touches

L3 · ~12 FTE · 4% tickets

L2 · ~40 FTE · 15% tickets

L1 · ~80 FTE · 3 SHIFTS · 80% tickets
most volume · most repetition · highest attrition

Headcount and decisions are concentrated exactly where they are easiest to automate.

Part 4 · State of the Art

Where Agentic AI Sits in 2026

Six years, six steps — each layer of capability built on the one beneath it.

Two years ago, "AI agent" meant a chatbot with a system prompt. In 2026 it means something specific: a model that can decompose a goal, call tools to gather evidence, maintain state across turns, and stop when the work is done. The shift is real, and it is what makes the rest of this piece possible.

A Six-Year Capability Ramp

Each year since 2020 has unlocked a layer of capability that wasn't there the year before. The cumulative effect is what now allows specialised agents to do specialist work, not just general chat.

 

2020
2021
2022
2023
2024
2025
2026












Multi-agent
orchestration
Local inference
30B–70B on laptop
Reliable tool use
structured output
Reasoning & RAG
retrieval-augmented
Instruction tuning
long context
Pre-training
2026: every layer needed for autonomous tier-aligned agents is in place.

What "Agentic" Actually Means

Four ingredients distinguish an agent from a chatbot. Every component is now boring engineering — no novel research required.

 

Agent
persona + system prompt



Reasoning
Decompose · plan · reflect
Why a step, not just what



Tool Use
Structured calls · typed inputs
Real-world side effects



Memory
Conversation · KB retrieval
State across turns



Authority
Bounded scope · approvals
When to act vs escalate






An agent is a system prompt, a model, a memory, a tool registry, and rules about when to call which.

Three Independent Shifts That Made This Credible

Each on its own is interesting. Together, they remove the standard objections G500 buyers raise to bringing AI inside the perimeter.

 Local
on-device inference
No data leaves building
Compliance objection gone
Inference Economics
100%
safety-pass target
Semantic grading
Adversarial eval cases
Eval Discipline

A Capability Map by Tier — Ready Now vs Emerging

This is not a roadmap; it is an honest read of what's possible today. "Ready" means the prompt, tool set, KB and eval pattern are known. "Emerging" means the approach is understood but still being measured.

L1 · Ready Now

Front-Line Desk

Identity verification · password / MFA / unlock
Catalog software install + approval
Outlook / Teams diagnostics
Printer · peripheral pairing
KB retrieval + grounded response

L2 · Ready

App Specialist

App log structured analysis
Service health diagnostic
Intune compliance + push
Mailbox + M365 admin actions
Hypothesis-test workflows

L3 · Emerging

Senior Engineer

Infrastructure root-cause
AD attribute engineering
Kusto / log-analytics
Change request authoring
Emergency change application

Architect · Emerging

Design Authority

Change review against ADR library
P1 RCA authoring
Pattern-vs-one-off classification
Capacity-review triggers
Design-impact assessment

SDM · Emerging

Delivery Manager

Customer comms drafting
SLA dashboard + breach alerts
War-room convene flow
Weekly briefing generation
Status update cadence

Part 5 · Trajectory

Where Agentic AI Is Heading — Next 2–3 Years

The trajectory of the last six years points in a clear direction: from a single model answering a single question, toswarms of specialised agents collaborating on bounded problems under an orchestrator they cannot themselves modify.

 


LLM
2020–2023
Single model
Chat completion





Agent
2023–2024
+ Tools + Memory
Single task loop





Multi-
agent
2024–2026
Planner + executors
Verifier loops





Specialised
Swarm
2026 →
Role-aligned agents
Bounded authority





Enterprise
2027 →
Cross-function
Swarm-of-swarms

Cortex Swarm sits at the fourth stage. The fifth is what this pattern unlocks once it's proven inside one function.

Three Bets About the Next 24 Months

Bet 01

Specialisation beats generalisation

One large general agent doing everything is brittle. Five small role-aligned agents — each with its own persona, tools, and KB — are more reliable, more debuggable, and more auditable.

Bet 02

The orchestrator is the operating system

Frameworks like LangChain / LangGraph / Autogen are scaffolding. Production systems will hold their durable value in a bespoke orchestrator that owns state, audit, identity, and policy — not in any library it depends on.

Bet 03

Compliance becomes the product

The agent that wins inside a regulated enterprise is not the one with the highest benchmark — it's the one whose every action a G500 internal-audit team can replay in one click.

Part 6 · The Cortex Swarm

How a Five-Agent Swarm Replaces the Five-Tier Team

The mirror principle — the agent inherits the role the human already plays.

The idea is structurally simple.Don't reinvent the team. Mirror it. One agent per tier. Distinct persona, tools, knowledge base, and authority. The org chartis the system architecture.

If a human L2 specialist refuses to apply a config change without log evidence, the L2 agent does the same. If the Architect won't approve a change without checking the ADR library, neither does the Architect agent.

Human Team ↔ Agent Swarm — Tier for Tier

 



SDM · 2 FTE
Customer voice · SLA · war room


SDM Agent
Same persona · same tools · always on


Architect · 3 FTE
Design authority · ADRs · RCAs


Architect Agent
Pattern-vs-one-off · ADR-gated


L3 · 12 FTE
Root cause · changes · vendor escalation


L3 Agent
Same diagnostics · proposes change request


L2 · 40 FTE
App / platform · log-driven diagnosis


L2 Agent
Reads logs · hypothesis-test · bounces back


L1 · 80 FTE · 3 shifts
Front-line desk · password · catalog · printer


L1 Agent · always on
Same persona · 25 tools · escalates on rule

Each pair shares the same persona, scope and authority. The agent inherits the role; the orchestrator inherits the rules.

One Orchestrator. Five Agents. Clean Seams.

The architecture is deliberately conservative. The model proposes; the orchestrator and adapters decide whether the proposal executes. Bespoke ~250-line orchestrator. No LangChain. No LangGraph. No Autogen.

The Five Agents — in Detail

L1 · Phase 1

Service Desk

replaces ~80 FTE · 3 shifts · 80% tickets

"Polite. Fast. Scripted. Resolves common categories. Never speculates on root cause."

Password / MFA / unlock · Catalog install · Printer · Outlook/Teams basics
25 tools without further approval; mutating tools pass the identity gate
Escalates after 2 failed attempts; infra/outage/security always escalate
Target SLA: TTA < 5 min · TTR < 30 min

IdentityAccessProductivityEndpoint

L2 · Phase 2

App Specialist

replaces ~40 FTE · 2 shifts · 15% tickets

"Specialist. Methodical. Demands diagnostics before acting. Reads logs."

App log analysis · Intune compliance + push · Exchange diagnostics
Hypothesis-test workflows — escalates only after evidence is gathered
Rejects wrongly-escalated tickets back to L1 with structured rationale
Target SLA: TTA < 30 min · TTR < 4 hrs

M365IntuneExchangeServiceNow

L3 · Phase 2

Senior Engineer

replaces ~12 FTE · day + oncall · 4% tickets

"Root-cause focused. Drives change requests. Consults Architect for design-impacting fixes."

AD/Identity engineering · Kusto/log-analytics · Emergency change application
Authors RCA stubs — Architect finalises
Consults Architect for design-impacting changes
Target SLA: TTA < 2 hrs · TTR < 24 hrs

AD EngineeringInfrastructureChange MgmtSecurity

Architect · Phase 3

Design Authority

replaces ~3 FTE · 5% touches

"Asks 'is this a one-off or a pattern?' Advises and gates — not a doer."

Approves/rejects change requests against the ADR library
Authors P1 RCAs · pattern-vs-one-off classification · capacity reviews
May veto L3-proposed changes
Target SLA: Design review ≤ 2 biz days · RCA ≤ 5 days

ADR LibraryChange ReviewRCACapacity

SDM · Phase 3

Delivery Manager

replaces ~2 FTE · 10% touches

"Customer-facing. SLA-driven. Calm under pressure. Owns comms cadence."

SLA dashboard · customer comms · war-room convening · weekly briefing
Cannot make technical decisions — can demand them by SLA-bound deadline
Can convene a war room without prior approval
Target SLA: Customer comms < 1 hr (P1) · weekly briefing every Friday

SLA GovernanceCustomer CommsWar RoomWeekly Briefing

Five Collaboration Patterns

Five patterns cover every interaction between the five agents. Escalation is just the first.

Escalation

L1 → L2 → L3 with filtered conversation history per tier scope. The higher tier sees only what's relevant to its authority.

Bounce-back

L2 or L3 → L1 with structured de-escalation rationale. Cycle detection prevents loops.

Design gate

L3 → Architect viapropose_change_request; result returns viarespond_to_l3.

War-room

SDM forces L3 + Architect to sync on a single ticket thread under stricter time budgets.

Internal escalate

SDM → SP leadership (humans) on systemic breach patterns — humans re-enter the loop only when patterns demand it.

A Representative Ticket — End to End

Scenario: "Team shared mailbox stopped syncing." A representative ticket traversing all five agents under a formal 13-state machine.

A P1 Incident — 5 Minutes, No Human Paged

Region-wide Exchange Online failure. The swarm runs the entire war-room cycle while humans are asleep.

Part 7 · Efficiency

What Changes — and Why It Compounds

None of the gains below are individually surprising. The point is that all seven happen simultaneously, on the same architecture, against the same SLAs that already exist on the contract.

Dimension	Today (Human Team)	With Cortex Swarm
Coverage	3 follow-the-sun shifts with formal handoffs	Single team, always on. No handoffs. No context lost between geographies.
Capacity	Inelastic. 2× spike can't be staffed in < 24h	Elastic by definition. 10× spike = 10× concurrent agent instances.
Quality	Varies by shift, tenure, individual	Even. Same prompt, same KB, same eval bar everywhere.
Speed	P1 page → human awake → diagnose: 30–60 min	P1 detect → diagnose → propose → approve → apply: 5 min.
Languages	Each language needs native-speaker hiring	EN/HI/DE on the same agent. Locale bundle is a config file.
Audit	Reconstruct from chat + ticket + admin logs over weeks	One ticket ID → hash-chained replay of every tool call and state change.
Knowledge	Walks out with every resignation. 4–8 weeks to rebuild.	Persistent. KB is versioned. Prompts and tools are reviewed in PR.

The Compounding Effect

Speed compounds with capacity (faster resolution × elastic concurrency = shorter incident windows). Audit compounds with quality (every action is replayable, so every regression has a fix that ships in a single PR rather than a memo). Language compounds with coverage (one swarm serves every region in every supported language at the same SLA).

The org chart was never the bottleneck. The bottleneck was the shift roster underneath it.

Part 8 · New Challenges

The Honest List — What Could Go Wrong

The model is not free. It trades a familiar set of operational problems for a less familiar set of socio-technical ones. Each one below is real; each one has a specific mitigation already wired into Cortex Swarm.

Challenge	What It Looks Like	Mitigation
Trust gap	End users distrust "the bot". CIOs distrust autonomy.	Phase-gated rollout. Human approval on every mutating tool until evals plateau. Audit replay UI for skeptical buyers.
Audit scrutiny	Regulators want to know "what did the model decide and why?"	Hash-chained SHA-256 audit. Every tool call, KB chunk, and state change is replayable in one click via`/audit/verify`.
Prompt injection	Adversarial input tries to make the model exfiltrate or escalate.	Six layered defences: directive precedence, -tagged inputs, identity gate, Pydantic validation, rate limits, output filter.
Model drift	A new model version regresses on something nobody noticed.	52 eval cases as a CI gate. Semantic grading on resolution, tool-correctness, grounding, citation.`make evals` fails the build.
Long-tail edge cases	Rare scenarios the agent has never seen.	Escalation patterns. Out-of-scope intent triggers`escalate_out_of_scope()` to the next tier or to a human.
Change management	Humans whose roles dissolve. SP commercial models built on FTE counts.	The hardest one. Honest position: agents replace tier responsibilities, not the function. Senior staff move to swarm operations, KB curation, and eval authoring.
Cost & scaling	Inference cost grows with ticket volume.	Local inference (LM Studio + Qwen3-Coder-Next) keeps marginal cost low. Bigger models reserved for L3/Architect on rare paths.
Failure modes	"Right tool, hallucinated reasoning." Confident wrong answers.	Formal 13-state machine. Mutating tools callable only in`ACTING`.`RESOLVED` reachable only from`VERIFYING`.

Six Security Defences in Depth

System-prompt directive

SECURITY DIRECTIVE at highest precedence — cannot be overridden by user input.

Input/output normalisation

and treated as DATA, never instructions.

Identity-gate decorator

Orchestrator blocks cross-user mutations even if the model forgets to check.

Tool input validation

Pydantic + allow-lists. Schema mismatches rejected before execution.

Per-session rate limit

Max 30 tool calls / 10 min · 50 chat turns / hour — prevents runaway loops.

Output filter

Leaked-secret patterns stripped before reaching the UI layer.

Authority is bounded by composition: the model proposes, the orchestrator and adapters dispose.

Part 9 · Delivery

Six Phases Over ~12 Months

Phase 1 ships in 5 weeks. Each later phase is a drop-in module gated on an explicit trigger — not an arbitrary date.

Phase 01 · Now · ~50h

Foundation

L1 agent · EN chat
Stub L2/L3 · mock backends
State machine + 52 evals
Hash-chained audit

Trigger: Building now

Phase 02 · ~35h

Specialist Tiers

Full L2 + L3 agents
Real ServiceNow / AD / Intune
Real OIDC SSO
KB pruning via evals

Trigger: First pilot signed

Phase 03 · ~45h

Planning Architecture

Architect + SDM agents
Planner / Executor / Verifier
SDM dashboard + war-room

Trigger: Eval data shows plateau

Phase 04 · ~25h

Multilingual

HI + DE locale bundles
Per-locale KB ingest
KB provenance + trust tiers

Trigger: External KB integrated

Phase 05 · ~20h

Voice

Sarvam (EN/HI) · Azure (DE)
ElevenLabs alternate
Browser mic + playback

Trigger: Demand-driven

Phase 06 · ~12h

Multi-Tenant

Tenant ID propagation
Branded shells per tenant
Per-tenant SLA dashboards

Trigger: Second client signed

Closing

The Org Chart Was Always the Answer

Every senior engineer who has ever worked in IT operations recognises the five-tier shape. It's the shape that emerges every time, in every geography, in every sector, because the responsibilities map cleanly onto the kinds of decisions a team needs to make. That same shape is exactly what makes a swarm legible: each agent does what the role already does, no more, no less, and the rest of the building knows where to send its ticket.

The interesting work over the next year is not adding more agents. It is sharpening the seams between them — better identity gates, better KB provenance, faster audit replays, tighter eval cases — so that what runs in production is a system anyone in IT operations can trust without having to also be an AI specialist.

5Agents
L1 · L2 · L3 · Arch · SDM

~137FTE mirrored
across all tiers

24×7Single team
no shift roster

1-clickAudit verify
any ticket

· · ·

Ajay Walia · Digital Workplace Operations · May 2026
LinkedIn: /in/ajaywalia

]]>

RAG Chatbot from indexed public documentation

Ajay Walia — Tue, 19 May 2026 00:00:00 +0000

This article documents the design and implementation of a domain-specific Retrieval-Augmented Generation (RAG) conversational assistant. The project was completed as theWeek 15 Graded Mini Project of theIITM Pravartak Professional Certificate Programme in Agentic AI and Applications. The brief required the construction of an assistant that retrieves content from a private document set and produces answers grounded in that retrieved context, while refusing to answer when supporting evidence is absent.

The implementation selects theTechnology domain and indexes six official pages of the GitHub REST API documentation. The assistant accepts free-form questions in a command-line interface, performs context-aware follow-up handling, returns concise answers with source attribution, and emits a fixed refusal phrase when the retrieved context does not contain the answer.

Architecture

The pipeline is split into two phases. An offline ingestion stage prepares a reusable vector index; a runtime stage answers user turns by retrieving from that index and prompting a large language model under strict grounding instructions.

Public Data Sources

Six public pages from the official GitHub Docs were saved locally as plain-text files indata/raw/:

Quickstart for GitHub REST API
Getting started with the REST API
Authenticating to the REST API
Rate limits for the REST API
Best practices for using the REST API
Troubleshooting the REST API

These pages were chosen because they form a self-contained operational knowledge set: a developer integrating with the API typically needs to consult exactly this material when handling authentication, rate-limit pressure, and recovery from common errors.

Components and Concepts

The following components and ideas underpin the implementation. Each is summarised briefly to make the design choices easy to evaluate.

Retrieval-Augmented Generation (RAG). A pattern in which a language model is grounded on retrieved passages from an external corpus rather than relying solely on parametric memory. Retrieval reduces hallucination and lets the assistant cite source material it actually consulted.

LangChain. The orchestration framework used throughout the project. It provides composable abstractions for document loaders, text splitters, embeddings, vector stores and chat models, which keeps the ingestion and runtime code small and idiomatic.

RAG Chunking. Long documents are split into smaller passages before embedding. This implementation usesRecursiveCharacterTextSplitter with a 900-character chunk size and a 150-character overlap. Splitting prefers semantic boundaries (paragraph breaks and Markdown headings) so that each chunk is internally coherent and retrieval surfaces meaningful units rather than fragments.

OpenAI Embeddings. Each chunk is converted into a high-dimensional vector using thetext-embedding-3-small model. The vector captures the semantic content of the chunk so that questions phrased differently from the source text still retrieve the right passages.

FAISS. Facebook AI Similarity Search is the local vector store used to index the embeddings. The index is persisted to disk undervectorstore/github_rest_api_faiss so that ingestion runs once and the runtime simply loads the existing index, satisfying the rubric’s reusable-index criterion.

Top-k Retrieval. At each user turn the retriever returns thek=4 chunks most similar to the query. A smallk keeps the prompt focused and reduces the chance that off-topic passages dilute the grounded answer.

Follow-up Rewriter. A short, dedicated LLM call rewrites the user’s latest question into a standalone retrieval query using recent conversation history. This is the mechanism that allows ambiguous follow-ups such as “what about rate-limit headers?” to be embedded and retrieved correctly without the model having to guess the missing antecedent.

Grounded Answerer. The final answer is produced bygpt-4o-mini at temperature zero under a system prompt that instructs the model to answer only from the retrieved context, append aSources: line naming the document titles consulted, and otherwise return the exact refusal phrase.

Refusal Behaviour. When retrieval returns no usable context, the runtime short-circuits and emits the required string verbatim:I don't have enough information in the provided documents. The refusal is also enforced by the system prompt, providing defence in depth against hallucinated answers.

Implementation Layout

The submission is organised as two thin Python modules and a Jupyter notebook.ingest.py loads the text files, removes GitHub Docs templating syntax, extracts a human-readable title from each document, chunks the text, embeds the chunks, and writes the FAISS index.chatbot.py loads the persisted index, manages conversation history, invokes the rewriter and the grounded answerer in sequence, and prints the response with source attribution. The accompanying notebook imports the same functions so that the notebook and the CLI execute identical code paths, eliminating drift between the two surfaces.

Configuration is environment-driven. The OpenAI API key is read from a.env file and an optionalOPENAI_BASE_URL is honoured so that the same code runs against the Vocareum-provided proxy without modification. No secrets are committed to source.

Sample Conversation

A short transcript illustrates the three behaviours the rubric tests — grounded answer, follow-up resolution, and refusal.

You: What headers should I send when making requests to the GitHub REST API?Assistant: GitHub REST API requests commonly use an Accept header such asapplication/vnd.github+json. Authenticated requests send an Authorizationheader with Bearer YOUR-TOKEN. GitHub also recommends a valid User-Agentvalue, such as your GitHub username or application name.Sources: Getting started with the REST API; Authenticating to the REST APIYou: What about rate limit headers?Search query: rate limit response headers GitHub REST APIAssistant: Response headers can include x-ratelimit-remaining andx-ratelimit-reset, which indicate how many requests remain in the currentperiod and when the rate limit resets.Sources: Getting started with the REST API; Rate limits for the REST APIYou: How should an integration handle rate limit errors?Assistant: If a retry-after header is present, wait that many seconds beforeretrying. If x-ratelimit-remaining is 0, wait until the time given byx-ratelimit-reset. Otherwise, wait at least one minute and then useexponentially increasing waits for repeated secondary rate-limit failures.Continuing to make requests while rate-limited may result in a ban.Sources: Best practices for using the REST API; Troubleshooting the REST API

The second exchange demonstrates the follow-up rewriter at work: the original “what about rate limit headers?” is expanded into a self-contained search query before retrieval, which is what allows the retriever to surface the correct passages despite the missing antecedent.

Now, a deliberately off-topic question — one the index cannot possibly support — triggers the safety path:

⛔ Refusal path — out-of-scope question
You: What is the best laptop to buy for running the GitHub REST API?
Assistant: I don't have enough information in the provided documents.

No chunk in the index supports an opinion on hardware, so the assistant returns the mandated refusal verbatim rather than producing a plausible-sounding but ungrounded answer. The refusal is enforced twice — once procedurally when retrieval returns empty, and once in the system prompt — so the behaviour is stable even when retrieval surfaces weakly related chunks.

Closing Observations

Three observations stand out from this exercise.

Retrieval quality is set upstream, not by the vector store. The quality of retrieval is determined less by the choice of vector store than by the upstream decisions about cleaning and chunking. Stripping GitHub Docs templating syntax — the{% data ... %} Liquid blocks, the reusables, and the[AUTOTITLE] link macros — materially improved the relevance of returned chunks because the embeddings stopped clustering around boilerplate.
Follow-up handling must be engineered, not assumed. It is not a free behaviour of the language model. A separate rewriter step that converts conversational queries into standalone search queries is the smallest reliable mechanism, and it removes a category of retrieval failures that would otherwise be invisible in casual testing.
Refusal is a feature, not a fallback. Enforcing the refusal both procedurally (short-circuit on empty context) and in the system prompt is what gives the assistant a useful posture in production: it will say nothing it cannot support, and it will say so in a predictable way.

]]>

I Built a Team of IT Architects using LLM That Live on MacBook — Meet Aether

Ajay Walia — Sat, 16 May 2026 00:00:00 +0000

Every architect has felt this at some point. You are mid-design on a complex Azure landing zone, you need a sanity check on your FSLogix profile container sizing, and the fastest path to an answer is to ping a colleague who knows AVD cold — except it is 10pm, or they are in another timezone, or that colleague simply does not exist in your organisation.

I built Aether to fix that. It is a local-first, multi-agent AI system that runs a team of 10 specialist IT architecture advisors on a single MacBook Pro M5. No internet after setup. No API costs. No data leaves the machine. Just fast, cited, domain-grounded answers — available at 10pm when the deadline hits.

This is the story of how I built it, what the stack looks like under the hood, and what I learned along the way.

The Problem I Was Actually Solving

I have spent years in IT architecture — cloud, digital workplace, network, end-user computing, the works. Over that time I have accumulated a large personal knowledge base: AWS Well-Architected reviews, Intune compliance policy templates, AVD host pool sizing guides, Citrix NetScaler configurations, TOGAF artefacts, cloud adoption frameworks. The knowledge exists. The problem is retrieval — getting the right answer from the right domain quickly, without context-switching across six different documentation tabs.

Commercial AI tools are good at general answers. They are not great at answering “give me the exact OMA-URI path for configuring Windows Hello for Business through Intune on a hybrid-joined device for a tenant with MFA enforced at the Conditional Access layer.” That requires domain depth, and it requires knowing which documents to pull from.

I also wanted to explore a practical AI use case — not a demo, not a proof of concept, but something I would actually use daily. Aether became that experiment.

What Aether Is

AETHER v2.6 // SYSTEM DEFINITION

Type:      Local-First Multi-Agent AI System
Agents:    10 (3-tier hierarchy)
Model:     Gemma 4 26B A4B (Q4_K_M) — single instance
Runtime:   LM Studio → LangGraph → FastAPI → Gradio
Memory:    ~43 GB of 64 GB unified (M5 Pro)
Egress:ZERO
API cost:ZERO
Aether is alocal-first AI system that acts as a team of specialist architecture advisors — all running on your MacBook Pro M5. The headline technical trick: every one of those 10 advisors is thesame Gemma 4 26B model, loaded once. What makes each advisor different is purely the system prompt it receives and the knowledge base namespace it retrieves from.
One model. Ten personas. Zero cloud.
The Three-Tier Agent HierarchyThe agents are organised the way a real consulting firm would structure a team — from narrow specialist up to cross-domain strategist.
Tier 1 — Enterprise Architect (1 agent). Cross-domain strategy, TOGAF, Zachman, governance frameworks (GDPR, ISO 27001, HIPAA), technology investment decisions. This agent can readall knowledge base namespaces — it is the only one with that reach. The final escalation destination.
Tier 2 — Domain Architects (3 agents). Cloud Domain (multi-cloud strategy, FinOps, landing zones), Network Domain (SD-WAN, ZTNA, BGP, micro-segmentation), Digital Workplace Domain (Microsoft 365, VDI strategy, device management).
Tier 3 — Technology Architects (6 agents). AWS, Azure, GCP, Intune, AVD, Citrix. Each one is scoped tightly to its domain — deep, narrow, and fast.
The Stack, Component by ComponentComponentWhat It Does
Gemma 4 26B A4B (Q4_K_M)The single model serving all 10 agents. MoE architecture — activates ~4–6B parameters per token. ~13 GB VRAM.
LM StudioLocal model server. OpenAI-compatible API on port 1234. Model stays resident in unified memory.
LangGraphOrchestration graph — defines the multi-step query pipeline as a typed state machine.
LanceDBLocal vector database. One namespace (table) per agent. Fully file-based, no server process needed.
BAAI/bge-small-en-v1.5Embedding model. 384-dimensional vectors. Runs on the Apple Neural Engine — essentially free compute.
RedisSession memory (24h TTL, rolling 3-turn window) and routing cache (1h TTL).
FastAPI + UvicornREST API gateway on port 8000. Full query/response model, session management, agent listing.
GradioWeb chat UI on port 7860. Session management, source citations, escalation chain display.
PrometheusMetrics: query counts, latency per agent, escalation rate. Useful for understanding usage patterns.
SQLite (audit.db)Immutable audit trail. Every query logged before response is returned. Cannot be skipped.
The full system sits comfortably in about 43 GB of the M5 Pro’s 64 GB unified memory — leaving 21 GB of headroom for the rest of the machine.
The “One Model, Ten Specialists” TrickThis is the part I get asked about most, because it sounds like it should not work.
Every agent in Aether is defined by a YAML manifest. That manifest specifies asystem_prompt, anamespace (which LanceDB table to retrieve from), atemperature, andmax_tokens. There is no model switching. There is no weight loading. The Gemma 4 26B A4B is loaded once by LM Studio and stays resident.
# agent_manifests/aws_technology_architect.yamlagent_id:aws_technology_architectdisplay_name:"AWS Technology Architect"tier:3parent_agent:cloud_domain_architectnamespace:aws_techtemperature:0.1max_tokens:2048system_prompt:| You are an AWS Technology Architect with deep expertise in the AWS Well-Architected Framework, EC2/EKS/Lambda sizing, IAM policy design, CloudFormation and CDK, GuardDuty, and cost optimisation. You draw only from AWS-specific documentation and architecture patterns. When answering, cite the source documents retrieved. End your response with a new line: 'Confidence: X.XX' (0.00–1.00) reflecting how well your knowledge base supports this answer.
Why does this work? Because Gemma 4 26B A4B is a Mixture of Experts model. It routes each token through specialist sub-networks internally — activating only about 4–6 billion parameters per inference pass, despite having 26 billion total. The practical result: it runs at near 5B speed while retaining the reasoning breadth of a much larger dense model. A single loaded instance can faithfully adopt both a narrow Citrix specialist persona and a broad enterprise strategy persona, because the MoE routing shifts for each.
The other half of the trick is the knowledge base. Each Tier 3 agent retrieves only from its own LanceDB namespace. The AWS agent never sees a Citrix document. The Intune agent never sees a GCP architecture guide. Domain knowledge is isolated by design — which means the model cannot hallucinate across domain boundaries, because the retrieval context does not cross them.
Confidence-Driven EscalationThe escalation mechanism is the design decision I am most proud of.
Every agent is instructed to append a confidence score to its response — a float between 0.00 and 1.00 representing how well its retrieved knowledge supports the answer. The orchestrator reads that score via regex. If it falls below 0.7and the agent has a parent tier definedand that parent has not already been tried, the system automatically escalates.
// ESCALATION CHAIN EXAMPLE

Query: "Give me a multi-cloud strategy covering Azure, AWS, GCP, AVD, and Citrix with network segmentation"

→ Routed to:aws_technology_architect
→ Confidence:0.41 (below 0.7 threshold)
→ Escalate to:cloud_domain_architect
→ Confidence:0.63 (still below threshold)
→ Escalate to:enterprise_architect
→ Confidence:0.88 ✓

Response appended with escalation chain for full transparency.
The response shown to the user includes the full escalation path — which level of expertise produced the final answer. This matters in enterprise contexts. It is the difference between “the AI said so” and “the enterprise-level advisor produced this after the technical specialist’s knowledge was insufficient.”
The elegance here is thatthe model participates in its own routing decision. The orchestration does not need a separate classifier to judge answer quality — the model tells you its own confidence, and the system acts on it.
The 7-Step Query PipelineEvery query passes through a typed LangGraph state graph. The state object —AetherState — carries everything between nodes: query, session ID, agent manifest, RAG results, conversation history, messages, response, confidence score, escalation flags, and escalation chain list.
01ROUTE02RETRIEVE03HISTORY04BUILD05LLM06ESCALATE?07FINALISE
01 — ROUTE. A keyword scanner maps the query to the best-fit agent, checking Tier 3 rules first (most specific), then Tier 2, then Tier 1 as catch-all. Redis caches route results for one hour, so repeated queries on the same topic skip the scan entirely.
02 — RETRIEVE. Semantic search against that agent’s LanceDB namespace — top-5 chunks returned. Documents were ingested at 500-word chunks with 50-word overlap, embedded into 384-dimensional vectors byBAAI/bge-small-en-v1.5 running on the Apple Neural Engine.
03 — HISTORY. The last three message pairs (six messages) are loaded from Redis for the session. This gives conversational continuity without letting the context window balloon.
04 — BUILD. The message payload is assembled:system_prompt + RAG documents + history + current query. The confidence instruction is appended here.
05 — LLM. The assembled payload hits LM Studio on port 1234. The orchestrator extracts the confidence score from the response text via regex before passing the response forward.
06 — ESCALATE (conditional). If confidence is below 0.7, a parent agent exists, and it has not already been tried this turn — swap in the parent’s manifest and loop back to step 02 with fresh retrieval against the parent’s namespace.
07 — FINALISE. The turn is saved to Redis. The SQLite audit record is written. The escalation chain annotation (if any) is appended to the response. Result returned to the user.
Knowledge Isolation — The Anti-Hallucination Architecture
One of the most practical decisions in Aether’s design is namespace isolation. Each Tier 3 agent retrieves only from its own LanceDB table. The AWS agent’s retrieval context will never include a Citrix StoreFront configuration guide — because those documents simply do not exist in its namespace.
This matters more than it might seem. A common failure mode in RAG systems iscross-domain contamination — where retrieval pulls in tangentially related content from a different domain, and the model confabulates a plausible-sounding but wrong answer by blending the two. Namespace isolation eliminates this at the architectural level.
The Enterprise Architect at Tier 1 is the deliberate exception — it can query all namespaces, because cross-domain synthesis is exactly what it is built for.
The Audit Trail — Because EnterpriseEvery single query is written toaudit.db (SQLite) before the response is returned. The record includes: timestamp, session ID, query text, agent used, confidence score, escalation chain, and sources cited. The audit write is wrapped in error handling so that a database failure never blocks the main query flow — but the log is never optional.
# Simplified from the finalise nodeaudit_record={"timestamp":datetime.utcnow().isoformat(),"session_id":state["session_id"],"query":state["query"],"agent_id":state["agent_id"],"confidence":state["confidence"],"escalation_chain":json.dumps(state["escalation_chain"]),"response_length":len(state["final_response"])}db.execute("INSERT INTO audit_log VALUES (:timestamp, :session_id, ...)",audit_record)
For a system giving architecture recommendations — decisions that feed into multi-million dollar cloud commitments — having an immutable audit trail of what was asked, what agent answered, at what confidence level, via what escalation path, is not a nice-to-have. It is the thing that makes it organisationally defensible.
What It Can Actually Do$aether query"Design an Azure landing zone for PCI-DSS compliance"
$aether query"Right-size EC2 instances for a memory-intensive Java application"
$aether query"Configure FSLogix profile containers for 500 AVD users"
$aether query"Zero-trust network architecture for 5,000 remote employees"
$aether query"Cloud adoption roadmap for a financial services firm"
$aether query"Intune compliance policy for iOS BYOD — hybrid-joined, MFA enforced"
The depth per domain is real. The Intune agent knows OMA-URI paths, Graph API commands, Autopilot profiles, and licensing requirements — because those are the documents I ingested into its namespace. The AVD agent knows host pool design, FSLogix sizing, MSIX app attach, and session host scaling plans. The knowledge base is only as good as what you put into it — but that is also the point. This ismy architecture knowledge, curated, searchable, and queryable at any hour.
What I Learned Building ThisPrompt engineering IS the architecture. In a system like this, the YAML manifestis the agent. The difference between a brilliant AWS specialist and a generic AI assistant is entirely in what the system prompt says and what documents back it up. Getting those prompts precise, domain-bounded, and calibrated for the right temperature took longer than any of the code.
MoE models are underrated for local multi-agent work. The choice of Gemma 4 26B A4B over a dense model was the right call. You get reasoning breadth comparable to a much larger model at the inference cost of a small one, on hardware that most architects already carry.
Confidence as a first-class citizen. Asking the model to self-assess and surface that score is one of the highest-value things I added. It makes the system honest — and it drives the escalation logic that makes the team metaphor actually work.
Namespace isolation is a practical hallucination brake. Not a theoretical one. In the first version of Aether, all documents lived in a single namespace. Cross-domain contamination was visible and annoying. Splitting into per-agent namespaces fixed it immediately.
Audit trails are not overhead — they are the point. Every enterprise AI deployment should have one. Building it into the core pipeline from day one changes how you think about what the system is producing.
What Is NextAether v2.6 is a working, daily-use system. The next version I am working toward adds a web-based ingestion UI (so loading new documents does not require touching the ingest script), structured output for architecture decision records (ADRs) in a consistent format, and inter-agent communication — where a Tier 3 agent can proactively pull context from a peer rather than waiting for the escalation chain to activate.
I built Aether because I wanted a team. It turns out a team was always available — it just needed the right prompts and a local model to bring it to life.
Have questions about the stack or want to see specific parts of the implementation? Drop a comment or reach out onLinkedIn.

]]>

Building with LLMs in 2026: The Framework Atlas

Ajay Walia — Fri, 08 May 2026 00:00:00 +0000

Four years after LLMs entered the mainstream, the single most common mistake I see architects make is spending most of their decision energy on the model. Which frontier model? GPT or Claude or Gemini? The model choice matters — but it is one decision out of roughly fifteen, and it is far from the most consequential one.

Building an LLM-powered system in 2026 is an architecture decision made across a stack of competing frameworks, each solving a well-posed problem at a specific layer. I spent several months mapping that landscape as a practitioner — the result is a 73-page whitepaper I call the Framework Atlas. This post distils it into the five things I think every architect, engineer, and senior IT leader should know before picking a single tool.

The Stack Has a Shape

There is no single AI stack, but there is a canonical shape. Every non-trivial production LLM system — whether a support chatbot, a document search engine, or a multi-agent workflow — is a composition of six layers:

Application layer. The surface your user interacts with. LangChain is the default; Semantic Kernel is the Microsoft-native choice; CrewAI leads when the app itself is agentic.

Agent layer. When a single LLM call is not enough — when the system needs to plan, call tools, or coordinate among multiple agents — this layer provides the loop. LangGraph is the most production-grounded option in 2026.

Data / retrieval layer. The memory of your system. LlamaIndex leads on orchestration; Weaviate, Pinecone, and Chroma compete at storage, each tuned for a different operational profile.

Model layer. The foundation models themselves. This layer is increasingly commoditised. The most important design decision here is not which model you start with — it is whether you can swap it without rewriting the layers above.

Serving / inference layer. How you turn a model into an endpoint. vLLM dominates throughput-bound workloads; BentoML packages models into clean APIs for teams that want to think about models, not infrastructure.

Infrastructure layer. Kubernetes, Docker, cloud, on-prem. Every framework choice depends on where you can actually deploy.

Wrapped around all six layers are three concerns that have become non-negotiable since 2024:observability and evaluation,fine-tuning and training, andguardrails and safety. If your system design has no answer for any of these three, it is under-designed. Ignoring them does not eliminate the risk — it just defers it until something breaks in production.

The Abstraction Trap

Every framework in the atlas is catalogued against eleven attributes, but the one architects under-weight most consistently isabstraction level — how much code you write versus how much the framework decides for you.

LangChain’s high abstraction makes the first demo fast and the tenth production fix slow, because you are debugging through someone else’s default decisions. FAISS’s low abstraction costs more lines but yields fewer surprises at 3am.

The operational signal:match abstraction to team seniority. Junior teams over-value high abstraction; senior teams over-value low. A mixed team benefits from a medium default — and from making the choice explicitly rather than defaulting to whatever has the best GitHub star count.

Decision Heuristics That Actually Hold

Rather than optimising at each layer independently, the atlas maps common requirements to preferred framework combinations. These are the ones I have found most durable in practice:

Requirement	Starting stack
Fast LLM prototype	LangChain + Chroma + OpenAI API
Enterprise-grade RAG	LlamaIndex + Weaviate + LangSmith
Multi-agent workflow	LangGraph (+ AutoGen for agent conversations)
High-throughput inference	vLLM + Ray Serve
Local / offline / on-prem AI	Ollama + FAISS + LangChain-local
Domain-specialised model	Axolotl (QLoRA) + vLLM + MLflow

Two things stand out from this table. First, LangChain and LlamaIndex are not competitors — they compose cleanly, with LangChain at the application layer and LlamaIndex at the retrieval layer. Second, local inference is no longer an edge case. Ollama plus a Llama-3-class model is a realistic production option for regulated industries where data sovereignty is a hard constraint.

Agents Moved to Production — With Guardrails

In 2023, autonomous agents were mostly demos. By 2026, they are in targeted production use: triage, routing, research synthesis. What changed is not the models — it is the frameworks.

LangGraph’s state-machine model gives agents deterministic control flow: you declare states, transitions, and retry policies explicitly. AutoGen models multi-agent systems as conversations, which makes it remarkably expressive for critique-revise loops and planner-executor separations. The practitioner heuristic:for production agents, LangGraph. For multi-agent conversations, AutoGen. For lightweight document workflows, CrewAI.

The critical note:never deploy autonomous agents in production unless the failure cost is bounded. The agent should draft; a human should approve. The pattern that ships is almost always a hybrid — autonomy where the stakes are low, escalation where they are not.

Guardrails have crossed from afterthought to critical infrastructure in the same period. Prompt injection is the new SQL injection. Every production system needs an input guard, an output guard, and a policy layer between them. The minimum viable defense in 2026 is: input guard → LLM → output guard. Anything less is operating without a seat belt.

The 2026 Outlook: Three Trends Worth Designing For

Agents are becoming the compile target. LangGraph, AutoGen, and CrewAI are converging on a common abstraction — a loop over an LLM with tool use and state. Expect a future that looks like the deep learning layer in 2018: multiple frontend frameworks, one common runtime. Design your agent layer to be swappable.

Retrieval is eating search. Elasticsearch, Postgres, and OpenSearch all ship vector indexes now; Weaviate and Pinecone ship BM25. The primitives have converged. The differentiator is no longer features — it is operational maturity and the team’s ability to run the infrastructure. Hybrid retrieval (vector + keyword) is the production-safe default.

Guardrails are becoming infrastructure. Today they are a library you bolt on. In two years they will be a runtime — prompt injection detection, PII scrubbing, and policy enforcement applied by default to every model invocation, the way CORS and auth middleware is applied to every HTTP request today. Get ahead of this by treating your guardrails layer as critical infrastructure now, not as a compliance checkbox later.

A Practitioner’s Closing Note

Frameworks age faster than architectures. The stack shape you design today — application, agent, retrieval, model, serving, infrastructure — will still be valid in three years. The individual framework boxes you fill it with probably will not be. The single most important design invariant isswappability at each layer. Make the layer interfaces clean, keep the framework-specific code thin, and you will be able to move when the landscape shifts — and it will.

The full Framework Atlas (v4.0, April 2026) covers all ten framework categories in detail, including comparison tables, maturity radars, cost and latency envelopes, and four reference architectures with working code. It is available below.

Download the Framework Atlas — Building with LLMs v4.0 (PDF)

]]>

LLM & Embeddings — One Predicts Words. One Maps Meaning.

Ajay Walia — Tue, 03 Mar 2026 00:00:00 +0000

.two-mech { margin: 2rem 0 2.5rem; border-radius: 14px; overflow: hidden; border: 1px solid #1f3358; background: #0a1424; } .two-mech svg { display: block; width: 100%; height: auto; min-width: 720px; } .two-mech-wrap { overflow-x: auto; } @media (prefers-reduced-motion: reduce) { .two-mech .tm-particle { display: none; } }

The model is what writes the email. The embedding is what finds the one you wrote last March.

Most modern AI systems are built from two fundamentally different mechanisms, and most confusion about what AI “is” comes from conflating them. LLMs aregenerative: tokens in, tokens out, with the output shaped by the prompt and the sampling settings, varying every time you ask. Embeddings aregeometric: a deterministic mapping from a word or sentence to a fixed vector, where comparisons are positional and identical input always produces identical output. Both are essential. Both are old enough to be uncontroversial. Most useful systems combine them.

What follows is theWeek 6 Graded Mini Project of theIITM Pravartak Professional Certificate Programme in Agentic AI and Applications, used here as a lens for both mechanisms across five hands-on exercises.

The two paths, side by side

The header image above shows the contrast in one frame. The LLM path is a loop with sampling — non-deterministic by design, behaviour controlled by temperature, top-p, and prompt structure. The embedding path is a one-shot lookup followed by a geometric comparison — deterministic, fast, stable.

That single distinction tells you which mechanism to reach for. If the answer needs to be written, generated, synthesized, or improvised, you want the LLM. If the answer needs to be found, ranked, deduplicated, clustered, or routed, you want embeddings. Most production systems use both because most real problems are some combination of “find the right context” and “say something useful about it.”

A quick decision table to anchor the rest of the article:

Problem	Reach for
Semantic search over a corpus	Embeddings
Conversational reply or text drafting	LLM
Near-duplicate detection or content clustering	Embeddings
Summarization of a long document	LLM
Routing a support ticket to the right team	Embeddings + a small classifier head
Question answering grounded in your docs	Both (RAG)
Image or text classification	Embeddings + a categorical head
Translation, rewriting, code generation	LLM

The exercises below show why each row works the way it does.

Exercise 1: Text generation reveals prompt and sampling sensitivity

Section A1 loadeddistilgpt2 through the Hugging Facepipeline API and generated three continuations of the same prompt:

generator=pipeline("text-generation",model="distilgpt2")generator("AI is transforming industries by",max_new_tokens=40,num_return_sequences=3,do_sample=True)

Three continuations came back from the same model, the same prompt, the same call:

“AI is transforming industries by using science to bring people together with a greater understanding of the importance of science. The new book takes an approach to both science and technology, allowing people to focus more more effectively on the basics and to…”

“AI is transforming industries by replacing the manufacturing sector with a manufacturing sector that can be turned into a manufacturing and IT sector by creating new jobs and creating new jobs. The new jobs and investment in the next decade will help spur growth…”

“AI is transforming industries by creating a new, faster, and more attractive way of generating capital and creating jobs for both the United States and Europe. This is an effective new way of doing this.”

Three different stories. None of which the model “knew” — it just produced plausible-sounding next tokens under stochastic sampling. Notice the repetitions (“manufacturing sector with a manufacturing sector”), the loops (“more more effectively”), the empty filler (“a new, faster, and more attractive way of generating capital”). DistilGPT-2 is a small model — these are the artefacts of a system that’s good at local fluency but doesn’t have a strong forward plan.

The headline insight: LLM outputs are statistical, prompt-sensitive, and unrepeatable unless you fix the seed. The same prompt can give you variety (a feature when brainstorming) or drift (a bug when consistency matters).

Exercise 2: Tokenization is where the abstraction begins

This is the section to slow down on. Take the sentence:

“LLMs are powerful tools for natural language understanding.”

A human reads eight words. The model sees ten tokens.

After BPE (Byte-Pair Encoding) with the DistilGPT-2 tokenizer:

['LL', 'Ms', 'Ġare', 'Ġpowerful', 'Ġtools', 'Ġfor', 'Ġnatural',
'Ġlanguage', 'Ġunderstanding', '.']

The stringLLMs doesn’t appear in the model’s vocabulary as a single unit, so it is split intoLL andMs. TheĠ prefix encodes “preceding space” — that’s how BPE preserves word boundaries without a separator character. The period gets its own token.

The mismatch betweenwhat a human reads andwhat the model processes has real consequences:

Cost is per token, not per word. API billing, latency, and rate limits are all token-denominated. A 1,000-word prompt to a frontier model may bill at 1,300–1,500 tokens depending on language.
Context windows are token windows. A 4,096-token context holds roughly 3,000 English words. Much less for code (whitespace and symbols inflate counts), much less again for languages with poor vocabulary coverage in the tokenizer.
Rare strings behave oddly. Brand names, technical acronyms, foreign words, internal jargon — anything outside the trained vocabulary gets fractured. Model behaviour around those fractures is harder to predict, and prompt sensitivity often hides at this layer.
The same string can tokenize differently with leading whitespace."king" and" king" are different token sequences. That’s why pasted prompts sometimes produce subtly different outputs than typed ones.

Tokenization is the lowest layer of the LLM stack and the one most engineering conversations skip. If you’re tuning prompts and getting unstable behaviour, the first place to look is what your input looks likeafter the tokenizer touches it, not what it looks like in your editor.

Exercise 3: Prompts shape what you get

Section B ran three task-shaped prompts through the same generator, withtemperature=0.8 andtop_p=0.95:

Summarization — explicit instruction with a 30-word cap.
Q&A — structured format withQ: andA: markers.
Creative — open-ended request for a 4-line poem about AI.

The summarization output respected the spirit of the constraint but drifted past 30 words on most runs — DistilGPT-2 is small enough that hard length control isn’t reliable even with explicit instructions. The Q&A output, asked for the capital of Japan, returnedI believe... — the model hedged. A larger model would say Tokyo confidently; a small model produces statistically plausible Q&A-shaped text without strong factual grounding. The creative prompt produced varied and stylistic continuations, but with the lowest grounding: fluency over precision.

Structure compresses the output space the model is sampling from. Vagueness expands it. That single sentence is most of what “prompt engineering” actually is — the rest is technique.

Exercise 4: Word embeddings encode semantic geometry

Pivot to the other mechanism. Section C1 loadedGloVe vectors (glove-wiki-gigaword-50 — 50 dimensions, trained on Wikipedia and Gigaword) via Gensim, then asked for the five nearest neighbours of three words:

Query	Top 5 neighbours (cosine similarity)
`king`	prince (0.82), queen (0.78), ii (0.77), emperor (0.77), son (0.77)
`queen`	princess (0.85), lady (0.81), elizabeth (0.79), king (0.78), prince (0.78)
`diamond`	gold (0.77), diamonds (0.77), gem (0.74), silver (0.72), jewel (0.71)

There is no generation here. Each word is mapped to a fixed 50-dimensional vector, and the “nearest neighbours” are the words whose vectors sit closest in that space by cosine similarity. The geometry was learned by training on co-occurrence — words that appear in similar contexts end up in similar positions. That’s whyking andprince are nearest neighbours, whyqueen pulls inelizabeth (the corpus has plenty of references to Queen Elizabeth), and whydiamond cleanly resolves to a jewellery cluster.

The classicking − man + woman ≈ queen analogy works in this same space; the lab didn’t run it, but the geometry is there. Embeddings don’twrite anything — theyplace things near other things. That single property is what makes them the backbone of semantic search, retrieval, deduplication, and recommendation.

Exercise 5: Sentence similarity from averaged word vectors

Section C2 extended the geometry to sentences. Five short sentences across two topics — AI/ML and jewellery — were averaged into sentence vectors (mean of their word vectors, with simple lowercase tokenization), then compared with cosine similarity.

Plotted in 2D via multidimensional scaling on the cosine distances, the clustering is unambiguous:

The numerical version:

	AI/support	ML/fraud	Jewellery	Neural/medical	Luxury/rings
AI/support	1.00	0.84	0.60	0.80	0.50
ML/fraud	0.84	1.00	0.73	0.83	0.62
Jewellery	0.60	0.73	1.00	0.58	0.88
Neural/medical	0.80	0.83	0.58	1.00	0.56
Luxury/rings	0.50	0.62	0.88	0.56	1.00

Within-cluster pairs sit at 0.84–0.88. Cross-domain pairs sit at 0.50–0.62. The grouping is exactly what you’d want a retrieval system to do.

Three caveats worth naming, because they explain why modern retrieval doesn’t actually use GloVe averages:

Averaging discards word order. “Dog bites man” and “man bites dog” produce identical sentence vectors. For most retrieval that’s tolerable; for anything where syntax carries the meaning, it isn’t.
Transformer encoders fixed this. Models like BERT, RoBERTa, and their descendants producecontextual embeddings — each token’s vector depends on the tokens around it. Pool those across a sentence and you get a representation that respects word order and disambiguates polysemy.
Sentence-BERT and friends made it production-grade. SBERT (and successors like OpenAI’stext-embedding-3, Cohere’s embeddings, Voyage, etc.) trained encoders specifically for sentence-level similarity. That’s the difference between “the demo works on five sentences” and “you can index a million documents and search them in milliseconds.”

GloVe averaging is a baseline. It’s the right baseline to start with, because it lets you see the geometry without the architecture getting in the way. Production systems start from this picture and replace the lookup step.

When both mechanisms meet

The final exercise sits at the intersection.distilbert-base-uncased-finetuned-sst-2-english is a transformer encoder (an embedding model under the hood) with a classification head fine-tuned for sentiment. Run it on three workplace-themed inputs:

Input	Label	Score
“The chatbot reduced ticket resolution time by 40% this quarter.”	POSITIVE	0.9962
“Our deployment failed repeatedly and customers were upset.”	NEGATIVE	0.9997
“The new recommendation engine is acceptable but needs tuning.”	NEGATIVE	0.9898

The third row is the interesting one, and it’s worth unpacking because it points at a problem that turns up in every enterprise deployment of pretrained models.

“Acceptable but needs tuning” is, in workplace context, alukewarm-positive — closer to “approved with caveats” than “this is bad.” The classifier scored it NEGATIVE with 0.9898 confidence. Three things are happening at once:

Domain mismatch. The model was fine-tuned on SST-2, which is movie reviews. “Needs tuning” reads negative there. In an engineering team’s language, “needs tuning” is constructive — the same words have different sentiment loadings in different domains.
No calibration on workplace text. The score is 0.9898 — extreme confidence — for what should be a borderline case. Pretrained classifiers tend to be miscalibrated on out-of-distribution inputs: they’re not just wrong, they’re confidently wrong. Calibration techniques (temperature scaling, Platt scaling, conformal prediction) exist for exactly this.
Weak supervision is the practical fix. When you can’t fine-tune (no labelled data, no budget, no time), the durable answer is to treat the classifier as one signal among several — combine it with rules, keyword filters, or a second model — rather than trusting any single number above the threshold.

Architecturally, the lesson generalises across all three Section D variants. Generation is “embedding + decoder loop.” Classification is “embedding + categorical head.” Retrieval is “embedding + cosine.” Same underlying mathematical object, different output shapes. The architectural choices around the embedding determine what the system does — and where it fails when you take it out of the domain it was trained on.

Closing observations

Three things that generalise beyond this lab.

Tokenization is where most LLM cost and quirks actually originate. It’s the lowest layer of the stack and the one most engineering conversations skip. If you’re tuning prompts and getting unstable behaviour, the first place to look is what your input looks like after the tokenizer touches it.

Embedding-based similarity is older, cheaper, and more deterministic than people remember. Before reaching for an LLM call to compare two pieces of text, embed them and compute cosine. It’s milliseconds, free, and stable. A surprising fraction of “AI features” are really embedding lookups with a confidence threshold.

Generation and similarity sit next to each other. They are not competitors. RAG is the obvious example — embeddings retrieve, the LLM generates the answer grounded in what was retrieved. TheWeek 15 RAG chatbot post is what these two mechanisms look like wired together for production.

One predicts words. One maps meaning. Knowing which one to reach for is most of the job.

]]>

Component	What It Does
Gemma 4 26B A4B (Q4_K_M)	The single model serving all 10 agents. MoE architecture — activates ~4–6B parameters per token. ~13 GB VRAM.
LM Studio	Local model server. OpenAI-compatible API on port 1234. Model stays resident in unified memory.
LangGraph	Orchestration graph — defines the multi-step query pipeline as a typed state machine.
LanceDB	Local vector database. One namespace (table) per agent. Fully file-based, no server process needed.
BAAI/bge-small-en-v1.5	Embedding model. 384-dimensional vectors. Runs on the Apple Neural Engine — essentially free compute.
Redis	Session memory (24h TTL, rolling 3-turn window) and routing cache (1h TTL).
FastAPI + Uvicorn	REST API gateway on port 8000. Full query/response model, session management, agent listing.
Gradio	Web chat UI on port 7860. Session management, source citations, escalation chain display.
Prometheus	Metrics: query counts, latency per agent, escalation rate. Useful for understanding usage patterns.
SQLite (audit.db)	Immutable audit trail. Every query logged before response is returned. Cannot be skipped.