<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>Ajay Walia</title><link>https://curiousbit.netlify.app/</link><description>Digital workplace, artificial intelligence, cloud, security, automation, and enterprise technology notes by Ajay Walia.</description><language>en-au</language><managingEditor>Ajay Walia</managingEditor><webMaster>Ajay Walia</webMaster><copyright>Copyright 2026 Ajay Walia</copyright><lastBuildDate>Sun, 21 Jun 2026 05:46:10 +0000</lastBuildDate><atom:link href="https://curiousbit.netlify.app/tags/engineering/index.xml" rel="self" type="application/rss+xml"/><image><url>https://curiousbit.netlify.app/images/og-default.png</url><title>Ajay Walia</title><link>https://curiousbit.netlify.app/</link></image><item><title>Aether, Rethought — The Shape Was Wrong All Along</title><link>https://curiousbit.netlify.app/aether-rethought-the-shape-was-wrong-all-along/</link><guid isPermaLink="true">https://curiousbit.netlify.app/aether-rethought-the-shape-was-wrong-all-along/</guid><pubDate>Fri, 12 Jun 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Part III of the Aether series.&lt;/strong&gt; Missed the first two? Start with &lt;a href="https://curiousbit.netlify.app/i-built-a-team-of-it-architects-using-llm-that-live-on-macbook-meet-aether/"&gt;Meet Aether&lt;/a&gt; (the build), then &lt;a href="https://curiousbit.netlify.app/aether-grown-wild-the-implementation-journey-v2.6-v2.8.2/"&gt;Aether, Grown Wild&lt;/a&gt; (what happened when it ran).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;h2 id="the-story-so-far--one-paragraph-each"&gt;The story so far — one paragraph each&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Part I:&lt;/strong&gt; I built a 10-agent (later 13-agent) system that runs an entire team of IT architecture specialists on a single MacBook M5 Pro — one Gemma model, zero cloud, zero data egress. Every agent is just a YAML manifest: a different system prompt and a different knowledge-base namespace pointing at the same weights. The system escalates upward when confidence falls below 0.7.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/aether-rethought.png" alt="Engineering" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><blockquote><p><strong>Part III of the Aether series.</strong> Missed the first two? Start with<a href="/i-built-a-team-of-it-architects-using-llm-that-live-on-macbook-meet-aether/">Meet Aether</a> (the build), then<a href="/aether-grown-wild-the-implementation-journey-v2.6-v2.8.2/">Aether, Grown Wild</a> (what happened when it ran).</p></blockquote><hr><h2 id="the-story-so-far--one-paragraph-each">The story so far — one paragraph each</h2><p><strong>Part I:</strong> I built a 10-agent (later 13-agent) system that runs an entire team of IT architecture specialists on a single MacBook M5 Pro — one Gemma model, zero cloud, zero data egress. Every agent is just a YAML manifest: a different system prompt and a different knowledge-base namespace pointing at the same weights. The system escalates upward when confidence falls below 0.7.</p><p><strong>Part II:</strong> That clean idea hit reality. The router was rebuilt twice, retrieval flipped from knowledge-base-first to web-first, and self-reported confidence was replaced by a formula the system computes. Thirteen agents, a live web allowlist, a computed confidence score, and a CHANGELOG.md in place of Git.</p><p><strong>Part III (this one):</strong> The shape is wrong. The hierarchy that felt so natural — because it mirrors the org chart — turns out to optimise for the wrong things. I worked through five recognised ways to build an agentic AI system, scored each against the same criteria, and arrived at a recommendation that changes the architecture without throwing away anything we&rsquo;ve built.</p><hr><h2 id="what-were-actually-building">What we&rsquo;re actually building</h2><p>The output isn&rsquo;t a chatbot. It&rsquo;s a consulting deliverable — an architecture document advising how to run an IT transition. Every real client engagement spans multiple towers simultaneously: a move to Entra ID while modernising the network and shifting workloads to Azure touches Cloud, Network, Digital Workplace, and Security in the same breath. The value and the difficulty live in the cross-domain synthesis.</p><p><img src="/images/slides/slides.002.png" alt="Slide: The problem — what we are actually building. Six technology towers. The hard part is cross-domain synthesis into one coherent, defensible document."/><hr><h2 id="where-we-started--the-three-tier-hierarchy">Where we started — the three-tier hierarchy</h2><p>This is what Aether v2.x actually is. One model, thirteen agents, each differentiated only by system prompt and knowledge namespace. Work routes down the tree; low confidence escalates back up.</p><p><img src="/images/slides/slides.003.png" alt="Slide: The current Aether v2.x system — three-tier hierarchy of 13 agents across Cloud, DWP, and Network domains"/><p>It was appealing for real reasons: it mirrors how a delivery organisation thinks, easy to explain to a client, and the RAG namespace-per-domain isolation is clean. But we copied an org chart into the control flow — and the next slide explains why that&rsquo;s usually a trap.</p><hr><h2 id="the-core-insight--why-the-org-chart-is-the-wrong-shape">The core insight — why the org chart is the wrong shape</h2><p>This is the conceptual centre of the entire analysis. Everything that follows flows from these two points.</p><p><img src="/images/slides/slides.004.png" alt="Slide: Why mirroring the org chart is usually a trap — human orgs optimise for accountability, agent systems optimise for information flow. Two consequences: real infra work is cross-domain, and the deliverable is a workflow not an organisation."/><p>A strict tree only allows vertical movement — up to escalate, down to delegate. Real architecture work needs lateral collaboration. The AWS specialist can&rsquo;t directly ask the Network specialist a question; it has to climb the entire tree and back down. That&rsquo;s bureaucracy encoded in Python.</p><p>The second consequence is more fundamental:<strong>the deliverable is a workflow, not an organisation</strong>. Producing an architecture document is a consulting process with phases. The right structure for that process is a pipeline. We modelled the people first and the process second — we should have done it the other way round.</p><hr><h2 id="five-ways-to-build-it">Five ways to build it</h2><p>Rather than jump straight to a new design, I evaluated five recognised approaches against the same five axes: cross-domain capability, parallelism, auditability, simplicity, and fit for document generation.</p><p><img src="/images/slides/slides.005.png" alt="Slide: The section ahead — five ways to build it, each weighed on the same axes. RAG is a shared substrate underneath all five."/><p>Note the footer: RAG sits underneath<strong>all five</strong> approaches — they differ in control flow, not in whether they retrieve.</p><hr><h3 id="approach-1--hierarchical--org-mirror">Approach 1 — Hierarchical / org-mirror</h3><p>This is what Aether v2.x already is.</p><p><img src="/images/slides/slides.006.png" alt="Slide: Approach 1 of 5 — Hierarchical / org-mirror. Pros: intuitive, clear escalation, clean RAG namespaces. Cons: no lateral channel, brittle confidence escalation, Tier-1 becomes a bottleneck."/><p>The pros are real — which is why we chose it. But the killer con:<strong>there is no sideways path</strong>. The AWS agent can&rsquo;t ask the Network agent a question without escalating all the way up and back down. Also, escalation triggered by confidence scores sounds clean, but LLM confidence is unreliable — the trigger itself is shaky.</p><p><strong>Verdict:</strong> Great for stable, siloed problems with a genuine audit chain. That is not what our engagements look like.</p><hr><h3 id="approach-2--orchestrator--flat-specialists">Approach 2 — Orchestrator + flat specialists</h3><p>One orchestrator plans the task, fans it out to specialists running<strong>in parallel</strong>, then synthesises. Flat: adding a domain means adding one specialist — no re-tiering.</p><p><img src="/images/slides/slides.007.png" alt="Slide: Approach 2 of 5 — Orchestrator + flat specialists. Real parallelism, easy to extend, conflicts resolved in one place. Cost: orchestrator decomposition and synthesis are the hard part."/><p><strong>Verdict:</strong> A strong, flexible default. A piece of the recommendation.</p><hr><h3 id="approach-3--workflow--pipeline-process-native">Approach 3 — Workflow / pipeline (process-native)</h3><p>Instead of organising by<em>who</em>, organise by the<em>stages</em> of producing the document: Discover → Assess → Design → Review → Assemble. Each stage maps to a section of the output. Human checkpoints slot between stages.</p><p><img src="/images/slides/slides.008.png" alt="Slide: Approach 3 of 5 — Workflow / pipeline. Mirrors how the deliverable is actually made. Highly auditable: each stage = a section. The &lsquo;con&rsquo; — domains become knowledge sources — is arguably a feature."/><p><strong>Verdict:</strong> This is the spine. The structure that mirrors how the deliverable is actually made. The so-called con — that domain experts become knowledge sources rather than autonomous agents — is arguably the correct framing.</p><hr><h3 id="approach-4--blackboard--shared-artifact">Approach 4 — Blackboard / shared artifact</h3><p>All agents read and write a common workspace — the evolving document. Strong on cross-domain consistency because there&rsquo;s only one object. The hard part is concurrency control.</p><p><img src="/images/slides/slides.009.png" alt="Slide: Approach 4 of 5 — Blackboard / shared artifact. Good cross-domain consistency. Concurrency and conflict handling is fiddly. A typed, controlled state object (LangGraph) is safer than a true free-for-all blackboard."/><p><strong>Key distinction:</strong> a true blackboard (free-for-all writes) is risky. A typed, controlled shared state object — which LangGraph gives us — keeps the benefit without the chaos. We use the controlled version.</p><hr><h3 id="approach-5--single-agent-baseline">Approach 5 — Single-agent baseline</h3><p>One capable model. A lightweight router. Domain knowledge bases on demand. Simplest, cheapest, most reliable.</p><p><img src="/images/slides/slides.010.png" alt="Slide: Approach 5 of 5 — Single-agent baseline. Simplest and most reliable. Often beats multi-agent for doc generation. Breaks down when inputs blow past the context window."/><p><strong>Verdict:</strong> The benchmark every fancier design has to beat. The real reason to decompose is<em>information management</em> — not model weakness. When CMDB exports, cloud inventories, and Terraform files all arrive together, no context window handles it cleanly.</p><hr><h2 id="the-comparison">The comparison</h2><p><img src="/images/slides/slides.011.png" alt="Slide: Side-by-side comparison — all five approaches scored on cross-domain, parallelism, auditability, simplicity, and fit for doc gen. The two that score High on fit-for-doc-gen are Orchestrator+Specialists and Workflow — and they&rsquo;re complementary."/><p>The table does the work. Our current design (Hierarchical) is weakest exactly where we need strength — cross-domain — while strong on auditability. The two approaches that score high on fit-for-doc-gen are<strong>complementary</strong>: strong in different places. That&rsquo;s the bridge to the recommendation.</p><hr><h2 id="the-recommendation--a-hybrid">The recommendation — a hybrid</h2><p><img src="/images/slides/slides.013.png" alt="Slide: Recommended — a hybrid workflow + specialist platform. Hub-and-spoke structure: typed shared state at the centre, pipeline flowing through it, specialists invoked within stages, cross-cutting reviewers across all stages, QA/eval loops back."/><p>Not a ladder. A<strong>hub</strong>. The pipeline (Discovery → Assessment → Design) flows through a single typed shared state object. Domain specialists are invoked within stages — callable skills, not autonomous routing agents. Cross-cutting reviewers (Security, Cost, Risk, Compliance) act across all stages. QA can loop work back to Design or Assessment until the rubric passes.</p><hr><h3 id="how-the-model-works">How the model works</h3><p><img src="/images/slides/slides.014.png" alt="Slide: The recommendation in motion — pluggable model, orchestrated pipeline, parallel specialists, cross-cutting reviewers, deterministic arbitration, human gates, render and evaluate."/><p>Three things to highlight:</p><ul><li><strong>Parallel specialists</strong> write structured findings into one typed shared state — the single source of truth</li><li><strong>Arbitration is deterministic</strong> — fires on conflict, policy breach, or missing data, never on confidence scores</li><li><strong>The model is pluggable</strong> — local Gemma today, Claude/OpenAI/Gemini tomorrow, same knowledge, no rigid tree</li></ul><hr><h3 id="why-this-is-the-right-call">Why this is the right call</h3><p><img src="/images/slides/slides.015.png" alt="Slide: Why this is the right call — trade-offs honestly laid out. Pros: handles cross-domain work, governance is first-class, model-agnostic. Cons: more moving parts, orchestrator and eval are real engineering."/><p>The verdict, stated plainly: it&rsquo;s the only option that handles cross-domain work with the auditability, governance, and evidence-traceability a client deliverable demands — at an acceptable, well-understood increase in build complexity.</p><hr><h2 id="the-technology-stack--nothing-new-to-install">The technology stack — nothing new to install</h2><p><img src="/images/slides/slides.016.png" alt="Slide: The technology stack — LangGraph + FastAPI + Redis, Gemma via LM Studio, LanceDB + BAAI/bge-small + SQLite, ddgs + BeautifulSoup, Gradio, YAML manifests. Everything already running."/><p>Every box in the stack is something we already run. The v3 work extends Orchestration and adds the governance/eval layer.<strong>This is not a rebuild.</strong></p><hr><h2 id="the-process--how-a-real-engagement-runs">The process — how a real engagement runs</h2><p><img src="/images/slides/slides.017.png" alt="Slide: Process flow — Discovery → evidence normalise → current state → gap analysis → target design → ADRs → reviews → migration plan → assembly → QA/eval. Two human gates. Feedback loops back to design. Everything writes into one canonical model."/><p>Two things to notice:</p><ol><li><strong>Feedback arrows</strong> — review, QA, and cost/risk findings can send work<em>back</em> to Design or Assessment. It&rsquo;s iterative, not a one-way pipeline.</li><li><strong>Human gates</strong> — after assessment and before assembly. An architect validates the current-state picture and signs off before the document is built. Gated autonomy, not full automation — which matters when the output carries liability.</li></ol><hr><h2 id="the-real-hard-problem--evidence-quality">The real hard problem — evidence quality</h2><p><img src="/images/slides/slides.018.png" alt="Slide: The real hard problem — bad source data. Missing CMDB, conflicting diagrams, unknown dependencies, shadow IT. Evidence quality is an explicit early gate: score completeness, request more data, log gaps, escalate when material."/><p>Every downstream recommendation carries the confidence level and assumptions set at this gate.<strong>No silent guessing.</strong> The system requests more data, logs gaps in an assumption register, or proceeds while explicitly stating its confidence level.</p><hr><h2 id="what-flows-through-the-system--the-canonical-architecture-state">What flows through the system — the canonical Architecture State</h2><p><img src="/images/slides/slides.019.png" alt="Slide: The canonical Architecture Model — one typed object, ArchitectureState. Fields mature left to right: Evidence → Analysis → Decisions → Plan &amp; Governance → Output. Every recommendation traces back to the evidence that produced it."/><p>One typed object. Every stage, specialist, and reviewer reads from and writes to it. The document is rendered from it. Every recommendation traces back through the same object to the evidence that produced it — that traceability is what makes the deliverable auditable.</p><hr><h2 id="lenses-not-domains--where-security-and-compliance-live">Lenses, not domains — where security and compliance live</h2><p><img src="/images/slides/slides.020.png" alt="Slide: Lenses not domains — Security, Cost/FinOps, Compliance, Risk, Sustainability, Supportability cut horizontally across all stages. Each is a reviewer plus a checklist in the QA rubric firing on concrete triggers."/><p><strong>Security is a lens, not a domain.</strong> Same for Cost, Compliance, Risk, Sustainability, and Supportability. They aren&rsquo;t towers sitting next to Cloud and Network — they cut horizontally across every stage. Most rework loops originate in Security and Compliance, which is exactly why they&rsquo;re first-class cross-cutting reviewers.</p><hr><h2 id="key-design-considerations">Key design considerations</h2><p><img src="/images/slides/slides.021.png" alt="Slide: Six key design considerations — decompose for context not IQ, deterministic arbitration, human-in-the-loop gates, evaluation and provenance, template as rendering concern, operational reality."/><p>The most important reframe:<strong>we decompose for information management, not because the model is weak</strong>. That reframes the whole multi-agent debate. And arbitration fires on deterministic events — never on confidence scores. Every recommendation must trace to evidence.</p><hr><h2 id="design-decisions--where-the-build-effort-actually-goes">Design decisions — where the build effort actually goes</h2><p><img src="/images/slides/slides.022.png" alt="Slide: Design decisions and effort split. 35% evidence ingestion, 20% evaluation/governance, 10% agents. Most teams invert these numbers and ship a demo that can&rsquo;t produce a deliverable a client would pay for."/><p>The effort allocation is the provocative bit.<strong>Only ~10% of build effort goes to the agents themselves.</strong> ~35% is evidence ingestion and normalisation. ~20% is evaluation and governance. Most teams invert these numbers — polished agents, no evaluation — and end up with a demo that can&rsquo;t produce a deliverable a client would pay for. That&rsquo;s the trap we&rsquo;re avoiding.</p><hr><h2 id="business-value">Business value</h2><p><img src="/images/slides/slides.023.png" alt="Slide: Business value — consulting accelerator and quality platform. Time to assess: weeks to days. Evidence gathering: manual to automated. Document assembly: hand-built to generated and validated. Hypotheses to validate in a pilot."/><p>The positioning:<strong>a consulting accelerator and quality platform</strong> — not a headcount-reduction play. The specific numbers (50–80% faster assessments, 60–90% less evidence-collection effort) are hypotheses to validate in a pilot, not measured results. They&rsquo;re framed that way because that&rsquo;s what earns trust.</p><hr><h2 id="next-steps">Next steps</h2><p><img src="/images/slides/slides.024.png" alt="Slide: Next steps — align on approach, write v3 design, scope Cloud Migration Assessment as first engagement, stand up evidence and eval first."/><p>The sequencing matters:<strong>evidence ingestion and the QA rubric first</strong>, agent polish second. That&rsquo;s where the quality actually lives.</p><hr><h2 id="the-honest-accounting">The honest accounting</h2><p>I built a hierarchy because it was intuitive. It is intuitive — it maps to how the organisation thinks, the escalation chain is easy to trace and audit, and it&rsquo;s easy to explain to a client. Those are real advantages.</p><p>But intuitive for humans and optimal for agents are not the same thing. The hierarchy was designed to answer<em>&ldquo;who is responsible?&rdquo;</em> — an important human question that agents don&rsquo;t need answered. The new design answers<em>&ldquo;what needs to happen next?&rdquo;</em> — which is the right question for a document-generation workflow.</p><p><strong>What changes in v3:</strong></p><ul><li>Agents become implementation details inside the Knowledge layer — the durable value is in evidence, governance, evaluation, and the document workflow</li><li>Escalation fires on events (conflict, policy breach, missing data) — not on confidence scores</li><li>The canonical model is the design constraint — the document isn&rsquo;t an afterthought, it&rsquo;s what everything is structured around</li><li>Model-agnostic by default — local Gemma today, SOTA API tomorrow, same knowledge</li></ul><p><strong>What stays the same:</strong> LangGraph, LanceDB, BAAI/bge-small, FastAPI, SQLite, the YAML manifests, and the 13 domain specialists — which become callable skills within the Design stage rather than autonomous routing agents. Same knowledge, different invocation mechanism.</p><p>This is an extension of what we run, not a rebuild.</p><hr><p><em>Questions about the design or the approach? Reach out on<a href="https://www.linkedin.com/in/ajay-walia-8b066a1b/">LinkedIn</a>.</em></p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/aether-rethought.png" medium="image"><media:title type="plain">Engineering</media:title></media:content><category>artificial-intelligence</category><category>architecture</category><category>automation</category><category>engineering</category><category>llm</category><category>Knowledge Base</category></item><item><title>I Built My Own RSS Reader in an Afternoon — With AI Doing the Typing</title><link>https://curiousbit.netlify.app/i-built-my-own-rss-reader-in-an-afternoon-with-ai-doing-the-typing/</link><guid isPermaLink="true">https://curiousbit.netlify.app/i-built-my-own-rss-reader-in-an-afternoon-with-ai-doing-the-typing/</guid><pubDate>Wed, 10 Jun 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;For years my RSS setup was a compromise. I never wanted to pay for a desktop reader, so I got stuck with The Old Reader — a perfectly fine service, but using it felt like visiting a website, because that&amp;rsquo;s exactly what it was. Open a browser tab, log in, scroll, repeat tomorrow. What I actually wanted was simple: a native Mac app, my feeds downloaded and stored locally, no account, no subscription, fast enough to triage a hundred articles with my keyboard.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/Lumen/lumenai-hero.jpg" alt="Engineering" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>For years my RSS setup was a compromise. I never wanted to pay for a desktop reader, so I got stuck with The Old Reader — a perfectly fine service, but using it felt like visiting a website, because that&rsquo;s exactly what it was. Open a browser tab, log in, scroll, repeat tomorrow. What I actually wanted was simple: a native Mac app, my feeds downloaded and stored locally, no account, no subscription, fast enough to triage a hundred articles with my keyboard.</p><p>Then I heard about Claude&rsquo;s new Fable model and thought: fine, let&rsquo;s stop wishing and just build the thing. This is the story of<strong>LumenAI</strong> — a local-first RSS reader for macOS with AI summaries — built from an empty folder to a notarizable DMG in about an hour of wall-clock time, with me acting as product owner and build verifier while the AI wrote the code.</p><p><img src="/images/lumenai-screens/main-window.png" alt="LumenAI main window — three-pane layout in dark mode"/><p><em>That&rsquo;s the real app, not a mockup. There&rsquo;s a<a href="/lumenai-screens.html">full captioned gallery here</a> — the loaded feed list, OPML import, the memory footprint, and the DMG build.</em></p><hr><h2 id="the-idea">The Idea</h2><p>The pitch I gave the AI was one paragraph: a local RSS reader for Mac, feeds downloaded and stored on my machine, a premium feel, and treat it like a real engineering project — clear phases, and ask me questions before making decisions. That last part turned out to be the most important sentence in the whole project.</p><p>Instead of immediately generating a wall of code, it interviewed me. What stack? What does v1 include, and — just as important — what does it exclude? How should refresh work? What does &ldquo;premium&rdquo; mean to you, concretely? By the end of a few rounds of multiple-choice questions, we had a real spec:</p><p><strong>v1 goals:</strong> subscribe to feeds, full article extraction, fast local search, OPML import/export, saved views, deduplication, offline reading, AI summaries, keyboard-first navigation, dark/light themes, typography controls, reader mode, reading progress.</p><p><strong>v1 non-goals:</strong> semantic search, topic clustering, multi-device sync, social anything.</p><p>Writing down the non-goals felt almost ceremonial at the time. It wasn&rsquo;t. Every time scope tried to creep, that list killed the discussion in one line.</p><hr><h2 id="the-stack">The Stack</h2><p>Every choice optimized for &ldquo;native feel, local data, no servers.&rdquo;</p><p>The app is Swift and SwiftUI targeting macOS 14, because nothing fakes the feel of a real Mac app. Storage is SQLite via<a href="https://github.com/groue/GRDB.swift">GRDB.swift</a>, chosen over Apple&rsquo;s SwiftData specifically for FTS5 — SQLite&rsquo;s built-in full-text search engine, which gives instant search across every article ever downloaded, entirely offline. Feed parsing is FeedKit, wrapped in a normalizer layer so the rest of the app never touches a FeedKit type and JSON Feed support costs no schema changes. Full-text extraction is Mozilla&rsquo;s Readability.js — the same engine behind Firefox&rsquo;s reader mode — running in a hidden WKWebView with the page&rsquo;s own JavaScript disabled. The reader itself is a WKWebView used purely as a rendering layer for a themed HTML template; everything around it stays SwiftUI. The project file is generated by XcodeGen from a YAML spec, which kept the AI and Xcode from ever fighting over a<code>.xcodeproj</code>.</p><p>The AI layer is the part I&rsquo;m proudest of architecturally: a single<code>SummaryProvider</code> protocol with six implementations — Apple Intelligence (on-device, appears only on macOS 26+), Ollama for local models, Claude, OpenAI, Gemini, and Disabled. One protocol method. Swapping providers is a dropdown in Settings; API keys live in the macOS Keychain.</p><p><img src="/images/Lumen/lumenai-stack.jpg" alt="LumenAI architecture — SwiftUI over SQLite/FTS5, FeedKit and Readability.js, with six pluggable AI providers"/><hr><h2 id="the-seven-phases-okay-eight">The Seven Phases (Okay, Eight)</h2><p>We numbered from zero, like civilized people.</p><p><strong>Phase 0 — Scaffold.</strong> XcodeGen project, sandbox and network entitlements, and a three-pane shell (sidebar, article list, reader) running on sample data. The exit criterion was simply &ldquo;builds and runs.&rdquo; It almost did: the very first build failed with Swift&rsquo;s infamous<em>&ldquo;the compiler is unable to type-check this expression in reasonable time&rdquo;</em> — the AI had written a too-clever nested closure to generate sample data. It rewrote it as a boring<code>for</code> loop. A very human bug, honestly.</p><p><strong>Phase 1 — Data layer.</strong> The real schema: feeds, folders, articles, saved views, an FTS5 index kept in sync by SQL triggers, and a three-tier deduplication identity — an article is its<code>guid</code> if the feed provides one, else its normalized URL (tracking parameters stripped), else a content hash. Ten unit tests against an in-memory database before any networking existed.</p><p><strong>Phase 2 — Feed engine.</strong> Fetching with HTTP conditional GET, so unchanged feeds cost a 304 response instead of a re-download. Feed auto-discovery, so typing<code>daringfireball.net</code> finds the actual feed URL by scanning the page&rsquo;s<code>&lt;link&gt;</code> tags. RSS, Atom, and JSON Feed all normalize into one canonical model. By the end of this phase the app was genuinely usable: subscribe, read, refresh.</p><p><strong>Phase 3 — Core UI.</strong> Folders, favicons, thumbnails, unread badges, and the thing that makes an RSS reader feel like a tool instead of a website: keyboard navigation.<code>j</code>/<code>k</code> to move, space for next unread,<code>s</code> to star,<code>m</code> to toggle read. The Reeder dialect, basically.</p><p><strong>Phase 4 — Reading experience.</strong> Select an article and it silently fetches the source page, runs Readability.js over it, and stores clean full text — so a feed that only publishes two-line excerpts still gives you whole articles, offline, forever. Typography controls (serif/sans, size, line width), themes that follow the system, and per-article reading progress that restores when you come back. This phase also produced the best bug of the project: scroll position was saved to app state, which regenerated the reader HTML, which reloaded the page, which reset the scroll — an infinite loop the AI caught in code review<em>before</em> I ever built it.</p><p><strong>Phase 5 — Search, saved views, OPML.</strong> Global FTS5 search from the toolbar, saved views (persistent named filters — &ldquo;unread Swift articles from these three feeds, last 30 days&rdquo;), and OPML import/export so my subscriptions could finally walk out of The Old Reader with folder structure intact.</p><p><strong>Phase 6 — AI summaries.</strong> The provider protocol described above, plus a deliberately boring prompt: summarize in two or three sentences, be specific, no &ldquo;this article discusses.&rdquo; The summary renders as a tinted card above the article. The point isn&rsquo;t to replace reading — it&rsquo;s triage. Is this worth my next ten minutes?</p><p><img src="/images/Lumen/lumenai-reader.jpg" alt="Reader view with the AI summary card above the article (concept render)"/><p><strong>Phase 7 — Polish.</strong> An app icon (generated programmatically — an RSS glyph under a sparkle on an indigo gradient), a dock badge with the unread count, render-path caching so thousand-article lists scroll smoothly, and a one-command script that builds a signed, drag-to-Applications DMG.</p><p><img src="/images/lumenai-screens/dmg-build.png" alt="One-command DMG build — ./Scripts/make_dmg.sh producing LumenAI.dmg"/><p><em>The Phase 7 finale, for real: one script, one signed DMG on the Desktop. More screenshots — the loaded feed list after OPML import, and the app&rsquo;s 143 MB memory footprint in Activity Monitor — are on the<a href="/lumenai-screens.html">LumenAI screenshots page</a>.</em></p><hr><h2 id="what-the-lifecycle-actually-felt-like">What the Lifecycle Actually Felt Like</h2><p>The loop for every phase was identical: the AI proposed decisions and asked questions, wrote the code and its tests, and then stopped — because it couldn&rsquo;t compile anything. Its sandbox is Linux; you can&rsquo;t build a Mac app there. So I was the build machine. ⌘U, ⌘R, report back. &ldquo;Build succeeded, go ahead&rdquo; became the rhythm of the afternoon.</p><p>That constraint turned out to be a feature. It forced a real checkpoint between phases — a human running the actual app — instead of an unbroken firehose of unverified code. Of the three failures across the whole project, two were caught by my builds (the type-checker timeout, and a &ldquo;cannot find type&rdquo; error that turned out to mean I&rsquo;d forgotten to re-run XcodeGen after files were added) and one was caught by the AI re-reading its own code. Final tally:<strong>38 Swift files, 27 tests, zero runtime crashes encountered.</strong></p><p>The other thing that surprised me: being asked questions felt like the AI respecting that it was<em>my</em> app. Tech stack, refresh cadence, dedup policy, summary length, even the app&rsquo;s name — every fork in the road was a decision I made in seconds from a menu of researched options, instead of an assumption silently baked into code I&rsquo;d discover three weeks later.</p><hr><h2 id="what-id-tell-you-if-youre-tempted">What I&rsquo;d Tell You If You&rsquo;re Tempted</h2><p>Treat it like an engineering project, not a magic trick. The phases, the non-goals list, the tests, the check-in after every phase — that structure is why this worked in an hour instead of unraveling in a weekend. The AI typed every line of code, but the spec, the taste, and the &ldquo;no, simpler&rdquo; calls were the human contribution, and the project needed both.</p><p>And yes — the app starts instantly, works on a plane, and never asks me to log in. The subscription I avoided paying for has been replaced by the most expensive thing of all: now I want to build everything.</p><hr><p><em>LumenAI is Swift/SwiftUI on macOS 14+, with GRDB, FeedKit, and Readability.js. Built with Claude Fable 5.</em></p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/Lumen/lumenai-hero.jpg" medium="image"><media:title type="plain">Engineering</media:title></media:content><category>artificial-intelligence</category><category>automation</category><category>engineering</category><category>llm</category><category>Knowledge Base</category></item><item><title>Aether, Grown Wild — The Implementation Journey (v2.6 → v2.8.2)</title><link>https://curiousbit.netlify.app/aether-grown-wild-the-implementation-journey-v2.6-v2.8.2/</link><guid isPermaLink="true">https://curiousbit.netlify.app/aether-grown-wild-the-implementation-journey-v2.6-v2.8.2/</guid><pubDate>Wed, 03 Jun 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;link href="https://fonts.googleapis.com/css2?family=Noto+Serif:ital,wght@0,400;0,700;1,400&amp;family=JetBrains+Mono:wght@400;600&amp;display=swap" rel="stylesheet"&gt;
&lt;style&gt;
/* ─────────────────────────────────────────────
JUNGLE BOOK PALETTE — scoped to .aether-journey
───────────────────────────────────────────── */
.aether-journey {
--bg: #0c1410;
--bg2: #111d16;
--bg3: #16271c;
--jungle: #1f3d2b;
--jungle2: #2c5239;
--olive: #8a9a52;
--olive-dk: #4a5429;
--gold: #d9a521;
--gold-lt: #f0c95a;
--gold-dk: #8a6610;
--clay: #a9683f;
--clay-lt: #c98a5b;
--cream: #e7e0cf;
--muted: #aab694;
--rule: rgba(217,165,33,0.20);
--rule2: rgba(125,139,74,0.24);
font-family: 'Noto Sans', sans-serif;
font-size: 21px;
line-height: 1.85;
color: var(--cream);
}
.aether-journey a { color:var(--gold); text-decoration:none; border-bottom:1px solid rgba(217,165,33,0.35); }
.aether-journey a:hover { color:var(--gold-lt); border-color:var(--gold-lt); }
/* ── Hero ── */
.aj-hero { border-bottom:1px solid var(--rule); position:relative; overflow:hidden; min-height:540px; display:flex; align-items:flex-end; border-radius:12px; margin-bottom:40px; }
.aj-hero-video { position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:1; }
.aj-hero-scrim { position:absolute; inset:0; z-index:2; pointer-events:none;
background:
linear-gradient(180deg, rgba(12,20,16,0.25) 0%, rgba(12,20,16,0.50) 50%, rgba(12,20,16,0.93) 100%),
radial-gradient(900px 480px at 16% 100%, rgba(31,61,43,0.55), transparent 70%); }
.aj-hero-in { padding:74px 26px 60px; max-width:960px; margin:0 auto; position:relative; z-index:3; width:100%; }
.aj-hero-meta { display:flex; gap:22px; flex-wrap:wrap; margin-top:30px; font-size:1rem; color:var(--muted); }
.aj-hero-meta strong { color:var(--gold); font-weight:600; }
.aj-hero-meta span:not(:last-child)::after { content:"·"; margin-left:22px; color:var(--olive-dk); }
.aj-kicker { display:inline-block; font-size:.84rem; letter-spacing:.2em; text-transform:uppercase; color:var(--bg); background:var(--gold); padding:6px 15px; border-radius:3px; font-weight:700; margin-bottom:22px; }
/* ── Typography ── */
.aether-journey .aj-label { font-size:.88rem; letter-spacing:.22em; text-transform:uppercase; color:var(--gold); font-weight:600; margin:54px 0 12px; }
.aether-journey h2 { font-family:'Noto Serif',serif; font-size:clamp(1.85rem,3.5vw,2.6rem); color:#fff; font-weight:700; margin:8px 0 24px; line-height:1.25; }
.aether-journey h3 { font-family:'Noto Serif',serif; font-size:1.45rem; color:var(--gold-lt); margin:34px 0 12px; }
.aether-journey p { color:var(--cream); margin:0 0 24px; }
.aether-journey .aj-lead { font-size:1.4rem; color:#fff; line-height:1.7; }
.aether-journey em.q { color:var(--clay-lt); font-style:italic; }
.aether-journey strong { color:#fff; font-weight:700; }
.aj-hero h1 { font-family:'Noto Serif',serif; font-size:clamp(2.6rem,5.5vw,4rem); line-height:1.14; color:#fff; font-weight:700; }
/* ── Lists — em-dash ── */
.aether-journey ul.leaf { list-style:none; padding:0; margin:0 0 24px; }
.aether-journey ul.leaf li { padding:0 0 20px 34px; position:relative; color:var(--cream); font-size:1.35rem; line-height:1.8; }
.aether-journey ul.leaf li::before { content:"—"; position:absolute; left:0; color:var(--olive); font-weight:700; }
.aether-journey ul.leaf li strong { color:#fff; }
/* ── Recap note ── */
.aj-recap { background:var(--bg2); border-left:3px solid var(--clay); border-radius:0 6px 6px 0; padding:22px 28px; margin:0 0 40px; }
.aj-recap p { margin:0; font-size:1.12rem; color:var(--muted); line-height:1.7; }
/* ── Pull quote ── */
.aj-pull { font-family:'Noto Serif',serif; font-size:1.6rem; line-height:1.55; color:var(--gold-lt); font-style:italic; border-left:3px solid var(--gold); padding:8px 0 8px 28px; margin:40px 0; }
/* ── Stat strip ── */
.aj-statbar { display:grid; grid-template-columns:repeat(4,1fr); gap:1px; background:var(--rule2); border:1px solid var(--rule2); margin:40px 0; border-radius:8px; overflow:hidden; }
.aj-stat { background:var(--bg2); padding:28px 18px; text-align:center; }
.aj-stat .num { font-family:'Noto Serif',serif; font-size:2.2rem; color:var(--gold); display:block; line-height:1.1; }
.aj-stat .lbl { font-size:.82rem; letter-spacing:.08em; color:var(--muted); text-transform:uppercase; margin-top:8px; }
/* ── Figures ── */
.aj-visual { margin:36px 0 12px; border-radius:10px; overflow:hidden; border:1px solid var(--rule2); background:var(--bg2); }
.aj-visual svg { display:block; width:100%; height:auto; }
.aj-cap { font-size:1.25rem; color:var(--muted); text-align:center; margin:0 0 36px; font-style:italic; line-height:1.6; }
/* ── Code / formula ── */
.aether-journey pre { background:#07100b; border:1px solid var(--rule2); border-radius:8px; padding:22px 24px; overflow-x:auto; margin:28px 0; font-family:'JetBrains Mono',monospace; font-size:.95rem; line-height:1.75; color:var(--cream); }
.aether-journey pre .c { color:var(--muted); }
.aether-journey pre .k { color:var(--gold-lt); }
.aether-journey pre .s { color:var(--olive); }
.aether-journey code.inline { font-family:'JetBrains Mono',monospace; font-size:1em; background:rgba(125,139,74,.16); color:var(--gold-lt); padding:2px 8px; border-radius:4px; border-bottom:none; }
.aj-formula { background:var(--bg3); border:1px solid var(--rule); border-radius:8px; padding:24px 28px; margin:30px 0; font-family:'JetBrains Mono',monospace; font-size:1.05rem; color:var(--gold-lt); text-align:center; line-height:2; }
.aj-formula span { color:var(--muted); }
/* ── Timeline ── */
.aj-timeline { position:relative; margin:34px 0 12px; padding-left:6px; }
.aj-timeline::before { content:''; position:absolute; left:19px; top:8px; bottom:8px; width:2px; background:linear-gradient(var(--olive),var(--clay)); }
.aj-tl { display:grid; grid-template-columns:46px 1fr; gap:18px; margin-bottom:26px; }
.aj-tl-dot { width:44px; height:44px; border-radius:50%; background:var(--bg2); border:2px solid var(--gold); display:flex; align-items:center; justify-content:center; font-size:.74rem; font-family:'JetBrains Mono',monospace; color:var(--gold); font-weight:600; position:relative; z-index:1; }
.aj-tl-body h3 { font-size:1.25rem; color:var(--gold-lt); margin:8px 0 6px; }
.aj-tl-body p { font-size:1.1rem; color:var(--muted); margin:0; line-height:1.7; }
/* ── CTA ── */
.aj-cta { background:linear-gradient(135deg,var(--jungle),var(--bg2)); border:1px solid var(--gold-dk); border-radius:12px; padding:40px 34px; margin:48px 0; text-align:center; }
.aj-cta h2 { margin-bottom:12px; }
.aj-cta p { color:var(--cream); max-width:640px; margin:0 auto 24px; font-size:1.15rem; line-height:1.7; }
.aj-btn { display:inline-block; background:var(--gold); color:var(--bg); font-weight:700; padding:16px 34px; border-radius:6px; letter-spacing:.02em; font-size:1.05rem; border:none; transition:transform .15s ease, background .15s ease; }
.aj-btn:hover { background:var(--gold-lt); color:var(--bg); transform:translateY(-2px); }
@media (max-width:640px) {
.aj-statbar { grid-template-columns:repeat(2,1fr); }
}
&lt;/style&gt;
&lt;div class="aether-journey"&gt;
&lt;div class="aj-hero"&gt;
&lt;video class="aj-hero-video" autoplay loop muted playsinline poster="/images/ather22.jpg"&gt;
&lt;source src="https://curiousbit.netlify.app/images/ather2.mp4" type="video/mp4"&gt;
&lt;/video&gt;
&lt;div class="aj-hero-scrim"&gt;&lt;/div&gt;
&lt;div class="aj-hero-in"&gt;
&lt;span class="aj-kicker"&gt;Implementation Journey · Part II&lt;/span&gt;
&lt;h1&gt;Aether, Grown Wild&lt;/h1&gt;
&lt;p class="aj-lead" style="margin-top:22px;"&gt;The first article was implementing the base idea. This one is the expedition — how the idea evolved and things got added as we moved forward, growing into a 13-agent, web-first, self-escalating system, and every bug in the undergrowth that shaped it.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/ather22.jpg" alt="Engineering" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><link href="https://fonts.googleapis.com/css2?family=Noto+Serif:ital,wght@0,400;0,700;1,400&family=JetBrains+Mono:wght@400;600&display=swap" rel="stylesheet"><style>
/* ─────────────────────────────────────────────
JUNGLE BOOK PALETTE — scoped to .aether-journey
───────────────────────────────────────────── */
.aether-journey {
--bg: #0c1410;
--bg2: #111d16;
--bg3: #16271c;
--jungle: #1f3d2b;
--jungle2: #2c5239;
--olive: #8a9a52;
--olive-dk: #4a5429;
--gold: #d9a521;
--gold-lt: #f0c95a;
--gold-dk: #8a6610;
--clay: #a9683f;
--clay-lt: #c98a5b;
--cream: #e7e0cf;
--muted: #aab694;
--rule: rgba(217,165,33,0.20);
--rule2: rgba(125,139,74,0.24);
font-family: 'Noto Sans', sans-serif;
font-size: 21px;
line-height: 1.85;
color: var(--cream);
}
.aether-journey a { color:var(--gold); text-decoration:none; border-bottom:1px solid rgba(217,165,33,0.35); }
.aether-journey a:hover { color:var(--gold-lt); border-color:var(--gold-lt); }
/* ── Hero ── */
.aj-hero { border-bottom:1px solid var(--rule); position:relative; overflow:hidden; min-height:540px; display:flex; align-items:flex-end; border-radius:12px; margin-bottom:40px; }
.aj-hero-video { position:absolute; inset:0; width:100%; height:100%; object-fit:cover; z-index:1; }
.aj-hero-scrim { position:absolute; inset:0; z-index:2; pointer-events:none;
background:
linear-gradient(180deg, rgba(12,20,16,0.25) 0%, rgba(12,20,16,0.50) 50%, rgba(12,20,16,0.93) 100%),
radial-gradient(900px 480px at 16% 100%, rgba(31,61,43,0.55), transparent 70%); }
.aj-hero-in { padding:74px 26px 60px; max-width:960px; margin:0 auto; position:relative; z-index:3; width:100%; }
.aj-hero-meta { display:flex; gap:22px; flex-wrap:wrap; margin-top:30px; font-size:1rem; color:var(--muted); }
.aj-hero-meta strong { color:var(--gold); font-weight:600; }
.aj-hero-meta span:not(:last-child)::after { content:"·"; margin-left:22px; color:var(--olive-dk); }
.aj-kicker { display:inline-block; font-size:.84rem; letter-spacing:.2em; text-transform:uppercase; color:var(--bg); background:var(--gold); padding:6px 15px; border-radius:3px; font-weight:700; margin-bottom:22px; }
/* ── Typography ── */
.aether-journey .aj-label { font-size:.88rem; letter-spacing:.22em; text-transform:uppercase; color:var(--gold); font-weight:600; margin:54px 0 12px; }
.aether-journey h2 { font-family:'Noto Serif',serif; font-size:clamp(1.85rem,3.5vw,2.6rem); color:#fff; font-weight:700; margin:8px 0 24px; line-height:1.25; }
.aether-journey h3 { font-family:'Noto Serif',serif; font-size:1.45rem; color:var(--gold-lt); margin:34px 0 12px; }
.aether-journey p { color:var(--cream); margin:0 0 24px; }
.aether-journey .aj-lead { font-size:1.4rem; color:#fff; line-height:1.7; }
.aether-journey em.q { color:var(--clay-lt); font-style:italic; }
.aether-journey strong { color:#fff; font-weight:700; }
.aj-hero h1 { font-family:'Noto Serif',serif; font-size:clamp(2.6rem,5.5vw,4rem); line-height:1.14; color:#fff; font-weight:700; }
/* ── Lists — em-dash ── */
.aether-journey ul.leaf { list-style:none; padding:0; margin:0 0 24px; }
.aether-journey ul.leaf li { padding:0 0 20px 34px; position:relative; color:var(--cream); font-size:1.35rem; line-height:1.8; }
.aether-journey ul.leaf li::before { content:"—"; position:absolute; left:0; color:var(--olive); font-weight:700; }
.aether-journey ul.leaf li strong { color:#fff; }
/* ── Recap note ── */
.aj-recap { background:var(--bg2); border-left:3px solid var(--clay); border-radius:0 6px 6px 0; padding:22px 28px; margin:0 0 40px; }
.aj-recap p { margin:0; font-size:1.12rem; color:var(--muted); line-height:1.7; }
/* ── Pull quote ── */
.aj-pull { font-family:'Noto Serif',serif; font-size:1.6rem; line-height:1.55; color:var(--gold-lt); font-style:italic; border-left:3px solid var(--gold); padding:8px 0 8px 28px; margin:40px 0; }
/* ── Stat strip ── */
.aj-statbar { display:grid; grid-template-columns:repeat(4,1fr); gap:1px; background:var(--rule2); border:1px solid var(--rule2); margin:40px 0; border-radius:8px; overflow:hidden; }
.aj-stat { background:var(--bg2); padding:28px 18px; text-align:center; }
.aj-stat .num { font-family:'Noto Serif',serif; font-size:2.2rem; color:var(--gold); display:block; line-height:1.1; }
.aj-stat .lbl { font-size:.82rem; letter-spacing:.08em; color:var(--muted); text-transform:uppercase; margin-top:8px; }
/* ── Figures ── */
.aj-visual { margin:36px 0 12px; border-radius:10px; overflow:hidden; border:1px solid var(--rule2); background:var(--bg2); }
.aj-visual svg { display:block; width:100%; height:auto; }
.aj-cap { font-size:1.25rem; color:var(--muted); text-align:center; margin:0 0 36px; font-style:italic; line-height:1.6; }
/* ── Code / formula ── */
.aether-journey pre { background:#07100b; border:1px solid var(--rule2); border-radius:8px; padding:22px 24px; overflow-x:auto; margin:28px 0; font-family:'JetBrains Mono',monospace; font-size:.95rem; line-height:1.75; color:var(--cream); }
.aether-journey pre .c { color:var(--muted); }
.aether-journey pre .k { color:var(--gold-lt); }
.aether-journey pre .s { color:var(--olive); }
.aether-journey code.inline { font-family:'JetBrains Mono',monospace; font-size:1em; background:rgba(125,139,74,.16); color:var(--gold-lt); padding:2px 8px; border-radius:4px; border-bottom:none; }
.aj-formula { background:var(--bg3); border:1px solid var(--rule); border-radius:8px; padding:24px 28px; margin:30px 0; font-family:'JetBrains Mono',monospace; font-size:1.05rem; color:var(--gold-lt); text-align:center; line-height:2; }
.aj-formula span { color:var(--muted); }
/* ── Timeline ── */
.aj-timeline { position:relative; margin:34px 0 12px; padding-left:6px; }
.aj-timeline::before { content:''; position:absolute; left:19px; top:8px; bottom:8px; width:2px; background:linear-gradient(var(--olive),var(--clay)); }
.aj-tl { display:grid; grid-template-columns:46px 1fr; gap:18px; margin-bottom:26px; }
.aj-tl-dot { width:44px; height:44px; border-radius:50%; background:var(--bg2); border:2px solid var(--gold); display:flex; align-items:center; justify-content:center; font-size:.74rem; font-family:'JetBrains Mono',monospace; color:var(--gold); font-weight:600; position:relative; z-index:1; }
.aj-tl-body h3 { font-size:1.25rem; color:var(--gold-lt); margin:8px 0 6px; }
.aj-tl-body p { font-size:1.1rem; color:var(--muted); margin:0; line-height:1.7; }
/* ── CTA ── */
.aj-cta { background:linear-gradient(135deg,var(--jungle),var(--bg2)); border:1px solid var(--gold-dk); border-radius:12px; padding:40px 34px; margin:48px 0; text-align:center; }
.aj-cta h2 { margin-bottom:12px; }
.aj-cta p { color:var(--cream); max-width:640px; margin:0 auto 24px; font-size:1.15rem; line-height:1.7; }
.aj-btn { display:inline-block; background:var(--gold); color:var(--bg); font-weight:700; padding:16px 34px; border-radius:6px; letter-spacing:.02em; font-size:1.05rem; border:none; transition:transform .15s ease, background .15s ease; }
.aj-btn:hover { background:var(--gold-lt); color:var(--bg); transform:translateY(-2px); }
@media (max-width:640px) {
.aj-statbar { grid-template-columns:repeat(2,1fr); }
}</style><div class="aether-journey"><div class="aj-hero"><video class="aj-hero-video" autoplay= loop= muted= playsinline= poster="/images/ather22.jpg"><source src="/images/ather2.mp4" type="video/mp4"/><div class="aj-hero-scrim"/><div class="aj-hero-in"><span class="aj-kicker">Implementation Journey · Part II</span><h1>Aether, Grown Wild</h1><p class="aj-lead" style="margin-top:22px;">The first article was implementing the base idea. This one is the expedition — how the idea evolved and things got added as we moved forward, growing into a 13-agent, web-first, self-escalating system, and every bug in the undergrowth that shaped it.</p><div class="aj-hero-meta"><span>By<strong>Ajay Walia</strong></span><span>June 2026</span><span><strong>v2.6 → v2.8.2</strong></span><span>10 min read</span></div></div></div><div class="aj-recap" style="margin-top:12px;"><p>New here? Start with the original field guide —<a href="/i-built-a-team-of-it-architects-using-llm-that-live-on-macbook-meet-aether/">"I Built a Team of IT Architects Using LLM That Live on MacBook — Meet Aether."</a> That post laid out the thought. This one is what happened when the thought met real queries.</p></div><p class="aj-lead">Every design survives contact with the page. Then you run it.</p><p>Aether v2.6 worked end-to-end on day one — route, retrieve, build, generate, score, escalate, audit. And almost every lesson since came from running that clean little machine against questions it had never seen before.</p><p>The original architecture made a single bet: one model can't be an expert at everything, so build a tree of narrow experts and let them escalate on doubt. The bet held. But the path from v2.6 to v2.8.2 reshaped almost everything around it. The agent roster grew, the router was rebuilt twice, retrieval flipped from knowledge-base-first to web-first, and the confidence score stopped being a thing the model<em>claimed</em> and became a thing the system<em>computed</em>.</p><div class="aj-statbar"><div class="aj-stat"><span class="num">13</span><span class="lbl">Agents · 3 tiers</span></div><div class="aj-stat"><span class="num">v2.8.2</span><span class="lbl">Current release</span></div><div class="aj-stat"><span class="num">0.7</span><span class="lbl">Escalation threshold</span></div><div class="aj-stat"><span class="num">ZERO</span><span class="lbl">Egress · API cost</span></div></div><p class="aj-label">The Delta</p><h2 id="where-the-thought-and-the-build-diverged">Where the thought and the build diverged</h2><p>The blueprint described ten agents, a Gemma 4 26B model, knowledge-base-first retrieval, and a confidence number the model appended to its own answer. Run it for a week and three of those four assumptions bend out of shape:</p><ul class="leaf"><li><strong>Ten agents became thirteen.</strong> The flat roster of technology specialists reorganised into a four-agent network sub-branch and a consolidated digital-workplace branch.</li><li><strong>Knowledge-base-first became web-first.</strong> The local knowledge base started empty, so retrieval now scrapes a vendor allowlist first and falls back to the KB only when the web is thin.</li><li><strong>Self-reported confidence became computed.</strong> "Confidence: 0.92" was theatre; a formula over retrieval quality, domain fit and citation density replaced it.</li><li><strong>An idea became a discipline.</strong> With Git disabled, a hand-written<code class="inline">CHANGELOG.md</code> turned into the single source of truth — every fix and reversal, dated.</li></ul><p class="aj-label">The Pack</p><h2 id="from-ten-to-thirteen--the-hierarchy-regrows">From ten to thirteen — the hierarchy regrows</h2><p>The three-tier shape held: one Enterprise Architect at the top, three Domain Architects beneath, and a layer of deep technology specialists at the base. What changed was the base. The original Intune, AVD and Citrix agents were too narrow and overlapped each other, so they were folded into broader, sharper roles — and the network domain, barely a single agent before, grew a full four-specialist sub-branch.</p><div class="aj-visual"><svg viewBox="0 0 860 430" xmlns="http://www.w3.org/2000/svg" font-family="Noto Sans, sans-serif"><defs><linearGradient id="t1" x1="0" y1="0" x2="0" y2="1"><stop offset="0" stop-color="#d9a521"/><stop offset="1" stop-color="#8a6610"/></linearGradient></defs><g stroke="#4a5429" stroke-width="1.6" fill="none" opacity="0.8"><path d="M430,72 L200,150"/><path d="M430,72 L430,150"/><path d="M430,72 L660,150"/><path d="M200,182 L120,250"/><path d="M200,182 L260,250"/><path d="M430,182 L430,250"/></g><rect x="330" y="40" width="200" height="34" rx="7" fill="url(#t1)"/><text x="430" y="62" text-anchor="middle" font-size="13" font-weight="700" fill="#0c1410">TIER 1 · Enterprise Architect</text><g font-size="12" font-weight="600" fill="#e7e0cf"><rect x="120" y="150" width="160" height="32" rx="6" fill="#2c5239" stroke="#7d8b4a"/><text x="200" y="171" text-anchor="middle">Cloud Domain</text><rect x="350" y="150" width="160" height="32" rx="6" fill="#2c5239" stroke="#7d8b4a"/><text x="430" y="171" text-anchor="middle">DWP Domain</text><rect x="580" y="150" width="160" height="32" rx="6" fill="#2c5239" stroke="#7d8b4a"/><text x="660" y="171" text-anchor="middle">Network Domain</text></g><g font-size="10.5" fill="#0c1410" font-weight="600"><rect x="70" y="250" width="64" height="28" rx="5" fill="#c98a5b"/><text x="102" y="268" text-anchor="middle">AWS</text><rect x="142" y="250" width="64" height="28" rx="5" fill="#c98a5b"/><text x="174" y="268" text-anchor="middle">Azure</text><rect x="214" y="250" width="64" height="28" rx="5" fill="#c98a5b"/><text x="246" y="268" text-anchor="middle">GCP</text><rect x="358" y="250" width="74" height="28" rx="5" fill="#c98a5b"/><text x="395" y="268" text-anchor="middle">MS DWP</text><rect x="436" y="250" width="80" height="28" rx="5" fill="#c98a5b"/><text x="476" y="268" text-anchor="middle">End-User Virt</text><rect x="540" y="290" width="80" height="28" rx="5" fill="#c98a5b"/><text x="580" y="308" text-anchor="middle">Core Net</text><rect x="626" y="290" width="92" height="28" rx="5" fill="#c98a5b"/><text x="672" y="308" text-anchor="middle">SD-WAN/SASE</text><rect x="540" y="324" width="80" height="28" rx="5" fill="#c98a5b"/><text x="580" y="342" text-anchor="middle">Net Security</text><rect x="626" y="324" width="92" height="28" rx="5" fill="#c98a5b"/><text x="672" y="342" text-anchor="middle">NetOps AIOps</text></g><g stroke="#4a5429" stroke-width="1.4" fill="none" opacity="0.7"><path d="M660,182 L580,290 M660,182 L672,290 M660,182 L580,324 M660,182 L672,324"/></g></svg></div><p class="aj-cap">Thirteen agents across three tiers. Each domain owns a colour; specialists inherit it. Low confidence climbs the parent chain → Domain → Enterprise.</p><h3 id="added-to-the-pack">Added to the pack</h3><ul class="leaf"><li><strong>Microsoft DWP Technology Architect</strong> — a broad M365, Intune, Entra, Defender and Copilot specialist.</li><li><strong>End-User Virtualization</strong> — Citrix, Horizon, AVD and FSLogix under one roof.</li><li><strong>A four-agent network sub-branch</strong> — Core Networking, SD-WAN/SASE, Network Security and NetOps AIOps.</li></ul><h3 id="cut-from-the-undergrowth">Cut from the undergrowth</h3><ul class="leaf"><li><strong>The standalone Virtualization Domain Architect</strong> — folded into DWP.</li><li><strong>The Intune-only agent</strong> — replaced by the broader Microsoft DWP role.</li><li><strong>The AVD- and Citrix-specific agents</strong> — absorbed by End-User Virtualization.</li><li><strong>The retired KB taxonomy</strong> —<code class="inline">kb_intune</code>,<code class="inline">kb_avd</code> and<code class="inline">kb_citrix</code>.</li></ul><p class="aj-label">core/router.py</p><h2 id="routing--the-hardest-won-code-in-the-repo">Routing — the hardest-won code in the repo</h2><p>Routing decides which expert hears the question. It sounds trivial — match keywords, pick an agent — and it turned out to be the single biggest source of subtle, infuriating bugs. The router now runs four ordered passes:</p><ul class="leaf"><li><strong>Forced design-doc override.</strong> Phrases like "solution design document" bypass keyword scanning entirely, and any mention of "Copilot" routes straight to the Microsoft DWP specialist.</li><li><strong>Tier-3 keyword match.</strong> The most specific specialists are scanned first, with<code class="inline">\b</code> word-boundary regex to stop false hits.</li><li><strong>Tier-2 domain match.</strong> Broader strategy keywords catch domain-level queries that no specialist claimed.</li><li><strong>Default to Enterprise.</strong> Anything unmatched falls through to the catch-all at the top.</li></ul><style>
.routing-anim-wrapper {
display: flex;
flex-direction: column;
gap: 1rem;
margin: 3rem 0;
}
@media (min-width: 1024px) {
.routing-anim-wrapper {
flex-direction: row;
}
}
.routing-anim {
flex: 1.3;
position: relative;
background: #0f172a;
border: 1px solid #1e293b;
border-radius: 1rem;
padding: 2rem;
height: 520px;
overflow: hidden;
font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace;
box-shadow: inset 0 2px 20px rgba(0,0,0,0.5);
}
.route-debug {
flex: 0.7;
background: #020617;
border: 1px solid #1e293b;
border-radius: 1rem;
padding: 1.5rem;
height: 520px;
overflow-y: auto;
box-shadow: inset 0 0 10px rgba(0,0,0,0.8);
font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace;
}
.debug-header {
font-size: 0.8rem;
font-weight: 700;
color: #64748b;
margin-bottom: 1rem;
letter-spacing: 0.05em;
border-bottom: 1px solid #1e293b;
padding-bottom: 0.75rem;
display: flex;
align-items: center;
gap: 0.5rem;
}
.debug-dot {
width: 8px;
height: 8px;
background: #10b981;
border-radius: 50%;
animation: pulse-dot 2s infinite;
}
@keyframes pulse-dot { 50% { opacity: 0.5; } }
.debug-code {
color: #38bdf8;
font-size: 0.8rem;
line-height: 1.6;
white-space: pre-wrap;
}
.route-line {
position: absolute;
top: 0;
bottom: 0;
left: 80px;
width: 2px;
background: repeating-linear-gradient(to bottom, #334155 0, #334155 10px, transparent 10px, transparent 20px);
z-index: 1;
}
.route-packet {
position: absolute;
left: 80px;
top: -40px;
transform: translate(-50%, -50%);
background: #0ea5e9;
color: white;
padding: 0.5rem 1rem;
border-radius: 999px;
font-size: 0.875rem;
font-weight: 600;
z-index: 20;
box-shadow: 0 0 20px rgba(14, 165, 233, 0.5);
white-space: nowrap;
transition: top 0.5s ease-in-out, background 0.3s, box-shadow 0.3s, opacity 0.3s;
}
.route-step {
position: absolute;
left: 140px;
right: 20px;
background: #1e293b;
border: 1px solid #334155;
color: #94a3b8;
padding: 1rem 1.25rem;
border-radius: 0.5rem;
text-align: left;
z-index: 10;
transition: all 0.3s ease;
transform: translateY(-50%);
}
.route-step::before {
content: '';
position: absolute;
top: 50%;
left: -60px;
transform: translate(-50%, -50%);
width: 12px;
height: 12px;
background: #0f172a;
border: 2px solid #334155;
border-radius: 50%;
transition: border-color 0.3s, background 0.3s;
z-index: 2;
}
.route-step::after {
content: '';
position: absolute;
top: 50%;
left: -60px;
width: 60px;
height: 2px;
background: #334155;
transform: translateY(-50%);
z-index: 1;
transition: background 0.3s;
}
.route-step.active-scan {
border-color: #0ea5e9;
background: rgba(14, 165, 233, 0.05);
}
.route-step.active-scan::before {
border-color: #0ea5e9;
background: #0ea5e9;
}
.route-step.active-scan::after {
background: #0ea5e9;
}
.route-step.match {
border-color: #10b981;
background: rgba(16, 185, 129, 0.05);
box-shadow: 0 0 15px rgba(16, 185, 129, 0.1);
}
.route-step.match::before {
border-color: #10b981;
background: #10b981;
}
.route-step.match::after {
background: #10b981;
}
.r-step-1 { top: 80px; }
.r-step-2 { top: 190px; }
.r-step-3 { top: 300px; }
.r-step-4 { top: 410px; }
.step-title {
font-size: 0.9rem;
font-weight: 700;
color: #f8fafc;
margin-bottom: 0.25rem;
}
.step-desc {
font-size: 0.8rem;
color: #94a3b8;
line-height: 1.4;
}</style><div class="routing-anim-wrapper not-prose" id="routing-anim-wrapper"><div class="routing-anim" id="routing-anim"><div class="route-line"/><div class="route-step r-step-1" id="rs-1"><div class="step-title">1. Forced Override</div><div class="step-desc">Certain keywords (like "solution design document" or "Copilot") bypass deep scanning and hardcode a route to a specific agent.</div></div><div class="route-step r-step-2" id="rs-2"><div class="step-title">2. Tier-3 Specialist</div><div class="step-desc">Scans for highly specific technologies using exact word boundaries (<code class="inline">\bids\b</code>). Prevents broad terms from triggering niche specialists.</div></div><div class="route-step r-step-3" id="rs-3"><div class="step-title">3. Tier-2 Domain Architect</div><div class="step-desc">Catches broader domain-level queries (like "firewall" or "routing") if no specific L3 specialist claimed the question.</div></div><div class="route-step r-step-4" id="rs-4"><div class="step-title">4. Default to Enterprise</div><div class="step-desc">The catch-all bucket. If a query falls all the way through without matching anything, it goes to the Enterprise Architect.</div></div><div class="route-packet" id="r-packet">Starting...</div></div><div class="route-debug"><div class="debug-header"><div class="debug-dot"/> ROUTER ENGINE TRACE</div><div class="debug-code" id="debug-log">Waiting for query...</div></div></div><script>
document.addEventListener('DOMContentLoaded', function() {
const packet = document.getElementById('r-packet');
const debugLog = document.getElementById('debug-log');
const steps = [
document.getElementById('rs-1'),
document.getElementById('rs-2'),
document.getElementById('rs-3'),
document.getElementById('rs-4')
];
if(!packet || !steps[0] || !debugLog) return;
const scenarios = [
{ query: '"How to enable Copilot"', targetStep: 1, targetName: 'Microsoft DWP' },
{ query: '"BGP flapping issue"', targetStep: 2, targetName: 'Network L3 Tech' },
{ query: '"Firewall strategy"', targetStep: 3, targetName: 'Network Sec Arch' },
{ query: '"General IT strategy"', targetStep: 4, targetName: 'Enterprise Arch' }
];
let currentScenario = 0;
const sleep = ms => new Promise(r => setTimeout(r, ms));
async function runAnimation() {
while(true) {
const s = scenarios[currentScenario];
const reqId = "req_" + Math.random().toString(36).substr(2, 6);
let debugState = {
request_id: reqId,
query: s.query.replace(/"/g, ''),
status: "scanning",
cache: "redis_miss",
scan_log: []
};
const renderDebug = () => {
debugLog.textContent = JSON.stringify(debugState, null, 2);
};
packet.style.transition = 'none';
packet.style.top = '-40px';
packet.style.opacity = '1';
packet.style.background = '#0ea5e9';
packet.style.boxShadow = '0 0 20px rgba(14, 165, 233, 0.5)';
packet.textContent = s.query;
steps.forEach(step => step.classList.remove('active-scan', 'match'));
renderDebug();
await sleep(2000);
packet.style.transition = 'top 1.5s ease-in-out, background 0.3s, box-shadow 0.3s, opacity 0.3s';
for(let i=0; i<s.targetStep; i++)= {= let= stepTop=80 += (i*110);= packet.style.top=stepTop += 'px';= steps[i].classList.add('active-scan');= debugState.scan_log.push(`evaluating_tier_${i+1}`);= renderDebug();= await= sleep(2000);= if= (i= <= s.targetStep= -= 1)= {= packet.style.background='#ef4444' ;= packet.style.boxShadow='0 0 20px rgba(239, 68, 68, 0.5)' ;= packet.textContent="✗ No match" ;= debugState[`tier_${i+1}_match`]=false; renderDebug();= await= sleep(2000);= packet.style.background='#0ea5e9' ;= packet.style.boxShadow='0 0 20px rgba(14, 165, 233, 0.5)' ;= packet.textContent=s.query; steps[i].classList.remove('active-scan');= }= else= {= steps[i].classList.remove('active-scan');= steps[i].classList.add('match');= packet.style.background='#10b981' ;= packet.style.boxShadow='0 0 30px rgba(16, 185, 129, 0.8)' ;= packet.textContent="✓ Routed: " += s.targetName;= debugState.status="routed" ;= debugState[`tier_${i+1}_match`]=true; debugState.target_agent=s.targetName.replace(/ /g,= '_').toLowerCase();= debugState.cache_action="set_redis_ttl_3600" ;= renderDebug();= await= sleep(6000);= packet.style.opacity='0' ;= steps[i].classList.remove('match');= await= sleep(2500);= }= }= currentScenario=(currentScenario += 1)= %= scenarios.length;= }= }= runAnimation();= });= </script=><p>The bugs here were the kind that hide in plain sight. A bare<code class="inline">"ids"</code> keyword matched the "IDs" in<em class="q">"track record IDs"</em> and wrongly summoned the Network Security agent — fixed by swapping it for the explicit phrase<code class="inline">"intrusion detection system"</code>. Worse, the Tier-2 rules stored their keywords as double-escaped regex (<code class="inline">\\bfirewall\\b</code>), which turned the backslashes literal, so those rules could<em>never</em> match and three domain architects sat silently unreachable. A small<code class="inline">_kw_matches()</code> helper now treats real regex as regex and plain words as plain words, while Redis caches each verdict for an hour and degrades gracefully if it goes down.</p><p class="aj-label">The Single Beast</p><h2 id="one-local-gemma-thirteen-personas">One local Gemma, thirteen personas</h2><p>The headline trick from Part I survived intact, and got sharper. There is still exactly one model resident in memory:<code class="inline">gemma-4-26b-a4b-it-mlx</code> — Google's open Gemma family, the instruction-tuned variant in an active-parameter configuration, MLX-quantized for Apple silicon and served through LM Studio's OpenAI-compatible API at<code class="inline">localhost:1234</code>. Thirteen specialists, one set of weights. The persona switch lives entirely in the YAML manifest:</p><pre><span class="c"># agents/enterprise_architect.yaml</span><span class="k">id:</span> enterprise_architect<span class="k">tier:</span> 1<span class="k">namespace:</span> kb_enterprise<span class="k">parent:</span> null<span class="k">temperature:</span> 0.2<span class="k">max_tokens:</span> 3000<span class="k">system_prompt:</span> |<span class="s">DOMAIN MANDATE</span> — out-of-domain query → refuse, hard-code 0.0 confidence.<span class="s">ENTERPRISE MANDATE</span> — always cover Identity (AD/Entra, RBAC) and HA/DR;
state assumptions, never claim 100%.<span class="s">STYLE MANDATE</span> — no invented URLs, no gratuitous TOGAF/ISO name-drops,
plain-text arrows, never LaTeX.<span class="s">DIAGRAM RULE</span> — emit a concise Mermaid block (≤15 nodes) for architectures.</pre><p>That manifest<em>is</em> the agent. The mandates are the lessons of a hundred bad answers, hardened into rules: out-of-domain questions get refused and forced to escalate, every architecture must address identity and disaster recovery, and decorative fake citations are banned outright. There is also a discipline the local hardware taught the hard way — large design templates once overflowed the context window and OOM-crashed LM Studio, forcing careful tuning of how much web context gets injected per query.</p><p class="aj-label">The v2.8 Redesign</p><h2 id="retrieval-flips-web-first-kb-as-fallback">Retrieval flips: web-first, KB as fallback</h2><p>This was the biggest architectural change since the blueprint. The original design retrieved from the local LanceDB knowledge base first. The problem was mundane and fatal: the knowledge base started empty. An empty KB meant a zero retrieval score, which meant zero confidence, which meant every single query escalated all the way to the Enterprise Architect for nothing.</p><p>So the pipeline flipped. The orchestrator now scrapes trusted vendor documentation<em>before</em> it touches the KB, and only falls back to local documents when the web returns less than about 1,200 characters of usable text.</p><style>
.retrieval-anim {
position: relative;
background: #0f172a;
border: 1px solid #1e293b;
border-radius: 1rem;
padding: 0 40px;
margin: 3rem 0 1rem 0;
height: 260px;
display: flex;
align-items: center;
justify-content: space-between;
font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace;
box-shadow: inset 0 2px 20px rgba(0,0,0,0.5);
overflow: hidden;
}
.ret-line {
position: absolute;
top: 50%;
left: 40px;
right: 40px;
height: 2px;
background: #334155;
transform: translateY(-50%);
z-index: 1;
}
.ret-node-wrapper {
position: relative;
z-index: 10;
display: flex;
flex-direction: column;
align-items: center;
}
.ret-node {
padding: 0.8rem 1.2rem;
border-radius: 0.4rem;
font-size: 1rem;
font-weight: 700;
text-align: center;
color: #0c1410;
transition: all 0.3s ease;
box-shadow: 0 2px 5px rgba(0,0,0,0.3);
border: 2px solid transparent;
}
.node-green { background: #7d8b4a; }
.node-gold { background: #d9a521; }
.node-brown { background: #c98a5b; }
.ret-node.active {
box-shadow: 0 0 20px rgba(255,255,255,0.4);
transform: scale(1.15);
border-color: #f8fafc;
}
.ret-node.skipped {
opacity: 0.3;
filter: grayscale(80%);
}
.ret-packet {
position: absolute;
top: 50%;
left: -100px;
transform: translate(-50%, -50%);
background: #0ea5e9;
color: white;
padding: 0.5rem 1rem;
border-radius: 999px;
font-size: 0.9rem;
font-weight: bold;
z-index: 20;
white-space: nowrap;
box-shadow: 0 0 15px rgba(14, 165, 233, 0.6);
transition: left 0.6s ease-in-out, top 0.6s ease-in-out, opacity 0.3s;
}
.ret-log {
position: absolute;
bottom: 20px;
left: 50%;
transform: translateX(-50%);
color: #38bdf8;
font-size: 0.9rem;
background: #020617;
padding: 0.5rem 1.2rem;
border-radius: 0.5rem;
border: 1px solid #1e293b;
opacity: 0;
transition: opacity 0.3s;
white-space: nowrap;
}
.kb-label {
position: absolute;
top: 100%;
margin-top: 12px;
font-size: 0.8rem;
color: #94a3b8;
white-space: nowrap;
}
@media (max-width: 768px) {
.retrieval-anim {
flex-direction: column;
padding: 40px 0;
height: 600px;
}
.ret-line {
top: 40px; bottom: 40px; left: 50%; right: auto; width: 2px; height: auto; transform: translateX(-50%);
}
.kb-label {
left: 100%; top: 50%; margin-top: 0; margin-left: 12px; transform: translateY(-50%);
}
}</style><div class="retrieval-anim not-prose" id="ret-container"><div class="ret-line"/><div class="ret-node-wrapper"><div class="ret-node node-green" id="rn-1">Route</div></div><div class="ret-node-wrapper"><div class="ret-node node-gold" id="rn-2">Web primary</div></div><div class="ret-node-wrapper"><div class="ret-node node-brown" id="rn-3">KB fallback</div><div class="kb-label">only if web &lt; 1200 chars</div></div><div class="ret-node-wrapper"><div class="ret-node node-green" id="rn-4">History</div></div><div class="ret-node-wrapper"><div class="ret-node node-green" id="rn-5">Build</div></div><div class="ret-node-wrapper"><div class="ret-node node-gold" id="rn-6">Gemma</div></div><div class="ret-packet" id="ret-packet">Query</div><div class="ret-log" id="ret-log">Logs</div></div><p class="aj-cap" style="margin-top: 0; margin-bottom: 3rem;">The v2.8 retrieval path — web-first against a vendor allowlist, with the knowledge base merging in only when the web comes back thin.</p><script>
document.addEventListener('DOMContentLoaded', function() {
const container = document.getElementById('ret-container');
const packet = document.getElementById('ret-packet');
const log = document.getElementById('ret-log');
const nodes = [
document.getElementById('rn-1'),
document.getElementById('rn-2'),
document.getElementById('rn-3'),
document.getElementById('rn-4'),
document.getElementById('rn-5'),
document.getElementById('rn-6')
];
if(!packet || !nodes[0]) return;
const scenarios = [
{
query: '"AWS VPC Limits"',
webRes: 'Web hit: 5,420 chars (AWS Docs)',
skipKb: true
},
{
query: '"Project Aether Spec"',
webRes: 'Web hit: 120 chars (Thin)',
kbRes: 'KB hit: 8,300 chars (Local LanceDB)',
skipKb: false
}
];
let cur = 0;
const sleep = ms => new Promise(r => setTimeout(r, ms));
function moveTo(index) {
let rect = nodes[index].getBoundingClientRect();
let cRect = container.getBoundingClientRect();
packet.style.left = (rect.left - cRect.left + rect.width / 2) + 'px';
packet.style.top = (rect.top - cRect.top + rect.height / 2) + 'px';
}
async function run() {
while(true) {
const s = scenarios[cur];
packet.style.transition = 'none';
if (window.innerWidth< 768)= {= packet.style.left='50%' ;= packet.style.top='-50px' ;= }= else= {= packet.style.left='-100px' ;= packet.style.top='50%' ;= }= packet.style.opacity='1' ;= packet.textContent=s.query; log.style.opacity='0' ;= nodes.forEach(n=> n.classList.remove('active', 'skipped'));
await sleep(2000);
packet.style.transition = 'left 1.5s ease-in-out, top 1.5s ease-in-out, opacity 0.3s';
// Step 0: Route
moveTo(0);
nodes[0].classList.add('active');
await sleep(2000);
nodes[0].classList.remove('active');
// Step 1: Web primary
moveTo(1);
nodes[1].classList.add('active');
log.textContent = "Scraping trusted vendor docs...";
log.style.opacity = '1';
await sleep(2500);
log.textContent = s.webRes;
await sleep(2500);
nodes[1].classList.remove('active');
// Step 2: KB
moveTo(2);
if (s.skipKb) {
nodes[2].classList.add('skipped');
log.textContent = "Skipping KB (web threshold met)";
await sleep(2500);
} else {
nodes[2].classList.add('active');
log.textContent = "Web thin. Querying local LanceDB...";
await sleep(2500);
log.textContent = s.kbRes;
await sleep(2500);
nodes[2].classList.remove('active');
}
// Step 3: History
moveTo(3);
nodes[3].classList.add('active');
log.textContent = "Appending chat history...";
await sleep(2000);
nodes[3].classList.remove('active');
// Step 4: Build
moveTo(4);
nodes[4].classList.add('active');
log.textContent = "Constructing final prompt...";
await sleep(2000);
nodes[4].classList.remove('active');
// Step 5: Gemma
moveTo(5);
nodes[5].classList.add('active');
log.textContent = "Streaming to gemma-4-26b...";
packet.style.background = '#10b981';
packet.style.boxShadow = '0 0 20px rgba(16, 185, 129, 0.6)';
await sleep(4000);
packet.style.opacity = '0';
log.style.opacity = '0';
nodes[5].classList.remove('active');
packet.style.background = '#0ea5e9';
packet.style.boxShadow = '0 0 15px rgba(14, 165, 233, 0.6)';
await sleep(3000);
cur = (cur + 1) % scenarios.length;
}
}
run();
});</script><ul class="leaf"><li><strong>A curated vendor-domain allowlist.</strong> Results are restricted to Microsoft Learn, AWS, Google Cloud, Cisco, Palo Alto, NIST and others, with suffix-safe matching that blocks spoofs like<code class="inline">cisco.com.evil.com</code>.</li><li><strong>Per-agent ranking.</strong><code class="inline">AGENT_DOMAINS</code> ranks each specialist's preferred docs first, so the AWS agent leans on AWS documentation before anything else.</li><li><strong>Source-agnostic results.</strong> Web hits are reshaped to look exactly like KB results (<code class="inline">source / url / text / score</code>), so the confidence math works identically on either.</li><li><strong>Rich metadata.</strong> A schema of<code class="inline">domain</code>,<code class="inline">vendor</code>,<code class="inline">document_type</code> and<code class="inline">version_date</code> travels with every chunk into the prompt.</li></ul><p class="aj-label">Trust, Computed</p><h2 id="confidence-is-math-not-vibes">Confidence is math, not vibes</h2><p>In v2.6 the model ended each answer with<code class="inline">Confidence: 0.92</code> and the orchestrator believed it. The trouble is that a model confidently answering an AWS question with Azure facts will happily rate itself 0.92 too. Self-report is theatre. So confidence became a number the<em>system</em> computes, before and after generation:</p><div class="aj-formula">
pre_gen<span>=</span> 0.6 · retrieval<span>+</span> 0.4 · namespace_overlap<br>
confidence<span>=</span> min(1.0, best_pre_gen<span>+</span> 0.2 · citation_density)</div><ul class="leaf"><li><strong>retrieval_score</strong> — the quality of the retrieved documents. With an empty KB it is derived from web hits: 0.85 for preferred vendor domains, 0.70 otherwise.</li><li><strong>namespace_overlap</strong> — does the query hit the agent's keywords? A strong match scores about 0.85; off-topic collapses to 0.1, all but guaranteeing escalation.</li><li><strong>citation_density</strong> — the share of claims backed by<code class="inline">[1] [2]</code> sources, rewarding grounded answers with up to a 0.2 boost.</li></ul><p>Below the 0.7 threshold, the query climbs to the parent agent, which re-retrieves against its own namespace and re-answers. One subtle fix mattered here: the strongest<code class="inline">_best_pre_gen</code> score is carried up the escalation chain, so a confident child's score is never erased by a weaker parent re-running the same step. The model no longer judges itself — the architecture does.</p><p class="aj-label">The Undergrowth</p><h2 id="the-bugs-that-shaped-the-design">The bugs that shaped the design</h2><p>Most of the architecture above exists because something broke first. The marquee disaster was the<strong>0% confidence saga</strong> — a cluster of unrelated failures that all produced the same symptom: every query inexplicably crashing to zero confidence and escalating to the top of the tree.</p><ul class="leaf"><li><strong>Silent search failure.</strong><code class="inline">duckduckgo_search</code> was renamed to<code class="inline">ddgs</code> upstream and returned an empty list. A catch-all<code class="inline">except</code> swallowed the error, so retrieval quietly went to zero.</li><li><strong>Empty-KB zero score.</strong> With no documents ingested, the retrieval score defaulted to 0.0 — now derived from web-result quality instead.</li><li><strong>Overwritten scores.</strong> Each escalation re-ran<code class="inline">step_build</code> and erased the child's strong score, until<code class="inline">_best_pre_gen</code> was preserved across the chain.</li><li><strong>Dead Tier-2 regex.</strong> Double-escaped<code class="inline">\\b</code> made domain routing rules unreachable, and malformed YAML broke parsing entirely.</li><li><strong>Context overflow.</strong> The expanded universal design template pushed prompts past the local model's context limit — a 400 error until web-context sizes were tuned back down.</li><li><strong>A frozen UI on long jobs.</strong> Design-doc generation runs for minutes and the UI silently froze, until streaming feedback and a 10-minute timeout were added.</li></ul><p class="aj-pull">A single catch-all<code class="inline" style="font-size:1rem;">except</code> turned a renamed package into an invisible, week-long confidence collapse. Fail loud, not silent.</p><p class="aj-label">The Working Contract</p><h2 id="how-the-build-stays-honest">How the build stays honest</h2><p>With Git tracking disabled, discipline had to live somewhere. It lives in two places. The first is a hand-maintained<code class="inline">CHANGELOG.md</code> that records every change, reversal and reason. The second is a behavioural contract the assistant itself follows when modifying the code — think before acting, make the smallest possible diff, verify against success criteria, keep everything auditable and reversible, and prefer less code over more. It reads less like an engineering process and more like the Law of the Jungle: a few rules everyone keeps, because the alternative is chaos.</p><div class="aj-cta"><h2>See it running — the screenshots</h2><p>Nine captioned frames from the live system: the Gemma model resident in LM Studio, the Gradio chat generating a Microsoft 365 Copilot design document at 90% confidence with live web search, the namespace-per-domain knowledge base, and the design template behind it all.</p><a class="aj-btn" href="/aether-screens.html" target="_blank" rel="noopener">Open the screenshot gallery →</a></div><p class="aj-label">The Map So Far</p><h2 id="six-releases-one-expedition">Six releases, one expedition</h2><div class="aj-timeline"><div class="aj-tl"><div class="aj-tl-dot">2.6.0</div><div class="aj-tl-body"><h3>Base system</h3><p>Routing, RAG, agents and escalation — the first end-to-end build.</p></div></div><div class="aj-tl"><div class="aj-tl-dot">2.7.0</div><div class="aj-tl-body"><h3>Grounded confidence</h3><p>Dropped LLM self-report for a math formula; added web-search fallback and Mermaid diagrams in the UI.</p></div></div><div class="aj-tl"><div class="aj-tl-dot">2.7.1</div><div class="aj-tl-body"><h3>Agent reshuffle</h3><p>Retired the Intune, AVD, Citrix and Virtualization agents; added Microsoft DWP, End-User Virt and the network specialists.</p></div></div><div class="aj-tl"><div class="aj-tl-dot">2.7.4</div><div class="aj-tl-body"><h3>Routing repairs</h3><p>Fixed dead regex rules, corrupted YAMLs, and a hardcoded-path split that loaded stale manifests.</p></div></div><div class="aj-tl"><div class="aj-tl-dot">2.8.0</div><div class="aj-tl-body"><h3>Web-first retrieval</h3><p>Scrape vendor docs before the KB; vendor allowlist; KB fallback merge.</p></div></div><div class="aj-tl"><div class="aj-tl-dot">2.8.2</div><div class="aj-tl-body"><h3>Confidence fixes</h3><p>Solved 0% confidence on design docs; preserved the best score across the escalation chain.</p></div></div><div class="aj-tl"><div class="aj-tl-dot">2.9.0</div><div class="aj-tl-body"><h3>Clarifying questions</h3><p>Before generating a design, agents now ask two to four targeted questions — org size, compliance, stack, timeline — and fold the answers into a far more specific result.</p></div></div></div><p class="aj-label">Where We Are · What's Next</p><h2 id="wired-working-and-honest">Wired, working, and honest</h2><p>v2.8.2 is a working, daily-use system. The full pipeline runs route → retrieve → build → generate → score → escalate → audit, all thirteen manifests parse, routing and parent maps are consistent, and live vendor-doc search returns current, citable content. What's still open is honest too: the KB folders are wired but largely empty and need real source documents ingested; a rule-ordering overlap can still misroute shared virtualization keywords between End-User Virt and DWP; Git is off; and the context budget stays tight on the local model for large templates.</p><p>The road ahead, in order: populate the knowledge base so RAG augments rather than just the web; resolve the routing overlaps and settle End-User-Virt-versus-DWP ownership; re-enable Git and move off the manual changelog; then build a query test-suite to calibrate the confidence threshold against measured answer quality.</p><p class="aj-label">Lessons from the Trail</p><h2 id="what-the-journey-taught">What the journey taught</h2><ul class="leaf"><li><strong>Fail loud, not silent.</strong> A catch-all<code class="inline">except</code> turned a renamed package into invisible 0% confidence. Surface errors — never swallow them.</li><li><strong>Measure what you trust.</strong> Self-reported confidence is theatre. Grounding trust in retrieval and citations made escalation actually mean something.</li><li><strong>Narrow beats broad.</strong> Specialists with tight domains hallucinate far less than one generalist trying to know everything.</li><li><strong>Write the changelog.</strong> With Git off, the manual<code class="inline">CHANGELOG.md</code> became the single source of truth for every decision and fix.</li></ul><p style="font-size:1.18rem; color:var(--clay-lt); font-style:italic; line-height:1.6; margin-top:34px;">Aether began as a single sentence — "one model can't know everything." It grew into a hierarchy of grounded, self-aware experts, and the changelog is the proof of the journey. Specialise · Ground · Measure · Escalate.</p></div>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/ather22.jpg" medium="image"><media:title type="plain">Engineering</media:title></media:content><category>artificial-intelligence</category><category>architecture</category><category>automation</category><category>engineering</category><category>llm</category><category>Knowledge Base</category></item><item><title>RAG Chatbot from indexed public documentation</title><link>https://curiousbit.netlify.app/rag-chatbot-from-indexed-public-documentation/</link><guid isPermaLink="true">https://curiousbit.netlify.app/rag-chatbot-from-indexed-public-documentation/</guid><pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;This article documents the design and implementation of a domain-specific Retrieval-Augmented Generation (RAG) conversational assistant. The project was completed as the &lt;strong&gt;Week 15 Graded Mini Project&lt;/strong&gt; of the &lt;strong&gt;IITM Pravartak Professional Certificate Programme in Agentic AI and Applications&lt;/strong&gt;. The brief required the construction of an assistant that retrieves content from a private document set and produces answers grounded in that retrieved context, while refusing to answer when supporting evidence is absent.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/IITM/IIT1.jpg" alt="Engineering" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>This article documents the design and implementation of a domain-specific Retrieval-Augmented Generation (RAG) conversational assistant. The project was completed as the<strong>Week 15 Graded Mini Project</strong> of the<strong>IITM Pravartak Professional Certificate Programme in Agentic AI and Applications</strong>. The brief required the construction of an assistant that retrieves content from a private document set and produces answers grounded in that retrieved context, while refusing to answer when supporting evidence is absent.</p><p>The implementation selects the<strong>Technology</strong> domain and indexes six official pages of the GitHub REST API documentation. The assistant accepts free-form questions in a command-line interface, performs context-aware follow-up handling, returns concise answers with source attribution, and emits a fixed refusal phrase when the retrieved context does not contain the answer.</p><h2 id="architecture">Architecture</h2><p>The pipeline is split into two phases. An offline ingestion stage prepares a reusable vector index; a runtime stage answers user turns by retrieving from that index and prompting a large language model under strict grounding instructions.</p><p><img src="/images/rag-github-architecture.png" alt="RAG architecture for the GitHub REST API assistant, showing ingestion and runtime phases" title="End-to-end flow: ingestion builds the FAISS index; runtime retrieves, rewrites follow-ups, and grounds the answer."/><h2 id="public-data-sources">Public Data Sources</h2><p>Six public pages from the official GitHub Docs were saved locally as plain-text files in<code>data/raw/</code>:</p><ul><li>Quickstart for GitHub REST API</li><li>Getting started with the REST API</li><li>Authenticating to the REST API</li><li>Rate limits for the REST API</li><li>Best practices for using the REST API</li><li>Troubleshooting the REST API</li></ul><p>These pages were chosen because they form a self-contained operational knowledge set: a developer integrating with the API typically needs to consult exactly this material when handling authentication, rate-limit pressure, and recovery from common errors.</p><h2 id="components-and-concepts">Components and Concepts</h2><p>The following components and ideas underpin the implementation. Each is summarised briefly to make the design choices easy to evaluate.</p><p><strong>Retrieval-Augmented Generation (RAG).</strong> A pattern in which a language model is grounded on retrieved passages from an external corpus rather than relying solely on parametric memory. Retrieval reduces hallucination and lets the assistant cite source material it actually consulted.</p><p><strong>LangChain.</strong> The orchestration framework used throughout the project. It provides composable abstractions for document loaders, text splitters, embeddings, vector stores and chat models, which keeps the ingestion and runtime code small and idiomatic.</p><p><strong>RAG Chunking.</strong> Long documents are split into smaller passages before embedding. This implementation uses<code>RecursiveCharacterTextSplitter</code> with a 900-character chunk size and a 150-character overlap. Splitting prefers semantic boundaries (paragraph breaks and Markdown headings) so that each chunk is internally coherent and retrieval surfaces meaningful units rather than fragments.</p><p><strong>OpenAI Embeddings.</strong> Each chunk is converted into a high-dimensional vector using the<code>text-embedding-3-small</code> model. The vector captures the semantic content of the chunk so that questions phrased differently from the source text still retrieve the right passages.</p><p><strong>FAISS.</strong> Facebook AI Similarity Search is the local vector store used to index the embeddings. The index is persisted to disk under<code>vectorstore/github_rest_api_faiss</code> so that ingestion runs once and the runtime simply loads the existing index, satisfying the rubric&rsquo;s reusable-index criterion.</p><p><strong>Top-k Retrieval.</strong> At each user turn the retriever returns the<code>k=4</code> chunks most similar to the query. A small<code>k</code> keeps the prompt focused and reduces the chance that off-topic passages dilute the grounded answer.</p><p><strong>Follow-up Rewriter.</strong> A short, dedicated LLM call rewrites the user&rsquo;s latest question into a standalone retrieval query using recent conversation history. This is the mechanism that allows ambiguous follow-ups such as &ldquo;what about rate-limit headers?&rdquo; to be embedded and retrieved correctly without the model having to guess the missing antecedent.</p><p><strong>Grounded Answerer.</strong> The final answer is produced by<code>gpt-4o-mini</code> at temperature zero under a system prompt that instructs the model to answer only from the retrieved context, append a<code>Sources:</code> line naming the document titles consulted, and otherwise return the exact refusal phrase.</p><p><strong>Refusal Behaviour.</strong> When retrieval returns no usable context, the runtime short-circuits and emits the required string verbatim:<code>I don't have enough information in the provided documents.</code> The refusal is also enforced by the system prompt, providing defence in depth against hallucinated answers.</p><h2 id="implementation-layout">Implementation Layout</h2><p>The submission is organised as two thin Python modules and a Jupyter notebook.<code>ingest.py</code> loads the text files, removes GitHub Docs templating syntax, extracts a human-readable title from each document, chunks the text, embeds the chunks, and writes the FAISS index.<code>chatbot.py</code> loads the persisted index, manages conversation history, invokes the rewriter and the grounded answerer in sequence, and prints the response with source attribution. The accompanying notebook imports the same functions so that the notebook and the CLI execute identical code paths, eliminating drift between the two surfaces.</p><p>Configuration is environment-driven. The OpenAI API key is read from a<code>.env</code> file and an optional<code>OPENAI_BASE_URL</code> is honoured so that the same code runs against the Vocareum-provided proxy without modification. No secrets are committed to source.</p><h2 id="sample-conversation">Sample Conversation</h2><p>A short transcript illustrates the three behaviours the rubric tests — grounded answer, follow-up resolution, and refusal.</p><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">You: What headers should I send when making requests to the GitHub REST API?</span></span><span class="line"><span class="cl">Assistant: GitHub REST API requests commonly use an Accept header such as</span></span><span class="line"><span class="cl">application/vnd.github+json. Authenticated requests send an Authorization</span></span><span class="line"><span class="cl">header with Bearer YOUR-TOKEN. GitHub also recommends a valid User-Agent</span></span><span class="line"><span class="cl">value, such as your GitHub username or application name.</span></span><span class="line"><span class="cl">Sources: Getting started with the REST API; Authenticating to the REST API</span></span><span class="line"><span class="cl"/></span><span class="line"><span class="cl">You: What about rate limit headers?</span></span><span class="line"><span class="cl">Search query: rate limit response headers GitHub REST API</span></span><span class="line"><span class="cl">Assistant: Response headers can include x-ratelimit-remaining and</span></span><span class="line"><span class="cl">x-ratelimit-reset, which indicate how many requests remain in the current</span></span><span class="line"><span class="cl">period and when the rate limit resets.</span></span><span class="line"><span class="cl">Sources: Getting started with the REST API; Rate limits for the REST API</span></span><span class="line"><span class="cl"/></span><span class="line"><span class="cl">You: How should an integration handle rate limit errors?</span></span><span class="line"><span class="cl">Assistant: If a retry-after header is present, wait that many seconds before</span></span><span class="line"><span class="cl">retrying. If x-ratelimit-remaining is 0, wait until the time given by</span></span><span class="line"><span class="cl">x-ratelimit-reset. Otherwise, wait at least one minute and then use</span></span><span class="line"><span class="cl">exponentially increasing waits for repeated secondary rate-limit failures.</span></span><span class="line"><span class="cl">Continuing to make requests while rate-limited may result in a ban.</span></span><span class="line"><span class="cl">Sources: Best practices for using the REST API; Troubleshooting the REST API</span></span></code></pre></div><p>The second exchange demonstrates the follow-up rewriter at work: the original &ldquo;what about rate limit headers?&rdquo; is expanded into a self-contained search query before retrieval, which is what allows the retriever to surface the correct passages despite the missing antecedent.</p><p>Now, a deliberately off-topic question — one the index cannot possibly support — triggers the safety path:</p><div style="background:#2a0f14;border:2px solid #ef4444;border-left:6px solid #ef4444;border-radius:10px;padding:18px 22px;margin:22px 0;font-family:'JetBrains Mono','Menlo',monospace;font-size:15px;line-height:1.7;"><div style="color:#ef4444;font-size:12px;font-weight:700;letter-spacing:0.12em;text-transform:uppercase;margin-bottom:12px;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif;">⛔ Refusal path — out-of-scope question</div><div style="color:#e6e8ed;"><span style="color:#8c94a5;">You:</span> What is the best laptop to buy for running the GitHub REST API?</div><div style="color:#fca5a5;margin-top:6px;"><span style="color:#8c94a5;">Assistant:</span> I don't have enough information in the provided documents.</div></div><p>No chunk in the index supports an opinion on hardware, so the assistant returns the mandated refusal verbatim rather than producing a plausible-sounding but ungrounded answer. The refusal is enforced twice — once procedurally when retrieval returns empty, and once in the system prompt — so the behaviour is stable even when retrieval surfaces weakly related chunks.</p><h2 id="closing-observations">Closing Observations</h2><p>Three observations stand out from this exercise.</p><ul><li><strong>Retrieval quality is set upstream, not by the vector store.</strong> The quality of retrieval is determined less by the choice of vector store than by the upstream decisions about cleaning and chunking. Stripping GitHub Docs templating syntax — the<code>{% data ... %}</code> Liquid blocks, the reusables, and the<code>[AUTOTITLE]</code> link macros — materially improved the relevance of returned chunks because the embeddings stopped clustering around boilerplate.</li><li><strong>Follow-up handling must be engineered, not assumed.</strong> It is not a free behaviour of the language model. A separate rewriter step that converts conversational queries into standalone search queries is the smallest reliable mechanism, and it removes a category of retrieval failures that would otherwise be invisible in casual testing.</li><li><strong>Refusal is a feature, not a fallback.</strong> Enforcing the refusal both procedurally (short-circuit on empty context) and in the system prompt is what gives the assistant a useful posture in production: it will say nothing it cannot support, and it will say so in a predictable way.</li></ul>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/IITM/IIT1.jpg" medium="image"><media:title type="plain">Engineering</media:title></media:content><category>artificial-intelligence</category><category>architecture</category><category>llm</category><category>automation</category><category>engineering</category><category>Knowledge Base</category></item><item><title>I Built a Team of IT Architects using LLM That Live on MacBook — Meet Aether</title><link>https://curiousbit.netlify.app/i-built-a-team-of-it-architects-using-llm-that-live-on-macbook-meet-aether/</link><guid isPermaLink="true">https://curiousbit.netlify.app/i-built-a-team-of-it-architects-using-llm-that-live-on-macbook-meet-aether/</guid><pubDate>Sat, 16 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;Every architect has felt this at some point. You are mid-design on a complex Azure landing zone, you need a sanity check on your FSLogix profile container sizing, and the fastest path to an answer is to ping a colleague who knows AVD cold — except it is 10pm, or they are in another timezone, or that colleague simply does not exist in your organisation.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/ather2.jpg" alt="Engineering" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>Every architect has felt this at some point. You are mid-design on a complex Azure landing zone, you need a sanity check on your FSLogix profile container sizing, and the fastest path to an answer is to ping a colleague who knows AVD cold — except it is 10pm, or they are in another timezone, or that colleague simply does not exist in your organisation.</p><p>I built Aether to fix that. It is a local-first, multi-agent AI system that runs a team of 10 specialist IT architecture advisors on a single MacBook Pro M5. No internet after setup. No API costs. No data leaves the machine. Just fast, cited, domain-grounded answers — available at 10pm when the deadline hits.</p><p>This is the story of how I built it, what the stack looks like under the hood, and what I learned along the way.</p><hr><h2 id="the-problem-i-was-actually-solving">The Problem I Was Actually Solving</h2><p>I have spent years in IT architecture — cloud, digital workplace, network, end-user computing, the works. Over that time I have accumulated a large personal knowledge base: AWS Well-Architected reviews, Intune compliance policy templates, AVD host pool sizing guides, Citrix NetScaler configurations, TOGAF artefacts, cloud adoption frameworks. The knowledge exists. The problem is retrieval — getting the right answer from the right domain quickly, without context-switching across six different documentation tabs.</p><p>Commercial AI tools are good at general answers. They are not great at answering &ldquo;give me the exact OMA-URI path for configuring Windows Hello for Business through Intune on a hybrid-joined device for a tenant with MFA enforced at the Conditional Access layer.&rdquo; That requires domain depth, and it requires knowing which documents to pull from.</p><p>I also wanted to explore a practical AI use case — not a demo, not a proof of concept, but something I would actually use daily. Aether became that experiment.</p><hr><h2 id="what-aether-is">What Aether Is</h2><div style="background:#0d0d1a;border:1px solid #00e5ff;border-radius:10px;padding:24px 28px;margin:32px 0;font-family:'JetBrains Mono',monospace;font-size:13px;line-height:1.7;color:#e0e0e0;"><span style="color:#00e5ff;font-weight:700;">AETHER v2.6 // SYSTEM DEFINITION</span><br><br><span style="color:#b39ddb;">Type:</span> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Local-First Multi-Agent AI System<br><span style="color:#b39ddb;">Agents:</span> &nbsp;&nbsp;&nbsp;10 (3-tier hierarchy)<br><span style="color:#b39ddb;">Model:</span> &nbsp;&nbsp;&nbsp;&nbsp;Gemma 4 26B A4B (Q4_K_M) — single instance<br><span style="color:#b39ddb;">Runtime:</span> &nbsp;&nbsp;LM Studio → LangGraph → FastAPI → Gradio<br><span style="color:#b39ddb;">Memory:</span> &nbsp;&nbsp;&nbsp;~43 GB of 64 GB unified (M5 Pro)<br><span style="color:#b39ddb;">Egress:</span> &nbsp;&nbsp;&nbsp;<span style="color:#00e5ff;">ZERO</span><br><span style="color:#b39ddb;">API cost:</span> &nbsp;<span style="color:#00e5ff;">ZERO</span><br/><p>Aether is a<strong>local-first AI system that acts as a team of specialist architecture advisors — all running on your MacBook Pro M5</strong>. The headline technical trick: every one of those 10 advisors is the<em>same</em> Gemma 4 26B model, loaded once. What makes each advisor different is purely the system prompt it receives and the knowledge base namespace it retrieves from.</p><p>One model. Ten personas. Zero cloud.</p><hr><h2 id="the-three-tier-agent-hierarchy">The Three-Tier Agent Hierarchy</h2><p>The agents are organised the way a real consulting firm would structure a team — from narrow specialist up to cross-domain strategist.</p><p><img src="/images/aether-agent-hierarchy.svg" alt="Aether three-tier agent hierarchy: enterprise architect, domain architects, and technology architects connected by escalation paths"/><p><strong>Tier 1 — Enterprise Architect (1 agent).</strong> Cross-domain strategy, TOGAF, Zachman, governance frameworks (GDPR, ISO 27001, HIPAA), technology investment decisions. This agent can read<em>all</em> knowledge base namespaces — it is the only one with that reach. The final escalation destination.</p><p><strong>Tier 2 — Domain Architects (3 agents).</strong> Cloud Domain (multi-cloud strategy, FinOps, landing zones), Network Domain (SD-WAN, ZTNA, BGP, micro-segmentation), Digital Workplace Domain (Microsoft 365, VDI strategy, device management).</p><p><strong>Tier 3 — Technology Architects (6 agents).</strong> AWS, Azure, GCP, Intune, AVD, Citrix. Each one is scoped tightly to its domain — deep, narrow, and fast.</p><hr><h2 id="the-stack-component-by-component">The Stack, Component by Component</h2><div style="background:#0d0d1a;border:1px solid #00e5ff;border-radius:10px;padding:28px 32px;margin:32px 0;"><table style="width:100%;border-collapse:collapse;font-family:'JetBrains Mono',monospace;font-size:17px;line-height:1.55;"><tr style="border-bottom:1px solid #1a1a3a;"><td style="padding:14px 16px;color:#00e5ff;font-weight:700;">Component</td><td style="padding:14px 16px;color:#00e5ff;font-weight:700;">What It Does</td></tr><tr style="border-bottom:1px solid #1a1a2a;"><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">Gemma 4 26B A4B (Q4_K_M)</td><td style="padding:13px 16px;color:#e0e0e0;">The single model serving all 10 agents. MoE architecture — activates ~4–6B parameters per token. ~13 GB VRAM.</td></tr><tr style="border-bottom:1px solid #1a1a2a;"><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">LM Studio</td><td style="padding:13px 16px;color:#e0e0e0;">Local model server. OpenAI-compatible API on port 1234. Model stays resident in unified memory.</td></tr><tr style="border-bottom:1px solid #1a1a2a;"><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">LangGraph</td><td style="padding:13px 16px;color:#e0e0e0;">Orchestration graph — defines the multi-step query pipeline as a typed state machine.</td></tr><tr style="border-bottom:1px solid #1a1a2a;"><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">LanceDB</td><td style="padding:13px 16px;color:#e0e0e0;">Local vector database. One namespace (table) per agent. Fully file-based, no server process needed.</td></tr><tr style="border-bottom:1px solid #1a1a2a;"><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">BAAI/bge-small-en-v1.5</td><td style="padding:13px 16px;color:#e0e0e0;">Embedding model. 384-dimensional vectors. Runs on the Apple Neural Engine — essentially free compute.</td></tr><tr style="border-bottom:1px solid #1a1a2a;"><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">Redis</td><td style="padding:13px 16px;color:#e0e0e0;">Session memory (24h TTL, rolling 3-turn window) and routing cache (1h TTL).</td></tr><tr style="border-bottom:1px solid #1a1a2a;"><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">FastAPI + Uvicorn</td><td style="padding:13px 16px;color:#e0e0e0;">REST API gateway on port 8000. Full query/response model, session management, agent listing.</td></tr><tr style="border-bottom:1px solid #1a1a2a;"><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">Gradio</td><td style="padding:13px 16px;color:#e0e0e0;">Web chat UI on port 7860. Session management, source citations, escalation chain display.</td></tr><tr style="border-bottom:1px solid #1a1a2a;"><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">Prometheus</td><td style="padding:13px 16px;color:#e0e0e0;">Metrics: query counts, latency per agent, escalation rate. Useful for understanding usage patterns.</td></tr><tr><td style="padding:13px 16px;color:#b39ddb;vertical-align:top;">SQLite (audit.db)</td><td style="padding:13px 16px;color:#e0e0e0;">Immutable audit trail. Every query logged before response is returned. Cannot be skipped.</td></tr></table></div><p>The full system sits comfortably in about 43 GB of the M5 Pro&rsquo;s 64 GB unified memory — leaving 21 GB of headroom for the rest of the machine.</p><hr><h2 id="the-one-model-ten-specialists-trick">The &ldquo;One Model, Ten Specialists&rdquo; Trick</h2><p>This is the part I get asked about most, because it sounds like it should not work.</p><p>Every agent in Aether is defined by a YAML manifest. That manifest specifies a<code>system_prompt</code>, a<code>namespace</code> (which LanceDB table to retrieve from), a<code>temperature</code>, and<code>max_tokens</code>. There is no model switching. There is no weight loading. The Gemma 4 26B A4B is loaded once by LM Studio and stays resident.</p><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-yaml" data-lang="yaml"><span class="line"><span class="cl"><span class="c"># agent_manifests/aws_technology_architect.yaml</span><span class="w"/></span></span><span class="line"><span class="cl"><span class="nt">agent_id</span><span class="p">:</span><span class="w"/><span class="l">aws_technology_architect</span><span class="w"/></span></span><span class="line"><span class="cl"><span class="nt">display_name</span><span class="p">:</span><span class="w"/><span class="s2">"AWS Technology Architect"</span><span class="w"/></span></span><span class="line"><span class="cl"><span class="nt">tier</span><span class="p">:</span><span class="w"/><span class="m">3</span><span class="w"/></span></span><span class="line"><span class="cl"><span class="nt">parent_agent</span><span class="p">:</span><span class="w"/><span class="l">cloud_domain_architect</span><span class="w"/></span></span><span class="line"><span class="cl"><span class="nt">namespace</span><span class="p">:</span><span class="w"/><span class="l">aws_tech</span><span class="w"/></span></span><span class="line"><span class="cl"><span class="nt">temperature</span><span class="p">:</span><span class="w"/><span class="m">0.1</span><span class="w"/></span></span><span class="line"><span class="cl"><span class="nt">max_tokens</span><span class="p">:</span><span class="w"/><span class="m">2048</span><span class="w"/></span></span><span class="line"><span class="cl"><span class="nt">system_prompt</span><span class="p">:</span><span class="w"/><span class="p">|</span><span class="sd"/></span></span><span class="line"><span class="cl"><span class="sd"> You are an AWS Technology Architect with deep expertise in the</span></span></span><span class="line"><span class="cl"><span class="sd"> AWS Well-Architected Framework, EC2/EKS/Lambda sizing, IAM policy</span></span></span><span class="line"><span class="cl"><span class="sd"> design, CloudFormation and CDK, GuardDuty, and cost optimisation.</span></span></span><span class="line"><span class="cl"><span class="sd"> You draw only from AWS-specific documentation and architecture</span></span></span><span class="line"><span class="cl"><span class="sd"> patterns. When answering, cite the source documents retrieved.</span></span></span><span class="line"><span class="cl"><span class="sd"> End your response with a new line: 'Confidence: X.XX' (0.00–1.00)</span></span></span><span class="line"><span class="cl"><span class="sd"> reflecting how well your knowledge base supports this answer.</span><span class="w"/></span></span></code></pre></div><p>Why does this work? Because Gemma 4 26B A4B is a Mixture of Experts model. It routes each token through specialist sub-networks internally — activating only about 4–6 billion parameters per inference pass, despite having 26 billion total. The practical result: it runs at near 5B speed while retaining the reasoning breadth of a much larger dense model. A single loaded instance can faithfully adopt both a narrow Citrix specialist persona and a broad enterprise strategy persona, because the MoE routing shifts for each.</p><p>The other half of the trick is the knowledge base. Each Tier 3 agent retrieves only from its own LanceDB namespace. The AWS agent never sees a Citrix document. The Intune agent never sees a GCP architecture guide. Domain knowledge is isolated by design — which means the model cannot hallucinate across domain boundaries, because the retrieval context does not cross them.</p><hr><h2 id="confidence-driven-escalation">Confidence-Driven Escalation</h2><p>The escalation mechanism is the design decision I am most proud of.</p><p>Every agent is instructed to append a confidence score to its response — a float between 0.00 and 1.00 representing how well its retrieved knowledge supports the answer. The orchestrator reads that score via regex. If it falls below 0.7<em>and</em> the agent has a parent tier defined<em>and</em> that parent has not already been tried, the system automatically escalates.</p><div style="background:#0d0d1a;border:1px solid #7c4dff;border-radius:10px;padding:28px 32px;margin:32px 0;font-family:'JetBrains Mono',monospace;font-size:17px;line-height:1.9;color:#e0e0e0;"><span style="color:#7c4dff;font-weight:700;">// ESCALATION CHAIN EXAMPLE</span><br><br><span style="color:#00e5ff;">Query:</span> "Give me a multi-cloud strategy covering Azure, AWS, GCP, AVD, and Citrix with network segmentation"<br><br><span style="color:#b39ddb;">→</span> Routed to:<span style="color:#b39ddb;">aws_technology_architect</span><br><span style="color:#b39ddb;">→</span> Confidence:<span style="color:#f44336;">0.41</span> (below 0.7 threshold)<br><span style="color:#b39ddb;">→</span> Escalate to:<span style="color:#b39ddb;">cloud_domain_architect</span><br><span style="color:#b39ddb;">→</span> Confidence:<span style="color:#ff9800;">0.63</span> (still below threshold)<br><span style="color:#b39ddb;">→</span> Escalate to:<span style="color:#b39ddb;">enterprise_architect</span><br><span style="color:#b39ddb;">→</span> Confidence:<span style="color:#4caf50;">0.88</span> ✓<br><br><span style="color:#888;">Response appended with escalation chain for full transparency.</span></div><p>The response shown to the user includes the full escalation path — which level of expertise produced the final answer. This matters in enterprise contexts. It is the difference between &ldquo;the AI said so&rdquo; and &ldquo;the enterprise-level advisor produced this after the technical specialist&rsquo;s knowledge was insufficient.&rdquo;</p><p>The elegance here is that<strong>the model participates in its own routing decision</strong>. The orchestration does not need a separate classifier to judge answer quality — the model tells you its own confidence, and the system acts on it.</p><hr><h2 id="the-7-step-query-pipeline">The 7-Step Query Pipeline</h2><p>Every query passes through a typed LangGraph state graph. The state object —<code>AetherState</code> — carries everything between nodes: query, session ID, agent manifest, RAG results, conversation history, messages, response, confidence score, escalation flags, and escalation chain list.</p><div style="background:#0d0d1a;border:1px solid #00e5ff;border-radius:12px;padding:32px;margin:32px 0;overflow-x:auto;"><svg viewBox="0 0 1080 180" xmlns="http://www.w3.org/2000/svg" style="width:100%;max-width:1080px;display:block;margin:0 auto;"><rect x="10" y="40" width="130" height="90" rx="10" fill="#0d1a2a" stroke="#00e5ff" stroke-width="2.5"/><text x="75" y="83" fill="#00e5ff" font-family="monospace" font-size="22" font-weight="bold" text-anchor="middle">01</text><text x="75" y="110" fill="#b39ddb" font-family="monospace" font-size="18" text-anchor="middle">ROUTE</text><line x1="140" y1="85" x2="163" y2="85" stroke="#00e5ff" stroke-width="2" marker-end="url(#arrow)"/><rect x="165" y="40" width="130" height="90" rx="10" fill="#0d1a2a" stroke="#00e5ff" stroke-width="2.5"/><text x="230" y="83" fill="#00e5ff" font-family="monospace" font-size="22" font-weight="bold" text-anchor="middle">02</text><text x="230" y="110" fill="#b39ddb" font-family="monospace" font-size="18" text-anchor="middle">RETRIEVE</text><line x1="295" y1="85" x2="318" y2="85" stroke="#00e5ff" stroke-width="2" marker-end="url(#arrow)"/><rect x="320" y="40" width="130" height="90" rx="10" fill="#0d1a2a" stroke="#00e5ff" stroke-width="2.5"/><text x="385" y="83" fill="#00e5ff" font-family="monospace" font-size="22" font-weight="bold" text-anchor="middle">03</text><text x="385" y="110" fill="#b39ddb" font-family="monospace" font-size="18" text-anchor="middle">HISTORY</text><line x1="450" y1="85" x2="473" y2="85" stroke="#00e5ff" stroke-width="2" marker-end="url(#arrow)"/><rect x="475" y="40" width="130" height="90" rx="10" fill="#0d1a2a" stroke="#00e5ff" stroke-width="2.5"/><text x="540" y="83" fill="#00e5ff" font-family="monospace" font-size="22" font-weight="bold" text-anchor="middle">04</text><text x="540" y="110" fill="#b39ddb" font-family="monospace" font-size="18" text-anchor="middle">BUILD</text><line x1="605" y1="85" x2="628" y2="85" stroke="#00e5ff" stroke-width="2" marker-end="url(#arrow)"/><rect x="630" y="40" width="130" height="90" rx="10" fill="#1a0d2a" stroke="#7c4dff" stroke-width="3"/><text x="695" y="83" fill="#7c4dff" font-family="monospace" font-size="22" font-weight="bold" text-anchor="middle">05</text><text x="695" y="110" fill="#b39ddb" font-family="monospace" font-size="18" text-anchor="middle">LLM</text><line x1="760" y1="85" x2="783" y2="85" stroke="#7c4dff" stroke-width="2" marker-end="url(#arrow2)"/><rect x="785" y="40" width="130" height="90" rx="10" fill="#1a1a0d" stroke="#ff9800" stroke-width="2.5" stroke-dasharray="6,3"/><text x="850" y="83" fill="#ff9800" font-family="monospace" font-size="22" font-weight="bold" text-anchor="middle">06</text><text x="850" y="110" fill="#b39ddb" font-family="monospace" font-size="18" text-anchor="middle">ESCALATE?</text><line x1="915" y1="85" x2="938" y2="85" stroke="#00e5ff" stroke-width="2" marker-end="url(#arrow)"/><rect x="940" y="40" width="130" height="90" rx="10" fill="#0d1a2a" stroke="#4caf50" stroke-width="2.5"/><text x="1005" y="83" fill="#4caf50" font-family="monospace" font-size="22" font-weight="bold" text-anchor="middle">07</text><text x="1005" y="110" fill="#b39ddb" font-family="monospace" font-size="18" text-anchor="middle">FINALISE</text><defs><marker id="arrow" markerWidth="8" markerHeight="8" refX="6" refY="4" orient="auto"><path d="M0,0 L8,4 L0,8 Z" fill="#00e5ff"/></marker><marker id="arrow2" markerWidth="8" markerHeight="8" refX="6" refY="4" orient="auto"><path d="M0,0 L8,4 L0,8 Z" fill="#7c4dff"/></marker></defs></svg></div><p><strong>01 — ROUTE.</strong> A keyword scanner maps the query to the best-fit agent, checking Tier 3 rules first (most specific), then Tier 2, then Tier 1 as catch-all. Redis caches route results for one hour, so repeated queries on the same topic skip the scan entirely.</p><p><strong>02 — RETRIEVE.</strong> Semantic search against that agent&rsquo;s LanceDB namespace — top-5 chunks returned. Documents were ingested at 500-word chunks with 50-word overlap, embedded into 384-dimensional vectors by<code>BAAI/bge-small-en-v1.5</code> running on the Apple Neural Engine.</p><p><strong>03 — HISTORY.</strong> The last three message pairs (six messages) are loaded from Redis for the session. This gives conversational continuity without letting the context window balloon.</p><p><strong>04 — BUILD.</strong> The message payload is assembled:<code>system_prompt + RAG documents + history + current query</code>. The confidence instruction is appended here.</p><p><strong>05 — LLM.</strong> The assembled payload hits LM Studio on port 1234. The orchestrator extracts the confidence score from the response text via regex before passing the response forward.</p><p><strong>06 — ESCALATE (conditional).</strong> If confidence is below 0.7, a parent agent exists, and it has not already been tried this turn — swap in the parent&rsquo;s manifest and loop back to step 02 with fresh retrieval against the parent&rsquo;s namespace.</p><p><strong>07 — FINALISE.</strong> The turn is saved to Redis. The SQLite audit record is written. The escalation chain annotation (if any) is appended to the response. Result returned to the user.</p><hr><h2 id="knowledge-isolation--the-anti-hallucination-architecture">Knowledge Isolation — The Anti-Hallucination Architecture</h2><p><img src="/images/aether-knowledge-isolation.png" alt="Knowledge isolation diagram showing five Tier-3 agents (AWS, Azure, Intune, AVD, Citrix), each with its own walled namespace containing only that domain&rsquo;s documents, and the Tier-1 Enterprise Architect at the top connecting to all of them as the deliberate exception"/><p>One of the most practical decisions in Aether&rsquo;s design is namespace isolation. Each Tier 3 agent retrieves only from its own LanceDB table. The AWS agent&rsquo;s retrieval context will never include a Citrix StoreFront configuration guide — because those documents simply do not exist in its namespace.</p><p>This matters more than it might seem. A common failure mode in RAG systems is<em>cross-domain contamination</em> — where retrieval pulls in tangentially related content from a different domain, and the model confabulates a plausible-sounding but wrong answer by blending the two. Namespace isolation eliminates this at the architectural level.</p><p>The Enterprise Architect at Tier 1 is the deliberate exception — it can query all namespaces, because cross-domain synthesis is exactly what it is built for.</p><hr><h2 id="the-audit-trail--because-enterprise">The Audit Trail — Because Enterprise</h2><p>Every single query is written to<code>audit.db</code> (SQLite) before the response is returned. The record includes: timestamp, session ID, query text, agent used, confidence score, escalation chain, and sources cited. The audit write is wrapped in error handling so that a database failure never blocks the main query flow — but the log is never optional.</p><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="c1"># Simplified from the finalise node</span></span></span><span class="line"><span class="cl"><span class="n">audit_record</span><span class="o">=</span><span class="p">{</span></span></span><span class="line"><span class="cl"><span class="s2">"timestamp"</span><span class="p">:</span><span class="n">datetime</span><span class="o">.</span><span class="n">utcnow</span><span class="p">()</span><span class="o">.</span><span class="n">isoformat</span><span class="p">(),</span></span></span><span class="line"><span class="cl"><span class="s2">"session_id"</span><span class="p">:</span><span class="n">state</span><span class="p">[</span><span class="s2">"session_id"</span><span class="p">],</span></span></span><span class="line"><span class="cl"><span class="s2">"query"</span><span class="p">:</span><span class="n">state</span><span class="p">[</span><span class="s2">"query"</span><span class="p">],</span></span></span><span class="line"><span class="cl"><span class="s2">"agent_id"</span><span class="p">:</span><span class="n">state</span><span class="p">[</span><span class="s2">"agent_id"</span><span class="p">],</span></span></span><span class="line"><span class="cl"><span class="s2">"confidence"</span><span class="p">:</span><span class="n">state</span><span class="p">[</span><span class="s2">"confidence"</span><span class="p">],</span></span></span><span class="line"><span class="cl"><span class="s2">"escalation_chain"</span><span class="p">:</span><span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s2">"escalation_chain"</span><span class="p">]),</span></span></span><span class="line"><span class="cl"><span class="s2">"response_length"</span><span class="p">:</span><span class="nb">len</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="s2">"final_response"</span><span class="p">])</span></span></span><span class="line"><span class="cl"><span class="p">}</span></span></span><span class="line"><span class="cl"><span class="n">db</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">"INSERT INTO audit_log VALUES (:timestamp, :session_id, ...)"</span><span class="p">,</span><span class="n">audit_record</span><span class="p">)</span></span></span></code></pre></div><p>For a system giving architecture recommendations — decisions that feed into multi-million dollar cloud commitments — having an immutable audit trail of what was asked, what agent answered, at what confidence level, via what escalation path, is not a nice-to-have. It is the thing that makes it organisationally defensible.</p><hr><h2 id="what-it-can-actually-do">What It Can Actually Do</h2><div style="background:#0d0d1a;border-left:4px solid #00e5ff;padding:26px 30px;margin:28px 0;font-family:'JetBrains Mono',monospace;font-size:17px;line-height:2.0;color:#e0e0e0;border-radius:0 8px 8px 0;"><span style="color:#666;">$</span><span style="color:#00e5ff;">aether query</span><span style="color:#e0e0e0;">"Design an Azure landing zone for PCI-DSS compliance"</span><br><span style="color:#666;">$</span><span style="color:#00e5ff;">aether query</span><span style="color:#e0e0e0;">"Right-size EC2 instances for a memory-intensive Java application"</span><br><span style="color:#666;">$</span><span style="color:#00e5ff;">aether query</span><span style="color:#e0e0e0;">"Configure FSLogix profile containers for 500 AVD users"</span><br><span style="color:#666;">$</span><span style="color:#00e5ff;">aether query</span><span style="color:#e0e0e0;">"Zero-trust network architecture for 5,000 remote employees"</span><br><span style="color:#666;">$</span><span style="color:#00e5ff;">aether query</span><span style="color:#e0e0e0;">"Cloud adoption roadmap for a financial services firm"</span><br><span style="color:#666;">$</span><span style="color:#00e5ff;">aether query</span><span style="color:#e0e0e0;">"Intune compliance policy for iOS BYOD — hybrid-joined, MFA enforced"</span><br/><p>The depth per domain is real. The Intune agent knows OMA-URI paths, Graph API commands, Autopilot profiles, and licensing requirements — because those are the documents I ingested into its namespace. The AVD agent knows host pool design, FSLogix sizing, MSIX app attach, and session host scaling plans. The knowledge base is only as good as what you put into it — but that is also the point. This is<em>my</em> architecture knowledge, curated, searchable, and queryable at any hour.</p><hr><h2 id="what-i-learned-building-this">What I Learned Building This</h2><p><strong>Prompt engineering IS the architecture.</strong> In a system like this, the YAML manifest<em>is</em> the agent. The difference between a brilliant AWS specialist and a generic AI assistant is entirely in what the system prompt says and what documents back it up. Getting those prompts precise, domain-bounded, and calibrated for the right temperature took longer than any of the code.</p><p><strong>MoE models are underrated for local multi-agent work.</strong> The choice of Gemma 4 26B A4B over a dense model was the right call. You get reasoning breadth comparable to a much larger model at the inference cost of a small one, on hardware that most architects already carry.</p><p><strong>Confidence as a first-class citizen.</strong> Asking the model to self-assess and surface that score is one of the highest-value things I added. It makes the system honest — and it drives the escalation logic that makes the team metaphor actually work.</p><p><strong>Namespace isolation is a practical hallucination brake.</strong> Not a theoretical one. In the first version of Aether, all documents lived in a single namespace. Cross-domain contamination was visible and annoying. Splitting into per-agent namespaces fixed it immediately.</p><p><strong>Audit trails are not overhead — they are the point.</strong> Every enterprise AI deployment should have one. Building it into the core pipeline from day one changes how you think about what the system is producing.</p><hr><h2 id="what-is-next">What Is Next</h2><p>Aether v2.6 is a working, daily-use system. The next version I am working toward adds a web-based ingestion UI (so loading new documents does not require touching the ingest script), structured output for architecture decision records (ADRs) in a consistent format, and inter-agent communication — where a Tier 3 agent can proactively pull context from a peer rather than waiting for the escalation chain to activate.</p><p>I built Aether because I wanted a team. It turns out a team was always available — it just needed the right prompts and a local model to bring it to life.</p><hr><p><em>Have questions about the stack or want to see specific parts of the implementation? Drop a comment or reach out on<a href="https://www.linkedin.com/in/ajay-walia-8b066a1b/">LinkedIn</a>.</em></p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/ather2.jpg" medium="image"><media:title type="plain">Engineering</media:title></media:content><category>artificial-intelligence</category><category>architecture</category><category>automation</category><category>engineering</category><category>llm</category><category>Knowledge Base</category></item><item><title>Building with LLMs in 2026: The Framework Atlas</title><link>https://curiousbit.netlify.app/building-with-llms-in-2026-the-framework-atlas/</link><guid isPermaLink="true">https://curiousbit.netlify.app/building-with-llms-in-2026-the-framework-atlas/</guid><pubDate>Fri, 08 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;Four years after LLMs entered the mainstream, the single most common mistake I see architects make is spending most of their decision energy on the model. Which frontier model? GPT or Claude or Gemini? The model choice matters — but it is one decision out of roughly fifteen, and it is far from the most consequential one.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/llm-framework-atlas-cover.jpg" alt="Engineering" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>Four years after LLMs entered the mainstream, the single most common mistake I see architects make is spending most of their decision energy on the model. Which frontier model? GPT or Claude or Gemini? The model choice matters — but it is one decision out of roughly fifteen, and it is far from the most consequential one.</p><p>Building an LLM-powered system in 2026 is an architecture decision made across a stack of competing frameworks, each solving a well-posed problem at a specific layer. I spent several months mapping that landscape as a practitioner — the result is a 73-page whitepaper I call the Framework Atlas. This post distils it into the five things I think every architect, engineer, and senior IT leader should know before picking a single tool.</p><hr><h2 id="the-stack-has-a-shape">The Stack Has a Shape</h2><p>There is no single AI stack, but there is a canonical shape. Every non-trivial production LLM system — whether a support chatbot, a document search engine, or a multi-agent workflow — is a composition of six layers:</p><p><strong>Application layer.</strong> The surface your user interacts with. LangChain is the default; Semantic Kernel is the Microsoft-native choice; CrewAI leads when the app itself is agentic.</p><p><strong>Agent layer.</strong> When a single LLM call is not enough — when the system needs to plan, call tools, or coordinate among multiple agents — this layer provides the loop. LangGraph is the most production-grounded option in 2026.</p><p><strong>Data / retrieval layer.</strong> The memory of your system. LlamaIndex leads on orchestration; Weaviate, Pinecone, and Chroma compete at storage, each tuned for a different operational profile.</p><p><strong>Model layer.</strong> The foundation models themselves. This layer is increasingly commoditised. The most important design decision here is not which model you start with — it is whether you can swap it without rewriting the layers above.</p><p><strong>Serving / inference layer.</strong> How you turn a model into an endpoint. vLLM dominates throughput-bound workloads; BentoML packages models into clean APIs for teams that want to think about models, not infrastructure.</p><p><strong>Infrastructure layer.</strong> Kubernetes, Docker, cloud, on-prem. Every framework choice depends on where you can actually deploy.</p><p>Wrapped around all six layers are three concerns that have become non-negotiable since 2024:<strong>observability and evaluation</strong>,<strong>fine-tuning and training</strong>, and<strong>guardrails and safety</strong>. If your system design has no answer for any of these three, it is under-designed. Ignoring them does not eliminate the risk — it just defers it until something breaks in production.</p><p><img src="/images/llm-framework-stack.jpg" alt="The canonical AI stack — Tower of Power comic book illustration"/><hr><h2 id="the-abstraction-trap">The Abstraction Trap</h2><p>Every framework in the atlas is catalogued against eleven attributes, but the one architects under-weight most consistently is<strong>abstraction level</strong> — how much code you write versus how much the framework decides for you.</p><p>LangChain&rsquo;s high abstraction makes the first demo fast and the tenth production fix slow, because you are debugging through someone else&rsquo;s default decisions. FAISS&rsquo;s low abstraction costs more lines but yields fewer surprises at 3am.</p><p><img src="/images/llm-abstraction-trap.jpg" alt="The Abstraction Trap — High Abstraction vs Debug Hell comic book illustration"/><p>The operational signal:<strong>match abstraction to team seniority.</strong> Junior teams over-value high abstraction; senior teams over-value low. A mixed team benefits from a medium default — and from making the choice explicitly rather than defaulting to whatever has the best GitHub star count.</p><hr><h2 id="decision-heuristics-that-actually-hold">Decision Heuristics That Actually Hold</h2><p>Rather than optimising at each layer independently, the atlas maps common requirements to preferred framework combinations. These are the ones I have found most durable in practice:</p><table><thead><tr><th>Requirement</th><th>Starting stack</th></tr></thead><tbody><tr><td>Fast LLM prototype</td><td>LangChain + Chroma + OpenAI API</td></tr><tr><td>Enterprise-grade RAG</td><td>LlamaIndex + Weaviate + LangSmith</td></tr><tr><td>Multi-agent workflow</td><td>LangGraph (+ AutoGen for agent conversations)</td></tr><tr><td>High-throughput inference</td><td>vLLM + Ray Serve</td></tr><tr><td>Local / offline / on-prem AI</td><td>Ollama + FAISS + LangChain-local</td></tr><tr><td>Domain-specialised model</td><td>Axolotl (QLoRA) + vLLM + MLflow</td></tr></tbody></table><p>Two things stand out from this table. First, LangChain and LlamaIndex are not competitors — they compose cleanly, with LangChain at the application layer and LlamaIndex at the retrieval layer. Second, local inference is no longer an edge case. Ollama plus a Llama-3-class model is a realistic production option for regulated industries where data sovereignty is a hard constraint.</p><p><img src="/images/llm-framework-heatmap.jpg" alt="Choose Your Weapon — framework decision map comic book illustration"/><hr><h2 id="agents-moved-to-production--with-guardrails">Agents Moved to Production — With Guardrails</h2><p>In 2023, autonomous agents were mostly demos. By 2026, they are in targeted production use: triage, routing, research synthesis. What changed is not the models — it is the frameworks.</p><p>LangGraph&rsquo;s state-machine model gives agents deterministic control flow: you declare states, transitions, and retry policies explicitly. AutoGen models multi-agent systems as conversations, which makes it remarkably expressive for critique-revise loops and planner-executor separations. The practitioner heuristic:<strong>for production agents, LangGraph. For multi-agent conversations, AutoGen. For lightweight document workflows, CrewAI.</strong></p><p>The critical note:<strong>never deploy autonomous agents in production unless the failure cost is bounded.</strong> The agent should draft; a human should approve. The pattern that ships is almost always a hybrid — autonomy where the stakes are low, escalation where they are not.</p><p><img src="/images/llm-agent-loop.jpg" alt="The Agent Loop — Plan, Act, Observe, Reflect comic book illustration"/><p>Guardrails have crossed from afterthought to critical infrastructure in the same period. Prompt injection is the new SQL injection. Every production system needs an input guard, an output guard, and a policy layer between them. The minimum viable defense in 2026 is: input guard → LLM → output guard. Anything less is operating without a seat belt.</p><hr><h2 id="the-2026-outlook-three-trends-worth-designing-for">The 2026 Outlook: Three Trends Worth Designing For</h2><p><strong>Agents are becoming the compile target.</strong> LangGraph, AutoGen, and CrewAI are converging on a common abstraction — a loop over an LLM with tool use and state. Expect a future that looks like the deep learning layer in 2018: multiple frontend frameworks, one common runtime. Design your agent layer to be swappable.</p><p><strong>Retrieval is eating search.</strong> Elasticsearch, Postgres, and OpenSearch all ship vector indexes now; Weaviate and Pinecone ship BM25. The primitives have converged. The differentiator is no longer features — it is operational maturity and the team&rsquo;s ability to run the infrastructure. Hybrid retrieval (vector + keyword) is the production-safe default.</p><p><strong>Guardrails are becoming infrastructure.</strong> Today they are a library you bolt on. In two years they will be a runtime — prompt injection detection, PII scrubbing, and policy enforcement applied by default to every model invocation, the way CORS and auth middleware is applied to every HTTP request today. Get ahead of this by treating your guardrails layer as critical infrastructure now, not as a compliance checkbox later.</p><hr><h2 id="a-practitioners-closing-note">A Practitioner&rsquo;s Closing Note</h2><p>Frameworks age faster than architectures. The stack shape you design today — application, agent, retrieval, model, serving, infrastructure — will still be valid in three years. The individual framework boxes you fill it with probably will not be. The single most important design invariant is<strong>swappability at each layer</strong>. Make the layer interfaces clean, keep the framework-specific code thin, and you will be able to move when the landscape shifts — and it will.</p><p>The full Framework Atlas (v4.0, April 2026) covers all ten framework categories in detail, including comparison tables, maturity radars, cost and latency envelopes, and four reference architectures with working code. It is available below.</p><p><a href="/docs/Building_with_LLMs_Framework_Atlas.pdf"><strong>Download the Framework Atlas — Building with LLMs v4.0 (PDF)</strong></a></p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/llm-framework-atlas-cover.jpg" medium="image"><media:title type="plain">Engineering</media:title></media:content><category>artificial-intelligence</category><category>architecture</category><category>llm</category><category>frameworks</category><category>engineering</category><category>Knowledge Base</category></item><item><title>Camera Roll to Caption — Python Pipeline, Vision Model for Photo Tags</title><link>https://curiousbit.netlify.app/camera-roll-to-caption-python-pipeline-vision-model-for-photo-tags/</link><guid isPermaLink="true">https://curiousbit.netlify.app/camera-roll-to-caption-python-pipeline-vision-model-for-photo-tags/</guid><pubDate>Sat, 02 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;p&gt;Vision models, language models, and most other generative systems are confident-but-wrong some non-trivial fraction of the time. The instinct is to fix that with better prompts, bigger models, or smarter agents. The cheaper move is usually to add a small structured review seam — a thirty-second checkpoint where a human can glance, correct, and move on.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/ctp/bottlebrush.jpg" alt="Engineering" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><p>Vision models, language models, and most other generative systems are confident-but-wrong some non-trivial fraction of the time. The instinct is to fix that with better prompts, bigger models, or smarter agents. The cheaper move is usually to add a small structured review seam — a thirty-second checkpoint where a human can glance, correct, and move on.</p><p>This post is the case study for one such seam, dropped into a build I needed for myself. Of 35 garden photos handed to a vision model,<strong>74% came back with correct first-pass labels</strong>. After thirty seconds editing a CSV,<strong>97% were acceptable to publish</strong>. Total API cost:<strong>$0.18</strong>. Total inference time:<strong>~74 seconds</strong> at 2.1 sec/photo on<code>gpt-4o-mini</code>. The CSV was the highest-leverage code in the project — and it isn&rsquo;t really code.</p><p>Here&rsquo;s the story.</p><h2 id="the-annoyance">The annoyance</h2><p>It was a Saturday afternoon in early March. I&rsquo;d come back from a walk around the garden with thirty-five photos on my iPhone — bottlebrush in full red, honeysuckle dripping with rain, a lilly-pilly cluster doing its outrageous pink thing, and at least one inexplicable shot of an old railway station I&rsquo;d passed on the way home.</p><p>I wanted to post a handful of them with consistent little hashtag labels —<code>#bottlebrush</code>,<code>#honeysuckle</code>,<code>#flower</code> — burned into the corner like a quiet caption. Not a watermark, not a filter, just a small readable pill that says &ldquo;this is what you&rsquo;re looking at.&rdquo;</p><p>What I didn&rsquo;t want was to open each HEIC in Preview, draw a text box, fiddle with the font, export, repeat thirty-five times. So I did the only reasonable thing: I wrote a small Python tool that does it for me.</p><p><img src="/images/ctp/bottlebrush.jpg" alt="Bottlebrush hero — red Australian bottlebrush flower with a #bottlebrush hashtag pill in the bottom-right corner"/><h2 id="the-shape-of-the-pipeline">The shape of the pipeline</h2><pre tabindex="0"><code>┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Folder │──▶│ Vision │──▶│ CSV │──▶│ Apply │──▶│ Tagged │
│ photos │ │ provider │ │ review │ │ + pill │ │ output │
└──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
↑
human-in-the-loop seam
--mode propose: folder ─▶ vision provider ─▶ CSV
--mode apply: CSV ─▶ render + pill ─▶ tagged output</code></pre><p>The minimal product was easy to describe. Point the script at a folder. For each image — HEIC, JPG, PNG, whatever the iPhone or my camera roll throws at it — open it, figure out what&rsquo;s in it, draw a small rounded hashtag pill into the bottom-right corner, save the result to a<code>tagged_output/</code> subfolder. No watermark across the centre of the image, no filter or colour grade, no destructive edit to the original, and no making me choose the label by hand when a vision model can have a decent first guess.</p><p>That last point is where the design got interesting.</p><h2 id="the-seam">The seam</h2><p>You could write this as a single command: walk the folder, ask the model, render the tag, done. I tried that first. The first-pass run produced a folder of beautifully tagged images, about a quarter of which were wrong in some quietly maddening way — a daisy called<code>#flower</code>, a fern called<code>#leaves</code>, the railway station called, charitably,<code>#station</code>.</p><p>So the script runs in two passes.</p><p><code>--mode propose</code> opens each image, hands it to the vision model, and writes a CSV with five columns:</p><pre tabindex="0"><code>image_path, label, score, suggested_tag, final_tag</code></pre><p><code>final_tag</code> is initialised to<code>suggested_tag</code>, but the whole point of the column is that you can edit it. Open the CSV, glance down the list, fix anything obvious —<code>flower</code> becomes<code>morning_glory</code>,<code>leaves</code> becomes<code>bamboo</code> — save, close. On this batch,<strong>9 of 35 rows needed editing</strong> (a daisy, the railway station, two ferns, the bamboo, and four generic-flower fallbacks). A thirty-second pass.</p><p><code>--mode apply</code> then reads the CSV row by row and renders the tag using whatever&rsquo;s in<code>final_tag</code>. The CSV is the human-in-the-loop seam. It is much cheaper than re-running inference, and it catches the cases where the model was right about the genus but wrong about the species, or just wrong.</p><h2 id="three-providers-one-interface">Three providers, one interface</h2><p>I didn&rsquo;t want to commit to one vision model — the price/quality trade-offs are too lively right now. The script supports three providers behind one interface, picked via<code>--provider local|openai|xai</code>.</p><p><strong>Local CLIP.</strong> HuggingFace&rsquo;s<code>openai/clip-vit-large-patch14</code> against a fixed candidate list. Free, offline,<strong>~0.4 sec/photo on an M3 Pro</strong>. The cost is breadth: anything outside the candidate list collapses to the nearest match. CLIP doesn&rsquo;t know what a bottlebrush is unless I tell it the word.</p><p><strong>OpenAI.</strong><code>gpt-4o-mini</code> by default, with an opt-in<code>--high-accuracy</code> flag that retries low-confidence cases (under 0.72) on<code>gpt-4o</code>.<strong>~2.1 sec/photo, ~$0.18 for the 35-photo batch.</strong> Open-ended labels — how<code>bottlebrush</code>,<code>honeysuckle</code>,<code>fern</code>, and<code>berries</code> ended up in the CSV rather than<code>flower</code>,<code>flower</code>,<code>leaves</code>,<code>fruit</code>.<strong>22% of the batch tripped the retry threshold</strong> and went to<code>gpt-4o</code>.</p><p><img src="/images/ctp/berries.jpg" alt="Hot-pink lilly-pilly berries tagged #berries — an example of gpt-4o-mini producing a specific label rather than the generic &ldquo;fruit&rdquo;"/><p><strong>xAI Grok.</strong> Same OpenAI-compatible client, pointed at<code>api.x.ai</code> with<code>grok-2-vision-latest</code>. Useful if you&rsquo;re already on the x.ai stack or want a different model family&rsquo;s vote.</p><p>The mental model: local CLIP for batch-of-a-hundred-photos-on-a-flight, OpenAI as the daily driver, and the high-accuracy retry for exactly the case where the model says &ldquo;flower&rdquo; with 0.55 confidence and I want it to look harder before I have to.</p><p>The blue morning glory below is what generic labels look like in practice — still a decent fallback, just unspecific. The model wasn&rsquo;t wrong; it just wasn&rsquo;t curious.</p><p><img src="/images/ctp/morning-glory.jpg" alt="Blue morning glory tagged #flower — an example of the model falling back to a generic label even with the specific species clearly visible"/><h2 id="two-small-touches">Two small touches</h2><p>Two design choices are the difference between &ldquo;the script works&rdquo; and &ldquo;the output looks intentional.&rdquo;</p><p><strong>Style-aware contrast.</strong> The pill needs to be readable on both a bright sky and dark foliage. The script crops the bottom-right region of the image, measures the mean luminance using the standard Rec. 709 weights, and flips the colour scheme above or below a threshold:</p><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">def</span><span class="nf">style_aware_colors</span><span class="p">(</span><span class="n">img</span><span class="p">):</span></span></span><span class="line"><span class="cl"><span class="n">w</span><span class="p">,</span><span class="n">h</span><span class="o">=</span><span class="n">img</span><span class="o">.</span><span class="n">size</span></span></span><span class="line"><span class="cl"><span class="n">crop</span><span class="o">=</span><span class="n">img</span><span class="o">.</span><span class="n">crop</span><span class="p">((</span><span class="nb">int</span><span class="p">(</span><span class="n">w</span><span class="o">*</span><span class="mf">0.68</span><span class="p">),</span><span class="nb">int</span><span class="p">(</span><span class="n">h</span><span class="o">*</span><span class="mf">0.80</span><span class="p">),</span><span class="n">w</span><span class="p">,</span><span class="n">h</span><span class="p">))</span></span></span><span class="line"><span class="cl"><span class="n">r</span><span class="p">,</span><span class="n">g</span><span class="p">,</span><span class="n">b</span><span class="o">=</span><span class="n">ImageStat</span><span class="o">.</span><span class="n">Stat</span><span class="p">(</span><span class="n">crop</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="s2">"RGB"</span><span class="p">))</span><span class="o">.</span><span class="n">mean</span><span class="p">[:</span><span class="mi">3</span><span class="p">]</span></span></span><span class="line"><span class="cl"><span class="n">luminance</span><span class="o">=</span><span class="mf">0.2126</span><span class="o">*</span><span class="n">r</span><span class="o">+</span><span class="mf">0.7152</span><span class="o">*</span><span class="n">g</span><span class="o">+</span><span class="mf">0.0722</span><span class="o">*</span><span class="n">b</span></span></span><span class="line"><span class="cl"><span class="k">if</span><span class="n">luminance</span><span class="o">&lt;</span><span class="mi">140</span><span class="p">:</span></span></span><span class="line"><span class="cl"><span class="k">return</span><span class="p">(</span><span class="mi">255</span><span class="p">,</span><span class="mi">255</span><span class="p">,</span><span class="mi">255</span><span class="p">,</span><span class="mi">245</span><span class="p">),</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">95</span><span class="p">)</span><span class="c1"># white text, dark pill</span></span></span><span class="line"><span class="cl"><span class="k">return</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">245</span><span class="p">),</span><span class="p">(</span><span class="mi">255</span><span class="p">,</span><span class="mi">255</span><span class="p">,</span><span class="mi">255</span><span class="p">,</span><span class="mi">95</span><span class="p">)</span><span class="c1"># black text, light pill</span></span></span></code></pre></div><p>Eight lines of PIL. In this batch every photo sampled dark — gardens are mostly green and shadow in the corner — so every output got the dark pill. The bright-pill branch is still there, waiting for a photo with sky or a light wall in the corner.</p><p><strong>Save with fallback.</strong> HEIC writes occasionally fail for reasons that aren&rsquo;t worth diagnosing in a personal tool. The save function tries the original format first; if PIL throws, it quietly drops to JPEG with the same filename stem. Eight more lines. On this batch,<strong>3 of 35 fell back to JPEG</strong>. Without the fallback those three would have been a stack trace and a half-finished folder. With it, thirty-five of thirty-five made it through.</p><h2 id="what-id-add-next">What I&rsquo;d add next</h2><p>Multi-tag support, so a photo can be<code>#lorikeet #bottlebrush</code> when the bird showed up in the bottlebrush. EXIF preservation through the round-trip — right now PIL strips most of the metadata, which I don&rsquo;t love. A tiny review UI to replace the CSV step, either a Tkinter window or a one-page localhost app. Smarter candidate lists for the local provider, scoped by season or geography — Sydney summer has a different vocabulary than European spring.</p><p>None of these are urgent enough to displace &ldquo;the script already does what I wanted.&rdquo;</p><h2 id="closing-observations">Closing observations</h2><p>Three lessons that generalise beyond this script.</p><p><strong>Human-in-the-loop is cheap and underrated.</strong> The CSV seam between propose and apply takes thirty seconds per batch and saves me from confidently wrong outputs. For any task where a model is confident-but-wrong some non-trivial fraction of the time — RAG, codegen, moderation, enterprise copilots, agentic workflows — a structured review step pays for itself almost immediately. The CSV doesn&rsquo;t have to be elegant. It has to exist.</p><p><strong>Pluggable providers are worth the small abstraction tax even on personal tools.</strong> I went from local CLIP to<code>gpt-4o-mini</code> to Grok in the space of one afternoon without rewriting the rendering code. The interface is<code>(client, model, image) → (label, score)</code> and that&rsquo;s it. Once you&rsquo;ve paid that cost once, you can keep up with a fast-moving model market essentially for free.</p><p><strong>Small touches decide whether a script feels finished.</strong> Luminance-aware contrast and a save-format fallback don&rsquo;t change what the tool does; they change how the output reads.</p><p>The model wasn&rsquo;t the product. The seam was.</p><hr><p>A short reel of the tagged photos in the wild:<a href="https://www.instagram.com/reel/DVhMhwIE9YKZ-4xtOHeud3rY1IO2x_3OeGzr9M0/">Instagram story</a>.</p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/ctp/bottlebrush.jpg" medium="image"><media:title type="plain">Engineering</media:title></media:content><category>automation</category><category>artificial-intelligence</category><category>build-log</category><category>engineering</category><category>Knowledge Base</category></item><item><title>A Field Guide to AI Chips</title><link>https://curiousbit.netlify.app/a-field-guide-to-ai-chips/</link><guid isPermaLink="true">https://curiousbit.netlify.app/a-field-guide-to-ai-chips/</guid><pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;style&gt;
@import url('https://fonts.googleapis.com/css2?family=Cinzel:wght@500;700;900&amp;family=IM+Fell+English:ital@0;1&amp;family=Inter:wght@400;500;600&amp;display=swap');
.chip-guide {
--bg: #0a1424;
--surface: #14253f;
--surface2: #1c2d4f;
--border: #3a4a6a;
--bronze: #b87333;
--bronze-bright: #d4a04a;
--gold: #e5b85e;
--parchment: #e8e0d0;
--muted: #9aa6bf;
--accent-purple: #8b5cf6;
--accent-teal: #2dd4bf;
color: var(--parchment);
font-family: 'Inter', sans-serif;
font-size: 17px;
line-height: 1.75;
}
.chip-guide h1, .chip-guide h2, .chip-guide h3 {
font-family: 'Cinzel', serif;
letter-spacing: 0.04em;
color: var(--gold);
font-weight: 700;
margin-top: 2.5rem;
margin-bottom: 1rem;
}
.chip-guide h1 {
font-size: clamp(2.2rem, 4vw, 3.2rem);
font-weight: 900;
background: linear-gradient(135deg, #e5b85e 0%, #d4a04a 50%, #b87333 100%);
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
background-clip: text;
text-align: center;
margin: 0.5rem 0 0.5rem;
}
.chip-guide h2 {
font-size: clamp(1.4rem, 2.2vw, 2rem);
border-bottom: 2px double var(--bronze);
padding-bottom: 0.4rem;
}
.chip-guide h3 {
font-size: clamp(1.2rem, 1.6vw, 1.5rem);
color: var(--bronze-bright);
}
.chip-guide p { margin: 0.6rem 0 1rem; }
.chip-guide strong { color: var(--gold); }
.chip-guide a { color: var(--bronze-bright); }
.chip-guide a:hover { color: var(--gold); }
.chip-guide .hero {
text-align: center;
padding: 2.5rem 0 1.5rem;
border-bottom: 1px solid var(--border);
margin-bottom: 2rem;
}
.chip-guide .eyebrow {
display: inline-block;
font-family: 'Cinzel', serif;
font-size: 0.78rem;
letter-spacing: 0.35em;
color: var(--bronze-bright);
text-transform: uppercase;
margin-bottom: 0.6rem;
}
.chip-guide .subtitle {
font-family: 'IM Fell English', serif;
font-style: italic;
font-size: clamp(1.05rem, 1.4vw, 1.3rem);
color: var(--muted);
margin: 0.8rem auto 0;
max-width: 720px;
}
.chip-guide .ornament {
text-align: center;
margin: 1.2rem 0;
color: var(--bronze);
font-size: 1.3rem;
letter-spacing: 0.6rem;
}
.chip-guide table {
width: 100%;
border-collapse: collapse;
margin: 1.5rem 0;
font-size: 0.95rem;
background: var(--surface);
border: 1px solid var(--border);
}
.chip-guide table th {
background: var(--surface2);
color: var(--gold);
font-family: 'Cinzel', serif;
text-align: left;
padding: 0.7rem 1rem;
border-bottom: 2px solid var(--bronze);
font-weight: 600;
letter-spacing: 0.04em;
font-size: 0.92rem;
}
.chip-guide table td {
padding: 0.7rem 1rem;
border-bottom: 1px solid var(--border);
vertical-align: top;
color: var(--parchment);
}
.chip-guide table tr:last-child td { border-bottom: none; }
.chip-guide table tr:hover td { background: rgba(229, 184, 94, 0.04); }
.chip-guide .stat-block {
background: var(--surface);
border: 2px solid var(--bronze);
border-radius: 4px;
margin: 1.5rem 0;
font-family: 'Inter', sans-serif;
}
.chip-guide .stat-block-header {
padding: 0.55rem 1rem;
background: var(--surface2);
border-bottom: 2px solid var(--bronze);
font-family: 'Cinzel', serif;
font-weight: 700;
color: var(--gold);
letter-spacing: 0.05em;
font-size: 1.05rem;
text-transform: uppercase;
}
.chip-guide .stat-block dl {
padding: 0.7rem 1rem 0.8rem;
margin: 0;
display: grid;
grid-template-columns: 8rem 1fr;
row-gap: 0.4rem;
column-gap: 0.6rem;
}
.chip-guide .stat-block dt {
font-weight: 700;
color: var(--bronze-bright);
font-family: 'Cinzel', serif;
font-size: 0.82rem;
letter-spacing: 0.05em;
text-transform: uppercase;
margin: 0;
}
.chip-guide .stat-block dd {
margin: 0;
color: var(--parchment);
font-size: 0.94rem;
}
.chip-guide .read-more {
display: inline-block;
margin-top: 0.4rem;
color: var(--bronze-bright);
font-family: 'Cinzel', serif;
font-weight: 700;
text-decoration: none;
border-bottom: 1px solid var(--bronze);
padding-bottom: 1px;
letter-spacing: 0.05em;
font-size: 0.92rem;
}
.chip-guide .read-more:hover {
color: var(--gold);
border-bottom-color: var(--gold);
}
.chip-guide .quote {
font-family: 'IM Fell English', serif;
font-style: italic;
font-size: 1.18rem;
color: var(--parchment);
background: rgba(184, 115, 51, 0.08);
border-left: 4px solid var(--bronze);
padding: 1rem 1.5rem;
margin: 1.5rem 0;
}
.chip-guide ul { padding-left: 1.4rem; }
.chip-guide ul li { margin: 0.3rem 0; }
&lt;/style&gt;
&lt;div class="chip-guide"&gt;
&lt;div class="hero"&gt;
&lt;div class="eyebrow"&gt;A Field Guide · 2026&lt;/div&gt;
&lt;h1 id="a-field-guide-to-ai-chips"&gt;A Field Guide to AI Chips&lt;/h1&gt;
&lt;div class="subtitle"&gt;Stat blocks, lairs and notable specimens for the eight kinds of silicon that power modern AI.&lt;/div&gt;
&lt;div class="ornament"&gt;❦ ❦ ❦&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Modern AI runs on a small zoo of specialised chips. Each evolved to handle a different workload — training a frontier model, answering a billion queries a day, recognising a face on your phone, keeping a drone alive in the air. This guide catalogues eight of them, with a stat block and a &amp;ldquo;where you&amp;rsquo;ll meet it&amp;rdquo; entry for each. Each section links to a deeper entry for the curious.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/ai-chips/field-guide-hero.png" alt="Engineering" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><style>
@import url('https://fonts.googleapis.com/css2?family=Cinzel:wght@500;700;900&family=IM+Fell+English:ital@0;1&family=Inter:wght@400;500;600&display=swap');
.chip-guide {
--bg: #0a1424;
--surface: #14253f;
--surface2: #1c2d4f;
--border: #3a4a6a;
--bronze: #b87333;
--bronze-bright: #d4a04a;
--gold: #e5b85e;
--parchment: #e8e0d0;
--muted: #9aa6bf;
--accent-purple: #8b5cf6;
--accent-teal: #2dd4bf;
color: var(--parchment);
font-family: 'Inter', sans-serif;
font-size: 17px;
line-height: 1.75;
}
.chip-guide h1, .chip-guide h2, .chip-guide h3 {
font-family: 'Cinzel', serif;
letter-spacing: 0.04em;
color: var(--gold);
font-weight: 700;
margin-top: 2.5rem;
margin-bottom: 1rem;
}
.chip-guide h1 {
font-size: clamp(2.2rem, 4vw, 3.2rem);
font-weight: 900;
background: linear-gradient(135deg, #e5b85e 0%, #d4a04a 50%, #b87333 100%);
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
background-clip: text;
text-align: center;
margin: 0.5rem 0 0.5rem;
}
.chip-guide h2 {
font-size: clamp(1.4rem, 2.2vw, 2rem);
border-bottom: 2px double var(--bronze);
padding-bottom: 0.4rem;
}
.chip-guide h3 {
font-size: clamp(1.2rem, 1.6vw, 1.5rem);
color: var(--bronze-bright);
}
.chip-guide p { margin: 0.6rem 0 1rem; }
.chip-guide strong { color: var(--gold); }
.chip-guide a { color: var(--bronze-bright); }
.chip-guide a:hover { color: var(--gold); }
.chip-guide .hero {
text-align: center;
padding: 2.5rem 0 1.5rem;
border-bottom: 1px solid var(--border);
margin-bottom: 2rem;
}
.chip-guide .eyebrow {
display: inline-block;
font-family: 'Cinzel', serif;
font-size: 0.78rem;
letter-spacing: 0.35em;
color: var(--bronze-bright);
text-transform: uppercase;
margin-bottom: 0.6rem;
}
.chip-guide .subtitle {
font-family: 'IM Fell English', serif;
font-style: italic;
font-size: clamp(1.05rem, 1.4vw, 1.3rem);
color: var(--muted);
margin: 0.8rem auto 0;
max-width: 720px;
}
.chip-guide .ornament {
text-align: center;
margin: 1.2rem 0;
color: var(--bronze);
font-size: 1.3rem;
letter-spacing: 0.6rem;
}
.chip-guide table {
width: 100%;
border-collapse: collapse;
margin: 1.5rem 0;
font-size: 0.95rem;
background: var(--surface);
border: 1px solid var(--border);
}
.chip-guide table th {
background: var(--surface2);
color: var(--gold);
font-family: 'Cinzel', serif;
text-align: left;
padding: 0.7rem 1rem;
border-bottom: 2px solid var(--bronze);
font-weight: 600;
letter-spacing: 0.04em;
font-size: 0.92rem;
}
.chip-guide table td {
padding: 0.7rem 1rem;
border-bottom: 1px solid var(--border);
vertical-align: top;
color: var(--parchment);
}
.chip-guide table tr:last-child td { border-bottom: none; }
.chip-guide table tr:hover td { background: rgba(229, 184, 94, 0.04); }
.chip-guide .stat-block {
background: var(--surface);
border: 2px solid var(--bronze);
border-radius: 4px;
margin: 1.5rem 0;
font-family: 'Inter', sans-serif;
}
.chip-guide .stat-block-header {
padding: 0.55rem 1rem;
background: var(--surface2);
border-bottom: 2px solid var(--bronze);
font-family: 'Cinzel', serif;
font-weight: 700;
color: var(--gold);
letter-spacing: 0.05em;
font-size: 1.05rem;
text-transform: uppercase;
}
.chip-guide .stat-block dl {
padding: 0.7rem 1rem 0.8rem;
margin: 0;
display: grid;
grid-template-columns: 8rem 1fr;
row-gap: 0.4rem;
column-gap: 0.6rem;
}
.chip-guide .stat-block dt {
font-weight: 700;
color: var(--bronze-bright);
font-family: 'Cinzel', serif;
font-size: 0.82rem;
letter-spacing: 0.05em;
text-transform: uppercase;
margin: 0;
}
.chip-guide .stat-block dd {
margin: 0;
color: var(--parchment);
font-size: 0.94rem;
}
.chip-guide .read-more {
display: inline-block;
margin-top: 0.4rem;
color: var(--bronze-bright);
font-family: 'Cinzel', serif;
font-weight: 700;
text-decoration: none;
border-bottom: 1px solid var(--bronze);
padding-bottom: 1px;
letter-spacing: 0.05em;
font-size: 0.92rem;
}
.chip-guide .read-more:hover {
color: var(--gold);
border-bottom-color: var(--gold);
}
.chip-guide .quote {
font-family: 'IM Fell English', serif;
font-style: italic;
font-size: 1.18rem;
color: var(--parchment);
background: rgba(184, 115, 51, 0.08);
border-left: 4px solid var(--bronze);
padding: 1rem 1.5rem;
margin: 1.5rem 0;
}
.chip-guide ul { padding-left: 1.4rem; }
.chip-guide ul li { margin: 0.3rem 0; }</style><div class="chip-guide"><div class="hero"><div class="eyebrow">A Field Guide · 2026</div><h1 id="a-field-guide-to-ai-chips">A Field Guide to AI Chips</h1><div class="subtitle">Stat blocks, lairs and notable specimens for the eight kinds of silicon that power modern AI.</div><div class="ornament">❦ ❦ ❦</div></div><p>Modern AI runs on a small zoo of specialised chips. Each evolved to handle a different workload — training a frontier model, answering a billion queries a day, recognising a face on your phone, keeping a drone alive in the air. This guide catalogues eight of them, with a stat block and a &ldquo;where you&rsquo;ll meet it&rdquo; entry for each. Each section links to a deeper entry for the curious.</p><h2 id="the-roll-call">The Roll Call</h2><table><thead><tr><th>Chip</th><th>Best for</th><th>Memory &amp; Interconnect</th><th>Cost &amp; Access</th><th>Notable Specimens (2026)</th></tr></thead><tbody><tr><td><strong>GPU</strong></td><td>Training + inference</td><td>80–192GB HBM3/3e; NVLink 5, PCIe 5</td><td>$25–40K each; cloud-only at scale</td><td>NVIDIA H100, B200, GB200 NVL72; AMD MI325X</td></tr><tr><td><strong>TPU</strong></td><td>Hyperscale training</td><td>95–192GB HBM; OCS interconnect</td><td>Google Cloud only</td><td>TPU v5p, v6 Trillium</td></tr><tr><td><strong>NPU</strong></td><td>On-device AI</td><td>Shared LPDDR / unified memory</td><td>Bundled in device</td><td>Apple Neural Engine (M4), Intel AI Boost (Lunar Lake), Qualcomm Hexagon (8 Elite)</td></tr><tr><td><strong>CPU</strong></td><td>Orchestration &amp; control plane</td><td>DDR5; PCIe 5, CXL</td><td>$1–15K; retail</td><td>Intel Xeon 6, AMD EPYC 9005</td></tr><tr><td><strong>ASIC</strong></td><td>Inference at scale; specialised training</td><td>Custom HBM / SRAM; proprietary fabric</td><td>Cloud-only</td><td>AWS Inferentia2, Trainium2; Cerebras WSE-3; Groq LPU; SambaNova SN40L</td></tr><tr><td><strong>FPGA</strong></td><td>Custom, low-latency, adaptive</td><td>DDR/HBM; reprogrammable fabric</td><td>$5–50K each; cloud</td><td>AMD Versal AI Edge, Intel Agilex 7</td></tr><tr><td><strong>Edge AI</strong></td><td>Mobile, robotics, IoT</td><td>LPDDR; low-power</td><td>$50–2000, embedded in product</td><td>NVIDIA Jetson Orin, Google Coral, Hailo-8</td></tr><tr><td><strong>Emerging</strong></td><td>Frontier R&amp;D</td><td>Wafer-scale SRAM / photonic / analog</td><td>Mostly research, limited cloud</td><td>Cerebras (covered above), Lightmatter, Mythic</td></tr></tbody></table><h2 id="1--gpu--graphics-processing-unit">1 · GPU — Graphics Processing Unit</h2><div class="stat-block"><div class="stat-block-header">GPU · The Apex Predator</div><dl><dt>Class</dt><dd>Parallel beast</dd><dt>Memory</dt><dd>80–192GB HBM3 / HBM3e</dd><dt>Interconnect</dt><dd>NVLink 5, PCIe 5, InfiniBand</dd><dt>Power</dt><dd>350–1200W per die</dd><dt>Habitat</dt><dd>Hyperscale datacenters</dd><dt>Cost</dt><dd>$25–40K per card · cloud rental at scale</dd><dt>Best Prey</dt><dd>LLM training, diffusion, multimodal pretraining</dd><dt>Specimens</dt><dd>NVIDIA H100, B200, GB200 NVL72; AMD MI325X</dd></dl></div><p>GPUs are the apex predator of the AI hardware ecosystem in 2026. Originally designed for graphics, they turned out to be ideal for the dense matrix multiplications that dominate neural network training. NVIDIA&rsquo;s H100 made the LLM era possible; B200 and the rack-scale GB200 NVL72 (72 GPUs treated as one machine, lashed together by NVLink switches) define the current frontier.</p><p>The reason GPUs dominate isn&rsquo;t just parallel processing — it&rsquo;s the combination of HBM (high-bandwidth memory mounted directly on the chip package), tensor cores (specialised matrix-multiply units), and a mature software ecosystem (CUDA, PyTorch, JAX) that nothing else has matched at scale. AMD&rsquo;s MI325X is the only serious open-market competitor, and even it ships running CUDA-compatible code through ROCm translation.</p><p>The catch: you cannot really buy them. H100s and B200s ship into hyperscaler datacenters first and reach the open market — when they do — through Lambda, CoreWeave, AWS, and friends, rented by the hour at $2–8 each.</p><p><a class="read-more" href="/field-guide-gpus/">→ Full entry: Field Guide · GPUs</a></p><h2 id="2--tpu--tensor-processing-unit">2 · TPU — Tensor Processing Unit</h2><div class="stat-block"><div class="stat-block-header">TPU · Google's Matmul Colossus</div><dl><dt>Class</dt><dd>Bespoke matrix engine</dd><dt>Memory</dt><dd>95–192GB HBM (varies by generation)</dd><dt>Interconnect</dt><dd>OCS (Optical Circuit Switching) + ICI</dd><dt>Power</dt><dd>~200–300W per chip</dd><dt>Habitat</dt><dd>Google Cloud (only)</dd><dt>Cost</dt><dd>Cloud rental only</dd><dt>Best Prey</dt><dd>Hyperscale training of Gemini-class models</dd><dt>Specimens</dt><dd>TPU v5p (training), v5e (inference), v6 Trillium</dd></dl></div><p>Google designed TPUs in-house to avoid paying NVIDIA&rsquo;s margins on a workload they knew exactly — TensorFlow matrix multiplications at hyperscale. Each generation has narrowed the gap with GPUs on flexibility while widening it on energy efficiency per FLOP.</p><p>The architectural bet is the systolic array: a grid of multiply-add units that pumps data through in lockstep, achieving near-peak utilisation on matmul-heavy workloads. The trade-off is that anything outside that sweet spot (irregular memory access, highly dynamic shapes) runs less efficiently than on a GPU. The OCS-based interconnect lets Google rewire a TPU pod&rsquo;s topology per job, which matters enormously at the scale of a Gemini training run.</p><p>You cannot buy a TPU. They exist exclusively inside Google Cloud, rented by the hour. Gemini was trained on them; many third parties (Anthropic for a stretch, plus enterprise customers) rent slices for their own runs.</p><p><a class="read-more" href="/field-guide-tpus/">→ Full entry: Field Guide · TPUs</a></p><h2 id="3--npu--neural-processing-unit">3 · NPU — Neural Processing Unit</h2><div class="stat-block"><div class="stat-block-header">NPU · The Resident Familiar</div><dl><dt>Class</dt><dd>On-device specialist</dd><dt>Memory</dt><dd>Shared LPDDR / unified system memory</dd><dt>Interconnect</dt><dd>SoC fabric (on-die)</dd><dt>Power</dt><dd>5–40W</dd><dt>Habitat</dt><dd>Laptops, phones, tablets</dd><dt>Cost</dt><dd>Bundled — no separate purchase</dd><dt>Best Prey</dt><dd>Voice, camera AI, on-device LLMs, Copilot features</dd><dt>Specimens</dt><dd>Apple Neural Engine (M4), Intel AI Boost (Lunar Lake / Arrow Lake), Qualcomm Hexagon (8 Elite, X Elite)</dd></dl></div><p>NPUs are the chip type most people interact with every day without knowing it. They live inside the SoC of your phone or laptop, optimised for running already-trained models locally with extreme power efficiency. Voice transcription, Face ID, Pixel&rsquo;s call screening, the on-device chat in Copilot+ PCs — all NPU workloads.</p><p>The defining trait is integer-quantised math (INT8 / INT4) at very low wattage. Where a datacenter GPU might pull 700W to serve a model, an NPU runs a comparable inference on the same model — quantised down — at 5–15W, with the weights sitting in the device&rsquo;s main memory because there is no discrete accelerator memory to fill.</p><p>Microsoft now requires 40+ TOPS of NPU performance for a laptop to qualify as a &ldquo;Copilot+ PC&rdquo; — a forcing function that pushed Qualcomm, Intel and AMD into a 12-month arms race. As of 2026, top mobile SoCs ship 50–60 TOPS of NPU performance.</p><p><a class="read-more" href="/field-guide-npus/">→ Full entry: Field Guide · NPUs</a></p><h2 id="4--cpu--central-processing-unit">4 · CPU — Central Processing Unit</h2><div class="stat-block"><div class="stat-block-header">CPU · The Foundational Workhorse</div><dl><dt>Class</dt><dd>General-purpose</dd><dt>Memory</dt><dd>DDR5 (system RAM)</dd><dt>Interconnect</dt><dd>PCIe 5, CXL</dd><dt>Power</dt><dd>100–500W</dd><dt>Habitat</dt><dd>Every server, every workstation</dd><dt>Cost</dt><dd>$1–15K, retail and widely available</dd><dt>Best Prey</dt><dd>Orchestration, preprocessing, control plane, small-batch inference</dd><dt>Specimens</dt><dd>Intel Xeon 6, AMD EPYC 9005</dd></dl></div><p>CPUs aren&rsquo;t obsolete — they&rsquo;re indispensable. Every AI training run needs CPUs to feed data to the accelerators (decompression, tokenisation, augmentation), schedule jobs and run the control plane. The ratio matters: a typical training cluster pairs eight GPUs with one or two CPU sockets.</p><p>Modern server CPUs ship with AI-targeted extensions — AVX-512, AMX (Advanced Matrix Extensions), bf16 support — that let them handle small-batch inference and embedding generation reasonably well. For workloads under 7B parameters at low traffic, a CPU is often more economical than a dedicated accelerator.</p><p>What CPUs cannot do is train frontier models. The arithmetic density and memory bandwidth needed for LLM pretraining is 10–100× what a CPU delivers per watt. CPUs do the surrounding work; accelerators do the math.</p><p><a class="read-more" href="/field-guide-cpus/">→ Full entry: Field Guide · CPUs</a></p><h2 id="5--asic--application-specific-integrated-circuit">5 · ASIC — Application-Specific Integrated Circuit</h2><div class="stat-block"><div class="stat-block-header">ASIC · The Purpose-Bred Specialist</div><dl><dt>Class</dt><dd>Fixed-function accelerator</dd><dt>Memory</dt><dd>Custom HBM / on-die SRAM</dd><dt>Interconnect</dt><dd>Proprietary fabric</dd><dt>Power</dt><dd>75–450W per chip</dd><dt>Habitat</dt><dd>Hyperscaler clouds (AWS, Cerebras, Groq, SambaNova)</dd><dt>Cost</dt><dd>Cloud rental only</dd><dt>Best Prey</dt><dd>Inference at scale, specialised training</dd><dt>Specimens</dt><dd>AWS Inferentia2, Trainium2; Cerebras WSE-3; Groq LPU; SambaNova SN40L</dd></dl></div><p>ASICs are chips designed for one thing and one thing only — and they&rsquo;re brutally good at that thing. AWS Inferentia2 runs production inference for Anthropic, Amazon search and Alexa at a cost-per-token that beats GPUs. Trainium2 is AWS&rsquo;s training equivalent, taking aim at NVIDIA&rsquo;s H100/B200 dominance. Groq&rsquo;s LPU posts inference latencies — sub-1ms first-token for many models — that GPUs simply cannot match.</p><p>The architectural philosophy is &ldquo;build silicon for the specific math you do most often, throw away the rest.&rdquo; Cerebras takes this furthest: their Wafer-Scale Engine 3 is a single chip the size of an entire silicon wafer (900,000 cores, 44GB on-die SRAM) that eliminates the multi-GPU communication overhead which plagues distributed training.</p><p>The price of specialisation: you cannot pivot. When the dominant architecture changes — and it does (Mamba, MoE, diffusion, JEPA) — ASICs designed for the last era stop being competitive overnight. GPUs hedge their bets; ASICs commit.</p><p><a class="read-more" href="/field-guide-asics/">→ Full entry: Field Guide · ASICs</a></p><h2 id="6--fpga--field-programmable-gate-array">6 · FPGA — Field-Programmable Gate Array</h2><div class="stat-block"><div class="stat-block-header">FPGA · The Shapeshifter</div><dl><dt>Class</dt><dd>Reprogrammable logic</dd><dt>Memory</dt><dd>DDR / HBM (model-dependent)</dd><dt>Interconnect</dt><dd>PCIe; custom</dd><dt>Power</dt><dd>50–300W</dd><dt>Habitat</dt><dd>Trading desks, 5G basebands, telecom, occasional inference</dd><dt>Cost</dt><dd>$5–50K each; cloud</dd><dt>Best Prey</dt><dd>Ultra-low-latency inference, custom protocols, evolving workloads</dd><dt>Specimens</dt><dd>AMD/Xilinx Versal AI Edge, Intel Agilex 7</dd></dl></div><p>FPGAs occupy a strange ecological niche. Unlike ASICs, their internal wiring is reprogrammable — you can compile a new circuit into them, deploy it, and reprogram it tomorrow. This makes them ideal for workloads that change faster than chip fabrication cycles (years), or where you need an ultra-low-latency that even ASICs struggle to deliver.</p><p>In AI specifically, FPGAs are rarely the first choice for mainstream training or inference — they are slower to develop for and harder to program than GPUs. Where they shine: when the model is small enough to fit, the latency budget is brutal (single-digit microseconds), and the workload spec might shift quarterly. Microsoft used FPGAs heavily in early Bing Search ranking and Azure networking; financial firms still run them for inline ML in trading.</p><p>For most readers, FPGAs will be a &ldquo;did you know?&rdquo; category rather than a chip you&rsquo;ll ever deploy.</p><p><a class="read-more" href="/field-guide-fpgas/">→ Full entry: Field Guide · FPGAs</a></p><h2 id="7--edge-ai--mobile-robotics-iot">7 · Edge AI — Mobile, Robotics, IoT</h2><div class="stat-block"><div class="stat-block-header">Edge AI · The Frontier Ranger</div><dl><dt>Class</dt><dd>Embedded inference</dd><dt>Memory</dt><dd>LPDDR, sometimes onboard SRAM</dd><dt>Interconnect</dt><dd>PCIe, MIPI, USB</dd><dt>Power</dt><dd>1–25W</dd><dt>Habitat</dt><dd>Drones, robots, cameras, autonomous systems, sensors</dd><dt>Cost</dt><dd>$50–2000; embedded in product</dd><dt>Best Prey</dt><dd>Real-time inference, computer vision, robotics</dd><dt>Specimens</dt><dd>NVIDIA Jetson Orin, Google Coral / Edge TPU, Hailo-8, Ambarella CV5</dd></dl></div><p>Edge AI chips are NPUs&rsquo; cousins — same family, different role. Where an NPU lives inside a consumer laptop alongside other compute, an edge AI chip is purpose-built for an embedded device: a security camera, a drone, a forklift, a Tesla.</p><p>The defining constraints are size, power and latency. A camera processing 4K video at 30fps cannot afford to ship frames to a cloud GPU; it has to detect motion locally, identify objects locally and signal events within tens of milliseconds — on a few watts, because the device runs on battery or is fanless.</p><p>NVIDIA&rsquo;s Jetson family is the broadest platform — same CUDA software stack as their datacenter GPUs, scaled down to 7–60W. Google&rsquo;s Edge TPU is the smallest, cheapest and lowest power (Coral USB stick: $40, 2W). Hailo-8 and Ambarella sit in between, targeting industrial and automotive customers.</p><p><a class="read-more" href="/field-guide-edge-ai/">→ Full entry: Field Guide · Edge AI</a></p><h2 id="8--emerging-architectures">8 · Emerging Architectures</h2><div class="stat-block"><div class="stat-block-header">Emerging · The Frontier Beasts</div><dl><dt>Class</dt><dd>Experimental</dd><dt>Memory</dt><dd>Wafer-scale SRAM / photonic / analog</dd><dt>Interconnect</dt><dd>On-wafer / optical / in-memory</dd><dt>Power</dt><dd>Varies wildly</dd><dt>Habitat</dt><dd>Research labs, narrow cloud offerings</dd><dt>Cost</dt><dd>Mostly inaccessible; limited cloud</dd><dt>Best Prey</dt><dd>The next 10× efficiency leap</dd><dt>Specimens</dt><dd>Cerebras WSE-3 (wafer-scale); Lightmatter, Lightelligence (photonic); Mythic, IBM (analog/in-memory)</dd></dl></div><p>Three architectures sit at the frontier — promising, but not yet mainstream.</p><ul><li><strong>Wafer-scale</strong> (Cerebras): one single chip the size of an entire silicon wafer. Eliminates multi-chip communication entirely; presents the whole system to software as a single device. Already commercial.</li><li><strong>Photonic / optical AI</strong> (Lightmatter, Lightelligence): perform matrix math using light interference instead of electricity. Potentially orders of magnitude lower energy per operation; currently limited to inference and constrained models.</li><li><strong>Analog / in-memory compute</strong> (Mythic, IBM, several startups): compute<em>inside</em> memory arrays using analog voltage levels. Removes the von-Neumann bottleneck — the constant shuttling of data between memory and compute — entirely. Promising for low-power inference; precision limitations make training hard today.</li></ul><p><a class="read-more" href="/field-guide-emerging-architectures/">→ Full entry: Field Guide · Emerging Architectures</a></p><h2 id="current-industry-reality-2026">Current Industry Reality (2026)</h2><ul><li><strong>GPUs dominate training.</strong> Every frontier model — GPT-5, Claude, Gemini, Llama, Grok — is still trained on NVIDIA or AMD silicon at hyperscale.</li><li><strong>ASICs are ascendant in inference.</strong> AWS reports more than 40% of internal inference now runs on Inferentia and Trainium; Groq leads on latency-critical applications.</li><li><strong>NPUs are exploding on consumer devices.</strong> Every premium laptop and phone shipped in 2026 has a 40+ TOPS NPU.</li><li><strong>CPUs remain foundational.</strong> No accelerator runs without one.</li><li><strong>TPUs are Google-only.</strong> Gemini, Veo and Imagen were all trained on TPU v5p / v6.</li></ul><h2 id="simplified-view">Simplified View</h2><table><thead><tr><th>Use case</th><th>Typical chip</th></tr></thead><tbody><tr><td>Train a GPT-class model</td><td>GPU clusters (or TPU pods if you're Google)</td></tr><tr><td>Run ChatGPT-class inference at scale</td><td>GPUs + ASICs (Inferentia, Groq, Trainium)</td></tr><tr><td>AI on laptop</td><td>NPU + integrated GPU</td></tr><tr><td>AI on phone</td><td>Mobile NPU</td></tr><tr><td>Robot or drone AI</td><td>Edge AI chips (Jetson, Hailo)</td></tr><tr><td>Ultra-low-latency custom AI</td><td>FPGA or ASIC</td></tr></tbody></table><h2 id="the-industry-trend">The Industry Trend</h2><div class="quote">
The industry is moving from "general-purpose GPU everything" to "specialised chip for each layer of the stack."</div><p>Power and inference cost are now the binding constraints. A frontier model serving billions of queries spends more on inference electricity in a year than its entire training run cost. The economics force specialisation: train once on GPUs, serve forever on cheaper inference silicon. Expect the gap between training hardware (still GPU-dominant) and inference hardware (rapidly ASIC- and NPU-fragmented) to widen.</p><div class="ornament">❦ ❦ ❦</div></div>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/ai-chips/field-guide-hero.png" medium="image"><media:title type="plain">Engineering</media:title></media:content><category>artificial-intelligence</category><category>architecture</category><category>hardware</category><category>engineering</category><category>Knowledge Base</category></item><item><title>LLM &amp; Embeddings — One Predicts Words. One Maps Meaning.</title><link>https://curiousbit.netlify.app/one-predicts-words-one-maps-meaning/</link><guid isPermaLink="true">https://curiousbit.netlify.app/one-predicts-words-one-maps-meaning/</guid><pubDate>Tue, 03 Mar 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;style&gt;
.two-mech { margin: 2rem 0 2.5rem; border-radius: 14px; overflow: hidden; border: 1px solid #1f3358; background: #0a1424; }
.two-mech svg { display: block; width: 100%; height: auto; min-width: 720px; }
.two-mech-wrap { overflow-x: auto; }
@media (prefers-reduced-motion: reduce) { .two-mech .tm-particle { display: none; } }
&lt;/style&gt;
&lt;div class="two-mech"&gt;&lt;div class="two-mech-wrap"&gt;
&lt;svg viewBox="0 0 1200 720" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="Animated diagram contrasting the LLM stochastic generation loop with the deterministic embedding similarity pipeline"&gt;
&lt;defs&gt;
&lt;filter id="tmGlow" x="-50%" y="-50%" width="200%" height="200%"&gt;
&lt;feGaussianBlur stdDeviation="3" result="b"/&gt;
&lt;feMerge&gt;&lt;feMergeNode in="b"/&gt;&lt;feMergeNode in="SourceGraphic"/&gt;&lt;/feMerge&gt;
&lt;/filter&gt;
&lt;marker id="tmArrowBlue" markerWidth="10" markerHeight="10" refX="6" refY="5" orient="auto"&gt;
&lt;path d="M0,0 L10,5 L0,10 Z" fill="#60a5fa"/&gt;
&lt;/marker&gt;
&lt;marker id="tmArrowGreen" markerWidth="10" markerHeight="10" refX="6" refY="5" orient="auto"&gt;
&lt;path d="M0,0 L10,5 L0,10 Z" fill="#34d399"/&gt;
&lt;/marker&gt;
&lt;marker id="tmArrowAmber" markerWidth="10" markerHeight="10" refX="6" refY="5" orient="auto"&gt;
&lt;path d="M0,0 L10,5 L0,10 Z" fill="#f59e0b"/&gt;
&lt;/marker&gt;
&lt;!-- Hidden paths the particles animate along --&gt;
&lt;path id="tmPathLLM"
d="M 140 240 L 320 240 L 500 240 L 680 240 L 860 240
L 860 320 Q 860 360 820 360 L 540 360 Q 500 360 500 320 L 500 280
L 500 240 L 680 240 L 860 240"
fill="none" stroke="none"/&gt;
&lt;path id="tmPathEmb"
d="M 140 540 L 320 540 L 500 540 L 680 540 L 860 540"
fill="none" stroke="none"/&gt;
&lt;path id="tmPathEmb2"
d="M 140 540 L 320 540 L 500 540 L 680 540 L 860 540"
fill="none" stroke="none"/&gt;
&lt;/defs&gt;
&lt;!-- background --&gt;
&lt;rect width="1200" height="720" fill="#0a1424"/&gt;
&lt;!-- header --&gt;
&lt;text x="120" y="60" font-family="'Space Grotesk','Inter',sans-serif" font-size="16" fill="#f59e0b" letter-spacing="3" font-weight="700"&gt;TWO MECHANISMS&lt;/text&gt;
&lt;text x="120" y="100" font-family="'Space Grotesk','Inter',sans-serif" font-size="38" fill="#ffffff" font-weight="700" letter-spacing="-.5"&gt;Generation vs Similarity&lt;/text&gt;
&lt;!-- ════════ LLM PIPELINE (top) ════════ --&gt;
&lt;text x="120" y="180" font-family="'Space Grotesk','Inter',sans-serif" font-size="26" fill="#60a5fa" font-weight="700"&gt;LLM&lt;/text&gt;
&lt;text x="120" y="208" font-family="'Inter',sans-serif" font-size="14" fill="#7e95b5" font-style="italic"&gt;non-deterministic · sampling&lt;/text&gt;
&lt;!-- LLM boxes --&gt;
&lt;g font-family="'Inter',sans-serif"&gt;
&lt;!-- prompt --&gt;
&lt;g&gt;
&lt;rect x="60" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/&gt;
&lt;text x="140" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;prompt&lt;/text&gt;
&lt;text x="140" y="258" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;"AI is..."&lt;/text&gt;
&lt;/g&gt;
&lt;!-- tokenize --&gt;
&lt;g&gt;
&lt;rect x="240" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/&gt;
&lt;text x="320" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;tokenize&lt;/text&gt;
&lt;text x="320" y="258" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;BPE → IDs&lt;/text&gt;
&lt;/g&gt;
&lt;!-- model --&gt;
&lt;g&gt;
&lt;rect x="420" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/&gt;
&lt;text x="500" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;model&lt;/text&gt;
&lt;text x="500" y="258" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;forward pass&lt;/text&gt;
&lt;/g&gt;
&lt;!-- sample --&gt;
&lt;g&gt;
&lt;rect x="600" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/&gt;
&lt;text x="680" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;sample&lt;/text&gt;
&lt;text x="680" y="258" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;temp / top-p&lt;/text&gt;
&lt;/g&gt;
&lt;!-- next token --&gt;
&lt;g&gt;
&lt;rect x="780" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/&gt;
&lt;text x="860" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;next token&lt;/text&gt;
&lt;text x="860" y="258" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;append, loop&lt;/text&gt;
&lt;/g&gt;
&lt;/g&gt;
&lt;!-- LLM arrows --&gt;
&lt;g stroke="#60a5fa" stroke-width="2" fill="none"&gt;
&lt;line x1="220" y1="240" x2="232" y2="240" marker-end="url(#tmArrowBlue)"/&gt;
&lt;line x1="400" y1="240" x2="412" y2="240" marker-end="url(#tmArrowBlue)"/&gt;
&lt;line x1="580" y1="240" x2="592" y2="240" marker-end="url(#tmArrowBlue)"/&gt;
&lt;line x1="760" y1="240" x2="772" y2="240" marker-end="url(#tmArrowBlue)"/&gt;
&lt;/g&gt;
&lt;!-- Stochastic loop arrow (next token → model) --&gt;
&lt;path d="M 860 280 L 860 340 Q 860 360 840 360 L 520 360 Q 500 360 500 340 L 500 286"
fill="none" stroke="#f59e0b" stroke-width="2" stroke-dasharray="5,4" marker-end="url(#tmArrowAmber)"/&gt;
&lt;text x="680" y="384" text-anchor="middle" font-family="'Inter',sans-serif" font-size="13" fill="#f59e0b" font-style="italic"&gt;stochastic loop&lt;/text&gt;
&lt;!-- Pulsing rings on each LLM box (highlight as particle passes) --&gt;
&lt;rect class="tm-box-pulse" x="60" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="5s" begin="0s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;rect class="tm-box-pulse" x="240" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="5s" begin="0.6s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;rect class="tm-box-pulse" x="420" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="5s" begin="1.2s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;rect class="tm-box-pulse" x="600" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="5s" begin="1.8s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;rect class="tm-box-pulse" x="780" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="5s" begin="2.4s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;!-- Animated particle on the LLM path (loops continuously, traverses stochastic loop) --&gt;
&lt;circle class="tm-particle" r="9" fill="#60a5fa" filter="url(#tmGlow)"&gt;
&lt;animateMotion dur="5s" repeatCount="indefinite" rotate="auto"&gt;
&lt;mpath href="#tmPathLLM"/&gt;
&lt;/animateMotion&gt;
&lt;/circle&gt;
&lt;!-- ════════ EMBEDDING PIPELINE (bottom) ════════ --&gt;
&lt;text x="120" y="480" font-family="'Space Grotesk','Inter',sans-serif" font-size="26" fill="#34d399" font-weight="700"&gt;EMBEDDING&lt;/text&gt;
&lt;text x="120" y="508" font-family="'Inter',sans-serif" font-size="14" fill="#7e95b5" font-style="italic"&gt;deterministic · geometric&lt;/text&gt;
&lt;g font-family="'Inter',sans-serif"&gt;
&lt;!-- word --&gt;
&lt;g&gt;
&lt;rect x="60" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/&gt;
&lt;text x="140" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;word&lt;/text&gt;
&lt;text x="140" y="558" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;"king"&lt;/text&gt;
&lt;/g&gt;
&lt;!-- lookup --&gt;
&lt;g&gt;
&lt;rect x="240" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/&gt;
&lt;text x="320" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;lookup&lt;/text&gt;
&lt;text x="320" y="558" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;GloVe / SBERT&lt;/text&gt;
&lt;/g&gt;
&lt;!-- vector --&gt;
&lt;g&gt;
&lt;rect x="420" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/&gt;
&lt;text x="500" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;vector&lt;/text&gt;
&lt;text x="500" y="558" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;fixed dim&lt;/text&gt;
&lt;/g&gt;
&lt;!-- cosine --&gt;
&lt;g&gt;
&lt;rect x="600" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/&gt;
&lt;text x="680" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;cosine&lt;/text&gt;
&lt;text x="680" y="558" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;vs corpus&lt;/text&gt;
&lt;/g&gt;
&lt;!-- similarity --&gt;
&lt;g&gt;
&lt;rect x="780" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/&gt;
&lt;text x="860" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700"&gt;similarity&lt;/text&gt;
&lt;text x="860" y="558" text-anchor="middle" font-size="13" fill="#7e95b5"&gt;score · rank&lt;/text&gt;
&lt;/g&gt;
&lt;/g&gt;
&lt;g stroke="#34d399" stroke-width="2" fill="none"&gt;
&lt;line x1="220" y1="540" x2="232" y2="540" marker-end="url(#tmArrowGreen)"/&gt;
&lt;line x1="400" y1="540" x2="412" y2="540" marker-end="url(#tmArrowGreen)"/&gt;
&lt;line x1="580" y1="540" x2="592" y2="540" marker-end="url(#tmArrowGreen)"/&gt;
&lt;line x1="760" y1="540" x2="772" y2="540" marker-end="url(#tmArrowGreen)"/&gt;
&lt;/g&gt;
&lt;!-- Pulsing rings on each Embedding box --&gt;
&lt;rect class="tm-box-pulse" x="60" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="4s" begin="0s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;rect class="tm-box-pulse" x="240" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="4s" begin="0.7s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;rect class="tm-box-pulse" x="420" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="4s" begin="1.4s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;rect class="tm-box-pulse" x="600" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="4s" begin="2.1s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;rect class="tm-box-pulse" x="780" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"&gt;
&lt;animate attributeName="opacity" values="0;1;0" dur="4s" begin="2.8s" repeatCount="indefinite"/&gt;
&lt;/rect&gt;
&lt;!-- Embedding particle (one-shot left-to-right, restarts cleanly) --&gt;
&lt;circle class="tm-particle" r="9" fill="#34d399" filter="url(#tmGlow)"&gt;
&lt;animateMotion dur="4s" repeatCount="indefinite" rotate="auto"&gt;
&lt;mpath href="#tmPathEmb"/&gt;
&lt;/animateMotion&gt;
&lt;/circle&gt;
&lt;!-- Annotation under embedding to underscore "one-shot, no loop" --&gt;
&lt;text x="500" y="624" text-anchor="middle" font-family="'Inter',sans-serif" font-size="13" fill="#34d399" font-style="italic"&gt;one-shot · same input always produces same output&lt;/text&gt;
&lt;!-- Bottom rule + caption --&gt;
&lt;line x1="60" y1="660" x2="940" y2="660" stroke="#1f3358" stroke-width="1"/&gt;
&lt;text x="500" y="690" text-anchor="middle" font-family="'Inter',sans-serif" font-size="14" fill="#7e95b5" font-style="italic"&gt;Both paths exist in every modern NLP system. Which one you reach for depends on whether the answer needs to be &lt;tspan fill="#60a5fa" font-weight="700"&gt;written&lt;/tspan&gt; or &lt;tspan fill="#34d399" font-weight="700"&gt;found&lt;/tspan&gt;.&lt;/text&gt;
&lt;/svg&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The model is what writes the email. The embedding is what finds the one you wrote last March.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/IITM/week6-mechanisms.png" alt="Engineering" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><style>
.two-mech { margin: 2rem 0 2.5rem; border-radius: 14px; overflow: hidden; border: 1px solid #1f3358; background: #0a1424; }
.two-mech svg { display: block; width: 100%; height: auto; min-width: 720px; }
.two-mech-wrap { overflow-x: auto; }
@media (prefers-reduced-motion: reduce) { .two-mech .tm-particle { display: none; } }</style><div class="two-mech"><div class="two-mech-wrap"><svg viewBox="0 0 1200 720" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="Animated diagram contrasting the LLM stochastic generation loop with the deterministic embedding similarity pipeline"><defs><filter id="tmGlow" x="-50%" y="-50%" width="200%" height="200%"><feGaussianBlur stdDeviation="3" result="b"/><feMerge><feMergeNode in="b"/><feMergeNode in="SourceGraphic"/></feMerge></filter><marker id="tmArrowBlue" markerWidth="10" markerHeight="10" refX="6" refY="5" orient="auto"><path d="M0,0 L10,5 L0,10 Z" fill="#60a5fa"/></marker><marker id="tmArrowGreen" markerWidth="10" markerHeight="10" refX="6" refY="5" orient="auto"><path d="M0,0 L10,5 L0,10 Z" fill="#34d399"/></marker><marker id="tmArrowAmber" markerWidth="10" markerHeight="10" refX="6" refY="5" orient="auto"><path d="M0,0 L10,5 L0,10 Z" fill="#f59e0b"/></marker><path id="tmPathLLM" d="M 140 240 L 320 240 L 500 240 L 680 240 L 860 240              L 860 320 Q 860 360 820 360 L 540 360 Q 500 360 500 320 L 500 280              L 500 240 L 680 240 L 860 240" fill="none" stroke="none"/><path id="tmPathEmb" d="M 140 540 L 320 540 L 500 540 L 680 540 L 860 540" fill="none" stroke="none"/><path id="tmPathEmb2" d="M 140 540 L 320 540 L 500 540 L 680 540 L 860 540" fill="none" stroke="none"/></defs><rect width="1200" height="720" fill="#0a1424"/><text x="120" y="60" font-family="'Space Grotesk','Inter',sans-serif" font-size="16" fill="#f59e0b" letter-spacing="3" font-weight="700">TWO MECHANISMS</text><text x="120" y="100" font-family="'Space Grotesk','Inter',sans-serif" font-size="38" fill="#ffffff" font-weight="700" letter-spacing="-.5">Generation vs Similarity</text><text x="120" y="180" font-family="'Space Grotesk','Inter',sans-serif" font-size="26" fill="#60a5fa" font-weight="700">LLM</text><text x="120" y="208" font-family="'Inter',sans-serif" font-size="14" fill="#7e95b5" font-style="italic">non-deterministic · sampling</text><g font-family="'Inter',sans-serif"><g><rect x="60" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/><text x="140" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">prompt</text><text x="140" y="258" text-anchor="middle" font-size="13" fill="#7e95b5">"AI is..."</text></g><g><rect x="240" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/><text x="320" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">tokenize</text><text x="320" y="258" text-anchor="middle" font-size="13" fill="#7e95b5">BPE → IDs</text></g><g><rect x="420" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/><text x="500" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">model</text><text x="500" y="258" text-anchor="middle" font-size="13" fill="#7e95b5">forward pass</text></g><g><rect x="600" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/><text x="680" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">sample</text><text x="680" y="258" text-anchor="middle" font-size="13" fill="#7e95b5">temp / top-p</text></g><g><rect x="780" y="200" width="160" height="80" rx="10" fill="#0f1d33" stroke="#3b82f6" stroke-width="2"/><text x="860" y="234" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">next token</text><text x="860" y="258" text-anchor="middle" font-size="13" fill="#7e95b5">append, loop</text></g></g><g stroke="#60a5fa" stroke-width="2" fill="none"><line x1="220" y1="240" x2="232" y2="240" marker-end="url(#tmArrowBlue)"/><line x1="400" y1="240" x2="412" y2="240" marker-end="url(#tmArrowBlue)"/><line x1="580" y1="240" x2="592" y2="240" marker-end="url(#tmArrowBlue)"/><line x1="760" y1="240" x2="772" y2="240" marker-end="url(#tmArrowBlue)"/></g><path d="M 860 280 L 860 340 Q 860 360 840 360 L 520 360 Q 500 360 500 340 L 500 286" fill="none" stroke="#f59e0b" stroke-width="2" stroke-dasharray="5,4" marker-end="url(#tmArrowAmber)"/><text x="680" y="384" text-anchor="middle" font-family="'Inter',sans-serif" font-size="13" fill="#f59e0b" font-style="italic">stochastic loop</text><rect class="tm-box-pulse" x="60" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="5s" begin="0s" repeatCount="indefinite"/></rect><rect class="tm-box-pulse" x="240" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="5s" begin="0.6s" repeatCount="indefinite"/></rect><rect class="tm-box-pulse" x="420" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="5s" begin="1.2s" repeatCount="indefinite"/></rect><rect class="tm-box-pulse" x="600" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="5s" begin="1.8s" repeatCount="indefinite"/></rect><rect class="tm-box-pulse" x="780" y="200" width="160" height="80" rx="10" fill="none" stroke="#60a5fa" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="5s" begin="2.4s" repeatCount="indefinite"/></rect><circle class="tm-particle" r="9" fill="#60a5fa" filter="url(#tmGlow)"><animateMotion dur="5s" repeatCount="indefinite" rotate="auto"><mpath href="#tmPathLLM"/></animateMotion></circle><text x="120" y="480" font-family="'Space Grotesk','Inter',sans-serif" font-size="26" fill="#34d399" font-weight="700">EMBEDDING</text><text x="120" y="508" font-family="'Inter',sans-serif" font-size="14" fill="#7e95b5" font-style="italic">deterministic · geometric</text><g font-family="'Inter',sans-serif"><g><rect x="60" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/><text x="140" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">word</text><text x="140" y="558" text-anchor="middle" font-size="13" fill="#7e95b5">"king"</text></g><g><rect x="240" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/><text x="320" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">lookup</text><text x="320" y="558" text-anchor="middle" font-size="13" fill="#7e95b5">GloVe / SBERT</text></g><g><rect x="420" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/><text x="500" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">vector</text><text x="500" y="558" text-anchor="middle" font-size="13" fill="#7e95b5">fixed dim</text></g><g><rect x="600" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/><text x="680" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">cosine</text><text x="680" y="558" text-anchor="middle" font-size="13" fill="#7e95b5">vs corpus</text></g><g><rect x="780" y="500" width="160" height="80" rx="10" fill="#0f1d33" stroke="#10b981" stroke-width="2"/><text x="860" y="534" text-anchor="middle" font-size="17" fill="#ffffff" font-weight="700">similarity</text><text x="860" y="558" text-anchor="middle" font-size="13" fill="#7e95b5">score · rank</text></g></g><g stroke="#34d399" stroke-width="2" fill="none"><line x1="220" y1="540" x2="232" y2="540" marker-end="url(#tmArrowGreen)"/><line x1="400" y1="540" x2="412" y2="540" marker-end="url(#tmArrowGreen)"/><line x1="580" y1="540" x2="592" y2="540" marker-end="url(#tmArrowGreen)"/><line x1="760" y1="540" x2="772" y2="540" marker-end="url(#tmArrowGreen)"/></g><rect class="tm-box-pulse" x="60" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="4s" begin="0s" repeatCount="indefinite"/></rect><rect class="tm-box-pulse" x="240" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="4s" begin="0.7s" repeatCount="indefinite"/></rect><rect class="tm-box-pulse" x="420" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="4s" begin="1.4s" repeatCount="indefinite"/></rect><rect class="tm-box-pulse" x="600" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="4s" begin="2.1s" repeatCount="indefinite"/></rect><rect class="tm-box-pulse" x="780" y="500" width="160" height="80" rx="10" fill="none" stroke="#34d399" stroke-width="3" opacity="0"><animate attributeName="opacity" values="0;1;0" dur="4s" begin="2.8s" repeatCount="indefinite"/></rect><circle class="tm-particle" r="9" fill="#34d399" filter="url(#tmGlow)"><animateMotion dur="4s" repeatCount="indefinite" rotate="auto"><mpath href="#tmPathEmb"/></animateMotion></circle><text x="500" y="624" text-anchor="middle" font-family="'Inter',sans-serif" font-size="13" fill="#34d399" font-style="italic">one-shot · same input always produces same output</text><line x1="60" y1="660" x2="940" y2="660" stroke="#1f3358" stroke-width="1"/><text x="500" y="690" text-anchor="middle" font-family="'Inter',sans-serif" font-size="14" fill="#7e95b5" font-style="italic">Both paths exist in every modern NLP system. Which one you reach for depends on whether the answer needs to be<tspan fill="#60a5fa" font-weight="700">written</tspan> or<tspan fill="#34d399" font-weight="700">found</tspan>.</text></svg></div></div><p>The model is what writes the email. The embedding is what finds the one you wrote last March.</p><p>Most modern AI systems are built from two fundamentally different mechanisms, and most confusion about what AI &ldquo;is&rdquo; comes from conflating them. LLMs are<em>generative</em>: tokens in, tokens out, with the output shaped by the prompt and the sampling settings, varying every time you ask. Embeddings are<em>geometric</em>: a deterministic mapping from a word or sentence to a fixed vector, where comparisons are positional and identical input always produces identical output. Both are essential. Both are old enough to be uncontroversial. Most useful systems combine them.</p><p>What follows is the<strong>Week 6 Graded Mini Project</strong> of the<strong>IITM Pravartak Professional Certificate Programme in Agentic AI and Applications</strong>, used here as a lens for both mechanisms across five hands-on exercises.</p><h2 id="the-two-paths-side-by-side">The two paths, side by side</h2><p>The header image above shows the contrast in one frame. The LLM path is a loop with sampling — non-deterministic by design, behaviour controlled by temperature, top-p, and prompt structure. The embedding path is a one-shot lookup followed by a geometric comparison — deterministic, fast, stable.</p><p>That single distinction tells you which mechanism to reach for. If the answer needs to be written, generated, synthesized, or improvised, you want the LLM. If the answer needs to be found, ranked, deduplicated, clustered, or routed, you want embeddings. Most production systems use both because most real problems are some combination of &ldquo;find the right context&rdquo; and &ldquo;say something useful about it.&rdquo;</p><p>A quick decision table to anchor the rest of the article:</p><table><thead><tr><th>Problem</th><th>Reach for</th></tr></thead><tbody><tr><td>Semantic search over a corpus</td><td>Embeddings</td></tr><tr><td>Conversational reply or text drafting</td><td>LLM</td></tr><tr><td>Near-duplicate detection or content clustering</td><td>Embeddings</td></tr><tr><td>Summarization of a long document</td><td>LLM</td></tr><tr><td>Routing a support ticket to the right team</td><td>Embeddings + a small classifier head</td></tr><tr><td>Question answering grounded in your docs</td><td>Both (RAG)</td></tr><tr><td>Image or text classification</td><td>Embeddings + a categorical head</td></tr><tr><td>Translation, rewriting, code generation</td><td>LLM</td></tr></tbody></table><p>The exercises below show why each row works the way it does.</p><h2 id="exercise-1-text-generation-reveals-prompt-and-sampling-sensitivity">Exercise 1: Text generation reveals prompt and sampling sensitivity</h2><p>Section A1 loaded<code>distilgpt2</code> through the Hugging Face<code>pipeline</code> API and generated three continuations of the same prompt:</p><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="n">generator</span><span class="o">=</span><span class="n">pipeline</span><span class="p">(</span><span class="s2">"text-generation"</span><span class="p">,</span><span class="n">model</span><span class="o">=</span><span class="s2">"distilgpt2"</span><span class="p">)</span></span></span><span class="line"><span class="cl"><span class="n">generator</span><span class="p">(</span><span class="s2">"AI is transforming industries by"</span><span class="p">,</span></span></span><span class="line"><span class="cl"><span class="n">max_new_tokens</span><span class="o">=</span><span class="mi">40</span><span class="p">,</span><span class="n">num_return_sequences</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span><span class="n">do_sample</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span></span></span></code></pre></div><p>Three continuations came back from the same model, the same prompt, the same call:</p><blockquote><p><em>&ldquo;AI is transforming industries by using science to bring people together with a greater understanding of the importance of science. The new book takes an approach to both science and technology, allowing people to focus more more effectively on the basics and to&hellip;&rdquo;</em></p></blockquote><blockquote><p><em>&ldquo;AI is transforming industries by replacing the manufacturing sector with a manufacturing sector that can be turned into a manufacturing and IT sector by creating new jobs and creating new jobs. The new jobs and investment in the next decade will help spur growth&hellip;&rdquo;</em></p></blockquote><blockquote><p><em>&ldquo;AI is transforming industries by creating a new, faster, and more attractive way of generating capital and creating jobs for both the United States and Europe. This is an effective new way of doing this.&rdquo;</em></p></blockquote><p>Three different stories. None of which the model &ldquo;knew&rdquo; — it just produced plausible-sounding next tokens under stochastic sampling. Notice the repetitions (&ldquo;manufacturing sector with a manufacturing sector&rdquo;), the loops (&ldquo;more more effectively&rdquo;), the empty filler (&ldquo;a new, faster, and more attractive way of generating capital&rdquo;). DistilGPT-2 is a small model — these are the artefacts of a system that&rsquo;s good at local fluency but doesn&rsquo;t have a strong forward plan.</p><p>The headline insight: LLM outputs are statistical, prompt-sensitive, and unrepeatable unless you fix the seed. The same prompt can give you variety (a feature when brainstorming) or drift (a bug when consistency matters).</p><h2 id="exercise-2-tokenization-is-where-the-abstraction-begins">Exercise 2: Tokenization is where the abstraction begins</h2><p>This is the section to slow down on. Take the sentence:</p><blockquote><p><em>&ldquo;LLMs are powerful tools for natural language understanding.&rdquo;</em></p></blockquote><p>A human reads eight words. The model sees ten tokens.</p><p><img src="/images/IITM/week6-tokens.png" alt="BPE tokenization of the sentence, showing each token as a coloured pill"/><p>After BPE (Byte-Pair Encoding) with the DistilGPT-2 tokenizer:</p><pre tabindex="0"><code>['LL', 'Ms', 'Ġare', 'Ġpowerful', 'Ġtools', 'Ġfor', 'Ġnatural',
'Ġlanguage', 'Ġunderstanding', '.']</code></pre><p>The string<code>LLMs</code> doesn&rsquo;t appear in the model&rsquo;s vocabulary as a single unit, so it is split into<code>LL</code> and<code>Ms</code>. The<code>Ġ</code> prefix encodes &ldquo;preceding space&rdquo; — that&rsquo;s how BPE preserves word boundaries without a separator character. The period gets its own token.</p><p>The mismatch between<em>what a human reads</em> and<em>what the model processes</em> has real consequences:</p><ul><li><strong>Cost is per token, not per word.</strong> API billing, latency, and rate limits are all token-denominated. A 1,000-word prompt to a frontier model may bill at 1,300–1,500 tokens depending on language.</li><li><strong>Context windows are token windows.</strong> A 4,096-token context holds roughly 3,000 English words. Much less for code (whitespace and symbols inflate counts), much less again for languages with poor vocabulary coverage in the tokenizer.</li><li><strong>Rare strings behave oddly.</strong> Brand names, technical acronyms, foreign words, internal jargon — anything outside the trained vocabulary gets fractured. Model behaviour around those fractures is harder to predict, and prompt sensitivity often hides at this layer.</li><li><strong>The same string can tokenize differently with leading whitespace.</strong><code>"king"</code> and<code>" king"</code> are different token sequences. That&rsquo;s why pasted prompts sometimes produce subtly different outputs than typed ones.</li></ul><p>Tokenization is the lowest layer of the LLM stack and the one most engineering conversations skip. If you&rsquo;re tuning prompts and getting unstable behaviour, the first place to look is what your input looks like<em>after the tokenizer touches it</em>, not what it looks like in your editor.</p><h2 id="exercise-3-prompts-shape-what-you-get">Exercise 3: Prompts shape what you get</h2><p>Section B ran three task-shaped prompts through the same generator, with<code>temperature=0.8</code> and<code>top_p=0.95</code>:</p><ul><li><strong>Summarization</strong> — explicit instruction with a 30-word cap.</li><li><strong>Q&amp;A</strong> — structured format with<code>Q:</code> and<code>A:</code> markers.</li><li><strong>Creative</strong> — open-ended request for a 4-line poem about AI.</li></ul><p>The summarization output respected the spirit of the constraint but drifted past 30 words on most runs — DistilGPT-2 is small enough that hard length control isn&rsquo;t reliable even with explicit instructions. The Q&amp;A output, asked for the capital of Japan, returned<code>I believe...</code> — the model hedged. A larger model would say Tokyo confidently; a small model produces statistically plausible Q&amp;A-shaped text without strong factual grounding. The creative prompt produced varied and stylistic continuations, but with the lowest grounding: fluency over precision.</p><p>Structure compresses the output space the model is sampling from. Vagueness expands it. That single sentence is most of what &ldquo;prompt engineering&rdquo; actually is — the rest is technique.</p><h2 id="exercise-4-word-embeddings-encode-semantic-geometry">Exercise 4: Word embeddings encode semantic geometry</h2><p>Pivot to the other mechanism. Section C1 loaded<strong>GloVe</strong> vectors (<code>glove-wiki-gigaword-50</code> — 50 dimensions, trained on Wikipedia and Gigaword) via Gensim, then asked for the five nearest neighbours of three words:</p><table><thead><tr><th>Query</th><th>Top 5 neighbours (cosine similarity)</th></tr></thead><tbody><tr><td><code>king</code></td><td>prince (0.82), queen (0.78), ii (0.77), emperor (0.77), son (0.77)</td></tr><tr><td><code>queen</code></td><td>princess (0.85), lady (0.81), elizabeth (0.79), king (0.78), prince (0.78)</td></tr><tr><td><code>diamond</code></td><td>gold (0.77), diamonds (0.77), gem (0.74), silver (0.72), jewel (0.71)</td></tr></tbody></table><p>There is no generation here. Each word is mapped to a fixed 50-dimensional vector, and the &ldquo;nearest neighbours&rdquo; are the words whose vectors sit closest in that space by cosine similarity. The geometry was learned by training on co-occurrence — words that appear in similar contexts end up in similar positions. That&rsquo;s why<code>king</code> and<code>prince</code> are nearest neighbours, why<code>queen</code> pulls in<code>elizabeth</code> (the corpus has plenty of references to Queen Elizabeth), and why<code>diamond</code> cleanly resolves to a jewellery cluster.</p><p>The classic<code>king − man + woman ≈ queen</code> analogy works in this same space; the lab didn&rsquo;t run it, but the geometry is there. Embeddings don&rsquo;t<em>write</em> anything — they<em>place</em> things near other things. That single property is what makes them the backbone of semantic search, retrieval, deduplication, and recommendation.</p><h2 id="exercise-5-sentence-similarity-from-averaged-word-vectors">Exercise 5: Sentence similarity from averaged word vectors</h2><p>Section C2 extended the geometry to sentences. Five short sentences across two topics — AI/ML and jewellery — were averaged into sentence vectors (mean of their word vectors, with simple lowercase tokenization), then compared with cosine similarity.</p><p>Plotted in 2D via multidimensional scaling on the cosine distances, the clustering is unambiguous:</p><p><img src="/images/IITM/week6-clusters.png" alt="Two-dimensional cluster plot of the five sentence vectors, with the AI/ML sentences clearly separated from the jewellery sentences"/><p>The numerical version:</p><table><thead><tr><th/><th>AI/support</th><th>ML/fraud</th><th>Jewellery</th><th>Neural/medical</th><th>Luxury/rings</th></tr></thead><tbody><tr><td><strong>AI/support</strong></td><td>1.00</td><td>0.84</td><td>0.60</td><td>0.80</td><td>0.50</td></tr><tr><td><strong>ML/fraud</strong></td><td>0.84</td><td>1.00</td><td>0.73</td><td>0.83</td><td>0.62</td></tr><tr><td><strong>Jewellery</strong></td><td>0.60</td><td>0.73</td><td>1.00</td><td>0.58</td><td>0.88</td></tr><tr><td><strong>Neural/medical</strong></td><td>0.80</td><td>0.83</td><td>0.58</td><td>1.00</td><td>0.56</td></tr><tr><td><strong>Luxury/rings</strong></td><td>0.50</td><td>0.62</td><td>0.88</td><td>0.56</td><td>1.00</td></tr></tbody></table><p>Within-cluster pairs sit at 0.84–0.88. Cross-domain pairs sit at 0.50–0.62. The grouping is exactly what you&rsquo;d want a retrieval system to do.</p><p>Three caveats worth naming, because they explain why modern retrieval doesn&rsquo;t actually use GloVe averages:</p><ul><li><strong>Averaging discards word order.</strong> &ldquo;Dog bites man&rdquo; and &ldquo;man bites dog&rdquo; produce identical sentence vectors. For most retrieval that&rsquo;s tolerable; for anything where syntax carries the meaning, it isn&rsquo;t.</li><li><strong>Transformer encoders fixed this.</strong> Models like BERT, RoBERTa, and their descendants produce<em>contextual</em> embeddings — each token&rsquo;s vector depends on the tokens around it. Pool those across a sentence and you get a representation that respects word order and disambiguates polysemy.</li><li><strong>Sentence-BERT and friends made it production-grade.</strong> SBERT (and successors like OpenAI&rsquo;s<code>text-embedding-3</code>, Cohere&rsquo;s embeddings, Voyage, etc.) trained encoders specifically for sentence-level similarity. That&rsquo;s the difference between &ldquo;the demo works on five sentences&rdquo; and &ldquo;you can index a million documents and search them in milliseconds.&rdquo;</li></ul><p>GloVe averaging is a baseline. It&rsquo;s the right baseline to start with, because it lets you see the geometry without the architecture getting in the way. Production systems start from this picture and replace the lookup step.</p><h2 id="when-both-mechanisms-meet">When both mechanisms meet</h2><p>The final exercise sits at the intersection.<code>distilbert-base-uncased-finetuned-sst-2-english</code> is a transformer encoder (an embedding model under the hood) with a classification head fine-tuned for sentiment. Run it on three workplace-themed inputs:</p><table><thead><tr><th>Input</th><th>Label</th><th>Score</th></tr></thead><tbody><tr><td>&ldquo;The chatbot reduced ticket resolution time by 40% this quarter.&rdquo;</td><td>POSITIVE</td><td>0.9962</td></tr><tr><td>&ldquo;Our deployment failed repeatedly and customers were upset.&rdquo;</td><td>NEGATIVE</td><td>0.9997</td></tr><tr><td>&ldquo;The new recommendation engine is acceptable but needs tuning.&rdquo;</td><td>NEGATIVE</td><td>0.9898</td></tr></tbody></table><p>The third row is the interesting one, and it&rsquo;s worth unpacking because it points at a problem that turns up in every enterprise deployment of pretrained models.</p><p>&ldquo;Acceptable but needs tuning&rdquo; is, in workplace context, a<em>lukewarm-positive</em> — closer to &ldquo;approved with caveats&rdquo; than &ldquo;this is bad.&rdquo; The classifier scored it NEGATIVE with 0.9898 confidence. Three things are happening at once:</p><ul><li><strong>Domain mismatch.</strong> The model was fine-tuned on SST-2, which is movie reviews. &ldquo;Needs tuning&rdquo; reads negative there. In an engineering team&rsquo;s language, &ldquo;needs tuning&rdquo; is constructive — the same words have different sentiment loadings in different domains.</li><li><strong>No calibration on workplace text.</strong> The score is 0.9898 — extreme confidence — for what should be a borderline case. Pretrained classifiers tend to be miscalibrated on out-of-distribution inputs: they&rsquo;re not just wrong, they&rsquo;re confidently wrong. Calibration techniques (temperature scaling, Platt scaling, conformal prediction) exist for exactly this.</li><li><strong>Weak supervision is the practical fix.</strong> When you can&rsquo;t fine-tune (no labelled data, no budget, no time), the durable answer is to treat the classifier as one signal among several — combine it with rules, keyword filters, or a second model — rather than trusting any single number above the threshold.</li></ul><p>Architecturally, the lesson generalises across all three Section D variants. Generation is &ldquo;embedding + decoder loop.&rdquo; Classification is &ldquo;embedding + categorical head.&rdquo; Retrieval is &ldquo;embedding + cosine.&rdquo; Same underlying mathematical object, different output shapes. The architectural choices around the embedding determine what the system does — and where it fails when you take it out of the domain it was trained on.</p><h2 id="closing-observations">Closing observations</h2><p>Three things that generalise beyond this lab.</p><p><strong>Tokenization is where most LLM cost and quirks actually originate.</strong> It&rsquo;s the lowest layer of the stack and the one most engineering conversations skip. If you&rsquo;re tuning prompts and getting unstable behaviour, the first place to look is what your input looks like after the tokenizer touches it.</p><p><strong>Embedding-based similarity is older, cheaper, and more deterministic than people remember.</strong> Before reaching for an LLM call to compare two pieces of text, embed them and compute cosine. It&rsquo;s milliseconds, free, and stable. A surprising fraction of &ldquo;AI features&rdquo; are really embedding lookups with a confidence threshold.</p><p><strong>Generation and similarity sit next to each other.</strong> They are not competitors. RAG is the obvious example — embeddings retrieve, the LLM generates the answer grounded in what was retrieved. The<a href="/rag-chatbot-for-the-github-rest-api/">Week 15 RAG chatbot post</a> is what these two mechanisms look like wired together for production.</p><p>One predicts words. One maps meaning. Knowing which one to reach for is most of the job.</p>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/IITM/week6-mechanisms.png" medium="image"><media:title type="plain">Engineering</media:title></media:content><category>artificial-intelligence</category><category>llm</category><category>machine-learning</category><category>engineering</category><category>Knowledge Base</category></item></channel></rss>