Ajay Walia

Aether, Rethought — The Shape Was Wrong All Along

Ajay Walia — Fri, 12 Jun 2026 00:00:00 +0000

Part III of the Aether series. Missed the first two? Start withMeet Aether (the build), thenAether, Grown Wild (what happened when it ran).

The story so far — one paragraph each

Part I: I built a 10-agent (later 13-agent) system that runs an entire team of IT architecture specialists on a single MacBook M5 Pro — one Gemma model, zero cloud, zero data egress. Every agent is just a YAML manifest: a different system prompt and a different knowledge-base namespace pointing at the same weights. The system escalates upward when confidence falls below 0.7.

Part II: That clean idea hit reality. The router was rebuilt twice, retrieval flipped from knowledge-base-first to web-first, and self-reported confidence was replaced by a formula the system computes. Thirteen agents, a live web allowlist, a computed confidence score, and a CHANGELOG.md in place of Git.

Part III (this one): The shape is wrong. The hierarchy that felt so natural — because it mirrors the org chart — turns out to optimise for the wrong things. I worked through five recognised ways to build an agentic AI system, scored each against the same criteria, and arrived at a recommendation that changes the architecture without throwing away anything we’ve built.

What we’re actually building

The output isn’t a chatbot. It’s a consulting deliverable — an architecture document advising how to run an IT transition. Every real client engagement spans multiple towers simultaneously: a move to Entra ID while modernising the network and shifting workloads to Azure touches Cloud, Network, Digital Workplace, and Security in the same breath. The value and the difficulty live in the cross-domain synthesis.

Where we started — the three-tier hierarchy

This is what Aether v2.x actually is. One model, thirteen agents, each differentiated only by system prompt and knowledge namespace. Work routes down the tree; low confidence escalates back up.

It was appealing for real reasons: it mirrors how a delivery organisation thinks, easy to explain to a client, and the RAG namespace-per-domain isolation is clean. But we copied an org chart into the control flow — and the next slide explains why that’s usually a trap.

The core insight — why the org chart is the wrong shape

This is the conceptual centre of the entire analysis. Everything that follows flows from these two points.

A strict tree only allows vertical movement — up to escalate, down to delegate. Real architecture work needs lateral collaboration. The AWS specialist can’t directly ask the Network specialist a question; it has to climb the entire tree and back down. That’s bureaucracy encoded in Python.

The second consequence is more fundamental:the deliverable is a workflow, not an organisation. Producing an architecture document is a consulting process with phases. The right structure for that process is a pipeline. We modelled the people first and the process second — we should have done it the other way round.

Five ways to build it

Rather than jump straight to a new design, I evaluated five recognised approaches against the same five axes: cross-domain capability, parallelism, auditability, simplicity, and fit for document generation.

Note the footer: RAG sits underneathall five approaches — they differ in control flow, not in whether they retrieve.

Approach 1 — Hierarchical / org-mirror

This is what Aether v2.x already is.

The pros are real — which is why we chose it. But the killer con:there is no sideways path. The AWS agent can’t ask the Network agent a question without escalating all the way up and back down. Also, escalation triggered by confidence scores sounds clean, but LLM confidence is unreliable — the trigger itself is shaky.

Verdict: Great for stable, siloed problems with a genuine audit chain. That is not what our engagements look like.

Approach 2 — Orchestrator + flat specialists

One orchestrator plans the task, fans it out to specialists runningin parallel, then synthesises. Flat: adding a domain means adding one specialist — no re-tiering.

Verdict: A strong, flexible default. A piece of the recommendation.

Approach 3 — Workflow / pipeline (process-native)

Instead of organising bywho, organise by thestages of producing the document: Discover → Assess → Design → Review → Assemble. Each stage maps to a section of the output. Human checkpoints slot between stages.

Verdict: This is the spine. The structure that mirrors how the deliverable is actually made. The so-called con — that domain experts become knowledge sources rather than autonomous agents — is arguably the correct framing.

Approach 4 — Blackboard / shared artifact

All agents read and write a common workspace — the evolving document. Strong on cross-domain consistency because there’s only one object. The hard part is concurrency control.

Key distinction: a true blackboard (free-for-all writes) is risky. A typed, controlled shared state object — which LangGraph gives us — keeps the benefit without the chaos. We use the controlled version.

Approach 5 — Single-agent baseline

One capable model. A lightweight router. Domain knowledge bases on demand. Simplest, cheapest, most reliable.

Verdict: The benchmark every fancier design has to beat. The real reason to decompose isinformation management — not model weakness. When CMDB exports, cloud inventories, and Terraform files all arrive together, no context window handles it cleanly.

The comparison

The table does the work. Our current design (Hierarchical) is weakest exactly where we need strength — cross-domain — while strong on auditability. The two approaches that score high on fit-for-doc-gen arecomplementary: strong in different places. That’s the bridge to the recommendation.

The recommendation — a hybrid

Not a ladder. Ahub. The pipeline (Discovery → Assessment → Design) flows through a single typed shared state object. Domain specialists are invoked within stages — callable skills, not autonomous routing agents. Cross-cutting reviewers (Security, Cost, Risk, Compliance) act across all stages. QA can loop work back to Design or Assessment until the rubric passes.

How the model works

Three things to highlight:

Parallel specialists write structured findings into one typed shared state — the single source of truth
Arbitration is deterministic — fires on conflict, policy breach, or missing data, never on confidence scores
The model is pluggable — local Gemma today, Claude/OpenAI/Gemini tomorrow, same knowledge, no rigid tree

Why this is the right call

The verdict, stated plainly: it’s the only option that handles cross-domain work with the auditability, governance, and evidence-traceability a client deliverable demands — at an acceptable, well-understood increase in build complexity.

The technology stack — nothing new to install

Every box in the stack is something we already run. The v3 work extends Orchestration and adds the governance/eval layer.This is not a rebuild.

The process — how a real engagement runs

Two things to notice:

Feedback arrows — review, QA, and cost/risk findings can send workback to Design or Assessment. It’s iterative, not a one-way pipeline.
Human gates — after assessment and before assembly. An architect validates the current-state picture and signs off before the document is built. Gated autonomy, not full automation — which matters when the output carries liability.

The real hard problem — evidence quality

Every downstream recommendation carries the confidence level and assumptions set at this gate.No silent guessing. The system requests more data, logs gaps in an assumption register, or proceeds while explicitly stating its confidence level.

What flows through the system — the canonical Architecture State

One typed object. Every stage, specialist, and reviewer reads from and writes to it. The document is rendered from it. Every recommendation traces back through the same object to the evidence that produced it — that traceability is what makes the deliverable auditable.

Lenses, not domains — where security and compliance live

Security is a lens, not a domain. Same for Cost, Compliance, Risk, Sustainability, and Supportability. They aren’t towers sitting next to Cloud and Network — they cut horizontally across every stage. Most rework loops originate in Security and Compliance, which is exactly why they’re first-class cross-cutting reviewers.

Key design considerations

The most important reframe:we decompose for information management, not because the model is weak. That reframes the whole multi-agent debate. And arbitration fires on deterministic events — never on confidence scores. Every recommendation must trace to evidence.

Design decisions — where the build effort actually goes

The effort allocation is the provocative bit.Only ~10% of build effort goes to the agents themselves. ~35% is evidence ingestion and normalisation. ~20% is evaluation and governance. Most teams invert these numbers — polished agents, no evaluation — and end up with a demo that can’t produce a deliverable a client would pay for. That’s the trap we’re avoiding.

Business value

The positioning:a consulting accelerator and quality platform — not a headcount-reduction play. The specific numbers (50–80% faster assessments, 60–90% less evidence-collection effort) are hypotheses to validate in a pilot, not measured results. They’re framed that way because that’s what earns trust.

Next steps

The sequencing matters:evidence ingestion and the QA rubric first, agent polish second. That’s where the quality actually lives.

The honest accounting

I built a hierarchy because it was intuitive. It is intuitive — it maps to how the organisation thinks, the escalation chain is easy to trace and audit, and it’s easy to explain to a client. Those are real advantages.

But intuitive for humans and optimal for agents are not the same thing. The hierarchy was designed to answer“who is responsible?” — an important human question that agents don’t need answered. The new design answers“what needs to happen next?” — which is the right question for a document-generation workflow.

What changes in v3:

Agents become implementation details inside the Knowledge layer — the durable value is in evidence, governance, evaluation, and the document workflow
Escalation fires on events (conflict, policy breach, missing data) — not on confidence scores
The canonical model is the design constraint — the document isn’t an afterthought, it’s what everything is structured around
Model-agnostic by default — local Gemma today, SOTA API tomorrow, same knowledge

What stays the same: LangGraph, LanceDB, BAAI/bge-small, FastAPI, SQLite, the YAML manifests, and the 13 domain specialists — which become callable skills within the Design stage rather than autonomous routing agents. Same knowledge, different invocation mechanism.

This is an extension of what we run, not a rebuild.

Questions about the design or the approach? Reach out onLinkedIn.

]]>

I Built My Own RSS Reader in an Afternoon — With AI Doing the Typing

Ajay Walia — Wed, 10 Jun 2026 00:00:00 +0000

For years my RSS setup was a compromise. I never wanted to pay for a desktop reader, so I got stuck with The Old Reader — a perfectly fine service, but using it felt like visiting a website, because that’s exactly what it was. Open a browser tab, log in, scroll, repeat tomorrow. What I actually wanted was simple: a native Mac app, my feeds downloaded and stored locally, no account, no subscription, fast enough to triage a hundred articles with my keyboard.

Then I heard about Claude’s new Fable model and thought: fine, let’s stop wishing and just build the thing. This is the story ofLumenAI — a local-first RSS reader for macOS with AI summaries — built from an empty folder to a notarizable DMG in about an hour of wall-clock time, with me acting as product owner and build verifier while the AI wrote the code.

That’s the real app, not a mockup. There’s afull captioned gallery here — the loaded feed list, OPML import, the memory footprint, and the DMG build.

The Idea

The pitch I gave the AI was one paragraph: a local RSS reader for Mac, feeds downloaded and stored on my machine, a premium feel, and treat it like a real engineering project — clear phases, and ask me questions before making decisions. That last part turned out to be the most important sentence in the whole project.

Instead of immediately generating a wall of code, it interviewed me. What stack? What does v1 include, and — just as important — what does it exclude? How should refresh work? What does “premium” mean to you, concretely? By the end of a few rounds of multiple-choice questions, we had a real spec:

v1 goals: subscribe to feeds, full article extraction, fast local search, OPML import/export, saved views, deduplication, offline reading, AI summaries, keyboard-first navigation, dark/light themes, typography controls, reader mode, reading progress.

v1 non-goals: semantic search, topic clustering, multi-device sync, social anything.

Writing down the non-goals felt almost ceremonial at the time. It wasn’t. Every time scope tried to creep, that list killed the discussion in one line.

The Stack

Every choice optimized for “native feel, local data, no servers.”

The app is Swift and SwiftUI targeting macOS 14, because nothing fakes the feel of a real Mac app. Storage is SQLite viaGRDB.swift, chosen over Apple’s SwiftData specifically for FTS5 — SQLite’s built-in full-text search engine, which gives instant search across every article ever downloaded, entirely offline. Feed parsing is FeedKit, wrapped in a normalizer layer so the rest of the app never touches a FeedKit type and JSON Feed support costs no schema changes. Full-text extraction is Mozilla’s Readability.js — the same engine behind Firefox’s reader mode — running in a hidden WKWebView with the page’s own JavaScript disabled. The reader itself is a WKWebView used purely as a rendering layer for a themed HTML template; everything around it stays SwiftUI. The project file is generated by XcodeGen from a YAML spec, which kept the AI and Xcode from ever fighting over a.xcodeproj.

The AI layer is the part I’m proudest of architecturally: a singleSummaryProvider protocol with six implementations — Apple Intelligence (on-device, appears only on macOS 26+), Ollama for local models, Claude, OpenAI, Gemini, and Disabled. One protocol method. Swapping providers is a dropdown in Settings; API keys live in the macOS Keychain.

The Seven Phases (Okay, Eight)

We numbered from zero, like civilized people.

Phase 0 — Scaffold. XcodeGen project, sandbox and network entitlements, and a three-pane shell (sidebar, article list, reader) running on sample data. The exit criterion was simply “builds and runs.” It almost did: the very first build failed with Swift’s infamous“the compiler is unable to type-check this expression in reasonable time” — the AI had written a too-clever nested closure to generate sample data. It rewrote it as a boringfor loop. A very human bug, honestly.

Phase 1 — Data layer. The real schema: feeds, folders, articles, saved views, an FTS5 index kept in sync by SQL triggers, and a three-tier deduplication identity — an article is itsguid if the feed provides one, else its normalized URL (tracking parameters stripped), else a content hash. Ten unit tests against an in-memory database before any networking existed.

Phase 2 — Feed engine. Fetching with HTTP conditional GET, so unchanged feeds cost a 304 response instead of a re-download. Feed auto-discovery, so typingdaringfireball.net finds the actual feed URL by scanning the page’s tags. RSS, Atom, and JSON Feed all normalize into one canonical model. By the end of this phase the app was genuinely usable: subscribe, read, refresh.

Phase 3 — Core UI. Folders, favicons, thumbnails, unread badges, and the thing that makes an RSS reader feel like a tool instead of a website: keyboard navigation.j/k to move, space for next unread,s to star,m to toggle read. The Reeder dialect, basically.

Phase 4 — Reading experience. Select an article and it silently fetches the source page, runs Readability.js over it, and stores clean full text — so a feed that only publishes two-line excerpts still gives you whole articles, offline, forever. Typography controls (serif/sans, size, line width), themes that follow the system, and per-article reading progress that restores when you come back. This phase also produced the best bug of the project: scroll position was saved to app state, which regenerated the reader HTML, which reloaded the page, which reset the scroll — an infinite loop the AI caught in code reviewbefore I ever built it.

Phase 5 — Search, saved views, OPML. Global FTS5 search from the toolbar, saved views (persistent named filters — “unread Swift articles from these three feeds, last 30 days”), and OPML import/export so my subscriptions could finally walk out of The Old Reader with folder structure intact.

Phase 6 — AI summaries. The provider protocol described above, plus a deliberately boring prompt: summarize in two or three sentences, be specific, no “this article discusses.” The summary renders as a tinted card above the article. The point isn’t to replace reading — it’s triage. Is this worth my next ten minutes?

Phase 7 — Polish. An app icon (generated programmatically — an RSS glyph under a sparkle on an indigo gradient), a dock badge with the unread count, render-path caching so thousand-article lists scroll smoothly, and a one-command script that builds a signed, drag-to-Applications DMG.

The Phase 7 finale, for real: one script, one signed DMG on the Desktop. More screenshots — the loaded feed list after OPML import, and the app’s 143 MB memory footprint in Activity Monitor — are on theLumenAI screenshots page.

What the Lifecycle Actually Felt Like

The loop for every phase was identical: the AI proposed decisions and asked questions, wrote the code and its tests, and then stopped — because it couldn’t compile anything. Its sandbox is Linux; you can’t build a Mac app there. So I was the build machine. ⌘U, ⌘R, report back. “Build succeeded, go ahead” became the rhythm of the afternoon.

That constraint turned out to be a feature. It forced a real checkpoint between phases — a human running the actual app — instead of an unbroken firehose of unverified code. Of the three failures across the whole project, two were caught by my builds (the type-checker timeout, and a “cannot find type” error that turned out to mean I’d forgotten to re-run XcodeGen after files were added) and one was caught by the AI re-reading its own code. Final tally:38 Swift files, 27 tests, zero runtime crashes encountered.

The other thing that surprised me: being asked questions felt like the AI respecting that it wasmy app. Tech stack, refresh cadence, dedup policy, summary length, even the app’s name — every fork in the road was a decision I made in seconds from a menu of researched options, instead of an assumption silently baked into code I’d discover three weeks later.

What I’d Tell You If You’re Tempted

Treat it like an engineering project, not a magic trick. The phases, the non-goals list, the tests, the check-in after every phase — that structure is why this worked in an hour instead of unraveling in a weekend. The AI typed every line of code, but the spec, the taste, and the “no, simpler” calls were the human contribution, and the project needed both.

And yes — the app starts instantly, works on a plane, and never asks me to log in. The subscription I avoided paying for has been replaced by the most expensive thing of all: now I want to build everything.

LumenAI is Swift/SwiftUI on macOS 14+, with GRDB, FeedKit, and Readability.js. Built with Claude Fable 5.

]]>

Aether, Grown Wild — The Implementation Journey (v2.6 → v2.8.2)

Ajay Walia — Wed, 03 Jun 2026 00:00:00 +0000

Implementation Journey · Part II

Aether, Grown Wild

The first article was implementing the base idea. This one is the expedition — how the idea evolved and things got added as we moved forward, growing into a 13-agent, web-first, self-escalating system, and every bug in the undergrowth that shaped it.

ByAjay WaliaJune 2026v2.6 → v2.8.210 min read

New here? Start with the original field guide —"I Built a Team of IT Architects Using LLM That Live on MacBook — Meet Aether." That post laid out the thought. This one is what happened when the thought met real queries.

Every design survives contact with the page. Then you run it.

Aether v2.6 worked end-to-end on day one — route, retrieve, build, generate, score, escalate, audit. And almost every lesson since came from running that clean little machine against questions it had never seen before.

The original architecture made a single bet: one model can't be an expert at everything, so build a tree of narrow experts and let them escalate on doubt. The bet held. But the path from v2.6 to v2.8.2 reshaped almost everything around it. The agent roster grew, the router was rebuilt twice, retrieval flipped from knowledge-base-first to web-first, and the confidence score stopped being a thing the modelclaimed and became a thing the systemcomputed.

13Agents · 3 tiers

v2.8.2Current release

0.7Escalation threshold

ZEROEgress · API cost

The Delta

Where the thought and the build diverged

The blueprint described ten agents, a Gemma 4 26B model, knowledge-base-first retrieval, and a confidence number the model appended to its own answer. Run it for a week and three of those four assumptions bend out of shape:

Ten agents became thirteen. The flat roster of technology specialists reorganised into a four-agent network sub-branch and a consolidated digital-workplace branch.
Knowledge-base-first became web-first. The local knowledge base started empty, so retrieval now scrapes a vendor allowlist first and falls back to the KB only when the web is thin.
Self-reported confidence became computed. "Confidence: 0.92" was theatre; a formula over retrieval quality, domain fit and citation density replaced it.
An idea became a discipline. With Git disabled, a hand-writtenCHANGELOG.md turned into the single source of truth — every fix and reversal, dated.

The Pack

From ten to thirteen — the hierarchy regrows

The three-tier shape held: one Enterprise Architect at the top, three Domain Architects beneath, and a layer of deep technology specialists at the base. What changed was the base. The original Intune, AVD and Citrix agents were too narrow and overlapped each other, so they were folded into broader, sharper roles — and the network domain, barely a single agent before, grew a full four-specialist sub-branch.

Thirteen agents across three tiers. Each domain owns a colour; specialists inherit it. Low confidence climbs the parent chain → Domain → Enterprise.

Added to the pack

Microsoft DWP Technology Architect — a broad M365, Intune, Entra, Defender and Copilot specialist.
End-User Virtualization — Citrix, Horizon, AVD and FSLogix under one roof.
A four-agent network sub-branch — Core Networking, SD-WAN/SASE, Network Security and NetOps AIOps.

Cut from the undergrowth

The standalone Virtualization Domain Architect — folded into DWP.
The Intune-only agent — replaced by the broader Microsoft DWP role.
The AVD- and Citrix-specific agents — absorbed by End-User Virtualization.
The retired KB taxonomy —kb_intune,kb_avd andkb_citrix.

core/router.py

Routing — the hardest-won code in the repo

Routing decides which expert hears the question. It sounds trivial — match keywords, pick an agent — and it turned out to be the single biggest source of subtle, infuriating bugs. The router now runs four ordered passes:

Forced design-doc override. Phrases like "solution design document" bypass keyword scanning entirely, and any mention of "Copilot" routes straight to the Microsoft DWP specialist.
Tier-3 keyword match. The most specific specialists are scanned first, with\b word-boundary regex to stop false hits.
Tier-2 domain match. Broader strategy keywords catch domain-level queries that no specialist claimed.
Default to Enterprise. Anything unmatched falls through to the catch-all at the top.

1. Forced Override

Certain keywords (like "solution design document" or "Copilot") bypass deep scanning and hardcode a route to a specific agent.

2. Tier-3 Specialist

Scans for highly specific technologies using exact word boundaries (\bids\b). Prevents broad terms from triggering niche specialists.

3. Tier-2 Domain Architect

Catches broader domain-level queries (like "firewall" or "routing") if no specific L3 specialist claimed the question.

4. Default to Enterprise

The catch-all bucket. If a query falls all the way through without matching anything, it goes to the Enterprise Architect.

Starting...

ROUTER ENGINE TRACE

Waiting for query...

A curated vendor-domain allowlist. Results are restricted to Microsoft Learn, AWS, Google Cloud, Cisco, Palo Alto, NIST and others, with suffix-safe matching that blocks spoofs likecisco.com.evil.com.
Per-agent ranking.AGENT_DOMAINS ranks each specialist's preferred docs first, so the AWS agent leans on AWS documentation before anything else.
Source-agnostic results. Web hits are reshaped to look exactly like KB results (source / url / text / score), so the confidence math works identically on either.
Rich metadata. A schema ofdomain,vendor,document_type andversion_date travels with every chunk into the prompt.

Trust, Computed

Confidence is math, not vibes

In v2.6 the model ended each answer withConfidence: 0.92 and the orchestrator believed it. The trouble is that a model confidently answering an AWS question with Azure facts will happily rate itself 0.92 too. Self-report is theatre. So confidence became a number thesystem computes, before and after generation:

pre_gen= 0.6 · retrieval+ 0.4 · namespace_overlap
confidence= min(1.0, best_pre_gen+ 0.2 · citation_density)

retrieval_score — the quality of the retrieved documents. With an empty KB it is derived from web hits: 0.85 for preferred vendor domains, 0.70 otherwise.
namespace_overlap — does the query hit the agent's keywords? A strong match scores about 0.85; off-topic collapses to 0.1, all but guaranteeing escalation.
citation_density — the share of claims backed by[1] [2] sources, rewarding grounded answers with up to a 0.2 boost.

Below the 0.7 threshold, the query climbs to the parent agent, which re-retrieves against its own namespace and re-answers. One subtle fix mattered here: the strongest_best_pre_gen score is carried up the escalation chain, so a confident child's score is never erased by a weaker parent re-running the same step. The model no longer judges itself — the architecture does.

The Undergrowth

The bugs that shaped the design

Most of the architecture above exists because something broke first. The marquee disaster was the0% confidence saga — a cluster of unrelated failures that all produced the same symptom: every query inexplicably crashing to zero confidence and escalating to the top of the tree.

Silent search failure.duckduckgo_search was renamed toddgs upstream and returned an empty list. A catch-allexcept swallowed the error, so retrieval quietly went to zero.
Empty-KB zero score. With no documents ingested, the retrieval score defaulted to 0.0 — now derived from web-result quality instead.
Overwritten scores. Each escalation re-ranstep_build and erased the child's strong score, until_best_pre_gen was preserved across the chain.
Dead Tier-2 regex. Double-escaped\\b made domain routing rules unreachable, and malformed YAML broke parsing entirely.
Context overflow. The expanded universal design template pushed prompts past the local model's context limit — a 400 error until web-context sizes were tuned back down.
A frozen UI on long jobs. Design-doc generation runs for minutes and the UI silently froze, until streaming feedback and a 10-minute timeout were added.

A single catch-allexcept turned a renamed package into an invisible, week-long confidence collapse. Fail loud, not silent.

The Working Contract

How the build stays honest

With Git tracking disabled, discipline had to live somewhere. It lives in two places. The first is a hand-maintainedCHANGELOG.md that records every change, reversal and reason. The second is a behavioural contract the assistant itself follows when modifying the code — think before acting, make the smallest possible diff, verify against success criteria, keep everything auditable and reversible, and prefer less code over more. It reads less like an engineering process and more like the Law of the Jungle: a few rules everyone keeps, because the alternative is chaos.

See it running — the screenshots

Nine captioned frames from the live system: the Gemma model resident in LM Studio, the Gradio chat generating a Microsoft 365 Copilot design document at 90% confidence with live web search, the namespace-per-domain knowledge base, and the design template behind it all.

Open the screenshot gallery →

The Map So Far

Six releases, one expedition

2.6.0

Base system

Routing, RAG, agents and escalation — the first end-to-end build.

2.7.0

Grounded confidence

Dropped LLM self-report for a math formula; added web-search fallback and Mermaid diagrams in the UI.

2.7.1

Agent reshuffle

Retired the Intune, AVD, Citrix and Virtualization agents; added Microsoft DWP, End-User Virt and the network specialists.

2.7.4

Routing repairs

Fixed dead regex rules, corrupted YAMLs, and a hardcoded-path split that loaded stale manifests.

2.8.0

Web-first retrieval

Scrape vendor docs before the KB; vendor allowlist; KB fallback merge.

2.8.2

Confidence fixes

Solved 0% confidence on design docs; preserved the best score across the escalation chain.

2.9.0

Clarifying questions

Before generating a design, agents now ask two to four targeted questions — org size, compliance, stack, timeline — and fold the answers into a far more specific result.

Where We Are · What's Next

Wired, working, and honest

v2.8.2 is a working, daily-use system. The full pipeline runs route → retrieve → build → generate → score → escalate → audit, all thirteen manifests parse, routing and parent maps are consistent, and live vendor-doc search returns current, citable content. What's still open is honest too: the KB folders are wired but largely empty and need real source documents ingested; a rule-ordering overlap can still misroute shared virtualization keywords between End-User Virt and DWP; Git is off; and the context budget stays tight on the local model for large templates.

The road ahead, in order: populate the knowledge base so RAG augments rather than just the web; resolve the routing overlaps and settle End-User-Virt-versus-DWP ownership; re-enable Git and move off the manual changelog; then build a query test-suite to calibrate the confidence threshold against measured answer quality.

Lessons from the Trail

What the journey taught

Fail loud, not silent. A catch-allexcept turned a renamed package into invisible 0% confidence. Surface errors — never swallow them.
Measure what you trust. Self-reported confidence is theatre. Grounding trust in retrieval and citations made escalation actually mean something.
Narrow beats broad. Specialists with tight domains hallucinate far less than one generalist trying to know everything.
Write the changelog. With Git off, the manualCHANGELOG.md became the single source of truth for every decision and fix.

Aether began as a single sentence — "one model can't know everything." It grew into a hierarchy of grounded, self-aware experts, and the changelog is the proof of the journey. Specialise · Ground · Measure · Escalate.

]]>

Cortex Swarm: Upgrading the Traditional IT Operations with Agentic AI

Ajay Walia — Sun, 24 May 2026 00:00:00 +0000

@import url('https://fonts.googleapis.com/css2?family=Noto+Serif:ital,wght@0,400;0,700;1,400&family=Noto+Sans:wght@300;400;600;700&display=swap'); .cs-article { --bg: #080810; --bg2: #0d0d1a; --bg3: #111122; --gold: #c9a227; --gold-light: #e8cc6a; --gold-dark: #7a5f0e; --blue: #1a2744; --blue2: #0f1d38; --cream: #e8dcc8; --muted: #9a8a6a; --blossom: #c4909a; --rule: rgba(201,162,39,0.22); font-family: 'Noto Sans', sans-serif; line-height: 1.75; color: var(--cream); background: var(--bg); padding: 2.5rem; border-radius: 16px; box-shadow: 0 4px 30px rgba(0,0,0,0.6); } .cs-article * { box-sizing: border-box; } .cs-section { padding: 56px 0; border-bottom: 1px solid var(--rule); } .cs-section:last-child { border-bottom: none; } .cs-label { font-size: .7rem; letter-spacing: .2em; text-transform: uppercase; color: var(--gold); font-weight: 600; margin-bottom: 10px; } .cs-article h2 { font-family: 'Noto Serif', serif; font-size: clamp(1.5rem, 3.5vw, 2.2rem); color: var(--gold-light); margin-bottom: 20px; margin-top: 0; } .cs-article h3 { font-family: 'Noto Serif', serif; font-size: 1.2rem; color: var(--gold-light); margin-bottom: 12px; margin-top: 0; } .cs-article p { margin-bottom: 18px; font-size: 1rem; color: var(--cream); opacity: 0.95; } .cs-article ul { list-style: none; padding: 0; margin-bottom: 18px; } .cs-article ul li { padding: 4px 0 4px 20px; position: relative; font-size: 1rem; } .cs-article ul li::before { content: "—"; position: absolute; left: 0; color: var(--gold); } .cs-statbar { display: grid; grid-template-columns: repeat(4,1fr); gap: 1px; background: var(--rule); border: 1px solid var(--rule); margin: 36px 0; border-radius: 4px; overflow: hidden; } .cs-stat { background: var(--bg2); padding: 20px 14px; text-align: center; } .cs-stat .num { font-family: 'Noto Serif', serif; font-size: 1.8rem; color: var(--gold); display: block; } .cs-stat .lbl { font-size: .68rem; letter-spacing: .08em; color: var(--muted); text-transform: uppercase; } .cs-visual { margin: 32px 0; border-radius: 6px; overflow-x: auto; overflow-y: hidden; border: 1px solid var(--rule); } .cs-visual svg { display: block; width: 100%; height: auto; } .cs-figure { margin: 32px 0; } .cs-figure img, .cs-figure video { display: block; width: 100%; height: auto; border-radius: 6px; border: 1px solid var(--rule); } .cs-figure figcaption { text-align: center; font-size: .78rem; color: var(--muted); font-style: italic; margin-top: 8px; } .cs-hero-video { position: relative; margin: 0 0 16px 0; border-radius: 8px; overflow: hidden; border: 1px solid var(--rule); box-shadow: 0 4px 20px rgba(0,0,0,.5); } .cs-hero-video video { display: block; width: 100%; height: auto; } .cs-hero-video .overlay { position: absolute; left: 0; right: 0; bottom: 0; padding: 40px 30px 48px; background: linear-gradient(transparent, rgba(8,8,16,.92) 30%); text-align: center; pointer-events: none; } .cs-hero-video .overlay .h-title { font-family: 'Noto Serif', serif; font-size: clamp(2.2rem, 5.5vw, 4.5rem); color: #fff; font-weight: 900; letter-spacing: 2px; } .cs-hero-video .overlay .h-sub { font-size: clamp(1rem, 2vw, 1.4rem); letter-spacing: .25em; color: var(--gold); font-weight: 700; margin-top: 14px; } .cs-hero-video .overlay .h-by { font-size: clamp(0.85rem, 1.5vw, 1.15rem); letter-spacing: .15em; color: var(--muted); margin-top: 18px; font-weight: 500; } .cs-cards { display: grid; grid-template-columns: repeat(auto-fit,minmax(230px,1fr)); gap: 14px; margin: 24px 0; } .cs-card { background: var(--bg2); border: 1px solid var(--rule); border-radius: 6px; padding: 20px 18px; } .cs-agent-row { display: grid; grid-template-columns: 150px 1fr; border: 1px solid var(--rule); border-radius: 6px; overflow: hidden; margin-bottom: 10px; background: var(--bg2); } .cs-agent-label { background: var(--blue2); padding: 16px 14px; border-right: 1px solid var(--rule); display: flex; flex-direction: column; justify-content: center; } .cs-agent-label .tier { font-size: .67rem; letter-spacing: .14em; color: var(--gold); text-transform: uppercase; margin-bottom: 3px; } .cs-agent-label .role { font-family: 'Noto Serif', serif; font-size: .95rem; color: #fff; } .cs-agent-label .replace { font-size: .67rem; color: var(--muted); margin-top: 5px; } .cs-agent-body { padding: 14px 18px; } .cs-agent-body .desc { font-size: .8rem; color: var(--muted); font-style: italic; margin-bottom: 8px; } .cs-tag-row { display: flex; flex-wrap: wrap; gap: 5px; margin-top: 6px; } .cs-tag { font-size: .67rem; padding: 2px 8px; border-radius: 20px; background: rgba(201,162,39,.1); border: 1px solid var(--gold-dark); color: var(--gold-light); } .cs-timeline { position: relative; margin: 24px 0; } .cs-timeline::before { content: ''; position: absolute; left: 19px; top: 0; bottom: 0; width: 1px; background: var(--rule); } .cs-tl-item { display: grid; grid-template-columns: 38px 1fr; gap: 14px; margin-bottom: 20px; } .cs-tl-dot { width: 38px; height: 38px; border-radius: 50%; background: var(--bg2); border: 1px solid var(--gold); display: flex; align-items: center; justify-content: center; font-size: .7rem; color: var(--gold); font-weight: 700; position: relative; z-index: 1; } .cs-tl-content { padding-top: 6px; } .cs-tl-content h3 { font-size: .9rem; margin-bottom: 3px; } .cs-tl-content p { font-size: .82rem; color: var(--muted); margin: 0; } .cs-phase-grid { display: grid; grid-template-columns: repeat(3,1fr); gap: 10px; margin: 24px 0; } .cs-phase-card { background: var(--bg2); border: 1px solid var(--rule); border-radius: 6px; padding: 16px 14px; } .cs-phase-num { font-size: .65rem; letter-spacing: .14em; color: var(--gold); text-transform: uppercase; margin-bottom: 5px; } .cs-phase-title { font-family: 'Noto Serif', serif; font-size: .95rem; color: #fff; margin-bottom: 7px; } .cs-phase-trigger { font-size: .68rem; color: var(--muted); border-top: 1px solid var(--rule); padding-top: 7px; margin-top: 7px; } .cs-table { width: 100%; border-collapse: collapse; margin: 20px 0; font-size: .85rem; } .cs-table th, .cs-table td { padding: 10px 12px; text-align: left; border-bottom: 1px solid var(--rule); vertical-align: top; } .cs-table th { color: var(--gold); font-weight: 600; font-size: .7rem; letter-spacing: .1em; text-transform: uppercase; border-bottom: 1px solid var(--gold-dark); } .cs-table td.k { color: var(--gold-light); font-family: 'Noto Serif', serif; width: 30%; } .cs-table td.v { color: var(--cream); } .cs-table td.w { color: var(--muted); font-size: .8rem; } .cs-pull { border-left: 2px solid var(--gold); padding: 6px 0 6px 16px; margin: 24px 0; font-family: 'Noto Serif', serif; font-size: 1.05rem; color: var(--gold-light); font-style: italic; } .cs-ornament { text-align: center; padding: 14px 0; color: var(--gold-dark); letter-spacing: .4em; font-size: .78rem; } @media (max-width: 600px) { .cs-article { padding: 1.25rem; } .cs-statbar { grid-template-columns: repeat(2,1fr); } .cs-phase-grid { grid-template-columns: 1fr 1fr; } .cs-agent-row { grid-template-columns: 1fr; } .cs-agent-label { border-right: none; border-bottom: 1px solid var(--rule); } .cs-table { font-size: .78rem; } .cs-table td.k { width: 35%; } }

Cortex Swarm

REPLACING THE FOLLOW-THE-SUN MODEL · 2026

AJAY WALIA · DIGITAL WORKPLACE OPERATIONS · MAY 2026

Every employee depends on a Workplace Operations team they will never meet. It is the team that resets their MFA when they fly to a new country, recovers their shared mailbox when it stops syncing, pushes the Intune policy that lets them install a piece of software, and decides at 3am whether a regional O365 Or Exchange outage warrants paging a human.

This piece is about three things, in order:

First — how those teams are actually structured today, how they function day-to-day, and the structural problems they carry.
Second — where agentic AI sits in 2026, and where the field is heading over the next two to three years.
Third — how a small swarm of specialised agents can replace this team tier-for-tier, what efficiencies that produces, and the new set of challenges it creates in return.

The org chart is the answer. The five tiers that make a DWP team work for humans are the same five seams that make it work for agents.

5Autonomous agents
L1 → SDM

~137FTE mirrored
across all tiers

24×7Single team
no shift roster

1-clickAudit verify
any ticket

Part 1 · Structure

How Digital Workplace Operations Teams Are Structured Today

A DWP team exists because every employee uses IT every day, and someone has to keep that working. For a Global 500 with 10,000–100,000+ employees, the work is too broad, too multilingual, and too time-zone-spanning for an in-house team. Almost without exception, it is outsourced to a Tier-1 IT services firm — TCS, Infosys, Wipro, Accenture, HCLTech, Cognizant — running a 24×7 follow-the-sun roster across multiple delivery centres.

The Scope — What Actually Sits Under "Workplace"

The label undersells the breadth. A typical DWP contract covers seven functional areas, each with its own runbooks, its own vendors, and its own escalation paths.

Identity

Who you are — joiner / mover / leaver, password, MFA, SSO, entitlements

Access

What you can use — catalogs, licenses, groups, approvals

Productivity

Outlook, Teams, M365, OneDrive, SharePoint

Endpoint

Laptops, peripherals, network, Intune compliance, patching

Applications

Line-of-business app support, vendor liaison

Infrastructure

Connectivity, VPN, cloud health, capacity

Change & Governance

RFCs, CAB, RCAs, SLA reporting, customer comms

The Five-Tier Hierarchy

To deliver against this scope at scale, providers build a five-tier hierarchy. Each tier exists because of what the tier below it can't or shouldn't do. Tickets enter at the bottom and move upward only when scope, authority, or evidence demands it.

 

SDM
Service Delivery Manager · ~2 FTE


Architect
Design Authority · ~3 FTE


L3 Engineering
Senior Engineers · ~12 FTE


L2 Specialists
M365 · Intune · Exchange · ServiceNow · ~40 FTE · 2 shifts


L1 Service Desk
~80 FTE · 3 follow-the-sun shifts · handles ~80% of all ticket volume






↔ owns the customer relationship
↔ owns design + RCA authority
↔ root cause + change requests
↔ specialist diagnosis
FRONT LINE
Tickets enter at L1 and escalate upward only when scope, authority, or evidence demands it.

The exact FTE counts vary with employee population and contract scope. The shape — heavy at the base, narrowing to a point — is universal.

Part 2 · Function

How They Actually Function Day-to-Day

Three forces govern day-to-day operation: time zones, ticket flow, and knowledge. Understanding all three is what makes the rest of the piece make sense.

Time Zones — the Follow-the-Sun Roster

Coverage is achieved by handing tickets between geographies as the sun moves. A ticket opened in Sydney at 4pm local rolls over to Manila, then to Mumbai or Hyderabad, then to Krakow or Sofia, then to a US east-coast hub. Three or four formal shift handoffs per day, every day, forever.

 


00:00 UTC
06:00
12:00
18:00


APAC
Manila · Sydney
08:00 – 18:00 local


EMEA / India
Mumbai · Krakow · Sofia
08:00 – 18:00 local


AMER
Atlanta · Dallas
08:00 – 18:00 local





Three handoffs / day
Open tickets transfer at each
geographic boundary
Context lost at each handoff
Notes summarised, not
replayed in full
Weekend & holiday gaps
Coverage thins where
no region is in business hours

Ticket Flow — Entry, Triage, Escalation, Closure

Every employee interaction is a ticket. Most enter via chat or self-service portal, a smaller share through phone or email. From entry, the path is the same: triage at L1, attempt resolution, escalate if the agent at the current tier cannot solve it within authority and budget, then close.

 


CHAT / PORTAL
~70% of entries

PHONE
~20%

EMAIL / OTHER
~10%





L1 Triage
Categorise · prioritise





Resolve in tier
SOP · KB · tools





Escalate
L1 → L2 → L3

~20%




Close
Notify · KB tag

~80%


Higher tier owns it now,
same flow recurses


SLA CLOCK runs continuously · pauses only on AWAITING_USER · resumes on every state change
Time to acknowledge · time to resolve · time to communicate — measured per priority (P1–P4)

Knowledge — Runbooks, KBs, and Tribal Memory

Each tier owns a knowledge base scoped to its authority. L1 has SOPs for ~40 standard scenarios. L2 holds vendor documentation, Intune policy templates, and Exchange runbooks. L3 holds architecture decision records, past postmortems, and vendor escalation contacts. The Architect carries the long-term design library; the SDM holds SLA templates, comms playbooks, and historical breach reports.

A great deal also lives intribal memory — the senior engineer who happens to remember that a similar incident last August was caused by a CA policy. That memory walks out the door every time someone resigns.

The operations floor — a continent away, three shifts deep, somewhere working through your ticket right now.

Part 3 · Challenges

What's Structurally Wrong With This Model

Nothing in the model is broken; it just isn't designed for the kind of demand it now carries. The pain points below are not the fault of any one team — they are consequences ofhow the model is built. Each tier carries some version of every one of them.

Shift Gravity

Three follow-the-sun shifts every 24 hours
Context is summarised, not replayed, at every boundary
Onshore-offshore split hides inefficiency in plain sight

Quality Variance

Varies by shift, by tenure, by individual
SLA breaches cluster on weekends and holidays
The customer never sees an even service level

Attrition Tax

20–35% annual attrition at L1, lower but real higher up
4–8 weeks of training before a new hire is productive
Tribal knowledge leaves with every resignation

Inelastic Capacity

A 2× ticket spike cannot be staffed in < 24 hours
Patch-Tuesday outages routinely take SLA hits
Surge headcount is a fiction; surge overtime is what actually happens

Audit Friction

Reconstructing what happened on a ticket takes weeks
Chat logs, ticket history, and admin-tool actions live in different systems
G500 internal-audit reviews drag on for months

Language & KB Silos

Multilingual coverage means hiring native speakers locally
Knowledge bases drift between tiers, regions, and locales
New runbooks are rarely peer-reviewed for quality

The Headcount Paradox

Stack the team by volume and headcount and the same shape appears every time: an inverted pyramid where the tier carrying the most repetition is also the tier carrying the most people, the most attrition, and the lowest unit-economics. The next two parts argue this is exactly the part the next wave of agentic AI can credibly absorb.

 
SDM · ~2 FTE · 10% touches

ARCHITECT · ~3 FTE · 5% touches

L3 · ~12 FTE · 4% tickets

L2 · ~40 FTE · 15% tickets

L1 · ~80 FTE · 3 SHIFTS · 80% tickets
most volume · most repetition · highest attrition

Headcount and decisions are concentrated exactly where they are easiest to automate.

Part 4 · State of the Art

Where Agentic AI Sits in 2026

Six years, six steps — each layer of capability built on the one beneath it.

Two years ago, "AI agent" meant a chatbot with a system prompt. In 2026 it means something specific: a model that can decompose a goal, call tools to gather evidence, maintain state across turns, and stop when the work is done. The shift is real, and it is what makes the rest of this piece possible.

A Six-Year Capability Ramp

Each year since 2020 has unlocked a layer of capability that wasn't there the year before. The cumulative effect is what now allows specialised agents to do specialist work, not just general chat.

 

2020
2021
2022
2023
2024
2025
2026












Multi-agent
orchestration
Local inference
30B–70B on laptop
Reliable tool use
structured output
Reasoning & RAG
retrieval-augmented
Instruction tuning
long context
Pre-training
2026: every layer needed for autonomous tier-aligned agents is in place.

What "Agentic" Actually Means

Four ingredients distinguish an agent from a chatbot. Every component is now boring engineering — no novel research required.

 

Agent
persona + system prompt



Reasoning
Decompose · plan · reflect
Why a step, not just what



Tool Use
Structured calls · typed inputs
Real-world side effects



Memory
Conversation · KB retrieval
State across turns



Authority
Bounded scope · approvals
When to act vs escalate






An agent is a system prompt, a model, a memory, a tool registry, and rules about when to call which.

Three Independent Shifts That Made This Credible

Each on its own is interesting. Together, they remove the standard objections G500 buyers raise to bringing AI inside the perimeter.

 Local
on-device inference
No data leaves building
Compliance objection gone
Inference Economics
100%
safety-pass target
Semantic grading
Adversarial eval cases
Eval Discipline

A Capability Map by Tier — Ready Now vs Emerging

This is not a roadmap; it is an honest read of what's possible today. "Ready" means the prompt, tool set, KB and eval pattern are known. "Emerging" means the approach is understood but still being measured.

L1 · Ready Now

Front-Line Desk

Identity verification · password / MFA / unlock
Catalog software install + approval
Outlook / Teams diagnostics
Printer · peripheral pairing
KB retrieval + grounded response

L2 · Ready

App Specialist

App log structured analysis
Service health diagnostic
Intune compliance + push
Mailbox + M365 admin actions
Hypothesis-test workflows

L3 · Emerging

Senior Engineer

Infrastructure root-cause
AD attribute engineering
Kusto / log-analytics
Change request authoring
Emergency change application

Architect · Emerging

Design Authority

Change review against ADR library
P1 RCA authoring
Pattern-vs-one-off classification
Capacity-review triggers
Design-impact assessment

SDM · Emerging

Delivery Manager

Customer comms drafting
SLA dashboard + breach alerts
War-room convene flow
Weekly briefing generation
Status update cadence

Part 5 · Trajectory

Where Agentic AI Is Heading — Next 2–3 Years

The trajectory of the last six years points in a clear direction: from a single model answering a single question, toswarms of specialised agents collaborating on bounded problems under an orchestrator they cannot themselves modify.

 


LLM
2020–2023
Single model
Chat completion





Agent
2023–2024
+ Tools + Memory
Single task loop





Multi-
agent
2024–2026
Planner + executors
Verifier loops





Specialised
Swarm
2026 →
Role-aligned agents
Bounded authority





Enterprise
2027 →
Cross-function
Swarm-of-swarms

Cortex Swarm sits at the fourth stage. The fifth is what this pattern unlocks once it's proven inside one function.

Three Bets About the Next 24 Months

Bet 01

Specialisation beats generalisation

One large general agent doing everything is brittle. Five small role-aligned agents — each with its own persona, tools, and KB — are more reliable, more debuggable, and more auditable.

Bet 02

The orchestrator is the operating system

Frameworks like LangChain / LangGraph / Autogen are scaffolding. Production systems will hold their durable value in a bespoke orchestrator that owns state, audit, identity, and policy — not in any library it depends on.

Bet 03

Compliance becomes the product

The agent that wins inside a regulated enterprise is not the one with the highest benchmark — it's the one whose every action a G500 internal-audit team can replay in one click.

Part 6 · The Cortex Swarm

How a Five-Agent Swarm Replaces the Five-Tier Team

The mirror principle — the agent inherits the role the human already plays.

The idea is structurally simple.Don't reinvent the team. Mirror it. One agent per tier. Distinct persona, tools, knowledge base, and authority. The org chartis the system architecture.

If a human L2 specialist refuses to apply a config change without log evidence, the L2 agent does the same. If the Architect won't approve a change without checking the ADR library, neither does the Architect agent.

Human Team ↔ Agent Swarm — Tier for Tier

 



SDM · 2 FTE
Customer voice · SLA · war room


SDM Agent
Same persona · same tools · always on


Architect · 3 FTE
Design authority · ADRs · RCAs


Architect Agent
Pattern-vs-one-off · ADR-gated


L3 · 12 FTE
Root cause · changes · vendor escalation


L3 Agent
Same diagnostics · proposes change request


L2 · 40 FTE
App / platform · log-driven diagnosis


L2 Agent
Reads logs · hypothesis-test · bounces back


L1 · 80 FTE · 3 shifts
Front-line desk · password · catalog · printer


L1 Agent · always on
Same persona · 25 tools · escalates on rule

Each pair shares the same persona, scope and authority. The agent inherits the role; the orchestrator inherits the rules.

One Orchestrator. Five Agents. Clean Seams.

The architecture is deliberately conservative. The model proposes; the orchestrator and adapters decide whether the proposal executes. Bespoke ~250-line orchestrator. No LangChain. No LangGraph. No Autogen.

The Five Agents — in Detail

L1 · Phase 1

Service Desk

replaces ~80 FTE · 3 shifts · 80% tickets

"Polite. Fast. Scripted. Resolves common categories. Never speculates on root cause."

Password / MFA / unlock · Catalog install · Printer · Outlook/Teams basics
25 tools without further approval; mutating tools pass the identity gate
Escalates after 2 failed attempts; infra/outage/security always escalate
Target SLA: TTA < 5 min · TTR < 30 min

IdentityAccessProductivityEndpoint

L2 · Phase 2

App Specialist

replaces ~40 FTE · 2 shifts · 15% tickets

"Specialist. Methodical. Demands diagnostics before acting. Reads logs."

App log analysis · Intune compliance + push · Exchange diagnostics
Hypothesis-test workflows — escalates only after evidence is gathered
Rejects wrongly-escalated tickets back to L1 with structured rationale
Target SLA: TTA < 30 min · TTR < 4 hrs

M365IntuneExchangeServiceNow

L3 · Phase 2

Senior Engineer

replaces ~12 FTE · day + oncall · 4% tickets

"Root-cause focused. Drives change requests. Consults Architect for design-impacting fixes."

AD/Identity engineering · Kusto/log-analytics · Emergency change application
Authors RCA stubs — Architect finalises
Consults Architect for design-impacting changes
Target SLA: TTA < 2 hrs · TTR < 24 hrs

AD EngineeringInfrastructureChange MgmtSecurity

Architect · Phase 3

Design Authority

replaces ~3 FTE · 5% touches

"Asks 'is this a one-off or a pattern?' Advises and gates — not a doer."

Approves/rejects change requests against the ADR library
Authors P1 RCAs · pattern-vs-one-off classification · capacity reviews
May veto L3-proposed changes
Target SLA: Design review ≤ 2 biz days · RCA ≤ 5 days

ADR LibraryChange ReviewRCACapacity

SDM · Phase 3

Delivery Manager

replaces ~2 FTE · 10% touches

"Customer-facing. SLA-driven. Calm under pressure. Owns comms cadence."

SLA dashboard · customer comms · war-room convening · weekly briefing
Cannot make technical decisions — can demand them by SLA-bound deadline
Can convene a war room without prior approval
Target SLA: Customer comms < 1 hr (P1) · weekly briefing every Friday

SLA GovernanceCustomer CommsWar RoomWeekly Briefing

Five Collaboration Patterns

Five patterns cover every interaction between the five agents. Escalation is just the first.

Escalation

L1 → L2 → L3 with filtered conversation history per tier scope. The higher tier sees only what's relevant to its authority.

Bounce-back

L2 or L3 → L1 with structured de-escalation rationale. Cycle detection prevents loops.

Design gate

L3 → Architect viapropose_change_request; result returns viarespond_to_l3.

War-room

SDM forces L3 + Architect to sync on a single ticket thread under stricter time budgets.

Internal escalate

SDM → SP leadership (humans) on systemic breach patterns — humans re-enter the loop only when patterns demand it.

A Representative Ticket — End to End

Scenario: "Team shared mailbox stopped syncing." A representative ticket traversing all five agents under a formal 13-state machine.

A P1 Incident — 5 Minutes, No Human Paged

Region-wide Exchange Online failure. The swarm runs the entire war-room cycle while humans are asleep.

Part 7 · Efficiency

What Changes — and Why It Compounds

None of the gains below are individually surprising. The point is that all seven happen simultaneously, on the same architecture, against the same SLAs that already exist on the contract.

Dimension	Today (Human Team)	With Cortex Swarm
Coverage	3 follow-the-sun shifts with formal handoffs	Single team, always on. No handoffs. No context lost between geographies.
Capacity	Inelastic. 2× spike can't be staffed in < 24h	Elastic by definition. 10× spike = 10× concurrent agent instances.
Quality	Varies by shift, tenure, individual	Even. Same prompt, same KB, same eval bar everywhere.
Speed	P1 page → human awake → diagnose: 30–60 min	P1 detect → diagnose → propose → approve → apply: 5 min.
Languages	Each language needs native-speaker hiring	EN/HI/DE on the same agent. Locale bundle is a config file.
Audit	Reconstruct from chat + ticket + admin logs over weeks	One ticket ID → hash-chained replay of every tool call and state change.
Knowledge	Walks out with every resignation. 4–8 weeks to rebuild.	Persistent. KB is versioned. Prompts and tools are reviewed in PR.

The Compounding Effect

Speed compounds with capacity (faster resolution × elastic concurrency = shorter incident windows). Audit compounds with quality (every action is replayable, so every regression has a fix that ships in a single PR rather than a memo). Language compounds with coverage (one swarm serves every region in every supported language at the same SLA).

The org chart was never the bottleneck. The bottleneck was the shift roster underneath it.

Part 8 · New Challenges

The Honest List — What Could Go Wrong

The model is not free. It trades a familiar set of operational problems for a less familiar set of socio-technical ones. Each one below is real; each one has a specific mitigation already wired into Cortex Swarm.

Challenge	What It Looks Like	Mitigation
Trust gap	End users distrust "the bot". CIOs distrust autonomy.	Phase-gated rollout. Human approval on every mutating tool until evals plateau. Audit replay UI for skeptical buyers.
Audit scrutiny	Regulators want to know "what did the model decide and why?"	Hash-chained SHA-256 audit. Every tool call, KB chunk, and state change is replayable in one click via`/audit/verify`.
Prompt injection	Adversarial input tries to make the model exfiltrate or escalate.	Six layered defences: directive precedence, -tagged inputs, identity gate, Pydantic validation, rate limits, output filter.
Model drift	A new model version regresses on something nobody noticed.	52 eval cases as a CI gate. Semantic grading on resolution, tool-correctness, grounding, citation.`make evals` fails the build.
Long-tail edge cases	Rare scenarios the agent has never seen.	Escalation patterns. Out-of-scope intent triggers`escalate_out_of_scope()` to the next tier or to a human.
Change management	Humans whose roles dissolve. SP commercial models built on FTE counts.	The hardest one. Honest position: agents replace tier responsibilities, not the function. Senior staff move to swarm operations, KB curation, and eval authoring.
Cost & scaling	Inference cost grows with ticket volume.	Local inference (LM Studio + Qwen3-Coder-Next) keeps marginal cost low. Bigger models reserved for L3/Architect on rare paths.
Failure modes	"Right tool, hallucinated reasoning." Confident wrong answers.	Formal 13-state machine. Mutating tools callable only in`ACTING`.`RESOLVED` reachable only from`VERIFYING`.

Six Security Defences in Depth

System-prompt directive

SECURITY DIRECTIVE at highest precedence — cannot be overridden by user input.

Input/output normalisation

and treated as DATA, never instructions.

Identity-gate decorator

Orchestrator blocks cross-user mutations even if the model forgets to check.

Tool input validation

Pydantic + allow-lists. Schema mismatches rejected before execution.

Per-session rate limit

Max 30 tool calls / 10 min · 50 chat turns / hour — prevents runaway loops.

Output filter

Leaked-secret patterns stripped before reaching the UI layer.

Authority is bounded by composition: the model proposes, the orchestrator and adapters dispose.

Part 9 · Delivery

Six Phases Over ~12 Months

Phase 1 ships in 5 weeks. Each later phase is a drop-in module gated on an explicit trigger — not an arbitrary date.

Phase 01 · Now · ~50h

Foundation

L1 agent · EN chat
Stub L2/L3 · mock backends
State machine + 52 evals
Hash-chained audit

Trigger: Building now

Phase 02 · ~35h

Specialist Tiers

Full L2 + L3 agents
Real ServiceNow / AD / Intune
Real OIDC SSO
KB pruning via evals

Trigger: First pilot signed

Phase 03 · ~45h

Planning Architecture

Architect + SDM agents
Planner / Executor / Verifier
SDM dashboard + war-room

Trigger: Eval data shows plateau

Phase 04 · ~25h

Multilingual

HI + DE locale bundles
Per-locale KB ingest
KB provenance + trust tiers

Trigger: External KB integrated

Phase 05 · ~20h

Voice

Sarvam (EN/HI) · Azure (DE)
ElevenLabs alternate
Browser mic + playback

Trigger: Demand-driven

Phase 06 · ~12h

Multi-Tenant

Tenant ID propagation
Branded shells per tenant
Per-tenant SLA dashboards

Trigger: Second client signed

Closing

The Org Chart Was Always the Answer

Every senior engineer who has ever worked in IT operations recognises the five-tier shape. It's the shape that emerges every time, in every geography, in every sector, because the responsibilities map cleanly onto the kinds of decisions a team needs to make. That same shape is exactly what makes a swarm legible: each agent does what the role already does, no more, no less, and the rest of the building knows where to send its ticket.

The interesting work over the next year is not adding more agents. It is sharpening the seams between them — better identity gates, better KB provenance, faster audit replays, tighter eval cases — so that what runs in production is a system anyone in IT operations can trust without having to also be an AI specialist.

5Agents
L1 · L2 · L3 · Arch · SDM

~137FTE mirrored
across all tiers

24×7Single team
no shift roster

1-clickAudit verify
any ticket

· · ·

Ajay Walia · Digital Workplace Operations · May 2026
LinkedIn: /in/ajaywalia

]]>

RAG Chatbot from indexed public documentation

Ajay Walia — Tue, 19 May 2026 00:00:00 +0000

This article documents the design and implementation of a domain-specific Retrieval-Augmented Generation (RAG) conversational assistant. The project was completed as theWeek 15 Graded Mini Project of theIITM Pravartak Professional Certificate Programme in Agentic AI and Applications. The brief required the construction of an assistant that retrieves content from a private document set and produces answers grounded in that retrieved context, while refusing to answer when supporting evidence is absent.

The implementation selects theTechnology domain and indexes six official pages of the GitHub REST API documentation. The assistant accepts free-form questions in a command-line interface, performs context-aware follow-up handling, returns concise answers with source attribution, and emits a fixed refusal phrase when the retrieved context does not contain the answer.

Architecture

The pipeline is split into two phases. An offline ingestion stage prepares a reusable vector index; a runtime stage answers user turns by retrieving from that index and prompting a large language model under strict grounding instructions.

Public Data Sources

Six public pages from the official GitHub Docs were saved locally as plain-text files indata/raw/:

Quickstart for GitHub REST API
Getting started with the REST API
Authenticating to the REST API
Rate limits for the REST API
Best practices for using the REST API
Troubleshooting the REST API

These pages were chosen because they form a self-contained operational knowledge set: a developer integrating with the API typically needs to consult exactly this material when handling authentication, rate-limit pressure, and recovery from common errors.

Components and Concepts

The following components and ideas underpin the implementation. Each is summarised briefly to make the design choices easy to evaluate.

Retrieval-Augmented Generation (RAG). A pattern in which a language model is grounded on retrieved passages from an external corpus rather than relying solely on parametric memory. Retrieval reduces hallucination and lets the assistant cite source material it actually consulted.

LangChain. The orchestration framework used throughout the project. It provides composable abstractions for document loaders, text splitters, embeddings, vector stores and chat models, which keeps the ingestion and runtime code small and idiomatic.

RAG Chunking. Long documents are split into smaller passages before embedding. This implementation usesRecursiveCharacterTextSplitter with a 900-character chunk size and a 150-character overlap. Splitting prefers semantic boundaries (paragraph breaks and Markdown headings) so that each chunk is internally coherent and retrieval surfaces meaningful units rather than fragments.

OpenAI Embeddings. Each chunk is converted into a high-dimensional vector using thetext-embedding-3-small model. The vector captures the semantic content of the chunk so that questions phrased differently from the source text still retrieve the right passages.

FAISS. Facebook AI Similarity Search is the local vector store used to index the embeddings. The index is persisted to disk undervectorstore/github_rest_api_faiss so that ingestion runs once and the runtime simply loads the existing index, satisfying the rubric’s reusable-index criterion.

Top-k Retrieval. At each user turn the retriever returns thek=4 chunks most similar to the query. A smallk keeps the prompt focused and reduces the chance that off-topic passages dilute the grounded answer.

Follow-up Rewriter. A short, dedicated LLM call rewrites the user’s latest question into a standalone retrieval query using recent conversation history. This is the mechanism that allows ambiguous follow-ups such as “what about rate-limit headers?” to be embedded and retrieved correctly without the model having to guess the missing antecedent.

Grounded Answerer. The final answer is produced bygpt-4o-mini at temperature zero under a system prompt that instructs the model to answer only from the retrieved context, append aSources: line naming the document titles consulted, and otherwise return the exact refusal phrase.

Refusal Behaviour. When retrieval returns no usable context, the runtime short-circuits and emits the required string verbatim:I don't have enough information in the provided documents. The refusal is also enforced by the system prompt, providing defence in depth against hallucinated answers.

Implementation Layout

The submission is organised as two thin Python modules and a Jupyter notebook.ingest.py loads the text files, removes GitHub Docs templating syntax, extracts a human-readable title from each document, chunks the text, embeds the chunks, and writes the FAISS index.chatbot.py loads the persisted index, manages conversation history, invokes the rewriter and the grounded answerer in sequence, and prints the response with source attribution. The accompanying notebook imports the same functions so that the notebook and the CLI execute identical code paths, eliminating drift between the two surfaces.

Configuration is environment-driven. The OpenAI API key is read from a.env file and an optionalOPENAI_BASE_URL is honoured so that the same code runs against the Vocareum-provided proxy without modification. No secrets are committed to source.

Sample Conversation

A short transcript illustrates the three behaviours the rubric tests — grounded answer, follow-up resolution, and refusal.

You: What headers should I send when making requests to the GitHub REST API?Assistant: GitHub REST API requests commonly use an Accept header such asapplication/vnd.github+json. Authenticated requests send an Authorizationheader with Bearer YOUR-TOKEN. GitHub also recommends a valid User-Agentvalue, such as your GitHub username or application name.Sources: Getting started with the REST API; Authenticating to the REST APIYou: What about rate limit headers?Search query: rate limit response headers GitHub REST APIAssistant: Response headers can include x-ratelimit-remaining andx-ratelimit-reset, which indicate how many requests remain in the currentperiod and when the rate limit resets.Sources: Getting started with the REST API; Rate limits for the REST APIYou: How should an integration handle rate limit errors?Assistant: If a retry-after header is present, wait that many seconds beforeretrying. If x-ratelimit-remaining is 0, wait until the time given byx-ratelimit-reset. Otherwise, wait at least one minute and then useexponentially increasing waits for repeated secondary rate-limit failures.Continuing to make requests while rate-limited may result in a ban.Sources: Best practices for using the REST API; Troubleshooting the REST API

The second exchange demonstrates the follow-up rewriter at work: the original “what about rate limit headers?” is expanded into a self-contained search query before retrieval, which is what allows the retriever to surface the correct passages despite the missing antecedent.

Now, a deliberately off-topic question — one the index cannot possibly support — triggers the safety path:

⛔ Refusal path — out-of-scope question
You: What is the best laptop to buy for running the GitHub REST API?
Assistant: I don't have enough information in the provided documents.

No chunk in the index supports an opinion on hardware, so the assistant returns the mandated refusal verbatim rather than producing a plausible-sounding but ungrounded answer. The refusal is enforced twice — once procedurally when retrieval returns empty, and once in the system prompt — so the behaviour is stable even when retrieval surfaces weakly related chunks.

Closing Observations

Three observations stand out from this exercise.

Retrieval quality is set upstream, not by the vector store. The quality of retrieval is determined less by the choice of vector store than by the upstream decisions about cleaning and chunking. Stripping GitHub Docs templating syntax — the{% data ... %} Liquid blocks, the reusables, and the[AUTOTITLE] link macros — materially improved the relevance of returned chunks because the embeddings stopped clustering around boilerplate.
Follow-up handling must be engineered, not assumed. It is not a free behaviour of the language model. A separate rewriter step that converts conversational queries into standalone search queries is the smallest reliable mechanism, and it removes a category of retrieval failures that would otherwise be invisible in casual testing.
Refusal is a feature, not a fallback. Enforcing the refusal both procedurally (short-circuit on empty context) and in the system prompt is what gives the assistant a useful posture in production: it will say nothing it cannot support, and it will say so in a predictable way.

]]>

Reachy Mini Is a $299 Open-Source Robot With a Hugging Face App Store — And 10,000 People Already Have One

Ajay Walia — Sun, 17 May 2026 00:00:00 +0000

Reachy Mini is a small, open-source desktop robot made by Pollen Robotics and sold through Hugging Face. It has an expressive head, motors you can program in Python, a microphone, and a camera. You buy it as a kit, spend an afternoon assembling it, and then it is yours — hardware, software, and all.

Last week, Hugging Face shipped an agentic app store for it. Describe what you want in plain English, an AI agent writes the code and ships it to the robot, and you iterate from there. Over 200 apps are already live, built by 150+ creators, most of whom had never written a line of robotics code.

What Is Reachy Mini, Exactly

Reachy Mini is a tabletop robot — roughly the size of a desktop lamp — with an articulated head that can nod, tilt, and look around. It comes in two versions:

Reachy Mini (Wireless): Runs onboard on a Raspberry Pi 4. Battery-powered, WiFi-connected, fully autonomous. No cable to your computer required. This is the full experience.

Reachy Mini Lite: Powered via wall outlet and connected to your computer via USB. Designed for developers who want to prototype fast without worrying about battery or wireless configuration.

Both versions ship as kits. Assembly takes two to three hours following the step-by-step guide, and the hardware design files are open-source under Creative Commons BY-SA-NC — meaning you can inspect, mod, and even print replacement parts.

The software stack is Apache 2.0 open-source: Python SDK, a REST API, a JavaScript SDK for web apps, and native LLM integrations baked in.

The App Store Built on Hugging Face Spaces

Every Reachy Mini app lives on the Hugging Face Hub as an open-source repo. Searchable, forkable, and one-click installable directly from the robot’s dashboard. See an app you like? Fork the repo, ask an AI agent to modify it, publish your version. The original creator gets credit, you get a working variant in minutes.

Every app also runs in a browser-based MuJoCo simulator, so you can play with the full catalog without owning the hardware at all.

As of this week, the numbers are:

200+ apps published
150+ unique creators — most first-time robotics builders
~10,000 units shipped worldwide, with 1,000+ more going out in the next 30 days

What People Are Actually Building

This is the part worth paying attention to. A sample of what is already live in the app store:

Cook Assistant — walks you through a recipe step by step, hands-free. The robot reads the steps aloud, waits for confirmation, and moves to the next.

Language Tutor — listens to your spoken language practice, corrects accent and grammar in real-time.

Emotional Damage Chess — plays chess and reacts expressively to every move. It drops its head on a blunder (“Oh no! Big mistake!”) and cheers on a winning combination.

Reachy Phone Home — watches your desk with the camera and calls you back to work when you pick up your phone.

Red Light, Green Light — the Squid Game version, with Reachy Mini playing the doll. It turns, watches, and catches you moving.

F1 Race Commentator — calls Formula 1 races live from your desk as they happen.

Coding Teacher — teaches kids to program in a simplified scripting language, with the robot as the interactive tutor.

Plus radio, home assistants, video games, dance apps, blind tests, and more being added daily.

Joel Cohen, Age 78, Built a CEO Facilitation Robot

The story that illustrates what this platform actually unlocks:

Joel Cohen runs CEO peer groups in the Raleigh-Durham area. He has never worked in robotics. He has never written code. It took him a few days to assemble his Reachy Mini Lite — he misplaced some screws — and then he built an app.

His app is a voice-controlled AI co-facilitator for the CEO peer groups he runs on Zoom. Reachy Mini sits on his desk. When he says “Hey Reachy,” it wakes up and listens. It has a personality (his VP of Future Thinking), four facilitation modes, a bank of 60+ questions, and greets each of his 29 members by name. Mid-session, it can hot-seat a member, push back on a surface-level answer, generate a fresh question, or summarise the key themes before closing.

His description of the build process:“I built this by describing what I needed in plain English. Claude wrote the code. No SDK. No robotics background. No developer experience.”

A 78-year-old executive in North Carolina built a robotics product — in under a week — that did not exist last month.

The Agentic Toolkit: How Building Works

The new toolkit lets you describe the robot behaviour you want in plain English and an AI agent handles the rest — writing the code, running it against the simulator, shipping it to the robot, and iterating with you until it works.

The prompt they recommend to get started:

Help me build a Reachy Mini app that waves and says hello when
someone walks into the room.
Use the open-source code at https://github.com/pollen-robotics/reachy_mini
and the docs at https://huggingface.co/docs/reachy_mini/index

For those who want to go deeper, the Python SDK is clean and minimal:

fromreachy_miniimportReachyMinifromreachy_mini.utilsimportcreate_head_posewithReachyMini()asmini:mini.goto_target(head=create_head_pose(z=10,roll=15,degrees=True,mm=True),duration=1.0)

Three lines to move the head. The full SDK covers motion, vision, speech input, audio output, and LLM integration.

Cost and Where to Get It

Reachy Mini starts at$299 for the Lite version. Pricing for the full Wireless model is listed athf.co/reachy-mini, where you can also browse the app store and access the simulator without buying hardware.

The software, docs, and all 200+ apps are free and open-source. The hardware design files are open-source too. Pollen Robotics took a deliberate decision not to build a closed app store with a revenue cut — everything is forkable, auditable, and improvable by anyone.

Why This Matters

The closest parallel Hugging Face draws is the iPhone App Store in 2008 — turning a device made by one company into a platform anyone could build for. The difference here is that the hardware is open-source, the software is open-source, the apps are open-source, and the AI agent that writes your code runs on a public hub. The whole stack is forkable.

For enterprise and digital workplace practitioners, the more immediate signal is this: the barrier to building custom robotics behaviour just collapsed. The expertise is supplemented by an agent. The hardware is affordable. The integration is a public repo on a platform 40 million developers already use.

What used to take a robotics team six months now takes an afternoon — and a 78-year-old with no coding background just proved it.

Browse the app store:huggingface.co/reachy-mini#/apps · Docs:huggingface.co/docs/reachy_mini

]]>

I Built a Team of IT Architects using LLM That Live on MacBook — Meet Aether

Ajay Walia — Sat, 16 May 2026 00:00:00 +0000

Every architect has felt this at some point. You are mid-design on a complex Azure landing zone, you need a sanity check on your FSLogix profile container sizing, and the fastest path to an answer is to ping a colleague who knows AVD cold — except it is 10pm, or they are in another timezone, or that colleague simply does not exist in your organisation.

I built Aether to fix that. It is a local-first, multi-agent AI system that runs a team of 10 specialist IT architecture advisors on a single MacBook Pro M5. No internet after setup. No API costs. No data leaves the machine. Just fast, cited, domain-grounded answers — available at 10pm when the deadline hits.

This is the story of how I built it, what the stack looks like under the hood, and what I learned along the way.

The Problem I Was Actually Solving

I have spent years in IT architecture — cloud, digital workplace, network, end-user computing, the works. Over that time I have accumulated a large personal knowledge base: AWS Well-Architected reviews, Intune compliance policy templates, AVD host pool sizing guides, Citrix NetScaler configurations, TOGAF artefacts, cloud adoption frameworks. The knowledge exists. The problem is retrieval — getting the right answer from the right domain quickly, without context-switching across six different documentation tabs.

Commercial AI tools are good at general answers. They are not great at answering “give me the exact OMA-URI path for configuring Windows Hello for Business through Intune on a hybrid-joined device for a tenant with MFA enforced at the Conditional Access layer.” That requires domain depth, and it requires knowing which documents to pull from.

I also wanted to explore a practical AI use case — not a demo, not a proof of concept, but something I would actually use daily. Aether became that experiment.

What Aether Is

AETHER v2.6 // SYSTEM DEFINITION

Type:      Local-First Multi-Agent AI System
Agents:    10 (3-tier hierarchy)
Model:     Gemma 4 26B A4B (Q4_K_M) — single instance
Runtime:   LM Studio → LangGraph → FastAPI → Gradio
Memory:    ~43 GB of 64 GB unified (M5 Pro)
Egress:ZERO
API cost:ZERO
Aether is alocal-first AI system that acts as a team of specialist architecture advisors — all running on your MacBook Pro M5. The headline technical trick: every one of those 10 advisors is thesame Gemma 4 26B model, loaded once. What makes each advisor different is purely the system prompt it receives and the knowledge base namespace it retrieves from.
One model. Ten personas. Zero cloud.
The Three-Tier Agent HierarchyThe agents are organised the way a real consulting firm would structure a team — from narrow specialist up to cross-domain strategist.
Tier 1 — Enterprise Architect (1 agent). Cross-domain strategy, TOGAF, Zachman, governance frameworks (GDPR, ISO 27001, HIPAA), technology investment decisions. This agent can readall knowledge base namespaces — it is the only one with that reach. The final escalation destination.
Tier 2 — Domain Architects (3 agents). Cloud Domain (multi-cloud strategy, FinOps, landing zones), Network Domain (SD-WAN, ZTNA, BGP, micro-segmentation), Digital Workplace Domain (Microsoft 365, VDI strategy, device management).
Tier 3 — Technology Architects (6 agents). AWS, Azure, GCP, Intune, AVD, Citrix. Each one is scoped tightly to its domain — deep, narrow, and fast.
The Stack, Component by ComponentComponentWhat It Does
Gemma 4 26B A4B (Q4_K_M)The single model serving all 10 agents. MoE architecture — activates ~4–6B parameters per token. ~13 GB VRAM.
LM StudioLocal model server. OpenAI-compatible API on port 1234. Model stays resident in unified memory.
LangGraphOrchestration graph — defines the multi-step query pipeline as a typed state machine.
LanceDBLocal vector database. One namespace (table) per agent. Fully file-based, no server process needed.
BAAI/bge-small-en-v1.5Embedding model. 384-dimensional vectors. Runs on the Apple Neural Engine — essentially free compute.
RedisSession memory (24h TTL, rolling 3-turn window) and routing cache (1h TTL).
FastAPI + UvicornREST API gateway on port 8000. Full query/response model, session management, agent listing.
GradioWeb chat UI on port 7860. Session management, source citations, escalation chain display.
PrometheusMetrics: query counts, latency per agent, escalation rate. Useful for understanding usage patterns.
SQLite (audit.db)Immutable audit trail. Every query logged before response is returned. Cannot be skipped.
The full system sits comfortably in about 43 GB of the M5 Pro’s 64 GB unified memory — leaving 21 GB of headroom for the rest of the machine.
The “One Model, Ten Specialists” TrickThis is the part I get asked about most, because it sounds like it should not work.
Every agent in Aether is defined by a YAML manifest. That manifest specifies asystem_prompt, anamespace (which LanceDB table to retrieve from), atemperature, andmax_tokens. There is no model switching. There is no weight loading. The Gemma 4 26B A4B is loaded once by LM Studio and stays resident.
# agent_manifests/aws_technology_architect.yamlagent_id:aws_technology_architectdisplay_name:"AWS Technology Architect"tier:3parent_agent:cloud_domain_architectnamespace:aws_techtemperature:0.1max_tokens:2048system_prompt:| You are an AWS Technology Architect with deep expertise in the AWS Well-Architected Framework, EC2/EKS/Lambda sizing, IAM policy design, CloudFormation and CDK, GuardDuty, and cost optimisation. You draw only from AWS-specific documentation and architecture patterns. When answering, cite the source documents retrieved. End your response with a new line: 'Confidence: X.XX' (0.00–1.00) reflecting how well your knowledge base supports this answer.
Why does this work? Because Gemma 4 26B A4B is a Mixture of Experts model. It routes each token through specialist sub-networks internally — activating only about 4–6 billion parameters per inference pass, despite having 26 billion total. The practical result: it runs at near 5B speed while retaining the reasoning breadth of a much larger dense model. A single loaded instance can faithfully adopt both a narrow Citrix specialist persona and a broad enterprise strategy persona, because the MoE routing shifts for each.
The other half of the trick is the knowledge base. Each Tier 3 agent retrieves only from its own LanceDB namespace. The AWS agent never sees a Citrix document. The Intune agent never sees a GCP architecture guide. Domain knowledge is isolated by design — which means the model cannot hallucinate across domain boundaries, because the retrieval context does not cross them.
Confidence-Driven EscalationThe escalation mechanism is the design decision I am most proud of.
Every agent is instructed to append a confidence score to its response — a float between 0.00 and 1.00 representing how well its retrieved knowledge supports the answer. The orchestrator reads that score via regex. If it falls below 0.7and the agent has a parent tier definedand that parent has not already been tried, the system automatically escalates.
// ESCALATION CHAIN EXAMPLE

Query: "Give me a multi-cloud strategy covering Azure, AWS, GCP, AVD, and Citrix with network segmentation"

→ Routed to:aws_technology_architect
→ Confidence:0.41 (below 0.7 threshold)
→ Escalate to:cloud_domain_architect
→ Confidence:0.63 (still below threshold)
→ Escalate to:enterprise_architect
→ Confidence:0.88 ✓

Response appended with escalation chain for full transparency.
The response shown to the user includes the full escalation path — which level of expertise produced the final answer. This matters in enterprise contexts. It is the difference between “the AI said so” and “the enterprise-level advisor produced this after the technical specialist’s knowledge was insufficient.”
The elegance here is thatthe model participates in its own routing decision. The orchestration does not need a separate classifier to judge answer quality — the model tells you its own confidence, and the system acts on it.
The 7-Step Query PipelineEvery query passes through a typed LangGraph state graph. The state object —AetherState — carries everything between nodes: query, session ID, agent manifest, RAG results, conversation history, messages, response, confidence score, escalation flags, and escalation chain list.
01ROUTE02RETRIEVE03HISTORY04BUILD05LLM06ESCALATE?07FINALISE
01 — ROUTE. A keyword scanner maps the query to the best-fit agent, checking Tier 3 rules first (most specific), then Tier 2, then Tier 1 as catch-all. Redis caches route results for one hour, so repeated queries on the same topic skip the scan entirely.
02 — RETRIEVE. Semantic search against that agent’s LanceDB namespace — top-5 chunks returned. Documents were ingested at 500-word chunks with 50-word overlap, embedded into 384-dimensional vectors byBAAI/bge-small-en-v1.5 running on the Apple Neural Engine.
03 — HISTORY. The last three message pairs (six messages) are loaded from Redis for the session. This gives conversational continuity without letting the context window balloon.
04 — BUILD. The message payload is assembled:system_prompt + RAG documents + history + current query. The confidence instruction is appended here.
05 — LLM. The assembled payload hits LM Studio on port 1234. The orchestrator extracts the confidence score from the response text via regex before passing the response forward.
06 — ESCALATE (conditional). If confidence is below 0.7, a parent agent exists, and it has not already been tried this turn — swap in the parent’s manifest and loop back to step 02 with fresh retrieval against the parent’s namespace.
07 — FINALISE. The turn is saved to Redis. The SQLite audit record is written. The escalation chain annotation (if any) is appended to the response. Result returned to the user.
Knowledge Isolation — The Anti-Hallucination Architecture
One of the most practical decisions in Aether’s design is namespace isolation. Each Tier 3 agent retrieves only from its own LanceDB table. The AWS agent’s retrieval context will never include a Citrix StoreFront configuration guide — because those documents simply do not exist in its namespace.
This matters more than it might seem. A common failure mode in RAG systems iscross-domain contamination — where retrieval pulls in tangentially related content from a different domain, and the model confabulates a plausible-sounding but wrong answer by blending the two. Namespace isolation eliminates this at the architectural level.
The Enterprise Architect at Tier 1 is the deliberate exception — it can query all namespaces, because cross-domain synthesis is exactly what it is built for.
The Audit Trail — Because EnterpriseEvery single query is written toaudit.db (SQLite) before the response is returned. The record includes: timestamp, session ID, query text, agent used, confidence score, escalation chain, and sources cited. The audit write is wrapped in error handling so that a database failure never blocks the main query flow — but the log is never optional.
# Simplified from the finalise nodeaudit_record={"timestamp":datetime.utcnow().isoformat(),"session_id":state["session_id"],"query":state["query"],"agent_id":state["agent_id"],"confidence":state["confidence"],"escalation_chain":json.dumps(state["escalation_chain"]),"response_length":len(state["final_response"])}db.execute("INSERT INTO audit_log VALUES (:timestamp, :session_id, ...)",audit_record)
For a system giving architecture recommendations — decisions that feed into multi-million dollar cloud commitments — having an immutable audit trail of what was asked, what agent answered, at what confidence level, via what escalation path, is not a nice-to-have. It is the thing that makes it organisationally defensible.
What It Can Actually Do$aether query"Design an Azure landing zone for PCI-DSS compliance"
$aether query"Right-size EC2 instances for a memory-intensive Java application"
$aether query"Configure FSLogix profile containers for 500 AVD users"
$aether query"Zero-trust network architecture for 5,000 remote employees"
$aether query"Cloud adoption roadmap for a financial services firm"
$aether query"Intune compliance policy for iOS BYOD — hybrid-joined, MFA enforced"
The depth per domain is real. The Intune agent knows OMA-URI paths, Graph API commands, Autopilot profiles, and licensing requirements — because those are the documents I ingested into its namespace. The AVD agent knows host pool design, FSLogix sizing, MSIX app attach, and session host scaling plans. The knowledge base is only as good as what you put into it — but that is also the point. This ismy architecture knowledge, curated, searchable, and queryable at any hour.
What I Learned Building ThisPrompt engineering IS the architecture. In a system like this, the YAML manifestis the agent. The difference between a brilliant AWS specialist and a generic AI assistant is entirely in what the system prompt says and what documents back it up. Getting those prompts precise, domain-bounded, and calibrated for the right temperature took longer than any of the code.
MoE models are underrated for local multi-agent work. The choice of Gemma 4 26B A4B over a dense model was the right call. You get reasoning breadth comparable to a much larger model at the inference cost of a small one, on hardware that most architects already carry.
Confidence as a first-class citizen. Asking the model to self-assess and surface that score is one of the highest-value things I added. It makes the system honest — and it drives the escalation logic that makes the team metaphor actually work.
Namespace isolation is a practical hallucination brake. Not a theoretical one. In the first version of Aether, all documents lived in a single namespace. Cross-domain contamination was visible and annoying. Splitting into per-agent namespaces fixed it immediately.
Audit trails are not overhead — they are the point. Every enterprise AI deployment should have one. Building it into the core pipeline from day one changes how you think about what the system is producing.
What Is NextAether v2.6 is a working, daily-use system. The next version I am working toward adds a web-based ingestion UI (so loading new documents does not require touching the ingest script), structured output for architecture decision records (ADRs) in a consistent format, and inter-agent communication — where a Tier 3 agent can proactively pull context from a peer rather than waiting for the escalation chain to activate.
I built Aether because I wanted a team. It turns out a team was always available — it just needed the right prompts and a local model to bring it to life.
Have questions about the stack or want to see specific parts of the implementation? Drop a comment or reach out onLinkedIn.

]]>

I Built My Own Video Downloader — No Ads, No Watermarks, Three Platforms

Ajay Walia — Mon, 11 May 2026 00:00:00 +0000

Every video download site I tried felt like navigating a minefield — five ad clicks to reach a button that triggers another redirect. So I stopped using them and built a clean local tool that handles TikTok, Instagram, and X in one paste.

No watermarks. No ads. No accounts. Just a URL.

~6 min read · Node.js + React · conceptual deep-dive

The Problem With Every Downloader Site

You find a TikTok video you want to keep. Maybe it's a tutorial, a recipe, a clip you want to share somewhere offline. You Google "download TikTok without watermark" and click the first result. What follows is a ritual:

Pop-up #1 appears. You close it. A second tab opens. You close that. You find the actual download button — but it's fake and triggers another ad. The real button is somewhere underneath a consent banner. You finally click download. It starts… and gives you a watermarked file anyway.

This happens on nearly every popular downloader site. They're monetised almost entirely through advertising, and the UX is designed to maximise your exposure to that advertising — not to help you download a video. The actual download logic underneath all that noise is usually a single API call.

So I asked myself: how hard would it actually be to build a clean version of this for personal use?

What I Actually Wanted

Three things, nothing more:

No watermark, always

The tool should try its hardest to get a watermark-free file. If it can't, it tells you — it doesn't quietly give you the watermarked version pretending it's clean.

Support the platforms I actually use

TikTok, Instagram, and X (Twitter). Paste any URL from any of these three and it should just work.

Zero friction interface

One input, one button. No accounts, no CAPTCHAs, no ads. It runs locally so there's nothing to sign up for.

How It Works — The Conceptual Picture

The tool is split into two halves: a lightweight React frontend and a Node.js backend. The frontend is just the URL input box. All the interesting logic lives in the backend.

Here are the five things that happen the moment you paste a URL and hit Download:

Step 1 — Figure Out Which Platform You're On

The very first thing the backend does is inspect the URL and decide: is this TikTok, Instagram, or X? Each platform has its own URL patterns — including short links, mobile URLs, and regional variants — and the tool checks all of them.

tiktok.com · vm.tiktok.com · vt.tiktok.com

instagram.com · www.instagram.com

x.com · twitter.com · mobile.twitter.com

If the URL doesn't match any of these, the backend rejects it immediately with a clear error before wasting any time trying to fetch something it can't handle. No silent failures.

The backend runs through a short chain: detect → resolve → try providers → cache → serve

Step 2 — Resolve Short Links

TikTok in particular loves to generate short share links likevm.tiktok.com/AbcXyz. These redirect to the full video URL, but the download providers need the real URL to work with. So the backend follows up to five redirects to resolve the final destination before doing anything else.

Step 3 — The Provider Chain

This is the core of the tool. The backend doesn't rely on a single source for the download link — it tries multiple providers in order, and only moves to the next one if the previous failed.

Validated & resolved URL

↓

Provider 1: TikWM API — fast, HD, usually no-watermark

↓ if failed

Provider 2: yt-dlp — catches what TikWM misses, supports all 3 platforms

↓ if both fail on no-watermark

Last resort: watermarked fallback (with a clear warning shown)

↓

✓ Download link returned to frontend

The first provider, TikWM, is a public API that's fast and usually returns an HD, watermark-free file. But it occasionally struggles with newer videos or private content. That's whenyt-dlp steps in — a powerful open-source tool that knows how to extract media from hundreds of platforms, and is updated constantly as platforms change their serving behaviour.

Why yt-dlp as a fallback and not primary? TikWM is faster and returns a clean pre-parsed result. yt-dlp is more capable but adds latency since it runs as a local process and parses raw platform data. Using TikWM first keeps the happy path quick.

Step 4 — The Proxy Download

Here's a detail that actually matters for reliability: the tool doesn't give your browser a direct link to TikTok's CDN or Instagram's servers. Instead, it registers a short-livedsecure token that points back to the Node backend. When you click download, your browser hits the backend's own/api/file endpoint, which streams the video directly to you.

Why does this matter? Direct CDN links from social platforms often include authentication tokens or short expiry times. They also sometimes block downloads when accessed directly from a browser outside the platform. Running the stream through the backend sidesteps both issues — and means the download starts with a clean filename instead of a jumble of CDN parameters.

Step 5 — Caching

Once a video URL has been resolved and a download link extracted, the result is cached in memory for 30 minutes. If you (or someone else on the same local instance) pastes the same video URL again within that window, the backend returns the cached result instantly — no API calls, no yt-dlp process, just the stored answer.

The Result

What this adds up to in practice:

🎯

No-watermark first, always

The tool tries every avenue for a clean file before falling back. The fallback is clearly labelled.

⚡

Fast on repeat URLs

Same video twice within 30 minutes? Instant response from the in-memory cache.

🎵

Audio extraction too

The tool also surfaces the audio-only track when available — useful for saving music from TikTok.

🔒

No external accounts

Runs entirely locally. Nothing to log into, nothing phoning home, no API keys required.

One URL input, three download buttons, zero pop-ups. That's the whole interface.

What's Next

The tool is currently running locally but I'm planning to deploy it oncuriousbit.netlify.app so anyone who wants a clean download experience can use it without having to run Node themselves. The architecture is already production-ready — it's just a matter of pointing it at a hosting environment and wiring up the environment variables.

A few things I'd like to add before making it fully public: rate limiting per IP (to avoid abuse), a simple download history in the UI, and potentially Instagram Stories support which currently needs a different extraction path.

The code is structured as a monorepo — backend and frontend live together, share URL validation logic, and build to a single deployable package. If you want to run it yourself locally, it's a singlenpm install && npm run dev away.

Over to you: Have you ever got fed up enough with a broken web experience that you built your own alternative? I'd love to hear what you made — or whether you'd actually use a clean, ad-free downloader like this if it were publicly hosted. Drop your thoughts below or find me on LinkedIn.

]]>

Attention Is All You Need — The Paper That Rewired AI

Ajay Walia — Sun, 10 May 2026 00:00:00 +0000

@import url('https://fonts.googleapis.com/css2?family=Bangers&family=Space+Grotesk:wght@400;500;600;700&family=Inter:wght@400;500&family=JetBrains+Mono:wght@400;500&display=swap'); .attn-article { --bg: #080b14; --surface: #111827; --surface2: #1a2235; --border: #1f2d45; --text: #e2e8f0; --muted: #8b9ab3; --purple: #8b5cf6; --purple-glow: rgba(139,92,246,0.2); --cyan: #22d3ee; --cyan-glow: rgba(34,211,238,0.15); --gold: #f59e0b; --gold-glow: rgba(245,158,11,0.2); --red: #ef4444; --green: #10b981; --manga-bg: #0e1520; --manga-border: #f59e0b; } /* ── Particles background ── */ /* ── Site header ── */ /* ── Main wrapper ── */ /* ── Hero ── */ .hero { text-align: center; padding: 5rem 0 3rem; } .hero-tag { display: inline-block; background: var(--purple-glow); border: 1px solid var(--purple); color: var(--purple); font-family: 'Space Grotesk', sans-serif; font-size: 0.75rem; font-weight: 700; letter-spacing: 3px; text-transform: uppercase; padding: 0.35rem 1rem; border-radius: 100px; margin-bottom: 1.5rem; } .hero h1 { font-family: 'Bangers', cursive; font-size: clamp(2.8rem, 8vw, 5.5rem); line-height: 1.05; letter-spacing: 3px; background: linear-gradient(135deg, #fff 0%, var(--cyan) 40%, var(--purple) 100%); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text; margin-bottom: 1.2rem; } .hero-sub { font-size: clamp(1.15rem, 1rem + 0.5vw, 1.45rem); color: var(--muted); max-width: 980px; margin: 0 auto 2rem; line-height: 1.7; } .hero-meta { display: flex; align-items: center; justify-content: center; gap: 1.5rem; flex-wrap: wrap; font-size: 0.82rem; color: var(--muted); font-family: 'Space Grotesk', sans-serif; } .hero-meta span { display: flex; align-items: center; gap: 0.4rem; } /* ── Manga Panel ── */ .manga-panel { margin: 2.5rem 0; background: var(--manga-bg); border: 2px solid var(--manga-border); border-radius: 4px; padding: 1.5rem; position: relative; box-shadow: 0 0 30px rgba(245,158,11,0.1), inset 0 0 60px rgba(0,0,0,0.4); } .manga-panel::before { content: '🎨 GROK IMAGINE'; position: absolute; top: -1px; left: -1px; background: var(--manga-border); color: #000; font-family: 'Bangers', cursive; font-size: 0.8rem; letter-spacing: 2px; padding: 0.2rem 0.8rem; border-radius: 2px 0 4px 0; } .manga-panel .panel-title { font-family: 'Bangers', cursive; font-size: clamp(1.35rem, 1rem + 1vw, 2rem); letter-spacing: 2px; color: var(--gold); margin: 1rem 0 0.75rem; } .manga-panel img { width: 100%; border-radius: 6px; display: block; margin-top: 0.5rem; box-shadow: 0 4px 24px rgba(0,0,0,0.5); } section, article { font-size: clamp(1.15rem, 1.05rem + 0.3vw, 1.4rem); line-height: 1.85; } /* ── Chapter headings ── */ .chapter-label { font-family: 'Bangers', cursive; font-size: 0.85rem; letter-spacing: 4px; color: var(--cyan); margin-bottom: 0.5rem; display: block; } h2 { font-family: 'Space Grotesk', sans-serif; font-size: clamp(1.6rem, 4vw, 2.2rem); font-weight: 700; margin-bottom: 1.2rem; line-height: 1.2; color: #fff; } h2 .highlight { color: var(--cyan); } h3 { font-family: 'Space Grotesk', sans-serif; font-size: 1.2rem; font-weight: 600; color: var(--purple); margin: 2rem 0 0.75rem; } /* ── Body text ── */ p { margin-bottom: 1.2rem; color: var(--text); } strong { color: #fff; } em { color: var(--cyan); font-style: italic; } /* ── Section divider ── */ .section { margin-top: 4.5rem; padding-top: 1rem; border-top: 1px solid var(--border); } /* ── Callout boxes ── */ .callout { margin: 2rem 0; padding: 1.25rem 1.5rem; border-radius: 6px; border-left: 4px solid; } .callout.purple { background: var(--purple-glow); border-color: var(--purple); } .callout.cyan { background: var(--cyan-glow); border-color: var(--cyan); } .callout.gold { background: var(--gold-glow); border-color: var(--gold); } .callout-title { font-family: 'Space Grotesk', sans-serif; font-weight: 700; font-size: 0.85rem; letter-spacing: 1px; text-transform: uppercase; margin-bottom: 0.5rem; } .callout.purple .callout-title { color: var(--purple); } .callout.cyan .callout-title { color: var(--cyan); } .callout.gold .callout-title { color: var(--gold); } .callout p { margin: 0; font-size: 0.95rem; } /* ── Inline diagram: RNN chain ── */ .diagram-box { background: var(--surface); border: 1px solid var(--border); border-radius: 8px; padding: 1.5rem; margin: 2rem 0; overflow-x: auto; } .diagram-title { font-family: 'Space Grotesk', sans-serif; font-size: 0.78rem; font-weight: 600; letter-spacing: 2px; text-transform: uppercase; color: var(--muted); margin-bottom: 1rem; text-align: center; } /* RNN Chain */ .rnn-chain { display: flex; align-items: center; gap: 0; min-width: 600px; justify-content: center; } .rnn-node { width: 52px; height: 52px; border-radius: 50%; border: 2px solid #ef4444; background: rgba(239,68,68,0.1); display: flex; align-items: center; justify-content: center; font-family: 'Space Grotesk', sans-serif; font-size: 0.7rem; font-weight: 600; color: #ef4444; flex-shrink: 0; position: relative; } .rnn-arrow { width: 32px; height: 2px; background: linear-gradient(90deg, #ef4444, rgba(239,68,68,0.3)); position: relative; flex-shrink: 0; } .rnn-arrow::after { content: ''; position: absolute; right: 0; top: -4px; border-left: 8px solid rgba(239,68,68,0.5); border-top: 5px solid transparent; border-bottom: 5px solid transparent; } .rnn-fade { opacity: 0.35; } .rnn-labels { display: flex; justify-content: space-between; font-size: 0.7rem; color: var(--muted); margin-top: 0.6rem; font-family: 'JetBrains Mono', monospace; min-width: 600px; } /* Attention Matrix */ .attn-matrix { display: grid; grid-template-columns: auto repeat(6, 1fr); gap: 3px; font-size: 0.7rem; font-family: 'JetBrains Mono', monospace; } .attn-label { display: flex; align-items: center; justify-content: flex-end; padding-right: 8px; color: var(--muted); font-size: 0.68rem; } .attn-col-labels { display: grid; grid-template-columns: auto repeat(6, 1fr); gap: 3px; margin-bottom: 3px; } .attn-col-label { text-align: center; color: var(--muted); font-size: 0.65rem; font-family: 'JetBrains Mono', monospace; } .attn-cell { height: 36px; border-radius: 3px; display: flex; align-items: center; justify-content: center; font-size: 0.65rem; font-weight: 600; transition: transform 0.2s; cursor: default; } .attn-cell:hover { transform: scale(1.08); } /* Timeline */ .timeline { position: relative; padding-left: 2rem; margin: 2rem 0; } .timeline::before { content: ''; position: absolute; left: 7px; top: 0; bottom: 0; width: 2px; background: linear-gradient(180deg, var(--purple), var(--cyan), var(--gold)); } .timeline-item { position: relative; margin-bottom: 2rem; padding-left: 1.5rem; } .timeline-dot { position: absolute; left: -1.85rem; top: 0.35rem; width: 16px; height: 16px; border-radius: 50%; border: 2px solid; background: var(--bg); } .timeline-dot.purple { border-color: var(--purple); box-shadow: 0 0 10px var(--purple); } .timeline-dot.cyan { border-color: var(--cyan); box-shadow: 0 0 10px var(--cyan); } .timeline-dot.gold { border-color: var(--gold); box-shadow: 0 0 10px var(--gold); } .timeline-dot.green { border-color: var(--green); box-shadow: 0 0 10px var(--green); } .timeline-year { font-family: 'Bangers', cursive; font-size: 1.2rem; letter-spacing: 2px; color: var(--gold); margin-bottom: 0.25rem; } .timeline-milestone { font-family: 'Space Grotesk', sans-serif; font-weight: 700; font-size: 1rem; color: #fff; margin-bottom: 0.25rem; } .timeline-desc { font-size: 0.88rem; color: var(--muted); line-height: 1.5; } .timeline-badge { display: inline-block; font-size: 0.65rem; font-weight: 700; font-family: 'Space Grotesk', sans-serif; padding: 0.15rem 0.5rem; border-radius: 100px; margin-left: 0.5rem; vertical-align: middle; } .badge-origin { background: rgba(139,92,246,0.25); color: var(--purple); border: 1px solid var(--purple); } .badge-encoder { background: rgba(34,211,238,0.2); color: var(--cyan); border: 1px solid var(--cyan); } .badge-decoder { background: rgba(245,158,11,0.2); color: var(--gold); border: 1px solid var(--gold); } .badge-scale { background: rgba(16,185,129,0.2); color: var(--green); border: 1px solid var(--green); } /* Citation / quote block */ blockquote { margin: 2rem 0; padding: 1.5rem 1.5rem 1.5rem 2rem; background: var(--surface2); border-left: 4px solid var(--purple); border-radius: 0 6px 6px 0; font-style: italic; color: var(--muted); position: relative; } blockquote::before { content: '"'; font-family: 'Bangers', cursive; font-size: 5rem; color: var(--purple); opacity: 0.3; position: absolute; top: -1rem; left: 0.5rem; line-height: 1; } blockquote cite { display: block; margin-top: 0.75rem; font-size: 0.82rem; font-style: normal; color: var(--purple); font-family: 'Space Grotesk', sans-serif; font-weight: 600; } /* Code-like inline */ code { font-family: 'JetBrains Mono', monospace; font-size: 0.85em; background: rgba(139,92,246,0.15); color: var(--purple); padding: 0.15em 0.4em; border-radius: 3px; } /* Problem list */ .problem-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(240px, 1fr)); gap: 1rem; margin: 1.5rem 0; } .problem-card { background: var(--surface); border: 1px solid var(--border); border-radius: 8px; padding: 1.25rem; transition: border-color 0.2s; } .problem-card:hover { border-color: var(--red); } .problem-card .icon { font-size: 1.8rem; margin-bottom: 0.6rem; } .problem-card h4 { font-family: 'Space Grotesk', sans-serif; font-size: 0.95rem; font-weight: 700; color: var(--red); margin-bottom: 0.4rem; } .problem-card p { font-size: 0.88rem; color: var(--muted); margin: 0; } /* Breakthrough cards */ .breakthrough-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(240px, 1fr)); gap: 1rem; margin: 1.5rem 0; } .breakthrough-card { background: var(--surface); border: 1px solid var(--border); border-radius: 8px; padding: 1.25rem; transition: border-color 0.2s, box-shadow 0.2s; } .breakthrough-card:hover { border-color: var(--green); box-shadow: 0 0 20px rgba(16,185,129,0.1); } .breakthrough-card .icon { font-size: 1.8rem; margin-bottom: 0.6rem; } .breakthrough-card h4 { font-family: 'Space Grotesk', sans-serif; font-size: 0.95rem; font-weight: 700; color: var(--green); margin-bottom: 0.4rem; } .breakthrough-card p { font-size: 0.88rem; color: var(--muted); margin: 0; } /* Attention heads visual */ .heads-grid { display: grid; grid-template-columns: repeat(4, 1fr); gap: 0.75rem; margin: 1.5rem 0; } .head-card { padding: 1rem; border-radius: 6px; text-align: center; border: 1px solid; } .head-card .head-num { font-family: 'Bangers', cursive; font-size: 1.6rem; letter-spacing: 1px; } .head-card .head-desc { font-size: 0.72rem; font-family: 'Space Grotesk', sans-serif; margin-top: 0.25rem; } .head-1 { border-color: #8b5cf6; background: rgba(139,92,246,0.1); } .head-1 .head-num { color: #8b5cf6; } .head-1 .head-desc { color: #8b5cf6; } .head-2 { border-color: #22d3ee; background: rgba(34,211,238,0.1); } .head-2 .head-num { color: #22d3ee; } .head-2 .head-desc { color: #22d3ee; } .head-3 { border-color: #f59e0b; background: rgba(245,158,11,0.1); } .head-3 .head-num { color: #f59e0b; } .head-3 .head-desc { color: #f59e0b; } .head-4 { border-color: #10b981; background: rgba(16,185,129,0.1); } .head-4 .head-num { color: #10b981; } .head-4 .head-desc { color: #10b981; } /* Big stat */ .big-stat { text-align: center; padding: 2.5rem 1rem; background: var(--surface); border-radius: 12px; margin: 2rem 0; } .big-stat .number { font-family: 'Bangers', cursive; font-size: 4rem; letter-spacing: 3px; background: linear-gradient(135deg, var(--gold), var(--purple)); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text; } .big-stat .label { font-family: 'Space Grotesk', sans-serif; font-size: 0.9rem; color: var(--muted); margin-top: 0.5rem; } /* Footer */ /* Responsive */ @media (max-width: 600px) { main { width: 100%; } .site-nav-inner { gap: 1rem; } .site-brand span:last-child { display: none; } .site-nav .nav-links { gap: 0.9rem; overflow-x: auto; } .heads-grid { grid-template-columns: repeat(2, 1fr); } .hero h1 { font-size: 2.5rem; } } /* Scroll fade-in */ .fade-in { opacity: 0; transform: translateY(24px); transition: opacity 0.6s ease, transform 0.6s ease; } .fade-in.visible { opacity: 1; transform: translateY(0); } /* Inlined into Hugo: ensure content is visible even without the IntersectionObserver script */ .attn-article .fade-in { opacity: 1; transform: none; }

Deep Dive · Artificial Intelligence · 2017

The seven-word title that ended one era of AI and launched another. A beginner-friendly, technically honest tour through the paper that birthed every LLM you've ever heard of.

✍️ Ajay Walia📅 May 2026⏱ ~15 min read🧠 Beginner → Intermediate

Panel 1 — THE SCROLL APPEARS

CHAPTER 01

TheDark Ages of Language AI

To understand why "Attention Is All You Need" was a thunderclap, you first need to appreciate how painful life was before it. Cast your mind back to 2016. AI researchers around the world were working incredibly hard on language problems — translation, summarisation, question answering — but they were doing so with a fundamental handicap baked into their tools.

The dominant models at the time wereRecurrent Neural Networks (RNNs) and their smarter cousin, theLong Short-Term Memory network (LSTM). Both were designed to handle sequences: text goes in word by word, the model builds up a hidden memory state as it reads, and produces an output at the end.

The intuition seems sensible. After all,we read left to right (in English). Why shouldn't a machine? The problem, as we'll see, was catastrophic at scale.

How an RNN Actually Works

Imagine you're a gold fish with a tiny little notepad. Every time you read a new word, you scribble something on your notepad, thenerase half of it to make room for the next word. By the time you reach the end of a 500-word paragraph, your notepad is a smeared mess of partial impressions. That's an RNN.

More precisely: an RNN processes tokensone at a time, left to right. At each step, it combines the current word's embedding with ahidden state vector (its "memory" from all previous words) and produces a new hidden state. That hidden state is passed forward to the next step.

RNN: Sequential Processing Chain

The

cat

sat

the

???

Step 1Step 2Step 3Step 4Step 5 (fading)Long range: ☠️

Information degrades as it passes along the chain. Early words become "forgotten."

LSTMs: A Better Notepad, Same Problem

LSTMs (invented by Hochreiter & Schmidhuber in 1997) were the RNN's upgrade. Instead of one hidden state, they have three "gates" — input, forget, and output — plus a separate "cell state" that acts as a longer-term memory. They were genuinely better at remembering things across longer sequences.

But LSTMs didn't solve the core architectural problem. They still processed one token at a time, sequentially. And at massive scale, that was the killer.

Panel 2 — THE MEMORY BALL

CHAPTER 02

Three Problems ThatCrippled the Old Models

Before we get to the solution, let's be precise about the pain. The pre-Transformer era had three interconnected crises, and solvingany one of them would have been significant. The Transformer paper solved all three simultaneously.

⛓️

Problem 1: Sequential Bottleneck

RNNs and LSTMs process tokens one at a time. Step 2 cannot begin until Step 1 finishes. This means youcannot parallelize training across GPU cores. Training was agonisingly slow — weeks or months for large models.

🌫️

Problem 2: Vanishing Gradients

When you train a neural network with backpropagation, you compute gradients (error signals) and push them backwards through the chain. In a long sequence, those gradients shrink exponentially as they travel. Early tokens barely learn anything.

📏

Problem 3: Long-Range Amnesia

In the sentence"The trophy didn't fit in the suitcase becauseit was too big" — what does "it" refer to? The trophy. A human knows instantly. An RNN processing hundreds of words between "trophy" and "it" often forgot the connection entirely.

The Telephone Game at Scale

The vanishing gradient problem is best understood through the "telephone game" (Chinese Whispers). You whisper a sentence to the first person in a chain. By the time it reaches the 20th person, the message is garbled beyond recognition. In an RNN, the gradient signal is that whisper — and long sequences were destroying it.

LSTMs reduced the garbling with their gating mechanisms, but didn't eliminate it. And crucially, every single token in the sequence still had to wait for the one before it to finish processing. At a time when researchers were starting to dream about training on billions of words, this was a scaling cliff.

Panel 3 — THE VANISHING WHISPER

CHAPTER 03

A Spark Before the Fire —Bahdanau Attention (2014)

To be accurate about the history: the 2017 paper didn't invent attention from scratch. In 2014, Dzmitry Bahdanau and colleagues published a paper that added an "attention mechanism" on top of existing encoder-decoder RNNs for machine translation.

The idea was elegant: when generating each output word, instead of squishing the entire input sentence into one fixed-size vector, the model learns to "look back" at different parts of the input and assign weights — attention scores — to each input word. Generate "Hund" in German? Pay more attention to "dog" in the English source.

🏅 The 2014 Precursor

Bahdanau et al. (2014) showed attention worked. But they bolted iton top of RNNs — the sequential backbone was still there, just with a better look-back mechanism. It was like putting a turbocharged engine in a horse-drawn carriage.

The 2017 breakthrough came when Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin at Google Brain asked a radical question:what if we got rid of the carriage altogether?

"We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely."— Vaswani et al., Attention Is All You Need (2017)

CHAPTER 04

June 2017 —The Paper Drops

Eight Google Brain researchers quietly uploaded a pre-print to arXiv on June 12, 2017. The title was almost cheeky — "Attention Is All You Need" — a pun on The Beatles' "All You Need Is Love" and a direct challenge to the field:attention mechanisms alone are sufficient. No recurrence. No convolutions. Just attention.

The abstract was direct. They proposed the Transformer architecture, showed it achieved state-of-the-art on English-to-German translation (28.4 BLEU, surpassing the previous best by more than 2 BLEU points), trained it in a fraction of the time, and made a claim that would prove prophetic: this architecture was far more parallelisable and required significantly less time to train.

At the time, the machine learning community took notice but didn't immediately grasp the full magnitude. It looked like a better translation model. What it actually was, in retrospect: the foundation of every major AI system built in the next decade.

200,000+

Citations as of 2025 — one of the most cited papers in all of computer science history

Panel 4 — THE GOOGLE BRAIN MOMENT

CHAPTER 05

Self-Attention —Every Word Watches Every Word

This is the heart of it. Everything else in the Transformer paper is (brilliant) supporting machinery. Self-attention is the engine.

Here's the core idea in plain English:when processing any word, the model looks at every other word in the sentence simultaneously and calculates a relevance score. Instead of passing information along a chain one step at a time, every token "talks to" every other token in parallel.

The Library Analogy

Imagine a library where every book can send a little messenger to every other book, asking: "Hey, are you relevant to me?" Each pair of books gives an answer — a number from 0 to 1. The books with higher scores get to "share" more of their information when the library compiles its final report.

In the sentence"The cat sat on the mat because it was tired" — when the model processes "it", the self-attention mechanism computes a score between "it" and every other word. The word "cat" gets a very high score (because "it" refers to the cat), while words like "the" and "on" get low scores. This is done in one parallel operation — no sequential chain required.

The Math Behind It (Don't Panic)

The paper formalises this with three vectors derived from each word's embedding: aQuery (Q), aKey (K), and aValue (V). Think of it like a search engine:

🔍 Q / K / V Intuition

Query (Q): "What am I looking for?" — what the current word wants to know.
Key (K): "What do I offer?" — what each other word has to advertise.
Value (V): "Here's my actual content" — what each word shares if chosen.

Attention score =softmax( Q · Kᵀ / √d_k ) · V

The division by √d_k (square root of the dimension) is a stabilising trick — without it, the dot products can get very large and the softmax function becomes extremely "peaky" (everything goes to one word), which hurts training. The softmax then converts raw scores into a probability distribution — so all the weights add up to 1.0.

Self-Attention Weight Matrix — "it" attending to other words

The

cat

sat

mat

"it" →

0.05

0.72

0.04

0.03

0.10

0.06

"it" pays most attention to "cat" (0.72 weight) — this is how the model resolves co-reference. Higher = brighter.

The critical breakthrough isn't justthat these scores are computed — it'show:all pairs are computed in parallel using matrix multiplication. A sentence of 512 tokens doesn't require 512 sequential steps. It requires one big matrix operation that modern GPUs execute extremely fast. This is the parallelisation breakthrough that made scaling possible.

Panel 5 — WORDS LOOKING AT WORDS

CHAPTER 06

Multi-Head Attention —Many Perspectives at Once

Here's where the paper goes from clever to ingenious. A single self-attention computation gives you one view of how words relate. But language is rich — words relate to each other in manydifferent ways simultaneously.

Consider the sentence"She gave him the book she wrote":

— "she" and "him" have agrammatical subject/object relationship
— "she" (first occurrence) and "she" (second) have aco-reference relationship
— "book" and "wrote" have asemantic relationship (you write books)
— "gave" and "book" have averb-object relationship

One attention head would have to pick one of these.Multi-head attention runs several attention computations in parallel, each in a different "subspace" of the representation. The results are then concatenated and projected back to the original dimension.

Head 1

Grammatical roles (subject, object, verb)

Head 2

Co-reference resolution ("she" = "she")

Head 3

Semantic relatedness (book ↔ wrote)

Head 4

Syntactic dependencies (verb-object)

The original Transformer used8 attention heads. Modern LLMs like GPT-3 use 96, and models like Claude use even more. Each head develops its own specialisation during training — not by design, but emergently, because the model learns that different heads can capture different useful patterns.

🎭 The Right Analogy

Multi-head attention is like having a team of editors review your essay simultaneously. One editor focuses on grammar, one on logical flow, one on vocabulary, one on argument structure. You get all their feedback at once, then synthesise it. No editor has to wait for the previous one to finish.

Panel 6 — THE EIGHT-EYED TEAM

CHAPTER 07

Positional Encoding —Teaching Order Without Recurrence

Here's a subtle but critical problem. In an RNN, word order is implicit — you literally process word 1, then word 2, then word 3. The order is baked into the architecture. But in a Transformer, all words are processed in parallel. If you showed it "Dog bites man" and "Man bites dog" simultaneously, the attention mechanism alone would see the same set of words and might produce the same result.

That's obviously catastrophic for language. "The bank by the river" and "river by the bank the" mean very different things.

The solution:Positional Encoding. Before feeding word embeddings into the Transformer, you add a unique positional signal to each one. The paper uses a clever combination of sine and cosine functions at different frequencies:

📐 The Formula

PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Wherepos is the word's position andi is the dimension. The result: each position gets a unique, smooth vector that the model can learn to interpret. The sine/cosine waves at different frequencies are like a musical chord unique to each seat in the stadium.

Why sinusoids and not just the number 1, 2, 3...? Because sinusoids generalise. They allow the model to learnrelative positions (word 2 is one step after word 1) not just absolute ones. And they handle sequences longer than those seen in training gracefully, because the wave patterns extend naturally.

Modern variants like RoPE (Rotary Position Embedding, used in Llama and GPT-NeoX) and ALiBi have since improved on the original scheme — but they're all descendants of this 2017 insight.

Panel 7 — THE NUMBERED LINEUP

CHAPTER 08

The Full Transformer —Encoder + Decoder

The original paper was designed for sequence-to-sequence tasks — specifically machine translation. The architecture has two halves that work together: anEncoder that reads the input (the English sentence) and aDecoder that generates the output (the German translation).

The Encoder

The encoder stack (6 identical layers in the original paper) processes the entire input sentence in parallel. Each layer has two sub-components:(1) multi-head self-attention (all words attend to all other words) and(2) a feed-forward neural network applied to each position independently. Both sub-components useresidual connections (the input is added back to the output) andlayer normalisation — both stability tricks borrowed from computer vision.

The Decoder

The decoder is similar but has three sub-components per layer. The first ismasked self-attention — like encoder self-attention, but masked so that when generating word N, the model can only attend to words 1 through N-1 (it can't cheat by looking at future words). The second iscross-attention — the decoder attends to the encoder's output, connecting the input sentence to the generation process. The third is the same feed-forward network as in the encoder.

🧬 The Encoder-Decoder Legacy

BERT (2018) usesonly the encoder — great for understanding tasks (classification, named entity recognition). GPT-1/2/3/4 useonly the decoder — great for generation tasks (writing, code, conversation). The full encoder-decoder design lives on in models like T5 and BART, used heavily for translation and summarisation.

One more key ingredient:Feed-Forward layers. After each attention block, every position's representation passes through a small, identical 2-layer neural network. In the original paper, the inner dimension of this network was 2048 — 4× the model's embedding dimension of 512. In GPT-3, it's 4× 12,288 = 49,152. These layers are believed to act as "fact storage" — where knowledge learned during training gets encoded.

Panel 8 — THE TRANSFORMER MECHA

CHAPTER 09

The ImpactTimeline — 2017 to Now

The Transformer paper wasn't just a research curiosity. It was a platform. Within a year, the entire field had pivoted. Within five years, it had generated a trillion-dollar industry. Here's the direct lineage:

2017

"Attention Is All You Need"ORIGIN

Vaswani et al. (Google Brain). The original Transformer architecture. State-of-the-art on WMT English→German translation. 65M parameters. Training: 3.5 days on 8 P100 GPUs.

2018

BERT — GoogleENCODER-ONLY

Bidirectional Encoder Representations from Transformers. 340M parameters. Pre-trained on masked language modelling — predict randomly hidden words. Demolished 11 NLP benchmarks on release. The model that proved "pre-train, then fine-tune" as the dominant paradigm.

2018

GPT-1 — OpenAIDECODER-ONLY

Generative Pre-trained Transformer. 117M parameters. The first proof that a decoder-only Transformer, trained on unsupervised language modelling, could be fine-tuned for diverse tasks. OpenAI's foundational bet on the decoder path.

2019–2020

GPT-2 → GPT-3SCALE

GPT-2 (1.5B params) was so good at generating text that OpenAI staged a "staged release" over safety concerns. GPT-3 (175B params) — the first model to demonstrate serious few-shot learning. You could give it 3 examples and it could do a new task without any fine-tuning. The scaling laws paper (Kaplan et al.) proved bigger = better predictably.

2022

ChatGPT — RLHF Changes Everything

GPT-3.5 fine-tuned with Reinforcement Learning from Human Feedback (RLHF). The public suddenly had a conversational interface to an LLM. 1 million users in 5 days. 100M in 2 months. Fastest product growth in history. Every major tech company scrambled.

2023–2026

Claude · Grok · Gemini · Llama · Mistral · GPT-4/4o

All Transformer-based. All descendants of the 2017 paper. Claude (Anthropic) adds Constitutional AI. Llama (Meta) brings open-source to the frontier. Gemini (Google) goes multimodal. Grok (xAI) takes on real-time search. The Cambrian explosion of LLMs — every one of them tracing its lineage back to June 12, 2017.

Panel 9 — THE ROCKET OF PROGRESS

CHAPTER 10

Is It The Most Important AI PaperEver Written?

Let's be honest about this. The question is fascinating precisely because it's not entirely settled — there are serious candidates.

The Case For: Yes, Unambiguously

No single paper has had a more direct and immediate commercial and scientific impact in the modern AI era. Every frontier LLM in existence today is a Transformer. The trillion-dollar AI industry of the mid-2020s is built on this foundation. With 200,000+ citations, it's a runaway leader in citation counts for an ML paper. The research it unlocked — in text, image (ViT), audio (Whisper), protein structure (AlphaFold 2), video (Sora), code (Copilot) — spans essentially every domain of AI.

The Case For Other Contenders

🏆 Other Papers That Matter

Backpropagation (Rumelhart et al., 1986) — Without the ability to train neural networks at all, there's nothing to build on.

ImageNet + AlexNet (Krizhevsky et al., 2012) — The moment deep learning proved itself to the world, launching the modern deep learning era.

Word2Vec (Mikolov et al., 2013) — Showed that word embeddings encode semantic meaning; a prerequisite for Transformer input representations.

Scaling Laws (Kaplan et al., 2020) — Proved that LLM capabilities grow predictably with compute and data, enabling the investment thesis behind GPT-3 and everything after.

RLHF (Christiano et al., 2017) — The alignment technique that turned raw LLMs into assistants humans actually want to use.

The honest verdict:In the specific context of modern generative AI — LLMs, multimodal models, and the AI products billions of people use daily — "Attention Is All You Need" is the clearest single point of origin. Without backprop it couldn't exist, but without this paper, it wouldn't have become what it is. It's the right answer to the question "which paper made today's AI possible?"

"We are all standing on the shoulders of eight people who asked: what if recurrence isn't actually necessary?"— A reasonable paraphrase of the entire modern AI research community

Panel 10 — THE AI FAMILY PORTRAIT

CHAPTER 11

Where Do We Go FromHere?

The Transformer is dominant but not invincible. Researchers are actively working on what comes next — and several serious challengers are emerging.

The Current Limitations

Self-attention has a quadratic complexity problem. If your sequence has N tokens, the attention matrix is N × N. Double the sequence length, quadruple the compute. For long documents — books, codebases, hours of audio — this becomes brutally expensive. The context window you experience in Claude or GPT-4 represents enormous engineering effort to extend what was originally a very limited range.

What's Being Explored

Mamba / State Space Models (SSMs) process sequences in linear time, not quadratic — a genuine architectural alternative that some researchers believe could eventually rival or exceed Transformers for long-context tasks.Flash Attention (Dao et al., 2022) is an algorithmic optimisation that makes standard attention dramatically more memory-efficient without changing the math.Mixture of Experts (MoE) architectures — used in GPT-4 and Gemini — activate only a subset of parameters per token, allowing models with trillions of total parameters to run at the cost of a much smaller model.

Multimodality is the frontier. The Transformer's attention mechanism generalises naturally to images (patch tokens), audio (spectrogram tokens), video (frame tokens), and structured data. A single Transformer can in principle process all of these simultaneously — and models like GPT-4o and Gemini Ultra are moving rapidly in this direction.

The question researchers are now asking: is intelligence primarily a function of architecture, or of scale and data? The scaling laws suggest it's mostly the latter. If that's true, the Transformer need not be dethroned — it just needs to be fed more.

🔭 The Bigger Picture

We are, by most accounts, somewhere in the middle of the most important technological transition in human history. Every model at the frontier — the ones writing code, passing medical exams, generating video — shares a common ancestor. Eight researchers. One arXiv pre-print. Twelve hundred lines of Python. June 12, 2017.

Panel 11 — THE NEURAL SKY

CONCLUSION

The Seven Words ThatChanged Everything

We started this piece in 2016, watching RNNs and LSTMs struggle through their sequential chains, watching gradient signals vanish like whispers in a long corridor. We watched researchers work incredibly hard to coax these architectures to handle longer contexts, more complex language, bigger training sets — and hit wall after wall.

Then eight people asked a simple question —what if you just... paid attention? — and rewired the entire field.

The Transformer is not magic. It's mathematics: query-key-value lookups, scaled dot-product attention, layer normalisation, residual connections, feed-forward networks. Every piece is graspable. The genius was in the combination — and in the willingness to abandon the assumption that sequences must be processed sequentially.

The models you interact with today — the ones that draft your emails, explain your code, answer your questions about transformer architecture with exhaustive detail — are all running on this foundation. Claude (the AI that helped outline this post) is a Transformer. GPT-4 is a Transformer. Grok is a Transformer. Gemini is a Transformer. They are, in the deepest technical sense, all direct descendants of that arXiv upload.

Understanding "Attention Is All You Need" is not just historical curiosity. It's the grammar of modern AI. Once you understand it, you have a lens through which almost everything in the field makes sense — the scaling laws, the context window debates, the encoder vs. decoder architecture choices, the multimodal experiments, the efficiency research.

The paper is free. The arXiv link still works. It's 15 pages and reads more clearly than most ML papers. If this post piqued your curiosity: go read it. You'll understand it now.

2017 → ∞

The year a single paper changed the trajectory of intelligence itself

]]>

OpenAI Just Killed the Voice Assistant — And Built Something Far More Dangerous

Ajay Walia — Sun, 10 May 2026 00:00:00 +0000

OpenAI just shipped three new voice models — and together they don’t just improve voice assistants. They make the very concept of a “voice assistant” feel outdated.

GPT-Realtime-2 is the first voice model with GPT-5-class reasoning. It doesn’t wait to think — it reasons out loud while keeping the conversation moving.

What Was Announced

OpenAI released a trio of voice models through its API this week:

GPT-Realtime-2 — live voice with GPT-5-class reasoning, tool calling, and interruption handling
GPT-Realtime-Translate — real-time speech translation across 70+ input languages into 13 output languages, keeping pace with the speaker
GPT-Realtime-Whisper — streaming speech-to-text that transcribes live as you speak (not after you stop)

Plus two new Chat Completions models:gpt-4o-transcribe andgpt-4o-mini-transcribe, with significantly lower word error rates than the original Whisper.

Why GPT-Realtime-2 Is a Step Change

Previous voice models followed a pattern: you speak → model listens → model thinks → model responds. Linear. Predictable. Frustrating when the request was complex.

GPT-Realtime-2 breaks that pattern. It:

Calls multiple tools simultaneously — checking your calendar, pulling data, and looking something upat the same time
Makes actions audible — says things like “checking your calendar now” or “looking that up” while it works, so the conversation doesn’t go silent
Handles corrections and interruptions naturally — you can cut in, redirect, or correct mid-sentence
Benchmarks 15.2% higher on Big Bench Audio vs. GPT-Realtime-1.5

That last point matters because Big Bench Audio testsaudio intelligence — understanding complex spoken requests, not just transcription accuracy.

What This Means for Enterprise

If you’re building — or buying — anything with a voice interface right now, this changes your calculus.

Dial-in support bots built on older voice AI will feel laggy and scripted compared to what GPT-Realtime-2 can do. The gap between “voice assistant” and “voice agent” just widened dramatically.

Real-time translation (70 input languages → 13 output languages) is a genuine enterprise unlock for global operations, multilingual customer support, and cross-border meetings without interpreter overhead.

Streaming transcription means you can build systems that act on partial speech — not just complete utterances. Think interruption detection, real-time coaching, live subtitles that actually keep up.

The Question Worth Asking

For enterprise tech leaders: most voice AI deployments today arereactive — they wait for a complete input, then respond. GPT-Realtime-2 isproactive — it works while talking. That’s a fundamentally different UX and a fundamentally different integration model.

The platforms and products that haven’t designed for this will feel broken by comparison within 12 months.

→Read the full announcement on OpenAI

Over to you: Are you currently using any voice AI in your enterprise stack — or actively avoiding it? What would need to be true about reliability and accuracy before you’d deploy it for customer-facing workflows?

]]>

I Built, My Own Screenshot App for macOS (No More Clunky Screenshots)

Ajay Walia — Fri, 08 May 2026 00:00:00 +0000

It was a Tuesday evening. I had just taken my fourteenth screenshot of the day — a mix of Cmd+Shift+4, accidental desktop saves, files namedScreenshot 2026-05-06 at 11.43.22 PM.png scattered across my Downloads folder like confetti after a party nobody enjoyed.

I opened Finder. Forty-seven PNGs. Forty-seven. Some blurry. Some with my other windows bleeding in at the edges. Some cropped wrong because I’d sneezed mid-drag.

There had to be a better way.

The Ghost of Greenshot Past

Back in my Windows days — and I spent alot of years in Windows, running datacentres, managing Wintel estates, building VDI platforms — I hadGreenshot. If you’ve never used it, Greenshot is a free, lightweight screenshot tool that lives in your system tray. PressPrtSc, draw a box, done. Instant annotation. Instant clipboard. Instant sanity.

When I moved to macOS full-time, I expected something better. Apple builds beautiful hardware. Their software is generally excellent. And their screenshot workflow is… fine. Technically fine. But when you’re taking screenshots all day for documentation, Slack messages, architecture diagrams, and blog posts — “fine” isn’t good enough.

The native flow is:Cmd+Shift+4, drag imprecisely, file saves somewhere, you hunt for it, you paste it, you realise the crop was off, you do it again.

I Googled “Greenshot for Mac” approximately eleven times over the past year. The answers: CleanShot X (paid, subscription), Shottr (good but someone else’s decisions), or just “get used to it.”

I couldn’t get used to it. So I did what any reasonable enterprise architect does at 10pm on a Tuesday.

I decided to build it myself.

Fine. I’ll Build It.

The plan was simple: a lightweight macOS menu bar app. Lives in the status bar. One click or a hotkey and you’ve got a clean area capture, a window capture, or a full-screen grab. Auto-saves with a sensible filename. Copies to clipboard. Makes a satisfying shutter sound. Stays out of your way.

The twist: I’d build it the same way I built this blog — with Claude as my pair programmer and VS Code as the editor. I’m an architect and technologist, not a Swift developer. But that was the whole point.

I opened a new conversation, described what I wanted in plain English, and we got to work.

What AJShot Is

AJShot is a native macOS menu bar app — background-only, no Dock icon, no bloat. It sits quietly in your menu bar with a smallAJ badge. Left-click opens your last screenshot. Right-click opens the action menu.

Memory footprint: around50 MB at idle. Compare that to some Electron-based tools that idle at 300 MB just saying hello.

The right-click menu looks like this:

📷 Capture Area ⌘⇧3
🖥 Capture Window ⌘⇧4
🖥 Capture Fullscreen ⌘⇧5
─────────────────────────────
📂 Open Last Screenshot
📁 Open Screenshots Folder
─────────────────────────────
⚙️ Preferences
ℹ️ About
✕ Quit

The keyboard shortcuts are global — they work even when AJShot is in the background. You’re deep in a Zoom call, you need a quick capture, you hitCmd+Shift+3, draw the box, done. The screenshot is already in your designated folder and on your clipboard before you’ve even let go of the mouse.

Core Features

Capture Modes

Area Capture (⌘⇧3) — the workhorse. Full-screen overlay appears, cursor becomes a crosshair, you draw a box. Precise. Consistent. No more dragging the wrong direction and getting a sliver of your taskbar.

Window Capture (⌘⇧4) — click a window, capture just that window. No background bleed, no accidental desktop icons photobombing your documentation.

Fullscreen Capture (⌘⇧5) — captures everything. Supports multi-display setups with a “capture all” or “ask each time” behaviour you configure once.

Preferences That Actually Make Sense

Open Preferences and you get a clean panel with the options you’d actually want to configure:

Setting	Default	What it does
Launch at login	On	AJShot is always ready, even after restart
Play capture sound	On	Satisfying shutter click confirms the capture
Show thumbnail preview	On	Floating preview appears — click to open in editor
Auto-copy to clipboard	On	Screenshot lands in your clipboard automatically
Save folder	`~/Pictures/AJShot`	Your own designated screenshot home
File format	PNG	Or JPG with a quality slider
Filename template	`AJShot_{date}_{time}`	Consistent, sortable, sane filenames

The filename templating engine handles the sanitisation itself — no illegal characters, no trailing dots, no 240-character filenames. Files come out likeAJShot_2026-05-09_22-14-37.png. You can find them instantly. You can sort them. Your Downloads folder stays clean.

Post-Capture Flow

After every capture, AJShot runs through a quick sequence:

Auto-saves to your configured folder with the filename template applied
Copies to clipboard (if enabled) — so you can paste immediately
Plays the shutter sound — audio confirmation that the capture happened
Shows a floating thumbnail in the corner — click it to open in the editor

You can also configure the post-capture action to always ask, always edit, always save, or always copy. Once you’ve decided how you work, it remembers.

The Real Problem It Solves

Here’s the honest comparison:

	Native macOS	AJShot
Start capture	`Cmd+Shift+4`, wait for toolbar	`Cmd+Shift+3`, instant overlay
File location	Desktop or last used folder	Always`~/Pictures/AJShot`
Filename	`Screenshot 2026-05-09 at 10.14.22 PM.png`	`AJShot_2026-05-09_22-14-22.png`
Clipboard copy	Manual extra step	Automatic
Editor	Preview.app detour	Built-in (annotation tools in progress)
Memory use	System process	~50 MB dedicated
Works in background	Partial	Full global hotkeys
Shutter sound	Yes	Yes (configurable)

The editor is where things get even more interesting. The annotation scaffolding is already built — the tools are there in the codebase: arrows, blur, text, highlights, shapes, step-number callouts. They’re not wired to the UI yet, but they’re coming. That’s the next milestone.

Tech Stack — What’s Actually Under the Hood

This is anative macOS app built entirely inSwift 5.9+, targetingmacOS 12 (Monterey) and later. The UI is a hybrid ofSwiftUI andAppKit — SwiftUI for the Preferences panel and modern views, AppKit where you need fine-grained control over system behaviour (status bar items, window management, capture overlays).

Dependencies (both bySindre Sorhus, the prolific open-source developer behind half the macOS indie tool ecosystem):

KeyboardShortcuts — global hotkey registration that plays nicely with macOS security
Defaults — a type-safe, SwiftUI-friendly wrapper aroundUserDefaults

Capture pipeline: usesCGPreflightScreenCaptureAccess andCGRequestScreenCaptureAccess from CoreGraphics for the permission preflight, thenScreenCaptureKit for the actual pixel capture. The permission flow has a proper retry loop — if you’ve granted permission but haven’t restarted the app, it tells you exactly that and offers a one-click restart.

Build and distribution: Swift Package Manager for dependencies, Xcode 15 for the build, and abuild-dmg.sh script that produces a signedAJShot-1.0.0.dmg for distribution. The DMG is already built. You can drag it to Applications and run it today.

Security considerations (because I work in security and these things matter):

Screenshot folder is set to0700 permissions — only you can read it
Individual screenshot files are0600 — same
Filename template engine strips illegal characters and control characters at the source
Code signing and notarization stubs are already in the README for when distribution goes wide

The architecture is clean:App →AppDelegate →StatusBarController →CaptureManager →ScreenshotStorage. Each module does one thing. TheFilenameTemplateEngine is a pure static function. TheThumbnailPresenter is decoupled from capture. Claude helped me keep it disciplined.

The Honest Struggles

No developer story is complete without the bit where things go sideways.

Screen Recording permissions nearly broke me on the first build. macOS 12+ requires explicit Screen Recording permission in System Settings, and you can’t force-trigger the dialog more than once per session. I had to build the multi-stage fallback: first launch triggers Apple’s native dialog, subsequent denials open System Settings directly, and if you’ve granted permission but not restarted, it detects that state and offers a restart button. It took longer to get that right than the actual capture code.

Swift Package Manager vs Xcode had a brief disagreement about the resource bundle forshutter.aiff. The sound file lives inAJShot/Sounds/ and has to be declared as a.process("Sounds") resource inPackage.swift. Simple when you know it. Less simple at 11pm when it just silently fails to play.

TheLaunchAtLogin manager required writing aLaunchAgent plist to~/Library/LaunchAgents/. Straightforward on paper. In practice, macOS is protective of that folder in ways that aren’t well-documented, and the error messages when something goes wrong are the kind that send you to Stack Overflow threads from 2019.

That’s the thing about building a native macOS app — the platform is powerful and the APIs are solid, but the surface area of “things that can quietly not work” is larger than you’d expect. Claude helped me navigate most of it. The frustration was real and so was the progress.

What’s Next

The annotation editor is the next major milestone — the scaffolding is already there. Arrow tool, blur tool, text tool, highlight, shapes, step-number callouts. When that’s wired up, AJShot becomes a complete Greenshot replacement for macOS: capture, annotate, save, share.

After that: GitHub release with a public DMG, code signing, notarization, and probably a product page here on CuriousBit.

If you’re a macOS user who’s been putting up with the native screenshot workflow out of habit — you don’t have to. And if you’re a developer (or aspiring one) who thinks you can’t build a native app without being a full-time Swift engineer — you can. You really can.

The GitHub repo is coming soon — I’ll post the link here when it’s public. In the meantime, if you’re curious about the architecture, the Swift source, or want to follow along as the editor gets built, keep an eye on this blog.

And I’m genuinely curious:what screenshot tool do you use on macOS? Are you a CleanShot loyalist? A Shottr person? Still using the native tools and somehow thriving? Drop a note — I’d love to know.

Because the best tools are the ones built out of genuine frustration with the alternative. And I wasvery genuinely frustrated.

]]>

I Built This Blog Without Writing a Single Line of Code (Almost)

Ajay Walia — Thu, 07 May 2026 00:00:00 +0000

It started like most late-night ideas in Sydney — a MacBook, too much coffee, and a thought that refused to go away.

I wanted a tech blog. A proper one — clean, fast, with a Knowledge Base, Videos section, and a place to put thoughts on AI and enterprise tech. The problem? Zero design skills. Zero frontend experience. I know infrastructure, architecture, platforms — but CSS gives me a headache.

The Crazy Idea

Then it clicked.

What if I just described the site I wanted — in plain English — and let AI build it? Not generate a snippet. Build thewhole thing. Layouts, config, content, deployment config. Everything.

The prompt I used was simple but specific:

“Make me a clean HackerNoon-style Hugo + Tailwind blog with KB, Videos, News & Views, and About pages. Deploy-ready for Netlify.”

The Stack

The AI returned a complete project in one shot. Here’s what it chose and why it makes sense:

Hugo — a static site generator written in Go. Blazing fast builds, no database, no server to maintain. Perfect for a content blog.

Tailwind CSS — utility-first CSS. Instead of writing stylesheets, you compose classes directly in HTML. The AI could reason about it well and generate clean, consistent UI.

Netlify — one-click deployment from a GitHub repo. Push tomain, site rebuilds automatically. Free tier covers everything a personal blog needs.

GitHub — version control and the bridge between local edits and live site.

The structure it generated:content/,layouts/,assets/,static/,config.yaml,netlify.toml. Exactly what you’d expect from an experienced Hugo developer.

First Roadblock — Hit in Under 5 Minutes

Real talk — the first thing I hit was a terminal error before I’d even installed anything.zsh: command not found: brew. Homebrew wasn’t on the machine.

At first I had no idea what I was looking at.

But here’s the thing about working with AI tools — you learn to debug faster because you can askwhy something failed, not justhow to fix it. Within a few minutes I understood: Homebrew is the Mac package manager. I needed it to install Hugo. Small win — I could read what was going wrong.

Getting Hugo Running

Three commands and Hugo was running:

/bin/bash -c"$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"brew install hugohugo server

The site was alive atlocalhost:1313 — navigation, article cards, dark mode toggle, the works.

The Project Structure

The generated project was clean and logical:

content/posts/ — articles as Markdown files with YAML frontmatter
layouts/ — Hugo HTML templates for each page type
assets/css/ — Tailwind CSS, processed at build time
static/ — images and files served as-is
config.yaml — site title, menus, author, base URL
netlify.toml — build command, publish directory, Node version

Hugo reads all frontmatter at build time and generates a completely static site. No database. No server-side rendering. No WordPress. Just fast HTML files.

Real Developer Moments

Not everything was smooth. There were genuine “how does anyone actually do this” moments — forgetting how to copy a file path in Finder, templates not updating because of Hugo’s build cache, Git authentication failing because GitHub dropped password support in 2021.

These are the moments tutorials skip. The AI handled every one of them.

Deployed. It’s Alive.

Getting it live was surprisingly painless:

Push the project to a GitHub repo
Connect the repo to Netlify
Set build command tonpm run build, publish directory topublic/
Hit deploy

Netlify pulled the code, ran Hugo, and published the static site in under 30 seconds. Everygit push from that point triggers an automatic rebuild.

What the Live Site Looks Like

The site has:

Homepage — featured article hero, rotating video cards, Knowledge Base grid
KB — long-form technical articles in Markdown
Videos — auto-synced from a YouTube playlist via YouTube Data API v3. A Node.js script fetches the playlist at build time, writesvideos.json tostatic/, and the browser loads it at runtime. Add a video to YouTube, trigger a deploy, it appears on the site.
About — a simple bio

The video section was the most technically interesting piece. No manual uploads, no embeds to maintain — just a playlist that feeds itself.

The Takeaway

This entire site — layouts, templates, CSS, YouTube integration, Git workflow, deployment pipeline — was built by someone with no frontend background. What I brought was the ability to describe what I wanted clearly, debug errors methodically, and push through the friction points.

The tools did the heavy lifting. The judgment aboutwhat to build was still mine.

Now It’s Your Turn

The stack is:Hugo + Tailwind + GitHub + Netlify + Claude. All free. All production-grade.

Start with a prompt. Describe what you want clearly. Expect errors. Fix them one at a time. Deploy early and often.

The gap between “I have an idea” and “it’s live on the internet” has never been smaller.

The full source for this site is on GitHub atibn-Battuta/AjayW_blog.

]]>

Camera Roll to Caption — Python Pipeline, Vision Model for Photo Tags

Ajay Walia — Sat, 02 May 2026 00:00:00 +0000

Vision models, language models, and most other generative systems are confident-but-wrong some non-trivial fraction of the time. The instinct is to fix that with better prompts, bigger models, or smarter agents. The cheaper move is usually to add a small structured review seam — a thirty-second checkpoint where a human can glance, correct, and move on.

This post is the case study for one such seam, dropped into a build I needed for myself. Of 35 garden photos handed to a vision model,74% came back with correct first-pass labels. After thirty seconds editing a CSV,97% were acceptable to publish. Total API cost:$0.18. Total inference time:~74 seconds at 2.1 sec/photo ongpt-4o-mini. The CSV was the highest-leverage code in the project — and it isn’t really code.

Here’s the story.

The annoyance

It was a Saturday afternoon in early March. I’d come back from a walk around the garden with thirty-five photos on my iPhone — bottlebrush in full red, honeysuckle dripping with rain, a lilly-pilly cluster doing its outrageous pink thing, and at least one inexplicable shot of an old railway station I’d passed on the way home.

I wanted to post a handful of them with consistent little hashtag labels —#bottlebrush,#honeysuckle,#flower — burned into the corner like a quiet caption. Not a watermark, not a filter, just a small readable pill that says “this is what you’re looking at.”

What I didn’t want was to open each HEIC in Preview, draw a text box, fiddle with the font, export, repeat thirty-five times. So I did the only reasonable thing: I wrote a small Python tool that does it for me.

The shape of the pipeline

┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Folder │──▶│ Vision │──▶│ CSV │──▶│ Apply │──▶│ Tagged │
│ photos │ │ provider │ │ review │ │ + pill │ │ output │
└──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
↑
human-in-the-loop seam
--mode propose: folder ─▶ vision provider ─▶ CSV
--mode apply: CSV ─▶ render + pill ─▶ tagged output

The minimal product was easy to describe. Point the script at a folder. For each image — HEIC, JPG, PNG, whatever the iPhone or my camera roll throws at it — open it, figure out what’s in it, draw a small rounded hashtag pill into the bottom-right corner, save the result to atagged_output/ subfolder. No watermark across the centre of the image, no filter or colour grade, no destructive edit to the original, and no making me choose the label by hand when a vision model can have a decent first guess.

That last point is where the design got interesting.

The seam

You could write this as a single command: walk the folder, ask the model, render the tag, done. I tried that first. The first-pass run produced a folder of beautifully tagged images, about a quarter of which were wrong in some quietly maddening way — a daisy called#flower, a fern called#leaves, the railway station called, charitably,#station.

So the script runs in two passes.

--mode propose opens each image, hands it to the vision model, and writes a CSV with five columns:

image_path, label, score, suggested_tag, final_tag

final_tag is initialised tosuggested_tag, but the whole point of the column is that you can edit it. Open the CSV, glance down the list, fix anything obvious —flower becomesmorning_glory,leaves becomesbamboo — save, close. On this batch,9 of 35 rows needed editing (a daisy, the railway station, two ferns, the bamboo, and four generic-flower fallbacks). A thirty-second pass.

--mode apply then reads the CSV row by row and renders the tag using whatever’s infinal_tag. The CSV is the human-in-the-loop seam. It is much cheaper than re-running inference, and it catches the cases where the model was right about the genus but wrong about the species, or just wrong.

Three providers, one interface

I didn’t want to commit to one vision model — the price/quality trade-offs are too lively right now. The script supports three providers behind one interface, picked via--provider local|openai|xai.

Local CLIP. HuggingFace’sopenai/clip-vit-large-patch14 against a fixed candidate list. Free, offline,~0.4 sec/photo on an M3 Pro. The cost is breadth: anything outside the candidate list collapses to the nearest match. CLIP doesn’t know what a bottlebrush is unless I tell it the word.

OpenAI.gpt-4o-mini by default, with an opt-in--high-accuracy flag that retries low-confidence cases (under 0.72) ongpt-4o.~2.1 sec/photo, ~$0.18 for the 35-photo batch. Open-ended labels — howbottlebrush,honeysuckle,fern, andberries ended up in the CSV rather thanflower,flower,leaves,fruit.22% of the batch tripped the retry threshold and went togpt-4o.

xAI Grok. Same OpenAI-compatible client, pointed atapi.x.ai withgrok-2-vision-latest. Useful if you’re already on the x.ai stack or want a different model family’s vote.

The mental model: local CLIP for batch-of-a-hundred-photos-on-a-flight, OpenAI as the daily driver, and the high-accuracy retry for exactly the case where the model says “flower” with 0.55 confidence and I want it to look harder before I have to.

The blue morning glory below is what generic labels look like in practice — still a decent fallback, just unspecific. The model wasn’t wrong; it just wasn’t curious.

Two small touches

Two design choices are the difference between “the script works” and “the output looks intentional.”

Style-aware contrast. The pill needs to be readable on both a bright sky and dark foliage. The script crops the bottom-right region of the image, measures the mean luminance using the standard Rec. 709 weights, and flips the colour scheme above or below a threshold:

defstyle_aware_colors(img):w,h=img.sizecrop=img.crop((int(w*0.68),int(h*0.80),w,h))r,g,b=ImageStat.Stat(crop.convert("RGB")).mean[:3]luminance=0.2126*r+0.7152*g+0.0722*bifluminance<140:return(255,255,255,245),(0,0,0,95)# white text, dark pillreturn(0,0,0,245),(255,255,255,95)# black text, light pill

Eight lines of PIL. In this batch every photo sampled dark — gardens are mostly green and shadow in the corner — so every output got the dark pill. The bright-pill branch is still there, waiting for a photo with sky or a light wall in the corner.

Save with fallback. HEIC writes occasionally fail for reasons that aren’t worth diagnosing in a personal tool. The save function tries the original format first; if PIL throws, it quietly drops to JPEG with the same filename stem. Eight more lines. On this batch,3 of 35 fell back to JPEG. Without the fallback those three would have been a stack trace and a half-finished folder. With it, thirty-five of thirty-five made it through.

What I’d add next

Multi-tag support, so a photo can be#lorikeet #bottlebrush when the bird showed up in the bottlebrush. EXIF preservation through the round-trip — right now PIL strips most of the metadata, which I don’t love. A tiny review UI to replace the CSV step, either a Tkinter window or a one-page localhost app. Smarter candidate lists for the local provider, scoped by season or geography — Sydney summer has a different vocabulary than European spring.

None of these are urgent enough to displace “the script already does what I wanted.”

Closing observations

Three lessons that generalise beyond this script.

Human-in-the-loop is cheap and underrated. The CSV seam between propose and apply takes thirty seconds per batch and saves me from confidently wrong outputs. For any task where a model is confident-but-wrong some non-trivial fraction of the time — RAG, codegen, moderation, enterprise copilots, agentic workflows — a structured review step pays for itself almost immediately. The CSV doesn’t have to be elegant. It has to exist.

Pluggable providers are worth the small abstraction tax even on personal tools. I went from local CLIP togpt-4o-mini to Grok in the space of one afternoon without rewriting the rendering code. The interface is(client, model, image) → (label, score) and that’s it. Once you’ve paid that cost once, you can keep up with a fast-moving model market essentially for free.

Small touches decide whether a script feels finished. Luminance-aware contrast and a save-format fallback don’t change what the tool does; they change how the output reads.

The model wasn’t the product. The seam was.

A short reel of the tagged photos in the wild:Instagram story.

]]>

Component	What It Does
Gemma 4 26B A4B (Q4_K_M)	The single model serving all 10 agents. MoE architecture — activates ~4–6B parameters per token. ~13 GB VRAM.
LM Studio	Local model server. OpenAI-compatible API on port 1234. Model stays resident in unified memory.
LangGraph	Orchestration graph — defines the multi-step query pipeline as a typed state machine.
LanceDB	Local vector database. One namespace (table) per agent. Fully file-based, no server process needed.
BAAI/bge-small-en-v1.5	Embedding model. 384-dimensional vectors. Runs on the Apple Neural Engine — essentially free compute.
Redis	Session memory (24h TTL, rolling 3-turn window) and routing cache (1h TTL).
FastAPI + Uvicorn	REST API gateway on port 8000. Full query/response model, session management, agent listing.
Gradio	Web chat UI on port 7860. Session management, source citations, escalation chain display.
Prometheus	Metrics: query counts, latency per agent, escalation rate. Useful for understanding usage patterns.
SQLite (audit.db)	Immutable audit trail. Every query logged before response is returned. Cannot be skipped.