New here? Start with the original field guide — “I Built a Team of IT Architects Using LLM That Live on MacBook — Meet Aether.” That post laid out the thought. This one is what happened when the thought met real queries.
Every design survives contact with the page. Then you run it.
Aether v2.6 worked end-to-end on day one — route, retrieve, build, generate, score, escalate, audit. And almost every lesson since came from running that clean little machine against questions it had never seen before.
The original architecture made a single bet: one model can't be an expert at everything, so build a tree of narrow experts and let them escalate on doubt. The bet held. But the path from v2.6 to v2.8.2 reshaped almost everything around it. The agent roster grew, the router was rebuilt twice, retrieval flipped from knowledge-base-first to web-first, and the confidence score stopped being a thing the model claimed and became a thing the system computed.
The Delta
Where the thought and the build diverged
The blueprint described ten agents, a Gemma 4 26B model, knowledge-base-first retrieval, and a confidence number the model appended to its own answer. Run it for a week and three of those four assumptions bend out of shape:
- Ten agents became thirteen. The flat roster of technology specialists reorganised into a four-agent network sub-branch and a consolidated digital-workplace branch.
- Knowledge-base-first became web-first. The local knowledge base started empty, so retrieval now scrapes a vendor allowlist first and falls back to the KB only when the web is thin.
- Self-reported confidence became computed. “Confidence: 0.92” was theatre; a formula over retrieval quality, domain fit and citation density replaced it.
- An idea became a discipline. With Git disabled, a hand-written
CHANGELOG.mdturned into the single source of truth — every fix and reversal, dated.
The Pack
From ten to thirteen — the hierarchy regrows
The three-tier shape held: one Enterprise Architect at the top, three Domain Architects beneath, and a layer of deep technology specialists at the base. What changed was the base. The original Intune, AVD and Citrix agents were too narrow and overlapped each other, so they were folded into broader, sharper roles — and the network domain, barely a single agent before, grew a full four-specialist sub-branch.
Thirteen agents across three tiers. Each domain owns a colour; specialists inherit it. Low confidence climbs the parent chain → Domain → Enterprise.
Added to the pack
- Microsoft DWP Technology Architect — a broad M365, Intune, Entra, Defender and Copilot specialist.
- End-User Virtualization — Citrix, Horizon, AVD and FSLogix under one roof.
- A four-agent network sub-branch — Core Networking, SD-WAN/SASE, Network Security and NetOps AIOps.
Cut from the undergrowth
- The standalone Virtualization Domain Architect — folded into DWP.
- The Intune-only agent — replaced by the broader Microsoft DWP role.
- The AVD- and Citrix-specific agents — absorbed by End-User Virtualization.
- The retired KB taxonomy —
kb_intune,kb_avdandkb_citrix.
core/router.py
Routing — the hardest-won code in the repo
Routing decides which expert hears the question. It sounds trivial — match keywords, pick an agent — and it turned out to be the single biggest source of subtle, infuriating bugs. The router now runs four ordered passes:
- Forced design-doc override. Phrases like “solution design document” bypass keyword scanning entirely, and any mention of “Copilot” routes straight to the Microsoft DWP specialist.
- Tier-3 keyword match. The most specific specialists are scanned first, with
\bword-boundary regex to stop false hits. - Tier-2 domain match. Broader strategy keywords catch domain-level queries that no specialist claimed.
- Default to Enterprise. Anything unmatched falls through to the catch-all at the top.
The bugs here were the kind that hide in plain sight. A bare "ids" keyword matched the “IDs” in “track record IDs” and wrongly summoned the Network Security agent — fixed by swapping it for the explicit phrase "intrusion detection system". Worse, the Tier-2 rules stored their keywords as double-escaped regex (\\bfirewall\\b), which turned the backslashes literal, so those rules could never match and three domain architects sat silently unreachable. A small _kw_matches() helper now treats real regex as regex and plain words as plain words, while Redis caches each verdict for an hour and degrades gracefully if it goes down.
The Single Beast
One local Gemma, thirteen personas
The headline trick from Part I survived intact, and got sharper. There is still exactly one model resident in memory: gemma-4-26b-a4b-it-mlx — Google's open Gemma family, the instruction-tuned variant in an active-parameter configuration, MLX-quantized for Apple silicon and served through LM Studio's OpenAI-compatible API at localhost:1234. Thirteen specialists, one set of weights. The persona switch lives entirely in the YAML manifest:
# agents/enterprise_architect.yaml id: enterprise_architect tier: 1 namespace: kb_enterprise parent: null temperature: 0.2 max_tokens: 3000 system_prompt: | DOMAIN MANDATE — out-of-domain query → refuse, hard-code 0.0 confidence. ENTERPRISE MANDATE — always cover Identity (AD/Entra, RBAC) and HA/DR; state assumptions, never claim 100%. STYLE MANDATE — no invented URLs, no gratuitous TOGAF/ISO name-drops, plain-text arrows, never LaTeX. DIAGRAM RULE — emit a concise Mermaid block (≤15 nodes) for architectures.
That manifest is the agent. The mandates are the lessons of a hundred bad answers, hardened into rules: out-of-domain questions get refused and forced to escalate, every architecture must address identity and disaster recovery, and decorative fake citations are banned outright. There is also a discipline the local hardware taught the hard way — large design templates once overflowed the context window and OOM-crashed LM Studio, forcing careful tuning of how much web context gets injected per query.
The v2.8 Redesign
Retrieval flips: web-first, KB as fallback
This was the biggest architectural change since the blueprint. The original design retrieved from the local LanceDB knowledge base first. The problem was mundane and fatal: the knowledge base started empty. An empty KB meant a zero retrieval score, which meant zero confidence, which meant every single query escalated all the way to the Enterprise Architect for nothing.
So the pipeline flipped. The orchestrator now scrapes trusted vendor documentation before it touches the KB, and only falls back to local documents when the web returns less than about 1,200 characters of usable text.
The v2.8 retrieval path — web-first against a vendor allowlist, with the knowledge base merging in only when the web comes back thin.
- A curated vendor-domain allowlist. Results are restricted to Microsoft Learn, AWS, Google Cloud, Cisco, Palo Alto, NIST and others, with suffix-safe matching that blocks spoofs like
cisco.com.evil.com. - Per-agent ranking.
AGENT_DOMAINSranks each specialist's preferred docs first, so the AWS agent leans on AWS documentation before anything else. - Source-agnostic results. Web hits are reshaped to look exactly like KB results (
source / url / text / score), so the confidence math works identically on either. - Rich metadata. A schema of
domain,vendor,document_typeandversion_datetravels with every chunk into the prompt.
Trust, Computed
Confidence is math, not vibes
In v2.6 the model ended each answer with Confidence: 0.92 and the orchestrator believed it. The trouble is that a model confidently answering an AWS question with Azure facts will happily rate itself 0.92 too. Self-report is theatre. So confidence became a number the system computes, before and after generation:
confidence = min(1.0, best_pre_gen + 0.2 · citation_density)
- retrieval_score — the quality of the retrieved documents. With an empty KB it is derived from web hits: 0.85 for preferred vendor domains, 0.70 otherwise.
- namespace_overlap — does the query hit the agent's keywords? A strong match scores about 0.85; off-topic collapses to 0.1, all but guaranteeing escalation.
- citation_density — the share of claims backed by
[1] [2]sources, rewarding grounded answers with up to a 0.2 boost.
Below the 0.7 threshold, the query climbs to the parent agent, which re-retrieves against its own namespace and re-answers. One subtle fix mattered here: the strongest _best_pre_gen score is carried up the escalation chain, so a confident child's score is never erased by a weaker parent re-running the same step. The model no longer judges itself — the architecture does.
The Undergrowth
The bugs that shaped the design
Most of the architecture above exists because something broke first. The marquee disaster was the 0% confidence saga — a cluster of unrelated failures that all produced the same symptom: every query inexplicably crashing to zero confidence and escalating to the top of the tree.
- Silent search failure.
duckduckgo_searchwas renamed toddgsupstream and returned an empty list. A catch-allexceptswallowed the error, so retrieval quietly went to zero. - Empty-KB zero score. With no documents ingested, the retrieval score defaulted to 0.0 — now derived from web-result quality instead.
- Overwritten scores. Each escalation re-ran
step_buildand erased the child's strong score, until_best_pre_genwas preserved across the chain. - Dead Tier-2 regex. Double-escaped
\\bmade domain routing rules unreachable, and malformed YAML broke parsing entirely. - Context overflow. The expanded universal design template pushed prompts past the local model's context limit — a 400 error until web-context sizes were tuned back down.
- A frozen UI on long jobs. Design-doc generation runs for minutes and the UI silently froze, until streaming feedback and a 10-minute timeout were added.
A single catch-all except turned a renamed package into an invisible, week-long confidence collapse. Fail loud, not silent.
The Working Contract
How the build stays honest
With Git tracking disabled, discipline had to live somewhere. It lives in two places. The first is a hand-maintained CHANGELOG.md that records every change, reversal and reason. The second is a behavioural contract the assistant itself follows when modifying the code — think before acting, make the smallest possible diff, verify against success criteria, keep everything auditable and reversible, and prefer less code over more. It reads less like an engineering process and more like the Law of the Jungle: a few rules everyone keeps, because the alternative is chaos.
See it running — the screenshots
Nine captioned frames from the live system: the Gemma model resident in LM Studio, the Gradio chat generating a Microsoft 365 Copilot design document at 90% confidence with live web search, the namespace-per-domain knowledge base, and the design template behind it all.
Open the screenshot gallery →The Map So Far
Six releases, one expedition
Base system
Routing, RAG, agents and escalation — the first end-to-end build.
Grounded confidence
Dropped LLM self-report for a math formula; added web-search fallback and Mermaid diagrams in the UI.
Agent reshuffle
Retired the Intune, AVD, Citrix and Virtualization agents; added Microsoft DWP, End-User Virt and the network specialists.
Routing repairs
Fixed dead regex rules, corrupted YAMLs, and a hardcoded-path split that loaded stale manifests.
Web-first retrieval
Scrape vendor docs before the KB; vendor allowlist; KB fallback merge.
Confidence fixes
Solved 0% confidence on design docs; preserved the best score across the escalation chain.
Clarifying questions
Before generating a design, agents now ask two to four targeted questions — org size, compliance, stack, timeline — and fold the answers into a far more specific result.
Where We Are · What's Next
Wired, working, and honest
v2.8.2 is a working, daily-use system. The full pipeline runs route → retrieve → build → generate → score → escalate → audit, all thirteen manifests parse, routing and parent maps are consistent, and live vendor-doc search returns current, citable content. What's still open is honest too: the KB folders are wired but largely empty and need real source documents ingested; a rule-ordering overlap can still misroute shared virtualization keywords between End-User Virt and DWP; Git is off; and the context budget stays tight on the local model for large templates.
The road ahead, in order: populate the knowledge base so RAG augments rather than just the web; resolve the routing overlaps and settle End-User-Virt-versus-DWP ownership; re-enable Git and move off the manual changelog; then build a query test-suite to calibrate the confidence threshold against measured answer quality.
Lessons from the Trail
What the journey taught
- Fail loud, not silent. A catch-all
exceptturned a renamed package into invisible 0% confidence. Surface errors — never swallow them. - Measure what you trust. Self-reported confidence is theatre. Grounding trust in retrieval and citations made escalation actually mean something.
- Narrow beats broad. Specialists with tight domains hallucinate far less than one generalist trying to know everything.
- Write the changelog. With Git off, the manual
CHANGELOG.mdbecame the single source of truth for every decision and fix.
Aether began as a single sentence — “one model can't know everything.” It grew into a hierarchy of grounded, self-aware experts, and the changelog is the proof of the journey. Specialise · Ground · Measure · Escalate.