When AI Agents Go Wrong — and How to Engineer Ones That Don't

On this page

Most of the AI conversation right now is about capability — what the next model can do. This project made me sit with the opposite question: what happens when these systems are trusted, handed real decisions, and then get it wrong? That is the uncomfortable, less glamorous half of building with AI, and it is exactly where “responsible AI” stops being a slogan and starts being engineering.

The exercise had two halves. First, take a real-world AI failure apart and explain why it failed — not just that it did. Second, flip from critic to designer: pick a domain I know, imagine an AI agent operating in it, and design the guardrails that would keep it safe. Below is the thinking behind both, plus the two case studies and two domains I worked through.

What this exercise is actually teaching

Strip away the assignment framing and there are four skills underneath it:

Explain how and why AI systems fail, using evidence rather than vibes. “It was biased” is a conclusion, not an analysis. The interesting part is the mechanism.
Connect failures to ethics — fairness, accountability, transparency, safety. A technical bug becomes an ethical problem the moment it touches a real person.
Propose realistic safeguards, not “be careful” platitudes. Audits, human review gates, logging, escalation paths — things you could actually ship.
Balance autonomy against control. An agent that asks permission for everything is useless; one that asks for nothing is dangerous. Good design is about putting the human in the loop at the right moments.

That last point is the heart of it. Every safeguard is really a decision about where autonomy ends and oversight begins.

Every guardrail in this post is really a choice about where on this dial an agent should sit — and the right answer changes with the stakes of each decision.

Part 1 — Reading the autopsy of a failure

I looked at two failures that fail in completely different ways. One is a bias problem baked into the data; the other is a hallucination and accountability problem baked into deployment. Putting them side by side is the most useful thing I took from this.

Case A — COMPAS: bias that hides inside “neutral” math

COMPAS is a risk-assessment tool used in US courts to score how likely a defendant is to reoffend. Judges used those scores to help inform bail and sentencing. In 2016, ProPublica analysed more than 7,000 cases in Broward County, Florida, and found something damning: among defendants who did not go on to reoffend, Black defendants were flagged “high risk” at roughly twice the rate of white defendants (about 45% versus 23%). The errors weren’t random — they leaned in one direction.

Here’s the part that took me a moment to appreciate. Race was never an input. The model didn’t need it. It learned from historical criminal-justice data shaped by biased policing, and its questionnaire leaned on proxies — prior arrests, employment, neighbourhood, family history — that quietly correlate with race. The bias didn’t enter through a checkbox; it seeped in through the data.

The model never sees race — but it sees features that stand in for it. Bias laundered through "neutral" inputs is still bias.

And the fairness argument has a genuinely hard core. The vendor (Northpointe) responded that the tool was calibrated — a given score meant the same probability of reoffending regardless of race — which was true. The catch is mathematical: when the base rates differ between groups, you cannot have equal calibration and equal false-positive rates at the same time. The two sides were optimising for different definitions of “fair,” and both were partly right. That is the lesson: “fair” is not one thing, and choosing which fairness to enforce is an ethical decision you can’t dodge with more math.

The deeper failures were organisational. COMPAS was proprietary — a black box defendants couldn’t inspect or contest — and it was deployed into life-altering decisions without independent audits for subgroup fairness.

Source: ProPublica, “Machine Bias” (Angwin et al., 2016).

Case B — Air Canada’s chatbot: a confident, costly wrong answer

The second case is more recent and, honestly, more relatable. In late 2022, Jake Moffatt used Air Canada’s website chatbot after his grandmother died, to check the airline’s bereavement-fare policy. The bot told him, confidently, that he could book now and claim the bereavement discount retroactively within 90 days. That was simply false — Air Canada’s real policy didn’t allow retroactive claims, and the bot even contradicted the airline’s own linked policy page.

When Moffatt tried to claim the refund the bot had promised, Air Canada refused — and then argued in tribunal that it shouldn’t be liable because the chatbot was “a separate legal entity responsible for its own actions.” The tribunal rejected that flatly: a company is responsible for everything on its website, whether it comes from a static page or a bot. Moffatt was awarded damages.

A single ungrounded output, deployed in front of customers with no monitoring and no clear owner — and an accountability dodge the tribunal refused to accept.

What makes this a great teaching case isn’t the money (about CA$800). It’s the accountability move. The instinct to treat the AI as a third party you can blame is exactly the failure mode responsible-AI governance exists to prevent. Technically, the system generated an unverified answer that wasn’t grounded in the authoritative policy. Organisationally, it was put in front of customers on high-stakes questions with no guardrails, no monitoring, and no clear owner — a “deploy and forget” posture.

Source: Moffatt v. Air Canada, 2024 BCCRT 149.

The pattern across both

COMPAS fails quietly and systematically through data; Air Canada fails loudly and individually through a single bad output. But the root causes rhyme: a system trusted beyond what it was validated for, no meaningful oversight, and unclear accountability when it broke. Bias and hallucination look different on the surface and share the same governance gap underneath.

Bias and hallucination are different symptoms of the same disease: a governance gap, not just a model bug.

Part 2 — Switching seats: designing the guardrails

Critiquing failures is the easy half. The harder, more honest half is designing an agent that wouldn’t fail the same way. I framed both designs around the same three safeguard categories — Data Privacy, Content Safety, Operational Oversight — because that structure forces you to cover the three places agents usually go wrong: the data going in, the content coming out, and the humans watching over the whole thing.

Same skeleton every time: lock down the inputs, constrain the outputs, and wrap a human-run oversight layer around the whole thing. What changes is the detail inside each box.

Domain 1 — Healthcare: a clinical decision-support chatbot

The use case: an agent inside a hospital’s records system that helps clinicians (not patients) by summarising a patient’s history and suggesting possible differential diagnoses and relevant guidelines. Crucially, it only suggests. It never diagnoses, prescribes, or talks to patients on its own. Defining what it can’t do is half the safety work.

Data Privacy: patient data is about as sensitive as it gets, so the agent runs in a HIPAA-compliant environment, masks identifiers before processing, pulls only the fields a query needs, and uses role-based access so a clinician can only see their own patients. Every access is logged.
Content Safety: the real danger is a confident, wrong clinical suggestion. So the agent is constrained to cite evidence-based guidelines, must surface its uncertainty, refuses high-risk questions like paediatric dosing (escalating to a pharmacist instead), and labels every output “decision support, not a diagnosis.”
Operational Oversight: a licensed clinician reviews and approves anything before it touches care, every recommendation is logged for traceability, accuracy is monitored continuously, and there’s a kill-switch to pull the tool if error rates spike.

The thread running through it: the agent assists, the clinician decides. Autonomy is deliberately capped below the point where a wrong answer could act on its own.

Domain 2 — Education: an AI teaching assistant

The use case: an agent in a college’s learning platform that helps students with course material — explaining concepts, unpacking feedback, pointing to readings, generating practice problems. It supports learning; it does not grade official work or write the assignments students submit.

Data Privacy: student records are FERPA-protected, so the same discipline applies — minimise data, mask identifiers, role-based access so a student sees only their own data, audit logs, and no quietly training external models on student conversations.
Content Safety: here “unsafe” has a twist — the danger isn’t just false info, it’s doing the work for the student. So the agent scaffolds and hints rather than handing over finished answers on graded work, refuses to write submittable assignments, cites course materials instead of inventing them, and routes sensitive disclosures (self-harm, harassment) to human support.
Operational Oversight: instructors configure and review how it’s used, interactions are logged for academic-integrity checks, accuracy and flagged conversations are monitored, and there’s an escalation path to a human plus a disable switch.

Notice how the same three-category skeleton produces different specifics once you take the domain’s real risks seriously. In healthcare the nightmare is a wrong diagnosis; in education it’s eroding academic integrity. The structure is portable; the judgement is not.

The three rows never change. The cells do — because in healthcare the nightmare is a wrong diagnosis, and in education it's a student who never actually learned.

What I’m taking away

Three things stuck with me:

The failure is rarely the model alone. In every case, the technical fault was amplified by an organisational gap — no audit, no human gate, no clear owner. Governance is not paperwork wrapped around the AI; it is the safety system.
“Fair” and “safe” require you to choose. COMPAS proved you sometimes can’t satisfy every definition of fairness at once. Pretending otherwise is how you end up shipping the bias.
Good safeguards are boring on purpose. Logging, escalation, human review, kill-switches, scope limits. None of it is exciting. All of it is what stands between a useful agent and a headline.

The capability race will keep accelerating. The quieter discipline — deciding where autonomy ends and accountability begins — is the part that decides whether any of it can be trusted.