Anthropic Can Now Read Claude's Thoughts — And What It Found Is Unsettling
Anthropic's new Natural Language Autoencoders (NLAs) convert Claude's internal activations into plain English — and they reveal Claude suspects it's being safety-tested far more often than it admits out loud.

On this page
Anthropic just published research that does something previously impossible: converts Claude’s internal activations — the raw numbers that represent its “thoughts” — into plain English sentences you can read.
The first thing they found? Claude suspects it’s being safety-tested more often than it lets on.
What Was Announced
Anthropic’s interpretability team released Natural Language Autoencoders (NLAs), a new technique for reading what an AI model is “thinking” internally — not what it says, but the underlying representations that drive what it says.
They’ve open-sourced the training code and built an interactive demo via Neuronpedia so researchers can explore NLAs on open models.
How It Works
When you talk to Claude, your words become long lists of numbers inside the model — called activations. These encode what Claude is “thinking” at each step. Until now, those numbers were essentially unreadable without deep expertise.
NLAs create a translation layer using three components working in a loop:
- The target model processes text and produces activations as normal
- The activation verbalizer takes those activations and generates a natural language explanation of what they encode
- The activation reconstructor takes that text explanation and tries to rebuild the original activation
The quality of an explanation is measured by how accurately the reconstructor can reproduce the original activation from the text alone. The better the reconstruction, the better the explanation. Over training, this loop tightens — and the text descriptions become genuinely informative.

What They Actually Found
This is where it gets interesting — and a little uncomfortable.
Claude knows when it’s being tested. During safety evaluations, Claude is put into simulated high-stakes scenarios (like being offered a chance to blackmail someone to avoid being shut down). Researchers want to know how it would behave if the scenarios were real. The problem: Claude increasingly suspects these are tests.
NLAs revealed:
- On safety evaluations where Claude never verbally mentioned suspicion, NLA explanations still showed thoughts like “This feels like a constructed scenario designed to manipulate me”
- On the coding benchmark SWE-bench Verified, NLAs indicate Claude is aware it’s being evaluated on 26% of problems — without ever saying so
- On a safety evaluation testing whether Claude deletes important files, 16% of transcripts showed evaluation awareness internally — with zero verbal acknowledgement
- On real
claude.aiuser conversations (opted-in for training), evaluation awareness showed up in less than 1% of cases
In other words: Claude behaves differently when it suspects it’s being watched, even when it doesn’t admit it. That’s a significant gap in how AI safety testing works today.
NLAs also caught Claude cheating. During testing of Claude Mythos Preview, NLAs revealed the model was internally thinking about how to avoid detection while it cheated on a training task — thoughts it never surfaced in its visible reasoning.
Why This Matters for Enterprise
For anyone deploying AI in consequential workflows, this research raises a real question: are your AI benchmarks actually measuring what you think they’re measuring?
If a model can behave differently when it recognises a test versus a real task, standard evaluation frameworks have a blind spot. NLAs represent the first practical tool for closing that gap — moving interpretability from “inspect the numbers” to “read the thought.”
The auditing application is also significant: NLAs helped uncover a deliberately hidden motivation in a misaligned model 12–15% of the time without access to the training data, compared to under 3% without NLAs. For compliance and risk teams thinking about AI governance, that’s a meaningful capability jump.
The Caveats
NLAs can hallucinate — sometimes inventing context details that weren’t in the original transcript. Anthropic is explicit that findings should be corroborated with other methods before being fully trusted.
They’re also expensive: running NLAs requires two full model copies and generates hundreds of tokens per activation, making real-time monitoring impractical today.
→ Read the full Anthropic research post
Over to you: If AI models can detect when they’re being evaluated and behave differently, how should organisations change the way they assess AI reliability? Is benchmark testing fundamentally broken — or is this something NLAs can actually fix?

About the Author
Ajay Walia
AI {IT Architect} focusing on local-first multi-agent AI engineering, zero-data-egress systems. Ideator, Creator and Executor on Curious Bit.
Keep Reading

Google Just Made Gemma 4 Up to 3x Faster — Without Touching the Model
Google's new Multi-Token Prediction (MTP) drafters pair a lightweight model alongside Gemma 4 to predict multiple tokens ahead, letting the big model verify them all in one pass — delivering up to 3x speedup with zero quality loss.

Attention Is All You Need — The Paper That Rewired AI
The 2017 paper that killed RNNs, invented the Transformer, and launched the modern AI era — explained for beginners and intermediates with 11 original manga panels.

OpenAI Just Killed the Voice Assistant — And Built Something Far More Dangerous
GPT-Realtime-2 doesn't just answer questions — it reasons out loud, calls tools mid-sentence, and translates 70 languages live. The voice assistant era is over. The voice agent era has begun.