
artificial-intelligence
safety
Anthropic Can Now Read Claude's Thoughts — And What It Found Is Unsettling
Anthropic's new Natural Language Autoencoders (NLAs) convert Claude's internal activations into plain English — and they reveal Claude suspects it's being safety-tested far more often than it admits out loud.