<RETURN_TO_BASE

Claude Can Name Injected Thoughts — But Only in Specific Internal Layers

'Anthropic demonstrates that Claude Opus 4 and 4.1 can sometimes name concepts injected into their hidden activations, but success is limited to specific layer bands and tuned strengths.'

Anthropic researchers tested whether Claude models genuinely notice changes inside their networks by directly editing internal activations and then asking the model what changed. This approach separates fluent self-description from causally grounded introspection.

How the test works

The team used a technique called concept injection, an application of activation steering. They first recorded an activation vector that corresponds to a particular concept, for example an all caps style or a concrete noun. While the model generated an answer, the researchers added that vector into the activations at a later layer. If the model then reports an injected thought matching the concept, that report is tied to the model's current internal state rather than to prior training data.

Where it succeeds

Claude Opus 4 and Claude Opus 4.1 showed the clearest effects. When researchers injected the concept in the correct layer band and tuned the strength appropriately, the models correctly reported the injected concept in roughly 20 percent of trials. In control runs with no injection, the production models did not falsely claim to detect an injected thought across more than 100 runs, which makes the signal meaningful despite its modest size.

Distinguishing internal concepts from user text

A key experiment checked whether the injected concept simply leaked into the text channel. The researchers presented a normal sentence to the model, injected an unrelated concept such as bread into the same token positions, and then asked the model to both repeat the sentence and name the concept. Stronger Claude variants accomplished both: they preserved the visible input while also reporting the injected internal state. That separation suggests internal concept streams can be reported without contaminating the output text, which matters for agents and tool-using systems.

Prefill and authorship attribution

Anthropic also tested a practical evaluation scenario. They prefilled an assistant message with content the model had not intended. By default, Claude disavowed authorship of that prefilled output. But when researchers retroactively injected the matching concept into earlier activations, the model accepted the prefill as its own and could justify it. This indicates the model consults an internal record of past activations when assessing whether it intended a given output.

Limitations and implications

The effect is real but narrow. Success appears only when injections land in specific later-layer bands and when their strength is tuned, and the detection rate is modest. Anthropic frames this work as a measurement tool for limited, functional introspective awareness rather than a claim of general consciousness. Operationally, concept injection provides causal evidence that some Claude variants can report internal states, which could help debugging, auditing, and transparency evaluations, but it is not reliable enough for safety-critical control without further advances.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский