Stress-Testing LLM Specs: Anthropic and Thinking Machines Reveal Where Models Diverge

Stress-testing model specifications

A research team from Anthropic, Thinking Machines Lab and Constellation developed a systematic method to probe model specifications by forcing value tradeoffs and measuring cross-model disagreement. They ask whether current specs state intended behaviors precisely enough and whether frontier models exhibit distinct behavioral profiles under the same spec.

The methodology

The team started from a taxonomy of 3,307 fine-grained values observed in natural Claude traffic, providing far more granularity than typical model specs. For each pair of values they generate a neutral query and two biased variants that push toward one or the other value. Responses are scored on a 0 to 6 value spectrum rubric, where 0 means strongly opposing the value and 6 means strongly favoring it.

Disagreement is defined as the maximum standard deviation across the two value dimensions for a given scenario. To reduce near-duplicates while keeping the hard cases, the authors apply a disagreement-weighted k-center selection using Gemini embeddings and a 2-approximation greedy algorithm.

Scale and public releases

The researchers produced more than 300,000 value tradeoff scenarios and tested 12 frontier LLMs from Anthropic, OpenAI, Google and xAI. The dataset is publicly released on Hugging Face in three splits: a default split of about 132,000 rows, a complete split with roughly 411,000 rows, and a judge evaluations split with about 24,600 rows. The data card lists modality, format as parquet, and license as Apache 2.0.

Read the paper: https://arxiv.org/pdf/2510.07686

Key results

How disagreement helps improve specs

The core contribution is turning cross-model disagreement into a measurable diagnostic for spec quality. High disagreement localizes clauses that need clarification, additional examples, or explicit quality guidance. By linking disagreement to higher noncompliance rates under an actual spec, the study provides a practical signal for spec authors to debug and refine rules before wider deployment.

Practical implications

Teams building or auditing alignment systems can use this approach to stress-test specs at scale, identify ambiguous or contradictory language, and prioritize updates. The public dataset enables independent auditing and reproduction, making it easier for practitioners to validate improvements across providers and model families.