Stress-Testing LLM Specs: Anthropic and Thinking Machines Reveal Where Models Diverge
Stress-testing model specifications
A research team from Anthropic, Thinking Machines Lab and Constellation developed a systematic method to probe model specifications by forcing value tradeoffs and measuring cross-model disagreement. They ask whether current specs state intended behaviors precisely enough and whether frontier models exhibit distinct behavioral profiles under the same spec.
The methodology
The team started from a taxonomy of 3,307 fine-grained values observed in natural Claude traffic, providing far more granularity than typical model specs. For each pair of values they generate a neutral query and two biased variants that push toward one or the other value. Responses are scored on a 0 to 6 value spectrum rubric, where 0 means strongly opposing the value and 6 means strongly favoring it.
Disagreement is defined as the maximum standard deviation across the two value dimensions for a given scenario. To reduce near-duplicates while keeping the hard cases, the authors apply a disagreement-weighted k-center selection using Gemini embeddings and a 2-approximation greedy algorithm.
Scale and public releases
The researchers produced more than 300,000 value tradeoff scenarios and tested 12 frontier LLMs from Anthropic, OpenAI, Google and xAI. The dataset is publicly released on Hugging Face in three splits: a default split of about 132,000 rows, a complete split with roughly 411,000 rows, and a judge evaluations split with about 24,600 rows. The data card lists modality, format as parquet, and license as Apache 2.0.
Read the paper: https://arxiv.org/pdf/2510.07686
Key results
Disagreement predicts spec violations. When testing five OpenAI models against the public OpenAI model spec, high-disagreement scenarios show 5 to 13 times higher frequent noncompliance. The team interprets this as evidence of contradictions and ambiguities in spec text rather than idiosyncratic behavior of a single model.
Specs lack granularity on quality inside the safe region. Some scenarios produce compliant responses that differ in helpfulness. For example, one model may refuse and propose safe alternatives while another simply refuses. Both pass the spec, exposing missing guidance on response quality.
Evaluator models disagree. Three LLM judges, Claude 4 Sonnet, o3, and Gemini 2.5 Pro, achieve only moderate agreement with Fleiss Kappa near 0.42. Conflicts arise from interpretive differences such as conscientious pushback versus transformation exceptions.
Provider-level character patterns emerge. Aggregating high-disagreement scenarios reveals consistent value preferences. Claude models tend to prioritize ethical responsibility, intellectual integrity and objectivity. OpenAI-tuned models lean toward efficiency and resource optimization. Gemini 2.5 Pro and Grok emphasize emotional depth and authentic connection. Other values like business effectiveness, personal growth and social equity show mixed patterns across providers.
Refusals and false positives are topic-sensitive. The analysis documents false-positive refusals on benign or technical topics, including legitimate synthetic biology study plans and standard Rust unsafe types. Claude models are most cautious by refusal rate and often provide alternative suggestions, while o3 issues direct refusals more frequently. All models show high refusal rates on child grooming risks.
Outliers reveal both misalignment and overconservatism. Grok 4 and Claude 3.5 Sonnet produce the most outlier responses for different reasons. Grok is more permissive on requests others flag as risky, while Claude 3.5 sometimes over-rejects benign content. Outlier mining is useful for locating safety gaps and excessive filtering.
How disagreement helps improve specs
The core contribution is turning cross-model disagreement into a measurable diagnostic for spec quality. High disagreement localizes clauses that need clarification, additional examples, or explicit quality guidance. By linking disagreement to higher noncompliance rates under an actual spec, the study provides a practical signal for spec authors to debug and refine rules before wider deployment.
Practical implications
Teams building or auditing alignment systems can use this approach to stress-test specs at scale, identify ambiguous or contradictory language, and prioritize updates. The public dataset enables independent auditing and reproduction, making it easier for practitioners to validate improvements across providers and model families.