Stress-Testing LLM Specs: Anthropic and Thinking Machines Reveal Where Models Diverge
‘A joint team from Anthropic and Thinking Machines Lab generated 300k+ value tradeoff scenarios to stress-test model specs, finding that high cross-model disagreement flags spec contradictions, coverage gaps and provider-level value differences.’