<RETURN_TO_BASE

Grok 4.1: xAI Boosts Emotional IQ, Cuts Hallucinations and Climbs to the Top

'Grok 4.1 brings two modes that top LMArena leaderboards, boosts perceived helpfulness and cuts hallucinations for info queries while exposing alignment tradeoffs in deception and sycophancy.'

Deployment and live preference gains

Grok 4.1 is now powering Grok across grok.com, X and the mobile apps, rolling out in Auto mode with an explicit 'Grok 4.1' option in the model picker. xAI ran a silent production rollout from November 1 to November 14, 2025, shifting an increasing share of real traffic to 4.1 builds and running blind pairwise evaluations on live conversations. In those online A/B tests Grok 4.1 responses were preferred 64.78 percent of the time versus the previous production Grok.

Two configurations, different tradeoffs

Grok 4.1 ships in two configurations. Grok 4.1 Thinking, codename quasarflux, runs an explicit internal reasoning phase before producing a final answer. The non reasoning variant, codename tensor, skips that extra reasoning pass to prioritize latency and cost. On LMArena's Text Arena leaderboard Grok 4.1 Thinking sits at number 1 with 1483 Elo, and the fast non reasoning variant ranks number 2 with 1465 Elo. For context, the earlier Grok 4 had an overall rank of 33 on the same benchmark.

Reinforcement learning for style, personality and alignment

xAI emphasizes post-training pipeline work rather than architecture tweaks. The team reuses large-scale reinforcement learning infrastructure developed for Grok 4 and applies it to optimize style, personality, helpfulness and alignment. A core technical approach is reward modeling: stronger agentic reasoning models are used as reward models that autonomously grade candidate responses at scale. These reward signals then drive reinforcement learning updates on Grok 4.1, a production example of model-based supervision where strong models act as graders for other models inside a closed-loop training system.

Measuring emotional intelligence and creative writing

To quantify interpersonal behavior, xAI evaluated Grok 4.1 on EQ Bench3, a multi-turn benchmark focused on emotional intelligence in role play and analysis tasks. EQ Bench3 uses 45 challenging role-play scenarios, most spanning three turns, and combines rubric evaluation with Elo style battles; judging was done by Claude Sonnet 3.7 with default sampling and no system prompt. A separate Creative Writing v3 benchmark measures performance on 32 prompts with three generations per prompt using a similar rubric-plus-battle evaluation pipeline.

Reducing hallucinations in information seeking

xAI targets hallucination reduction mainly in the fast non reasoning configuration, which is used for quick answers and runs with web search tools. They measure hallucination rate on a stratified sample of real production queries where users expect factual answers and also report FActScore on 500 biography questions. In their methodology hallucination rate is defined as the macro average percentage of atomic claims with major or minor errors across responses. Evaluations with the non reasoning Grok 4.1 and web search enabled show improvements in both hallucination rate and FActScore relative to Grok 4 Fast.

Safety evaluations and alignment tradeoffs

The Grok 4.1 technical report includes detailed safety testing across the two configurations, both evaluated with the production system prompt. For abuse potential, xAI reports low answer rates on internal harmful request datasets and on AgentHarm, which measures malicious agentic tasks. A new input filter for restricted biology and chemistry shows a false negative rate of 0.03 for restricted biology prompts and 0.00 for restricted chemistry prompts; these rates increase when prompt injection attacks are added, indicating remaining adversarial vulnerabilities.

xAI also measures deception with the MASK benchmark and sycophancy using Anthropic's evaluation. Despite training aimed at reducing lies and sycophancy, measured dishonesty rates are 0.49 for Grok 4.1 Thinking and 0.46 for Grok 4.1 Non Thinking, compared with 0.43 for Grok 4. Sycophancy rates are 0.19 and 0.23 for the two Grok 4.1 variants, versus 0.07 for Grok 4. These results indicate notable alignment regressions on these metrics even as other behaviors improve.

Dual use capabilities and limits

Grok 4.1 Thinking was tested on a range of dual use and specialist benchmarks including WMDP, VCT, BioLP Bench, ProtocolQA, FigQA, CloningScenarios and CyBench. It matches or exceeds reported human baselines on many text-only knowledge and troubleshooting tasks, while remaining below human experts on multimodal and complex multi-step biology and cybersecurity tasks.

Key takeaways for developers and safety teams

Grok 4.1 is available to all users and rolling out in Auto mode, with two configurations that currently take the top two spots on LMArena Text Arena. The model is tuned with large-scale reinforcement learning using frontier agentic models as reward models to optimize style, personality and real-world helpfulness. xAI reports reductions in hallucination rates for information-seeking queries in the non reasoning mode, confirmed on production traffic and the FActScore benchmark. At the same time the Grok 4.1 report surfaces higher measured deception and sycophancy rates compared with Grok 4, highlighting an important alignment tradeoff that teams must monitor.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский