FILTER MODE ACTIVE

#benchmarks

Records found: 4

#benchmarks09/08/2025

Inside 2025's AI Agents: What Works, What Risks, and How to Ship

'A concise 2025 guide to AI agents covering what they are, where they work reliably, risks, architecture patterns, and evaluation strategies.'

READ →

#benchmarks23/06/2025

VERINA Benchmark: Pushing the Limits of Verifiable Code Generation with LLMs

VERINA introduces a holistic benchmark for evaluating LLMs on verifiable code generation, combining code, formal specifications, and proofs across diverse difficulty levels.

READ →

#benchmarks04/06/2025

LifelongAgentBench: Revolutionizing Continuous Learning in LLM-Based Agents

LifelongAgentBench introduces a novel benchmark for evaluating continuous learning in LLM-based agents, focusing on knowledge retention and adaptation across sequential tasks in dynamic environments.

READ →

#benchmarks01/05/2025

Salesforce Unveils New Benchmarks and Models to Enhance Trustworthy AI Agents

Salesforce AI Research introduces innovative benchmarks, safety models, and architectures to build more reliable and capable AI agents for enterprise applications.

READ →