Using Benchmarks Measuring

Exclusive: This new benchmark could expose AI’s biggest weakness

ARC-AGI-3 tests whether models can reason through novel problems, not just recall patterns, a task even top systems still ...

SiliconANGLE

Researchers develop new LiveBench benchmark for measuring AI models’ response accuracy

A group of researchers has developed a new benchmark, dubbed LiveBench, to ease the task of evaluating large language models’ question-answering capabilities. The researchers released the benchmark on ...

SiliconANGLE

MLCommons releases new AILuminate benchmark for measuring AI model safety

MLCommons today released AILuminate, a new benchmark test for evaluating the safety of large language models. Launched in 2020, MLCommons is an industry consortium backed by several dozen tech firms.

SlashData to Reveal New Data on Measuring AI ROI in Live Webinar on March 31, 2026

A new global study of 11,500+ software developers reveals how developers use AI in 2026 & how organisations are ...

MUO on MSN

AI benchmark numbers are meaningless — here's what to look for instead

Numbers go up, AI gets better.

Yahoo

AI's capabilities may be exaggerated by flawed tests, according to new study

Researchers said that the methods used to evaluate AI are oftentimes lacking in rigor. (Leila Register) Researchers behind a new study say that the methods used to evaluate AI systems’ capabilities ...

Business Wire

Simbian Announces Industry’s First Benchmark to Comprehensively Measure LLM Performance in Security Operations Centers

New “AI SOC LLM Leaderboard” Uniquely Measures LLMs in Realistic IT Environment to Give SOC Teams and Vendors Guidance to Pick the Best LLM for Their Organization Simbian's industry-first benchmark ...

Business Wire

Thunk.AI Releases “Hi-Fi” Benchmark to Measure AI Automation Reliability

SEATTLE--(BUSINESS WIRE)--Thunk.AI today announced the release of a new “Hi-Fi” benchmark designed to rigorously measure the reliability of AI agentic automation. The benchmark models enterprise ...

TechCrunch

Why most AI benchmarks tell us so little

On Tuesday, startup Anthropic released a family of generative AI models that it claims achieve best-in-class performance. Just a few days later, rival Inflection AI unveiled a model that it asserts ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results