AI benchmarks are broken, real-world tests required

Benchmarks are broken, and health AI proves it.

The mainstream way we judge AI—clear-cut, human-vs-machine tasks on narrow problems—is finally meeting the messiness of real use. A pair of Technology Review articles this week argues that the old habit of testing AI on isolated jobs and datasets is not just insufficient; it’s potentially misleading about how models behave when deployed inside teams, workflows, and high-stakes settings. The critique is blunt: we’re not evaluating AI where it actually matters, so we’re missing systemic risks, economic impacts, and the true limits of reliability.

The first piece lays out a simple but stubborn point: performance on a single benchmark does not translate into sustained, beneficial help in a real organization. AI is now embedded in teams, with multiple people, overlapping tasks, and long-running processes. Evaluation that ends after a single test scenario ignores how AI’s strengths and weaknesses unfold over time—how it blurs into daily work, how users adapt, and how failure modes compound. The author argues for benchmarks that track performance across longer horizons, within human workflows, and across organizational contexts. In other words, we need to measure how AI behaves when it co-authors documents, schedules meetings, or routes tasks inside a company—not just how it answers a question in isolation.

The second article spotlights a sector where the stakes are literal life-and-death: health AI. In recent weeks, Big Tech players have pushed broader access to AI-health tools—Microsoft’s Copilot Health, Amazon’s Health AI, OpenAI’s ChatGPT Health, and Claude’s health-capable mode—reflecting demand for hands-off, easily accessible medical guidance. But as adoption scales, independent evaluation becomes indispensable. The piece notes the risk of companies evaluating their own tools and releasing products with blind spots that external experts would spot only after customers have already used them in critical situations. In health, where missteps can have immediate consequences, the call is clear: rigorous, external evaluation before broad release, with transparent methods and data-sharing to validate claims.

Taken together, these narratives sketch a broader move in AI governance: shift from narrow, static tests to long-horizon, workflow-aware benchmarks that reflect how AI actually operates with people over days, weeks, and months. And in high-stakes domains like health, the standard must include independent scrutiny, published methodologies, and opportunities for external validation before products reach wide audiences.

For practitioners, a few concrete takeaways emerge:

Design evaluation around real workflows. If you’re shipping AI this quarter, test how it behaves when multiple clinicians collaborate with it, or when a support agent uses it to triage a patient’s issue over a shift. Measure not just accuracy, but time-to-resolution, escalation rates, and how often users override AI suggestions.

Insist on independent validation for health tools. When tools claim “safer, more accurate recommendations,” require external reviews, access to test datasets, and pre-release publication of evaluation protocols. Transparency isn’t a luxury—it’s a risk-management device.

Move beyond one-shot benchmarks. Create time-distributed experiments that span multiple tasks, user roles, and environmental conditions. Track how performance evolves as the system adapts to a team’s rhythms, misaligns with new workflows, or encounters data shifts.

Watch for failure modes that aren’t obvious in lab tests. Hallucinations, over-reliance, or complacency can creep in when users rely on AI for routine decisions. Build guardrails, audits, and user education into the product roadmap.

For product incentives, weigh the cost of slower iteration against the value of safer, more reliable deployments. If the quarter’s priority is rapid feature launches, you may be incentivized to skim on validation—but the long tail risk in healthcare, finance, or customer support argues for deliberate, rigorous evaluation.

As the debate intensifies, the practical challenge remains: how to operationalize long-horizon, workflow-aware benchmarks without crippling development velocity or inflating costs. The answer will require collaboration across researchers, practitioners, and independent evaluators—and will likely hinge on shared datasets, transparency, and a willingness to publish negative findings for the greater good.

What this means for products shipping this quarter is not a retreat from ambitious AI goals but a recalibration: build pipelines that integrate real-user testing into the release cycle, demand external validation for high-stakes domains, and design benchmarks that reflect the messy, collaborative reality of work.

Sources

AI benchmarks are broken. Here’s what we need instead.

There are more AI health tools than ever—but how well do they work?

AI benchmarks are broken, real-world tests required

Sources

The Robotics Briefing