AI benchmarks are broken, real-world tests required
By Alexander Cole
Image / Photo by Possessed Photography on Unsplash
Benchmarks are broken, and health AI proves it.
The mainstream way we judge AI—clear-cut, human-vs-machine tasks on narrow problems—is finally meeting the messiness of real use. A pair of Technology Review articles this week argues that the old habit of testing AI on isolated jobs and datasets is not just insufficient; it’s potentially misleading about how models behave when deployed inside teams, workflows, and high-stakes settings. The critique is blunt: we’re not evaluating AI where it actually matters, so we’re missing systemic risks, economic impacts, and the true limits of reliability.
The first piece lays out a simple but stubborn point: performance on a single benchmark does not translate into sustained, beneficial help in a real organization. AI is now embedded in teams, with multiple people, overlapping tasks, and long-running processes. Evaluation that ends after a single test scenario ignores how AI’s strengths and weaknesses unfold over time—how it blurs into daily work, how users adapt, and how failure modes compound. The author argues for benchmarks that track performance across longer horizons, within human workflows, and across organizational contexts. In other words, we need to measure how AI behaves when it co-authors documents, schedules meetings, or routes tasks inside a company—not just how it answers a question in isolation.
The second article spotlights a sector where the stakes are literal life-and-death: health AI. In recent weeks, Big Tech players have pushed broader access to AI-health tools—Microsoft’s Copilot Health, Amazon’s Health AI, OpenAI’s ChatGPT Health, and Claude’s health-capable mode—reflecting demand for hands-off, easily accessible medical guidance. But as adoption scales, independent evaluation becomes indispensable. The piece notes the risk of companies evaluating their own tools and releasing products with blind spots that external experts would spot only after customers have already used them in critical situations. In health, where missteps can have immediate consequences, the call is clear: rigorous, external evaluation before broad release, with transparent methods and data-sharing to validate claims.
Taken together, these narratives sketch a broader move in AI governance: shift from narrow, static tests to long-horizon, workflow-aware benchmarks that reflect how AI actually operates with people over days, weeks, and months. And in high-stakes domains like health, the standard must include independent scrutiny, published methodologies, and opportunities for external validation before products reach wide audiences.
For practitioners, a few concrete takeaways emerge:
As the debate intensifies, the practical challenge remains: how to operationalize long-horizon, workflow-aware benchmarks without crippling development velocity or inflating costs. The answer will require collaboration across researchers, practitioners, and independent evaluators—and will likely hinge on shared datasets, transparency, and a willingness to publish negative findings for the greater good.
What this means for products shipping this quarter is not a retreat from ambitious AI goals but a recalibration: build pipelines that integrate real-user testing into the release cycle, demand external validation for high-stakes domains, and design benchmarks that reflect the messy, collaborative reality of work.
Sources
Newsletter
The Robotics Briefing
Weekly intelligence on automation, regulation, and investment trends - crafted for operators, researchers, and policy leaders.
No spam. Unsubscribe anytime. Read our privacy policy for details.