LLMs still lag human experts on data tasks
Humans beat frontier LLMs in a data analysis coding test.
The paper shows that while large language models are vaunted for their benchmark-ready skills, a rigorous new task reveals a stubborn gap between what the frontier models can do and what trained human experts reliably deliver. The study introduces a novel benchmarking format that forces an LLM to write code to complete a data analysis problem, then pits its output against submissions from human specialists. Crucially, the researchers explicitly measure not just average accuracy but also the variance of responses and the magnitude of errors. The result: humans outperform on average across several metrics, and they demonstrate far less variability in performance. In other words, it’s not just about getting closer on a single score; it’s about delivering consistently correct, reproducible results in contexts where mistakes can cascade.
This matters because the conventional way benchmarks are run can paint a misleading picture. The paper shows that many tasks used to crown “the AI is as good as a expert” claims rely on content that may already appear in training data, and they often ignore how often models go off the rails or make large, costly mistakes. In high-stakes settings such as data analysis workflows that guide decision making, understanding the reliability and error profiles is as important as chasing higher averages. Benchmarks indicate a need to broaden evaluation beyond mean performance to include variance, outlier behavior, and the potential for harmful or expensive errors.
From a practitioner perspective, the findings translate into clear engineering and product bets. First, when designing model-enabled data tasks, the benchmark should be designed to reveal variance and error magnitude, not just average score. That means running multiple trials, testing across diverse inputs, and tracking worst-case failures rather than celebrating a single top-line result. Second, the value of human-in-the-loop remains high in high-stakes pipelines. The study’s outcomes suggest LLMs can accelerate portions of a workflow, but automated results should be paired with human review or automated cross-checks to catch outliers, logic gaps, or nonsensical code. Third, reliability engineering becomes essential: teams should build monitoring that flags outputs with high uncertainty or substantial deviation from expected patterns, and establish deterministic fallbacks when the model’s reasoning or code generation veers off course. Fourth, transparency about capabilities matters for decision makers. The paper shows that performance is not monotonic with model size or claimed sophistication, so product leaders should demand robust, variance-focused benchmarks when evaluating AI-assisted tools for analysts.
If the industry wants to move from headline claims to dependable AI-powered analysis, the path is clearer benchmarking and disciplined risk accounting. The study’s emphasis on variance and error magnitude provides a blueprint for what to watch next: richer evaluation regimes, standardized reporting of failure modes, and governance that treats AI outputs like potentially fallible instrument panels rather than infallible sources of truth. In the near term, teams should design data tasks with explicit failure modes in mind, preserve human oversight where it matters most, and push for benchmarks that reveal not just how well models perform, but how reliably they perform under pressure.
- Flaws in the LLM Automation NarrativearXiv LLM/Foundation Query / Primary source / Published JUN 09, 2026 / Accessed JUN 10, 2026