Frozen LLM Beats Fine-Tuned Models at Causal Discovery

Visual status: no verified article image is available. The reporting remains text-first.

A frozen language model beat trained rivals at causal discovery.

Researchers asked whether large language models can reliably infer causal graphs from data, and the answer is blunt. Fine-tuning and in-context tricks do not fix the core problem. They prove a kernel obstruction theorem showing that supervised fine-tuning, direct preference optimization, and in-context learning cannot reliably distinguish between causal graphs that generate similar observational data. The representations would have to grow without bound to do so. In other words, the limitation is intrinsic to the learning paradigm, not a flaw of a particular model or dataset.

To escape this trap, the paper proposes Agentic Causal Bayesian Optimization, or A-CBO. The idea is simple in practice but clever in design: a frozen language model serves as an interventional oracle, answering targeted questions about how intervention effects would play out. Meanwhile, an external Bayesian loop narrows the space of candidate graphs by concentrating beliefs over them in logarithmically many rounds. Because the costly decision process operates outside the space where the kernel obstruction applies, A-CBO converges even while the underlying model remains untouched.

Benchmark results anchor the claim. On Corr2Cause, A-CBO matches fine-tuned baselines without any training, proving that a fixed LLM can hold its own against models that were explicitly taught to chase causal signals. On Extended Corr2Cause, a new benchmark scaling to 24 variables with 18,000 test samples, A-CBO significantly outperforms both fine-tuning and preference optimization, with the advantage growing as the graph space becomes more complex. The headline here is not a marginal win; it is a demonstration that a fixed model paired with a light external loop can beat trained variants on a tougher causal discovery challenge.

For practitioners, the implications are concrete. First, expect to see more causal discovery tooling that relies on an external inference loop rather than brute-force model fine-tuning. The benchmark results say: you can deploy a system that leaves the LLM frozen and still outperforms traditional fine-tuned baselines on larger, real-world style graphs. Second, the compute story becomes more modular: you pay for running interventional queries and Bayesian rounds, not for retraining gigantic models. Third, as graphs scale, the performance gap in favor of the A-CBO approach tends to widen, suggesting tangible gains for product features aimed at scientific discovery, policy analysis, or causal debugging inside AI pipelines.

Two practical insights stand out for builders. One, there is a hard limit to improving causal discovery with more data alone; the kernel obstruction tells you that unless you inject interventions and a separate belief-updating loop, you won't surpass a well-tuned baseline on larger problems. Two, the cost and complexity arrive with the external loop rather than the model itself. You’ll want to tune the budget of interventional queries and the number of Bayesian rounds, because those drive latency and cloud costs in a product setting.

Of course, limitations remain. The kernel obstruction is described as intrinsic, but real-world applicability will hinge on how robust the interventional oracle is across domains, and how well priors align with actual causal structure. The experiments cover Corr2Cause and Extended Corr2Cause; scaling to even larger graphs or different data regimes may introduce new brittleness in the interventional queries, priors, or the Bayesian updater.

In short, shipping this quarter, teams evaluating causal discovery should consider architectures that keep the LLM frozen and pair it with an efficient external loop. The result is not just a better score on a benchmark; it is a blueprint for practical, train-bottle-green causal inference at product scale.

Frozen LLM Beats Fine-Tuned Models at Causal Discovery

The Robotics Briefing