Order sensitivity breaks multimodal LLMs

By Alexander ColeJUN 25, 20262 min read

Image / arXiv LLM/Foundation Query

Shuffling the inputs changes the answer in every cutting-edge multimodal LLM.

The team behind Facet-Probe audited 18 frontier and open-weight multimodal large language models to test a simple but critical reliability property: does the order of information alter the final answer? The five facet audit checks option, evidence-chunk, document-rank, image-set, and mixed-modality ordering, using a Bayesian item-response model to separate ordering noise from facet bias and a same-ordering control to estimate the decoder’s intrinsic stochastic floor. The paper shows none of the models are truly order-invariant. Across the five facets, screen-per-facet flip rates span from 24% to 50%, a dizzying range that reveals how fragile current reasoning is to how data is presented. In a controlled baseline, a Gemini same-ordering control at temperature 0 estimates a substantial ordering excess above the decoded-noise floor in verified cells.

The results are disconcerting for teams hoping to deploy multimodal systems in environments where presentation order is fluid or user-driven. The team reports that even a strong model can flip on 13.4% of trials in best-case comparisons, underscoring that capabilities do not automatically translate into stable behavior when inputs are reorganized. And when researchers tried training-free prompt adjustments to improve robustness, they found the effects are modality-conditional and do not transfer from text to visual reasoning. In short, prompt tinkering alone is unlikely to deliver general order robustness across modalities.

Benchmarks indicate a deeper problem: order sensitivity appears to be baked into the model’s decoding and how it fuses multimodal evidence, not just a quirk of one evaluation setup. The authors argue that evaluating LLMs with a single canonical ordering misses a fundamental reliability property that evaluation guidelines are starting to demand. They propose cross-ordering flip rate as a standard reporting axis for multimodal LLMs, inviting engineers to quantify how models behave when input streams are rearranged and to compare models on a level playing field.

From an engineering perspective, the takeaways are practical and actionable. First, this isn’t a problem you fix with smarter prompts alone; the fault line runs deeper in training-time objectives and architectural choices around how the model attends to and aggregates multimodal signals. Second, you should plan for latency and complexity costs if you pursue training-time remedies or architectural redesigns, because robust, order-invariant reasoning may require more explicit cross-modal alignment or invariants baked into the decoder. Third, product teams should treat order robustness as a first-class reliability metric, especially for applications where input order is not strictly controlled by the system (for example, mixed media pipelines, user-generated content, or dynamic data streams). Fourth, evaluation pipelines should routinely report cross-ordering performance to detect regressions and to separate improvements in capability from stability across input permutations.

Looking ahead, the study suggests a clear, practical research direction: beyond prompt engineering, what architectural patterns or training regimes yield true order-robustness across modalities? If models routinely flip on a quarter to a half of trials under reorderings, the industry will need explicit design choices, such as invariant evidence integration, robust multimodal alignment objectives, or decoding strategies that temper sensitivity to input order, to keep automated reasoning trustworthy in real world use.

Order sensitivity breaks multimodal LLMs

The Robotics Briefing