Memory Gaps Slow Multimodal LLMs More Than Flaws

By Alexander ColeJUN 18, 20262 min read

Your multimodal LLM forgets what it just saw.

A new benchmark suite called RNG-Bench isolates memory from action, showing that forgetting past observations, not misjudged choices, often trips up models during multi-step tasks. The authors pit base models against two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must be recalled later, and 3D Maze, which requires turning egocentric views into a coherent spatial map. The work is evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality, and a head-to-head duel protocol to normalize instance-level variance. The hardest configurations push contexts to roughly 128K tokens and 350 image inputs per episode, a regime far from saturated by frontier multimodal models.

The team reports that these configurations reveal a sharp truth about multimodal closed-loop control: even when decisions look sound, a model can stumble because it cannot remember what it saw earlier. The memory gap metric is designed to disentangle forgetting from poor action selection, a crucial distinction for engineering teams seeking to improve real systems. In practice, this means improving the way a model stores, retrieves, and utilizes past observations across multiple steps, rather than assuming that better immediate predictions will automatically translate into better long-horizon behavior. The paper shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making, a finding that shifts the emphasis of model improvement toward robust memory mechanisms alongside perception and planning.

On the modeling side, the work highlights a concrete path for practitioners. Fine-tuning a mid-sized model, Qwen3.5-9B, on optimal-policy rollouts and filtered demonstrations yields tangible gains on RNG-Bench and transfers to existing benchmarks without sacrificing multimodal capability. The team reports that this 9B-parameter model benefits from targeted policy data, suggesting a practical lever for teams that cannot train colossal giants but still want stronger long-horizon performance. This result underlines a familiar engineering constraint: modest-scale models can close gaps when given curated experience that aligns their behavior with desired cycles of observation and action.

Two or three practitioner takeaways emerge clearly. First, memory management is not a nicety but a core design constraint; engineers should invest in explicit memory handling and retrieval strategies that keep past observations accessible across long interaction traces. Second, diagnostic metrics like Memory Gap offer actionable visibility into whether failures come from forgetting versus decision flaws, guiding where to invest compute and data. Third, policy-oriented fine tuning on well-curated rollout data can yield cross-benchmark gains for mid-range models without eroding multimodal performance, providing a practical upgrade path for teams with fixed compute budgets. Finally, benchmark designs that compare models in head-to-head setups help reveal variance and stress-test memory under realistic, multi-step interaction, a pattern teams should adopt when evaluating memory-centric capabilities.

In short, RNG-Bench reframes progress for multimodal, real-time systems: the decisive bottleneck is not always what the model decides, but what it remembers. As teams push toward longer contexts and tighter loops between sensing and action, targeted memory improvements and memory-aware evaluation will be the levers that turn promising demos into reliable, production-ready behavior.

Memory Gaps Slow Multimodal LLMs More Than Flaws

The Robotics Briefing