Transformers prove three projections unnecessary in practice

Visual status: no verified article image is available. The reporting remains text-first.

Sharing keys and values halves cache, barely bending language quality.

A focused study challenges a core assumption in modern transformers: you may not need separate projections for query, key, and value. The team reports that several projection sharing strategies can cut memory needs dramatically while keeping model quality in reach. In a sweep across synthetic tasks, vision benchmarks, and language modeling with 300 million and 1.2 billion parameter models trained on about 10 billion tokens, they compare threeQKV sharing patterns. They show that sharing Q and K with V (Q-K=V) can perform on par with, or even exceed, the standard QKV arrangement in many settings. Crucially, for language models, Q-K=V delivers about a 50 percent reduction in KV cache usage with only a 3.1 percent perplexity hit in the evaluated configurations. The work also maps how these savings compound when combined with head sharing strategies, a direction the authors describe as complementary to their projection sharing.

The paper dives into how to keep attention useful when projections are tied. When the researchers test Q=K-V or the extreme Q=K=V, they observe symmetric attention maps that wash out directional cues. To counter this, they explore asymmetric attention using 2D positional encodings, but the central takeaway remains that tying keys and values to the same projection preserves enough expressive power for many tasks. The result is not just a theoretical curiosity; it maps to practical impact for edge deployment where memory and bandwidth are at a premium.

On the language modeling front, the authors report compelling numbers. In their 300M and 1.2B parameter models, Q-K=V stands out as a low risk, high reward option. They quantify the benefits when combining Q-K=V with grouped query attention (GQA) or multi-query attention (MQA). Specifically, pairing Q-K=V with GQA and four groups yields an 87.5 percent reduction in KV cache usage, while replacing the standard approach with MQA pushes the figure to 96.9 percent. Those are material memory savings, especially when you multiply across layers and scale up to larger models or longer prompts. The team notes that such reductions can unlock practical on-device inference for models that would otherwise be limited to server-backed deployment.

The results are framed as a concrete form of weight tying in attention, with direct, quantifiable inference memory benefits. The authors emphasize that Q-K=V preserves quality because keys and values can occupy similar representational spaces, and because attention operates in a relatively low-rank regime. They acknowledge, however, that Q=K=V can break directionality and harm performance, underscoring the need for careful choice of projection sharing pattern based on the target task and hardware constraints.

For engineers and product leaders, the takeaway is actionable. If memory constraints matter, a staged approach could start with Q-K=V and GQA or MQA to maximize cache savings with minimal accuracy loss. The paper also signals a broader design principle: weight tying in attention, when used judiciously, can yield meaningful efficiency gains without wholesale sacrifices in quality. The authors provide their code openly to help teams reproduce and experiment, reinforcing the engineering mindset that these results are about practical deployment as much as theory.

The study adds a clear, testable path for edge deployments and resource-constrained settings, a topic of increasing relevance as models grow in size and reach. It does not claim a universal replacement for QKV, but it does show that a sizable portion of the memory budget can be reclaimed through intelligent projection sharing, especially when paired with advanced attention variants.

Transformers prove three projections unnecessary in practice

The Robotics Briefing