Problem: In autoregressive decoding, the KV cache (all past keys & values for every head) dominates memory and bandwidth.
Idea (MLA): Don’t cache big K/V tensors. Instead, cache a tiny latent for each token and reconstruct K and V on-the-fly from that latent with lightweight up-projections per head. This moves the bottleneck from memory-bandwidth to a bit more compute, which is a good trade on modern hardware. In DeepSeek-V2/V3, this cut KV cache massively while keeping quality competitive. (arXiv)
One-line summary:
Cache C (small), not K/V (large). At each step, K = W_k·C, V = W_v·C per head.
Everything else (queries, RoPE, softmax) stays the same.
Typical practical notes:
Let $x_t \in \mathbb{R}^{d_{\text{model}}}.$
$q_t = W_q x_t \;\; \in \mathbb{R}^{H \times d_h}$
$c_t = W_c x_t \;\; \in \mathbb{R}^{r}, \quad r \ll H\cdot d_h$