1) What DeepSeek changes (why MLA)

Problem: In autoregressive decoding, the KV cache (all past keys & values for every head) dominates memory and bandwidth.

Idea (MLA): Don’t cache big K/V tensors. Instead, cache a tiny latent for each token and reconstruct K and V on-the-fly from that latent with lightweight up-projections per head. This moves the bottleneck from memory-bandwidth to a bit more compute, which is a good trade on modern hardware. In DeepSeek-V2/V3, this cut KV cache massively while keeping quality competitive. (arXiv)

One-line summary:

Cache C (small), not K/V (large). At each step, K = W_k·C, V = W_v·C per head.

Everything else (queries, RoPE, softmax) stays the same.

Typical practical notes:

KV cache drop is huge (figures like ~85–95% smaller are often cited, with exact % depending on dims/model). (Medium)
Used across DeepSeek-V2/V3 (and discussed for R1 as well). (arXiv)

2) Core math (copy-pasteable)

Let $x_t \in \mathbb{R}^{d_{\text{model}}}.$

Queries (standard):

$q_t = W_q x_t \;\; \in \mathbb{R}^{H \times d_h}$

Latent compression (shared across heads):

$c_t = W_c x_t \;\; \in \mathbb{R}^{r}, \quad r \ll H\cdot d_h$

Per-head decompression to keys/values: