1) What’s different in GPT-OSS (in plain words)
- Alternating attention pattern. Layers alternate between:
- Global (full) attention: each token can look at all previous tokens.
- Local (sliding-window) attention: each token only looks back, say, the last 128 tokens. This cuts cost while still catching nearby patterns. (OpenAI)
- GQA (Grouped multi-query attention). Many Q heads share a smaller set of K/V heads (e.g., 64 Q heads but only 8 K/V heads). You compute K/V once per K/V head and repeat them across the Q heads in each group — big win for memory bandwidth / speed. (OpenAI)
- Long-context RoPE. Rotary position embeddings with large context (reported 128K). Same RoPE idea you already used; just configured for long sequences. (Hugging Face)
- Learned attention sink (per head). Each head has a learned “extra mass” added to the softmax denominator so a tiny bit of probability can fall into a “null/sink” bucket. Intuition: stabilizes training, prevents heads from over-committing, and mimics the helpful “sink tokens” trick without adding actual tokens. (Hugging Face)
- (Outside attention) MoE feed-forwards with SwiGLU and softmax-after-top-k routing; not needed to grasp the attention path we’ll code here. (Hugging Face)
2) Mental model
- Global layers: classic causal attention.
- Local layers: same math but your mask zeros out keys beyond a sliding window (e.g., last 128 positions).
- GQA: compute K,V with far fewer heads, then
repeat_kv
to match the number of Q heads.
- Sink: compute normal softmax but divide by (sum(exp) + exp(sink_h)) per head. (Equivalently: softmax over N real logits plus 1 learned “ghost” logit; the “ghost” doesn’t contribute to the output vector.)
What is the “attention sink”?
It’s a learned “none-of-the-above” option for each head during softmax attention.
Normally, a head must distribute 100% of its probability mass across real keys (past tokens). But sometimes none of the past tokens are actually helpful for that head at that moment. Forcing the head to pick something can create noisy, spiky gradients and random “fake matches.”
The sink gives each head a safe place to put leftover probability mass. Think of it as a ghost key with no value (so it doesn’t change the content), just a place for probability to go when the head isn’t confident.