1) What’s different in GPT-OSS (in plain words)


2) Mental model

What is the “attention sink”?

It’s a learned “none-of-the-above” option for each head during softmax attention.

Normally, a head must distribute 100% of its probability mass across real keys (past tokens). But sometimes none of the past tokens are actually helpful for that head at that moment. Forcing the head to pick something can create noisy, spiky gradients and random “fake matches.”

The sink gives each head a safe place to put leftover probability mass. Think of it as a ghost key with no value (so it doesn’t change the content), just a place for probability to go when the head isn’t confident.