Attention Variants 2: Qwen3-Next

Qwen-3 Next is built to push context length and throughput way up without wrecking quality. To do that, it swaps “every layer = full softmax attention” for a hybrid stack: most layers use an efficient linear-attention kernel (Gated DeltaNet) and the rest keep full (softmax) attention. Around that, it adds GQA, RoPE, QK-norm, a dynamic cache, and a sparse MoE MLP so capacity rises while active FLOPs per token stay low.

Here’s why each piece is there:

Hybrid attention (interleaving linear + full)

Linear attention (here, Gated DeltaNet) runs in O(T·d) with fixed-size state, so KV memory and latency don’t explode with long prompts. But softmax attention still wins on exact recall and some reasoning edges. Interleaving them gives most of the speed/memory win and preserves global coherence and ICL quality. This is the design Qwen calls out for long-context efficiency. (Hugging Face)
Why Gated DeltaNet specifically (vs older linear attentions/Mamba)

Gated DeltaNet combines gating (fast erase/retain) with the delta-rule (precise, targeted memory updates) and trains efficiently with chunked kernels. It beats prior linear models (incl. Mamba2/DeltaNet) on language modeling, retrieval, length extrapolation, and long-context understanding—so it’s a strong drop-in for the “fast layers.” Qwen’s code calls these “linear_attention” layers and uses the FLA Triton kernels when available. (arXiv)
Full (softmax) attention layers kept periodically

These “re-sync” information across the entire sequence so entities/topics can hop long distances. That’s how the model avoids the quality cliff you’d get with only local/linear mechanisms. Qwen’s docs and partner runtime posts highlight this hybrid as the core idea. (Hugging Face)
Extreme context targets

Model cards state native contexts like 262k and extendable to ≈1M tokens; you can’t make that practical if every layer stores per-head K/V across the whole prefix. The hybrid + GQA + linear-state path makes those lengths tractable on real hardware. (Hugging Face)
GQA (Grouped-Query Attention)

Many Q heads, fewer shared K/V heads → the full-attention layers’ KV cache shrinks by the grouping factor. That’s crucial when you still keep some softmax layers in the stack. (You can see repeat_kv in the code.) (Hugging Face)
QK-norm + RoPE

QK-norm stabilizes the dot-product scales across heads/timesteps (Qwen normalizes per head-dim). RoPE keeps relative position information and supports long-context scaling tricks; Qwen ships a Qwen3NextRotaryEmbedding with dynamic/variant init. Net effect: better stability and extrapolation at long lengths. (Hugging Face)
Sparse MoE MLP

Big capacity when needed, small compute per token (top-k experts + shared expert). That complements the attention savings: you keep expressivity without paying full dense-MLP costs every layer. Qwen’s 80B “A3B” card lists 512 experts with top-k activation and still targets long contexts. (Hugging Face)
Runtime & ecosystem support

The design plays nicely with modern inference stacks (vLLM, Triton kernels from FLA) for speedups on NVIDIA/AMD/Intel. That’s partly why you see quick runtime support posts emphasizing the hybrid attention path. (GitHub)

Bottom line: Qwen-3 Next chooses this approach to scale context and throughput by a large factor—while holding quality. The linear-attention layers (Gated DeltaNet) carry most of the sequence processing cheaply; periodic full-attention layers maintain global recall/ICL; GQA/RoPE/QK-norm/MoE keep the softmax layers lighter, more stable, and expressive. Public cards and docs explicitly frame this as the reason for the “Next” architecture. (Hugging Face)

Code https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3_next/modeling_qwen3_next.py#L74

Qwen3-Next mixes two token mixers in one stack:

Full attention layers (classic softmax attention, but with GQA, RoPE, QK-norm, and an output gate).
Linear attention layers via Gated DeltaNet (a conv + recurrent kernel with per-head states).