Qwen-3 Next is built to push context length and throughput way up without wrecking quality. To do that, it swaps “every layer = full softmax attention” for a hybrid stack: most layers use an efficient linear-attention kernel (Gated DeltaNet) and the rest keep full (softmax) attention. Around that, it adds GQA, RoPE, QK-norm, a dynamic cache, and a sparse MoE MLP so capacity rises while active FLOPs per token stay low.
Here’s why each piece is there:
Hybrid attention (interleaving linear + full)
Linear attention (here, Gated DeltaNet) runs in O(T·d) with fixed-size state, so KV memory and latency don’t explode with long prompts. But softmax attention still wins on exact recall and some reasoning edges. Interleaving them gives most of the speed/memory win and preserves global coherence and ICL quality. This is the design Qwen calls out for long-context efficiency. (Hugging Face)
Why Gated DeltaNet specifically (vs older linear attentions/Mamba)
Gated DeltaNet combines gating (fast erase/retain) with the delta-rule (precise, targeted memory updates) and trains efficiently with chunked kernels. It beats prior linear models (incl. Mamba2/DeltaNet) on language modeling, retrieval, length extrapolation, and long-context understanding—so it’s a strong drop-in for the “fast layers.” Qwen’s code calls these “linear_attention” layers and uses the FLA Triton kernels when available. (arXiv)
Full (softmax) attention layers kept periodically
These “re-sync” information across the entire sequence so entities/topics can hop long distances. That’s how the model avoids the quality cliff you’d get with only local/linear mechanisms. Qwen’s docs and partner runtime posts highlight this hybrid as the core idea. (Hugging Face)
Extreme context targets
Model cards state native contexts like 262k and extendable to ≈1M tokens; you can’t make that practical if every layer stores per-head K/V across the whole prefix. The hybrid + GQA + linear-state path makes those lengths tractable on real hardware. (Hugging Face)
GQA (Grouped-Query Attention)
Many Q heads, fewer shared K/V heads → the full-attention layers’ KV cache shrinks by the grouping factor. That’s crucial when you still keep some softmax layers in the stack. (You can see repeat_kv
in the code.) (Hugging Face)
QK-norm + RoPE
QK-norm stabilizes the dot-product scales across heads/timesteps (Qwen normalizes per head-dim). RoPE keeps relative position information and supports long-context scaling tricks; Qwen ships a Qwen3NextRotaryEmbedding
with dynamic/variant init. Net effect: better stability and extrapolation at long lengths. (Hugging Face)
Sparse MoE MLP
Big capacity when needed, small compute per token (top-k experts + shared expert). That complements the attention savings: you keep expressivity without paying full dense-MLP costs every layer. Qwen’s 80B “A3B” card lists 512 experts with top-k activation and still targets long contexts. (Hugging Face)
Runtime & ecosystem support
The design plays nicely with modern inference stacks (vLLM, Triton kernels from FLA) for speedups on NVIDIA/AMD/Intel. That’s partly why you see quick runtime support posts emphasizing the hybrid attention path. (GitHub)
Bottom line: Qwen-3 Next chooses this approach to scale context and throughput by a large factor—while holding quality. The linear-attention layers (Gated DeltaNet) carry most of the sequence processing cheaply; periodic full-attention layers maintain global recall/ICL; GQA/RoPE/QK-norm/MoE keep the softmax layers lighter, more stable, and expressive. Public cards and docs explicitly frame this as the reason for the “Next” architecture. (Hugging Face)
Qwen3-Next mixes two token mixers in one stack: