Qwen-3 Next is built to push context length and throughput way up without wrecking quality. To do that, it swaps “every layer = full softmax attention” for a hybrid stack: most layers use an efficient linear-attention kernel (Gated DeltaNet) and the rest keep full (softmax) attention. Around that, it adds GQA, RoPE, QK-norm, a dynamic cache, and a sparse MoE MLP so capacity rises while active FLOPs per token stay low.

Here’s why each piece is there:

Bottom line: Qwen-3 Next chooses this approach to scale context and throughput by a large factor—while holding quality. The linear-attention layers (Gated DeltaNet) carry most of the sequence processing cheaply; periodic full-attention layers maintain global recall/ICL; GQA/RoPE/QK-norm/MoE keep the softmax layers lighter, more stable, and expressive. Public cards and docs explicitly frame this as the reason for the “Next” architecture. (Hugging Face)

image.png

Code https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3_next/modeling_qwen3_next.py#L74


Qwen3-Next mixes two token mixers in one stack:

  1. Full attention layers (classic softmax attention, but with GQA, RoPE, QK-norm, and an output gate).
  2. Linear attention layers via Gated DeltaNet (a conv + recurrent kernel with per-head states).