Transformer Architecture

Definition

Transformer 是现代大语言模型与许多多模态模型的基础架构范式。其核心特点是以 attention（尤其是 multi-head attention 机制）来建模序列内部的关系，并用更并行化的方式替代传统的循环或卷积主导结构。自 2017 年提出以来，Transformer 已成为 NLP/CV/多模态领域最具影响力的单一架构创新。

Key Papers

论文	年份	机构	核心贡献
Attention Is All You Need (Vaswani et al.)	2017	Google Brain	提出纯 Attention 架构，消除 RNN/CNN；Multi-Head Attention + Positional Encoding
BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al.)	2018	Google AI	双向编码器预训练，Masked LM + Next Sentence Prediction
GPT: Improving Language Understanding by Generative Pre-Training (Radford et al.)	2018	OpenAI	Decoder-only 自回归语言模型，开创 GPT 系列架构
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (Dai et al.)	2019	CMU/Google	片段级递归 + 相对位置编码，突破固定长度上下文
RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al.)	2022	追一科技	RoPE（旋转位置编码），成为后续主流模型的标准位置编码方案
Llama: Open and Efficient Foundation Language Models (Touvron et al.)	2023	Meta	"Llama-ified Transformer"——Pre-Norm + SwiGLU + RoPE + RMSNorm，成为开源架构标准

Architecture Evolution

变体	改进核心	代表模型
Encoder-Decoder (原版)	编码器双向 Attention + 解码器 Masked Self-Attention + Cross-Attention	T5, BART
Encoder-Only	双向自注意力，适合理解/分类	BERT, RoBERTa
Decoder-Only	因果掩码（Causal Mask）+ 自回归生成	GPT 系列, Llama, Qwen, DeepSeek
Multi-Query Attention (MQA)	所有查询头共享 Key/Value，减少 KV cache	Falcon, PaLM
Grouped-Query Attention (GQA)	查询头分组共享 Key/Value，MQA 和 MHA 的折中	Llama 2/3, Mistral, Gemma
Flash Attention	IO 感知的精确注意力，减少 HBM 读写	几乎所有现代模型（通过 xformers/Triton）
Multi-Head Latent Attention (MLA)	低秩压缩 Key/Value，极致 KV cache 压缩	DeepSeek V2/V3/V4

Core Components

1. Multi-Head Attention

输入通过三个线性投影生成 Query (Q)、Key (K)、Value (V) 张量：

Attention(Q,K,V) = softmax(QK^T / √d_k) · V
Multi-Head：h 个独立的注意力头并行计算，结果拼接后线性投影

2. Positional Encoding

因为自注意力是置换等变的（permutation equivariant），需要位置信号：

Sinusoidal（原版）：固定频率的正余弦函数
Learned（BERT/GPT）：可训练的嵌入向量
RoPE（当前标准）：旋转矩阵编码相对位置——已成为 Llama/Qwen/DeepSeek 等主流模型的标准选择
ALiBi（Press et al.）：注意力分数加偏差项，支持长度外推

3. Feed-Forward Network (FFN)

每个 Transformer 层包含一个前馈网络：

原版：ReLU(xW₁ + b₁)W₂ + b₂
当代标准（Llama 架构）：SwiGLU(x) = SwiGLU(xW_g) · (xW₁) · W₂
FFN 隐藏层维度通常是模型维度的 2.7–4x

4. Normalization

Pre-Norm（当代标准）：LayerNorm 放在子层之前，训练更稳定
Post-Norm（原版）：LayerNorm 放在子层之后，需要学习率预热
RMSNorm：简化版 LayerNorm，仅做 RMS 归一化，节省计算

Architectural Patterns

现代 Decoder-Only Transformer 的层结构：

输入 → Token Embedding → RoPE
  ↓
[重复 L 层]:
  Layer → RMSNorm → Self-Attention (GQA/MQA/MLA) → Residual → RMSNorm → FFN (SwiGLU) → Residual
  ↓
  Output → LM Head (权重共享) → Softmax

Engineering Significance

并行训练：Transformer 消除了序列计算依赖（对比 RNN），使大规模 GPU 并行训练成为可能
信息流动：长距离依赖建模能力远超 RNN/LSTM，但 O(n²) 计算复杂度是长上下文场景的核心限制
规模可扩展：Scaling Laws 证明了 Transformer 性能随参数/数据/计算量可预测地提升
量化友好：因结构规整，Transformer 模型（特别是 Decoder-Only）支持各类量化技术（GPTQ、AWQ、GGUF）进行推理优化

Why It Matters

Transformer 是理解 OpenAI、Anthropic、DeepSeek、Llama、Qwen、Google Gemini & DeepMind 等几乎所有主流模型路线的基础页面
它为 Retrieval Augmented Generation、AI Agents、Multimodal Models 等更上层系统提供底层模型能力
许多后续路线，包括长上下文（ALiBi、RoPE 扩展）、稀疏注意力、Mixture of Experts，都可以被理解为在 transformer 基础上的演化

相关概念：Mixture of Experts、Retrieval Augmented Generation、AI Agents、Scaling Laws、Multimodal Models、Transformer vs SSM (Mamba / RWKV / Jamba)
相关实体：OpenAI、Llama、DeepSeek、Google Gemini & DeepMind

Engineering Pitfalls

位置编码外推：模型在训练上下文之外的表现急剧下降，RoPE 的扩展技术（NTK-aware、YaRN、温度缩放）非银弹
KV cache 爆炸：长上下文下多头注意力的 KV cache 显存消耗与层数×头数×序列长度成正比——MLA 和 GQA 可缓解但非消除
训练-推理对齐：Pre-Norm + 固定学习率调度在训练时工作，但推理时的输入分布可能偏离训练分布
权重绑定：LM Head 与 Embedding 共享权重减少参数量，但可能限制模型容量

Open Questions

Transformer 在超长上下文和高效推理方面会如何继续演化？
它的主导地位是否会被新的基础架构（如 Mamba、RWKV 等状态空间模型）逐步削弱？
Linear Attention 能否在不显著损失质量的前提下替代 O(n²) 的 Softmax Attention？

Sources

raw/papers/attention-is-all-you-need-1706.03762-2026-04-26.md
raw/articles/transformer-wikipedia-summary-2026-04-26.md
"Attention Is All You Need" (Vaswani et al., 2017)
RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2022)
Llama 2: Open Foundation and Fine-Tuned Chat Models (Touvron et al., 2023)
FlashAttention: Fast and Memory-Efficient Exact Attention (Dao et al., 2022)

Transformer Architecture ​

Definition ​

Key Papers ​

Architecture Evolution ​

Core Components ​

1. Multi-Head Attention ​

2. Positional Encoding ​

3. Feed-Forward Network (FFN) ​

4. Normalization ​

Architectural Patterns ​

Engineering Significance ​

Why It Matters ​

Related Concepts ​

Engineering Pitfalls ​

Open Questions ​

Sources ​