细读经典： ZeRO

news2026/3/16 6:18:36

论文链接https://arxiv.org/pdf/1910.02054训练并行的几种方式1. Pipeline Parallelism (PP)2. Model Parallelism (MP)3. Data ParallelismSo, how can we overcome the limitations of existing solutions and train large models more efficiently? To answer this question, we first analyze the full spectrum of memory consumption of the existing systems on model training and classify it into two parts: 1) For large models, the majority of the memory is occupied by model states which include the optimizer states (such as momentum and variances in Adam [6]), gradients, and parameters. 2) The remaining memory is consumed by activation, temporary buffers and unusable fragmented memory, which we refer to collectively as residual states. We develop ZeRO— Zero Redundancy Optimizer — to optimize memory efficiency on both while obtaining high compute and communication efficiency. As these two parts face different challenges, we develop and discuss their solutions correspondingly该论文提供了一种新的数据并行的方式Zero-DP提出了Zero-R用于 optimize the residual memory consumed by these three factors respectively.1) For activations (stored from forward pass in order to perform backward pass), we noticed checkpointing [7] helps but not sufficient for large models. Thus ZeRO-R optimizes activation memory by identifying and removing activation replication in existing MP approaches through activation partitioning. It also offloads activations to CPU when appropriate. 2) ZeRO-R defines appropriate size for temporary buffers to strike for a balance of memory and computation efficiency. 3) We observe fragmented memory during training due to variations in the lifetime of different tensors. Lack of contiguous memory due to fragmentation can cause memory allocation failure, even when enough free memory is available. ZeRO-R proactively m最后放一段GPT的总结这篇论文是ZeRO: Memory Optimizations Toward Training Trillion Parameter Models作者Samyam Rajbhandari、Jeff Rasley、Olatunji Ruwase、Yuxiong He来自 Microsoft后来成为 DeepSpeed 的核心技术。这篇论文是大模型训练系统领域最重要的论文之一提出了ZeROZero Redundancy Optimizer核心目标是解决大模型训练的 GPU 内存瓶颈问题。(Hugging Face)下面我从问题 → 方法 → ZeRO 三个阶段 → 效果帮你系统讲清楚。一、论文要解决的问题随着深度学习模型规模快速增长BERT、GPT 等训练面临一个核心瓶颈GPU显存不够。在标准数据并行Data Parallelism中每个 GPU 都需要保存完整的模型状态模型训练时需要保存三类状态ParametersGradientsOptimizer states如 Adam 的 m / v其中optimizer state 是最大的。如果使用 Adammemory ≈ parameters gradients 2 × optimizer states也就是说每个参数需要 ~16 bytesFP16训练举例模型规模需要显存1B 参数~16GB10B 参数~160GB这就是为什么普通数据并行无法训练大模型。(Hugging Face)二、传统方法的局限论文分析了三种已有方案1 Data Parallelism特点每个 GPU 保存完整模型只分 batch问题memory O(model size)无法扩展。2 Model Parallelism代表Megatron-LMMesh TensorFlow方法把一个 layer 拆到多个 GPU。问题通信量巨大编程复杂多节点效率差论文实验40B 模型跨节点训练GPU 利用率 5% peak。 (Hugging Face)3 Pipeline Parallelism代表GPipe问题pipeline bubble实现复杂三、核心思想Zero Redundancy论文观察到一个关键事实数据并行存在大量“冗余状态”。在 DP 中GPU0 GPU1 GPU2 GPU3每个 GPU 都存parameters gradients optimizer states这些都是完全一样的副本。ZeRO 的思想消除数据并行中的冗余副本让不同 GPU只保存一部分状态。四、ZeRO 三个阶段最重要部分ZeRO逐步分片三类状态。Stage 1Optimizer State Partition只切分optimizer states例如GPU0 - optimizer 0-25% GPU1 - optimizer 25-50% GPU2 - optimizer 50-75% GPU3 - optimizer 75-100%而parameters - full copy gradients - full copy优点显存减少4×Stage 2Gradient Partition进一步切分gradients现在状态是否分片parameters❌gradients✅optimizer✅优点显存减少≈ 8×Stage 3Parameter Partition最后连参数也分片。每个 GPU 只保存1/N parameters 1/N gradients 1/N optimizer训练时forward 前all-gather 参数backward 后reduce-scatter 梯度这样memory O(model_size / N)这就是 ZeRO 的关键突破。五、ZeRO 的通信机制ZeRO 的核心通信模式ForwardAll-Gather parameters每层计算前收集参数。BackwardReduce-Scatter gradients梯度直接分片回收。相比传统AllReduce gradientsZeRO 更节省内存。六、论文实验结果实验规模400 GPUs结果100B 模型可训练15 PFLOPS throughputsuper-linear speedup对比 SOTA指标提升模型规模8×训练性能10×并成功训练17B Turing-NLG 模型。 (Hugging Face)七、论文的历史意义这篇论文几乎定义了现代大模型训练架构。之后很多技术都是它的延伸ZeRO-OffloadCPU 显存扩展。ZeRO-InfinityNVMe 扩展。PyTorch FSDPFully Sharded Data Parallel几乎就是ZeRO Stage3 的实现。八、一句话总结ZeRO 的核心贡献通过分片 optimizer / gradients / parameters消除数据并行的冗余显存实现O(model_size / N)的内存扩展能力。从而让Trillion parameter training成为可能。如果你愿意我可以再帮你深入讲一个工程层面非常关键的问题为什么 Adam optimizer 会占 3× 参数内存以及ZeRO 如何把它降到 1/N—— 这个其实和你之前问的Adam memory usage是完全相关的。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/2415238.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！