CANN Cosmos NPU多卡并行优化

news2026/5/10 10:33:05

Cosmos 昇腾 NPU 多卡并行优化说明【免费下载链接】cann-recipes-embodied-intelligence本项目针对具身智能业务中的典型模型、加速算法提供基于CANN平台的优化样例项目地址: https://gitcode.com/cann/cann-recipes-embodied-intelligence1. 优化概述本次优化针对 Cosmos 系列世界基础模型在昇腾 NPU 平台上的多卡并行推理能力进行了系统性增强主要涵盖两个模型Cosmos-Transfer2.5-2B: 视频风格转换多控制网络模型Cosmos-Predict2.5-2B: 视频生成世界基础模型优化重点聚焦于使能多卡并行功能,包括 CFGClassifier-Free Guidance并行、上下文并行Context Parallelism以及 NPU 设备管理实现在昇腾多卡环境下的分布式高效推理。此外针对 NPU 特性还进行了相关优化包括 Flash Attention 替换、RMSNorm 融合算子适配以及 Rotary 位置编码优化。2. 多卡并行使能2.1 Cosmos在NPU上的多卡并行说明目前的 Cosmos-Predict2.5 与 Cosmos-Transfer2.5 通过运行npu_adapt.sh脚本即可在 NPU 上正常进行多卡并行推理。2.2 CFG并行修复Cosmos-Transfer2.5 原生支持多种控制模态深度图、语义分割、边缘检测等的视频到视频风格迁移。为提升大规模推理效率需实现以下并行策略CFG 并行Classifier-Free Guidance Parallelism将 NPU 分为两组分别处理条件conditional和无条件unconditional去噪任务提升大规模集群扩展性上下文并行Context Parallelism跨设备分配长序列视频帧支持超长视频生成2.3 核心修改内容2.3.1 配置层修改cosmos_transfer2/config.py在SetupArguments数据类中添加新的并行控制参数# 在 SetupArguments 数据类中添加新参数 enable_cfg_parallel: bool False Enable Classifier-Free Guidance parallelism for better scaling across more NPUs. Splits NPUs into two groups for conditional/unconditional denoising.2.3.2 推理层重构Control2WorldInference.__init__方法修改文件:cosmos_transfer2/inference.pyPatch 文件:adaptor_patches/inference_patch.py关键代码变更:# 原始代码 (官方版本) self.device_rank 0 process_group None if args.context_parallel_size 1: from megatron.core import parallel_state distributed.init() parallel_state.initialize_model_parallel(context_parallel_sizeargs.context_parallel_size) process_group parallel_state.get_context_parallel_group() # 优化后代码 (昇腾适配版) self.device_rank 0 cfg_parallel args.enable_cfg_parallel # 新增读取 CFG 并行标志 process_group None if args.context_parallel_size 1: from megatron.core import parallel_state distributed.init() # 根据 cfg_parallel 决定上下文并行规模 if cfg_parallel: # CFG 并行模式将总卡数对半分一半用于 condition一半用于 unconditional parallel_state.initialize_model_parallel(context_parallel_sizeargs.context_parallel_size // 2) else: # 标准模式使用全部卡进行上下文并行 parallel_state.initialize_model_parallel(context_parallel_sizeargs.context_parallel_size) process_group parallel_state.get_context_parallel_group()逻辑说明:CFG 并行模式(enable_cfg_parallelTrue):假设总卡数为 8则context_parallel_size44 卡处理条件去噪分支4 卡处理无条件去噪分支标准并行模式(enable_cfg_parallelFalse):8 卡全部用于上下文并行传递 cfg_parallel 标志:self.inference_pipeline ControlVideo2WorldInference( ... cfg_parallelcfg_parallel, # 传递给下游流水线 )3. NPU 算子性能优化3.1 Flash AttentionFA替换3.1.1 优化说明使用 torch_npu 中的npu_fusion_attention融合算子替换源代码中的 FlashAttention 算子实现。关于npu_fusion_attention的详细说明可见昇腾社区文档。3.1.2 实现方式1在 Cosmos-Predict2.5-2B 中使用了torch_npu接口调用方式attn_output_bnsd torch_npu.npu_fusion_attention( query_bnsd, key_bnsd, value_bnsd, head_num, input_layoutBNSD, pseNone, atten_maskself.atten_mask_npu, scalescale, pre_tockens2147483647, next_tockens2147483647, keep_prob1, sparse_mode2 )[0]2在 Cosmos-Transfer2.5-2B 中使用了原生 SDPA 接口调用attn_output_bnsd F.scaled_dot_product_attention( query_bnsd, key_bnsd, value_bnsd, attn_maskNone, dropout_p0.0, is_causalTrue )3.1.3 优化位置文件cosmos-predict2.5/cosmos_predict2/_src/reason1/networks/qwen2_5_vl.pycosmos-transfer2.5/cosmos_transfer2/_src/reason1/networks/qwen2_5_vl.py3.2 RMSNorm 算子优化3.2.1 优化说明使用 torch_npu 内置的npu_rms_norm融合算子替换源代码中的自定义实现。关于npu_rms_norm的详细说明可见昇腾设计文档。3.2.2 实现方式1原始实现class RMSNorm(torch.nn.Module): def __init__(self, dim: int, eps: float 1e-5): super().__init__() self.eps eps self.weight nn.Parameter(torch.ones(dim)) def reset_parameters(self): torch.nn.init.ones_(self.weight) def _norm(self, x): return x * torch.rsqrt(x.pow(2).mean(-1, keepdimTrue) self.eps) def forward(self, x: torch.Tensor) - torch.Tensor: output self._norm(x.float()).type_as(x) return output * self.weight2优化后实现class RMSNorm(torch.nn.Module): def __init__(self, dim: int, eps: float 1e-5): super().__init__() self.eps eps self.weight nn.Parameter(torch.ones(dim)) def reset_parameters(self): torch.nn.init.ones_(self.weight) def _norm(self, x): return x * torch.rsqrt(x.pow(2).mean(-1, keepdimTrue) self.eps) def forward(self, x: torch.Tensor) - torch.Tensor: output torch_npu.npu_rms_norm(x, self.weight.float(), epsilonself.eps)[0] return output3.2.3 优化位置文件cosmos-predict2.5/cosmos_predict2/_src/predict2/networks/minimal_v4_dit.pycosmos-transfer2.5/cosmos_transfer2/_src/predict2/networks/minimal_v4_dit.py3.3 Rotary 融合算子适配3.3.1 优化说明使用 torch_npu 内置的npu_rotary_mul融合算子替换源代码中由transformer_engine导入的apply_rotary_pos_emb。关于npu_rotary_mul的详细说明可见昇腾设计文档。3.3.2 实现方式def apply_rotary_pos_emb( x: torch.Tensor, freqs: torch.Tensor, ) - torch.Tensor: radians freqs.transpose(0, 1) cos torch.cos(radians) sin torch.sin(radians) res_rot torch_npu.npu_rotary_mul(x, cos, sin) return res_rot3.3.3 优化位置文件cosmos-predict2.5/cosmos_predict2/_src/predict2/networks/minimal_v4_dit.pycosmos-transfer2.5/cosmos_transfer2/_src/predict2/networks/minimal_v4_dit.py4. 总结本次优化成功实现了 Cosmos 系列模型在昇腾 NPU 平台上的多卡并行推理能力与优化4.1 多卡并行优化Cosmos-Transfer2.5新增enable_cfg_parallel参数支持 CFG 并行和上下文并行的灵活组合通过inference_patch.py动态修改初始化逻辑无需侵入式修改源码Cosmos-Predict2.5通过 Monkey Patch 机制动态应用 NPU 适配补丁4.2 通用特性支持torchrun启动的多卡分布式推理灵活的并行策略配置4.3 融合算子优化Flash Attention使用npu_fusion_attention替代标准 Flash AttentionRMSNorm使用npu_rms_norm融合算子提升归一化性能Rotary 位置编码使用npu_rotary_mul加速旋转位置编码计算【免费下载链接】cann-recipes-embodied-intelligence本项目针对具身智能业务中的典型模型、加速算法提供基于CANN平台的优化样例项目地址: https://gitcode.com/cann/cann-recipes-embodied-intelligence创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/2600245.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！