【202511】Cosmos-Predict2.5-02-模型篇：用于PhysicalAI的基于视频基础模型的世界模拟【网络架构：DiT】【视觉Tokenizer：WAN2.1 VAE】【16fps】

news2026/4/29 1:08:18

《World Simulation with Video Foundation Models for Physical AI》Method3. 方法In this section, we first discuss our flow-matching formulation and then present the network architecture.在本节中，我们首先讨论我们的 flow-matching 表述，然后介绍网络架构。3.1. Flow MatchingWe adopt flow matching (FM) (Lipman et al., 2022) for training diffusion models because of its conceptual simplicity and practical effectiveness. While FM and the Elucidated Diffusion Model (EDM) (Karras et al., 2022), which was used in [Cosmos-Predict1] (NVIDIA, 2025), are mathematically equivalent in terms of their forward and backward diffusion processes, they differ in how the denoising network is parameterized (Gao et al., 2025). In EDM, the preconditioning coefficients are chosen so that both the inputs and outputs of the denoising network are approximately standardized Gaussians, which simplifies training and improves stability. In contrast, FM selects coefficients that make the denoising network predict the velocity of the diffusion trajectory. This velocity-based formulation not only provides a more direct training target but also tends to yield smoother optimization and improved sample quality in practice.我们采用 flow matching (FM) (Lipman et al., 2022) 来训练 diffusion models，因为它在概念上简洁且在实践中有效。尽管 FM 与 Elucidated Diffusion Model (EDM) (Karras et al., 2022)——即 [Cosmos-Predict1] (NVIDIA, 2025) 中所使用的方法——在前向和后向扩散过程的数学形式上是等价的，但它们在 denoising network 的参数化方式上有所不同 (Gao etal., 2025)。在 EDM 中，preconditioning 系数的选择使得 denoising network 的输入和输出都近似为标准化高斯分布，从而简化训练并提高稳定性。相比之下，FM 选择的系数使 denoising network 预测扩散轨迹的 velocity。这种基于velocity 的形式不仅提供了更直接的训练目标，而且在实践中往往能够带来更平滑的优化过程和更好的 sample quality。Formally, given a data sample x (image or video), a noise vectorϵ ∼ N ( 0 , I ) \epsilon \sim \mathcal { N } ( 0 , I )ϵ∼N(0,I), and a timestept ∈ [ 0 , 1 ] t \in [ 0 , 1 ]t∈[0,1]drawn from a logitnormal distribution, the interpolated latentx t \mathbf { x } _ { t }xtis defined as形式化地，给定一个数据样本x \mathsf { x }x（image 或 video）、一个噪声向量ϵ ∼ N ( 0 , I ) \epsilon \sim \mathcal { N } ( 0 , I )ϵ∼N(0,I)，以及一个从 logit-normaldistribution 中采样得到的时间步t ∈ [ 0 , 1 ] t \in [ 0 , 1 ]t∈[0,1]，插值后的 latentx t \mathbf { x } _ { t }xt定义为x t = ( 1 − t ) x + t ϵ . \mathbf { x } _ { t } = ( 1 - t ) \mathbf { x } + t { \boldsymbol { \epsilon } } .xt=(1−t)x+tϵ.The corresponding ground-truth velocity is对应的 ground-truth velocity 为v t = ϵ − x . \mathbf { v } _ { t } = \epsilon - \mathbf { x } .vt=ϵ−x.The model is trained to predictv t \mathbf { v } _ { t }vtby minimizing the mean squared error (MSE) between the prediction and ground truth:模型通过最小化预测值与真实值之间的均方误差（MSE）来训练以预测v t \mathbf { v } _ { t }vt：L ( θ ) = E x , ϵ , c , t ∥ u ( x t , t , c ; θ ) − v t ∥ 2 , \begin{array} { r } { \mathcal { L } ( \boldsymbol { \theta } ) = \mathbb { E } _ { \mathbf { x } , \boldsymbol { \epsilon } , \mathbf { c } , t } \| \mathbf { u } \left( \mathbf { x } _ { t } , t , \mathbf { c } ; \boldsymbol { \theta } \right) - \mathbf { v } _ { t } \| ^ { 2 } , } \end{array}L(θ)=Ex,ϵ,c,t∥u(xt,t,c;θ)−vt∥2,where denotes conditioning information associated withx \mathbf { x }x(e.g., text embeddings, reference frames, and other conditiona inputs),θ \thetaθrepresents the model parameters, andu ( ⋅ ; θ ) \mathbf { u } ( \cdot ; \theta )u(⋅;θ)is the predicted velocity function.其中，表示与x \mathbf { x }x相关的条件信息（例如，文本嵌入、参考帧和其他条件输入），θ \thetaθ表示模型 Parameter，而u ( ⋅ ; θ ) \mathbf { u } ( \cdot ; \theta )u(⋅;θ)是预测的速度函数。High-resolution content often contains significant redundancy, since nearby pixels are highly correlated. As a result, if the level of injected noise is too small, the model may fail to “break apart” this correlation, making it harder for the FM model to learn meaningful structure (Esser et al., 2024; Hoogeboom et al., 2023; Chen, 2023; Atzmon et al., 2024). To address this, we deliberately bias the training process toward higher noise levels. Specifically, we adopt the shifted logit-normal distribution (Esser et al., 2024). In practice, we first samplet ttfrom a logit-normal distribution, and then apply the monotone transformation高分辨率内容通常包含显著的冗余，因为相邻像素之间具有很强的相关性。因此，如果注入噪声的水平过小，模型可能无法“打破”这种相关性，从而使 FM 模型更难学习到有意义的结构（Esser et al., 2024; Hoogeboom et al., 2023; Chen,2023; Atzmon et al., 2024）。为了解决这一问题，我们有意将训练过程偏向于更高的噪声水平。具体而言，我们采用shifted logit-normal 分布（Esser et al., 2024）。在实践中，我们首先从 logit-normal 分布中采样t tt，然后应用如下单调变换t s = β t 1 + ( β − 1 ) t t _ { s } = \frac { \beta t } { 1 + ( \beta - 1 ) t }ts=1+(β−1)tβtwhereβ \betaβis a shift hyper-parameter. This transformation reweights the distribution so thatt s t _ { s }tsvalues are skewed其中，β \betaβ是一个 shift 超参数。该变换对分布进行重新加权，使得t s t _ { s }ts值呈偏斜分布表 3： [Cosmos-Predict2.5] 模型的配置细节。Configuration配置Cosmos-Predict2.5-2BCosmos-Predict2.5-14BNumber of Layers层数3236Model Dimension模型维度2,0485,120FFN Hidden DimensionFFN隐藏维度8,19220,480AdaLN-LoRA DimensionAdaLN-LoRA维度256256Number of Attention Heads注意力头数量1640Head Dimension头维度128128MLP ActivationMLP 激活函数GELUPositional Embedding位置 Embedding3D RoPE朝着更高噪声的方向。直观地说，增大β \betaβ会使模型更频繁地遇到噪声更强的输入，这有助于它学习在相关性被严重破坏时仍然重建信号。当β = 1 \beta = 1β=1时，不施加偏移，且t s = t t _ { s } = tts=t。3.2. 网络架构In [Cosmos-Predict2.5], we largely reuse the denoising networku ( ⋅ , θ ) \mathbf { u } ( \cdot , \theta )u(⋅,θ)introduced in [Cosmos-Predict1]'s DiT (NVIDIA, 2025), which is based on a latent diffusion model. The main architectural change is the removal of the absolute positional embeddings and only keeping the relative positional embeddings. While absolute embeddings provide a fixed spatial or temporal reference, they limit the model’s ability to generalize to resolutions or sequence lengths not seen during training. By removing them, [Cosmos-Predict2.5] gains greater flexibility for handling higher-resolution content and longer video sequences during post-training. This design choice is motivated by recent progress in long-context large language models, where alternative positional encoding strategies (Peng et al., 2023; bloc97, 2023) have proven effective at extending context length without sacrificing performance. The overall velocity prediction network design is illustrated in Fig. 2.在 [Cosmos-Predict2.5] 中，我们在很大程度上复用了 [Cosmos-Predict1] 的 DiT（NVIDIA，2025）中引入的去噪网络u ( ⋅ , θ ) \mathbf { u } ( \cdot , \theta )u(⋅,θ)，其基于 latent diffusion model。主要的架构变化是移除了绝对位置嵌入，仅保留相对位置嵌入。虽然绝对嵌入提供了固定的空间或时间参考，但它们会限制模型泛化到训练期间未见过的分辨率或序列长度的能力。通过移除它们，[Cosmos-Predict2.5] 在后训练期间处理更高分辨率内容和更长视频序列时获得了更大的灵活性。这一设计选择受到了长上下文大语言模型最新进展的启发，其中替代性的 Positional Encoding 策略（Peng et al., 2023；bloc97, 2023）已被证明能够在不牺牲性能的情况下有效扩展上下文长度。整体速度预测网络设计如图 2 所示。We adopt a different set of auxiliary models in [Cosmos-Predict2.5] compared to [Cosmos-Predict1], with improvements in both visual and textual representations. For the visual tokenizer, we use WAN2.1 VAE (Wan et al., 2025), a causal variational autoencoder that compresses video sequences with a compression rate of4 × 8 × 8 4 \times 8 \times 84×8×8across the time, height, and width dimensions, respectively. This compression greatly reduces the computational cost while preserving essential spatiotemporal structure. On top of this representation, we apply the same1 × 2 × 2 1 \times 2 \times 21×2×2patchification strategy to compress latent features further. We train our model to generate 93 frames, which corresponds to 24 latent frames, at a time using 16 fps videos. Each of the generated videos is about 5.8 seconds long.与 [Cosmos-Predict1] 相比，我们在 [Cosmos-Predict2.5] 中采用了一组不同的辅助模型，并在视觉和文本表征方面均有所改进。对于视觉 Tokenizer，我们使用 WAN2.1 VAE（Wan et al., 2025），这是一种因果变分自编码器，能够分别在时间、高度和宽度维度上以4 × 8 × 8 4 \times 8 \times 84×8×8的压缩率对视频序列进行压缩。这种压缩在保留关键时空结构的同时，大幅降低了计算成本。在此表征之上，我们进一步采用相同的1 × 2 × 2 1 \times 2 \times 21×2×2patchification 策略来压缩潜在特征。我们使用 16fps 视频训练模型，使其一次生成 93 帧，对应 24 个潜在帧。每个生成的视频时长约为 5.8 秒。For the text encoder, we leverage [Cosmos-Reason1] (NVIDIA, 2025) instead of the T5 encoder used in [CosmosPredict1]. Unlike standard approaches that rely on the output of a single transformer layer, we concatenate activations across multiple blocks for each token and project them into a 1024-dimensional space inspired by Wang et al. (2025). This yields a sequence of embedding vectors that more faithfully captures both local and global linguistic context. During training, these embeddings are integrated into the denoising process via cross-attention layers, enabling textual prompts to directly guide video generation. Moreover, the vision encoder in [Cosmos-Reason1] supports additional visual conditional inputs for style control, which we leave as an exciting direction for future exploration.对于文本编码器，我们采用 [Cosmos-Reason1]（NVIDIA，2025），而不是 [CosmosPredict1] 中使用的 T5 编码器。不同于依赖单个 Transformer 层输出的标准方法，我们针对每个 Token 拼接多个 block 的激活，并将其投影到受 Wang etal. (2025) 启发的 1024 维空间中。这会产生一系列 Embedding 向量，能够更忠实地捕捉局部和全局语言上下文。在训练过程中，这些 Embedding 通过 cross-attention 层被集成到去噪过程中，使文本提示能够直接引导视频生成。此外，[Cosmos-Reason1] 中的视觉编码器支持用于风格控制的额外视觉条件输入，我们将其保留为未来探索的一个令人兴奋的方向。Each [Cosmos-Predict2.5] model is designed to operate in three modes: Text2World, Image2World, and Video2World. In the Text2World setting, generation is guided solely by a text prompt. In Image2World, the model receives both a text prompt and a reference image, allowing it to ground the generated video in specific visual content. In Video2World, the model further extends this conditioning to video sequences, enabling temporally coherent continuation or transformation of input clips.每个 [Cosmos-Predict2.5] 模型都被设计为在三种模式下运行：Text2World、Image2World 、Video2World。在Text2World 设置中，生成过程仅由文本Prompt 引导。在 Image2World 中，模型同时接收文本Prompt 和参考图像，从而能够将生成的视频锚定到特定的视觉内容上。在 Video2World 中，模型进一步将这种条件扩展到视频序列，从而实现对输入片段在时间上连贯的延续或变换。Figure 2: Figure 2: Overall architecture of [Cosmos-Predict2.5]. As shown on the right, in the latent space, the model applies repeated blocks of self-attention, cross-attention, and feed-forward MLP layers, modulated by adaptive layer normalization (scale, shift, gate) for a given time stept t

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/2564073.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！