DDPM优化目标公式推导

DDPM优化目标公式推导
- - **1. 问题定义**
  - **2. 优化目标：最大化对数似然**
  - **3. 变分下界的分解**
  - **4. 关键步骤：简化 KL 散度项**
  - - **(a) 后验分布 $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ 的闭式解**
    - **(b) 参数化均值 $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$ **
    - **(c) KL 散度的闭式解**
  - **5. 最终优化目标**
  - **关键结论**
补充内容（优化思路）
- - 变分下界（VLB）最终简化公式的逐项解析与优化思路
  - - **1. 重构项 (Reconstruction Term)**
    - **2. 去噪匹配项 (Denoising Matching Term)**
    - **3. 先验匹配项 (Prior Matching Term)**
  - **整体优化思路分析**
  - - **1. 核心优化目标**
    - **2. 实际训练简化**
    - **3. 物理意义图解**
    - **4. 为什么此优化有效？**
  - **总结**

DDPM优化目标公式推导

DDPM（Denoising Diffusion Probabilistic Models）的优化目标推导基于变分下界（Variational Lower Bound, VLB） 或 证据下界（Evidence Lower Bound, ELBO）。以下是详细推导过程：

1. 问题定义

目标：学习一个模型 $p_\theta(\mathbf{x}_0)$ 逼近真实数据分布 $q(\mathbf{x}_0)$ 。
前向过程（扩散过程）：
固定方差序列 $\beta_1, \dots, \beta_T$ ，定义马尔可夫链：
$q(\mathbf{x}_{1:T} | \mathbf{x}_0) = \prod_{t=1}^T q(\mathbf{x}_t | \mathbf{x}_{t-1}), \quad q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I})$
反向过程（生成过程）：
学习参数化的马尔可夫链：
$p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t), \quad p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))$

2. 优化目标：最大化对数似然

目标是最大化 $\log p_\theta(\mathbf{x}_0)$ ，但直接计算困难，转而最大化其变分下界：
$\log p_\theta(\mathbf{x}_0) \geq \mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right] \triangleq \text{VLB}$

3. 变分下界的分解

将 VLB 展开并分解：
$\begin{align*} \text{VLB} &= \mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right] \\ &= \mathbb{E}_{q} \left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right] \\ &= \mathbb{E}_{q} \left[ \log p_\theta(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right] \\ \end{align*}$
利用马尔可夫性质，改写为：
$\text{VLB} = \mathbb{E}_{q} \left[ \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) + \sum_{t=2}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)} - \sum_{t=1}^T \log \frac{q(\mathbf{x}_t | \mathbf{x}_{t-1})}{q(\mathbf{x}_{t-1} | \mathbf{x}_0)} \right] + C$
最终简化为：
$\boxed{\text{VLB} = \mathbb{E}_{q} \left[ \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) \right] - \sum_{t=2}^T \mathbb{E}_{q} \left[ D_\text{KL} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \right) \right] - D_\text{KL} \left( q(\mathbf{x}_T | \mathbf{x}_0) \parallel p(\mathbf{x}_T) \right)}$

详细过程请参考DDPM优化目标公式推导（详细）

4. 关键步骤：简化 KL 散度项

(a) 后验分布 $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ 的闭式解

由贝叶斯公式：
$q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I})$
其中：
$\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0 + \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t, \quad \tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t$
（记 $\alpha_t = 1 - \beta_t$ , $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$ ）

(b) 参数化均值 $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$

设 $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))$ 。
为匹配后验分布，选择：
$\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \tilde{\boldsymbol{\mu}}_t \left( \mathbf{x}_t, \frac{\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}_\theta}{\sqrt{\bar{\alpha}_t}} \right)$
代入闭式解得：
$\boldsymbol{\mu}_\theta = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right)$

© KL 散度的闭式解

两个高斯分布的 KL 散度为：
$D_\text{KL}(\mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1) \parallel \mathcal{N}(\boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2)) = \frac{1}{2} \left[ \log \frac{|\boldsymbol{\Sigma}_2|}{|\boldsymbol{\Sigma}_1|} - d + \text{tr}(\boldsymbol{\Sigma}_2^{-1} \boldsymbol{\Sigma}_1) + (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1)^\top \boldsymbol{\Sigma}_2^{-1} (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1) \right]$
假设 $\boldsymbol{\Sigma}_\theta = \sigma_t^2 \mathbf{I}$ （常取 $\sigma_t^2 = \beta_t$ 或 $\tilde{\beta}_t$ ），则：
$D_\text{KL} = \frac{1}{2\sigma_t^2} \| \tilde{\boldsymbol{\mu}}_t - \boldsymbol{\mu}_\theta \|^2 + C$
代入 $\boldsymbol{\mu}_\theta$ 和 $\tilde{\boldsymbol{\mu}}_t$ 的表达式：
$\tilde{\boldsymbol{\mu}}_t - \boldsymbol{\mu}_\theta = \frac{\beta_t}{\sqrt{\alpha_t} \sqrt{1 - \bar{\alpha}_t}} \left( \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right)$
其中 $\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}$ 。最终：
$\boxed{D_\text{KL} \propto \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \|^2 \right]}$

5. 最终优化目标

忽略常数项和权重，DDPM 的简化目标为：
$\mathcal{L}_\text{simple}(\theta) = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \|^2 \right]$
其中：

$\sim \text{Uniform}(1, T)$
$\mathbf{x}_0 \sim q(\mathbf{x}_0)$
$\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
$\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}$

关键结论

DDPM 通过训练一个网络 $\boldsymbol{\epsilon}_\theta$ 预测添加到样本中的噪声，最小化噪声预测的均方误差，从而实现数据生成。此目标等价于对数据分布的梯度（分数）进行匹配，与基于分数的生成模型有深刻联系。

补充内容（优化思路）

变分下界（VLB）最终简化公式的逐项解析与优化思路

最终VLB公式为：
$\begin{align*} \text{VLB} = & \;\mathbb{E}_{q(\mathbf{x}_1 | \mathbf{x}_0)} \Big[ \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) \Big] \\ & - \sum_{t=2}^T \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)} \left[ D_{\text{KL}} \Big( q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \Big) \right] \\ & - D_{\text{KL}} \Big( q(\mathbf{x}_T | \mathbf{x}_0) \parallel p(\mathbf{x}_T) \Big) \end{align*}$

1. 重构项 (Reconstruction Term)

$\mathbb{E}_{q(\mathbf{x}_1 | \mathbf{x}_0)} \Big[ \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) \Big]$

含义：
衡量从第一步带噪样本 $\mathbf{x}_1$ 重建原始数据 $\mathbf{x}_0$ 的质量。
- $q(\mathbf{x}_1 | \mathbf{x}_0)$ ：前向过程第一步（ $\mathbf{x}_0 \to \mathbf{x}_1$ )
- $p_\theta(\mathbf{x}_0 | \mathbf{x}_1)$ ：反向生成过程的第一步（ $\mathbf{x}_1 \to \mathbf{x}_0$ )
物理意义：
评估模型在轻度噪声水平（ $t = 1$ ）下的数据重建能力。
对于图像数据，此项常建模为离散分布（如像素级交叉熵）或连续分布（如高斯似然）。
优化作用：
确保生成过程最终输出高质量样本。实际训练中此项影响较小（因 $t = 1$ 噪声水平低）。

2. 去噪匹配项 (Denoising Matching Term)

$\sum_{t=2}^T \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)} \left[ D_{\text{KL}} \Big( q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \Big) \right]$

含义：
核心优化项！要求反向生成过程 $p_\theta$ 匹配前向过程的后验分布 $q$ 。
- $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ ：已知 $\mathbf{x}_0$ 和 $\mathbf{x}_t$ 时 $\mathbf{x}_{t-1}$ 的真实后验分布（可解析计算的高斯分布）
- $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)$ ：参数化的反向生成模型（神经网络预测）
物理意义：
在每一步 $t$ ，强制生成模型从 $\mathbf{x}_t$ 预测 $\mathbf{x}_{t-1}$ 的分布接近理论最优去噪分布。
关键推导结论：
该KL散度可简化为 噪声预测的均方误差：
$D_{\text{KL}} \propto \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta (\mathbf{x}_t, t) \|^2$
其中 $\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}$ ， $\boldsymbol{\epsilon}_\theta$ 是预测噪声的神经网络。
优化作用：
主导整个训练过程（占损失函数权重的99%以上）。
将复杂的分布匹配问题转化为简单的监督学习：训练网络 $\boldsymbol{\epsilon}_\theta$ 预测加入的噪声 $\boldsymbol{\epsilon}$ 。

3. 先验匹配项 (Prior Matching Term)

$D_{\text{KL}} \Big( q(\mathbf{x}_T | \mathbf{x}_0) \parallel p(\mathbf{x}_T) \Big)$

含义：
衡量前向过程最终分布 $q(\mathbf{x}_T | \mathbf{x}_0)$ 与预设先验 $p(\mathbf{x}_T)$ 的相似度。
- $q(\mathbf{x}_T | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_T; \sqrt{\bar{\alpha}_T} \mathbf{x}_0, (1-\bar{\alpha}_T)\mathbf{I})$
- $p(\mathbf{x}_T) = \mathcal{N}(\mathbf{0}, \mathbf{I})$ （标准高斯分布）
物理意义：
确保前向过程结束时，噪声分布接近标准高斯分布（生成过程的起点）。
优化作用：
- 当 $\bar{\alpha}_T \approx 0$ 时（DDPM通常满足），此项趋近于0（因 $q(\mathbf{x}_T|\mathbf{x}_0) \approx \mathcal{N}(0, \mathbf{I})$ )。
- 实际训练中常被忽略，因其不依赖可训练参数 $\theta$ 且值极小。

整体优化思路分析

1. 核心优化目标

最大化 $\log p_\theta(\mathbf{x}_0)$ 的下界（VLB），等价于最小化：
$\mathcal{L}_{\text{VLB}} = -\text{VLB} = \mathcal{L}_0 + \sum_{t=2}^T \mathcal{L}_{t} + \mathcal{L}_T$
其中：

$\mathcal{L}_0 = -\mathbb{E}[\log p_\theta(\mathbf{x}_0|\mathbf{x}_1)]$ （重构损失）
$\mathcal{L}_{t} = \mathbb{E}[D_{\text{KL}}(q \parallel p_\theta)]$ （去噪匹配损失）
$\mathcal{L}_T = D_{\text{KL}}(q(\mathbf{x}_T|\mathbf{x}_0) \parallel p(\mathbf{x}_T))$ （先验匹配损失）

2. 实际训练简化

忽略 $\mathcal{L}_T$ ：
因 $\bar{\alpha}_T \approx 0$ ，此项可忽略（接近0）。
简化 $\mathcal{L}_0$ ：
用均方误差替代离散分布建模（如对于图像数据）。
主导项 $\mathcal{L}_{t}$ 的转化：
通过数学推导，将KL散度转化为噪声预测损失：
$\mathcal{L}_{t} \propto \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}, t} \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \|^2$
均匀时间步采样：
为稳定训练，对 $\sim \text{Uniform}\{1,...,T\}$ 采样并去权重：
$\mathcal{L}_{\text{simple}} = \mathbb{E}_{t,\mathbf{x}_0,\boldsymbol{\epsilon}} \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \|^2$

3. 物理意义图解

生成过程（反向）: x_T ≈ N(0,I) → [pθ(x_{T-1}|x_T)] → ... → [pθ(x_0|x_1)] → x_0
                              ↑ 匹配          ↑ 匹配          ↑ 匹配
前向过程      : x_0 → [q(x1|x0)] → x_1 → ... → [q(x_T|x_{T-1})] → x_T
             重构项↑      去噪匹配项↑           先验匹配项↑

4. 为什么此优化有效？

解耦复杂性：
将高维数据分布匹配问题分解为 $T$ 个简单的高斯分布匹配任务。
渐进式优化：
通过时间步 $t$ 控制噪声水平，从易（高噪声）到难（低噪声）逐步训练。
闭式解指导：
利用前向过程后验 $q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)$ 的解析解提供训练目标。
隐式分数匹配：
噪声预测等价于学习数据分布的梯度场（ $\boldsymbol{\epsilon}_\theta \propto -\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t)$ ）。

总结

项	含义	优化作用	实际处理
重构项	从 $\mathbf{x}_1$ 重建 $\mathbf{x}_0$	保证最终输出质量	保留或用MSE替代
去噪匹配项	匹配反向生成与前向后验分布	核心训练目标（>99%权重）	转化为噪声预测损失
先验匹配项	对齐 $\mathbf{x}_T$ 与标准高斯	确保生成起点正确	忽略（值≈0）