前言
最近想开一个关于强化学习专栏,因为DeepSeek-R1很火,但本人对于LLM连门都没入。因此,只是记录一些类似的读书笔记,内容不深,大多数只是一些概念的东西,数学公式也不会太多,还望读者多多指教。本次阅读书籍为:马克西姆的《深度强化学习实践》。
限于篇幅原因,请读者首先看下历史文章:
马尔科夫过程
马尔科夫奖励过程
马尔科夫奖励过程二
RL框架Gym简介
Gym实现CartPole随机智能体
本篇开始,将介绍第一个RL算法,交叉熵算法。
1、交叉熵公式推导
1.1.前置基础
在介绍交叉熵算法之前,为了防止读者对交叉熵算法由来有疑惑,因此,先简单介绍下数学公式推导:
E
x
∼
p
(
x
)
[
H
(
x
)
]
=
∫
x
p
(
x
)
H
(
x
)
d
x
E_{x \sim p(x)}[H(x)]=\int_{x}p(x)H(x)dx
Ex∼p(x)[H(x)]=∫xp(x)H(x)dx
在上述公式中:
p
(
x
)
p(x)
p(x)是所有可能策略概率分布,而
H
(
x
)
H(x)
H(x)是采取x策略所获得的奖励值。而目的则是得到奖励值的期望,也就是将其积分。
但由于直接计算
p
(
x
)
p(x)
p(x)很难,因此我们希望找到一个
q
(
x
)
q(x)
q(x)来逼近
p
(
x
)
p(x)
p(x),则此时公式变成:
E
x
∼
p
(
x
)
[
H
(
x
)
]
=
∫
x
p
(
x
)
H
(
x
)
d
x
=
∫
x
q
(
x
)
p
(
x
)
q
(
x
)
H
(
x
)
d
x
=
E
x
∼
q
(
x
)
[
q
(
x
)
p
(
x
)
q
(
x
)
H
(
x
)
]
E_{x \sim p(x)}[H(x)]=\int_{x}p(x)H(x)dx=\int_{x}q(x)\frac{p(x)}{q(x)}H(x)dx=E_{x \sim q(x)}[q(x)\frac{p(x)}{q(x)}H(x)]
Ex∼p(x)[H(x)]=∫xp(x)H(x)dx=∫xq(x)q(x)p(x)H(x)dx=Ex∼q(x)[q(x)q(x)p(x)H(x)]
然后根据KL散度来逐步用
q
(
x
)
q(x)
q(x)来逼近
p
(
x
)
p(x)
p(x),KL散度定义为:
K
L
(
p
(
x
)
∣
∣
q
(
x
)
)
=
E
x
∼
p
(
x
)
l
o
g
p
(
x
)
q
(
x
)
=
E
x
∼
p
(
x
)
l
o
g
(
p
(
x
)
)
−
E
x
∼
p
(
x
)
l
o
g
(
q
(
x
)
)
KL(p(x)||q(x)) = E_{x \sim p_(x)}log \frac{p(x)}{q(x)} = E_{x \sim p(x)}log(p(x)) - E_{x \sim p(x)}log(q(x))
KL(p(x)∣∣q(x))=Ex∼p(x)logq(x)p(x)=Ex∼p(x)log(p(x))−Ex∼p(x)log(q(x))
则在上述公式中:第一项为熵,由于跟优化目标无关,可以忽略;第二项为交叉熵,即深度学习中通常的损失函数。
1.2.推导迭代公式
根据公式1可以得出:
E
x
∼
p
(
x
)
[
H
(
x
)
]
=
E
x
∼
q
i
(
x
)
[
p
(
x
)
q
i
(
x
)
H
(
x
)
]
E_{x \sim p(x)}[H(x)] = E_{x \sim q_i(x)}\left[\frac{p(x)}{q_i(x)} H(x)\right]
Ex∼p(x)[H(x)]=Ex∼qi(x)[qi(x)p(x)H(x)]
之后可以使用重要采样来重写 KL 散度。重要采样是一种通过另一个分布
q
i
(
x
)
q_i(x)
qi(x) 来估计期望的方法。具体来说:
E
x
∼
p
(
x
)
[
f
(
x
)
]
=
∫
f
(
x
)
p
(
x
)
d
x
=
∫
f
(
x
)
p
(
x
)
q
i
(
x
)
q
i
(
x
)
d
x
=
E
x
∼
q
i
(
x
)
[
f
(
x
)
p
(
x
)
q
i
(
x
)
]
E_{x \sim p(x)}[f(x)] = \int f(x) p(x) \, dx = \int f(x) \frac{p(x)}{q_i(x)} q_i(x) \, dx = E_{x \sim q_i(x)}\left[f(x) \frac{p(x)}{q_i(x)}\right]
Ex∼p(x)[f(x)]=∫f(x)p(x)dx=∫f(x)qi(x)p(x)qi(x)dx=Ex∼qi(x)[f(x)qi(x)p(x)]
将这个思想应用到 KL 散度上:
K
L
(
p
(
x
)
∥
q
i
+
1
(
x
)
)
=
E
x
∼
p
(
x
)
log
p
(
x
)
q
i
+
1
(
x
)
=
E
x
∼
q
i
(
x
)
[
log
p
(
x
)
q
i
+
1
(
x
)
⋅
p
(
x
)
q
i
(
x
)
]
KL(p(x) \| q_{i+1}(x)) = E_{x \sim p(x)} \log \frac{p(x)}{q_{i+1}(x)} = E_{x \sim q_i(x)}\left[\log \frac{p(x)}{q_{i+1}(x)} \cdot \frac{p(x)}{q_i(x)}\right]
KL(p(x)∥qi+1(x))=Ex∼p(x)logqi+1(x)p(x)=Ex∼qi(x)[logqi+1(x)p(x)⋅qi(x)p(x)]
进一步展开表达式:
K
L
(
p
(
x
)
∥
q
i
+
1
(
x
)
)
=
E
x
∼
q
i
(
x
)
[
p
(
x
)
q
i
(
x
)
(
log
p
(
x
)
−
log
q
i
+
1
(
x
)
)
]
KL(p(x) \| q_{i+1}(x)) = E_{x \sim q_i(x)}\left[\frac{p(x)}{q_i(x)} \left(\log p(x) - \log q_{i+1}(x)\right)\right]
KL(p(x)∥qi+1(x))=Ex∼qi(x)[qi(x)p(x)(logp(x)−logqi+1(x))]
将表达式分离为两部分:
K
L
(
p
(
x
)
∥
q
i
+
1
(
x
)
)
=
E
x
∼
q
i
(
x
)
[
p
(
x
)
q
i
(
x
)
log
p
(
x
)
]
−
E
x
∼
q
i
(
x
)
[
p
(
x
)
q
i
(
x
)
log
q
i
+
1
(
x
)
]
KL(p(x) \| q_{i+1}(x)) = E_{x \sim q_i(x)}\left[\frac{p(x)}{q_i(x)} \log p(x)\right] - E_{x \sim q_i(x)}\left[\frac{p(x)}{q_i(x)} \log q_{i+1}(x)\right]
KL(p(x)∥qi+1(x))=Ex∼qi(x)[qi(x)p(x)logp(x)]−Ex∼qi(x)[qi(x)p(x)logqi+1(x)]
注意到第一部分
E
x
∼
q
i
(
x
)
[
p
(
x
)
q
i
(
x
)
log
p
(
x
)
]
E_{x \sim q_i(x)}\left[\frac{p(x)}{q_i(x)} \log p(x)\right]
Ex∼qi(x)[qi(x)p(x)logp(x)] 是关于
q
i
+
1
(
x
)
q_{i+1}(x)
qi+1(x) 的常数项,因此我们在最小化 KL 散度时可以忽略这一部分:
min
q
i
+
1
(
x
)
K
L
(
p
(
x
)
∥
q
i
+
1
(
x
)
)
=
min
q
i
+
1
(
x
)
−
E
x
∼
q
i
(
x
)
[
p
(
x
)
q
i
(
x
)
log
q
i
+
1
(
x
)
]
\min_{q_{i+1}(x)} KL(p(x) \| q_{i+1}(x)) = \min_{q_{i+1}(x)} -E_{x \sim q_i(x)}\left[\frac{p(x)}{q_i(x)} \log q_{i+1}(x)\right]
qi+1(x)minKL(p(x)∥qi+1(x))=qi+1(x)min−Ex∼qi(x)[qi(x)p(x)logqi+1(x)]
为了与原始问题中的
H
(
x
)
H(x)
H(x) 结合,假设
H
(
x
)
=
1
H(x) =1
H(x)=1(即没有额外的权重)。如果
H
(
x
)
≠
1
H(x) \neq 1
H(x)=1,则可以在目标函数中包含
H
(
x
)
H(x)
H(x):
min
q
i
+
1
(
x
)
−
E
x
∼
q
i
(
x
)
[
p
(
x
)
q
i
(
x
)
H
(
x
)
log
q
i
+
1
(
x
)
]
\min_{q_{i+1}(x)} -E_{x \sim q_i(x)}\left[\frac{p(x)}{q_i(x)} H(x) \log q_{i+1}(x)\right]
qi+1(x)min−Ex∼qi(x)[qi(x)p(x)H(x)logqi+1(x)]
则最终迭代公式为:
q
i
+
1
(
x
)
=
arg
min
−
E
x
∼
q
i
(
x
)
p
(
x
)
q
i
(
x
)
H
(
x
)
log
q
i
+
1
(
x
)
q_{i+1}(x) = \arg\min -E_{x \sim q_i(x)} \frac{p(x)}{q_i(x)} H(x) \log q_{i+1}(x)
qi+1(x)=argmin−Ex∼qi(x)qi(x)p(x)H(x)logqi+1(x)
2、转化到RL
根据上节推导出的公式,换元得到RL的损失函数:
π
i
+
1
(
a
∣
s
)
=
arg
min
−
E
z
∼
π
i
(
a
∣
s
)
p
(
x
)
π
i
(
a
∣
s
)
H
(
x
)
log
π
i
+
1
(
a
∣
s
)
\pi_{i+1}(a|s) = \arg\min -E_{z \sim \pi_i(a|s)} \frac{p(x)}{\pi_i(a|s)} H(x) \log \pi_{i+1}(a|s)
πi+1(a∣s)=argmin−Ez∼πi(a∣s)πi(a∣s)p(x)H(x)logπi+1(a∣s)
在上述公式中,
p
(
x
)
H
(
x
)
p(x)H(x)
p(x)H(x)可以用指示函数替代,超过阈值为1,否则奖励为0。最终通过SGD就能得到一个
π
\pi
π最优策略模型,进而逼近真实的分布。
总结
本篇的公式比较多,我也有点儿懵逼,可以不用深入理解。下一篇将交叉熵方法用到CartPole智能体看看效果变得如何。