强化学习无痛上手笔记第1课

news2025/7/17 12:53:41

文章目录

  • Markov Decision Process
    • Definition
    • Related Concepts
  • Policy for MDP Agent
    • Definition
    • Judgement for Policy
      • Value Functions
      • TD formula for value functions
      • Relation of V and Q
      • Policy Criterion
      • Policy Improvement Theorem
      • Optimal Policy
      • Reinforcement Learning
  • Fundamental RL Algorithms: Policy Iteration
    • The Whole Procedure
    • Convergence Analysis
  • Fundamental RL Algorithms: Value Iteration

参考书籍:
1.Reinforcement Learning State-of-the-Art, Marco Wiering
2.Reinforcement learning: An introduction, RS Sutton

下一篇博客:
https://blog.csdn.net/Jinyindao243052/article/details/107051507

Markov Decision Process

Definition

Suppose that there is an interactive system as follows which consists of an environment and an agent.
在这里插入图片描述
Its running process is described as follows. At the beginning of step t = 0 , 1 , . . . t=0,1,... t=0,1,..., the agent has the knowledge of the environment state s t s_{t} st. Then it chooses an action a t ∈ A a_{t}\in \mathcal{A} atA that is available in state s t s_{t} st and apply it. After applying action a t a_{t} at, the state of the environment makes a transition to s t + 1 s_{t+1} st+1 according to a probability distribution T ( s t , a t ) T(s_t,a_t) T(st,at) over S \mathcal{S} S and generates a reward r t = R ( s t , a t , s t + 1 ) r_{t}=R(s_t,a_t,s_{t+1}) rt=R(st,at,st+1) and gives the information of its new state and reward to the agent.

Here S \mathcal{S} S is the set of possible environmental states, called state space. A \mathcal{A} A is the set of possible agent actions, called action space. T : S × A → P ( S ) T: \mathcal{S} \times \mathcal{A}\rightarrow \mathcal{P}(\mathcal{S}) T:S×AP(S) is called the transition function, where P ( S ) \mathcal{P}(\mathcal{S}) P(S) is probability distributions of states in S \mathcal{S} S. The probability of ending up in state s ′ s' s after doing applicable action a a a in state s s s is denoted as T ( s , a , s ′ ) T(s,a,s') T(s,a,s). R : S × A × S → R R: \mathcal{S}\times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R} R:S×A×SR is called the reward function, the reward of ending up in state s ′ s' s after doing action a a a in state s s s is R ( s , a , s ′ ) R(s,a,s') R(s,a,s).

Note that the propability distribution of its next state s t + 1 s_{t+1} st+1 is only determined by s t s_t st and a t a_t at, and is not directly influenced by all former states or actions. The reward r t r_t rt is determined by s t s_t st, a t a_t at, s t + 1 s_{t+1} st+1 and is not directly influenced by all former states, actions, and rewards. This kind of property is called Markov property.

In theoretic analysis, we assume the process never stops.

The discrete-time running process of the foregoing system is called a Markov Decision Process.

Related Concepts

Discounted Cumulative Reward: Define the discounted cumulative reward of the process from step t t t as γ k r t + k \gamma^{k}r_{t+k} γkrt+k(denoted as G t G_t Gt), where γ ∈ ( 0 , 1 ) \gamma \in (0,1) γ(0,1) is a parameter called the discount rate.

A MDP is usually represented by a tuple ⟨ S , A , T , R , γ ⟩ \left \langle \mathcal{S}, \mathcal{A}, T, R, \gamma\right \rangle S,A,T,R,γ, where S \mathcal{S} S is its state space, A \mathcal{A} A is its action space, T T T is its transition function, R R R is its reward function, γ \gamma γ is its discount rate.

History: The sequence of ( s 0 , a 0 , s 1 , a 1 , . . . ) (s_0,a_0,s_1,a_1,...) (s0,a0,s1,a1,...) is defined as the history of the MDP process (denoted as τ \tau τ). The probability of ( s t , a t , . . . , s t ′ ) (s_t, a_t, ..., s_{t'}) (st,at,...,st) given s t s_t st and a t a_{t} at,…, a t ′ − 1 a_{t'-1} at1 is Π i = t t ′ − 1 T ( s i , a i , s i + 1 ) \Pi_{i=t}^{t'-1}T(s_i,a_i, s_{i+1}) Πi=tt1T(si,ai,si+1).

Trajectory: The sequence of ( s 0 , a 0 , r 0 , s 1 , a 1 , r 1 . . . ) (s_0,a_0,r_0,s_1,a_1,r_1...) (s0,a0,r0,s1,a1,r1...) is defined as the trajectory of the MDP process (denoted as E \mathcal{E} E).

Policy for MDP Agent

Definition

在这里插入图片描述
在这里插入图片描述

Policy: A deterministic policy π \pi π is a function defined as π : S → A \pi: \mathcal{S}\rightarrow \mathcal{A} π:SA, which tells the agent the action to take in a particular state.

A stochastic policy π \pi π is defined as π : S → P ( A ) \pi: \mathcal{S}\rightarrow \mathcal{P}(\mathcal{A}) π:SP(A) where P ( A ) \mathcal{P}(\mathcal{A}) P(A) is probability distributions of actions in A \mathcal{A} A. It tells the agent the probability to take every action in a particular state. Denote the probability of action a a a according to π ( s ) \pi(s) π(s) as π ( a ∣ s ) \pi(a|s) π(as).

Adopting a deterministic policy π \pi π is equivalent to adopting a stochastic policy π ~ \tilde{\pi} π~,

π ~ ( a ∣ s ) = { 1 , a = π ( s ) 0 , a ≠ π ( s ) ,   s ∈ S \tilde{\pi}(a|s)=\begin{cases} &1, &a=\pi(s)\\ &0, &a\neq \pi(s) \end{cases},\ s\in\mathcal{S} π~(as)={1,0,a=π(s)a=π(s), sS

Therefore, we only talk about stochastic policy in the following part.

Judgement for Policy

Value Functions

State Value Function (V function): The state value function of policy π \pi π is defined as V π : S → R V^{\pi}: \mathcal{S}\rightarrow \mathbb{R} Vπ:SR,

V π ( s ) = E τ ∣ s 0 = s , π { ∑ t = 0 ∞ γ t r t } = E τ ∣ s 0 = s , π { G 0 } ,   s ∈ S V^{\pi}(s)=\mathbb{E}_{\tau |s_{0}=s, \pi}\{\sum\limits_{t=0}^{\infty}\gamma^{t}r_{t}\}=\mathbb{E}_{\tau |s_{0}=s, \pi}\{G_{0}\}, \ s\in \mathcal{S} Vπ(s)=Eτs0=s,π{t=0γtrt}=Eτs0=s,π{G0}, sS

where τ = s 0 , a 0 , s 1 , a 1 , … \tau=s_{0}, a_{0}, s_{1}, a_{1}, \dots τ=s0,a0,s1,a1, is the sequence of state-action transitions. V π ( s ) V^{\pi}(s) Vπ(s) is the expected discounted cumulative reward of the process when s 0 = s s_0=s s0=s and adopting policy π \pi π.

For any sequence [ S 0 , A 0 , S 1 . . . ] [S_0,A_0,S_1...] [S0,A0,S1...], where S 0 = S S_0=S S0=S, S i ∈ S S_i\in \mathcal{S} SiS, A i ∈ A A_i\in \mathcal{A} AiA:

p ( τ ∣ s 0 = s , π = [ S 0 , A 0 , S 1 , . . . ] ) = Π i = 0 ∞ T ( S i , A i , S i + 1 ) π ( A i ∣ S i ) p(\tau_{|s_0=s,\pi}=[S_0,A_0,S_1,...])=\Pi_{i=0}^{\infty}T(S_i,A_i,S_{i+1})\pi(A_i|S_i) p(τs0=s,π=[S0,A0,S1,...])=Πi=0T(Si,Ai,Si+1)π(AiSi)

p ( τ t ∣ s t = s , π = [ S 0 , A 0 , S 1 , . . . ] ∣ s 0 = s , π ) = Π i = 0 ∞ T ( S i , A i , S i + 1 ) π ( A i ∣ S i ) p(\tau_t|_{s_t=s,\pi}=[S_0,A_0,S_1,...]|s_0=s,\pi)=\Pi_{i=0}^{\infty}T(S_i,A_i,S_{i+1})\pi(A_i|S_i) p(τtst=s,π=[S0,A0,S1,...]s0=s,π)=Πi=0T(Si,Ai,Si+1)π(AiSi)

Therefore the probability density distributions of sequence τ ∣ s 0 = s , π \tau|s_0=s, \pi τs0=s,π and τ t ∣ s t = s , π \tau_t|s_t=s,\pi τtst=s,π are the same. When τ ∣ s 0 = s , π = τ t ∣ s t = s , π = [ S 0 , A 0 , S 1 , . . . ] \tau|_{s_0=s, \pi}=\tau_t|_{s_t=s,\pi}=[S_0,A_0,S_1,...] τs0=s,π=τtst=s,π=[S0,A0,S1,...], G 0 = G t = ∑ i = 0 ∞ { γ i R ( S i , A i , S i + 1 ) } G_0=G_t=\sum_{i=0}^{\infty}\{\gamma^{i}R(S_i,A_i,S_{i+1})\} G0=Gt=i=0{γiR(Si,Ai,Si+1)}

Therefore E τ ∣ s 0 = s , π { G 0 } = E τ t ∣ s t = s , π { G t } \mathbb{E}_{\tau |s_{0}=s, \pi}\{G_{0}\}=\mathbb{E}_{\tau_t|s_{t}=s, \pi}\{G_{t}\} Eτs0=s,π{G0}=Eτtst=s,π{Gt}. Then V π ( s ) = E τ t ∣ s t = s , π [ G t ] V^{\pi}(s)=\mathbb{E}_{\tau_t|s_{t}=s,\pi}[G_t] Vπ(s)=Eτtst=s,π[Gt], which is the expected discounted cumulative reward of the process from step t t t under the condition of s t = s s_t=s st=s and following π \pi π afterwards.

State-Action Value Function (Q function): The state-action value function is defined as Q π : S × A → R Q^{\pi}: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R} Qπ:S×AR,

Q π ( s , a ) = E τ ∣ s 0 = s , a 0 = a , π { ∑ t = 0 ∞ γ t r t } , = E τ ∣ s 0 = s , a 0 = a , π { G 0 }   s ∈ S , a ∈ A Q^{\pi}(s,a)=\mathop{\mathbb{E}}\limits_{\tau |s_{0}=s, a_{0}=a, \pi}\{\sum\limits_{t=0}^{\infty}\gamma^{t}r_{t}\}, =\mathop{\mathbb{E}}\limits_{\tau |s_{0}=s, a_{0}=a, \pi}\{G_0\} \ s\in \mathcal{S}, a\in \mathcal{A} Qπ(s,a)=τs0=s,a0=a,πE{t=0γtrt},=τs0=s,a0=a,πE{G0} sS,aA

Q π ( s , a ) Q^{\pi}(s,a) Qπ(s,a) is the expected discounted cumulative reward of the process from step 0 0 0 when s 0 = s s_0=s s0=s, a 0 = a a_0=a a0=a, and adopting policy π \pi π. Similarly, Q π ( s ) = E τ t ∣ s t = s , a t = a , π [ G t ] Q^{\pi}(s)=\mathbb{E}_{\tau_t|s_{t}=s, a_{t}=a,\pi}[G_t] Qπ(s)=Eτtst=s,at=a,π[Gt], which is the expected discounted cumulative reward of the process from step t t t under the condition of s t = s s_t=s st=s, a t = a a_t=a at=a and following π \pi π afterwards.

TD formula for value functions

  • TD(t) ( t = 1 , 2 , . . . t=1,2,... t=1,2,...) formula for V function is V π ( s ) = E τ k ∣ s k = s , π [ ∑ i = 0 t − 1 γ i r k + i + γ t V π ( s k + t ) ] V^{\pi}(s)=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}V^{\pi}(s_{k+t})] Vπ(s)=Eτksk=s,π[i=0t1γirk+i+γtVπ(sk+t)], s ∈ S s\in \mathcal{S} sS.

Proof.

V π ( s ) = E τ k ∣ s k = s , π [ ∑ i = 0 t − 1 γ i r k + i + γ t G k + t ] = E τ k ∣ s k = s , π ∑ i = 0 t − 1 [ γ i r k + i ] + γ t E τ k ∣ s k = s , π [ G k + t ] = E τ k ∣ s k = s , π ∑ i = 0 t − 1 [ γ i r k + i ] + γ t E ( s k , . . . , s k + t ) ∣ s k = s , π E ( a k + t , . . . ) ∣ ( s k , . . . , s k + t ) , s k = s , π [ G k + t ] = E τ k ∣ s k = s , π ∑ i = 0 t − 1 [ γ i r k + i ] + γ t E ( s k , . . . , s k + t ) ∣ s k = s , π V π ( s k + t ) = E τ k ∣ s k = s , π [ ∑ i = 0 t − 1 γ i r k + i + γ t V π ( s k + t ) ] \begin{align} V^{\pi}(s)&=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}G_{k+t}]\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{k+i}]+\gamma^{t}\mathbb{E}_{\tau_k|s_k=s,\pi}[G_{k+t}]\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{k+i}]+\gamma^{t}\mathbb{E}_{(s_k,...,s_{k+t})|s_k=s, \pi}\mathbb{E}_{(a_{k+t},...)|(s_k,...,s_{k+t}), s_k=s,\pi}[G_{k+t}]\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{k+i}]+\gamma^{t}\mathbb{E}_{(s_k,...,s_{k+t})|s_k=s, \pi}V^{\pi}(s_{k+t})\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}V^{\pi}(s_{k+t})] \end{align} Vπ(s)=Eτksk=s,π[i=0t1γirk+i+γtGk+t]=Eτksk=s,πi=0t1[γirk+i]+γtEτksk=s,π[Gk+t]=Eτksk=s,πi=0t1[γirk+i]+γtE(sk,...,sk+t)sk=s,πE(ak+t,...)(sk,...,sk+t),sk=s,π[Gk+t]=Eτksk=s,πi=0t1[γirk+i]+γtE(sk,...,sk+t)sk=s,πVπ(sk+t)=Eτksk=s,π[i=0t1γirk+i+γtVπ(sk+t)]

  • TD(t) ( t = 1 , 2 , . . . t=1,2,... t=1,2,...) for Q function is Q π ( s , a ) = E τ k ∣ s k = s , a k = a , π [ ∑ i = 0 t − 1 γ i r i + k + γ t + k Q π ( s t , a t ) ] Q^{\pi}(s,a)=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{i+k}+\gamma^{t+k}Q^{\pi}(s_{t},a_{t})] Qπ(s,a)=Eτksk=s,ak=a,π[i=0t1γiri+k+γt+kQπ(st,at)], s ∈ S s\in \mathcal{S} sS, a ∈ A a\in \mathcal{A} aA.

Proof.

Q π ( s , a ) = E τ k ∣ s k = s , a k = a , π [ ∑ i = 0 t − 1 γ i r i + k + γ t + k G t + k ] = E τ ∣ s k = s , a k = a , π ∑ i = 0 t − 1 [ γ i r i + k ] + γ t + k E τ ∣ s k = s , a k = a , π [ G t + k ] = E τ ∣ s k = s , a 0 = a , π ∑ i = 0 t − 1 [ γ i r i + k ] + γ t + k E ( s k , . . . , s t + k , a t + k ) ∣ s k = s , a k = a , π E ( s t + k + 1 , . . . ) ∣ ( s k , . . . , s t + k , a t + k ) , s k = s , a k = a , π [ G t + k ] = E τ k ∣ s k = s , a k = a , π ∑ i = 0 t − 1 [ γ i r i + k ] + γ t + k E ( s k , . . . , s t , a t + k ) ∣ s k = s , a k = a , π Q π ( s t + k , a t + k ) = E τ k ∣ s k = s , a k = a , π [ ∑ i = 0 t − 1 γ i r i + k + γ t + k Q π ( s t + k , a t + k ) ] \begin{align} Q^{\pi}(s,a)&=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{i+k}+\gamma^{t+k}G_{t+k}]\notag\\ &=\mathbb{E}_{\tau|s_k=s, a_k=a,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{i+k}]+\gamma^{t+k}\mathbb{E}_{\tau|s_k=s, a_k=a,\pi}[G_{t+k}]\notag\\ &=\mathbb{E}_{\tau|s_k=s,a_0=a,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{i+k}]+\gamma^{t+k}\mathbb{E}_{(s_k,...,s_{t+k}, a_{t+k})|s_k=s, a_k=a, \pi}\mathbb{E}_{(s_{t+k+1},...)|(s_k,...,s_{t+k}, a_{t+k}), s_k=s,a_k=a,\pi}[G_{t+k}]\notag\\ &=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{i+k}]+\gamma^{t+k}\mathbb{E}_{(s_k,...,s_t, a_{t+k})|s_k=s, a_k=a, \pi}Q^{\pi}(s_{t+k}, a_{t+k})\notag\\ &=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{i+k}+\gamma^{t+k}Q^{\pi}(s_{t+k}, a_{t+k})]\notag \end{align} Qπ(s,a)=Eτksk=s,ak=a,π[i=0t1γiri+k+γt+kGt+k]=Eτsk=s,ak=a,πi=0t1[γiri+k]+γt+kEτsk=s,ak=a,π[Gt+k]=Eτsk=s,a0=a,πi=0t1[γiri+k]+γt+kE(sk,...,st+k,at+k)sk=s,ak=a,πE(st+k+1,...)(sk,...,st+k,at+k),sk=s,ak=a,π[Gt+k]=Eτksk=s,ak=a,πi=0t1[γiri+k]+γt+kE(sk,...,st,at+k)sk=s,ak=a,πQπ(st+k,at+k)=Eτksk=s,ak=a,π[i=0t1γiri+k+γt+kQπ(st+k,at+k)]

Corollary.

(1)
V π ( s , a ) = ∑ a π ( a ∣ s ) ∑ s ′ T ( s , a , s ′ ) [ r ( s , a , s ′ ) + γ V π ( s ′ ) ] V^{\pi}(s,a)=\sum_{a}\pi(a|s)\sum\limits_{s'}T(s,a,s')[r(s,a,s')+\gamma V^{\pi}(s')] Vπ(s,a)=aπ(as)sT(s,a,s)[r(s,a,s)+γVπ(s)]
This is called the Bellman Equation for V function.

The proof is easily derived with TD(1) formula for V and is omitted.

(2)
Q π ( s , a ) = ∑ s ′ T ( s , a , s ′ ) [ r ( s , a , s ′ ) + γ ∑ a ′ π ( a ′ ∣ s ′ ) Q π ( s ′ , a ′ ) ] Q^{\pi}(s,a)=\sum\limits_{s'}T(s,a,s')[r(s,a,s')+\gamma \sum_{a'} \pi(a'|s') Q^{\pi}(s',a')] Qπ(s,a)=sT(s,a,s)[r(s,a,s)+γaπ(as)Qπ(s,a)]
This is called the Bellman Equation for Q function.

The proof is easily derived with TD(1) formula for Q and is omitted.

Relation of V and Q

The relation between V V V and Q Q Q function is given by the following theorems.

Theorem 1.
V π ( s ) = ∑ a π ( a ∣ s ) Q π ( s , a ) ,   s ∈ S V^{\pi}(s)=\sum_{a}\pi(a|s)Q^{\pi}(s,a), \ s\in \mathcal{S} Vπ(s)=aπ(as)Qπ(s,a), sS
Proof.
V π ( s ) = E τ ∣ s 0 = s , π { ∑ t = 0 ∞ γ t r t } = E a 0 ∣ s 0 = s , π E τ ∣ s 0 = s , a 0 , π { G 0 } = E a 0 ∣ s 0 = s , π { Q π ( s , a 0 ) } = ∑ a π ( a ∣ s ) Q π ( s , a ) \begin{align} V^{\pi}(s)&=\mathbb{E}_{\tau |s_{0}=s, \pi}\{\sum\limits_{t=0}^{\infty}\gamma^{t}r_{t}\}\notag \\ &=\mathbb{E}_{a_0|s_0=s, \pi}\mathbb{E}_{\tau |s_{0}=s, a_0, \pi}\{G_{0}\} \notag \\ &=\mathbb{E}_{a_0|s_0=s, \pi}\{Q^{\pi}(s,a_0)\}\notag \\ &=\sum_{a}\pi(a|s)Q^{\pi}(s,a) \end{align} Vπ(s)=Eτs0=s,π{t=0γtrt}=Ea0s0=s,πEτs0=s,a0,π{G0}=Ea0s0=s,π{Qπ(s,a0)}=aπ(as)Qπ(s,a)
Theorem 2. For t = 1 , . . . t=1,... t=1,..., it holds that
V π ( s ) = E τ k ∣ s k = s , π [ ∑ i = 0 t − 1 γ i r k + i + γ t Q π ( s k + t , a k + t ) ] ,   s ∈ S V^{\pi}(s)=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}Q^{\pi}(s_{k+t}, a_{k+t})],\ s\in \mathcal{S} Vπ(s)=Eτksk=s,π[i=0t1γirk+i+γtQπ(sk+t,ak+t)], sS

Q π ( s , a ) = E τ k ∣ s k = s , a k = a , π [ ∑ i = 0 t − 1 γ i r k + i + γ t V π ( s k + t ) ] ,   s ∈ S , a ∈ A Q^{\pi}(s,a)=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}V^{\pi}(s_{k+t})],\ s\in \mathcal{S}, a\in \mathcal{A} Qπ(s,a)=Eτksk=s,ak=a,π[i=0t1γirk+i+γtVπ(sk+t)], sS,aA
The proof is similar to the foregoing proofs and is omitted.

Corollary.

Q π ( s , a ) = ∑ s ′ T ( s , a , s ′ ) [ r ( s , a , s ′ ) + γ V π ( s ′ ) ] ,   s ∈ S , a ∈ A Q^{\pi}(s,a)=\sum\limits_{s'}T(s,a,s')[r(s,a,s')+\gamma V^{\pi}(s')], \ s\in \mathcal{S}, a\in \mathcal{A} Qπ(s,a)=sT(s,a,s)[r(s,a,s)+γVπ(s)], sS,aA

Policy Criterion

A policy π \pi π is said to be not worse than a policy π ′ \pi' π ( π ≥ π ′ \pi \geq \pi' ππ) if V π ( s ) ≥ V π ′ ( s ) V^{\pi}(s)\geq V^{\pi'}(s) Vπ(s)Vπ(s), s ∈ S s\in \mathcal{S} sS.

A policy π \pi π is said to be strictly better than a policy π ′ \pi' π ( π > π ′ \pi > \pi' π>π) if V π ( s ) ≥ V π ′ ( s ) V^{\pi}(s)\geq V^{\pi'}(s) Vπ(s)Vπ(s), s ∈ S s\in \mathcal{S} sS and there exists a state s 0 s_0 s0 V π ( s 0 ) > V π ′ ( s 0 ) V^{\pi}(s_0)> V^{\pi'}(s_0) Vπ(s0)>Vπ(s0).

Policy Improvement Theorem

Theorem. For any policy π \pi π and π ~ \tilde\pi π~:
V π ( s ) ≤ ∑ a π ~ ( a ∣ s ) Q π ( s , a ) ,   s ∈ S V^{\pi}(s) \leq \sum_{a}\tilde\pi(a|s)Q^{\pi}(s,a),\ s\in \mathcal{S} Vπ(s)aπ~(as)Qπ(s,a), sS
it holds that π ~ ≥ π \tilde\pi\geq \pi π~π.

Proof.

V π ( s ) = E s 0 ∣ s 0 = s , π ′ V π ( s 0 ) V^{\pi}(s)=\mathbb{E}_{s_0|s_0=s,\pi'} V^{\pi}(s_0) Vπ(s)=Es0s0=s,πVπ(s0)

E s 0 ∣ s 0 = s , π ′ V π ( s 0 ) = E s 0 ∣ s 0 = s , π ′ [ ∑ a π ( a ∣ s 0 ) Q π ( s 0 , a ) ] ≤ E s 0 ∣ s 0 = s , π ′ [ ∑ a π ~ ( a ∣ s 0 ) Q π ( s 0 , a ) ] = E ( s 0 , a 0 ) ∣ s 0 = s , π ′ [ Q π ( s 0 , a 0 ) ] = E ( s 0 , a 0 ) ∣ s 0 = s , π ′ [ ∑ s T ( s 0 , a 0 , s ) ( R ( s 0 , a 0 , s ) + γ V π ( s ) ) ] = E ( s 0 , a 0 ) ∣ s 0 = s , π ′ E s 1 ∣ ( s 0 , a 0 ) , s 0 = s , π ′ [ r 0 + γ V π ( s 1 ) ] = E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ [ r 0 ] + γ E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ [ V π ( s 1 ) ] \begin{align} \mathbb{E}_{s_0|s_0=s,\pi'} V^{\pi}(s_0)&=\mathbb{E}_{s_0|s_0=s,\pi'} [\sum_{a}\pi(a|s_0)Q^{\pi}(s_0,a)]\notag\\ &\leq \mathbb{E}_{s_0|s_0=s,\pi'}[\sum_{a}\tilde\pi(a|s_0)Q^{\pi}(s_0,a)]\notag\\ &=\mathbb{E}_{(s_0,a_0)|s_0=s,\pi'}[Q^{\pi}(s_0,a_0)]\notag\\ &=\mathbb{E}_{(s_0,a_0)|s_0=s,\pi'}[\sum_{s}T(s_0, a_0,s)(R(s_0, a_0,s)+\gamma V^{\pi}(s))]\notag\\ &=\mathbb{E}_{(s_0,a_0)|s_0=s,\pi'}\mathbb{E}_{s_1|(s_0,a_0),s_0=s,\pi'}[r_0+\gamma V^{\pi}(s_1)]\notag\\ &=\mathbb{E}_{(s_0,a_0,s_1)|s_0=s,\pi'}[r_0]+\gamma\mathbb{E}_{(s_0,a_0,s_1)|s_0=s,\pi'}[ V^{\pi}(s_1)] \end{align} Es0s0=s,πVπ(s0)=Es0s0=s,π[aπ(as0)Qπ(s0,a)]Es0s0=s,π[aπ~(as0)Qπ(s0,a)]=E(s0,a0)s0=s,π[Qπ(s0,a0)]=E(s0,a0)s0=s,π[sT(s0,a0,s)(R(s0,a0,s)+γVπ(s))]=E(s0,a0)s0=s,πEs1(s0,a0),s0=s,π[r0+γVπ(s1)]=E(s0,a0,s1)s0=s,π[r0]+γE(s0,a0,s1)s0=s,π[Vπ(s1)]
For i = 0 , 1 , . . . i=0,1,... i=0,1,..., it can be derived in a similar way that

E ( s 0 , . . . s i ) ∣ s 0 = s , π ′ [ V π ( s i ) ] ≤ E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ r i ] + γ E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ V π ( s i + 1 ) ] \mathbb{E}_{(s_0,...s_i)|s_0=s,\pi'}[ V^{\pi}(s_{i})]\leq \mathbb{E}_{(s_0,...,s_{i+1})|s_0=s,\pi'}[r_i]+\gamma\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s,\pi'}[V^{\pi}(s_{i+1})] E(s0,...si)s0=s,π[Vπ(si)]E(s0,...,si+1)s0=s,π[ri]+γE(s0,...,si+1)s0=s,π[Vπ(si+1)]

Then it is easily derived that for i = 0 , 1 , . . . i=0,1,... i=0,1,...,
V π ( s ) ≤ ∑ l = 0 i E ( s 0 , . . . , s l + 1 ) ∣ s 0 = s , π ′ [ γ l r l ] + γ i + 1 E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ V π ( s i + 1 ) ] = E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ ∑ l = 0 i γ l r l ] + γ i + 1 E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ V π ( s i + 1 ) ] \begin{align} V^{\pi}(s) &\leq \sum_{l=0}^{i}\mathbb{E}_{(s_0,...,s_{l+1})|s_0=s, \pi'}[\gamma^{l} r_{l}]+\gamma^{i+1}\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s, \pi'}[V^{\pi}(s_{i+1})]\notag\\ &=\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s, \pi'}[\sum_{l=0}^{i}\gamma^{l}r_{l}]+\gamma^{i+1}\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s, \pi'}[V^{\pi}(s_{i+1})]\notag \end{align} Vπ(s)l=0iE(s0,...,sl+1)s0=s,π[γlrl]+γi+1E(s0,...,si+1)s0=s,π[Vπ(si+1)]=E(s0,...,si+1)s0=s,π[l=0iγlrl]+γi+1E(s0,...,si+1)s0=s,π[Vπ(si+1)]
Let i → ∞ i\rightarrow \infty i, then V π ( s ) ≤ E τ ∣ s 0 = s , π ′ [ ∑ l = 0 ∞ γ l r l ] = V π ~ ( s ) V^{\pi}(s)\leq \mathbb{E}_{\tau|s_0=s, \pi'}[\sum_{l=0}^{\infty}\gamma^{l}r_{l}]=V^{\tilde\pi}(s) Vπ(s)Eτs0=s,π[l=0γlrl]=Vπ~(s).

Therefore, π ~ ≥ π \tilde\pi\geq \pi π~π.

Corollary.

For any policy π \pi π and π ~ \tilde\pi π~:
π ~ ( a ∣ s ) = { 1 ,   a = a r g m a x a Q π ( s , a ) 0 ,   a ≠ a r g m a x a Q π ( s , a ) ,   s ∈ S \tilde\pi(a|s)= \begin{cases} 1, \ a=\mathop{\mathrm{argmax}}\limits_{a}Q^{\pi}(s,a)\\ 0, \ a\neq\mathop{\mathrm{argmax}}\limits_{a}Q^{\pi}(s,a) \end{cases}, \ s\in \mathcal{S} π~(as)= 1, a=aargmaxQπ(s,a)0, a=aargmaxQπ(s,a), sS
it holds that π ~ ≥ π \tilde\pi\geq \pi π~π.

Optimal Policy

A policy π ∗ \pi^{*} π is called an optimal policy if V π ∗ ( s ) ≥ V π ( s ) V^{\pi^{*}}(s)\geq V^{\pi}(s) Vπ(s)Vπ(s) for all s ∈ S s\in \mathcal{S} sS and for all policy π ≠ π ∗ \pi \neq \pi^{*} π=π. (Denote V π ∗ ( s ) V^{\pi^{*}}(s) Vπ(s) / Q π ∗ ( s , a ) Q^{\pi^{*}}(s,a) Qπ(s,a) as V ∗ ( s ) V^{*}(s) V(s) / Q ∗ ( s , a ) Q^{*}(s,a) Q(s,a) in the following content.)

The relation of V ∗ V^{*} V and Q ∗ Q^{*} Q is given as follows:
V ∗ ( s ) = max ⁡ a Q ∗ ( s , a ) ,   s ∈ S ,   a ∈ A V^{*}(s)=\max\limits_{a}Q^{*}(s,a),\ s\in \mathcal{S}, \ a\in \mathcal{A} V(s)=amaxQ(s,a), sS, aA

The optimal policy is given by:
π ∗ ( a ∣ s ) = { 1 a = a r g m a x a Q ∗ ( s , a ) 0 a ≠ a r g m a x a Q ∗ ( s , a ) ,   s ∈ S \pi^{*}(a|s)=\begin{cases}&1 &a=\mathop{\mathrm{argmax}}\limits_{a}Q^{*}(s,a)\\ &0 & a\neq \mathop{\mathrm{argmax}}\limits_{a}Q^{*}(s,a) \end{cases},\ s\in \mathcal{S} π(as)= 10a=aargmaxQ(s,a)a=aargmaxQ(s,a), sS
Proof.

If these does not hold, then there exists a better policy according to the corollary of policy improvement theorem.

Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state.

Reinforcement Learning

Reinforcement Learning Algorithm: Algorithms that seek the optimal policy for MDPs are collectively known as reinforcement learning algorithms. We will introduce several reinforcement learning algorithms in the following content.

Fundamental RL Algorithms: Policy Iteration

We now introduce an RL algorithm for policy improvement called policy iteration that, with each iteration, yields a better strategy than the previous one and ultimately converges to an optimal policy. It pertains to finite MDPs (MDPs with finite state and action spaces).

Recall that
Q π ( s , a ) = ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ Q π ( s ′ , π ( s ′ ) ) ) Q^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma Q^{\pi}(s',\pi(s')) \big) Qπ(s,a)=sT(s,π(s),s)(R(s,π(s),s)+γQπ(s,π(s)))
This gives an iterative way to optimize estimator of Q Q Q function:
Q ^ π ( s , a ) ← ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ Q ^ π ( s ′ , π ( s ′ ) ) ) \hat{Q}^{\pi}(s,a)\leftarrow\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \hat{Q}^{\pi}(s',\pi(s')) \big) Q^π(s,a)sT(s,π(s),s)(R(s,π(s),s)+γQ^π(s,π(s)))

The Whole Procedure

Require: initial policy π \pi π
Construct a function Q ^ π : S × A → R \hat{Q}^{\pi}: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R} Q^π:S×AR arbitrarily as an estimator of Q π Q^{\pi} Qπ.
repeat // policy evaluation
\quad Δ = 0 \Delta=0 Δ=0.
\quad for s ∈ S s \in S sS, a ∈ A a\in \mathcal{A} aA do
\quad \quad q = Q ^ π ( s , a ) q=\hat{Q}^{\pi}(s,a) q=Q^π(s,a)
\quad \quad Q ^ π ( s , a ) = ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ Q ^ π ( s ′ , π ( s ′ ) ) ) \hat{Q}^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \hat{Q}^{\pi}(s',\pi(s')) \big) Q^π(s,a)=sT(s,π(s),s)(R(s,π(s),s)+γQ^π(s,π(s)))
\quad \quad Δ = max ⁡ ( Δ , ∣ q − Q ^ π ( s , a ) ∣ ) \Delta=\max(\Delta, |q - \hat{Q}^{\pi}(s,a)|) Δ=max(Δ,qQ^π(s,a))
until Δ < σ \Delta<\sigma Δ<σ
policy-stable=true
for s ∈ S s \in S sS do // policy improvement
\quad b = π ( s ) b=\pi(s) b=π(s)
\quad π ( s ) = arg max ⁡ a Q ^ π ( s , a ) \pi(s)=\argmax_{a}\hat{Q}^{\pi}(s,a) π(s)=argmaxaQ^π(s,a)
\quad if b ≠ π ( s ) b\neq \pi(s) b=π(s) then policy-stable = false
if policy-stable == false then go to policy evaluation else return π ( s ) \pi(s) π(s).

Policy Evaluation
Updating Q π ( s , a ) Q^{\pi}(s,a) Qπ(s,a) in the forementioned iterative way based on Bellman Equation.

Policy Improvement

Update π ( s ) \pi(s) π(s) in a greedy way:
π ( s ) = a r g m a x a Q π ( s , a ) \pi(s) =\mathop{\mathrm{argmax}}\limits_{a} Q^{\pi}(s,a) π(s)=aargmaxQπ(s,a)
在这里插入图片描述

*Note: Policy-stable means that the Q Q Q has no improvement compared to the previous iteration, that is, π k + 1 ( s ) \pi_{k+1}(s) πk+1(s) is as good as π k ( s ) \pi_{k}(s) πk(s).

参考: Wiering M , Otterlo M V .Reinforcement Learning: State of the Art[J].Springer Publishing Company, Incorporated, 2012.DOI:10.1007/s10840-007-9174-1. P22.

Convergence Analysis

It can be proved that policy iteration generates a monotonically increasing sequence of policies.
V π ( s ) = Q π ( s , π ( s ) ) ≤ Q π ( s , π ′ ( s ) ) = ∑ s ′ T ( s , π ′ ( s ) , s ′ ) [ r ( s , π ′ ( s ) , s ′ ) + γ V π ( s ′ ) ] = E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ { r 0 + γ V π ( s 1 ) } ≤ E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ { r 0 + γ Q π ( s 1 , π ′ ( s 1 ) ) } = E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ { r 0 + γ ∑ s ′ T ( s 1 , π ′ ( s 1 ) , s ′ ) [ r ( s 1 , π ′ ( s 1 ) , s ′ ) + V π ( s ′ ) ] ) = E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ { r 0 + γ E ( a 1 , s 2 ) ∣ s 1 , π ′ { r 1 + V π ( s 2 ) } } = E ( s 0 , a 0 , s 1 , a 1 , s 2 ) ∣ s 0 = s , π ′ { r 0 + γ r 1 + γ V π ( s 2 ) ] ) ≤ . . . ≤ E ( s 0 , a 0 , . . , s n − 1 , a n − 1 , s n ) ∣ s 0 = s , π ′ { ∑ t = 0 n − 1 γ t r t + γ n V π ( s n ) } ≤ . . . \begin{aligned} V^{\pi}(s)=Q^{\pi}(s,\pi(s)) & \leq Q^{\pi}(s,\pi'(s))\\ & = \sum\limits_{s'}T(s,\pi'(s),s')[r(s,\pi'(s),s')+\gamma V^{\pi}(s')]\\ &=\mathbb{E}_{(s_0, a_0, s_1)| s_0=s, \pi'}\{r_0 + \gamma V^{\pi}(s_{1})\}\\ &\leq \mathbb{E}_{(s_0, a_0, s_1)| s_0=s, \pi'}\{r_0 + \gamma Q^{\pi}(s_{1}, \pi'(s_1))\}\\ &= \mathbb{E}_{(s_0, a_0, s_1)| s_0=s, \pi'}\{r_0 + \gamma \sum\limits_{s'}T(s_1,\pi'(s_1),s')[r({s_1}, \pi'(s_1),s')+V^{\pi}(s')])\\ &=\mathbb{E}_{(s_0, a_0, s_1)|s_0=s, \pi'}\{r_0+\gamma \mathbb{E}_{(a_1, s_2)|s_1, \pi'}\{r_1+V^{\pi}(s_2)\}\}\\ &=\mathbb{E}_{(s_0, a_0, s_1, a_1, s_2)| s_0=s, \pi'}\{r_0 + \gamma r_1+\gamma V^{\pi}(s_2)])\\ & \leq ... \\ & \leq \mathbb{E}_{(s_{0},a_{0},..,s_{n-1},a_{n-1},s_{n})|s_{0}=s, \pi'} \{\sum\limits_{t=0}^{n-1}\gamma^{t} r_{t}+\gamma^{n}V^{\pi}(s_{n})\}\\ & \leq ... \\ \end{aligned} Vπ(s)=Qπ(s,π(s))Qπ(s,π(s))=sT(s,π(s),s)[r(s,π(s),s)+γVπ(s)]=E(s0,a0,s1)s0=s,π{r0+γVπ(s1)}E(s0,a0,s1)s0=s,π{r0+γQπ(s1,π(s1))}=E(s0,a0,s1)s0=s,π{r0+γsT(s1,π(s1),s)[r(s1,π(s1),s)+Vπ(s)])=E(s0,a0,s1)s0=s,π{r0+γE(a1,s2)s1,π{r1+Vπ(s2)}}=E(s0,a0,s1,a1,s2)s0=s,π{r0+γr1+γVπ(s2)])...E(s0,a0,..,sn1,an1,sn)s0=s,π{t=0n1γtrt+γnVπ(sn)}...
Therefore, V π ( s ) ≤ E τ ∣ s 0 = s , π ′ { ∑ t = 0 ∞ γ t r t } = V π ′ ( s ) V^{\pi}(s)\leq \mathbb{E}_{\tau | s_{0} = s, \pi'}\{\sum\limits_{t=0}^{\infty}\gamma^{t} r_{t}\} = V^{\pi'}(s) Vπ(s)Eτs0=s,π{t=0γtrt}=Vπ(s), ∀ s ∈ S \forall s \in \mathcal{S} sS.

Because a finite MDP has only a finite number of policies, π \pi π must converge in a finite number of iterations.

Suppose π ( s ) \pi(s) π(s) converges to π ∗ \pi^{*} π. Then it is easily derived that from the stop criterion that:
π ∗ ( s ) = a r g m a x a ∑ s ′ T ( s , a , s ′ ) ( R ( s , a , s ′ ) + γ V ∗ ( s ′ ) ) ,   ∀ s ∈ S \pi^{*}(s)=\mathop{\mathrm{argmax}}\limits_{a}\sum\limits_{s'}T(s,a,s')(R(s,a,s')+\gamma V^{*}(s')),\ \forall s\in S π(s)=aargmaxsT(s,a,s)(R(s,a,s)+γV(s)), sS
Then the Bellman optimality equation holds.

  • Each policy π k + 1 \pi_{k+1} πk+1 is a strictly better policy than π k \pi_{k} πk until the algorithm stops.
  • The policy iteration algorithm completely separates the evaluation and improvement phases.
  • Although policy iteration computes the original policy for a given MDP in finite time, it is relatively inefficient. In particular the first step, the policy evaluation step, is computationally expensive.

Fundamental RL Algorithms: Value Iteration

通过前面内容我们知道: π ∗ ( s ) = a r g m a x a Q ∗ ( s , a ) \pi^{*}(s)=\mathop{\mathrm{argmax}}\limits_{a}Q^{*}(s,a) π(s)=aargmaxQ(s,a),因此,只要求出 Q ∗ ( s , a ) Q^{*}(s,a) Q(s,a),就求出了 π ∗ ( s ) \pi^{*}(s) π(s),本节我们给出一种通过迭代求 Q ∗ ( s , a ) Q^{*}(s,a) Q(s,a) 的算法:Value Iteration.

Iterative ways to derive Q ∗ Q^{*} Q

Q ∗ Q^{*} Q π ∗ \pi^{*} π 的关系式
Q ∗ ( s , a ) = ∑ s ′ ∈ S T ( s , a , s ′ ) [ R ( s , a , s ′ ) + γ V ∗ ( s ′ ) ] Q^{*}(s, a) = \sum_{s' \in S} T(s, a, s') [R(s, a, s') + \gamma V^{*}(s')] Q(s,a)=sST(s,a,s)[R(s,a,s)+γV(s)]
可以得到两种迭代更新 Q ∗ Q^{*} Q estimator的方式:

(1)
Q ^ ∗ ( s , a ) ← ∑ s ′ T ( s , a , s ′ ) ( R ( s , a , s ′ ) + γ V ^ ∗ ( s ′ ) ) \hat{Q}^{*}(s, a)\leftarrow \sum _{s'}T(s, a, s')(R(s, a, s')+\gamma \hat{V}^{*}(s')) Q^(s,a)sT(s,a,s)(R(s,a,s)+γV^(s))

V ^ ∗ ( s ) ← max ⁡ a Q ^ ∗ ( s , a ) \hat{V}^{*}(s)\leftarrow\max _{a}\hat{Q}^{*}(s, a) V^(s)amaxQ^(s,a)
(2)
Q ^ ∗ ( s , a ) ← ∑ s ′ T ( s , a , s ′ ) ( R ( s , a , s ′ ) + γ max ⁡ a Q ^ ∗ ( s ′ , a ) ) \hat{Q}^{*}(s, a) \leftarrow \sum_{s'} T(s, a, s') \left( R(s, a, s') + \gamma \max_{a} \hat{Q}^{*}(s', a) \right) Q^(s,a)sT(s,a,s)(R(s,a,s)+γamaxQ^(s,a))

Given state s s s, action a a a, reward r r r and next state s i s_i si, it is possible to approximate Q ∗ ( s , a ) Q^{*}(s, a) Q(s,a) by iteratively solving the Bellman recurrence equation
Q i + 1 ( s , a ) = E s ′ [ r + γ max ⁡ a ′ Q i ( s ′ , a ′ ) ] Q_{i+1}(s,a)=\mathbb{E}_{s'}[r+\gamma\max_{a'}Q_{i}(s',a')] Qi+1(s,a)=Es[r+γamaxQi(s,a)]
摘自: Multiagent cooperation and competition with deep reinforcement learning

value iteration算法就是基于这两种迭代方式计算 Q ∗ Q^{*} Q的.

The Whole Procedure

版本1:

Require: initial policy π \pi π
Construct a function Q ^ : S × A → R \hat{Q}: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R} Q^:S×AR arbitrarily as the calculated Q ∗ Q^{*} Q. Construct a function V ^ : S → R \hat{V}: \mathcal{S}\rightarrow \mathbb{R} V^:SR arbitrarily as the calculated V ∗ V^{*} V.
repeat
\quad Δ = 0 \Delta=0 Δ=0.
\quad for s ∈ S s \in S sS do
\quad \quad v = V ^ ( s ) v=\hat{V}(s) v=V^(s)
\quad \quad for a ∈ A a\in \mathcal{A} aA do
\quad \quad \quad Q ^ π ( s , a ) = ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ V ^ ( s ′ ) ) \hat{Q}^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \hat{V}(s') \big) Q^π(s,a)=sT(s,π(s),s)(R(s,π(s),s)+γV^(s)) // 采用第1种迭代方式更新 Q ^ ( s , a ) \hat{Q}(s,a) Q^(s,a)
\quad \quad V ^ ( s ) = max ⁡ a Q ^ ( s , a ) \hat{V}(s)=\max_{a}\hat{Q}(s,a) V^(s)=maxaQ^(s,a)
\quad \quad Δ = max ⁡ ( Δ , ∣ v − V ^ ( s , a ) ∣ ) \Delta=\max(\Delta, |v - \hat{V}(s,a)|) Δ=max(Δ,vV^(s,a))
until Δ < σ \Delta<\sigma Δ<σ

版本2:

Require: initial policy π \pi π
Construct a function Q ^ : S × A → R \hat{Q}: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R} Q^:S×AR arbitrarily as the calculated Q ∗ Q^{*} Q.
repeat
\quad Δ = 0 \Delta=0 Δ=0.
\quad for s ∈ S s \in S sS, a ∈ A a\in \mathcal{A} aA do
\quad \quad q = Q ^ π ( s , a ) q=\hat{Q}^{\pi}(s,a) q=Q^π(s,a)
\quad \quad Q ^ π ( s , a ) = ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ max ⁡ a ′ Q ^ π ( s ′ , a ′ ) ) \hat{Q}^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \max_{a'}\hat{Q}^{\pi}(s',a') \big) Q^π(s,a)=sT(s,π(s),s)(R(s,π(s),s)+γmaxaQ^π(s,a)) // 采用第2种迭代方式更新 Q ^ ( s , a ) \hat{Q}(s,a) Q^(s,a)
\quad \quad Δ = max ⁡ ( Δ , ∣ q − Q ^ π ( s , a ) ∣ ) \Delta=\max(\Delta, |q - \hat{Q}^{\pi}(s,a)|) Δ=max(Δ,qQ^π(s,a))
until Δ < σ \Delta<\sigma Δ<σ

Convergence Analysis

V 0 → Q 1 → V 1 → Q 2 → V 2 → Q 3 → V 3 → Q 4 → V 4 → . . . V ∗ V_0 \rightarrow Q_1 \rightarrow V_1 \rightarrow Q_2 \rightarrow V_2 \rightarrow Q_3 \rightarrow V_3 \rightarrow Q_4 \rightarrow V_4 \rightarrow ... V^* V0Q1V1Q2V2Q3V3Q4V4...V, Value iteration is guaranteed to converge in the limit towards V ∗ V^* V, we have π ∗ ( s ) = π g r e e d y ( V ∗ ) ( s ) = argmax a Q ∗ ( s , a ) \pi ^ {*} (s)= \pi _{greedy}(V^*)(s)= \mathop{\text{argmax}}\limits_a Q^*(s,a) π(s)=πgreedy(V)(s)=aargmaxQ(s,a).

参考: Wiering M , Otterlo M V .Reinforcement Learning: State of the Art[J].Springer Publishing Company, Incorporated, 2012.DOI:10.1007/s10840-007-9174-1. P22.

最新修订于2021年10月29日
最新修订于2021年11月8日
最新修订于2021年11月12日
最新修订于2021年11月13日
最新修订于2021年11月22日
最新修订于2021年11月22日
最新修订于2021年11月29日
最新修订于2022年3月6日
最新修订于2022年6月11日
最新修订于2022年8月1日
最新修订于2022年10月14日
最新修订于2022年11月21日
最新修订于2022年11月28日
最新修订于2023年8月21日
最新修订于2024年12月26日
最新修订于2024年12月30日
最新修订于2024年12月31日
最新修订于2025年2月11日
最新修订于2025年2月19日
最新修订于2025年3月3日

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2310054.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

智能设备上的 AI 移植与部署:新趋势与实践案例

1. 引言&#xff1a;智能设备如何运行 AI&#xff1f; 随着人工智能&#xff08;AI&#xff09;技术的快速发展&#xff0c;AI 计算已经从云端走向边缘&#xff0c;嵌入到智能设备中&#xff0c;如智能手机、智能摄像头、机器人、自动驾驶汽车等。这种本地化 AI 计算能够减少延…

【USRP】NVIDIA Sionna:用于 6G 物理层研究的开源库

目录 Sionna&#xff1a;用于 6G 物理层研究的开源库主要特点实现6G研究的民主化支持 5G、6G 等模块化、可扩展、可伸缩快速启动您的研究 好处原生人工智能支持综合研究平台开放生态系统 安装笔记使用 pip 安装基于Docker的安装从源代码安装“你好世界&#xff01;”探索锡奥纳…

LLM大型语言模型(一)

1. 什么是 LLM&#xff1f; LLM&#xff08;大型语言模型&#xff09;是一种神经网络&#xff0c;专门用于理解、生成并对人类文本作出响应。这些模型是深度神经网络&#xff0c;通常训练于海量文本数据上&#xff0c;有时甚至覆盖了整个互联网的公开文本。 LLM 中的 “大” …

BUU44 [BJDCTF2020]ZJCTF,不过如此1 [php://filter][正则表达式get输入数据][捕获组反向引用][php中单双引号]

题目&#xff1a; 我仿佛见到了一位故人。。。也难怪&#xff0c;题目就是ZJCTF 按要求提交/?textdata://,I have a dream&filenext.php后&#xff1a; ......不太行&#xff0c;好像得用filephp://filter/convert.base64-encode/resourcenext.php 耶&#xff1f;那 f…

软考中级-数据库-3.3 数据结构-树

定义:树是n(n>=0)个结点的有限集合。当n=0时称为空树。在任一非空树中,有且仅有一个称为根的结点:其余结点可分为m(m>=0)个互不相交的有限集T1,T2,T3...,Tm…,其中每个集合又都是一棵树,并且称为根结点的子树。 树的相关概念 1、双亲、孩子和兄弟: 2、结点的度:一个结…

磁盘空间不足|如何安全清理以释放磁盘空间(开源+节流)

背景&#xff1a; 最近往数据库里存的东西有点多&#xff0c;磁盘不够用 查看磁盘使用情况 df -h /dev/sda5&#xff08;根目录 /&#xff09; 已使用 92% 咱们来开源节流 目录 背景&#xff1a; 一、开源 二、节流 1.查找 大于 500MB 的文件&#xff1a; 1. Snap 缓存…

SpringCloud系列教程(十二):网关配置动态路由

除了token以外&#xff0c;还有一个很实用的功能就是把网关的路由配置放到nacos上&#xff0c;并且修改路由配置的时候&#xff0c;网关服务可以动态的更新&#xff0c;这样我们在调整网络配置的时候&#xff0c;就不用重启服务了。所以我们需要用到两个重要的类&#xff1a;Na…

Java-实现PDF合同模板填写内容并导出PDF文件

可用于公司用户合同导出pdf文件 效果图 一、导入所需要jar包 <!--生成PDF--><dependency><groupId>com.itextpdf</groupId><artifactId>itextpdf</artifactId><version>5.5.11</version></dependency><dependency&…

基于STM32的环境监测系统(自制蓝牙APP)

目录 项目概述 实物图 演示视频 概述 硬件模块 原理图以及PCB 0.96寸OLED屏幕&#xff08;SSD1306&#xff09; CubeMX配置 初始化代码 MQ-2烟雾传感器 CubeMX配置 初始化代码 DHT11温湿度模块 驱动代码 HC-05蓝牙模块 CubeMX配置 ​编辑 空闲中断回调函数 有…

月结保障:回滚慢、行锁频发

问题背景 3.1号月结现场保障&#xff0c;到场了先让kill了一个账务的会话&#xff0c;回滚了20min&#xff0c;巡检的时候发现报表库有几条行锁&#xff1a;enq: TX - row lock contention&#xff0c;sql&#xff1a;delete from table_name 语句已经失败&#xff0c;正在回滚…

Golang的微服务服务发现机制

## 1. Golang微服务服务发现机制 微服务架构已经成为当今软件开发的主流趋势&#xff0c;它能将复杂的单体应用拆分成小而独立的服务单元&#xff0c;实现更快的开发、部署和扩展。在微服务架构中&#xff0c;服务发现是非常重要的一环&#xff0c;它能够实现服务之间的自动发现…

Keepalived 入门详解:高可用集群部署最佳实践!

1. 什么是 Keepalived&#xff1f; 在分布式集群中&#xff0c;单点故障&#xff08;SPOF&#xff09; 是影响系统稳定性的重要问题。Keepalived 作为一款高可用服务软件&#xff0c;可以有效防止集群单点故障&#xff0c;保障系统的高可用性。 Keepalived 最初是为 LVS&#…

SparkStreaming之04:调优

SparkStreaming调优 一 、要点 4.1 SparkStreaming运行原理 深入理解 4.2 调优策略 4.2.1 调整BlockReceiver的数量 案例演示&#xff1a; object MultiReceiverNetworkWordCount {def main(args: Array[String]) {val sparkConf new SparkConf().setAppName("Networ…

开发博客系统

前言 准备工作 数据库表分为实体表和关系表 第一&#xff0c;建数据库表 然后导入前端页面 创建公共模块 就是统一返回值&#xff0c;异常那些东西 自己造一个自定义异常 普通类 mapper 获取全部博客 我们只需要返回id&#xff0c;title&#xff0c;content&#xff0c;us…

微信小程序上如何使用图形验证码

1、php服务器生成图片验证码的代码片段如下&#xff1a; 注意红框部分的代码&#xff0c;生成的是ArrayBuffer类型的二进制图片 2、显示验证码 显示验证码&#xff0c;不要直接image组件加上src显示&#xff0c;那样拿不到cookie&#xff0c;没有办法做图形验证码的验证&…

IntelliJ IDEA 构建项目时内存溢出问题

问题现象 在使用 IntelliJ IDEA 构建 Java 项目时&#xff0c;遇到了以下错误&#xff1a; java: java.lang.OutOfMemoryError: Java heap space java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space这是一个典型的 Java 堆内存不足错误&#xff0c;表…

大模型微调与RAG检索增强技术深度解析

一、引言 随着人工智能技术的飞速发展&#xff0c;大模型&#xff08;如BERT、GPT等&#xff09;在自然语言处理、计算机视觉等领域取得了显著成效。然而&#xff0c;这些预训练好的大模型往往难以直接应用于特定业务场景&#xff0c;因此&#xff0c;大模型微调&#xff08;F…

[liorf_localization_imuPreintegration-2] process has died

使用liorf&#xff0c;编译没报错&#xff0c;但是roslaunch报错如下&#xff1a; 解决方法&#xff1a; step1: 如果你之前没有安装 GTSAM&#xff0c;可以尝试安装它 step2: 检查是否缺少依赖库 ldd /home/zz/1210/devel/lib/liorf_localization/liorf_localization_imuPr…

2024 年 MySQL 8.0.40 安装配置、Workbench汉化教程最简易(保姆级)

首先到官网上下载安装包&#xff1a;http://www.mysql.com 点击下载&#xff0c;拉到最下面&#xff0c;点击社区版下载 windows用户点击下面适用于windows的安装程序 点击下载&#xff0c;网络条件好可以点第一个&#xff0c;怕下着下着断了点第二个离线下载 双击下载好的安装…

【Python/Pytorch】-- 创建3090Ti显卡所需环境

文章目录 文章目录 01 服务器上&#xff0c;存在三个anaconda&#xff0c;如何选择合适的&#xff0c;创建python环境&#xff1f;02 conda、anaconda、cuda、cudnn区别03 用到一些指令04 如何指定cuda的版本&#xff1f;05 conda跟pip的区别&#xff1f;06 pycharm控制台07 服…