人工智障学习笔记——强化学习(4)时间差分方法-原创手记-慕课网

前两章我们学习了动态规划DP方法和蒙特卡洛MC方法，DP方法的特性是状态转移，状态值函数的估计是自举的(bootstrapping)，即当前状态值函数的更新依赖于已知的其他状态值函数。MC方法的特性是不需要环境模型，状态值函数的估计是相互独立的，但同时又依赖episode tasks。为了解决即不需要环境模型，又局限于episode task，还可以用于连续任务的问题，我们衍生出了时间差分学习方法。

时间差分学习(Temporal-Difference learning, TD learning)结合了动态规划和蒙特卡洛方法，是强化学习的核心思想。名字含义为通过当前时间的差分数据来进行学习。

策略状态价值vπ的时间差分学习方法

单步时间差分学习方法TD(0)

Initialize V(s)V(s) arbitrarily ∀s∈S+∀s∈S+Repeat (for each episode):  Initialize SS  Repeat (for each step of episode):   AA action given by ππ for SS   Take action AA, observe R,S′R,S′   V(S)V(S)+α[R+γV(S′)−V(S)]V(S)V(S)+α[R+γV(S′)−V(S)]   SS′SS′  Until S is terminal

多步时间差分学习方法

Input: the policy ππ to be evaluatedInitialize V(s)V(s) arbitrarily ∀s∈S∀s∈SParameters: step size α∈(0,1]α∈(0,1], a positive integer nnAll store and access operations (for StSt and RtRt) can take their index mod nnRepeat (for each episode):  Initialize and store S0≠terminalS0≠terminal  T∞T∞  For t=0,1,2,⋯t=0,1,2,⋯:   If t<Tt<T, then:    Take an action according to π( ˙|St)π( ˙|St)    Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1    If St+1St+1 is terminal, then Tt+1Tt+1   τt−n+1 τt−n+1  (ττ is the time whose state's estimate is being updated)   If τ≥0τ≥0:    G∑min(τ+n,T)i=τ+1γi−τ−1RiG∑i=τ+1min(τ+n,T)γi−τ−1Ri    if τ+n≤Tτ+n≤T then: GG+γnV(Sτ+n)(G(n)τ)GG+γnV(Sτ+n)(Gτ(n))    V(Sτ)V(Sτ)+α[G−V(Sτ)]V(Sτ)V(Sτ)+α[G−V(Sτ)]  Until τ=T−1

V(S0)V(S0) 是由V(S0),V(S1),…,V(Sn)V(S0),V(S1),…,V(Sn)计算所得；V(S1)V(S1)是由V(S1),V(S1),…,V(Sn+1)V(S1),V(S1),…,V(Sn+1)。

策略行动价值qπ的on-policy时间差分学习方法: Sarsa

单步时间差分学习方法

Initialize Q(s,a),∀s∈S,a∈A(s)Q(s,a),∀s∈S,a∈A(s) arbitrarily, and Q(terminal, ˙)=0Q(terminal, ˙)=0Repeat (for each episode):  Initialize SS  Choose AA from SS using policy derived from QQ (e.g. ϵ−greedyϵ−greedy)  Repeat (for each step of episode):   Take action AA, observe R,S′R,S′   Choose A′A′ from S′S′ using policy derived from QQ (e.g. ϵ−greedyϵ−greedy)   Q(S,A)Q(S,A)+α[R+γQ(S′,A′)−Q(S,A)]Q(S,A)Q(S,A)+α[R+γQ(S′,A′)−Q(S,A)]   SS′;AA′;SS′;AA′;  Until S is terminal

多步时间差分学习方法

Initialize Q(s,a)Q(s,a) arbitrarily ∀s∈S,∀ainA∀s∈S,∀ainAInitialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policyParameters: step size α∈(0,1]α∈(0,1],  small ϵ>0ϵ>0  a positive integer nnAll store and access operations (for StSt and RtRt) can take their index mod nnRepeat (for each episode):  Initialize and store S0≠terminalS0≠terminal  Select and store an action A0∼π( ˙|S0)A0∼π( ˙|S0)  T∞T∞  For t=0,1,2,⋯t=0,1,2,⋯:   If t<Tt<T, then:    Take an action AtAt    Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1    If St+1St+1 is terminal, then:     Tt+1Tt+1    Else:     Select and store an action At+1∼π( ˙|St+1)At+1∼π( ˙|St+1)   τt−n+1 τt−n+1  (ττ is the time whose state's estimate is being updated)   If τ≥0τ≥0:    G∑min(τ+n,T)i=τ+1γi−τ−1RiG∑i=τ+1min(τ+n,T)γi−τ−1Ri    if τ+n≤Tτ+n≤T then: GG+γnQ(Sτ+n,Aτ+n)(G(n)τ)GG+γnQ(Sτ+n,Aτ+n)(Gτ(n))    Q(Sτ,Aτ)Q(Sτ,Aτ)+α[G−Q(Sτ,Aτ)]Q(Sτ,Aτ)Q(Sτ,Aτ)+α[G−Q(Sτ,Aτ)]    If {\pi} is being learned, then ensure that π( ˙|Sτ)π( ˙|Sτ) is ϵϵ-greedy wrt Q  Until τ=T−1

策略行动价值qπ的off-policy时间差分学习方法: Q-learning

Q-learning 算法（Watkins, 1989）是一个突破性的算法。这里利用了这个公式进行off-policy学习。

Q(St,At)Q(St,At)+α[Rt+1+γmaxa Q(St+1,a)−Q(St,At)]

单步时间差分学习方法

Initialize Q(s,a),∀s∈S,a∈A(s)Q(s,a),∀s∈S,a∈A(s) arbitrarily, and Q(terminal, ˙)=0Q(terminal, ˙)=0Repeat (for each episode):  Initialize SS  Choose AA from SS using policy derived from QQ (e.g. ϵ−greedyϵ−greedy)  Repeat (for each step of episode):   Take action AA, observe R,S′R,S′   Q(S,A)Q(S,A)+α[R+γmaxa Q(S‘,a)−Q(S,A)]Q(S,A)Q(S,A)+α[R+γmaxa Q(S‘,a)−Q(S,A)]   SS′;SS′;  Until S is terminal

Double Q-learning的单步时间差分学习方法

Initialize Q1(s,a)Q1(s,a) and Q2(s,a),∀s∈S,a∈A(s)Q2(s,a),∀s∈S,a∈A(s) arbitrarilyInitialize Q1(terminal, ˙)=Q2(terminal, ˙)=0Q1(terminal, ˙)=Q2(terminal, ˙)=0Repeat (for each episode):  Initialize SS  Repeat (for each step of episode):   Choose AA from SS using policy derived from Q1Q1 and Q2Q2 (e.g. ϵ−greedyϵ−greedy)   Take action AA, observe R,S′R,S′   With 0.5 probability:    Q1(S,A)Q1(S,A)+α[R+γQ2(S′,argmaxa Q1(S′,a))−Q1(S,A)]Q1(S,A)Q1(S,A)+α[R+γQ2(S′,argmaxa Q1(S′,a))−Q1(S,A)]   Else:    Q2(S,A)Q2(S,A)+α[R+γQ1(S′,argmaxa Q2(S′,a))−Q2(S,A)]Q2(S,A)Q2(S,A)+α[R+γQ1(S′,argmaxa Q2(S′,a))−Q2(S,A)]   SS′;SS′;  Until S is terminal

策略行动价值qπ的off-policy时间差分学习方法(by importance sampling): Sarsa

多步时间差分学习方法

Input: behavior policy \mu such that μ(a|s)>0，∀s∈S,a∈Aμ(a|s)>0，∀s∈S,a∈AInitialize Q(s，a)Q(s，a) arbitrarily ∀s∈S,∀ainA∀s∈S,∀ainAInitialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policyParameters: step size α∈(0,1]α∈(0,1],  small ϵ>0ϵ>0  a positive integer nnAll store and access operations (for StSt and RtRt) can take their index mod nnRepeat (for each episode):  Initialize and store S0≠terminalS0≠terminal  Select and store an action A0∼μ( ˙|S0)A0∼μ( ˙|S0)  T∞T∞  For t=0,1,2,⋯t=0,1,2,⋯:   If t<Tt<T, then:    Take an action AtAt    Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1    If St+1St+1 is terminal, then:     Tt+1Tt+1    Else:     Select and store an action At+1∼π( ˙|St+1)At+1∼π( ˙|St+1)   τt−n+1 τt−n+1  (ττ is the time whose state's estimate is being updated)   If τ≥0τ≥0:    ρ∏min(τ+n−1,T−1)i=τ+1π(At|St)μ(At|St)(ρ(τ+1)τ+n)ρ∏i=τ+1min(τ+n−1,T−1)π(At|St)μ(At|St)(ρτ+n(τ+1))    G∑min(τ+n,T)i=τ+1γi−τ−1RiG∑i=τ+1min(τ+n,T)γi−τ−1Ri    if τ+n≤Tτ+n≤T then: GG+γnQ(Sτ+n,Aτ+n)(G(n)τ)GG+γnQ(Sτ+n,Aτ+n)(Gτ(n))    Q(Sτ,Aτ)Q(Sτ,Aτ)+αρ[G−Q(Sτ,Aτ)]Q(Sτ,Aτ)Q(Sτ,Aτ)+αρ[G−Q(Sτ,Aτ)]    If {\pi} is being learned, then ensure that π( ˙|Sτ)π( ˙|Sτ) is ϵϵ-greedy wrt Q  Until τ=T−1

策略行动价值qπ的off-policy时间差分学习方法(不带importance sampling): Tree Backup Algorithm

Tree Backup Algorithm的思想是每步都求行动价值的期望值。求行动价值的期望值意味着对所有可能的行动a都评估一次。

多步时间差分学习方法

Initialize Q(s，a)Q(s，a) arbitrarily ∀s∈S,∀ainA∀s∈S,∀ainAInitialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policyParameters: step size α∈(0,1]α∈(0,1],  small ϵ>0ϵ>0  a positive integer nnAll store and access operations (for StSt and RtRt) can take their index mod nnRepeat (for each episode):  Initialize and store S0≠terminalS0≠terminal  Select and store an action A0∼π( ˙|S0)A0∼π( ˙|S0)  Q0Q(S0,A0)Q0Q(S0,A0)  T∞T∞  For t=0,1,2,⋯t=0,1,2,⋯:   If t<Tt<T, then:    Take an action AtAt    Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1    If St+1St+1 is terminal, then:     Tt+1Tt+1     δtR−QtδtR−Qt    Else:     δtR+γ∑aπ(a|St+1)Q(St+1,a)−QtδtR+γ∑aπ(a|St+1)Q(St+1,a)−Qt     Select arbitrarily and store an action as At+1At+1     Qt+1Q(St+1,At+1)Qt+1Q(St+1,At+1)     πt+1π(St+1,At+1)πt+1π(St+1,At+1)   τt−n+1 τt−n+1  (ττ is the time whose state's estimate is being updated)   If τ≥0τ≥0:    E1E1    GQτGQτ    For k=τ,…,min(τ+n−1,T−1):k=τ,…,min(τ+n−1,T−1):     G G+EδkG G+Eδk     E γEπk+1E γEπk+1    Q(Sτ,Aτ)Q(Sτ,Aτ)+α[G−Q(Sτ,Aτ)]Q(Sτ,Aτ)Q(Sτ,Aτ)+α[G−Q(Sτ,Aτ)]    If {\pi} is being learned, then ensure that π(a|Sτ)π(a|Sτ) is ϵϵ-greedy wrt Q(Sτ, ˙)Q(Sτ, ˙)  Until τ=T−1

策略行动价值qπ的off-policy时间差分学习方法: Q(σ)

Q(σ) 结合了Sarsa(importance sampling), Expected Sarsa, Tree Backup算法，并考虑了重要样本。
当σ=1时，使用了重要样本的Sarsa算法。
当σ=0时，使用了Tree Backup的行动期望值算法。

多步时间差分学习方法

Input: behavior policy \mu such that μ(a|s)>0，∀s∈S,a∈Aμ(a|s)>0，∀s∈S,a∈AInitialize Q(s，a)Q(s，a) arbitrarily \forall s \in \mathcal{S}^, \forall a in \mathcal{A}$Initialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policyParameters: step size α∈(0,1]α∈(0,1],  small ϵ>0ϵ>0  a positive integer nnAll store and access operations (for StSt and RtRt) can take their index mod nnRepeat (for each episode):  Initialize and store S0≠terminalS0≠terminal  Select and store an action A0∼μ( ˙|S0)A0∼μ( ˙|S0)  Q0Q(S0,A0)Q0Q(S0,A0)  T∞T∞  For t=0,1,2,⋯t=0,1,2,⋯:   If t<Tt<T, then:    Take an action AtAt    Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1    If St+1St+1 is terminal, then:     Tt+1Tt+1     δtR−QtδtR−Qt    Else:     Select and store an action as At+1∼μ( ˙|St+1)At+1∼μ( ˙|St+1)     Select and store σt+1)σt+1)     Qt+1Q(St+1,At+1)Qt+1Q(St+1,At+1)     δtR+γσt+1Qt+1+γ(1−σt+1)∑aπ(a|St+1)Q(St+1,a)−QtδtR+γσt+1Qt+1+γ(1−σt+1)∑aπ(a|St+1)Q(St+1,a)−Qt     πt+1π(St+1,At+1)πt+1π(St+1,At+1)     ρt+1π(At+1|St+1)μ(At+1|St+1)ρt+1π(At+1|St+1)μ(At+1|St+1)   τt−n+1 τt−n+1  (ττ is the time whose state's estimate is being updated)   If τ≥0τ≥0:    ρ1ρ1    E1E1    GQτGQτ    For k=τ,…,min(τ+n−1,T−1):k=τ,…,min(τ+n−1,T−1):     G G+EδkG G+Eδk     E γE[(1−σk+1)πk+1+σk+1]E γE[(1−σk+1)πk+1+σk+1]     ρ ρ(1−σk+σkτk)ρ ρ(1−σk+σkτk)    Q(Sτ,Aτ)Q(Sτ,Aτ)+αρ[G−Q(Sτ,Aτ)]Q(Sτ,Aτ)Q(Sτ,Aτ)+αρ[G−Q(Sτ,Aτ)]    If ππ is being learned, then ensure that π(a|Sτ)π(a|Sτ) is ϵϵ-greedy wrt Q(Sτ, ˙)Q(Sτ, ˙)  Until τ=T−1

总结：如果说蒙特卡洛的方法是模拟（或者经历）一段情节，在情节结束后，根据情节上各个状态的价值，来估计状态价值。那么时间差分学习是模拟（或者经历）一段情节，每行动一步（或者几步），根据新状态的价值，然后估计执行前的状态价值。
可以认为蒙特卡洛的方法是最大步数的时间差分学习方法。