前两章我们学习了动态规划DP方法和蒙特卡洛MC方法,DP方法的特性是状态转移,状态值函数的估计是自举的(bootstrapping),即当前状态值函数的更新依赖于已知的其他状态值函数。MC方法的特性是不需要环境模型,状态值函数的估计是相互独立的,但同时又依赖episode tasks。为了解决即不需要环境模型,又局限于episode task,还可以用于连续任务的问题,我们衍生出了时间差分学习方法。
时间差分学习(Temporal-Difference learning, TD learning)结合了动态规划和蒙特卡洛方法,是强化学习的核心思想。名字含义为通过当前时间的差分数据来进行学习。
策略状态价值vπ的时间差分学习方法
单步时间差分学习方法TD(0)
Initialize V(s)V(s) arbitrarily ∀s∈S+∀s∈S+Repeat (for each episode): Initialize SS Repeat (for each step of episode): AA action given by ππ for SS Take action AA, observe R,S′R,S′ V(S)V(S)+α[R+γV(S′)−V(S)]V(S)V(S)+α[R+γV(S′)−V(S)] SS′SS′ Until S is terminal
多步时间差分学习方法
Input: the policy ππ to be evaluatedInitialize V(s)V(s) arbitrarily ∀s∈S∀s∈SParameters: step size α∈(0,1]α∈(0,1], a positive integer nnAll store and access operations (for StSt and RtRt) can take their index mod nnRepeat (for each episode): Initialize and store S0≠terminalS0≠terminal T∞T∞ For t=0,1,2,⋯t=0,1,2,⋯: If t<Tt<T, then: Take an action according to π( ˙|St)π( ˙|St) Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1 If St+1St+1 is terminal, then Tt+1Tt+1 τt−n+1 τt−n+1 (ττ is the time whose state's estimate is being updated) If τ≥0τ≥0: G∑min(τ+n,T)i=τ+1γi−τ−1RiG∑i=τ+1min(τ+n,T)γi−τ−1Ri if τ+n≤Tτ+n≤T then: GG+γnV(Sτ+n)(G(n)τ)GG+γnV(Sτ+n)(Gτ(n)) V(Sτ)V(Sτ)+α[G−V(Sτ)]V(Sτ)V(Sτ)+α[G−V(Sτ)] Until τ=T−1
V(S0)V(S0) 是由V(S0),V(S1),…,V(Sn)V(S0),V(S1),…,V(Sn)计算所得;V(S1)V(S1)是由V(S1),V(S1),…,V(Sn+1)V(S1),V(S1),…,V(Sn+1)。
策略行动价值qπ的on-policy时间差分学习方法: Sarsa
单步时间差分学习方法
Initialize Q(s,a),∀s∈S,a∈A(s)Q(s,a),∀s∈S,a∈A(s) arbitrarily, and Q(terminal, ˙)=0Q(terminal, ˙)=0Repeat (for each episode): Initialize SS Choose AA from SS using policy derived from QQ (e.g. ϵ−greedyϵ−greedy) Repeat (for each step of episode): Take action AA, observe R,S′R,S′ Choose A′A′ from S′S′ using policy derived from QQ (e.g. ϵ−greedyϵ−greedy) Q(S,A)Q(S,A)+α[R+γQ(S′,A′)−Q(S,A)]Q(S,A)Q(S,A)+α[R+γQ(S′,A′)−Q(S,A)] SS′;AA′;SS′;AA′; Until S is terminal
多步时间差分学习方法
Initialize Q(s,a)Q(s,a) arbitrarily ∀s∈S,∀ainA∀s∈S,∀ainAInitialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policyParameters: step size α∈(0,1]α∈(0,1], small ϵ>0ϵ>0 a positive integer nnAll store and access operations (for StSt and RtRt) can take their index mod nnRepeat (for each episode): Initialize and store S0≠terminalS0≠terminal Select and store an action A0∼π( ˙|S0)A0∼π( ˙|S0) T∞T∞ For t=0,1,2,⋯t=0,1,2,⋯: If t<Tt<T, then: Take an action AtAt Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1 If St+1St+1 is terminal, then: Tt+1Tt+1 Else: Select and store an action At+1∼π( ˙|St+1)At+1∼π( ˙|St+1) τt−n+1 τt−n+1 (ττ is the time whose state's estimate is being updated) If τ≥0τ≥0: G∑min(τ+n,T)i=τ+1γi−τ−1RiG∑i=τ+1min(τ+n,T)γi−τ−1Ri if τ+n≤Tτ+n≤T then: GG+γnQ(Sτ+n,Aτ+n)(G(n)τ)GG+γnQ(Sτ+n,Aτ+n)(Gτ(n)) Q(Sτ,Aτ)Q(Sτ,Aτ)+α[G−Q(Sτ,Aτ)]Q(Sτ,Aτ)Q(Sτ,Aτ)+α[G−Q(Sτ,Aτ)] If {\pi} is being learned, then ensure that π( ˙|Sτ)π( ˙|Sτ) is ϵϵ-greedy wrt Q Until τ=T−1
策略行动价值qπ的off-policy时间差分学习方法: Q-learning
Q-learning 算法(Watkins, 1989)是一个突破性的算法。这里利用了这个公式进行off-policy学习。
Q(St,At)Q(St,At)+α[Rt+1+γmaxa Q(St+1,a)−Q(St,At)]
单步时间差分学习方法
Initialize Q(s,a),∀s∈S,a∈A(s)Q(s,a),∀s∈S,a∈A(s) arbitrarily, and Q(terminal, ˙)=0Q(terminal, ˙)=0Repeat (for each episode): Initialize SS Choose AA from SS using policy derived from QQ (e.g. ϵ−greedyϵ−greedy) Repeat (for each step of episode): Take action AA, observe R,S′R,S′ Q(S,A)Q(S,A)+α[R+γmaxa Q(S‘,a)−Q(S,A)]Q(S,A)Q(S,A)+α[R+γmaxa Q(S‘,a)−Q(S,A)] SS′;SS′; Until S is terminal
Double Q-learning的单步时间差分学习方法
Initialize Q1(s,a)Q1(s,a) and Q2(s,a),∀s∈S,a∈A(s)Q2(s,a),∀s∈S,a∈A(s) arbitrarilyInitialize Q1(terminal, ˙)=Q2(terminal, ˙)=0Q1(terminal, ˙)=Q2(terminal, ˙)=0Repeat (for each episode): Initialize SS Repeat (for each step of episode): Choose AA from SS using policy derived from Q1Q1 and Q2Q2 (e.g. ϵ−greedyϵ−greedy) Take action AA, observe R,S′R,S′ With 0.5 probability: Q1(S,A)Q1(S,A)+α[R+γQ2(S′,argmaxa Q1(S′,a))−Q1(S,A)]Q1(S,A)Q1(S,A)+α[R+γQ2(S′,argmaxa Q1(S′,a))−Q1(S,A)] Else: Q2(S,A)Q2(S,A)+α[R+γQ1(S′,argmaxa Q2(S′,a))−Q2(S,A)]Q2(S,A)Q2(S,A)+α[R+γQ1(S′,argmaxa Q2(S′,a))−Q2(S,A)] SS′;SS′; Until S is terminal
策略行动价值qπ的off-policy时间差分学习方法(by importance sampling): Sarsa
多步时间差分学习方法
Input: behavior policy \mu such that μ(a|s)>0,∀s∈S,a∈Aμ(a|s)>0,∀s∈S,a∈AInitialize Q(s,a)Q(s,a) arbitrarily ∀s∈S,∀ainA∀s∈S,∀ainAInitialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policyParameters: step size α∈(0,1]α∈(0,1], small ϵ>0ϵ>0 a positive integer nnAll store and access operations (for StSt and RtRt) can take their index mod nnRepeat (for each episode): Initialize and store S0≠terminalS0≠terminal Select and store an action A0∼μ( ˙|S0)A0∼μ( ˙|S0) T∞T∞ For t=0,1,2,⋯t=0,1,2,⋯: If t<Tt<T, then: Take an action AtAt Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1 If St+1St+1 is terminal, then: Tt+1Tt+1 Else: Select and store an action At+1∼π( ˙|St+1)At+1∼π( ˙|St+1) τt−n+1 τt−n+1 (ττ is the time whose state's estimate is being updated) If τ≥0τ≥0: ρ∏min(τ+n−1,T−1)i=τ+1π(At|St)μ(At|St)(ρ(τ+1)τ+n)ρ∏i=τ+1min(τ+n−1,T−1)π(At|St)μ(At|St)(ρτ+n(τ+1)) G∑min(τ+n,T)i=τ+1γi−τ−1RiG∑i=τ+1min(τ+n,T)γi−τ−1Ri if τ+n≤Tτ+n≤T then: GG+γnQ(Sτ+n,Aτ+n)(G(n)τ)GG+γnQ(Sτ+n,Aτ+n)(Gτ(n)) Q(Sτ,Aτ)Q(Sτ,Aτ)+αρ[G−Q(Sτ,Aτ)]Q(Sτ,Aτ)Q(Sτ,Aτ)+αρ[G−Q(Sτ,Aτ)] If {\pi} is being learned, then ensure that π( ˙|Sτ)π( ˙|Sτ) is ϵϵ-greedy wrt Q Until τ=T−1
策略行动价值qπ的off-policy时间差分学习方法(不带importance sampling): Tree Backup Algorithm
Tree Backup Algorithm的思想是每步都求行动价值的期望值。求行动价值的期望值意味着对所有可能的行动a都评估一次。
多步时间差分学习方法
Initialize Q(s,a)Q(s,a) arbitrarily ∀s∈S,∀ainA∀s∈S,∀ainAInitialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policyParameters: step size α∈(0,1]α∈(0,1], small ϵ>0ϵ>0 a positive integer nnAll store and access operations (for StSt and RtRt) can take their index mod nnRepeat (for each episode): Initialize and store S0≠terminalS0≠terminal Select and store an action A0∼π( ˙|S0)A0∼π( ˙|S0) Q0Q(S0,A0)Q0Q(S0,A0) T∞T∞ For t=0,1,2,⋯t=0,1,2,⋯: If t<Tt<T, then: Take an action AtAt Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1 If St+1St+1 is terminal, then: Tt+1Tt+1 δtR−QtδtR−Qt Else: δtR+γ∑aπ(a|St+1)Q(St+1,a)−QtδtR+γ∑aπ(a|St+1)Q(St+1,a)−Qt Select arbitrarily and store an action as At+1At+1 Qt+1Q(St+1,At+1)Qt+1Q(St+1,At+1) πt+1π(St+1,At+1)πt+1π(St+1,At+1) τt−n+1 τt−n+1 (ττ is the time whose state's estimate is being updated) If τ≥0τ≥0: E1E1 GQτGQτ For k=τ,…,min(τ+n−1,T−1):k=τ,…,min(τ+n−1,T−1): G G+EδkG G+Eδk E γEπk+1E γEπk+1 Q(Sτ,Aτ)Q(Sτ,Aτ)+α[G−Q(Sτ,Aτ)]Q(Sτ,Aτ)Q(Sτ,Aτ)+α[G−Q(Sτ,Aτ)] If {\pi} is being learned, then ensure that π(a|Sτ)π(a|Sτ) is ϵϵ-greedy wrt Q(Sτ, ˙)Q(Sτ, ˙) Until τ=T−1
策略行动价值qπ的off-policy时间差分学习方法: Q(σ)
Q(σ) 结合了Sarsa(importance sampling), Expected Sarsa, Tree Backup算法,并考虑了重要样本。
当σ=1时,使用了重要样本的Sarsa算法。
当σ=0时,使用了Tree Backup的行动期望值算法。
多步时间差分学习方法
Input: behavior policy \mu such that μ(a|s)>0,∀s∈S,a∈Aμ(a|s)>0,∀s∈S,a∈AInitialize Q(s,a)Q(s,a) arbitrarily \forall s \in \mathcal{S}^, \forall a in \mathcal{A}$Initialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policyParameters: step size α∈(0,1]α∈(0,1], small ϵ>0ϵ>0 a positive integer nnAll store and access operations (for StSt and RtRt) can take their index mod nnRepeat (for each episode): Initialize and store S0≠terminalS0≠terminal Select and store an action A0∼μ( ˙|S0)A0∼μ( ˙|S0) Q0Q(S0,A0)Q0Q(S0,A0) T∞T∞ For t=0,1,2,⋯t=0,1,2,⋯: If t<Tt<T, then: Take an action AtAt Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1 If St+1St+1 is terminal, then: Tt+1Tt+1 δtR−QtδtR−Qt Else: Select and store an action as At+1∼μ( ˙|St+1)At+1∼μ( ˙|St+1) Select and store σt+1)σt+1) Qt+1Q(St+1,At+1)Qt+1Q(St+1,At+1) δtR+γσt+1Qt+1+γ(1−σt+1)∑aπ(a|St+1)Q(St+1,a)−QtδtR+γσt+1Qt+1+γ(1−σt+1)∑aπ(a|St+1)Q(St+1,a)−Qt πt+1π(St+1,At+1)πt+1π(St+1,At+1) ρt+1π(At+1|St+1)μ(At+1|St+1)ρt+1π(At+1|St+1)μ(At+1|St+1) τt−n+1 τt−n+1 (ττ is the time whose state's estimate is being updated) If τ≥0τ≥0: ρ1ρ1 E1E1 GQτGQτ For k=τ,…,min(τ+n−1,T−1):k=τ,…,min(τ+n−1,T−1): G G+EδkG G+Eδk E γE[(1−σk+1)πk+1+σk+1]E γE[(1−σk+1)πk+1+σk+1] ρ ρ(1−σk+σkτk)ρ ρ(1−σk+σkτk) Q(Sτ,Aτ)Q(Sτ,Aτ)+αρ[G−Q(Sτ,Aτ)]Q(Sτ,Aτ)Q(Sτ,Aτ)+αρ[G−Q(Sτ,Aτ)] If ππ is being learned, then ensure that π(a|Sτ)π(a|Sτ) is ϵϵ-greedy wrt Q(Sτ, ˙)Q(Sτ, ˙) Until τ=T−1
总结:如果说蒙特卡洛的方法是模拟(或者经历)一段情节,在情节结束后,根据情节上各个状态的价值,来估计状态价值。那么时间差分学习是模拟(或者经历)一段情节,每行动一步(或者几步),根据新状态的价值,然后估计执行前的状态价值。
可以认为蒙特卡洛的方法是最大步数的时间差分学习方法。