首页 > 机器学习-白板推导系列笔记(三十五)-DP_及时行樂

樂土白,机器学习-白板推导系列笔记(三十五)-DP_及时行樂

互联网 2021-06-22 15:11:26

此文章主要是结合哔站shuhuai008大佬的白板推导视频:动态学习_108min

全部笔记的汇总贴:机器学习-白板推导系列笔记

一、策略迭代 (一)策略评估

已知MPP, P ( s ′ , r ∣ s , a ) P(s',r|s,a) P(s′,r∣s,a) 给定 π \pi π,求 V π          ( ∀ s ∈ S ) V_\pi\;\;\;\;(\forall s\in S) Vπ​(∀s∈S) 记 V π = ( V π ( s 1 ) V π ( s 2 ) ⋮ V π ( s ∣ S ∣ ) ) ∣ S ∣ ∗ 1              r π = ( r π ( s 1 ) r π ( s 2 ) ⋮ r π ( s ∣ S ∣ ) ) ∣ S ∣ ∗ 1 P π ≜ [ p π ( s , s ′ ) ] ∣ S ∣ ∗ ∣ S ∣ V_\pi=\begin{pmatrix} V_\pi(s_1) \\V_\pi(s_2) \\\vdots\\V_\pi(s_{|S|}) \end{pmatrix}_{|S|*1}\;\;\;\;\;\;r_\pi=\begin{pmatrix} r_\pi(s_1) \\r_\pi(s_2) \\\vdots\\r_\pi(s_{|S|}) \end{pmatrix}_{|S|*1}\\P_\pi\triangleq[p_\pi(s,s')]_{|S|*|S|} Vπ​=⎝⎜⎜⎜⎛​Vπ​(s1​)Vπ​(s2​)⋮Vπ​(s∣S∣​)​⎠⎟⎟⎟⎞​∣S∣∗1​rπ​=⎝⎜⎜⎜⎛​rπ​(s1​)rπ​(s2​)⋮rπ​(s∣S∣​)​⎠⎟⎟⎟⎞​∣S∣∗1​Pπ​≜[pπ​(s,s′)]∣S∣∗∣S∣​ 所以, V π ( s ) = E π [ G t ∣ S t = s ] = E π [ R t + 1 + γ V π ( s t + 1 ) ∣ s t = s ] = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V π ( s ′ ) ] = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) r ⏟ ( 1 ) + γ ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) V π ( s ′ ) ⏟ ( 2 ) V_\pi(s)=E_\pi[G_t|S_t=s]\\=E_\pi[R_{t+1}+\gamma V_{\pi}(s_{t+1})|s_t=s]\\=\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma V_\pi(s')]\\=\begin{matrix} \underbrace{ \sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)r } \\ (1) \end{matrix}+\begin{matrix} \underbrace{\gamma \sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)V_\pi(s')} \\ (2) \end{matrix} Vπ​(s)=Eπ​[Gt​∣St​=s]=Eπ​[Rt+1​+γVπ​(st+1​)∣st​=s]=a∑​π(a∣s)s′,r∑​p(s′,r∣s,a)[r+γVπ​(s′)]= a∑​π(a∣s)s′,r∑​p(s′,r∣s,a)r​(1)​+ γa∑​π(a∣s)s′,r∑​p(s′,r∣s,a)Vπ​(s′)​(2)​ ( 1 ) = ∑ a π ( a ∣ s ) ⋅ ∑ r r p ( r ∣ s , a ) ⏟ r ( s , a ) ≜ E π [ R t + 1 ∣ s t = s , A t = a ] = ∑ a π ( a ∣ s ) ⋅ r ( s , a ) ≜ r π ( s ) (1)=\sum_a\pi(a|s)\cdot\underset{ r(s,a)\triangleq E_\pi[R_{t+1}|s_t=s,A_t=a]}{\underbrace{\sum_{r}rp(r|s,a)}}\\=\sum_a\pi(a|s)\cdot r(s,a)\\\triangleq r_\pi(s) (1)=a∑​π(a∣s)⋅r(s,a)≜Eπ​[Rt+1​∣st​=s,At​=a] r∑​rp(r∣s,a)​​=a∑​π(a∣s)⋅r(s,a)≜rπ​(s) ( 2 ) = γ ∑ a π ( a ∣ s ) ∑ s ′ p ( s ′ ∣ s , a ) V π ( s ′ ) = γ ∑ s ′ ∑ a π ( a ∣ s ) p ( s ′ ∣ s , a ) ⏟ P π ( s , s ′ ) V π ( s ′ ) = γ ∑ s ′ P π ( s , s ′ ) V π ( s ′ ) (2)=\gamma \sum_a\pi(a|s)\sum_{s'}p(s'|s,a)V_\pi(s')\\=\gamma \sum_{s'}\underset{P_\pi(s,s')}{\underbrace{\sum_a\pi(a|s)p(s'|s,a)}}V_\pi(s')\\=\gamma \sum_{s'}P_\pi(s,s')V_\pi(s') (2)=γa∑​π(a∣s)s′∑​p(s′∣s,a)Vπ​(s′)=γs′∑​Pπ​(s,s′) a∑​π(a∣s)p(s′∣s,a)​​Vπ​(s′)=γs′∑​Pπ​(s,s′)Vπ​(s′) 于是, V π ( s ) = r π ( s ) + γ ∑ s ′ P π ( s , s ′ ) V π ( s ′ ) V_\pi(s)= r_\pi(s)+\gamma \sum_{s'}P_\pi(s,s')V_\pi(s') Vπ​(s)=rπ​(s)+γs′∑​Pπ​(s,s′)Vπ​(s′)

1.解析解

用矩阵的表达形式来解。

令 s i ≜ s , s j ≜ s ′ s_i\triangleq s,s_j\triangleq s' si​≜s,sj​≜s′,得到: V π ( s i ) = r π ( s i ) + γ ∑ j = 1 ∣ S ∣ P π ( s i , s j ) V π ( s j ) V π = r π + γ P π V π ( I − γ P π ) V π = r π V π = ( I − γ P π ) − 1 r π V_\pi(s_i)= r_\pi(s_i)+\gamma \sum_{j=1}^{|S|}P_\pi(s_i,s_j)V_\pi(s_j)\\V_\pi=r_\pi+\gamma P_\pi V_\pi\\(I-\gamma P_\pi)V_\pi=r_\pi\\V_\pi=(I-\gamma P_\pi)^{-1}r_\pi Vπ​(si​)=rπ​(si​)+γj=1∑∣S∣​Pπ​(si​,sj​)Vπ​(sj​)Vπ​=rπ​+γPπ​Vπ​(I−γPπ​)Vπ​=rπ​Vπ​=(I−γPπ​)−1rπ​ 复杂度为: O ( ∣ S ∣ 3 ) O(|S|^3) O(∣S∣3)

2.迭代解(数值解)

构造一个数列,让这个数列收敛于 V π V_\pi Vπ​,收敛性证明暂时不管。 lim ⁡ k → ∞ { V k } = V π \lim_{k\rightarrow\infty}\{V_k\}=V_\pi k→∞lim​{Vk​}=Vπ​ 我们根据贝尔曼方程就可以得到, V k + 1 ( s ) ≜ ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V k ( s ′ ) ] V_{k+1}(s)\triangleq \sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma V_k(s')] Vk+1​(s)≜a∑​π(a∣s)s′,r∑​p(s′,r∣s,a)[r+γVk​(s′)]

(二)策略改进

策略改进对应的是贝尔曼最优方程。

1.策略改进定理

给定 π , π ′ \pi,\pi' π,π′,如果 ∀ s ∈ S , q π ( s , π ′ ( s ) ) ≥ V π ( s ) \forall s\in S,q_\pi(s,\pi'(s))\ge V_\pi(s) ∀s∈S,qπ​(s,π′(s))≥Vπ​(s),那么则有 ∀ s ∈ S , V π ′ ( s ) ≥ V π ( s ) \forall s\in S,V_{\pi'}(s)\ge V_\pi(s) ∀s∈S,Vπ′​(s)≥Vπ​(s)

证明:

∀ s ∈ S ( q π ( s , a ) = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V π ( s ′ ) ] = E [ R t + 1 + γ V π ( s t + 1 ) ∣ s t = s , A t = a ] ) \forall s\in S\\(q_\pi(s,a)=\sum_{s',r}p(s',r|s,a)[r+\gamma V_\pi(s')]=E[R_{t+1}+\gamma V_\pi(s_{t+1})|s_t=s,A_t=a]) ∀s∈S(qπ​(s,a)=s′,r∑​p(s′,r∣s,a)[r+γVπ​(s′)]=E[Rt+1​+γVπ​(st+1​)∣st​=s,At​=a]) V π ( s ) ≤ q π ( s , π ′ ( s ) ) = E [ R t + 1 + γ V π ( s t + 1 ) ∣ s t = s , A t = π ′ ( s ) ] = E π ′ [ R t + 1 + γ V π ( s t + 1 ) ∣ s t = s ] ≤ E π ′ [ R t + 1 + γ q π ( s ( t + 1 ) , π ′ ( s t + 1 ) ) ∣ s t = s ] = E π ′ [ R t + 1 + γ E π ′ [ R t + 2 + γ V π ( s t + 2 ) ∣ s t + 1 ] ∣ s t = s ] = E π ′ [ R t + 1 + γ E π ′ [ R t + 2 ∣ s t + 1 ] + γ 2 E π ′ [ V π ( s t + 2 ) ∣ s t + 1 ] ∣ s t = s ] = E π ′ [ R t + 1 + γ R t + 2 + γ 2 V π ( s t + 2 ) ∣ s t = s ] ⋯ ≤ E π ′ [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ + ⋯ ⏟ G t ∣ s t = s ] = V π ′ ( s ) V_\pi(s)\le q_\pi(s,\pi'(s))\\=E[R_{t+1}+\gamma V_\pi(s_{t+1})|s_t=s,A_t=\pi'(s)]\\=E_{\pi'}[R_{t+1}+\gamma V_\pi(s_{t+1})|s_t=s]\\\le E_{\pi'}[R_{t+1}+\gamma q_\pi(s_{(t+1)},\pi'(s_{t+1}))|s_t=s]\\=E_{\pi'}\Big[R_{t+1}+\gamma E_{\pi'}[R_{t+2}+\gamma V_\pi(s_{t+2})|s_{t+1}]\Big|s_t=s\Big]\\=E_{\pi'}\Big[R_{t+1}+\gamma E_{\pi'}[R_{t+2}|s_{t+1}]+\gamma^2 E_{\pi'}[V_\pi(s_{t+2})|s_{t+1}]\Big|s_t=s\Big]\\=E_{\pi'}\Big[R_{t+1}+\gamma R_{t+2}+\gamma^2 V_\pi(s_{t+2})\Big|s_t=s\Big]\\\cdots\\\le E_{\pi'}\Big[\underset{G_t}{\underbrace{R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\cdots+\cdots}}\Big|s_t=s\Big]\\=V_{\pi'}(s) Vπ​(s)≤qπ​(s,π′(s))=E[Rt+1​+γVπ​(st+1​)∣st​=s,At​=π′(s)]=Eπ′​[Rt+1​+γVπ​(st+1​)∣st​=s]≤Eπ′​[Rt+1​+γqπ​(s(t+1)​,π′(st+1​))∣st​=s]=Eπ′​[Rt+1​+γEπ′​[Rt+2​+γVπ​(st+2​)∣st+1​]∣∣∣​st​=s]=Eπ′​[Rt+1​+γEπ′​[Rt+2​∣st+1​]+γ2Eπ′​[Vπ​(st+2​)∣st+1​]∣∣∣​st​=s]=Eπ′​[Rt+1​+γRt+2​+γ2Vπ​(st+2​)∣∣∣​st​=s]⋯≤Eπ′​[Gt​ Rt+1​+γRt+2​+γ2Rt+3​+⋯+⋯​​∣∣∣​st​=s]=Vπ′​(s)

2.贪心策略

在这里插入图片描述在这里插入图片描述

Greedy Policy: ∀ s ∈ S \forall s\in S ∀s∈S, π ′ ( s ) = arg max ⁡ a q π ( s , a ) \pi'(s)=\argmax_aq_\pi(s,a) π′(s)=aargmax​qπ​(s,a)

∀ s ∈ S , V π ( s ) ≤ max ⁡ a q π ( s , a ) = q π ( s , π ′ ( s ) ) \forall s\in S,V_\pi(s)\le \max_a q_\pi(s,a)=q_\pi(s,\pi'(s)) ∀s∈S,Vπ​(s)≤amax​qπ​(s,a)=qπ​(s,π′(s)) 由策略改进定理可知: ∀ s ∈ S ,            V π ′ ( s ) ≥ V π ( s ) \forall s\in S,\;\;\;\;\;V_{\pi'}(s)\ge V_\pi(s) ∀s∈S,Vπ′​(s)≥Vπ​(s)

If V π ′ = V π V_{\pi'}=V_\pi Vπ′​=Vπ​,then V π ′ = V π = V ∗ V_{\pi'}=V_\pi=V_* Vπ′​=Vπ​=V∗​,证明如下:

V π ′ = V π                →            q π ′ = q π ∀ s ∈ S ,        V π ′ ( s ) = ∑ a π ′ ( a ∣ s ) q π ′ ( s , a ) = ∑ a π ′ ( a ∣ s ) q π ( s , a ) = q π ( s , π ′ ( s ) ) = max ⁡ a q π ( s , a ) = max ⁡ a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V π ( s ′ ) ] = max ⁡ a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V π ′ ( s ′ ) ] V_{\pi'}=V_\pi\;\;\;\;\;\;\;\rightarrow\;\;\;\;\;q_{\pi'}=q_\pi\\\forall s\in S,\;\;\;V_{\pi'}(s)=\sum_a\pi'(a|s)q_{\pi'}(s,a)\\=\sum_a\pi'(a|s)q_{\pi}(s,a)\\=q_\pi(s,\pi'(s))\\=\max_aq_\pi(s,a)\\=\max_a\sum_{s',r}p(s',r|s,a)[r+\gamma V_\pi(s')]\\=\max_a\sum_{s',r}p(s',r|s,a)[r+\gamma V_{\pi'}(s')] Vπ′​=Vπ​→qπ′​=qπ​∀s∈S,Vπ′​(s)=a∑​π′(a∣s)qπ′​(s,a)=a∑​π′(a∣s)qπ​(s,a)=qπ​(s,π′(s))=amax​qπ​(s,a)=amax​s′,r∑​p(s′,r∣s,a)[r+γVπ​(s′)]=amax​s′,r∑​p(s′,r∣s,a)[r+γVπ′​(s′)] 于是, V π ′ ( s ) = max ⁡ a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V π ′ ( s ′ ) ] V_{\pi'}(s)=\max_a\sum_{s',r}p(s',r|s,a)[r+\gamma V_{\pi'}(s')] Vπ′​(s)=amax​s′,r∑​p(s′,r∣s,a)[r+γVπ′​(s′)] 又因为: V ∗ ( s ) = max ⁡ a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V ∗ ( s ′ ) ] V_*(s)=\max_a\sum_{s',r}p(s',r|s,a)[r+\gamma V_*(s')] V∗​(s)=amax​s′,r∑​p(s′,r∣s,a)[r+γV∗​(s′)] 所以, V π ′ = V ∗ V_{\pi'}=V_* Vπ′​=V∗​ 即, V π ′ = V π = V ∗ V_{\pi'}=V_\pi=V_* Vπ′​=Vπ​=V∗​

二、价值迭代

是极端情况下的策略迭代,策略评估只进行一步,然后策略改进。

价值迭代虽然只走一步,但是还是会更新 S S S中的所有状态。比价值迭代更简单的是就地策略迭代(异步策略迭代的特例),只选取 S S S中的一个状态进行一步更新。

   \;    \;    \;    \;    \;

下一章传送门:白板推导系列笔记(三十六)-词向量

免责声明:非本网注明原创的信息,皆为程序自动获取自互联网,目的在于传递更多信息,并不代表本网赞同其观点和对其真实性负责;如此页面有侵犯到您的权益,请给站长发送邮件,并提供相关证明(版权证明、身份证正反面、侵权链接),站长将在收到邮件24小时内删除。

相关阅读