強化學習的數學原理 (英文版)

趙世鈺

預覽內頁

出版商: 清華大學
出版日期: 2024-07-01
定價: $708
售價: 8.5 折 $602
語言: 簡體中文
頁數: 301
ISBN: 7302658528
ISBN-13: 9787302658528
相關分類: Reinforcement、化學 Chemistry、英文 English
此書翻譯自: Mathematical Foundations of Reinforcement Learning (Hardcover)

下單後立即進貨 (約4週~6週)

買這商品的人也買了...

~~$648~~ $616

機器學習
~~$500~~ $390

為你自己學 Git
$408

強化學習精要：核心算法與 TensorFlow 實現
~~$480~~ $408

簡潔的 Python｜重構你的舊程式 (Clean Code in Python: Refactor your legacy codebase)
~~$474~~ $450

統計強化學習：現代機器學習方法 (Statistical Reinforcement Learning: Modern Machine Learning Approaches)
~~$594~~ $564

深度強化學習：學術前沿與實戰應用
~~$780~~ $663

強化式學習：打造最強 AlphaZero 通用演算法
$509

機器學習算法競賽實戰
~~$720~~ $562

Python 出神入化：Clean Coder 才懂的 Pythonic 技法，為你的程式碼畫龍點睛！ (Clean Code in Python, 2/e)
~~$1,200~~ $948

流暢的 Python｜清晰、簡潔、高效的程式設計, 2/e (Fluent Python: Clear, Concise, and Effective Programming, 2/e)
$412

控制之美 (捲2) - 最優化控制 MPC 與卡爾曼濾波器
~~$620~~ $490

程式設計守則｜如何寫出更好的程式碼 (The Rules of Programming: How to Write Better Code)
~~$1,180~~ $900

Clean Architecture 無瑕的程式碼－整潔的軟體設計與架構篇 + 實作篇－在整潔的架構上弄髒你的手, 2/e (雙書合購)
~~$834~~ $792

強化學習與最優控制
~~$880~~ $695

Clean Code 錦囊妙計 (Clean Code Cookbook : Recipes to Improve the Design and Quality of Your Code)
~~$750~~ $638

Stable Diffusion：與杰克艾米立攜手專精 AI 繪圖
$521

CPython 設計與實現
~~$590~~ $466

TRPG 遊戲設計
$1,163

深度學習精粹與 PyTorch 實踐
~~$834~~ $792

算法設計與分析基礎, 3/e (詳解版)
~~$390~~ $371

機器人基礎與數字孿生系統
~~$474~~ $450

優化理論與算法基礎
~~$479~~ $455

揭秘大模型：從原理到實戰
$505

機器學習習題參考
~~$654~~ $621

設計深度學習系統

商品描述

"本書從強化學習最基本的概念開始介紹, 將介紹基礎的分析工具, 包括貝爾曼公式和貝爾曼最優公式, 然後推廣到基於模型的和無模型的強化學習算法, 最後推廣到基於函數逼近的強化學習方法。本書強調從數學的角度引入概念、分析問題、分析算法, 並不強調算法的編程實現。本書不要求讀者具備任何關於強化學習的知識背景, 僅要求讀者具備一定的概率論和線性代數的知識。如果讀者已經具備強化學習的學習基礎, 本書可以幫助讀者更深入地理解一些問題並提供新的視角。本書面向對強化學習感興趣的本科生、研究生、研究人員和企業或研究所的從業者。 "

作者簡介

趙世鈺，西湖大學工學院AI分支特聘研究員，智能無人系統實驗室負責人，國家海外高層次人才引進計劃青年項目獲得者；本碩畢業於北京航空航天大學，博士畢業於新加坡國立大學，曾任英國謝菲爾德大學自動控制與系統工程系Lecturer；致力於研發有趣、有用、有挑戰性的下一代機器人系統，重點關註多機器人系統中的控制、決策與感知等問題。

目錄大綱

Contents

Overview of this Book 1

Chapter 1 Basic Concepts 6

1.1 A grid world example 7

1.2 State and action 8

1.3 State transition 9

1.4 Policy 11

1.5 Reward 13

1.6 Trajectories, returns, and episodes 15

1.7 Markov decision processes 18

1.8 Summary 20

1.9 Q&A 20

Chapter 2 State Values and the Bellman Equation 21

2.1 Motivating example 1: Why are returns important? 23

2.2 Motivating example 2: How to calculate returns? 24

2.3 State values 26

2.4 The Bellman equation 27

2.5 Examples for illustrating the Bellman equation 30

2.6 Matrix-vector form of the Bellman equation 33

2.7 Solving state values from the Bellman equation 35

2.7.1 Closed-form solution 35

2.7.2 Iterative solution 35

2.7.3 Illustrative examples 36

2.8 From state value to action value 38

2.8.1 Illustrative examples 39

2.8.2 The Bellman equation in terms of action values 40

2.9 Summary 41

2.10 Q&A 42

Chapter 3 Optimal State Values and the Bellman Optimality Equation 43

3.1 Motivating example: How to improve policies? 45

3.2 Optimal state values and optimal policies 46

3.3 The Bellman optimality equation 47

3.3.1 Maximization of the right-hand side of the BOE 48

3.3.2 Matrix-vector form of the BOE 49

3.3.3 Contraction mapping theorem 50

3.3.4 Contraction property of the right-hand side of the BOE 53

3.4 Solving an optimal policy from the BOE 55

3.5 Factors that influence optimal policies 58

3.6 Summary 63

3.7 Q&A 63

Chapter 4 Value Iteration and Policy Iteration 66

4.1 Value iteration 68

4.1.1 Elementwise form and implementation 68

4.1.2 Illustrative examples 70

4.2 Policy iteration 72

4.2.1 Algorithm analysis 73

4.2.2 Elementwise form and implementation 76

4.2.3 Illustrative examples 77

4.3 Truncated policy iteration 81

4.3.1 Comparing value iteration and policy iteration 81

4.3.2 Truncated policy iteration algorithm 83

4.4 Summary 85

4.5 Q&A 86

Chapter 5 Monte Carlo Methods 89

5.1 Motivating example: Mean estimation 91

5.2 MC Basic: The simplest MC-based algorithm 93

5.2.1 Converting policy iteration to be model-free 93

5.2.2 The MC Basic algorithm 94

5.2.3 Illustrative examples 96

5.3 MC Exploring Starts 99

5.3.1 Utilizing samples more efficiently 100

5.3.2 Updating policies more efficiently 101

5.3.3 Algorithm description 101

5.4 MC -Greedy: Learning without exploring starts 102

5.4.1 -greedy policies 103

5.4.2 Algorithm description 103

5.4.3 Illustrative examples 105

5.5 Exploration and exploitation of -greedy policies 106

5.6 Summary 111

5.7 Q&A 111

Chapter 6 Stochastic Approximation 114

6.1 Motivating example: Mean estimation 116

6.2 Robbins-Monro algorithm 117

6.2.1 Convergence properties 119

6.2.2 Application to mean estimation 123

6.3 Dvoretzky's convergence theorem 124

6.3.1 Proof of Dvoretzky's theorem 125

6.3.2 Application to mean estimation. 126

6.3.3 Application to the Robbins-Monro theorem 127

6.3.4 An extension of Dvoretzky's theorem 127

6.4 Stochastic gradient descent 128

6.4.1 Application to mean estimation 130

6.4.2 Convergence pattern of SGD 131

6.4.3 A deterministic formulation of SGD 133

6.4.4 BGD, SGD, and mini-batch GD 134

6.4.5 Convergence of SGD 136

6.5 Summary 138

6.6 Q&A 138

Chapter 7 Temporal-Difference Methods 140

7.1 TD learning of state values 142

7.1.1 Algorithm description 142

7.1.2 Property analysis 144

7.1.3 Convergence analysis 146

7.2 TD learning of action values: Sarsa 149

7.2.1 Algorithm description 149

7.2.2 Optimal policy learning via Sarsa 151

7.3 TD learning of action values: n-step Sarsa 154

7.4 TD learning of optimal action values: Q-learning 156

7.4.1 Algorithm description 156

7.4.2 Off-policy vs. on-policy 158

7.4.3 Implementation 160

7.4.4 Illustrative examples 161

7.5 A unified viewpoint 165

7.6 Summary 165

7.7 Q&A 166

Chapter 8 Value Function Approximation 168

8.1 Value representation: From table to function 170

8.2 TD learning of state values with function approximation 174

8.2.1 Objective function 174

8.2.2 Optimization algorithms 180

8.2.3 Selection of function approximators 182

8.2.4 Illustrative examples 183

8.2.5 Theoretical analysis 187

8.3 TD learning of action values with function approximation 198

8.3.1 Sarsa with function approximation 198

8.3.2 Q-learning with function approximation 200

8.4 Deep Q-learning 201

8.4.1 Algorithm description 202

8.4.2 Illustrative examples 204

8.5 Summary 207

8.6 Q&A 207

Chapter 9 Policy Gradient Methods 211

9.1 Policy representation: From table to function 213

9.2 Metrics for defining optimal policies 214

9.3 Gradients of the metrics 219

9.3.1 Derivation of the gradients in the discounted case 221

9.3.2 Derivation of the gradients in the undiscounted case 226

9.4 Monte Carlo policy gradient (REINFORCE) 232

9.5 Summary 235

9.6 Q&A 235

Chapter 10 Actor-Critic Methods 237

10.1 The simplest actor-critic algorithm (QAC) 239

10.2 Advantage actor-critic (A2C) 240

10.2.1 Baseline invariance 240

10.2.2 Algorithm description 243

10.3 Off-policy actor-critic 244

10.3.1 Importance sampling 245

10.3.2 The off-policy policy gradient theorem 247

10.3.3 Algorithm description 249

10.4 Deterministic actor-critic 251

10.4.1 The deterministic policy gradient theorem 251

10.4.2 Algorithm description 258

10.5 Summary 259

10.6 Q&A 260

Appendix A Preliminaries for Probability Theory 262

Appendix B Measure-Theoretic Probability Theory 268

Appendix C Convergence of Sequences 276

C.1 Convergence of deterministic sequences 277

C.2 Convergence of stochastic sequences 280

Appendix D Preliminaries for Gradient Descent 284

Bibliography 290

Symbols 297

Index 299

強化學習的數學原理 (英文版)

趙世鈺

買這商品的人也買了...

商品描述

作者簡介

目錄大綱

類似商品

最後瀏覽商品 (20)