[AI] Reinforcement Learning _ Content

JAsmine_log·2024년 7월 27일
0

Reinforcement Learning to Introduction

1 Introduction

1.1 Reinforcement Learning

1.2 Examples

1.3 Elements of Reinforcement Learning

1.4 Limitations and Scope

1.5 An Extended Example: Tic-Tac-Toe

1.6 Early History of Reinforcement Learning

I Tabular Solution Methods 18

2 Multi-armed Bandits

2.1 A k-armed Bandit Problem

2.2 Action-value Methods

2.3 The 10-armed Testbed

2.4 Incremental Implementation

2.5 Tracking a Nonstationary Problem

2.6 Optimistic Initial Values

2.7 Upper-Confidence-Bound Action Selection

2.8 Gradient Bandit Algorithms

2.9 Associative Search (Contextual Bandits)

3 Finite Markov Decision Processes

3.1 The Agent–Environment Interface

3.2 Goals and Rewards

3.3 Returns and Episodes

3.4 Unified Notation for Episodic and Continuing Tasks

3.5 Policies and Value Functions

3.6 Optimal Policies and Optimal Value Functions

3.7 Optimality and Approximation

4 Dynamic Programming

4.1 Policy Evaluation (Prediction)

4.2 Policy Improvement

4.3 Policy Iteration

4.4 Value Iteration

4.5 Asynchronous Dynamic Programming

4.6 Generalized Policy Iteration

4.7 Eciency of Dynamic Programming

5 Monte Carlo Methods

5.1 Monte Carlo Prediction

5.2 Monte Carlo Estimation of Action Values

5.3 Monte Carlo Control

5.4 Monte Carlo Control without Exploring Starts

5.5 Off-policy Prediction via Importance Sampling

5.6 Incremental Implementation

5.7 Off-policy Monte Carlo Control

5.8 *Discounting-aware Importance Sampling

5.9 *Per-reward Importance Sampling

6 Temporal-Di↵erence Learning

6.1 TD Prediction

6.2 Advantages of TD Prediction Methods

6.3 Optimality of TD(0)

6.4 Sarsa: On-policy TD Control

6.5 Q-learning: Off-policy TD Control

6.6 Expected Sarsa

6.7 Maximization Bias and Double Learning

6.8 Games, Afterstates, and Other Special Cases

7 n-step Bootstrapping\

7.1 n-step TD Prediction

7.2 n-step Sarsa

7.3 n-step Off-policy Learning by Importance Sampling

7.4 *Per-reward Off-policy Methods

7.5 Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

7.6 *A Unifying Algorithm: n-step Q(σ)

8 Planning and Learning with Tabular Methods

8.1 Models and Planning

8.2 Dyna: Integrating Planning, Acting, and Learning

8.3 When the Model Is Wrong

8.4 Prioritized Sweeping

8.5 Expected vs. Sample Updates

8.6 Trajectory Sampling

8.7 Real-time Dynamic Programming

8.8 Planning at Decision Time

8.10 Rollout Algorithms

II Approximate Solution Methods

9 On-policy Prediction with Approximation

9.1 Value-function Approximation

9.2 The Prediction Objective (VE)

9.3 Stochastic-gradient and Semi-gradient Methods

9.4 Linear Methods

9.5 Feature Construction for Linear Methods

9.5.1 Polynomials

9.5.2 Fourier Basis

9.5.3 Coarse Coding

9.5.4 Tile Coding

9.5.5 Radial Basis Functions

9.6 Nonlinear Function Approximation: Artificial Neural Networks

9.7 Least-Squares TD

9.8 Memory-based Function Approximation

9.9 Kernel-based Function Approximation

9.10 Looking Deeper at On-policy Learning: Interest and Emphasis

10 On-policy Control with Approximation

10.1 Episodic Semi-gradient Control

10.2 n-step Semi-gradient Sarsa

10.3 Average Reward: A New Problem Setting for Continuing Tasks

10.4 Deprecating the Discounted Setting

10.5 n-step Di↵erential Semi-gradient Sarsa

11 *Off-policy Methods with Approximation

11.1 Semi-gradient Methods

11.2 Examples of O↵-policy Divergence

11.3 The Deadly Triad

11.4 Linear Value-function Geometry

11.5 Stochastic Gradient Descent in the Bellman Error

11.6 The Bellman Error is Not Learnable

11.7 Gradient-TD Methods

11.8 Emphatic-TD Methods

11.9 Reducing Variance

12 Eligibility Traces

12.1 The λ-return

12.2 TD(λ)

12.3 n-step Truncated-return Methods

12.4 Redoing Updates: The Online λ-return Algorithm

12.5 True Online TD(λ)

12.6 Dutch Traces in Monte Carlo Learning

12.7 Sarsa(λ)

12.8 Variable λ and γ

12.9 Off-policy Eligibility Traces

12.10 Watkins’s Q(λ) to Tree-Backup(λ)

12.11 Stable Off-policy Methods with Traces

12.12 Implementation Issues

13 Policy Gradient Methods

13.1 Policy Approximation and its Advantages

13.2 The Policy Gradient Theorem

13.3 REINFORCE: Monte Carlo Policy Gradient

13.4 REINFORCE with Baseline

13.5 Actor–Critic Methods

13.6 Policy Gradient for Continuing Problems

13.7 Policy Parameterization for Continuous Actions


Reference
[1] Richard Sutton, Reinforcement Learning : An Introduction, 2014.

profile
Everyday Research & Development

0개의 댓글