Reinforcement Learning (요약)

- 6월 20, 2021

1. Difference from other ML paradigms

There is no supervisor, only a reward signal
Feedback is delayed, not instantaneous
Time really matters (sequential, non i.i.d. data)
Agent’s actions affect the subsequent data it receives

2. Reward

Reward $R_t$ : a scalar feedback signal
Agent’s job is to maximize cumulative reward
Reward hypothesis : all goals can be described by the maximization of expected cumulative reward

3. Sequential Decision Making

Goal: select actions to maximize total future reward
Reward may be delayed
It may be better to sacrifice immediate reward to gain more long-term reward (greedy $\leftrightarrow$ optimal)

4. RL Agent

Policy: agent’s behavior function

Map from state to action
Deterministic policy : $a = \pi (s) $
Stochastic policy : $\pi (a|s) = P[A_t = a | S_t = s] $

Value function: how good is each state and/or action

Prediction of future reward. Evaluate the goodness/badness of states
State-Value function $$v_{\pi}(s) = E_{\pi} [R_{t+1} + \gamma R_{t+2} +\gamma^2 R_{t+3} + \cdots | S_t = s] $$
Reward : Immediate payoff (number)
Value : Future payoff

Model: agent’s representation of the environment

A model predicts what the environment will do next
$P$ : predicts the next state $$P_{ss^{\prime}}^a = P[S_{t+1} = s^{\prime} | S_t = s, A_t = a ] $$
$R$ : predicts the next (immediate) reward $$R_s^a = E[R_{t+1} | S_t = s, A_t = a] $$

5. Learning and Planning

Reinforcement Learning

The environment is initially unknown
The agent interacts with the environment
The agent improves its policy

Planning

A model of the environment is known
The agent performs computations with its model (without any external interaction)
The agent improves its policy
a.k.a. deliberation, reasoning, introspection, pondering, thought, search

6. Exploration and Exploitation

Reinforcement learning is like trial-and-error learning
Exploration : Finds more information about the environment
Exploitation : Exploits known information to maximize reward

7. Markov Decision Process (MDP)

Environment is fully observable
Current state completely characterizes the process
Almost all RL problems can be formalized as MDPs
Markov Property

The future is independent of the past given the present
A state $S_t$ is Markov if and only if $$P[S_{t+1} | S_t] = P[S_{t+1} | S_1,\cdots , S_t] $$
Once the state is known, the history may be thrown away

State Transition Matrix

For a Markov state $s$ and successor state $s^\prime$, the state transition probability is defined by $$P_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s] $$
State transition matrix P : transition probabilities from all states $s$ to all successor states $s^\prime$ $$P = \begin{bmatrix}P_{11} & \cdots &P_{1n}\\ \vdots & \ddots & \vdots \\ P_{n1} & \cdots &P_{nn} \end{bmatrix} $$
Each row of the matrix sums to 1

Markov Process

Memoryless random process
미래의 State가 과거의 State에 의존하지 않음
A Markov Process (or Markov Chain) is a tuple $<S,P>$

$S$ is a (finite) set of states
$P$ is a state transition probability matrix : $P_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s] $

Markov Reward Process

is a Markov chain with values
A Markov Reward Process is a tuple $<S,P,R,\gamma >$

$S$ is a (finite) set of states
$P$ is a state transition probability matrix : $P_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s] $
$R$ is a reward function : $R_s = E[R_{t+1} | S_t = s] $
$\gamma $ is a discount factor : $\gamma \in [0,1] $

Return $G_t$

Return is the total discounted reward from time-step $t$ $$G_t = R_{t+1} + \gamma R_{t+2} +\gamma^2 R_{t+3} + \cdots = \sum_{k=0}^\infty \gamma^kR_{t+k+1} $$
The discount $\gamma \in [0,1] $ is the present value of future rewards

$\gamma$ close to $0$ leads to $immediate$ evaluation
$\gamma$ close to $1$ leads to $far\;sighted$ evaluation

Value Function $v(s)$

Value Function gives the long-term value of state $s$
The state value function $v(s)$ of an MRP is the expected return starting from state $s$ $$v(s) = E[G_t|S_t = s] $$
Finding optimal value function is the ultimate goal of the reinforcement learning

Bellman Equation

$$\begin{align*} v(s) &= E[G_t | S_t = s] \\ &= E[R_{t+1} + \gamma R_{t+2} +\gamma^2 R_{t+3} + \cdots| S_t = s] \\ &= E[R_{t+1} + \gamma (R_{t+2} +\gamma R_{t+3} + \cdots ) | S_t = s] \\ &= E[R_{t+1}+\gamma G_{t+1} | S_t = s] \\ &= E[R_{t+1}+\gamma v(S_{t+1}) | S_t = s] \end{align*} $$

Current state value = current reward + next state value

$$v(s) = R_s + \gamma \sum_{s^\prime \in S} P_{ss^\prime} v(s^\prime) $$

Bellman Equation을 통해 State value function의 최적값을 구할 수 있다
Bellman Equation is a linear equation, can be solved directly $$v = R +\gamma Pv $$ $$(1-\gamma P) v = R$$ $$v = (1-\gamma P)^{-1} R$$
Computational complexity is $O(n^3)$ for $n$ states
Direct solution only possible for small MRPs
There are many iterative methods for large MRPs

Dynamic programming
Monte-Carlo evaluation
Temporal-Difference learning

7. Markov Decision Process (cont.)

Markov reward process with decisions
Markov Decision Process is a tuple $<S,A,P,R,\gamma >$

$S$ is a (finite) set of states
$A$ is a finite set of actions
$P$ is a state transition probability matrix : $P_{ss^\prime}^a = P[S_{t+1} = s^\prime | S_t = s, A_t = a] $
$R$ is a reward function : $R_s^a = E[R_{t+1} | S_t = s, A_t = a] $
$\gamma $ is a discount factor : $\gamma \in [0,1] $

Policies

Policy $\pi$ is a distribution over actions given states $$\pi (a|s) = P[A_t = a | S_t = s] $$
Mapping between current state and action
A policy fully defines the behavior of an agent
MDP policies depend on the current state (not the history)

Policies are stationary (time-independent)

Value Function

State-Value function $v_{\pi}(s)$ : expected return starting from state $s$, and then following policy $\pi$ $$v_{\pi}(s) = E_{\pi}[G_t | S_t = s] $$
Action-Value function $q_{\pi}(s,a)$ : expected return starting from state $s$, taking action $a$, and then following policy $\pi$ $$q_{\pi}(s,a) = E_{\pi}[G_t | S_t = s, A_t= a] $$

Bellman Expectation Equation

$v(s)$ is related with $v(s^\prime) $
$q(s,a)$ is related with $q(s^\prime ,a^\prime )$

Bellman Expectation Equation (cont.)

The Bellman expectation equation can be expressed concisely using the induced MRP

Optimal Value function

The optimal state-value function $v_*(s)$ is the maximum value function over all policies $$v_*(s) = \max_\pi v_\pi (s) $$
The optimal action-value function $q_*(s)$ is the maximum action-value function over all policies $$q_*(s,a) = \max_\pi q_\pi (s,a) $$

Optimal Policy

Define a partial ordering over policies $$ \pi \geq \pi^\prime \;\;if\;\; v_\pi (s) \geq v_{\pi^\prime} (s),\;\;\forall s $$
....... 나중에 정리...

이 블로그 검색

yupsung Computer Science blog

Reinforcement Learning (요약)

댓글

댓글 쓰기

이 블로그의 인기 게시물

One-Class SVM & SVDD

Support Vector Machine (SVM) - Soft margin

Support Vector Regression (SVR)