Reinforcement Learning (요약)
1. Difference from other ML paradigms
- There is no supervisor, only a reward signal
- Feedback is delayed, not instantaneous
- Time really matters (sequential, non i.i.d. data)
- Agent’s actions affect the subsequent data it receives
2. Reward
- Reward $R_t$ : a scalar feedback signal
- Agent’s job is to maximize cumulative reward
- Reward hypothesis : all goals can be described by the maximization of expected cumulative reward
3. Sequential Decision Making
- Goal: select actions to maximize total future reward
- Reward may be delayed
- It may be better to sacrifice immediate reward to gain more long-term reward (greedy $\leftrightarrow$ optimal)
4. RL Agent
- Policy: agent’s behavior function
- Map from state to action
- Deterministic policy : $a = \pi (s) $
- Stochastic policy : $\pi (a|s) = P[A_t = a | S_t = s] $
- Value function: how good is each state and/or action
- Prediction of future reward. Evaluate the goodness/badness of states
- State-Value function $$v_{\pi}(s) = E_{\pi} [R_{t+1} + \gamma R_{t+2} +\gamma^2 R_{t+3} + \cdots | S_t = s] $$
- Reward : Immediate payoff (number)
- Value : Future payoff
- Model: agent’s representation of the environment
- A model predicts what the environment will do next
- $P$ : predicts the next state $$P_{ss^{\prime}}^a = P[S_{t+1} = s^{\prime} | S_t = s, A_t = a ] $$
- $R$ : predicts the next (immediate) reward $$R_s^a = E[R_{t+1} | S_t = s, A_t = a] $$
5. Learning and Planning
- Reinforcement Learning
- The environment is initially unknown
- The agent interacts with the environment
- The agent improves its policy
- Planning
- A model of the environment is known
- The agent performs computations with its model (without any external interaction)
- The agent improves its policy
- a.k.a. deliberation, reasoning, introspection, pondering, thought, search
6. Exploration and Exploitation
- Reinforcement learning is like trial-and-error learning
- Exploration : Finds more information about the environment
- Exploitation : Exploits known information to maximize reward
7. Markov Decision Process (MDP)
- Environment is fully observable
- Current state completely characterizes the process
- Almost all RL problems can be formalized as MDPs
- Markov Property
- The future is independent of the past given the present
- A state $S_t$ is Markov if and only if $$P[S_{t+1} | S_t] = P[S_{t+1} | S_1,\cdots , S_t] $$
- Once the state is known, the history may be thrown away
- State Transition Matrix
- For a Markov state $s$ and successor state $s^\prime$, the state transition probability is defined by $$P_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s] $$
- State transition matrix P : transition probabilities from all states $s$ to all successor states $s^\prime$ $$P = \begin{bmatrix}P_{11} & \cdots &P_{1n}\\ \vdots & \ddots & \vdots \\ P_{n1} & \cdots &P_{nn} \end{bmatrix} $$
- Each row of the matrix sums to 1
- Markov Process
- Memoryless random process
- 미래의 State가 과거의 State에 의존하지 않음
- A Markov Process (or Markov Chain) is a tuple $<S,P>$
- $S$ is a (finite) set of states
- $P$ is a state transition probability matrix : $P_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s] $
- Markov Reward Process
- is a Markov chain with values
- A Markov Reward Process is a tuple $<S,P,R,\gamma >$
- $S$ is a (finite) set of states
- $P$ is a state transition probability matrix : $P_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s] $
- $R$ is a reward function : $R_s = E[R_{t+1} | S_t = s] $
- $\gamma $ is a discount factor : $\gamma \in [0,1] $
- Return $G_t$
- Return is the total discounted reward from time-step $t$ $$G_t = R_{t+1} + \gamma R_{t+2} +\gamma^2 R_{t+3} + \cdots = \sum_{k=0}^\infty \gamma^kR_{t+k+1} $$
- The discount $\gamma \in [0,1] $ is the present value of future rewards
- $\gamma$ close to $0$ leads to $immediate$ evaluation
- $\gamma$ close to $1$ leads to $far\;sighted$ evaluation
- Value Function $v(s)$
- Value Function gives the long-term value of state $s$
- The state value function $v(s)$ of an MRP is the expected return starting from state $s$ $$v(s) = E[G_t|S_t = s] $$
- Finding optimal value function is the ultimate goal of the reinforcement learning
- Bellman Equation
$$\begin{align*} v(s) &= E[G_t | S_t = s] \\ &= E[R_{t+1} + \gamma R_{t+2} +\gamma^2 R_{t+3} + \cdots| S_t = s] \\ &= E[R_{t+1} + \gamma (R_{t+2} +\gamma R_{t+3} + \cdots ) | S_t = s] \\ &= E[R_{t+1}+\gamma G_{t+1} | S_t = s] \\ &= E[R_{t+1}+\gamma v(S_{t+1}) | S_t = s] \end{align*} $$
$$v(s) = R_s + \gamma \sum_{s^\prime \in S} P_{ss^\prime} v(s^\prime) $$
- Bellman Equation을 통해 State value function의 최적값을 구할 수 있다
- Bellman Equation is a linear equation, can be solved directly $$v = R +\gamma Pv $$ $$(1-\gamma P) v = R$$ $$v = (1-\gamma P)^{-1} R$$
- Computational complexity is $O(n^3)$ for $n$ states
- Direct solution only possible for small MRPs
- There are many iterative methods for large MRPs
- Dynamic programming
- Monte-Carlo evaluation
- Temporal-Difference learning
7. Markov Decision Process (cont.)
- Markov reward process with decisions
- Markov Decision Process is a tuple $<S,A,P,R,\gamma >$
- $S$ is a (finite) set of states
- $A$ is a finite set of actions
- $P$ is a state transition probability matrix : $P_{ss^\prime}^a = P[S_{t+1} = s^\prime | S_t = s, A_t = a] $
- $R$ is a reward function : $R_s^a = E[R_{t+1} | S_t = s, A_t = a] $
- $\gamma $ is a discount factor : $\gamma \in [0,1] $
- Policies
- Policy $\pi$ is a distribution over actions given states $$\pi (a|s) = P[A_t = a | S_t = s] $$
- Mapping between current state and action
- A policy fully defines the behavior of an agent
- MDP policies depend on the current state (not the history)
- Policies are stationary (time-independent)
- Value Function
- State-Value function $v_{\pi}(s)$ : expected return starting from state $s$, and then following policy $\pi$ $$v_{\pi}(s) = E_{\pi}[G_t | S_t = s] $$
- Action-Value function $q_{\pi}(s,a)$ : expected return starting from state $s$, taking action $a$, and then following policy $\pi$ $$q_{\pi}(s,a) = E_{\pi}[G_t | S_t = s, A_t= a] $$
- Bellman Expectation Equation
- $v(s)$ is related with $v(s^\prime) $
- $q(s,a)$ is related with $q(s^\prime ,a^\prime )$
- Bellman Expectation Equation (cont.)
- The Bellman expectation equation can be expressed concisely using the induced MRP
- Optimal Value function
- The optimal state-value function $v_*(s)$ is the maximum value function over all policies $$v_*(s) = \max_\pi v_\pi (s) $$
- The optimal action-value function $q_*(s)$ is the maximum action-value function over all policies $$q_*(s,a) = \max_\pi q_\pi (s,a) $$
- Optimal Policy
- Define a partial ordering over policies $$ \pi \geq \pi^\prime \;\;if\;\; v_\pi (s) \geq v_{\pi^\prime} (s),\;\;\forall s $$
- ....... 나중에 정리...
댓글
댓글 쓰기