Reinforcement Learning (요약)

1. Difference from other ML paradigms
  • There is no supervisor, only a reward signal
  • Feedback is delayed, not instantaneous
  • Time really matters (sequential, non i.i.d. data)
  • Agent’s actions affect the subsequent data it receives
2. Reward
  • Reward $R_t$ : a scalar feedback signal
  • Agent’s job is to maximize cumulative reward
  • Reward hypothesis : all goals can be described by the maximization of expected cumulative reward

3. Sequential Decision Making
  • Goal: select actions to maximize total future reward
  • Reward may be delayed
  • It may be better to sacrifice immediate reward to gain more long-term reward (greedy $\leftrightarrow$ optimal)

4. RL Agent
  • Policy: agent’s behavior function
    • Map from state to action
    • Deterministic policy : $a = \pi (s) $
    • Stochastic policy : $\pi (a|s) = P[A_t = a | S_t = s] $
  • Value function: how good is each state and/or action
    • Prediction of future reward. Evaluate the goodness/badness of states
    • State-Value function $$v_{\pi}(s) = E_{\pi} [R_{t+1} + \gamma R_{t+2} +\gamma^2 R_{t+3} + \cdots | S_t = s] $$
    • Reward : Immediate payoff (number)
    • Value : Future payoff
  • Model: agent’s representation of the environment
    • A model predicts what the environment will do next
    • $P$ : predicts the next state $$P_{ss^{\prime}}^a = P[S_{t+1} = s^{\prime} | S_t = s, A_t = a ] $$
    • $R$ : predicts the next (immediate) reward $$R_s^a = E[R_{t+1} | S_t = s, A_t = a] $$

5. Learning and Planning
  • Reinforcement Learning
    • The environment is initially unknown
    • The agent interacts with the environment
    • The agent improves its policy
  • Planning
    • A model of the environment is known
    • The agent performs computations with its model (without any external interaction)
    • The agent improves its policy
    • a.k.a. deliberation, reasoning, introspection, pondering, thought, search

6. Exploration and Exploitation
  • Reinforcement learning is like trial-and-error learning
  • Exploration : Finds more information about the environment
  • Exploitation : Exploits known information to maximize reward

7. Markov Decision Process (MDP)
  • Environment is fully observable
  • Current state completely characterizes the process
  • Almost all RL problems can be formalized as MDPs
  • Markov Property
    • The future is independent of the past given the present
    • A state $S_t$ is Markov if and only if $$P[S_{t+1} | S_t] = P[S_{t+1} | S_1,\cdots , S_t] $$
    • Once the state is known, the history may be thrown away
  • State Transition Matrix
    • For a Markov state $s$ and successor state $s^\prime$, the state transition probability is defined by $$P_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s] $$
    • State transition matrix P : transition probabilities from all states $s$ to all successor states $s^\prime$ $$P = \begin{bmatrix}P_{11} & \cdots &P_{1n}\\ \vdots & \ddots & \vdots \\ P_{n1} & \cdots &P_{nn} \end{bmatrix} $$
    • Each row of the matrix sums to 1
  • Markov Process
    • Memoryless random process
    • 미래의 State가 과거의 State에 의존하지 않음
    • A Markov Process (or Markov Chain) is a tuple $<S,P>$
      • $S$ is a (finite) set of states
      • $P$ is a state transition probability matrix : $P_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s] $
  • Markov Reward Process
    • is a Markov chain with values
    • A Markov Reward Process is a tuple $<S,P,R,\gamma >$ 
      • $S$ is a (finite) set of states
      • $P$ is a state transition probability matrix : $P_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s] $
      • $R$ is a reward function : $R_s = E[R_{t+1} | S_t = s] $
      • $\gamma $ is a discount factor : $\gamma \in [0,1] $
  • Return $G_t$
    • Return is the total discounted reward from time-step $t$ $$G_t = R_{t+1} + \gamma R_{t+2} +\gamma^2 R_{t+3} + \cdots = \sum_{k=0}^\infty \gamma^kR_{t+k+1} $$
    • The discount $\gamma \in [0,1] $ is the present value of future rewards
      • $\gamma$ close to $0$ leads to $immediate$ evaluation
      • $\gamma$ close to $1$ leads to $far\;sighted$ evaluation
  • Value Function $v(s)$
    • Value Function gives the long-term value of state $s$
    • The state value function $v(s)$ of an MRP is the expected return starting from state $s$ $$v(s) = E[G_t|S_t = s] $$
    • Finding optimal value function is the ultimate goal of the reinforcement learning

  • Bellman Equation
$$\begin{align*} v(s) &= E[G_t | S_t = s] \\ &= E[R_{t+1} + \gamma R_{t+2} +\gamma^2 R_{t+3} + \cdots| S_t = s] \\ &= E[R_{t+1} + \gamma (R_{t+2} +\gamma R_{t+3} + \cdots ) | S_t = s] \\ &= E[R_{t+1}+\gamma G_{t+1} | S_t = s] \\ &= E[R_{t+1}+\gamma v(S_{t+1}) | S_t = s] \end{align*} $$
    • Current state value = current reward + next state value
$$v(s) = R_s + \gamma \sum_{s^\prime \in S} P_{ss^\prime} v(s^\prime) $$
    • Bellman Equation을 통해 State value function의 최적값을 구할 수 있다
    • Bellman Equation is a linear equation, can be solved directly $$v = R +\gamma Pv $$ $$(1-\gamma P) v = R$$ $$v = (1-\gamma P)^{-1} R$$
    • Computational complexity is $O(n^3)$ for $n$ states
    • Direct solution only possible for small MRPs
    • There are many iterative methods for large MRPs
      • Dynamic programming
      • Monte-Carlo evaluation
      • Temporal-Difference learning

7. Markov Decision Process (cont.)
  • Markov reward process with decisions
  • Markov Decision Process is a tuple $<S,A,P,R,\gamma >$
    • $S$ is a (finite) set of states
    • $A$ is a finite set of actions
    • $P$ is a state transition probability matrix : $P_{ss^\prime}^a = P[S_{t+1} = s^\prime | S_t = s, A_t = a] $
    • $R$ is a reward function : $R_s^a = E[R_{t+1} | S_t = s, A_t = a] $
    • $\gamma $ is a discount factor : $\gamma \in [0,1] $
  • Policies
    • Policy $\pi$ is a distribution over actions given states $$\pi (a|s) = P[A_t = a | S_t = s] $$
    • Mapping between current state and action
    • A policy fully defines the behavior of an agent
    • MDP policies depend on the current state (not the history)
      • Policies are stationary (time-independent)
  • Value Function
    • State-Value function $v_{\pi}(s)$ : expected return starting from state $s$, and then following policy $\pi$ $$v_{\pi}(s) = E_{\pi}[G_t | S_t = s] $$
    • Action-Value function $q_{\pi}(s,a)$ : expected return starting from state $s$, taking action $a$, and then following policy $\pi$ $$q_{\pi}(s,a) = E_{\pi}[G_t | S_t = s, A_t= a] $$
  • Bellman Expectation Equation
    • $v(s)$ is related with $v(s^\prime) $
    • $q(s,a)$ is related with $q(s^\prime ,a^\prime )$

  • Bellman Expectation Equation (cont.)
    • The Bellman expectation equation can be expressed concisely using the induced MRP
  • Optimal Value function
    • The optimal state-value function $v_*(s)$ is the maximum value function over all policies $$v_*(s) = \max_\pi v_\pi (s) $$
    • The optimal action-value function $q_*(s)$ is the maximum action-value function over all policies $$q_*(s,a) = \max_\pi q_\pi (s,a) $$

  • Optimal Policy
    • Define a partial ordering over policies $$ \pi \geq \pi^\prime \;\;if\;\; v_\pi (s) \geq v_{\pi^\prime} (s),\;\;\forall s $$
    • ....... 나중에 정리...








댓글

이 블로그의 인기 게시물

One-Class SVM & SVDD

Support Vector Regression (SVR)

Self-Training & Co-Training