Reinforcement Learning (요약)

1. Difference from other ML paradigms
  • There is no supervisor, only a reward signal
  • Feedback is delayed, not instantaneous
  • Time really matters (sequential, non i.i.d. data)
  • Agent’s actions affect the subsequent data it receives
2. Reward
  • Reward R_t : a scalar feedback signal
  • Agent’s job is to maximize cumulative reward
  • Reward hypothesis : all goals can be described by the maximization of expected cumulative reward

3. Sequential Decision Making
  • Goal: select actions to maximize total future reward
  • Reward may be delayed
  • It may be better to sacrifice immediate reward to gain more long-term reward (greedy \leftrightarrow optimal)

4. RL Agent
  • Policy: agent’s behavior function
    • Map from state to action
    • Deterministic policy : a = \pi (s)
    • Stochastic policy : \pi (a|s) = P[A_t = a | S_t = s]
  • Value function: how good is each state and/or action
    • Prediction of future reward. Evaluate the goodness/badness of states
    • State-Value function v_{\pi}(s) = E_{\pi} [R_{t+1} + \gamma R_{t+2} +\gamma^2 R_{t+3} + \cdots | S_t = s]
    • Reward : Immediate payoff (number)
    • Value : Future payoff
  • Model: agent’s representation of the environment
    • A model predicts what the environment will do next
    • P : predicts the next state P_{ss^{\prime}}^a = P[S_{t+1} = s^{\prime} | S_t = s, A_t = a ]
    • R : predicts the next (immediate) reward R_s^a = E[R_{t+1} | S_t = s, A_t = a]

5. Learning and Planning
  • Reinforcement Learning
    • The environment is initially unknown
    • The agent interacts with the environment
    • The agent improves its policy
  • Planning
    • A model of the environment is known
    • The agent performs computations with its model (without any external interaction)
    • The agent improves its policy
    • a.k.a. deliberation, reasoning, introspection, pondering, thought, search

6. Exploration and Exploitation
  • Reinforcement learning is like trial-and-error learning
  • Exploration : Finds more information about the environment
  • Exploitation : Exploits known information to maximize reward

7. Markov Decision Process (MDP)
  • Environment is fully observable
  • Current state completely characterizes the process
  • Almost all RL problems can be formalized as MDPs
  • Markov Property
    • The future is independent of the past given the present
    • A state S_t is Markov if and only if P[S_{t+1} | S_t] = P[S_{t+1} | S_1,\cdots , S_t]
    • Once the state is known, the history may be thrown away
  • State Transition Matrix
    • For a Markov state s and successor state s^\prime, the state transition probability is defined by P_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s]
    • State transition matrix P : transition probabilities from all states s to all successor states s^\prime P = \begin{bmatrix}P_{11} & \cdots &P_{1n}\\ \vdots & \ddots & \vdots \\ P_{n1} & \cdots &P_{nn} \end{bmatrix}
    • Each row of the matrix sums to 1
  • Markov Process
    • Memoryless random process
    • 미래의 State가 과거의 State에 의존하지 않음
    • A Markov Process (or Markov Chain) is a tuple <S,P>
      • S is a (finite) set of states
      • P is a state transition probability matrix : P_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s]
  • Markov Reward Process
    • is a Markov chain with values
    • A Markov Reward Process is a tuple <S,P,R,\gamma > 
      • S is a (finite) set of states
      • P is a state transition probability matrix : P_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s]
      • R is a reward function : R_s = E[R_{t+1} | S_t = s]
      • \gamma is a discount factor : \gamma \in [0,1]
  • Return G_t
    • Return is the total discounted reward from time-step t G_t = R_{t+1} + \gamma R_{t+2} +\gamma^2 R_{t+3} + \cdots = \sum_{k=0}^\infty \gamma^kR_{t+k+1}
    • The discount \gamma \in [0,1] is the present value of future rewards
      • \gamma close to 0 leads to immediate evaluation
      • \gamma close to 1 leads to far\;sighted evaluation
  • Value Function v(s)
    • Value Function gives the long-term value of state s
    • The state value function v(s) of an MRP is the expected return starting from state s v(s) = E[G_t|S_t = s]
    • Finding optimal value function is the ultimate goal of the reinforcement learning

  • Bellman Equation
\begin{align*} v(s) &= E[G_t | S_t = s] \\ &= E[R_{t+1} + \gamma R_{t+2} +\gamma^2 R_{t+3} + \cdots| S_t = s] \\ &= E[R_{t+1} + \gamma (R_{t+2} +\gamma R_{t+3} + \cdots ) | S_t = s] \\ &= E[R_{t+1}+\gamma G_{t+1} | S_t = s] \\ &= E[R_{t+1}+\gamma v(S_{t+1}) | S_t = s] \end{align*}
    • Current state value = current reward + next state value
v(s) = R_s + \gamma \sum_{s^\prime \in S} P_{ss^\prime} v(s^\prime)
    • Bellman Equation을 통해 State value function의 최적값을 구할 수 있다
    • Bellman Equation is a linear equation, can be solved directly v = R +\gamma Pv (1-\gamma P) v = R v = (1-\gamma P)^{-1} R
    • Computational complexity is O(n^3) for n states
    • Direct solution only possible for small MRPs
    • There are many iterative methods for large MRPs
      • Dynamic programming
      • Monte-Carlo evaluation
      • Temporal-Difference learning

7. Markov Decision Process (cont.)
  • Markov reward process with decisions
  • Markov Decision Process is a tuple <S,A,P,R,\gamma >
    • S is a (finite) set of states
    • A is a finite set of actions
    • P is a state transition probability matrix : P_{ss^\prime}^a = P[S_{t+1} = s^\prime | S_t = s, A_t = a]
    • R is a reward function : R_s^a = E[R_{t+1} | S_t = s, A_t = a]
    • \gamma is a discount factor : \gamma \in [0,1]
  • Policies
    • Policy \pi is a distribution over actions given states \pi (a|s) = P[A_t = a | S_t = s]
    • Mapping between current state and action
    • A policy fully defines the behavior of an agent
    • MDP policies depend on the current state (not the history)
      • Policies are stationary (time-independent)
  • Value Function
    • State-Value function v_{\pi}(s) : expected return starting from state s, and then following policy \pi v_{\pi}(s) = E_{\pi}[G_t | S_t = s]
    • Action-Value function q_{\pi}(s,a) : expected return starting from state s, taking action a, and then following policy \pi q_{\pi}(s,a) = E_{\pi}[G_t | S_t = s, A_t= a]
  • Bellman Expectation Equation
    • v(s) is related with v(s^\prime)
    • q(s,a) is related with q(s^\prime ,a^\prime )

  • Bellman Expectation Equation (cont.)
    • The Bellman expectation equation can be expressed concisely using the induced MRP
  • Optimal Value function
    • The optimal state-value function v_*(s) is the maximum value function over all policies v_*(s) = \max_\pi v_\pi (s)
    • The optimal action-value function q_*(s) is the maximum action-value function over all policies q_*(s,a) = \max_\pi q_\pi (s,a)

  • Optimal Policy
    • Define a partial ordering over policies \pi \geq \pi^\prime \;\;if\;\; v_\pi (s) \geq v_{\pi^\prime} (s),\;\;\forall s
    • ....... 나중에 정리...








댓글

이 블로그의 인기 게시물

One-Class SVM & SVDD

Support Vector Regression (SVR)

Support Vector Machine (SVM) - Soft margin