Reinforcement Learning (요약)
1. Difference from other ML paradigms
- There is no supervisor, only a reward signal
- Feedback is delayed, not instantaneous
- Time really matters (sequential, non i.i.d. data)
- Agent’s actions affect the subsequent data it receives
2. Reward
- Reward R_t : a scalar feedback signal
- Agent’s job is to maximize cumulative reward
- Reward hypothesis : all goals can be described by the maximization of expected cumulative reward
3. Sequential Decision Making
- Goal: select actions to maximize total future reward
- Reward may be delayed
- It may be better to sacrifice immediate reward to gain more long-term reward (greedy \leftrightarrow optimal)
4. RL Agent
- Policy: agent’s behavior function
- Map from state to action
- Deterministic policy : a = \pi (s)
- Stochastic policy : \pi (a|s) = P[A_t = a | S_t = s]
- Value function: how good is each state and/or action
- Prediction of future reward. Evaluate the goodness/badness of states
- State-Value function v_{\pi}(s) = E_{\pi} [R_{t+1} + \gamma R_{t+2} +\gamma^2 R_{t+3} + \cdots | S_t = s]
- Reward : Immediate payoff (number)
- Value : Future payoff
- Model: agent’s representation of the environment
- A model predicts what the environment will do next
- P : predicts the next state P_{ss^{\prime}}^a = P[S_{t+1} = s^{\prime} | S_t = s, A_t = a ]
- R : predicts the next (immediate) reward R_s^a = E[R_{t+1} | S_t = s, A_t = a]
5. Learning and Planning
- Reinforcement Learning
- The environment is initially unknown
- The agent interacts with the environment
- The agent improves its policy
- Planning
- A model of the environment is known
- The agent performs computations with its model (without any external interaction)
- The agent improves its policy
- a.k.a. deliberation, reasoning, introspection, pondering, thought, search
6. Exploration and Exploitation
- Reinforcement learning is like trial-and-error learning
- Exploration : Finds more information about the environment
- Exploitation : Exploits known information to maximize reward
7. Markov Decision Process (MDP)
- Environment is fully observable
- Current state completely characterizes the process
- Almost all RL problems can be formalized as MDPs
- Markov Property
- The future is independent of the past given the present
- A state S_t is Markov if and only if P[S_{t+1} | S_t] = P[S_{t+1} | S_1,\cdots , S_t]
- Once the state is known, the history may be thrown away
- State Transition Matrix
- For a Markov state s and successor state s^\prime, the state transition probability is defined by P_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s]
- State transition matrix P : transition probabilities from all states s to all successor states s^\prime P = \begin{bmatrix}P_{11} & \cdots &P_{1n}\\ \vdots & \ddots & \vdots \\ P_{n1} & \cdots &P_{nn} \end{bmatrix}
- Each row of the matrix sums to 1
- Markov Process
- Memoryless random process
- 미래의 State가 과거의 State에 의존하지 않음
- A Markov Process (or Markov Chain) is a tuple <S,P>
- S is a (finite) set of states
- P is a state transition probability matrix : P_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s]
- Markov Reward Process
- is a Markov chain with values
- A Markov Reward Process is a tuple <S,P,R,\gamma >
- S is a (finite) set of states
- P is a state transition probability matrix : P_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s]
- R is a reward function : R_s = E[R_{t+1} | S_t = s]
- \gamma is a discount factor : \gamma \in [0,1]
- Return G_t
- Return is the total discounted reward from time-step t G_t = R_{t+1} + \gamma R_{t+2} +\gamma^2 R_{t+3} + \cdots = \sum_{k=0}^\infty \gamma^kR_{t+k+1}
- The discount \gamma \in [0,1] is the present value of future rewards
- \gamma close to 0 leads to immediate evaluation
- \gamma close to 1 leads to far\;sighted evaluation
- Value Function v(s)
- Value Function gives the long-term value of state s
- The state value function v(s) of an MRP is the expected return starting from state s v(s) = E[G_t|S_t = s]
- Finding optimal value function is the ultimate goal of the reinforcement learning
- Bellman Equation
\begin{align*} v(s) &= E[G_t | S_t = s] \\ &= E[R_{t+1} + \gamma R_{t+2} +\gamma^2 R_{t+3} + \cdots| S_t = s] \\ &= E[R_{t+1} + \gamma (R_{t+2} +\gamma R_{t+3} + \cdots ) | S_t = s] \\ &= E[R_{t+1}+\gamma G_{t+1} | S_t = s] \\ &= E[R_{t+1}+\gamma v(S_{t+1}) | S_t = s] \end{align*}
v(s) = R_s + \gamma \sum_{s^\prime \in S} P_{ss^\prime} v(s^\prime)
- Bellman Equation을 통해 State value function의 최적값을 구할 수 있다
- Bellman Equation is a linear equation, can be solved directly v = R +\gamma Pv (1-\gamma P) v = R v = (1-\gamma P)^{-1} R
- Computational complexity is O(n^3) for n states
- Direct solution only possible for small MRPs
- There are many iterative methods for large MRPs
- Dynamic programming
- Monte-Carlo evaluation
- Temporal-Difference learning
7. Markov Decision Process (cont.)
- Markov reward process with decisions
- Markov Decision Process is a tuple <S,A,P,R,\gamma >
- S is a (finite) set of states
- A is a finite set of actions
- P is a state transition probability matrix : P_{ss^\prime}^a = P[S_{t+1} = s^\prime | S_t = s, A_t = a]
- R is a reward function : R_s^a = E[R_{t+1} | S_t = s, A_t = a]
- \gamma is a discount factor : \gamma \in [0,1]
- Policies
- Policy \pi is a distribution over actions given states \pi (a|s) = P[A_t = a | S_t = s]
- Mapping between current state and action
- A policy fully defines the behavior of an agent
- MDP policies depend on the current state (not the history)
- Policies are stationary (time-independent)
- Value Function
- State-Value function v_{\pi}(s) : expected return starting from state s, and then following policy \pi v_{\pi}(s) = E_{\pi}[G_t | S_t = s]
- Action-Value function q_{\pi}(s,a) : expected return starting from state s, taking action a, and then following policy \pi q_{\pi}(s,a) = E_{\pi}[G_t | S_t = s, A_t= a]
- Bellman Expectation Equation
- v(s) is related with v(s^\prime)
- q(s,a) is related with q(s^\prime ,a^\prime )
- Bellman Expectation Equation (cont.)
- The Bellman expectation equation can be expressed concisely using the induced MRP
- Optimal Value function
- The optimal state-value function v_*(s) is the maximum value function over all policies v_*(s) = \max_\pi v_\pi (s)
- The optimal action-value function q_*(s) is the maximum action-value function over all policies q_*(s,a) = \max_\pi q_\pi (s,a)
- Optimal Policy
- Define a partial ordering over policies \pi \geq \pi^\prime \;\;if\;\; v_\pi (s) \geq v_{\pi^\prime} (s),\;\;\forall s
- ....... 나중에 정리...
댓글
댓글 쓰기