Reinforcement Learning (요약)
1. Difference from other ML paradigms There is no supervisor, only a reward signal Feedback is delayed, not instantaneous Time really matters (sequential, non i.i.d. data) Agent’s actions affect the subsequent data it receives 2. Reward Reward $R_t$ : a scalar feedback signal Agent’s job is to maximize cumulative reward Reward hypothesis : all goals can be described by the maximization of expected cumulative reward 3. Sequential Decision Making Goal: select actions to maximize total future reward Reward may be delayed It may be better to sacrifice immediate reward to gain more long-term reward (greedy $\leftrightarrow$ optimal) 4. RL Agent Policy: agent’s behavior function Map from state to action Deterministic policy : $a = \pi (s) $ Stochastic policy : $\pi (a|s) = P[A_t = a | S_t = s] $ Value function: how good is each state and/or action Prediction of future reward. Evaluate the goodness/badness of states State-Value function $$v_{\pi}(s) = E_{\pi} [R_{t+1} + \gamma R_{t+2} +\gam