Nov 8, 2006 ------------- - Reinforcement learning - grandest topic in AI - can subsume all of AI itself - Two basic features of RL - learn by trial and error interactions with environment - need to balance exploration and exploitation - General setting - states - actions - state transition table - rewards table - A simple problem: gridworld1 - three states, organized as - S1 - S2 - S3 - four possible actions - 1l - 2l - 1r - 2r - bumping into wall keeps state same, but - gives reward of -5 or -10 (depends on force of impact) - zero rewards for legal actions - Policies - a facet of the agent - give mapping from states to actions - have value functions - V^pi(s) - expected cumulative sum of discounted rewards, if you start from state "s" and apply policy "pi" - Discounting: gamma - between 0 and 1 - If 1, V^pi(s) becomes straight sum - Working out V^pi(s) for a given agent (from its policy) - is this the best policy possible? - how would you improve it? - Different agents - have different policies - A more complicated policy - agent has non-zero probability of taking two actions in a given state - Note - two different policies can have same value(s) - Making things more interesting: gridworld2 - four states S1 | S2 S3 | S4 ------- S5 (a hidden mine) - S2 and S5 are absorbing states - S2 is goal state - S5 is bad state to be in - four actions - up, down, left, right - state transition table - rewards matrix - bumping is not good (-5) - just roaming around is not good (-1) - into S2 gets reward of +50 - into S5 gets reward of -100 - from S2, and from S5, stays there, with reward 0 - discounting of 0.9 - Take a simple policy: Agent 006 - S1: down - S2: right - S3: up - S4: left - S5: down - Find V^pi - looks like Agent 006 is not such a good agent! - Improving the policy - look at just computed V^pi values - change policy at each state to be "greedy" w.r.t. V^pi - i.e., do the action that appears to be best using just one step lookahead - what if many actions are "equally best"? - sprinkle the probabilities among them - any way you like! - Basic idea: policy iteration - fix pi - find V^pi - find a new pi - find V^pi (again) - find a new pi, and so on - Improving agents - from 006 - to 006.5, - to James Bond! (play theme music) - The above are simple cases of RL - in general we are NOT given the state transition table and rewards matrix - will study these in future classes - The above topics are covered - NOT in the "Reinforcement Learning" chapter of AIMA - but in the "Making Complex Decisions" chapter