Nov 27, 2006 ------------- - Thus far - we are given P and R - policy and V are not given! - From now on - even P and R are not given! - why should we consider this possibility? - assume you are given just a V, how can you find the policy from it? - Ans: you cannot!, you need P and R - assume a two-player game (tic tac toe), where do we find the P and R matrices from, to calc. V? - Ans: you cannot! - Moral of the story - to use V, you need P and R - to find V, you need P and R - Solution - switch to Q! - V and Q: a tale of two functions - V^pi(s): expected cumulative discounted reward when starting from state s and following pi - Q^pi(s,a): expected cumulative discounted reward when starting from state s, applying action a, and thereafter following pi - Why does this solve the problem? - Q doesn't require "lookahead" - More on differences between V and Q - you can only apply V on future states => need a model for that! - you can apply Q on current state (and action) => don't need a model for that - How can we do RL without P and R matrices? - Monte Carlo approach: try a lot of runs! - Think of a "simulator" - takes state and action as input - produces new state and reward as input => you do not have access to internal workings! - How do you model a two-player game? - simulator "hides" the second player - "fake" the other player and factor him in the stochastics of the environment - Do the same thing as usual, iteratively - from a pi, get a Q - from a Q, get a pi - A serious problem - what if your original policy did not have the "good" actions in it? - how can improvement ever come up with the best policy? - Ans: it cannot! - basic conflict in RL - "exploration vs exploitation" - Solution: use epsilon-soft policies only! - make your policy always explore a little - outline of an epsilon-soft algorithm - do the best thing 90% of the time - pick a random action 10% of the time! (this facilitates exploration) - this is what you need for final project - psuedocode for learning e-soft policies via Monte Carlo Initialize all Q(s,a) to some number, or zero Initialize Returns(s,a) to an empty list Initialize pi to be an arbitrary epsilon-soft policy L1: Generate an episode using pi for each (s,a) occuring in the episode // count only the first time that (s,a) occurs find return "r" following that occurence (i.e., sum of discounted rewards) append "r" to the Returns(s,a) list Q(s,a) = average [ Returns(s,a) ] end for // now use Q(s,a) to create a policy for each s in the episode // only these would have changed a* = argmax Q(s,a) // i.e., find the a at which this // Q becomes maximum distribute 90% probability among a* distribute 10% probability among everybody // including a* // this could have been 80-20 or 95-5, just // make sure all actions have non-zero // probability of being taken update pi end for go back to L1