Nov 19, 2003 ------------- - Thus far - we are given P and R - policy and V are not given! - From now on - even P and R are not given! - why should we consider this possibility? - assume you are given just a V, how can you find the policy from it? - Ans: you cannot!, you need P and R - assume a two-player game (tic tac toe), where do we find the P and R matrices from, to calc. V? - Ans: you cannot! - Moral of the story - to use V, you need P and R - to find V, you need P and R - Solution - switch to Q! - V and Q: a tale of two functions - V^pi(s): expected cumulative discounted reward when starting from state s and following pi - Q^pi(s,a): expected cumulative discounted reward when starting from state s, applying action a, and thereafter following pi - Why does this solve the problem? - Q doesn't require "lookahead" - More on differences between V and Q - you can only apply V on future states => need a model for that! - you can apply Q on current state (and action) => don't need a model for that - How can we do RL without P and R matrices? - think of a "simulator" - takes state and action as input - produces new state and reward as input => you do not have access to internal workings! - How do you model a two-player game? - simulator "hides" the second player - "fake" the other player and factor him in the stochastics of the environment