Nov 19, 2003
-------------

- Thus far
        - we are given P and R
        - policy and V are not given!

- From now on
        - even P and R are not given!

- why should we consider this possibility?
        - assume you are given just a V,
          how can you find the policy from it?
                - Ans: you cannot!, 
                  you need P and R
        - assume a two-player game (tic tac toe), where do
          we find the P and R matrices from, to calc. V?
                - Ans: you cannot!

- Moral of the story
        - to use V, you need P and R
        - to find V, you need P and R

- Solution
        - switch to Q!

- V and Q: a tale of two functions
        - V^pi(s): expected cumulative discounted reward 
          when starting from state s and following pi
        - Q^pi(s,a): expected cumulative discounted reward 
          when starting from state s, applying action a,
          and thereafter following pi

- Why does this solve the problem?
        - Q doesn't require "lookahead"

- More on differences between V and Q
        - you can only apply V on future states
                => need a model for that!
        - you can apply Q on current state (and
          action) 
                => don't need a model for that

- How can we do RL without P and R matrices?
        - think of a "simulator"
        - takes state and action as input
        - produces new state and reward as input
                => you do not have access to
                   internal workings!

- How do you model a two-player game?
        - simulator "hides" the second player
        - "fake" the other player and factor him
          in the stochastics of the environment