Nov 27, 2006
-------------

- Thus far
        - we are given P and R
        - policy and V are not given!

- From now on
        - even P and R are not given!

- why should we consider this possibility?
        - assume you are given just a V,
          how can you find the policy from it?
                - Ans: you cannot!, 
                  you need P and R
        - assume a two-player game (tic tac toe), where do
          we find the P and R matrices from, to calc. V?
                - Ans: you cannot!

- Moral of the story
        - to use V, you need P and R
        - to find V, you need P and R

- Solution
        - switch to Q!

- V and Q: a tale of two functions
        - V^pi(s): expected cumulative discounted reward 
          when starting from state s and following pi
        - Q^pi(s,a): expected cumulative discounted reward 
          when starting from state s, applying action a,
          and thereafter following pi

- Why does this solve the problem?
        - Q doesn't require "lookahead"

- More on differences between V and Q
        - you can only apply V on future states
                => need a model for that!
        - you can apply Q on current state (and
          action) 
                => don't need a model for that

- How can we do RL without P and R matrices?
        - Monte Carlo approach: try a lot of runs!

- Think of a "simulator"
        - takes state and action as input
        - produces new state and reward as input
                => you do not have access to
                   internal workings!

- How do you model a two-player game?
        - simulator "hides" the second player
        - "fake" the other player and factor him
          in the stochastics of the environment

- Do the same thing as usual, iteratively
	- from a pi, get a Q
	- from a Q, get a pi

- A serious problem
        - what if your original policy did not
          have the "good" actions in it?
        - how can improvement ever come up with
          the best policy?
                - Ans: it cannot!
        - basic conflict in RL
                - "exploration vs exploitation"

- Solution: use epsilon-soft policies only!
        - make your policy always explore a little
        - outline of an epsilon-soft algorithm
                - do the best thing 90% of the time
                - pick a random action 10% of the time!
                  (this facilitates exploration)
        - this is what you need for final project

- psuedocode for learning e-soft policies via Monte Carlo


   Initialize all Q(s,a) to some number, or zero
   Initialize Returns(s,a) to an empty list
   
   Initialize pi to be an arbitrary epsilon-soft policy

   L1:  Generate an episode using pi

        for each (s,a) occuring in the episode
                // count only the first time that (s,a) occurs

                find return "r" following that occurence
			(i.e., sum of discounted rewards)
                append "r" to the Returns(s,a) list

                Q(s,a) = average [ Returns(s,a) ]

        end for

        // now use Q(s,a) to create a policy
        for each s in the episode
                // only these would have changed

                a* = argmax Q(s,a)
                        // i.e., find the a at which this
                        // Q becomes maximum
                distribute 90% probability among a*
                distribute 10% probability among everybody
                        // including a*

                // this could have been 80-20 or 95-5, just
                // make sure all actions have non-zero
                // probability of being taken

                update pi       
        end for

    go back to L1