Nov 8, 2006
-------------

- Reinforcement learning
        - grandest topic in AI
        - can subsume all of AI itself

- Two basic features of RL
	- learn by trial and error interactions with environment
	- need to balance exploration and exploitation

- General setting
        - states
        - actions
        - state transition table
        - rewards table

- A simple problem: gridworld1
        - three states, organized as
                - S1 - S2 - S3
        - four possible actions
                - 1l
                - 2l
                - 1r
                - 2r
        - bumping into wall keeps state same, but
                - gives reward of -5 or -10
                  (depends on force of impact)
        - zero rewards for legal actions

- Policies
        - a facet of the agent
        - give mapping from states to actions
        - have value functions

- V^pi(s) 
        - expected cumulative sum of discounted rewards,
          if you start from state "s" and apply policy "pi"

- Discounting: gamma
        - between 0 and 1
        - If 1, V^pi(s) becomes straight sum

- Working out V^pi(s) for a given agent (from its policy)
        - is this the best policy possible?
        - how would you improve it?

- Different agents
	- have different policies

- A more complicated policy
        - agent has non-zero probability of taking two actions in 
          a given state

- Note
	- two different policies can have same value(s)

- Making things more interesting: gridworld2
        - four states 
                S1 | S2
                S3 | S4
                -------
                     S5 (a hidden mine)
        - S2 and S5 are absorbing states
                - S2 is goal state
                - S5 is bad state to be in
        - four actions
                - up, down, left, right
        - state transition table
        - rewards matrix
                - bumping is not good (-5)
                - just roaming around is not good (-1)
                - into S2 gets reward of +50
                - into S5 gets reward of -100
                - from S2, and from S5, stays there,
                  with reward 0
        - discounting of 0.9

- Take a simple policy: Agent 006
        - S1: down
        - S2: right
        - S3: up
        - S4: left
        - S5: down

- Find V^pi
        - looks like Agent 006 is not such a good agent!

- Improving the policy
        - look at just computed V^pi values
        - change policy at each state to be "greedy" w.r.t. V^pi
                - i.e., do the action that appears to be best
		  using just one step lookahead
        - what if many actions are "equally best"?
                - sprinkle the probabilities among them
                        - any way you like!

- Basic idea: policy iteration
        - fix pi
        - find V^pi
        - find a new pi
        - find V^pi (again)
        - find a new pi, and so on

- Improving agents
        - from 006
        - to 006.5,
        - to James Bond! (play theme music)

- The above are simple cases of RL
	- in general we are NOT given the 
		state transition table and rewards matrix
	- will study these in future classes

- The above topics are covered
	- NOT in the "Reinforcement Learning" chapter of AIMA
	- but in the "Making Complex Decisions" chapter