Nov 29, 2006
-------------

- Continuing reinforcement learning
	- temporal differences algorithm

- In TD
	- can learn even before episode is over!
	- use reward and Q at one time step to "fix"
	  estimate of Q made at previous time step
	- suitable for "lifelong" learning

- Practical examples of TD
	- estimating driving time to reach home from office
	- estimating time for graduation

- Pseudocode for TD

   Initialize Q(s,a) arbitrary

   Repeat (for each episode)

	Initialize s

	Choose a from s using an e-soft policy derived from Q

	L1: 

	take action a in state s

	observe new state-> s'
		reward   -> r

	choose action a' from s' using an e-soft policy derived from Q

 	Q(s,a) <- Q(s,a) + alpha*[{r + gamma*Q(s',a')} - Q(s,a)]	

	// BUT DON'T do this update if a' was chosen due to curiosity
	// and not because Q said so

	s = s'
	a = a'

	until episode over (go back to L1)
   
   until

   // The code above doesn't take into account the two-player thingy