Nov 29, 2006 ------------- - Continuing reinforcement learning - temporal differences algorithm - In TD - can learn even before episode is over! - use reward and Q at one time step to "fix" estimate of Q made at previous time step - suitable for "lifelong" learning - Practical examples of TD - estimating driving time to reach home from office - estimating time for graduation - Pseudocode for TD Initialize Q(s,a) arbitrary Repeat (for each episode) Initialize s Choose a from s using an e-soft policy derived from Q L1: take action a in state s observe new state-> s' reward -> r choose action a' from s' using an e-soft policy derived from Q Q(s,a) <- Q(s,a) + alpha*[{r + gamma*Q(s',a')} - Q(s,a)] // BUT DON'T do this update if a' was chosen due to curiosity // and not because Q said so s = s' a = a' until episode over (go back to L1) until // The code above doesn't take into account the two-player thingy