Near-optimal Regret Bounds for Reinforcement Learning Peter Auer, Thomas Jaksch, and Ronald Or tner ID: T2 MU-Leoben New analysis of "Optimism in the face of uncer tainty". Regret = Sum of missed rewards (also during training!) compared to an optimal policy. New complexity parameter for MDPs: The diameter D . How long does it take to travel from one state to another state? Best known bounds for undiscounted RL with finite state/action space S × A. Regret after T steps is at most const · DS T A log (T ) . PAC-like bound: The average per step regret is at most after D s 2S 2A D SA T const · log teps. 2