Near-optimal Regret Bounds for Reinforcement Learning
Peter Auer, Thomas Jaksch, and Ronald Or tner

ID: T2
MU-Leoben

New analysis of "Optimism in the face of uncer tainty". Regret =
Sum of missed rewards (also during training!) compared to an optimal policy.

New complexity parameter for MDPs: The diameter D .
How long does it take to travel from one state to another state?

Best known bounds for undiscounted RL with finite state/action space S × A. Regret after T steps is at most const · DS T A log (T ) .

PAC-like bound: The average per step regret is at most  after D s 2S 2A D SA T  const · log teps. 2