Stochastic convex optimization for provably efficient apprenticeship learning

A. Kamoutsi, G. Banjac and J. Lygeros

in NeurIPS 2019 Optimization Foundations for Reinforcement Learning Workshop, Vancouver, Canada, December 2019.
BibTeX  Preprint 

@inproceedings{KBL:2019,
  author = {A. Kamoutsi and G. Banjac and J. Lygeros},
  title = {Stochastic convex optimization for provably efficient apprenticeship learning},
  booktitle = {NeurIPS 2019 Optimization Foundations for Reinforcement Learning Workshop},
  year = {2019}
}

We consider large-scale Markov decision processes (MDPs) with an unknown cost function and employ stochastic convex optimization tools to address the problem of imitation learning, which consists of learning a policy from a finite set of expert demonstrations. We adopt the apprenticeship learning formalism, which carries the assumption that the true cost function can be represented as a linear combination of some known features. Existing inverse reinforcement learning algorithms come with strong theoretical guarantees, but are computationally expensive because they use reinforcement learning or planning algorithms as a subroutine. On the other hand state-of-the-art policy gradient based algorithms (like IM-REINFORCE, IMTRPO and GAIL), achieve significant empirical success in challenging benchmark tasks, but are less well understood in terms of theory. With an emphasis on nonasymptotic guarantees of performance, we propose a method that directly learns a policy from expert demonstrations, bypassing the intermediate step of learning the cost function, by formulating the problem as a single convex optimization problem over occupancy measures. We develop a computationally efficient algorithm and derive high confidence excess-loss bounds on the quality of the extracted policy, utilizing results from uncertain convex optimization and recent works in approximate linear programming for solving forward MDPs.