**G. Banjac and J. Lygeros**

BibTeX Preprint

@inproceedings{BL:2019, author = {G. Banjac and J. Lygeros}, title = {A data-driven policy iteration scheme based on linear programming}, booktitle = {IEEE Conference on Decision and Control (CDC)}, year = {2019} }

We consider the problem of learning discounted-cost optimal control policies for unknown deterministic discrete-time systems with continuous state and action spaces. We show that a policy evaluation step of the well-known policy iteration (PI) algorithm can be characterized as a solution to an infinite dimensional linear program (LP). However, when approximating such an LP with a finite dimensional program, the PI algorithm loses its nominal properties. We propose a data-driven PI scheme that ensures a certain monotonic behavior and allows for incorporation of expert knowledge on the system. A numerical example illustrates effectiveness of the proposed algorithm.

**A. Kamoutsi, G. Banjac and J. Lygeros**

BibTeX Preprint

@inproceedings{KBL:2019, author = {A. Kamoutsi and G. Banjac and J. Lygeros}, title = {Stochastic convex optimization for provably efficient apprenticeship learning}, booktitle = {NeurIPS 2019 Optimization Foundations for Reinforcement Learning Workshop}, year = {2019} }

We consider large-scale Markov decision processes (MDPs) with an unknown cost function and employ stochastic convex optimization tools to address the problem of imitation learning, which consists of learning a policy from a finite set of expert demonstrations. We adopt the apprenticeship learning formalism, which carries the assumption that the true cost function can be represented as a linear combination of some known features. Existing inverse reinforcement learning algorithms come with strong theoretical guarantees, but are computationally expensive because they use reinforcement learning or planning algorithms as a subroutine. On the other hand state-of-the-art policy gradient based algorithms (like IM-REINFORCE, IMTRPO and GAIL), achieve significant empirical success in challenging benchmark tasks, but are less well understood in terms of theory. With an emphasis on nonasymptotic guarantees of performance, we propose a method that directly learns a policy from expert demonstrations, bypassing the intermediate step of learning the cost function, by formulating the problem as a single convex optimization problem over occupancy measures. We develop a computationally efficient algorithm and derive high confidence excess-loss bounds on the quality of the extracted policy, utilizing results from uncertain convex optimization and recent works in approximate linear programming for solving forward MDPs.

**M. Schubiger, G. Banjac and J. Lygeros**

BibTeX Preprint Code

@article{SBL:2019, author = {M. Schubiger and G. Banjac and J. Lygeros}, title = {{GPU} acceleration of {ADMM} for large-scale quadratic programming}, year = {2019} }

The alternating direction method of multipliers (ADMM) is a powerful operator splitting technique for solving structured convex optimization problems. Due to its relatively low per-iteration computational cost and ability to exploit sparsity in the problem data, it is particularly suitable for large-scale optimization. However, the method may still take prohibitively long to compute solutions to very large problem instances. Although ADMM is known to be parallelizable, this feature is rarely exploited in real implementations. In this paper we exploit the parallel computing architecture of a graphics processing unit (GPU) to accelerate ADMM. We build our solver on top of OSQP, a state-of-the-art implementation of ADMM for quadratic programming. Our open-source CUDA C implementation has been tested on many large-scale problems and was shown to be up to two orders of magnitude faster than the CPU implementation.