Announcing Reinforcement Learning Coach v1.0.0 – Batch RL, More Algos, New APIs

Ever since we built and released RL Coach, our open source framework for training and evaluating reinforcement learning agents, in 2017, we have been working hard to add algorithms, simulation environments and features that will make it useful for the machine learning research and engineering communities. Features such as benchmarks, native support for hierarchical RL, and horizontal scaling helped us demonstrate a strong and extensible foundation for agent development and training. During 2018 we also integrated Coach with AWS Sagemaker, where it is used to train DeepRacer and solve other challenges by AWS and their customers. We’re very happy to see the growing usage of Coach by both researchers [1], [2], [3] and engineers [4], [5] to design new algorithms or build RL-based solutions, and would love to hear how Coach helped you in your project at

In the past few months we have taken additional steps to bring RL to more use cases than research and to grow the community of Coach users. The latest additions to Coach go beyond simulation-based learning environments, incorporate newer and stronger RL algorithms, and maintain and extend the APIs to improve usability. Today, we are very excited to announce the 1.0.0 release of RL Coach. The new release features the implementation of several new algorithms (for a total of 27), support for Batch Reinforcement Learning, improved documentation, bug fixes and new APIs that enable the use of Coach as a Python library. With the 1.0.0 release we believe that the main software structure of Coach has matured and stabilized and no major API changes are on the horizon.

Batch Reinforcement Learning

Many real-world problems are missing a simulator to accurately model the environment that the agent would interact with in a standard reinforcement learning setting. Often, all a data scientist may have is data that was collected using a deployed policy, and this existing data must be used to learn a better policy for solving the problem. One such example is improving drug dose management or drug admission scheduling policy for patients. In these situations, we have data based on the policy that was used with previous patients, but we cannot conduct additional experiment on the same patients to collect new data. Here is where batch reinforcement learning allows RL to learn from a dataset, while also exercising the dataset for off-policy evaluation of the goodness of the learned policy.

With the 1.0.0 release, we have added support for batch reinforcement learning in Coach, while also enabling off-policy evaluation (OPE) of the learned policy based on data that was acquired using another policy. We have added several off-policy evaluators for contextual bandits (Direct Method, Inverse Propensity Scoring and Doubly Robust) and for reinforcement learning (Sequential Doubly Robust and Weighted Importance Sampling), while allowing the use of a wide range of integrated off-policy RL algorithms in Coach. We also added support for a variant of the Batch Constrained Q Learning algorithm for discrete action space problems. We encourage you to try it out with our deep dive tutorial on Batch Reinforcement Learning.

New Algorithms

Since our last blog post, we have added support for several new reinforcement learning agents: Sample Efficient Actor-Critic with Experience Replay (ACER), Soft Actor-Critic (SAC) and Twin-Delayed Deep Deterministic Policy Gradient (TD3). As always, when we add new RL algorithms to Coach, we aim to fully reproduce paper results, as shown in Coach Benchmarks. This is also the case with the newly added algorithms.

We’d be happy to get feedback on additional features that may be useful and on your experience using Coach. You can contact us at or on our GitHub repo. We’d also appreciate any contributions that can be useful for other members of the machine learning community.

Ready to check out the new Coach release? You can get started by cloning the repository and running through our Getting Started tutorial. For the latest advancements from the Intel AI research team, visit and follow us on Twitter: @IntelAIResearch.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No product or component can be absolutely secure. Check with your system manufacturer or retailer or learn more at Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. © Intel Corporation.