Delivering on the Promise of Real-World Reinforcement Learning

Reinforcement learning (RL)—the branch of artificial intelligence (AI) that deals with sequential decision making to maximize rewards in an environment—is an exciting area of AI. It has already made great progress in complex game settings, where RL-trained models have defeated world champions at strategic games such as Go and StarCraft. It has also delivered early wins in research and use cases in financial services, industrial robotics, drug design, and healthcare. For example, we have seen research (https://arxiv.org/pdf/1704.07555.pdf), where a RL-based method tunes a sequence-based generative model for molecular de novo design. We have seen RL being applied to vehicle routing problem (https://arxiv.org/pdf/1802.04240.pdf), or to learn policy to aid sepsis treatment (https://arxiv.org/pdf/1711.09602.pdf), etc..

The excitement about RL is fully justified. RL has tremendous value in deep learning scenarios where the historical labelled training data is insufficient or impractical for supervised training, as well as when the AI solution operates in a dynamic environment.

RL is poised to empower a new generation of advanced AI solutions that can operate effectively in complex, fast-changing scenarios. We offer a perspective on what it will take to fulfill RL’s promise and what we at Intel AI Lab are doing to accelerate progress.

Challenge 1: The Algorithm Gap

RL is an emerging field, and in the past few years research community has developed a diverse set of effective learning algorithms. Still with these progresses, there are several challenges that the researchers are actively trying to address these days. For example, RL systems often take many trials to learn, sensitive to hyper-parameters, and hard to balance between exploiting what has been learned and exploring to discover more robust solution.

To tackle these challenges, researchers from Intel AI Lab, developed an approach to combine policy gradient and evolution methods, called Collaborative Evolutionary Reinforcement Learning (CERL) (https://www.intel.ai/cerl/) (https://arxiv.org/pdf/1905.00976.pdf) (code: https://github.com/IntelAI/cerl). It is a scalable framework that comprises a portfolio of policies that simultaneously explore and exploit diverse regions of the solution space. A collection of learners – typically proven algorithms like TD3 – optimize over varying time-horizons leading to this diverse portfolio. As a form of online algorithm selection, computational resources are dynamically distributed to favor the best learners. All learners contribute to and use a shared replay buffer to achieve greater sample efficiency. Blended with complementary, evolutionary strategies, this process generates a single emergent learner that exceeds the capabilities of any individual learner, and potentially, becomes even more powerful in speeding learning ML training and empowering new use cases.

RL systems are also highly complex. The published algorithms are hard to reproduce, the code that comes with the publication sometime may have large numbers of variables, and they may not be working as robustly when testing with other use cases. Developing robust and easy to reproduce RL algorithms is an important step toward reproducible algorithms, and it is an issue that comes up repeatedly as we talk with data scientists and others who want to use RL for industrial use cases.

Challenge 2: Libraries, Tools, and Frameworks

Libraries, tools, and frameworks for RL are still maturing, and many are proprietary solutions that limit future flexibility. Intel introduced Reinforcement Learning Coach (RL Coach) in 2017 as a comprehensive, open source library and framework for developing, training, and evaluating RL agents. A recent survey of RL frameworks by Winder Research called RL Coach “the most comprehensive framework with the best documentation and a fantastic level of modularity.” Winder also praised RL Coach for having a “colossal” number of implemented algorithms and integrated RL environments, as well as the hooks it provides for deployment with Kubernetes.

RL is often paired with a simulation of the environment, enabling the use of real-time feedback to tune the learning system. But many use cases also lack a rich base of simulation software. With methods such as Batch RL, reinforcement learning can also allow faster, more cost-effective training of models based on historical data. We recently implemented Batch RL support in Coach 1.0.0 to help researchers and developers to use Coach and apply RL when your use case doesn’t allow a simulator

Challenge 3: Infrastructure Access

RL training is often compute- and memory-intensive, while many data scientists and researchers lack access to such powerful data center infrastructure. Intel has worked with Amazon to integrate RL Coach with the Amazon SageMaker* platform. Developers and researchers can use RL Coach to experiment with and deploy RL models while taking advantage of Amazon Web Services’ scalable computing services based on 2nd Generation Intel® Xeon® Scalable processors and other data center technologies from Intel. Intel continues to design our data center technologies around the requirements of AI workloads, ensuring robust performance for RL in the cloud and on premises.

Driving RL toward Maturity

Reinforcement learning is an exciting domain that is at the beginning of delivering value in AI use cases. By working together, we can significantly advance the maturing of RL and spread its benefits more broadly.

I encourage you to build on the work we have provided for RL. Contribute your RL codes to the open source community so data scientists and developers can reproduce and further them. Take advantage of RL Coach and Amazon SageMaker as needed to access high-performance infrastructure. Capitalize on the groundwork being done by RL researchers and developers. I’m excited to see what our work will make possible.