Nauta: Sailing the Seas of Distributed DL

Creating and training deep learning (DL) models typically starts on a workstation or laptop, but as projects mature, most will require the memory and compute resources of a departmental or enterprise cluster. Very large model training experiments may need to scale across multiple nodes to achieve reasonable training times.

Nauta

While the deep learning ecosystem offers a variety of frameworks and tools, data scientists often find themselves dealing with the intricacies of cluster configuration, as they work to move their models onto an enterprise cluster or cloud.

As a new, open source, multi-user platform incubated by Intel, Nauta complements existing tools by making it easier to run complex deep learning models on shared hardware resources. This approach to distributed deep learning frees data scientists from having to worry about setting up and managing cluster or cloud infrastructure. It also saves time for devops managers and hardware vendors that want to meet user and customer demands for high-performance deep learning capabilities.

My colleague Carlos Humberto Morales and I are excited to discuss Nauta at the Artificial Intelligence Conference in New York City. Sponsored by O’Reilly and Intel, the AI Conference is an industry-leading, four-day event that brings together deep learning professionals across the business and technology spectrum to focus on real-world AI applications.

  • At 9:20 am in the Grand Ballroom West, Carlos will provide an overview of Nauta in his keynote address, Making Real-World Distributed Deep Learning Easy with Nauta.
  • At 1:50 pm in the Trianon Ballroom, I’ll lead a 40-minute breakout session called Sailing with Nauta. The word Nauta means sailor in Latin, and Intel’s goal for Nauta is to provide smooth sailing on the vast expanse of deep learning tools. I’ll cover the motivation and benefits of Nauta, and show how you can get started with Nauta, execute deep learning experiments/tasks, monitor progress, set inference endpoints, manage input and output data, and customize Nauta.

Navigating the Rough Seas

Intel developed Nauta after observing common pain points experienced by DL practitioners within our company and across our customer base. The questions users wrestled with include:

  • How do I translate a “local” DL script in a way that lets me launch it on a cluster orchestration system such as Kubernetes*?
  • Once I’ve packaged the scripts in containers, how do I get the data into the training?
  • Where do I save the models when training is complete?
  • How do I orchestrate a large training job across multiple servers?
  • Once I submit my training job to the cluster, how do I know whether it has started and when it is done?
  • How do I coordinate resources and easily share results with my colleagues?

While robust point solutions exist to solve many of these problems, these solutions often require expertise that data scientists may not have. In addition, not all point solutions can smoothly handle a mix of jobs, frameworks, and operators — plus non-DL workloads in a cloud or cluster environment.

Smooth Sailing with Nauta

Many devops experts, large organizations, and DL teams have figured out how to answer these questions, but each group has had to search out and develop best practices for tasks such as submitting and tracking complex training jobs, tracking resources, distributing jobs, and providing visibility into job status.

To simplify this process, Nauta provides a curated set of tools based on best practices developed by the open source community. Nauta also offers extensibility (through template packs and Helm charts) to meet custom requirements. These capabilities are performance-optimized for Intel® Xeon® Scalable processor-based platforms. They’re also backed by a hefty investment in quality assurance— the Nauta team developed over 400 tests to ensure enterprise-grade reliability and interoperability.

Nauta harnesses Kubeflow* intrinsics and open source best practices to provide a robust, easy-to-learn, easy-to-use way of training deep learning models on an enterprise cluster or cloud.

Nauta harnesses Kubeflow* intrinsics and open source best practices to provide a robust, easy-to-learn, easy-to-use way of training deep learning models on an enterprise cluster or cloud.

Nauta runs on top of Kubernetes*, the industry-leading orchestration system and one to which Intel has contributed heavily. Nauta installs Kubernetes and adds a graphical interface that makes Kubernetes easier to learn and use. Data scientists can define Kubernetes and Docker* containers and use open source tools such as Jupyter* notebooks and TensorBoard* visualization to manage their end-to-end DL workflows efficiently. They can take advantage of resources on premises or in a secure cloud. Devops managers and hardware vendors can use Nauta to increase time-to-market and convenience for the solutions they’re bringing to their user community and customer base. The result is smoother sailing for the deep learning voyage.

Go Further

We hope you’ll stop by the Intel booth to share your questions and insights with the Nauta team, check out a demo showing how to initiate and monitor TensorFlow* model training experiments with Nauta and learn about other technologies in Intel’s AI portfolio. Nauta is a production-ready, open source solution, and I encourage all of you to check it out GitHub. Download the repo, try it out, and add to the store of knowledge. We also invite you to follow us at @IntelAI for the latest happenings at #TheAIConf in NYC.