AI is bringing new ways to use massive amounts of data to solve problems in business and industry—and in high performance computing (HPC). AI applications increasingly take on day-to-day use cases, HPC practitioners—like their commercial counterparts—are looking to move deep learning training off specialized laboratory hardware and software onto the familiar Intel®-based infrastructure already in place for handling a wide variety of HPC workloads.
To enable that workload flexibility, we optimized Intel® Xeon® Scalable processors for HPC and AI. We designed Intel® Omni-Path Architecture (Intel® OPA) fabric to provide high-performance communications across large clusters of Intel® Xeon® Scalable processor-based systems. And we’ve optimized the most widely applied AI frameworks to take full advantage of processor optimizations and available cores.
Now, to help mainstream deep learning in HPC environments, we’ve developed best practices detailing the setup, installation, and procedures to run distributed deep learning training and inference using TensorFlow* with the Uber Horovod* library on Intel Xeon CPU-based deployments. The configuration we implemented is tuned to the needs of HPC and includes:
In addition to detailed installation scripts for software components, we include installation scripts and command line arguments to install and run verification jobs using the ImageNet image database, the TensorFlow convolutional neural network benchmarks, and the ResNet50 convolutional neural network. We describe how to create a Singularity image enabling ready distribution and installation of the solution across large clusters. And we describe typical issues that might be encountered and provide troubleshooting tips for identifying problems.
Upon successful installation, our best practices provide detailed instructions for running multiple training instances on a single CPU. This approach divides the cores uniformly across worker instances and uses Non-Uniform Memory Access- (NUMA) aware core affinity and data placement to exploit local memory channels of the sockets. To scale to multiple nodes, we show how to spawn multiple workers per node and use Horovod over MPI to synchronize gradients. We provide example command line parameters to train the ResNet-50 model on multiple two-socket Intel Xeon Scalable processors, saving the model periodically in a model check-point directory. Finally, we provide instructions for evaluating the accuracy of the trained model.
The same approaches—and much of the same software—might be applied to install and run TensorFlow in traditional data center environments as well. Such an environment might substitute Ethernet for Intel OPA, use Docker containers with Kubernetes* rather than Singularity with SLURM, and otherwise adapt the configuration and installation to the target environment. In either case, the objective is to enable deep learning training in an environment that’s familiar and that in most cases is already in place. As cloud service providers roll out HPC services, the configurations we describe are readily adaptable to hybrid computing environments spanning on-premises data centers and cloud HPC instances.
As AI and analytics become common workloads in both HPC and commercial computing environments, moving deep learning training and inference to industry-standard Intel Xeon Scalable processors creates a uniform infrastructure that avoids the complexity of one-off solutions. It makes it possible to access and process the data where it lives, rather than moving it through the network to special processors. And it leverages existing investments in servers, processors, and communications rather than requiring that you buy and install more.
The best practices we offer make it easy for technical teams to achieve this and to make deep learning a standard part of their standard environments. Download the paper and sample scripts to get started right away.
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice Revision #20110804
Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.
© Intel Corporation