Best Practices for Scaling Deep Learning Training and Inference with TensorFlow* On Intel® Xeon® Processor-Based HPC Infrastructures

This document describes the setup, installation and procedure to run distributed Deep Learning training using TensorFlow with Uber Horovod MPI library. The steps required to run the benchmark can vary slightly depending on the user’s environment. In case of a large cluster with the order of hundreds or thousands of nodes, we provide sample scripts that use the SLURM scheduler. Alternatively, we also list out steps for smaller systems that may not have such a scheduler configured…

Download Script Files

Download File