Using Intel® Xeon® for Multi-node Scaling of TensorFlow* with Horovod*

TensorFlow* is one of the leading deep learning and machine learning frameworks today. Earlier in 2017, Intel worked with Google* to incorporate optimizations for Intel® Xeon® processor-based platforms using Intel® Math Kernel Library (Intel® MKL) [1].  Optimizations such as these with multiple popular frameworks have led to orders of magnitude improvement in performance.

Intel has mainly been reporting out Intel-optimized TensorFlow performance improvements on single node. However, many complex deep learning models are required to be trained on multi-node. They either don’t fit in one machine or their time-to-train can be significantly reduced if they are trained on a cluster of machines. Therefore, Intel has also performed scaling studies on multi-node cluster of Intel Xeon Scalable processors. This blog will show TensorFlow distributed training performance on a cluster of Intel® Xeon® platforms (system configurations to be found in the end) using Horovod, a distributed training framework for TensorFlow.

Horovod*, which was developed by Uber*, uses Message Passing Interface (MPI) as the main mechanism of communication. It uses MPI concepts such as allgather and allreduce to handle the cross-replicas communication and weight updates. OpenMPI* can be used with Horovod to support these concepts. Horovod is installed as separate Python package. By calling Horovod’s API from the Deep Learning Neural Networks model script, a regular build of TensorFlow can be used to run distributed training. By using Horovod, there is no source code change required in TensorFlow to support distributed training with MPI.

Scaling Results using Uber Horovod with TensorFlow 1.7

In this section, we will show the performance numbers of Intel-optimized TensorFlow 1.7 for Resnet-50 and Inception-V3 training running on up to 64 nodes containing Intel Xeon Gold processors. Real training dataset was used to perform these runs. As shown in the below charts, by running one MPI process per node, Resnet-50 was able to maintain at least 89.1% scalability for up to 64 nodes, while Inception-V3 could achieve at least 89.4%. So, with the higher throughput for Resnet-50 and Inception-V3, time-to-train is reduced significantly.

Although this study shows the scaling for up to 64 nodes, it is expected that the same scalability rate would carry over to 128 nodes.

The user can also run the same models by having 2 MPI processes running on each node. As shown in the charts below, we can get up to 17% and 24% performance improvements for Resnet-50 and Inceptionv3 respectively, with no extra hardware cost. Please note that the batch size per node remain the same as what we used for running 1 MPI process per node.

Thus by running two MPI process per node, as shown in the two graphs below, Resnet-50 was able to maintain at least 94.1% scalability for up to 64 nodes, while Inception-V3 could achieve at least 87.4%. So, higher throughput for ResNet-50 and Inception-V3, time-to-train is reduced significantly, even faster than using one MPI process per node.


Gathering and Installing Relevant Software Tools

  1. OpenMPI can be installed via yum on recent versions of CentOS. Some existing clusters already have available OpenMPI. In this blog, we will use OpenMPI 3.0.0. OpenMPI can be installed following instructions in the following link:
  2.  Latest GCC version is needed. At least, GCC 6.2 or newer versions are recommended. For the latest installation, following link can be used:
  3.  Python 2.7 and 3.6 both tested.
  4.  Uber Horovod supports running TensorFlow in distributed fashion. Horovod can be installed as a standalone python package as follows:
    pip install –no-cache-dir horovod (e.g. horovod-0.11.3)
    Please check the following link to install Horovod from source:
  5. The current TensorFlow benchmarks are recently modified to use Horovod. You can obtain these benchmarks code from GitHub:
    git clone
    cd benchmarks/scripts/tf_cnn_benchmarks
    Run as explained below.

Running TensorFlow Benchmark using Horovod with TensorFlow

Here, we discuss run commands needed to run distributed TensorFlow using Horovod framework. For hardware platform, we use dual-socket Intel® Xeon® Gold 6148 processor-based cluster system. For networking 10GB ethernet is used. Mellanox infiniband or Intel® Omni-Path Architecture (Intel® OPA) also can be used for networking the cluster.

Running 2 MPI processes on single node:

export LD_LIBRARY_PATH=<path to OpenMP lib>:$LD_LIBRARY_PATH
export PATH=<path to OpenMPI bin>:$PATH
export inter_op=2
export intra_op=18 {# cores per socket}
export batch_size=64 
export MODEL=resnet50 {or inception3}
export python_script= {path for script}

mpirun -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -cpus-per-proc 20 --map-by socket  --overscribe 
--report-bindings -n 2 python $python_script --mkl=True --forward_only=False --num_batches=200 
--kmp_blocktime=0 --num_warmup_batches=50 --num_inter_threads=$inter_op --distortions=False 
--optimizer=sgd --batch_size=$batch_size --num_intra_threads=$intra_op --data_format=NCHW 
--model=$MODEL --variable_update horovod --horovod_device cpu --data_dir <path-to-real-dataset> 
--data_name <dataset_name>

For 1 MPI process per node, the configuration will be as follows. The other environment variables will be the same.

export intra_op=38
export batch_size=128 

mpirun -x LD_LIBRARY_PATH -x OMP_NUM_THREADS --bind-to none --report-bindings  
-n 1 python $python_script --mkl=True --forward_only=False --num_batches=200 
--kmp_blocktime=0 --num_warmup_batches=50 --num_inter_threads=$inter_op 
--distortions=False --optimizer=sgd --batch_size=$batch_size 
--num_intra_threads=$intra_op --data_format=NCHW --model=$MODEL 
--variable_update horovod --horovod_device cpu --data_dir <path-to-real-dataset> 
--data_name <dataset_name>

Please note that if you want to train models to achieve good accuracy please use –distortions=True. You may also need to change the other hyper-parameters.

For running models on a multi-node cluster we will use a very similar run script as the one above. For example, to run on 64-node (2 MPI per node), where each node is Intel Xeon Gold 6148, the distributed training can be launched as shown in the below. All the export lists will be the same as above.

mpirun -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -cpus-per-proc 20 --map-by node  
--report-bindings -hostfile host_names -n 128 python $python_script --mkl=True 
--forward_only=False --num_batches=200 --kmp_blocktime=0 --num_warmup_batches=50 
--num_inter_threads=$inter_op --distortions=False --optimizer=sgd --batch_size=$batch_size
 --num_intra_threads=$intra_op --data_format=NCHW --model=$MODEL --variable_update horovod 
--horovod_device cpu --data_dir <path-to-real-dataset> --data_name <dataset_name>

Here, the host_names file is the list of hosts on which you wish to run the workload.

What Distributed TensorFlow means for Deep Learning Training on Intel Xeon

Various efforts were taken to implement distributed TensorFlow on CPU and GPU. For example, gRPC, VERBS, TensorFlow built-in MPI. All of these technologies are incorporated within TensorFlow codebase. Uber Horovod is one distributed TensorFlow technology that was able to harness the power of Intel Xeon. It uses MPI underneath, and uses Ring based reduction and gather for Deep Learning parameters. As shown above, Horovod on Intel Xeon shows great scaling for existing DL benchmark models, such as Resnet 50 (up to 94%) and Inception v3 (up to 89%) for 64 nodes. In other words, time to train a DL network can be accelerated by as much as 57x (resnet 50) and 58x (inception V3) using 64 Xeon nodes comparing to a single Xeon node. Thus, currently Intel recommends TensorFlow users use Intel-optimized TensorFlow and Horovod MPI for multi-node training on Intel® Xeon® Scalable Processors.


The authors would like to thank Vikram Saletore, Mikhail Smorkalov, and Srinivas Sridharan for their collaboration with us on using Horovod and Horovod-based Intel MLSL with TensorFlow.