Multi-node Convergence and Scaling of Inception-Resnet-V2 Model Using Intel® Xeon® Processors

TensorFlow* is one of the most popular, flexible open source software libraries for numerical computation and large-scale machine learning (ML) and deep learning (DL). Since 2016, Intel and Google engineers have been working together to optimize TensorFlow for Intel® Xeon® processors using Intel® Math Kernel Library (Intel® MKL). This optimization improves training and inference performance of deep neural networks on Intel® CPUs.

Horovod*, a distributed training framework for TensorFlow, Keras* and PyTorch*, was developed by Uber to make distributed DL projects faster and easier to develop. The primary principles of Horovod are based on Message Passing Interface (MPI) concepts, such as allreduce and allgather, to aggregate data among multiple processes and distribute the results back to the processes. Most of the latest MPI implementations, such as OpenMPI, can be used with Horovod. Horovod can be installed as a separate Python* package. Additionally, no source code modification is necessary in TensorFlow models to support distributed training with Horovod.

Some complex deep learning (DL) models can be trained more efficiently using a multimode training configuration. This article presents multi-node CPU training performance on a cluster of Intel Xeon platforms using TensorFlow with Horovod. Our goal is to leverage the computational power of multiple nodes to get shorter time to train and provide guidance on accurate hyper-parameter settings for multi-node training.

Multi-node Training Convergence

Here, the Inception-Resnet model is used to investigate how to achieve multi-node training convergence. There are two variants of this model, namely V1 and V2. The computational cost of Inception-Resnet-V1 is similar to Inception-V3, whereas Inception-Resnet-V2 is similar to Inception-V4. Both versions have similar structures but different stem layers for the reduction blocks and different hyper-parameter settings. Since the most accurate model among the inception models is Inception-Resnet-v2, in this article we discuss our approaches to reach state-of-the-art accuracy (SOTA) for Inception-Resnet-V2 and present our Inception-Resnet V2 multi-node convergence test results. The details of the model can be found here.

Learning Rate Scaling

In this article, we experimented with two approaches for learning rate policies for minibatches: linear scaling and squared root scaling. Researchers from Facebook have proposed[2] a linear scaling rule for learning rate that is a commonly used technique for some models on distributed training. However, after experimenting with this linear learning rate, our results showed that this learning rate policy didn’t provide good accuracy performance for the model. Instead, after experimenting with another learning rate policy for the model, square root scaling, we were able to achieve better Top-1 and Top-5 accuracy. Here, the Top-1 error presents the percentage of time that the model did not provide the correct class with the highest score. The Top-5 error is the percentage of the timeodel did not include the correct class among its top 5 highest scores.

Warmup Training

As discussed above, for distributed training, the learning rate is multiplied by a function of the number of minibatches to get the final learning rate. However, learning rate scaling can cause divergence of the training. With large minibatch sizes (for example, more than 32), the parameters change more frequently in the beginning of the training, eventually causing the training to be diverged. To tackle this issue, warmup training strategy, as discussed here[2], can be used with less aggressive learning rates at the beginning of the training. We have experimented with two types of warmup trainings.

  • Constant Warmup: In this scheme, a constant learning rate for the first few epochs of the training is used. But for Inception-Resnet-V2, this strategy caused spikes in training errors. Therefore, we did not use it in the final training run.
  • Gradual Warmup: In this case, the learning rate is increased gradually for the first few epochs to reach the target learning rate. This scheme seemed to show healthy convergence in the beginning of the training in our model. Therefore, we used this technique in the model.
    For multi-node training, global batch size can be calculated as follows:
    Global batch size = number_of_nodes * local batch size (minibatch)

After experimenting with various global batch sizes, we followed an empirical rule:

If the global batch size is greater than 1024, gradual warm-up phase is added for 5 epochs from the start of the training.

Installing Prerequisite Tools

In order to train to convergence of Inception-Resnet-V2 in a multi-node fashion, we needed to install various software tools, as listed below:

  1. OpenMPI* 3.0.0 as MPI distribution. OpenMPI can be installed from the instructions at Open MPI: Version 3.0. Some existing clusters already have available OpenMPI.
  2. The latest GNU Compiler Collection* (GCC) version. GCC version 6.2 or newer is recommended. See GCC, the GNU Compiler Collection for the latest installation instruction.
  3. Python* programming language. Versions 2.7 and 3.6 have been successfully tested. We have used Anaconda2 from https://www.anaconda.com/download/
  4. Horovod distributed training framework. Horovod can be installed as a standalone Python package as follows:pip install horovod (for example, horovod-0.11.3)
  5. Intel optimized TensorFlow 1.8 with Intel MKL support. You can install it from here. We used a specific TensorFlow commit: abccb5d3cb45da0d8703b526776883df2f575c87. We then applied the patch: https://github.com/tensorflow/tensorflow/issues/17437
  6. Inception-Resnet-V2 model. A patch (commit id: 4e71c0641d4b7bd24154ebb9ec7e62389214dd24) is applied to this TensorFlow models.

Multi-node Training Using TensorFlow with Horovod

In this section, we discuss the necessary runtime commands to run distributed TensorFlow using the Horovod framework. For testing, we used an internal cluster with a dual-socket Intel® Xeon® Gold 6148 processor.

We set the following environmental variables for software tools:
export LD_LIBRARY_PATH=/path/to/gcc6.2.0/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH="/path/to/openmpi-3.0.0/lib":$LD_LIBRARY_PATH
export PATH="/path/to/openmpi-3.0.0/bin":$PATH
export OMP_NUM_THREADS=40

We then set environment variables related to performance and initial hyper parameters:
dataset_dir={/path/to/image/data}
num_intra_threads=40
num_inter_threads=2
ppn=1
nodes=32
num_nodes=$(($ppn*$nodes))
local_batch_size=64
batch_size=$(($local_batch_size*$num_nodes))
lr_single_node=0.01
initial_lr=0.11
epochs=120
num_of_images=1281167
steps_per_epochs=$(($num_of_images/$batch_size))

We set up the parameters for run command:
params=" --max_number_of_steps=$max_steps --learning_rate=$initial_lr --batch_size=$local_batch_size --train_dir=$train_dir --dataset_dir=$dataset_dir --dataset_name=”dataset” --dataset_split_name=train --model_name=inception_resnet_v2 --num_intra_threads $num_intra_threads --num_inter_threads $num_inter_threads --clone_on_cpu=True --variable_update=horovod "

if [ $batch_size -gt 1024 ]; then
warmup_lr=0.05
warmup_epochs=5
warmup_steps=$(($warmup_epochs*$steps_per_epochs))

params+=" --warmup_epochs=$warmup_epochs --warmup_learning_rate=$warmup_lr --warmup_steps=$warmup_steps --learning_rate_decay_type polynomial --optimizer momentum --weight_decay 0.0002 --power 3.0 --num_epochs_per_decay 100"fi

Finally, we started the MPI process for training on multi-node using run command as follows:
mpirun -x $LD_LIBRARY_PATH -x $OMP_NUM_THREADS -cpus-per-proc $OMP_NUM_THREADS --map-by node --mca btl_tcp_if_include eth0 -np $num_nodes -H $HOSTLIST python $script_dir/train_image_classifier.py $params 2>&1 | tee $train_dir/output-goldskx-${num_nodes}.log

Multi-node Scaling on Intel Xeon Processors

Throughput Performance Scaling for Inception-Resnet-V2_2

Fig. 1 – Throughput Performance Scaling for Inception-Resnet-V2_2

In Figure 1, we show the scaling performance numbers of Intel Xeon processor optimizations for TensorFlow 1.8 for InceptionResNet-v2* training, running on up to 32 nodes containing Intel® Xeon® Gold processors. A real training dataset was used to perform these runs. As shown in the charts below, by running one MPI process per node, InceptionResNet-v2 was able to maintain at least 80 percent scalability for up to 32 nodes. So, with the higher throughput for InceptionResNet-v2 and Inception-v3, time to train is reduced significantly.

Model Evaluation

It is recommended to save the model checkpoint for each few epochs while training is in process. Then, evaluate the training model for every few epochs using the saved checkpoints to check if the training is converging. Tuning the training hyperparamaters may be necessary if the evaluation shows the increase of loss of accuracy. Here, we have shown the run command to evaluate the training checkpoints at 75000 steps.

CHECKPOINT_FILE = ${CHECKPOINT_DIR}/model.ckpt-75000

$ python eval_image_classifier.py \
--alsologtostderr \
--checkpoint_path=${CHECKPOINT_FILE} \
--dataset_dir=${DATASET_DIR} \
--dataset_name=”dataset”\
--dataset_split_name=validation \
--model_name=inception_resnet_v2

Training Results

After completing the training in 108 epochs, we achieved near-SOTA accuracy for both Top-1 and Top-5 accuracy metrics. The accuracy metrics are shown in the Table 1 below.

Metrics Values
Top 1 Accuracy 76.7
Top 5 Accuracy 93.2
Epochs 108

During the training steps, the model also produces logs for learning rate variations and losses. These logs can be visualized by the TensorBoard* tool, included in the TensorFlow framework. In Figure 2, the warmup learning rate during the few epochs is shown, with a subsequent drop in learning rate. In Figure 3, loss values gradually decrease with epochs, which indicates the training is converging.

Fig 2

Fig 2: Learning rate variations with number of steps
Fig 3

Fig 3: Loss value variations with number of steps

 

Conclusion

Training convergence of a training model in a single node is well-studied. Training convergence on a multi-node CPU cluster is a comparatively new area. In this article we presented how to achieve near-SOTA accuracy using 32 nodes of an Intel Xeon processor-based cluster, along with the correct hyper parameter values during the warmup period and later training periods and the environment settings and run time commands needed to achieve near SOTA accuracy for multi-node CPU training. The procedure presented in this article can be helpful for training other models in multi-node CPU deployments.

Acknowledgement

The authors would like to thank Wei Wang and Mahmoud Abuzaina, and Md Faijul Amin from Intel’s Artificial Intelligence Product Group (AIPG) for their valuable suggestions during the project.