Apache* MXNet* v1.5.0 Gets a Lift with Intel® DL Boost

The Apache MXNet community recently announced the v1.5.0 release of the Apache MXNet* deep learning framework. This version of Apache MXNet introduces a series of new features and optimizations targeting CPU backends, including:

  • Subgraph API and operator fusion
  • Model quantization and lower precision (INT8) inference
  • Integration of the new fused-RNN kernels
  • Support the Horovod*-based distributed training

The lower-precision (INT8) inference performance has seen gains thanks to the Intel® Deep Learning Boost (Intel® DL Boost) functionality on the recently launched 2nd Generation Intel® Xeon® Scalable processors. Intel DL Boost’s Vector Neural Network Instructions (VNNI) improve AI performance by combining three instructions into one and can be accessed through the three new Amazon EC2 C5 instances: C5.12xlarge, C5.24xlarge and C5.metal. Refer to AWS Launches New Amazon EC2 C5 Instances Featuring Intel® DL Boost for more details on how to get started.

Performance Improvement on Computer Vision Models

Both inference throughput and latency performance are significantly improved by leveraging the operator fusion and model quantization on Apache MXNet v1.5.0 optimized for a CPU backend (pip install mxnet-mkl). Taking ResNet50 as an example, compared to the Apache MXNet 1.5.0 native CPU build (pip install mxnet) on an AWS EC2 C5.24xlarge instance,with the Apache MXNet 1.5.0 build optimized for CPU, the inference performance gains are ~20x and ~9x on FP32 throughput and latency, respectively. Inference performance gains are ~82x and ~26x on lower precision (INT8) throughput and latency. [1] Figure 1 and 2 illustrate the inference throughput and latency comparison of popular neutral network topologies for image classification. [2]

Figure 1. Inference Throughput Speed-up on Topologies for Image Classification.

[1] Figure 1. Inference Throughput Speed-up on Topologies for Image Classification

Figure 2. Inference Latency Comparison on Topologies for Image Classification.

[2] Figure 2. Inference Latency Comparison on Topologies for Image Classification

Performance Improvement on Recurrent Neural Networks (RNNs)

In Apache MXNet v1.5.0 optimized for CPU backend, the integration of the fused RNN kernels provided by the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) introduces significant inference performance improvements for long short-term memory (LSTM) and vanilla-RNN on throughput and latency. A series of selective input shapes on LSTM layers adopted by GNMT are taken as the example to reflect the performance improvements.

Figure 3 and 4 illustrate the inference performance comparison on RNN. [3]

Figure 3 RNN Inference Throughput Comparison.

[3] Figure 3 RNN Inference Throughput Comparison

Figure 4 RNN Inference Latency Comparison.

Figure 4 RNN Inference Latency Comparison

Horovod and Intel® MPI Library Integration

Apache MXNet v1.5.0 enables distributed training using the Horovod distributed training framework and Intel® MPI library. The multi-node training efficiency is improved on a CPU backend by more efficient use of network bandwidth and scaling of the deep learning models.

On training the ResNet-50 v1 with ImageNet-1K on multi-node CPU, a 16-node Intel® Xeon® Gold 6148 processor achieves 92.86% scalability without any compromise of accuracy. After training with 90 epochs, the training converges, and the top-1 training and validation accuracy reach 0.7761 and 0.7595, respectively.

Table 1 lists the training throughput and scalability results on different nodes/instances of an Intel Xeon Gold 6148 processor.

Multi-Node Configurations Training Throughput
Augment
FP32
Scaling
Node
Number
Instance
Number
1 1 1.00 100.00%
1 2 1.98 98.86%
2 4 3.73 93.27%
4 8 7.46 93.31%
8 16 14.82 92.65%
16 32 29.75 92.98%

Table 1: Multi-node Training Scalability on ResNet50 v1

Figure 5 illustrates the training throughput scales well with up to 32 training instances (16 nodes).

Figure 5 Multi-node Training Throughput and Speed-up with Instance Number.

Figure 5 Multi-node Training Throughput and Speed-up with Instance Number

Figure 6 illustrates the training trends and validation accuracy on 8 and 16 nodes of an Intel Xeon Gold 6148 processor, which shows that distributed training has no impact on convergence and accuracy.

Figure 6 Multi-node Training and Validation Accuracy Trends.

Figure 6 Multi-node Training and Validation Accuracy Trends

Conclusion

Since the first official release of MXNet with Intel-optimized CPU backend in August 2018, we have continually worked to optimize MXNet to enhance the experience for users. This new version of MXNet includes Intel optimizations for Subgraph API, OP fusion and RNN improvement and enhanced quantization performance. We encourage you to get started with the latest version of MXNet and follow us on @IntelAIResearch for more news and updates.

Appendix

  • CNN throughput and latency are measured with synthetic data on a C5.24xlarge instance. There are two instances are running simultaneously and bind-to NUMA node 0 and NUMA node 1 respectively via the numactl utility; The throughput and latency is calculated as:
    • Throughput = Throughput instance1 + Throughput instance2
    • Latency instance1 + Latency instance2
      2

  • RNN throughput and latency are measured with synthetic data on a C5.24xlarge instance. There are two instances running simultaneously and bind-to NUMA node 0 and NUMA node 1 respectively via the numactl utility. The calculations follow the same equations as CNN topologies described in Notes 1).
  • Single instance CNN and RNN throughput and latency performance can be measured on C5.12xlarge instance; the C5.12xlarge instance has one NUMA node and half the number of physical cores of the C5.24xlarge instance which has two NUMA nodes (two CPU sockets).

Step-by-Step Data Reproduction:

MXNET Installation

Apache MXNet and Apache MXNet optimized for CPU backend can be installed by pip, such as

  • pip(3) mxnet==1.5.0
  • pip(3) mxnet-mkl==1.5.0
Running CNN Benchmarks:
  1. Install the corresponding version of Apache MXNet via pip;
  2. Install GluonCV, via pip(3) gluoncv
  3. Find the scripts and steps at: https://github.com/intel/optimized-models/tree/master/mxnet/blog/mxnet_v1.5_release to quantize the models and reproduce the CNN benchmarks on a C5.24x large instance or C5.12x large instance.

Many thanks to my colleagues Yixin Bao, Ying Guo, Zhiyuan Huang, Eric Lin, Wei Li, Zixuan Wei, Pengxin Yuan, Lujia Yin, Rong Zhang for their great work on optimizing deep learning frameworks with the state-of-the-art accelerating technology on Intel processors. Also, thanks to Emily Hutson, Jin Xu, Ying Hu, Jianyu Zhang and Zhuowei Si for providing valuable feedback.

Footnotes

Table 2: Detailed Performance data on Topologies for Image Classification (Measured on C5.24xlarge, Synthetic Dataset)

Batch Size ResNet-18 ResNet-50
mxnet FP32 speed-up
mxnet-mkl
FP32
v.s. mxnet
speed-up
mxnet-mkl
INT8
v.s. mxnet
mxnet
FP32
speed-up
mxnet-mkl
FP32
v.s. mxnet
speed-up
mxnet-mkl
INT8
v.s. mxnet
1 1.00 11.75 26.86 1.00 9.32 26.50
2 1.00 15.20 43.60 1.00 11.87 38.91
4 1.00 19.67 61.10 1.00 16.07 52.83
8 1.00 23.00 76.63 1.00 18.83 65.03
16 1.00 24.80 86.81 1.00 19.92 73.87
32 1.00 25.68 91.61 1.00 19.92 81.63
64 1.00 26.13 95.77 1.00 20.67 82.01

Batch Size MobileNet v1 MobileNet v2
mxnet FP32 speed-up
mxnet-mkl
FP32
v.s. mxnet
speed-up
mxnet-mkl
INT8
v.s. mxnet
mxnet
FP32
speed-up
mxnet-mkl
FP32
v.s. mxnet
speed-up
mxnet-mkl
INT8
v.s. mxnet
1 1.00 19.50 35.07 1.00 13.15 34.14
2 1.00 26.21 55.70 1.00 20.78 61.53
4 1.00 36.11 83.26 1.00 30.00 104.80
8 1.00 46.29 108.58 1.00 39.51 160.60
16 1.00 49.47 133.10 1.00 42.71 205.18
32 1.00 48.44 143.82 1.00 42.06 217.07
64 1.00 47.89 148.83 1.00 42.19 225.92

Batch Size ResNet-101 Squeezenet1.0
mxnet FP32 speed-up
mxnet-mkl
FP32
v.s. mxnet
speed-up
mxnet-mkl
INT8
v.s. mxnet
mxnet
FP32
speed-up
mxnet-mkl
FP32
v.s. mxnet
speed-up
mxnet-mkl
INT8
v.s. mxnet
1 1.00 9.50 22.91 1.00 5.68 18.16
2 1.00 12.42 30.73 1.00 9.01 29.41
4 1.00 16.03 42.36 1.00 13.61 44.14
8 1.00 18.31 58.12 1.00 17.22 58.81
16 1.00 19.56 71.00 1.00 18.54 67.53
32 1.00 20.22 80.41 1.00 17.32 70.43
64 1.00 20.73 82.89 1.00 16.63 69.44

Batch Size inception v3 ResNet-152 v2
mxnet FP32 speed-up
mxnet-mkl
FP32
v.s. mxnet
speed-up
mxnet-mkl
INT8
v.s. mxnet
mxnet
FP32
speed-up
mxnet-mkl
FP32
v.s. mxnet
speed-up
mxnet-mkl
INT8
v.s. mxnet
1 1.00 11.25 25.70 1.00 7.88 9.86
2 1.00 13.59 41.41 1.00 9.55 12.02
4 1.00 19.26 57.45 1.00 11.94 15.46
8 1.00 23.50 76.14 1.00 14.09 18.01
16 1.00 25.30 92.72 1.00 15.66 21.02
32 1.00 25.54 103.10 1.00 16.12 21.43
64 1.00 25.40 108.13 1.00 16.57 19.53

Table 3: Detailed Performance Data for RNN (Measured on C5.24xlarge, Synthetic Dataset)

Shape of LSTM
(N, T, C)
Layer=8 Layer=4
MXNet 1.5 MXNet-mkl 1.5 MXNet 1.5 MXNet-mkl 1.5
LSTM Latency (ms)
[1, 50, 512, 512]
1.00 7.82 1.00 10.49
LSTM Latency (ms)
[1, 50, 1024, 1024]
1.00 13.21 1.00 13.07
LSTM Throughput
[32, 50, 512, 512]
1.00 15.57 1.00 14.99
LSTM Throughput
[32, 50, 1024, 1024]
1.00 10.71 1.00 10.72

Notices and Disclaimers