Apache* MXNet* (Incubating) Gets a Lift with Intel® DL Boost

MaryT_Intel · ‎12-20-2019

The Apache MXNet (Incubating) community recently announced the v1.5.0 release of the Apache MXNet* (Incubating) deep learning framework. This version of Apache MXNet (Incubating) introduces a series of new features and optimizations targeting CPU backends, including:

Subgraph API and operator fusion
Model quantization and lower precision (INT8) inference
Integration of the new fused-RNN kernels
Support the Horovod*-based distributed training

The lower-precision (INT8) inference performance has seen gains thanks to the Intel® Deep Learning Boost (Intel® DL Boost) functionality on the recently launched 2nd Generation Intel® Xeon® Scalable processors. Intel DL Boost’s Vector Neural Network Instructions (VNNI) improve AI performance by combining three instructions into one and can be accessed through the three new Amazon EC2 C5 instances: C5.12xlarge, C5.24xlarge and C5.metal. Refer to AWS Launches New Amazon EC2 C5 Instances Featuring Intel® DL Boost for more details on how to get started.

Performance Improvement on Computer Vision Models

Both inference throughput and latency performance are significantly improved by leveraging the operator fusion and model quantization on Apache MXNet (Incubating) v1.5.0 optimized for a CPU backend (pip install mxnet-mkl). Taking ResNet50 as an example, compared to the Apache MXNet (Incubating) 1.5.0 native CPU build (pip install mxnet) on an AWS EC2 C5.24xlarge instance,with the Apache MXNet (Incubating) 1.5.0 build optimized for CPU, the inference performance gains are ~20x and ~9x on FP32 throughput and latency, respectively. Inference performance gains are ~82x and ~26x on lower precision (INT8) throughput and latency. ^[1] Figure 1 and 2 illustrate the inference throughput and latency comparison of popular neutral network topologies for image classification. ^[2]

cq5dam.web.1280.1280.jpeg

^[1] Figure 1. Inference Throughput Speed-up on Topologies for Image

Classification

cq5dam.web.1280.1280.jpeg

^[2] Figure 2. Inference Latency Comparison on Topologies for Image Classification

Performance Improvement on Recurrent Neural Networks (RNNs)

In Apache MXNet (Incubating) v1.5.0 optimized for CPU backend, the integration of the fused RNN kernels provided by the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) introduces significant inference performance improvements for long short-term memory (LSTM) and vanilla-RNN on throughput and latency. A series of selective input shapes on LSTM layers adopted by GNMT are taken as the example to reflect the performance improvements.

Figure 3 and 4 illustrate the inference performance comparison on RNN. ^[3]

cq5dam.web.1280.1280.jpeg

^[3] Figure 3 RNN Inference Throughput Comparison

cq5dam.web.1280.1280.jpeg

Figure 4 RNN Inference Latency Comparison

Horovod and Intel® MPI Library Integration

Apache MXNet (Incubating) v1.5.0 enables distributed training using the Horovod distributed training framework and Intel ® MPI library. The multi-node training efficiency is improved on a CPU backend by more efficient use of network bandwidth and scaling of the deep learning models.

On training the ResNet-50 v1 with ImageNet-1K on multi-node CPU, a 16-node Intel® Xeon® Gold 6148 processor achieves 92.86% scalability without any compromise of accuracy. After training with 90 epochs, the training converges, and the top-1 training and validation accuracy reach 0.7761 and 0.7595, respectively.

Table 1 lists the training throughput and scalability results on different nodes/instances of an Intel Xeon Gold 6148 processor.

Multi-Node Configurations		Training Throughput Augment	FP32 Scaling
Node Number	Instance Number	Training Throughput Augment	FP32 Scaling
1	1	1.00	100.00%
1	2	1.98	98.86%
2	4	3.73	93.27%
4	8	7.46	93.31%
8	16	14.82	92.65%
16	32	29.75	92.98%

Table 1: Multi-node Training Scalability on ResNet50 v1

Figure 5 illustrates the training throughput scales well with up to 32 training instances (16 nodes).

Figure 5 Multi-node Training Throughput and Speed-up with Instance Number.

Figure 5 Multi-node Training Throughput and Speed-up with Instance Number

Figure 6 illustrates the training trends and validation accuracy on 8 and 16 nodes of an Intel Xeon Gold 6148 processor, which shows that distributed training has no impact on convergence and accuracy.

Figure 6 Multi-node Training and Validation Accuracy Trends

Conclusion

Since the first official release of MXNet with Intel-optimized CPU backend in August 2018, we have continually worked to optimize MXNet to enhance the experience for users. This new version of MXNet includes Intel optimizations for Subgraph API, OP fusion and RNN improvement and enhanced quantization performance. We encourage you to get started with the latest version of MXNet and follow us on @IntelAIResearch for more news and updates.

Appendix

CNN throughput and latency are measured with synthetic data on a C5.24xlarge instance. There are two instances are running simultaneously and bind-to NUMA node 0 and NUMA node 1 respectively via the numactl utility; The throughput and latency is calculated as:
- Throughput = Throughput instance1 + Throughput instance2
- Latency instance1 + Latency instance2
  
  2
RNN throughput and latency are measured with synthetic data on a C5.24xlarge instance. There are two instances running simultaneously and bind-to NUMA node 0 and NUMA node 1 respectively via the numactl utility. The calculations follow the same equations as CNN topologies described in Notes 1).
Single instance CNN and RNN throughput and latency performance can be measured on C5.12xlarge instance; the C5.12xlarge instance has one NUMA node and half the number of physical cores of the C5.24xlarge instance which has two NUMA nodes (two CPU sockets).

Step-by-Step Data Reproduction:

MXNET Installation

Apache MXNet and Apache MXNet (Incubating) optimized for CPU backend can be installed by pip, such as

pip(3) mxnet==1.5.0
pip(3) mxnet-mkl==1.5.0

Running CNN Benchmarks:

Install the corresponding version of Apache MXNet (Incubating) via pip;
Install GluonCV, via pip(3) gluoncv
Find the scripts and steps at: https://github.com/intel/optimized-models/tree/master/mxnet/blog/mxnet_v1.5_release to quantize the models and reproduce the CNN benchmarks on a C5.24x large instance or C5.12x large instance.

Many thanks to my colleagues Yixin Bao, Ying Guo, Zhiyuan Huang, Eric Lin, Wei Li, Zixuan Wei, Pengxin Yuan, Lujia Yin, Rong Zhang for their great work on optimizing deep learning frameworks with the state-of-the-art accelerating technology on Intel processors. Also, thanks to Emily Hutson, Jin Xu, Ying Hu, Jianyu Zhang and Zhuowei Si for providing valuable feedback.

Footnotes

Performance results are based on testing as of 24th July 2019 by AWS and may not reflect all publicly available security updates. No product or component can be absolutely secure. Test Configuration: Reproduce Script: https://github.com/intel/optimized-models/tree/master/mxnet/blog/mxnet_v1.5_release Software: Apache MXNet (Incubating) 1.5.0 and benchmark script commit id ad4b7570b04ec13df739bd336f282bdaca690df7 Hardware: AWS EC2 C5.24xlarge Custom 2nd generation Intel Xeon Scalable Processors (Cascade Lake) with a sustained all core Turbo frequency of 3.6GHz and single core turbo frequency of up to 3.9GHz.
Table 2 lists the detailed inference throughput and latency data and comparison corresponding to figure 1 and 2, the numbers are relative gains compared to the “mxnet FP32” column, which set as the baseline. Synthetic data is used as the dataset.
Table 3 lists the detailed inference throughput and latency data and comparison corresponding to figure 3 and 4, the numbers are relative gains compared to the “MXNet 1.5” column, which set as the baseline. Synthetic data is used as the dataset.

Table 2: Detailed Performance data on Topologies for Image Classification (Measured on C5.24xlarge, Synthetic Dataset)

Batch Size	ResNet-18			ResNet-50
	mxnet FP32	speed-up mxnet-mkl FP32 v.s. mxnet	speed-up mxnet-mkl INT8 v.s. mxnet	mxnet FP32	speed-up mxnet-mkl FP32 v.s. mxnet	speed-up mxnet-mkl INT8 v.s. mxnet
1	1.00	11.75	26.86	1.00	9.32	26.50
2	1.00	15.20	43.60	1.00	11.87	38.91
4	1.00	19.67	61.10	1.00	16.07	52.83
8	1.00	23.00	76.63	1.00	18.83	65.03
16	1.00	24.80	86.81	1.00	19.92	73.87
32	1.00	25.68	91.61	1.00	19.92	81.63
64	1.00	26.13	95.77	1.00	20.67	82.01

Batch Size	MobileNet v1			MobileNet v2
	mxnet FP32	speed-up mxnet-mkl FP32 v.s. mxnet	speed-up mxnet-mkl INT8 v.s. mxnet	mxnet FP32	speed-up mxnet-mkl FP32 v.s. mxnet	speed-up mxnet-mkl INT8 v.s. mxnet
1	1.00	19.50	35.07	1.00	13.15	34.14
2	1.00	26.21	55.70	1.00	20.78	61.53
4	1.00	36.11	83.26	1.00	30.00	104.80
8	1.00	46.29	108.58	1.00	39.51	160.60
16	1.00	49.47	133.10	1.00	42.71	205.18
32	1.00	48.44	143.82	1.00	42.06	217.07
64	1.00	47.89	148.83	1.00	42.19	225.92

Batch Size	ResNet-101			Squeezenet1.0
	mxnet FP32	speed-up mxnet-mkl FP32 v.s. mxnet	speed-up mxnet-mkl INT8 v.s. mxnet	mxnet FP32	speed-up mxnet-mkl FP32 v.s. mxnet	speed-up mxnet-mkl INT8 v.s. mxnet
1	1.00	9.50	22.91	1.00	5.68	18.16
2	1.00	12.42	30.73	1.00	9.01	29.41
4	1.00	16.03	42.36	1.00	13.61	44.14
8	1.00	18.31	58.12	1.00	17.22	58.81
16	1.00	19.56	71.00	1.00	18.54	67.53
32	1.00	20.22	80.41	1.00	17.32	70.43
64	1.00	20.73	82.89	1.00	16.63	69.44

Batch Size	inception v3			ResNet-152 v2
	mxnet FP32	speed-up mxnet-mkl FP32 v.s. mxnet	speed-up mxnet-mkl INT8 v.s. mxnet	mxnet FP32	speed-up mxnet-mkl FP32 v.s. mxnet	speed-up mxnet-mkl INT8 v.s. mxnet
1	1.00	11.25	25.70	1.00	7.88	9.86
2	1.00	13.59	41.41	1.00	9.55	12.02
4	1.00	19.26	57.45	1.00	11.94	15.46
8	1.00	23.50	76.14	1.00	14.09	18.01
16	1.00	25.30	92.72	1.00	15.66	21.02
32	1.00	25.54	103.10	1.00	16.12	21.43
64	1.00	25.40	108.13	1.00	16.57	19.53

Table 3: Detailed Performance Data for RNN (Measured on C5.24xlarge, Synthetic Dataset)

Shape of LSTM (N, T, C)	Layer=8		Layer=4
Shape of LSTM (N, T, C)	MXNet 1.5	MXNet-mkl 1.5	MXNet 1.5	MXNet-mkl 1.5
LSTM Latency (ms) [1, 50, 512, 512]	1.00	7.82	1.00	10.49
LSTM Latency (ms) [1, 50, 1024, 1024]	1.00	13.21	1.00	13.07
LSTM Throughput [32, 50, 512, 512]	1.00	15.57	1.00	14.99
LSTM Throughput [32, 50, 1024, 1024]	1.00	10.71	1.00	10.72

Notices and Disclaimers

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

Performance results are based on testing by AWS as of 24th July 2019 and may not reflect all publicly available security updates. No product or component can be absolutely secure. Test Configuration: Reproduce Script: https://github.com/intel/optimized-models/tree/master/mxnet/blog/mxnet_v1.5_release Software: Apache MXNet (Incubating) 1.5.0 and benchmark script commit id ad4b7570b04ec13df739bd336f282bdaca690df7 Hardware: AWS EC2 C5.24xlarge Custom 2nd generation Intel Xeon Scalable Processors (Cascade Lake) with a sustained all core Turbo frequency of 3.6GHz and single core turbo frequency of up to 3.9GHz.

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No product or component can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Intel does not control or audit third-party data. You should review this content, consult other sources, and confirm whether referenced data are accurate.

Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation