MLPerf Results Validate CPUs for Deep Learning Training

I have worked on optimizing and benchmarking computer performance for more than two decades, on platforms ranging from supercomputers and database servers to mobile devices. It is always fun to highlight performance results for the product you are building and compare them with others in the industry. SPEC*, LINPACK*, and TPC* have become familiar names to many of us. Now, MLPerf* is filling in the void of benchmarking for Machine Learning.

I am excited to see the Intel® Xeon® Scalable processor MLPerf results submitted by our team because we work on both the user side and the computer system development side of deep learning. These results show that Intel® Xeon Scalable processors have surpassed a performance threshold where they can be an effective option for data scientists looking to run multiple workloads on their infrastructure without investing in dedicated hardware[6,7,8].

Back in 2015, I had a team working on mobile devices. We had to hire testers to manually play mobile games. It was fun initially for the testers, then it became boring and costly. One tester we hired quit on the same day. Our team created a robot to test mobile games and adopted deep learning. Our game testing robot played games automatically and found more bugs than human testers. We wanted to train neural networks on the machines we already had in the lab, but they were not fast enough. I had to allocate budget for the team to buy a GPU, an older version than the MLPerf reference GPU[9].

Today CPUs are capable of deep learning training as well as inference. Our MLPerf Intel® Xeon® Scalable processor results compare well with the MLPerf reference GPU[9] on a variety of MLPerf deep learning training workloads[6,7,8]. For example, the single-system two-socket Intel Xeon Scalable processor results submitted by Intel achieved a score of 0.85 on the MLPerf Image Classification benchmark (Resnet-50)[6]; 1.6 on the Recommendation benchmark (Neural Collaborative Filtering NCF)[7]; and 6.3 on Reinforcement Learning benchmark (mini GO)[8].  In all these scores, 1.0 is defined as the score of the reference implementation on the reference GPU[9].  For all the preceding results, we use FP32, the common numerical precision used in today’s market.  From these MLPerf results, we can see that our game testing robot could easily train on Intel Xeon Scalable processors today.

The deep learning and machine learning world continues to evolve from image processing using Convolutional Neural Networks (CNN) and natural language processing using Recurrent Neural Networks (RNN) to recommendation systems using MLP layers and general matrix multiply, reinforcement learning (mixing CNN and simulation) and hybrid models mixing deep learning and classical machine learning. A general purpose CPU is very adaptable to this dynamically changing environment, in addition to running existing non-DL workloads.

Enterprises have adopted CPUs for deep learning training. For example, today, Datatonic* published a blog showing up to 11x cost and 57 percent performance improvement when running a neural network recommender system used in production by a top-5 UK retailer on a Google Cloud* VM powered by Intel Xeon Scalable processors[5]. CPUs can also accommodate the large memory models required in many domains. The pharmaceutical company Novartis used Intel Xeon Scalable processors to accelerate training for a multiscale convolutional neural network (M-CNN) for 10,000 high-content cellular microscopic images, which are much larger in size than the typical ImageNet* images, reducing time to train from 11 hours to 31 minutes[1].

HPC customers use Intel Xeon processors for distributed training, as showcased at Supercomputing 2018. For instance, GENCI/CINES/INRIA trained a plant classification model for 300K species on a 1.5TByte dataset of 12 million images using 128 2S Intel Xeon processor-based systems [2]. DELL* EMC* and SURFSara used Intel Xeon processors to reduce training time to 11 minutes for a DenseNet-121 model[3]. CERN* showcased distributed training using 128 nodes of the TACC Stampede 2 cluster (Intel® Xeon® Platinum 8160 processor, Intel® OPA) with a 3D Generative Adversarial Network (3D GAN) achieving 94% scaling efficiency[4].  Additional examples can be found at https://software.intel.com/en-us/articles/intel-processors-for-deep-learning-training.

CPU hardware and software performance for deep learning has increased by a few orders of magnitude in the past few years. Training that used to take days or even weeks can now be done in hours or even minutes. This level of performance improvement was achieved through a combination of hardware and software. For example, current-generation Intel Xeon Scalable processors added both the AVX-512 instruction set (longer vector extensions) to allow a large number of operations to be done in parallel, and with a larger number of cores, essentially becoming a mini-supercomputer. The next-generation Intel® Xeon® Scalable processor (Cascade Lake), coming in the first half of 2019, adds Intel® Deep Learning Boost: higher throughput, lower numerical precision instructions to boost deep learning inference. On the software side, the performance difference between the baseline open source deep learning software, and the Intel-optimized software can be up to 275X(10) on the same Intel® Xeon® Scalable processor (as illustrated in a demo I showed at the Intel Architecture Day forum yesterday).

Over the past few years, Intel has worked with DL framework developers to optimize many popular open source frameworks such as TensorFlow*, Caffe*, MXNet*, PyTorch*/Caffe2*, PaddlePaddle* and Chainer*, for Intel processors. Intel has also designed a framework, BigDL for SPARK*, and the Intel® Deep Learning Deployment Toolkit for inference. Since the core computation is linear algebra, we have created a new math library, Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN), specifically for deep learning, based on many years of experience with the Intel® Math Kernel Library (MKL) for high-performance computing. The integration of Intel MKL-DNN into the frameworks, and the additional optimizations contributed to the frameworks to fully utilize the underlying hardware capabilities, are the key reason for the huge software performance improvement.

I’ve often been asked whether CPUs are faster or slower than accelerators. Of course, accelerators have certain advantages. For a specific domain, if an accelerator is not generally faster than a CPU, then it is not much of an accelerator. Even so, given the increasing variety of deep learning workloads, in some cases, a CPU may be as fast or faster while retaining that flexibility that is core to the CPU value proposition. Thus, the more pertinent question is whether CPUs can run deep learning well enough to be an effective option for customers that don’t wish to invest in accelerators. These initial MLPerf results[6,7.8], as well as our customer examples, show that CPUs can indeed be effectively used for training. Intel’s strategy is to offer both general purpose CPUs and accelerators to meet the machine learning needs of a wide range of customers.

Looking forward, we are continuing to add new AI and deep learning features to our future generations of CPUs, like Intel Deep Learning Boost, plus bfloat16 for training, as well as additional software optimizations. Please stay tuned. For more information on Intel software optimizations, see ai.intel.com/framework-optimizations.  For more information on Intel Xeon Scalable processors, see intel.com/xeonscalable.

 Disclaimers and Configuration Details:
(1) Novartis: Measured May 25th 2018. Based on speedup for 8 nodes relative to a single node.  Node configuration:  CPU: Intel® Xeon® Gold 6148 @ 2.4GHz, 192GB memory, Hyper-threading: Enabled. NIC: Intel® Omni-Path Host Fabric Interface, TensorFlow: v1.7.0, Horovod: 0.12.1, OpenMPI: 3.0.0. OS: CentOS 7.3, OpenMPU 23.0.0, Python 2.7.5. Time to Train to converge to 99% accuracy in model. Source: https://newsroom.intel.com/news/using-deep-neural-network-acceleration-image-analysis-drug-discovery
(2) GENCI: Occigen: 3306 nodes x 2 Intel® Xeon® processors (12-14 cores). Compute Nodes: 2 sockets Intel® Xeon® processor with 12 cores each @ 2.70GHz for a total of 24 cores per node, 2 threads per core, 96 GB of DDR4, Mellanox InfiniBand Fabric Interface, dual-rail. Software: Intel® MPI Library 2017 Update 4Intel® MPI Library 2019 Technical Preview OFI 1.5.0PSM2 w/ Multi-EP, 10 GbitEthernet, 200 GB local SSD, Red Hat* Enterprise Linux 6.7. Caffe*: Intel® Optimization for Caffe*: https://github.com/intel/caffe Intel MLSL: https://github.com/intel/MLSL Dataset:Pl@ntNet: CINES/GENCI Internal Dataset  Performance results are based on testing as of 10/15/2018. 
(3) Intel, Dell and Surfsara collaboration: Measured 5/17/2018 on 256x nodes of 2 socket Intel® Xeon® Gold 6148 processor. Compute Nodes: 2 sockets Intel® Xeon® Gold 6148F processor with 20 cores each @ 2.40GHz for a total of 40 cores per node, 2 Threads per core, L1d 32K; L1i cache 32K; L2 cache 1024K; L3 cache 33792K, 96 GB of DDR4, Intel® Omni-Path Host Fabric Interface, dual rail.  Software: Intel® MPI Library 2017 Update 4 Intel® MPI Library 2019 Technical Preview OFI 1.5.0PSM2 w/ Multi-EP, 10 Gbit Ethernet, 200 GB local SSD, Red Hat* Enterprise Linux 6.7.  TensorFlow* 1.6: Built & Installed from source: https://www.tensorflow.org/install/install_sources ResNet-50 Model: Topology specs from https://github.com/tensorflow/tpu/tree/master/models/official/resnet.  DenseNet-121Model: Topology specs from https://github.com/liuzhuang13/DenseNet. Convergence & Performance Model: https://surfdrive.surf.nl/files/index.php/s/xrEFLPvo7IDRARs. Dataset:  ImageNet2012-1K: http://www.image-net.org/challenges/LSVRC/2012 /.  ChexNet*: https://stanfordmlgroup.github.io/projects/chexnet/. Performance measured with:  OMP_NUM_THREADS=24 HOROVOD_FUSION_THRESHOLD=134217728 export I_MPI_FABRICS=tmi, export I_MPI_TMI_PROVIDER=psm2 \ mpirun -np 512 -ppn 2 python resnet_main.py –train_batch_size 8192 –train_steps 14075 –num_intra_threads 24 –num_inter_threads 2 — mkl=True –data_dir=/scratch/04611/valeriuc/tf-1.6/tpu_rec/train –model_dir model_batch_8k_90ep –use_tpu=False –kmp_blocktime 1. https://ai.intel.com/diagnosing-lung-disease-using-deep-learning/ 
(4) CERN: Measured 5/17/2018 on Stampede2/TACC: https://portal.tacc.utexas.edu/user-guides/stampede2.  Compute nodes: 2 sockets Intel® Xeon® Platinum 8160 processor with 24 cores each @ 2.10GHz for a total of 48 cores per node, 2 Threads per core, L1d 32K; L1i cache 32K; L2 cache 1024K; L3 cache 33792K, 96 GB of DDR4, Intel® OmniPath Host Fabric Interface, dual-rail. Software: Intel® MPI Library 2017 Update 4Intel® MPI Library 2019 Technical Preview OFI 1.5.0PSM2 w/ Multi-EP, 10 Gbit Ethernet, 200 GB local SSD, Red Hat* Enterprise Linux 6.7.  TensorFlow* 1.6: Built & Installed from source: https://www.tensorflow.org/install/install_sources Model: CERN* 3D GANS from https://github.com/sara-nl/3Dgan/tree/tf Dataset: CERN* 3D GANS from https://github.com/sara-nl/3Dgan/tree/tf Performance measured on 256 NodesPerformance measured on 256 Nodes with: OMP_NUM_THREADS=24 HOROVOD_FUSION_THRESHOLD=134217728 export I_MPI_FABRICS=tmi, export I_MPI_TMI_PROVIDER=psm2 \ mpirun -np 512 -ppn 2 python resnet_main.py –train_batch_size 8 \ –num_intra_threads 24 –num_inter_threads 2 –mkl=True \ –data_dir=/path/to/gans_script.py –kmp_blocktime 1.  https://www.rdmag.com/article/2018/11/imagining-unthinkable-simulations-without-classical-monte-carlo
(5) Datatonic: see https://datatonic.com/insights/accelerate-machine-learning-on-google-cloud-with-intel-xeon-processors/
(6) Score of 0.85 on the MLPerf Image Classification benchmark (Resnet-50) 0.85X  over the MLPerf baseline(+) using a 2 chip count Intel® Xeon® Platinum 8180. MLPerf v0.5 training Closed division; system employed Intel® Optimization for Caffe* 1.1.2a with the Intel® MKL-DNN v0.16 library. Retrieved from www.mlperf.org 12 December 2018, entry 0.5.6.1. MLPerf name and logo are trademarks. See www.mlperf.org for more information.
(7) Score of 1.6 on the Recommendation benchmark (Neural Collaborative Filtering NCF) 1.6X over the MLPerf baseline(+) using a 2 chip count Intel® Xeon® Platinum 8180. MLPerf v0.5 training Closed division; system employed Framework BigDL 0.7.0. Retrieved from www.mlperf.org 12 December 2018, entry 0.5.9.6. MLPerf name and logo are trademarks. See www.mlperf.org for more information.  
(8) Score of 6.3 on Reinforcement Learning benchmark (mini GO) 6.3X over the MLPerf baseline(+) using a 2 chip count Intel® Xeon® Platinum 8180. MLPerf v0.5 training Closed division; system employed TensorFlow 1.10.1 with the Intel® MKL-DNN v0.14 library. Retrieved from www.mlperf.org 12 December 2018, entry 0.5.10.7. MLPerf name and logo are trademarks. See www.mlperf.org for more information.
(+) MLPerf Baseline (adopted from MLPerf v0.5 Community Press Briefing): MLPerf Training v0.5 is a benchmark suite for measuring ML system speed. Each MLPerf Training benchmark is defined by a Dataset and Quality Target. MLPerf Training also provides a reference implementation for each benchmark that uses a specific model. The following table summarizes the seven benchmarks in version v0.5 of the suite.
Benchmark Dataset Quality Target Reference Implementation Model
Image classification ImageNet 74.90% classification Resnet-50 v1.5
Object detection (lightweight) COCO 2017 21.2% mAP SSD (Resnet-34 backbone)
Object detection (heavyweight) COCO 2017 0.377 Box min AP, 0.339 Mask min AP Mask R-CNN
Translation (recurrent) WMT English-German 21.8 BLEU Neural Machine Translation
Translation (non-recurrent) WMT English-German 25.0 BLEU Transformer
Recommendation MovieLens-20M 0.635 HR@10 Neural Collaborative Filtering
Reinforcement learning Pro games 40.00% move prediction Mini Go

MLPerf training rules: https://github.com/mlperf/policies/blob/master/training_rules.adoc
(9) MLPerf* reference system: Google Cloud Platform config: 16 vCPUs, Intel Skylake or later, 60 GB RAM (n1­standard­16), 1 NVIDIA* Tesla* P100 GPU, CUDA* 9.1 (9.0 for TensorFlow*), nvidia­docker2, Ubuntu* 16.04 LTS, Pre­emtibility: off, Automatic restart: off, 30GB boot disk + 1 SSD persistent disk 500 GB, docker* image: 9.1­cudnn7­runtime­ubuntu16.04 (9.0­cudnn7­devel­ubuntu16.04 for TensorFlow*)
(10) 275X Inference throughput performance improvement with Intel® Optimization for Caffe* compared to BVLC-Caffe*: Intel measured on 12/11/2018. 2S Intel® Xeon® Platinum 8180 CPU @ 2.50GHz (28 cores), HT ON, turbo ON, 192GB total memory (12 slots * 16 GB, Micron 2666MHz), Intel® SSD SSDSC2KF5, Ubuntu 16.04 Kernel 4.15.0-42.generic; BIOS: SE5C620.86B.00.01.0009.101920170742 (microcode: 0x0200004d); Topology: Resnet-50 Baseline: FP32, BVLC Caffe* (https://github.com/BVLC/caffe.git) commit 99bd99795dcdf0b1d3086a8d67ab1782a8a08383 Current Performance: INT8, Intel® Optimizations for Caffe* (https://github.com/Intel/caffe.git) commit: Caffe* commit: e94b3ff41012668ac77afea7eda89f07fa360adf, MKLDNN commit: 4e333787e0d66a1dca1218e99a891d493dbc8ef1
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.  For more information go to www.intel.com/benchmarks.
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Performance results may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure.
Intel, the Intel logo, Xeon Scalable processors, Deep Learning Boost, are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others. 
© Intel Corporation.