TensorFlow* is one of the leading deep learning and machine learning frameworks today. Earlier in 2017, Intel worked with Google to incorporate optimizations for Intel® Xeon® and Xeon Phi™ processor based platforms using Intel® Math Kernel Libraries (Intel® MKL). These optimizations resulted in orders of magnitude improvement in performance – up to 70x higher performance for training and up to 85x higher performance for inference.
In this blog we provide a performance update for a number of deep learning models running on the Intel Xeon Scalable processor. The Intel Xeon Scalable processor provides up to 28 cores, which brings additional computing power to the table compared to the 22 cores of its predecessor. Additional improvements include a non-inclusive, last-level cache, a larger 1MB L2 cache, faster 2666 MHz DDR4 memory, and an increase to six memory channels per CPU. In addition, the Intel Xeon Scalable processor includes Intel® Advanced Vector Extensions 512 (Intel® AVX-512), originally introduced with the Intel® Xeon Phi™ processor product line. The Intel Xeon Scalable processor introduces new Intel AVX-512 CPUID flags (AVX512BW and AVX512DQ) as well as a new capability (AVX512VL) to expand the benefits of the technology. The AVX512DQ CPUID flag is focused on new additions for benefiting high-performance computing (HPC) and machine learning workloads.
The optimizations discussed in this article utilize the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN). This is an open source performance library for Deep Learning applications, intended for acceleration of DL frameworks on Intel® architecture. Intel MKL-DNN includes highly vectorized and threaded building blocks for implementation of convolutional neural networks with C and C++ interfaces. Note that TensorFlow currently supports the open-sourced Intel MKL-DNN as well the DNN primitives in the closed source Intel Math Kernel Library. The version to use is selected when building TensorFlow. It is expected that in the future the support for the closed source DNN primitive will be removed from TensorFlow.
Optimizing deep learning model performance on the Intel Xeon Scalable processor utilizes several optimizing techniques that are similar to performance-sensitive applications in High Performance Computing (HPC):
Intel® MKL-DNN provides a number of optimized deep learning primitives that are highly optimized for Intel Xeon Scalable processors using the optimizations described above. Using the optimized primitives inside various deep learning frameworks helps ensure that we implement common building blocks efficiently. These include:
In TensorFlow, we implemented optimized versions of TensorFlow operations to make sure that these operations can utilize optimized MKL-DNN primitives for Intel Xeon Scalable CPUs wherever possible. While this is a necessary step to enable scalable performance on Intel® architecture, to get the best performance we implemented several additional optimizations including the following:
The following performance results were obtained for benchmark models from the TensorFlow repository at https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks
To get maximum performance we tuned the following parameters specifically for the model and for the processor.
Data format: Set to NCHW format to get maximum performance. (TensorFlow default NHWC format is not the most efficient data layout for the CPU and it results in some additional conversion overhead.)
Settings on Intel® Xeon® Scalable processor (2 Sockets, 28 Cores each) that were used for benchmarking.
Please note: The parameter settings were carefully tuned to gain maximum performance for the specific platform.
Performance results for Training on Intel® Xeon® Scalable processor (2 Sockets – 28 Cores each), mock data.
Performance results for Inference on Intel® Xeon® Scalable processor (2 Sockets – 28 Cores each), mock data. Inference performance was measured by running forward pass only.
In conclusion, TensorFlow now supports the Intel Xeon Scalable platform through the Intel MKL-DNN open source library. No additional software or configuration is required other than building TensorFlow with specific Intel MKL build settings. We are continually improving the performance of the Intel® Optimization for TensorFlow* and will be updating the repository on a continual basis.
Special thanks to Intel contributors Huma Bidi, Mahmoud Abuzaina, Md Faijul Amin, Mohammad Ashraf Bhuiyan, Jayaram Bobba, Xiaoming Cui, Sheng Fu, Niranjan, Hasabnis, Jing Huang, Jennifer Myers, Elmoustapha Ould-ahmed-vall, Clayne Robison, Bhavani Subramanian, Lakshay Tokas, Wei Wang, Karen Wu, and Guozhong Zhuang.
Intel® technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as “Spectre” and “Meltdown.” Implementation of these updates may make these results inapplicable to your device or system.
Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit https://www.intel.com/benchmarks.
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #201108
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.
© Intel Corporation. Intel, the Intel® logo, Xeon and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as property of others.
 The results are reported at https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture
 Same as (i)
 Refer to https://github.com/01org/mkl-dnn for more details on Intel® MKL-DNN optimized primitives
 For the complete list of optimizations, refer to https://github.com/pennsate/AIM2017/raw/master/AIM-accelerating.pdf
 System configuration: CPU: Intel Xeon Platinum 8180 processor @ 2.50GHz; OS CentOS 7.4; TensorFlow Source Code: https://github.com/tensorflow/tensorflow; TensorFlow Commit ID: 926fc13f7378d14fa7980963c4fe774e5922e336. Detailed configuration is as follows:
CPU Thread(s) per core: 2 Core(s) per socket: 28 Socket(s): 2 NUMA node(s): 2 CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz Stepping: 4 HyperThreading: ON Turbo: ON Memory 376GB (12 x 32GB) 24 slots, 12 occupied 2666 MHz Disks Intel RS3WC080 x 3 (800GB, 1.6TB, 6TB) BIOS SE5C620.86B.00.01.0004.071220170215 OS Centos Linux 7.4.1708 (Core) Kernel 3.10.0-693.11.6.el7.x86_64