Over the past few years, Intel has done remarkable work in optimizing the TensorFlow* deep learning framework on Intel® Xeon® Scalable processors. Training/inference performance benchmarks are usually measured with synthetic data. In the real world, however, data scientists use real data to train deep learning models and predict the results. Data preprocessing performance is also a significant part of overall performance for deep learning models. In this blog, we will discuss how to fully utilize the hardware capabilities of Intel® Xeon® Scalable processors and TensorFlow tf.data APIs to improve the data layer performance.
In March 2018, Google’s TensorFlow team released tf.data library that was specifically designed to accelerate data preprocessing in deep learning models. With tf.data APIs, users can easily build an input data pipeline which can run in parallel with training/inference. As shown in the example in Google’s tf.data APIs document, data preprocessing can be deployed to a CPU, and training/inference can be deployed to an accelerator (GPU or TPU). When the GPU/TPU is running training/inference on a batch of data, the CPU can process the next batch of data at the same time. With this pipeline, the total training/inference time can be significantly reduced.
We have applied tf.data APIs to the benchmark scripts for some CNN models with ImageNet 2012 dataset. We built the data pipeline for the images in the dataset, and optimized the data layer performance in two ways:
We enabled preprocessing data asynchronously with inference by deploying the data layer operations and inference into two separate CPU thread pools.
When data scientists create deep learning models with TensorFlow Python APIs, they typically combine data layer operations with training/inference into one computation graph. However, all the operations in one graph will be deployed into the same CPU thread pool, which is not efficient. In our case, as shown in Figure 1 above, we built two TensorFlow graphs: one is for the data layer operations, and the other for the inference operations. We connected the two graphs by inserting a placeholder node to the inference graph so that a batch of preprocessed data can be retrieved from the data graph and fed into the inference graph.
We also created two TensorFlow sessions with two different CPU thread pools associated. We assigned the data preprocessing graph to one TensorFlow session, and the inference graph to the other session. All the data layer operations are run in their assigned TensorFlow session within its associated thread pool. All the inference operations are run in the other session, and therefore the other thread pool. By separating the operations into different CPU thread pools, the data layer threads won’t interfere with the inference threads. Therefore, data preprocessing and inference can be executed concurrently.
As shown in Figure 2 above, the data preprocessing and inference are no longer sequential operations. When the inference operations perform the inference with one batch of data in the inference thread pool, the data layer operations can process the next batch of data at the same time in the data thread pool. Therefore, the total inference time will be significantly reduced.
In addition, we created and assigned different CPU configurations to each TensorFlow session. Data layer operations have their own natures of parallelism, which might be different from deep learning models. With different CPU configurations for data layer operations and inference, we can adjust the parallelism parameters accordingly.
Last but not least, to get the CPU configurations for the best performance of overall end to end deep learning flow from data preprocessing to inference, we used TensorTuner and tuned the following parameters on 2nd gen Intel Xeon Scalable processors:
data_inter_op_parallelism_threads: maximum number of data graph nodes that can be executed in parallel.
data_intra_op_parallelism_threads: maximum number of threads that can be used to execute one data graph node.
inference_inter_op_parallelism_threads: maximum number of inference graph nodes that can be executed in parallel.
inference_intra_op_parallelism_threads: maximum number of threads that can be used to execute one inference graph node.
OMP_NUM_THREADS: maximum number of threads to execute one graph node with type of MKL.
We have implemented the optimizations described in this blog for the inference of several popular deep learning models for image classification and object detection. Chart 1 below shows INT8 quantized model performance comparison between real data (ImageNet 2012 for CNN models, COCO 2017 for SSD_VGG16) and synthetic data, measured with one socket and one instance on 28-core 2nd gen Intel Xeon Scalable processors. As can be seen, the real data performance can reach up to 95% of synthetic data performance for SSD_VGG16.
In conclusion, we implemented TensorFlow data pipeline in our benchmark scripts for the inference of several deep learning models. We tuned the data pipeline configurations according to Intel CPU hardware capabilities, and significantly improved the deep learning performance. For details of implementation, please check the following Python scripts in our Intel AI Model Zoo.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information, visit www.intel.com/benchmarks.
Performance results are based on testing as of 5/22/2019 by Intel Corporation and may not reflect all publicly available security updates. No product or component can be absolutely secure.
2nd Gen Intel Xeon Scalable Processor Platform:
2 socket Intel® Xeon® Platinum 8280 Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32GB/ 2934 MHz), BIOS: SE5C620.86B.0D.01.0438.032620191658,CentOS 7.6, 4.19.5-1.el7.elrepo.x86_64, Deep Learning Framework: https://hub.docker.com/r/intelaipg/intel-optimized-tensorflow:1.14-pre-rc0-devel-mkl-py3 (https://github.com/tensorflow/tensorflow.git commit: f78b725d10e1386b614621465810b9e79558bd08), Compiler: gcc 6.3.0,MKL DNN version: v0.18, Datatype: INT8
Performance were measured with ImageNet 2012 for CNN models, COCO 2017 for SSD_VGG16 as well as synthetic data, with minibatch size of 128 for CNN models, and minibatch size of 1 for SSD_VGG16
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice Revision #20110804
Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation