Optimizing TensorFlow Data Layer Performance on Intel® Xeon® Scalable Processors

Over the past few years, Intel has done remarkable work in optimizing the TensorFlow* deep learning framework on Intel® Xeon® Scalable processors. Training/inference performance benchmarks are usually measured with synthetic data. In the real world, however, data scientists use real data to train deep learning models and predict the results. Data preprocessing performance is also a significant part of overall performance for deep learning models. In this blog, we will discuss how to fully utilize the hardware capabilities of Intel® Xeon® Scalable processors and TensorFlow tf.data APIs to improve the data layer performance.

Use TensorFlow tf.data APIs for Data Preprocessing

In March 2018, Google’s TensorFlow team released tf.data library that was specifically designed to accelerate data preprocessing in deep learning models. With tf.data APIs, users can easily build an input data pipeline which can run in parallel with training/inference. As shown in the example in Google’s tf.data APIs document, data preprocessing can be deployed to a CPU, and training/inference can be deployed to an accelerator (GPU or TPU). When the GPU/TPU is running training/inference on a batch of data, the CPU can process the next batch of data at the same time. With this pipeline, the total training/inference time can be significantly reduced.

We have applied tf.data APIs to the benchmark scripts for some CNN models with ImageNet 2012 dataset. We built the data pipeline for the images in the dataset, and optimized the data layer performance in two ways:

  1. Inside the data pipeline, we extracted the data in parallel from persistent storage, transformed the data in parallel, and used fused data operations (such as map_and_batch()) whenever possible. Finally, we loaded the preprocessed data into the data buffer. Furthermore, with 2nd generation Intel Xeon Scalable processors, we tuned the configurable parameters in the tf.data APIs. Parameters included the number of files to open in parallel for data extraction, the number of TFRecords to read from each file, the number of threads to transform the data in parallel, and the buffer size to cache the preprocessed data.
  2. Outside the data pipeline, we enabled data preprocessing and inference to run in parallel by deploying them into two different thread pools. The details are discussed in the next section.

Deploy Data Preprocessing and Training/Inference to Separate CPU Thread Pools

We enabled preprocessing data asynchronously with inference by deploying the data layer operations and inference into two separate CPU thread pools.

Figure 1: Deployment of data layer operations and inference on CPU.

Figure 1: Deployment of data layer operations and inference on CPU.

When data scientists create deep learning models with TensorFlow Python APIs, they typically combine data layer operations with training/inference into one computation graph. However, all the operations in one graph will be deployed into the same CPU thread pool, which is not efficient. In our case, as shown in Figure 1 above, we built two TensorFlow graphs: one is for the data layer operations, and the other for the inference operations. We connected the two graphs by inserting a placeholder node to the inference graph so that a batch of preprocessed data can be retrieved from the data graph and fed into the inference graph.

We also created two TensorFlow sessions with two different CPU thread pools associated. We assigned the data preprocessing graph to one TensorFlow session, and the inference graph to the other session. All the data layer operations are run in their assigned TensorFlow session within its associated thread pool. All the inference operations are run in the other session, and therefore the other thread pool. By separating the operations into different CPU thread pools, the data layer threads won’t interfere with the inference threads. Therefore, data preprocessing and inference can be executed concurrently.

Figure 2: Data pipeline on CPU.

Figure 2: Data pipeline on CPU.

As shown in Figure 2 above, the data preprocessing and inference are no longer sequential operations. When the inference operations perform the inference with one batch of data in the inference thread pool, the data layer operations can process the next batch of data at the same time in the data thread pool. Therefore, the total inference time will be significantly reduced.

In addition, we created and assigned different CPU configurations to each TensorFlow session. Data layer operations have their own natures of parallelism, which might be different from deep learning models. With different CPU configurations for data layer operations and inference, we can adjust the parallelism parameters accordingly.

The code snippets to implement the optimizations mentioned above are as follows:
The code snippets to implement the optimizations mentioned above.
The code snippets to implement the optimizations mentioned above.

Last but not least, to get the CPU configurations for the best performance of overall end to end deep learning flow from data preprocessing to inference, we used TensorTuner and tuned the following parameters on 2nd gen Intel Xeon Scalable processors:

  1. data_inter_op_parallelism_threads: maximum number of data graph nodes that can be executed in parallel.
  2. data_intra_op_parallelism_threads: maximum number of threads that can be used to execute one data graph node.
  3. inference_inter_op_parallelism_threads: maximum number of inference graph nodes that can be executed in parallel.
  4. inference_intra_op_parallelism_threads: maximum number of threads that can be used to execute one inference graph node.
  5. OMP_NUM_THREADS: maximum number of threads to execute one graph node with type of MKL.

Performance Comparison

We have implemented the optimizations described in this blog for the inference of several popular deep learning models for image classification and object detection. Chart 1 below shows INT8 quantized model performance comparison between real data (ImageNet 2012 for CNN models, COCO 2017 for SSD_VGG16) and synthetic data, measured with one socket and one instance on 28-core 2nd gen Intel Xeon Scalable processors. As can be seen, the real data performance can reach up to 95% of synthetic data performance for SSD_VGG16.

Chart 1: Performance comparison of INT8 quantized models between real data and synthetic data. Performance were measured with ImageNet 2012 for CNN models, COCO 2017 for SSD_VGG16 as well as synthetic data, with minibatch size of 128 for CNN models, and minibatch size of 1 for SSD_VGG16. See complete configuration details in appendix.

Chart 1: Performance comparison of INT8 quantized models between real data and synthetic data. Performance were measured with ImageNet 2012 for CNN models, COCO 2017 for SSD_VGG16 as well as synthetic data, with minibatch size of 128 for CNN models, and minibatch size of 1 for SSD_VGG16. See complete configuration details in appendix.

Summary

In conclusion, we implemented TensorFlow data pipeline in our benchmark scripts for the inference of several deep learning models. We tuned the data pipeline configurations according to Intel CPU hardware capabilities, and significantly improved the deep learning performance. For details of implementation, please check the following Python scripts in our Intel AI Model Zoo.

Model Scripts
ResNet-50 eval_image_ classifier_inference.py
preprocessing.py
ResNet-101 eval_image_ classifier_inference.py
preprocessing.py
InceptionV3 eval_image_ classifier_inference.py
preprocessing.py
SSD_VGG16 eval_ssd.py

Notices and Disclaimers

System configuration