BigDL Model Inference with Intel® DL Boost

Deep learning plays an increasingly important role in various artificial intelligence (AI) applications such as image classification, object detection, speech recognition, and recommendation engines. But once a trained model is deployed in the real world, strict business requirements and environments can present big challenges to deep learning inference. For example, industrial inspections have hard requirements for prediction latency, and Internet businesses require enormous volumes of data to be processed (and used to make predictions from) daily. Therefore, latency and throughput are two key indicators for model inference performance, and low precision inference is one method being widely studied and applied in industries to improve both.

BigDL, an open source distributed deep learning framework that was released by Intel in 2016, provides robust training and inference support to customers like JD.com, Mastercard, and Dell EMC. The BigDL team has been continuously improving the user experience by adding more usability and richer deep learning features, as well as optimizing performance to help users reduce training and inference costs. With the latest release announced in March 2019, BigDL added INT8 inference support with Intel® Deep Learning Boost (Intel® DL Boost) on 2nd generation Intel® Xeon® Scalable processors, using the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN).

In this article, we will describe how to generate INT8 models from FP32 models in BigDL, and present the results measured on two typical models for both performance and accuracy. Finally, we’ll discuss how Big DL integrates Intel MKL-DNN as the underlying runtime computing engine to accelerate training and inference.

Quantization in BigDL

Converting a FP32 model into an INT8 model takes two steps: first, generate the quantization scales and then call quantization API at runtime when doing inference.

Scales are generated from input samples. You need to create the same preprocessing pipeline as used by the validation dataset during training and then call calcScales API to generate scales for each layer:

  
val inputs = //create your input samples
inputs.foreach(input => {
    model.forward(input)
    model.calcScales(input)
})
  

After generating the scales, you simply need to call quantize at runtime to enable INT8 inference.

  
val quantizedModel = model.quantize()
  

Accuracy and performance

We selected two typical CNN models, Resnet-50 V1.0 and VGG-16, and measured with both FP32 and INT8 on ImageNet dataset. Table 1 shows the configurations for FP32 and INT8 inference. For the previous generation, the mainstream practice for low-precision inference is with FP32; with the recently introduced Intel DL Boost on 2nd gen Intel Xeon Scalable processors, the recommended inference is with INT8 to leverage new vector neural network instructions (VNNI).

Table 1 shows latency comparison between FP32 and INT8. BigDL archives 2.83 reduction for Resnet-50 and 2.05 for VGG-16.

Model FP32 Latency (ms)
Intel® Xeon® Platinum 8180 processor
INT8 Latency (ms)
Intel® Xeon® Platinum 8280 processor
Reduction ratio
Resnet-50 5.201 1.839 2.83
VGG-16 8.993 4.396 2.05

Table 1: Latency comparison. Complete configuration details in appendix.

Table 2 shows throughput speedup on single node between FP32 and INT8. We selected the batch size with the best throughput performance. BigDL archives 3.44 and 3.65 speedup for Resnet-50 and VGG-16 respectively.

Model Batch Size FP32 throughput (img/second)
Intel® Xeon® Platinum 8180 processor
INT8 throughput (img/second)
Intel® Xeon® Platinum 8280 processor
Speedup ratio
Resnet-50 64 624.412 2149.67 3.44
VGG-16 128 196.244 717.243 3.65

Table 2: Throughput comparison. Complete configuration details in appendix.

Table 3 shows the FP32 accuracy and INT8 accuracy. The accuracy loss for both top 1 and top 5 are no more than 0.2%.

Model Top N FP32 accuracy
Intel® Xeon® Platinum 8180 processor
INT8 accuracy
Intel® Xeon® Platinum 8280 processor
Accuracy loss
Resnet-50 Top 1 76.11% 75.91% 0.2%
Resnet-50 Top 5 92.80% 92.70% 0.1%
VGG-16 Top 1 70.70% 70.64% 0.06%
VGG-16 Top 5 89.75% 89.74% 0.01

Table 3: Accuracy loss. Complete configuration details in appendix.

INT8 implementation in BigDL

BigDL has now fully integrated Intel MKL-DNN as the underlying runtime computing engine to accelerate training/inference. Leveraging Intel MKL-DNN breaks down into several steps: runtime tensor management, layer/operation per data type management, reorder management and layer/operation fusion.

  • Runtime Tensor management. Since the BigDL tensor is managed by Java virtual machine (JVM) and memory alignment allocation is friendly to Intel MKL-DNN, we mapped the JVM tensor to the native DNNTensor created at runtime. The native tensor size is determined by JVM tensor and runtime data type (FP32 or INT8), and BigDL manages its lifecycle.
  • Layer/operation management. Intel MKL-DNN not only provides matrix/tensor calculation, it is a powerful implementation for common neural network operations. We used primitives provided by Intel-DNN to manage model layer/operation topology at runtime. The calculation for operations supported by Intel MKL-DNN are all performed by the Intel MKL-DNN engine.
  • Reorder management. Every Intel MKL-DNN operation has a platform-specific preferred input format to achieve better performance, e.g., 2-D convolution on Intel® Xeon® Platinum 8180 processor uses NCHW16C as the expected input format. However, the actual input format might be different than common practice. To resolve any performance drop due to the original format, we optimized the operation by reordering the input format to fit the internal preferred format. Figure 1 shows the format difference and how we can get performance improvement from formatting. Figure 2 shows the dynamic memory reordering during our model compiling.

Figure 1.  Direct convolution on NCHW memory layout. We can get a part of output for 8 channels of input by formula. As shown in the figure, it will get the input and kernel value from discrete memory locations and then do an add operation. This operation can be accelerated with an Intel® AVX instruction, which consolidates multiplication and addition in one instruction (AKKA, SIMD).

Figure 1. Direct convolution on NCHW memory layout. We can get a part of output for 8 channels of input by formula. As shown in the figure, it will get the input and kernel value from discrete memory locations and then do an add operation. This operation can be accelerated with an Intel® AVX instruction, which consolidates multiplication and addition in one instruction (AKKA, SIMD).

Figure 2: We needed to reorder the data in memory for better computing performance, as shown above.

Figure 2: We needed to reorder the data in memory for better computing performance, as shown above.

Figure 3, Dynamic runtime reorder. The yellow circles are the reorder operations. The solid line is the original graph structure and the dotted line is the graph structure with reordering operations.

Figure 3, Dynamic runtime reorder. The yellow circles are the reorder operations. The solid line is the original graph structure and the dotted line is the graph structure with reordering operations.

  • Layer/operation fusion. By analyzing the model topology, we found that adjacent layers/operations could be merged into one while keeping the same semantics. For example, for 2-D convolution and batch normalization, the final output to input 𝑥 is

In the above formula, E is the learning mean and V is the variance,  and 𝗒 and βare parameters in batch normalization.

In the above formula, E is the learning mean and V is the variance, and 𝗒 and βare parameters in batch normalization.

Let’s define
Figure 5

then output will be
Figure 6

We can see that we can merge the two into one 2-D conv by updating the weight and bias in the convolution layer.

Fusion reduces the model size as well as accelerates the compute process. Currently, we support 4 types of fusion in BigDL: Convolution + BatchNormalization, Convolution + Relu, BatchNormalization + Relu and Convolution + Sum.

Both FP32 and INT8 enabling follow the same steps, and the only difference is the data type we use to manage primitives at runtime.

Summary

In this article we described the design of INT8 support in BigDL 0.8.0 release and how to generate a quantized model from the user’s perspective in BigDL. The final testing result showed that In comparison with FP32, BigDL can get significant performance improvements for both latency and throughput with Intel® DL Boost on 2nd gen Intel Xeon Scalable processors. We plan to enable more layers in BigDL for continuous optimization with more optimized models support in future releases.

Appendix

Test configurations:

Configuration FP32 INT8
Platform Intel® Xeon® Platinum 8180 Intel® Xeon® Platinum 8280
Node 1 1
Sockets 2 2
Cores/Socket 28 28
Threads/Socket 56 56
System DDR 12 slots/16GB/2666 12 slots/16GB/2933
Storage Intel SSD S3610 Series 480G Intel SSD S3610 Series 480G
OS Centos 7.5 Centos 7.5
OS Kernel 3.10.0-862.EL7.x86_64 3.10.0-862.el7.x86_64
BigDL version 0.8.0 0.8.0
Compiler Java 1.8 Java 1.8
MKL-DNN version 0.17 0.17

Notices and Disclaimers