INT8 Inference Support in PaddlePaddle on 2nd Generation Intel® Xeon® Scalable Processors

PaddlePaddle* (PArallel Distributed Deep LEarning) is an open-source deep learning framework developed by Baidu for both internal usage and for the broader deep learning community. Since 2016, Intel and Baidu have been working together to optimize PaddlePaddle performance for deep learning training and inference per Baidu’s critical online deployment requirements.

While deep neural networks (DNN) show state-of-the-art (SOTA) accuracy for a wide range of computation tasks, they still face challenges in enterprise-scale deployment due to the high computational complexity of inference workloads. INT8 inference is one of the key techniques being actively studied to conquer the problem.

Baidu recently announced the release of PaddlePaddle v1.3, the first deep learning framework that supports INT8 inference with Intel® Deep Learning Boost (Intel® DL Boost) on 2nd generation Intel® Xeon® Scalable processors [1], using the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN). This new generation of processors also include integer vector neural network instructions (VNNI) that improve the throughput of multiply-add-accumulate operations with INT8 data types and help to improve performance of low precision convolution and matrix-matrix multiplication operations used in deep neural networks. With this hardware acceleration support, low precision inference can compute more operations per second, reduce the requirements on memory access, better utilize the cache, and deliver higher throughput and lower latency.

In this article, we first discuss recently implemented offline INT8 inference support in PaddlePaddle v1.3 release. We also describe how to enable an INT8 model and demonstrate the performance on 2nd gen Intel Xeon Scalable processors. Finally, we present an INT8 solution integration with the overall PaddlePaddle slim framework.

1. INT8 Inference

INT8 inference requires of two major components:

  1. a calibration tool
  2. INT8 ops/kernels

1.1 Calibration tool

The calibration tool is used to collect tensor statistics by running FP32 inference on calibration dataset, transform the graph from FP32 to INT8, and generate the INT8 model. Figure 1 shows the basic flow of the calibration tool, including tensor statistics collection and graph transformation, where the gray boxes represent the inputs/outputs to the calibration tool and the blue boxes represent the module in calibration tool. The tool exists as a Python* utility in PaddlePaddle v1.3 release. Please refer to the Python code for more details.

Figure 1. Basic flow of the calibration tool. The gray boxes represent the inputs/outputs to the calibration tool and the blue boxes represent the module in calibration tool.

Figure 1. Basic flow of the calibration tool. The gray boxes represent the inputs/outputs to the calibration tool and the blue boxes represent the module in calibration tool.

  • Tensor Statistics Collection. User needs to select a calibration dataset representative of the dataset that will be used when the model is deployed in production. With the user-prepared calibration dataset and the optimized FP32model, sampling collects the FP32 tensors for activation. Please refer to the sample data Python API for more details (paddle.fluid.contrib.Calibrator.sample_data). Note that users can prepare the calibration dataset by specifying the batch size and iteration number.
  • Graph Transformation. Graph transformation performs the below four steps using the FP32 tensors. Please refer to the save_int8_model Python API for more details.
    1. Compute the scales by the user-specified quantization algorithm
      • Quantization algorithms: MAX and KL. MAX is the default algorithm to find the maximum of absolute value, and KL is the alternative one to find the reasonable value by entropy. Users can tune the algorithm to meet the accuracy goal.
      • Formulas to compute quantization scales based on the collected tensor statistics:
        • Activation scale = 255/ MAX, if entire tensor is non-negative (e.g., ReLU)
        • Activation scale = 127 / MAX, if tensor has negative values
        • Weight scale = 127 / MAX
    2. Analyze the graph and insert quantize/dequantize ops based on the supported INT8 op list (Conv2D and Pool2D)
    3. Add quantization scales into Conv as op attributes
    4. Save the quantized INT8 model for inference deployment

With the calibration tool, the FP32 model is quantized to the INT8 model. Figure 2 shows the typical graph transformation in two stages: 1) inserting QuantizeOp and DequantizeOp around ConvOp (in blanket) as an intermediate stage, and 2) optimizing the intermediate graph by eliminating the unnecessary QuantizeOp and DequantizeOP (in gray mask box), Moreover, the quantization scale will be added as an attribute in QuantizeConvOp in INT8 graph.


Figure 2. Graph Transformation from FP32 to INT8 Model.

Figure 2. Graph Transformation from FP32 to INT8 Model.

1.2 INT8 ops/kernels

PaddlePaddle v1.3 supports four INT8 ops, including two newly-added INT8 ops (Quantize and Dequantize) and two newly-added INT8 kernels (Conv2D and Pool2D). Those ops are selected as the first support, since they are widely-used ops in popular CNN models and are required ops to show INT8 capability by nature. We define the supported INT8 ops with input and output data type (signed INT8 for S8 and unsigned INT8 for U8):

  • INT8 Op Quantize: support quantization from FP32 to S8/U8
  • INT8 Op Dequantize: support dequantization from S8/U8 to FP32
  • INT8 Kernel Conv: support INT8 Conv2D computation; input is S8/U8 and weight is S8; output is S8/U8/FP32
  • INT8 Kernel Pool: support INT8 Pool2D computation; input is S8/U8; output is same as input

More INT8 ops/kernels are under development to support additional models required by Baidu, including Requantize, Concat, Reshape, Transpose, and MatMul.

2. INT8 Enabling and Results

We selected ResNet-50 and MobileNet-V1 as two classical CNNs. Below, we provide instructions on how to enable the INT8 model from FP32 based on ResNet-50 and then demonstrate the performance on 2nd gen Intel Xeon Scalable processors with Intel DL Boost.

2.1 INT8 Enabling

There are three simple steps to enable the INT8 model based on the existing FP32 model and Python predictor:

  1. Construct calibration object. It accepts the FP32 model, calibration algorithm, and parameters that facilitate the sampling on FP32 inference.

            
    calibrator = paddle.fluid.contrib.Calibrator ( # Step 1
    program=infer_program, # required, FP32 program
    pretrained_model=model_path, # required, FP32 pretrained model
    algo=algo, # required, calibration algorithm; MAX (default) or KL
    exe=exe, # required, executor
    output=int8_model, # required, INT8 model
    feed_var_names=feed_dict, # required, feed dict
    fetch_list=fetch_targets) # required, fetch targets
            
          
  2.       
    _, acc1, _ = exe.run(
    program,
    feed={feed_dict[0]: image,
    feed_dict[1]: label},
    fetch_list=fetch_targets)
    calibrator.sample_data() # Step 2
          
        
  3.       
    calibrator.save_int8_model() # Step 3
          
        

The INT8 model will be generated after following the above three steps. The full instructions are described in the README in PaddlePaddle’s public repo and the sample code is added as a test case, as well. With the generated model, users can perform the INT8 inference in either Python API or C-API and demonstrate the reasonable accuracy and performance on INT8 inference compared to FP32 inference.

2.2 INT8 Results

We measured both accuracy and performance on 2nd gen Intel Xeon Scalable processors on ResNet-50 and MobileNet-V1 and demonstrated the reasonable accuracy and performance on INT8 inference comparing with FP32 inference.

Model FP32 Top-1 INT8 Top-1 Diff (FP32-INT8)
ResNet-50-V1.5 76.63% 76.23% 0.40%
MobileNet-V1 70.78% 70.47% 0.31%

Table 1. FP32 and INT8 Accuracy.

Table 1 shows the Top-1 accuracy measured on ImageNet full validation dataset, with 50,000 images[2]. ResNet-50 v1.5 and MobileNet-V1 demonstrate a 0.4% and 0.31% accuracy loss, respectively, in Top-1. A 1% accuracy loss between FP32 and INT8 is allowed as a standard for INT8 inference.

Model FP32 Throughput
(images/second)
INT8 Throughput
(images/second)
Ratio (FP32/INT8)
ResNet-50-V1.5 11.54 32.2 2.8
MobileNet-V1 49.21 108.37 2.2

Table 2. FP32 and INT8 Throughput Performance (single batch size, single core).

Table 2 shows the throughput of FP32 and INT8 on a single socket of Intel® Xeon® Gold 6271 Processor. Throughput is measured with a single batch size and single core based on the business deployment requirement from Baidu. The table demonstrates the significant performance boost from 2.2X to 2.8X with Intel DL Boost on INT8 over FP32.

3. Impact of INT8 Quantization on PaddlePaddle Slim Framework

PaddlePaddle defines a slim framework as one consisting of quantization, pruning, and distillation. As a part of quantization, INT8 calibration tool in v1.3 will be refined based on the unified PASS strategy and integrated with the slim framework as the post-training quantization. Especially, the graph transformation in Python will be adapted to a set of new APIs to the corresponding C++ Pass and a C++ predictor will be developed as well.

Figure 3: PaddlePaddle Slim Framework.

Figure 3: PaddlePaddle Slim Framework.

Moreover, training-aware quantization will leverage the existing highly efficient Intel® MKL-DNN based INT8 ops/kernels in quantization-aware INT8 training on Intel Xeon Scalable processors and 2nd gen Intel Xeon Scalable processors. Quantization-aware training usually simulates the quantization of weights and activations for both forward and backward pass and therefore models quantization effects for inference at training time. Post-training quantization is a light-weight quantization solution that can generate the quantized model with a calibration tool, while training-aware quantization aims to mitigate the accuracy loss further by simulating the quantization errors during training if post-training quantization is not enough. Figure 4 illustrates the relationship of key components within the PaddlePaddle slim framework.


Figure 4. INT8 Ops/Kernels and Calibration Support for Slim Framework.

Figure 4. INT8 Ops/Kernels and Calibration Support for Slim Framework.

4. Summary

In this article, we discussed the design of INT8 inference support in Baidu’s PaddlePaddle v1.3 release. We also provided instructions on how to enable an INT8 model from the existing FP32 model and demonstrated the improved performance on 2nd gen Intel Xeon Scalable processors with Intel DL Boost.

In the future, we plan to enable more INT8 models and more INT8 ops/kernels to improve inference and meet Baidu’s requirements. Moreover, we would like to collaborate with Baidu on more contributions to the PaddlePaddle slim framework.

Additional Contributors

Notices and Disclaimers