Introducing int8 quantization for fast CPU inference using OpenVINO

Deep learning framework optimizations and tools that streamline deployment are advancing the adoption of inference applications on Intel® platforms. Reducing model precision is an efficient way to accelerate inference on processors that support low precision math, with reduced memory bandwidth and improved operations-per-cycle. Currently, there are two mainstream ways to achieve a reduction in model precision:

  • Training in low precision from the beginning or subsequent fine-tuning for low precision if possible. This method requires access to training infrastructure, dataset, and knowledge of training parameters and procedure.
  • Post-training quantization without involvement of any training process whatsoever.

The difference between the two methods is not only in the complexity of procedure, but also in the accuracy of the final model. Quantization to 8 bits has been thoroughly investigated, and in the majority of cases can be performed without retraining. Lower resolutions typically suffer from quality loss as a result of post-training quantization and require use of training-based methods.

The post-training quantization process evaluates model accuracy with two goals: to reduce execution precision by as many layers as possible to achieve high performance and to keep model accuracy as close as possible to the original. This process usually results in mixed precision models which are a combination of fp32 (high accuracy) and int8 (high performance) layers. It is a tradeoff of sorts between keeping accuracy high while accelerating the model as much as possible.

The performance of Intel® processors supports floating point and integer throughput workloads which makes them a perfect target for mixed precision models. The latest release of the Intel® Distribution of OpenVINO™ toolkit, a developer toolkit that accelerates high performance computer vision and deep learning inference, includes a post-training quantization process with corresponding support for int8 model inference on Intel® processors.

To enhance performance on these workloads, the following steps are required using the Intel Distribution of OpenVINO toolkit:

  1. Convert the model from original framework format using the Model Optimizer tool. This will output the model in Intermediate Representation (IR) format.
  2. Perform model calibration using the calibration tool within the Intel Distribution of OpenVINO toolkit. It accepts the model in IR format and is framework-agnostic.
  3. Use the updated model in IR format to perform inference.

AIDM-29_fig1

If you have used the toolkit before, steps 1 and 3 are not new and are sufficient to run a model using the Deep Learning Inference Engine. Step 2 is the only additional step required to perform actual analysis of the model for low precision quantization.

Following the toolkit’s guidelines, the resulting IR file can be used across all Intel platforms already supported by the toolkit for inference (VPUs, HDDL, FPGAs, integrated graphics) without limitations. A switch to int8 precision will happen automatically on CPU-based platforms that support reduced precision inference.

Model calibration is performed using the source model as well as the validation dataset. A large training dataset is not required; it is sufficient to provide a small subset. Typically, a few hundred images is more than enough. You will also need to specify accuracy restrictions, i.e. the maximum accuracy degradation that is possible for the resulting model.

The current calibration tool supports calibration of classification and object detection SSD models on ImageNet and VOC2007 data sets. The calibration tool is deployed in an open source format and can be extended by users of the calibration tool for new datasets as well for new domains of neural networks.

Performance improvements from int8 quantization process vary depending on model; below are some examples of models for different Intel processors. It’s worth mentioning that the use of quantized models also improves memory consumption during inference which is also shown in table 1 below.

Table 1: Performance Improvements^

Topology name Speedup int8 vs fp32 Intel® Xeon®  Platinum 8160 Processor, Intel® AVX-512 Speedup int8 vs fp32 Intel® Core™ i7 8700 Processor, Intel® AVX2 Speedup int8 vs fp32 Intel Atom® E3900 Processor, SSE4.2 Memory footprint gain Intel Core i7 8700 Processor, Intel AVX2 Absolute accuracy drop vs original fp32 model
Inception V1 1.28x 1.31x 1.27x 0.93x Top 1: 0.16%
Top 5: 0.05 %
Inception V3 1.76x 1.76x 1.41x 0.76x Top 1: 0.01%
Top 5: 0.01 %
Inception V4 1.6x 1.82x 1.21x 0.64x Top 1: 0.35%
Top 5: 0.14 %
ResNet 50 2.15x 1.67x 1.71x 0.78x Top 1: 0.07%
Top 5: 0.02 %
ResNet 101 1.9x 1.77x 1.96x 0.69x Top 1: 0.27%
Top 5: 0.11 %
DenseNet_201 1.58x 1.67x 1.86x 0.68x Top 1: 0.05%
Top 5: 0.07 %
Mobilenet v1 1.6x 1.56x 2.1x 0.77x Top 1: 0.19%
Top 5: 0.2 %
SSD300 (VGG16) 1.42x 1.66x 1.72x 0.84x mAP: 0.03%
SSD (MobileNet) 1.5x 1.36x 2.1x 0.9x mAP: 0.24%

To make it easier for you to get higher performance, we have released pre-trained models from Open Model Zoo with quantization information already included. That way, additional performance gains will be available out of the box.

In conclusion, executing a post-training quantization process using the Intel Distribution of OpenVINO toolkit allows you to unleash additional performance while keeping the original model’s quality and without the substantial effort needed to convert a model to int8 precision. No additional training knowledge or datasets are required. You can download the Intel Distribution of OpenVINO toolkit here; an open-source version is also available.

Additional reading:

Intel Distribution of OpenVINO toolkit documentation:

Notices and Disclaimers:

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.   For more complete information visit www.intel.com/benchmarks.

^Configurations:

Intel® Xeon® Platinum 8160 Processor @ 2.10 GHz with 48MB RAM, OS: Ubuntu 16.04, kernel: 4.4.0-87-generic

Intel® Core™ i7-8700 Processor @ 3.20GHz with 16 MB RAM, OS: Ubuntu 16.04.3 LTS, Kernel: 4.15.0-29-generic

Intel Atom® Processor E3900 (Apollo Lake RVP1A) @ 1.60GHz with 2MB RAM, OS: Poky (Yocto Project Reference Distro) 2.0.3 (jethro), Kernel: 4.1.27-yocto-standard