Neural network quantization and execution in low precision have been widely adopted as an optimization method that can achieve significant acceleration while maintaining accuracy. Lowering computational precision to 8 bits can be achieved without model re-training in the majority of cases, and only requires a small fine-tuning step in the remaining ones. Meanwhile, quantization results in substantial speedup and higher system throughput, which is beneficial to deployment use cases.
The Intel® Distribution of OpenVINO™ toolkit is a software tool suite that accelerates applications and algorithms with high-performance, deep learning inference deployed from the edge to the cloud. The previous version of the toolkit, released in October 2019, introduced a new technique for post-training model quantization in order to convert models into low precision without re-training while also improving latency. With the latest release, version 2020.1, we continue to make improvements in both enabling a more streamlined development experience in low-precision quantization and optimizing deep learning performance on Intel® architecture-based platforms. The toolkit speeds up inference on a broad range of Intel hardware: Intel Xeon Scalable and Core CPUs for general-purpose compute, Intel Movidius VPUs for dedicated media and vision applications, and FPGAs for flexible programming logic and scale.
Highlights in these improvements include:
Let’s walk through key changes in neural network quantization and low-precision execution that are included in release 2020.1.
Quantization is the replacement of floating-point arithmetics with integer ones. Since the range of floating-point values is very large compared to the range of actual activations and weights, the typical quantization approach is to represent a range of activation and weight values with a discrete set of points. Consequently, for int8 quantization, the range of values is represented as 256 values in accordance with the number of possible values represented by 8 bits. Several quantization modes are possible, and two of them are considered mainstream (see Fig. 1):
The main difference between these two modes is that symmetric quantization is more hardware-friendly and produces higher speedup, while the asymmetric one introduces extra computations and requires hardware-specific tweaks and considerations. However, asymmetric mode has the potential to more accurately represent the original range and improve accuracy. In practice, symmetric quantization is considered a baseline for model acceleration on CPU and integrated GPU, while asymmetric can be used in special cases, such as quantizing non-ReLU models (ELU, PReLU, GELU, etc.).
The Post-training Optimization Tool (POT) is a re-designed version of our previous Calibration tool and will be released in the Intel Distribution of OpenVINO toolkit version 2020.1. The main purpose of this tool is to perform model optimizations after training. As we discussed in previous posts, post-training optimization is attractive due to its streamlined development process, which does not require fine-tuning.
The main objectives that we had for the tool redesign were:
The primary goal of this release is INT8 quantization, which is supported by next generation Intel architecture, including Intel Xeon Scalable processors with Intel Deep Learning Boost. INT8 leverages the compounding performance gains of both hardware and software improvements. All quantization features are available on the command line or by using the Intel Distribution of OpenVINO toolkit’s visual interface, called Deep Learning Workbench. The general flow remains the same‒the tool accepts the intermediate representation (IR) of the trained model and the dataset as input, and then produces a quantized IR that can then be consumed by the inference engine in the same way as any other IR. This simplifies the deployment of low-precision applications.
The Post-training Optimization Tool provides multiple quantization and accompanying algorithms which help to restore accuracy after quantizing weights and activations. Potentially, algorithms can form independent optimization pipelines which can be applied to quantize one or multiple models. In the 2020.1 release, we focused on providing two proven combinations of algorithms as defined below:
Below, we give a description of the methods used in these two pipelines.
Default Quantization pipeline is designed to do a fast, accurate 8-bit quantization of neural networks. It is a pipeline of three algorithms which are sequentially applied to the model:
Accuracy Aware pipeline is designed to perform accurate 8-bit quantization while maintaining a predefined range of accuracy drop, such as 1%. This may cause a degradation in performance in comparison to the default quantization pipeline, because some layers can be reverted back to the original precision. Generally, the pipeline consists of the following steps:
Comparison with other frameworks will be conducted and performance benchmarks will be posted on docs.openvinotoolkit.org.
To provide training capabilities to the OpenVINO community, we are releasing support of low-precision models in the Neural Network Compression Framework (NNCF) which is a part of OpenVINO Training Extensions. These Training Extensions are intended to streamline the development of deep learning models and accelerate the time-to-inference. NNCF is built on top of the PyTorch framework and supports a wide range of DL models for various use cases. It also implements quantization-aware training supporting different quantization modes and settings.
One of the most important features of NNCF is automatic graph transformation when the model is wrapped and additional layers required for quantization-aware fine-tuning are inserted. It helps to simplify the quantization process because the user is not required to be an expert in the quantization flow. In order to modify a custom training pipeline to make it produce a compressed network, some 10-15 lines are typically needed to be added to the user’s PyTorch code. In most cases the model is able to restore the original FP32 accuracy after several epochs of fine-tuning. When fine-tuning finishes, the model can be exported to ONNX format which can be used via regular OpenVINO flow, i.e. Model Optimizer and Inference Engine.
For more details about NNCF QAT and supported models please refer to the framework documentation on GitHub.
Two different quantization paths represent a challenge of unified model representation and execution. OpenVINO represents models quantized through frameworks and via post training quantization with the FakeQuantize primitive. It can represent different types of operations, such as Quantize, Dequantize, Re-Quantize or even QuantizeDequantize, due to its ability to map an input range to an arbitrary output range. It means that most quantized models can be expressed using this operation, no matter whether a model was obtained using QAT or post-training methods.
It is necessary to perform several graph transformation passes on a quantized network to convert it into a form suitable for low precision inference. The FakeQuantize primitive represents two consecutive operations: quantization, which produces integral values in [0, 255] intervals, and dequantization, which returns these values back to floating point range. On the first stage, runtime splits FakeQuantize on these two consecutive operations. The second stage attempts to optimize dequantization by passing it down through the execution graph and fusing with other layers using equivalent mathematical transformations. On the last pass, pattern-specific optimizations are applied. The transformation passes component is common for all target devices, but can be configured by taking into account the features of a particular device.
The new Post-training Optimization Tool in the Intel Distribution of OpenVINO toolkit 2020.1, enables significant acceleration improvement with little or no degradation in accuracy using a model quantization. This enhanced pipeline reduces model size while streamlining the development process with no model re-training or fine-tuning required. Accelerate deep learning inference on Intel architecture platforms today by using the Intel Distribution of OpenVINO toolkit.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No product can be absolutely secure.
OPTIMIZATION NOTICE: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #2010804
Intel, the Intel logo, OpenVINO, and other Intel marks are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Other names and brands may be claimed as the property of others. ©Intel Corporation 2020