One of the biggest challenges to AI can be eliciting high-performance deep learning inference that runs at real-world scale, leveraging existing infrastructures. Combining efficiency and flexibility, the Intel® Distribution of OpenVINO™ toolkit (a developer tool suite that stands for Open Visual Inference and Neural Network Optimization) accelerates high-performance deep learning inference deployments.
Latency, or execution time of an inference, is critical for real-time services. Typically, approaches to minimize latency focus on the performance of single inference requests, limiting parallelism to the individual input instance. This often means that real-time inference applications cannot take advantage of the computational efficiencies that batching (combining many input images to achieve optimal throughput) provides, as high batch sizes come with a latency penalty. To address this gap, the latest release of the Intel Distribution of OpenVINO Toolkit includes a CPU “throughput” mode.
This new mode allows efficient parallel execution of multiple inference requests by processing them using the same CNN, greatly improving the throughput. In addition to the reuse of filter weights in convolutional operations (also available with batching), a finer execution granularity available with the new mode further improves cache utilization. Using this “throughput” mode, CPU cores are evenly distributed between parallel inference requests, following the general “parallelize the outermost loop first” rule of thumb. It also greatly reduces the amount of scheduling/synchronization compared to a latency-oriented approach when every CNN operation is made parallelized internally over the full number of CPU cores.
The resulting speedup from the new mode is particularly strong on high-end servers, but also significant on other Intel® architecture-based systems, as shown in Table 1.
|Topology\Machine||Dual-Socket Intel® Xeon® Platinum 8180 Processor||Intel® Core™ i7-8700K Processor|
Together with general threading refactoring, also introduced in the R5 release, the toolkit does not require playing OMP_NUM_THREADS, KMP_AFFINITY and other machine-specific settings to achieve these performance improvements; they can be realized with the “out of the box” Intel Distribution of OpenVINO toolkit configuration.
Beyond raw performance, additional advantages of the toolkit’s “throughput” mode include:
Let’s measure latency versus throughput for the approaches discussed in this post. Below is the new “throughput” mode of the Intel Distribution of OpenVINO toolkit, compared to the conventional approach that uses the batching:
A few observations on the figure above:
To simplify benchmarking, the Intel Distribution of OpenVINO toolkit features a dedicated Benchmark App that can be used to play with the number of inference requests running in parallel from the command-line. The rule of thumb is to test up to the number of CPU cores in your machine. For example, on an 8-core processor, compare the performance of the “-nireq 1” (which is a latency-oriented scenario with a single request) to the 2, 4 and 8 requests. In addition to the number of inference requests, it is also possible to play with batch size from the command-line to find the throughput sweet spot.
The new CPU “throughput mode” in the Intel Distribution of OpenVINO toolkit enables support for finer execution granularity for throughput-oriented inference scenarios. This brings a significant performance boost for both data centers and inference at the network edge.
We discussed other CPU-specific features in the latest Intel Distribution of OpenVINO toolkit release in a previous blog post, including post-training quantization and support for int8 model inference on Intel® processors. The toolkit’s throughput mode is fully compatible with int8 and brings further performance improvements.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
Intel® Xeon® Platinum 8180 Processor @ 2.50 GHz with 32GB of memory, OS: Ubuntu 16.04, kernel: 4.4.0-87-generic
Intel® Core™ i7-8700 Processor @ 3.20GHz with 16 GB RAM, OS: Ubuntu 16.04.3 LTS, Kernel: 4.15.0-29-generic
Performance results are based on testing as of December 18, 2018 by Intel Corporation and may not reflect all publicly available security updates. See configuration disclosure for details. No product or component can be absolutely secure. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice Revision #20110804
Intel, the Intel logo, Xeon, Core, and Atom are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.