Inference in AI is the process of evaluating a trained neural network model on-real world samples to gain useful information. Inference is used in all AI application domains – object detection, image classification and segmentation, speech recognition, machine translation, and others. Recent studies show that the demand for inference requests in data centers and cloud-based services is expected to grow significantly in coming years. A majority of data centers are run on CPUs today, so we want to ensure inference workloads run on CPUs with the highest efficiency.
Intel AI researchers have already shown that running multiple instances affinitized to a subset of cores in an Intel® processor can help to scale inference efficiency. This mechanism is called multi-streaming. This technique leads to better core utilization and also localizes memory accesses to the memory channels of the CPU socket to which the cores belong.
In this work, we discuss parallel batching – a way to further boost multistream performance by creating child processes within each inference stream. This technique is especially helpful in neural network models where the input minibatches can contain data of varying lengths. A good example is raw text. A batch of four sentences to be translated from English to German can contain the following example sentences:
“Behind every exquisite thing that existed, there was something tragic”
“Increased safety for pedestrians”
“Sleepless in New York”
Typically, sentences in a batch are padded to the longest sentence. In this example, all the sentences will be padded to the length of the second sentence. This drastically wastes compute cycles. Hence, the input sentences are sorted grouping sentences of similar lengths together. Additionally, when the sorted batches are processed sequentially, the ones with short sentences tend to underutilize the CPU cores.
Our approach, known as parallel batching, solves this problem by packing inference requests spawning multiple processes within each inference stream as shown in Figure 1.
The main motivation for our parallel batching technique is to take into account the performance differences between batches, which occurs due to the varying lengths and required compute capacity, as shown in Figure 2.
Multiple parallel batching techniques are offered by the TensorFlow* serving APIs and the TensorFlow batch function. However, these techniques do not consider the varying batch times for resource allocation. Serially executing these batches is undesirable as batches with shorter sentences fail to utilize the cores efficiently. One way to improve the efficiency of the batches is to pack them in parallel with successive batches of shorter sentences. In addition, we affinitize the processes to mutually exclusive cores.
Our methodology in TensorFlow is shown in algorithm 1:
Algorithm 1: Parallel batching
Input: Dataset χ
Input: Mini batch size b
Experimental setup: To test our methodology, we used an Int8 quantized Transformer English-German language translation model in TensorFlow. We ran our tests on a two socket Intel® Xeon® Platinum 8268 processor with 24 cores per socket. We used the newstest2014 dataset with 3003 sentences. Throughput is reported as seq/s, or the total number of sequences translated relative to the total time taken to process all the batches.
The dataset consisting of 3003 sentences is sorted based on the number of tokens per sentence and batched into sizes of 64. The relative performance improvement with the parallel batching technique is shown in Figure 3. We tested with two inference streams per CPU node and four processes per inference stream.
The comparison of the serial execution and parallel batching for the 47 input batches is shown in Figure 4. We observe a relative improvement of 43% in throughput using the parallel batching technique over the baseline due to higher core utilization and at the same time, maxing out the memory bandwidth.
We have presented an inference technique known as parallel batching. Using this technique, we are able to improve the efficiency of inference requests and achieve higher throughput by balancing the compute and memory bandwidth requirements. This is demonstrated by a 43% relative throughput performance improvement for inference on the Transformer-LT benchmarks.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information, visit www.intel.com/benchmarks.
Configuration 1 (serial execution): Intel® Xeon® Platinum 8268 Processor with 35.75M Cache, 2.90 GHz. DDR memory configuration: 12 slots / 32GB / 2933 run speed. Total Memory: 384GB. Storage: 1x Intel 480GB SSD OS Drive. OS: CentOS Linux release 7.4.1708 (Core). Kernel: 3.10.0-693.17.1.el7.x86_64. Spectre meltdown checker: Mitigated. Workload & version: Transformer LT with Int8 quantization. Compiler: GCC 7.2.0. Libraries: Intel® MKL. Frameworks: TensorFlow. Dataset: newstest2014: 3003 Eng-Ger Sentence Pairs. Topology: Int8 quantized version of TensorFlow Official Transformer-LT Base. Batch size: 64. Raw Results (units): 163s.
Configuration 2 (parallel batching): Intel® Xeon® Platinum 8268 Processor with 35.75M Cache, 2.90 GHz. DDR memory configuration: 12 slots / 32GB / 2933 run speed. Total Memory: 384GB. Storage: 1x Intel 480GB SSD OS Drive. OS: CentOS Linux release 7.4.1708 (Core). Kernel: 3.10.0-693.17.1.el7.x86_64. Spectre meltdown checker: Mitigated. Workload & version: Transformer LT with Int8 quantization. Compiler: GCC 7.2.0. Libraries: Intel® MKL. Frameworks: TensorFlow. Dataset: newstest2014: 3003 Eng-Ger Sentence Pairs. Topology: Int8 quantized version of TensorFlow Official Transformer-LT Base. Batch size: 64. Raw Results (units): 114s.
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804
Performance results are based on testing done by Intel Corporation as of February 25, 2019 and may not reflect all publicly available security updates. No product or component can be absolutely secure. Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation