Improving Inference Efficiency on CPUs with Parallel Batching

Inference in AI is the process of evaluating a trained neural network model on-real world samples to gain useful information. Inference is used in all AI application domains – object detection, image classification and segmentation, speech recognition, machine translation, and others. Recent studies show that the demand for inference requests in data centers and cloud-based services is expected to grow significantly in coming years. A majority of data centers are run on CPUs today, so we want to ensure inference workloads run on CPUs with the highest efficiency.

Intel AI researchers have already shown that running multiple instances affinitized to a subset of cores in an Intel® processor can help to scale inference efficiency. This mechanism is called multi-streaming. This technique leads to better core utilization and also localizes memory accesses to the memory channels of the CPU socket to which the cores belong.

In this work, we discuss parallel batching – a way to further boost multistream performance by creating child processes within each inference stream. This technique is especially helpful in neural network models where the input minibatches can contain data of varying lengths. A good example is raw text. A batch of four sentences to be translated from English to German can contain the following example sentences:

“Hello world”
“Behind every exquisite thing that existed, there was something tragic”
“Increased safety for pedestrians”
“Sleepless in New York”

Typically, sentences in a batch are padded to the longest sentence. In this example, all the sentences will be padded to the length of the second sentence. This drastically wastes compute cycles. Hence, the input sentences are sorted grouping sentences of similar lengths together. Additionally, when the sorted batches are processed sequentially, the ones with short sentences tend to underutilize the CPU cores.

Our approach, known as parallel batching, solves this problem by packing inference requests spawning multiple processes within each inference stream as shown in Figure 1.

Figure 1: Process Pinning

Figure 1: Process Pinning.

Parallel Batching

The main motivation for our parallel batching technique is to take into account the performance differences between batches, which occurs due to the varying lengths and required compute capacity, as shown in Figure 2.

Figure 2: Batch processing times.

Figure 2: Batch processing times.

Multiple parallel batching techniques are offered by the TensorFlow* serving APIs and the TensorFlow batch function. However, these techniques do not consider the varying batch times for resource allocation. Serially executing these batches is undesirable as batches with shorter sentences fail to utilize the cores efficiently. One way to improve the efficiency of the batches is to pack them in parallel with successive batches of shorter sentences. In addition, we affinitize the processes to mutually exclusive cores.

Our methodology in TensorFlow is shown in algorithm 1:

  • Each inference stream has a batch queue. This is done to prevent an overhead added by the creation of the TensorFlow session for every inference batch.
  • For each inference stream, we create multiple children processes and affinitize each child process to a different set of cores through CPU affinity settings within the set of cores assigned to each inference stream.
  • The children processes dequeue batches asynchronously from the batch queue and perform inference.

Algorithm 1: Parallel batching
Input: Dataset χ
Input: Mini batch size b

  1. Create an input batch queue
  2. Create children processes
  3. Affinitize children processes to different set of cores through CPU affinity settings
  4. Add B input batches with batch size b to the input queue
  5. For i = 0 to B − 1 do
  6. Each child process gets the next batch from the input queue
  7. Perform inference on batch i
  8. End for

Performance Results

Experimental setup: To test our methodology, we used an Int8 quantized Transformer English-German language translation model in TensorFlow. We ran our tests on a two socket Intel® Xeon® Platinum 8268 processor with 24 cores per socket. We used the newstest2014 dataset with 3003 sentences. Throughput is reported as seq/s, or the total number of sequences translated relative to the total time taken to process all the batches.

Figure 3: Parallel Batching Performance Improvement.

Figure 3: Parallel Batching Performance Improvement.

The dataset consisting of 3003 sentences is sorted based on the number of tokens per sentence and batched into sizes of 64. The relative performance improvement with the parallel batching technique is shown in Figure 3. We tested with two inference streams per CPU node and four processes per inference stream.

The comparison of the serial execution and parallel batching for the 47 input batches is shown in Figure 4. We observe a relative improvement of 43% in throughput using the parallel batching technique over the baseline due to higher core utilization and at the same time, maxing out the memory bandwidth.

Figure 4: Comparison of the serial and parallel execution techniques.

Figure 4: Comparison of the serial and parallel execution techniques.


We have presented an inference technique known as parallel batching. Using this technique, we are able to improve the efficiency of inference requests and achieve higher throughput by balancing the compute and memory bandwidth requirements. This is demonstrated by a 43% relative throughput performance improvement for inference on the Transformer-LT benchmarks.

Notices and Disclaimers