Intel® Nervana™ NNP-I Shows Best-in-Class Throughput on BERT NLP Model

Natural Language Processing (NLP) will be a 43 billion dollar business by 2025. Cutting-edge models like Google’s BERT (Bidirectional Encoder Representations from Transformers) are poised to accelerate the adoption of NLP tasks by helping computers understand language more like humans do. At CES 2020, we revealed that our Intel® Nervana™ Neural Network Processor for Inference (NNP-I) performs BERT-base up to 1.6x faster than Nvidia T4 (in 75W envelope). By optimizing BERT and other NLP methods on Intel architecture, customers will be able to efficiently deploy new services and products in this quickly growing market.

Background on BERT

First introduced by Google in 2018, BERT is a pre-trained deep learning model that delivers state-of-the-art results on many natural language processing (NLP) tasks like Q&A, sequence tagging, and sentiment extraction (Fig. 1).

BERT consists of three steps:

  1. Pre-training learning: This is where the heavy computing occurs. In this step, BERT is pre-trained to solve a masked language model where words are either hidden or swapped with other random words. The training consumes very large amounts of unlabeled textual data (e.g. the entire Wikipedia dataset) in a semi-supervised manner so that no data labeling is needed. This process is done one time to create the model so it can be used to fine-tune learning and then inference.
  2. Fine-tune learning: This step consists of reasonably fast, supervised learning of specific tasks, such as question answering or sentiment analysis, using a small amount of in-domain labeled data.
  3. Inference: During inference, the fine-tuned model is loaded, and prediction is invoked. The fine-tuned model is based on the extremely large pre-trained model (BERT-base contains 110 million parameters). Therefore, a very large feed-forward calculation occurs, which is computationally intensive compared to traditional supervised learning inference.

Figure 1: Transfer Learning and BERT; POS = Positive; NEG = Negative.

Figure 1: Transfer Learning and BERT
POS = Positive
NEG = Negative

BERT Performance on Intel Nervana NNP-I

A major factor in the 1.6x performance increase over the Nvidia T4 comes from NNP-I hardware architecture and the 8bit quantization (Q8BERT) that Intel researchers presented last fall in achieving the best-in-class accuracy-compression ratio for BERT-base. The recipe is available at Intel’s NLP Architect library that benefits from HuggingFace’s API Transformer.

For this performance benchmark comparison, we optimized BERT implementation for maximum throughput, independent of batch size and latency, but in a similar sub-75W power envelope and PCIe form factor. According to product performance from Nvidia, the T4 has a maximum throughput of 827 sentences/second with batch size equal to 8 as of February 3, 2020. At CES 2020 we presented Intel Nervana NNP-I’s maximum throughput performance of 1334 sentences/second. As of January 20th, 2020, we sped up the throughput to 1560 sentences/second by using additional software optimizations.

Though already best-in-class, we expect the Intel Nervana NNP-I’s throughput performance on NLP tasks to continue improving as the software stack matures and optimizations continue.

Intel Nervana NNP-I + BERT

The Intel Nervana NNP-I has twelve Inference Compute Engines (ICE), each with a high performance matrix multiplication engine that supports 8b quantization and FP16 precision and a highly capable Tensilica Vision Q6 digital signal processor (DSP) with FP16 vector processing units (VPUs) and large local SRAMs. This allows BERT execution to be mapped completely on the ICE cores. The matrix engine achieves up to 92Tops and runs the 8b MLPs.

The Tensilica DSP is a highly programmable and performant VLIW Vector 512b machine that performs the Elementwise, SoftMax, Layer normalization and transpose layers in FP16 precision. The combination of quantization of the MLPs to 8b and the non-GEMM operations to FP16 allows very high performance on Intel Nervana NNP-I.

In addition, the large SRAM inside the ICE enables all intermediate results (between layers) to be stored inside the IP, maintaining data locality and proximity to the execution units. This reduces the external bandwidth and conserves power. The 24MB last level cache (LLC) that is shared across all ICE cores also reduces the bandwidth required for parameter fetch from memory considerably.

For more details on the Intel Nervana NNPI-I solution see our presentation from Hot Chips 2019: “Spring Hill (NNP-I 1000) Intel’s Data Center Inference Chip.”

Batch 2×6 Run on Intel Nervana NNP-I

BERT and similar workloads can be compiled and run on Intel Nervana NNP-I in different modes that target best latency, best throughput or maximal throughput at given latency.

In this blog, BERT measurement refers to a throughput mode of 2×6 illustrated in Figure 3. In this mode, a batch of two is created and run on a pair of Inference Compute Engines (ICE) cores. Since the Intel Nervana NNP-I has 12 ICE cores, six parallel and asynchronous batch two inferences can run on the machine simultaneously. Each of the ICE cores is running one inference of batch 2; this mode provides very high throughput while allowing the SW to run a very low batch.

Figure 2: Batch2x6 - Throughput mode

Figure 2: Batch2x6 – Throughput mode.

Figure 3: BERT performance on Intel Nervana NNP-I. Projections based on Intel internal measurements using pre-production hardware/software as of January 3 and January 20,, 2020. All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice. Nvidia T4 published results as of February 3, 2020. Batch size 8 was selected because it depicts Nvidia’s best throughput performance for comparison purposes

Figure 3: BERT performance on Intel Nervana NNP-I. Projections based on Intel internal measurements using pre-production hardware/software as of January 3 and January 20,, 2020. All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice. Nvidia T4 published results as of February 3, 2020. Batch size 8 was selected because it depicts Nvidia’s best throughput performance for comparison purposes

Continued Improvements for NLP

Though a relatively new method, BERT already is being used in a variety of tasks. Google utilizes BERT in its core search and ranking algorithms to better understand the subtle meanings of words and phrases in searches and match queries with relevant results. Researchers published a paper on aspect-based sentiment analysis using BERT, while others have proposed using BERT to generate multiple choice questions or in quantitative trading algorithms. Because BERT has shown state-of-the-art results in a wide variety of areas, including Q&A, name entity recognition, classification, and more, we expect it to be a crucial component of future NLP tasks.

Visit the NLP Architect website to explore our new features for NLP optimization in production, and follow us on Twitter for the latest updates from the Intel AI Lab.