NLP Architect: Efficient Training and Optimized Inference of Transformer Models

In the past year, we’ve seen major advances in the field of natural language processing (NLP) with the rise of pre-trained language models such as ULMFiT, ELMo, BERT, GPT and RoBERTa. These models were trained on large collections of text and are used for extracting the contextual information of words. They have successfully improved a vast collection of NLP tasks by serving as a feature extractor or as a generic pre-trained contextual extractor that can be tuned for specific tasks. Transformer-based models, such as BERT and GPT, stand out among these new models as they have shown great improvement by allowing the classifier to fine tune the model’s parameters when training the classifier on the target task (often referred to as the fine tune phase). However, the use of these generic pre-trained models come at a cost. While delivering great accuracy, the number of parameters and depth of the model brings a heavy computational burden, making them a challenge to train and deploy for inference.

Transformers in NLP Architect

With the fifth release of NLP Architect, an open source library of NLP models from Intel AI Lab, we integrated Transformer-based models that utilize pre-trained language models (using the pytorch-transformers github repository) to kick start NLP model training. The Transformer base class we developed can load predefined models, such as BERT, XLNet and XLM. It can also handle the life-cycle of a model, including creation and configuration of the model, loading the pre-trained weights and handling the training and inference process. The model can also be configured outside of  the pre-trained models;the topology can be configured by the developer if only partial pre-trained weights are needed or additional components are added.

Training models using pre-trained Transformer-based models is done by first adding classifier heads, or any additional network component that is trained on a target task, and then fine-tuning the Transformer weights along with the parameters of the classifier. An example of a simple classifier for sorting sentences into categories is a feed-forward layer with Softmax that is attached to the last layer of the Transformer model.

In our latest release, NLP Architect v0.5, we have included two such classifiers. One is a sequence classifier, useful for classifying sentences or documents to a predefined set of classes: for example, sentiment conveyed in a sentence, sentence category or entailment. The other is a token classifier, which classifies words into predefined categories: for example, part-of-speech (POS) tags or named entities (NER). In a future release, we will add additional classifiers relevant to other NLP tasks.

Quantized BERT (8-bit)

As mentioned in the opening paragraph, BERT-based models demand high levels of computation, memory and power usage. Model compression methods such as weight pruning, quantization and model distillation reduce the resources needed for model inference and therefore are crucial for using these models in production environments.

Researchers from Intel Labs have already demonstrated the potential of optimizing BERT inference performance on CPUs. We applied quantization aware training during the fine-tuning process of BERT and simulated 8bit quantized inference using FP32 variables. This model can be easily converted to Int8 variables resulting in a 75% reduction of model size.

To test our method we have evaluated our model on the GLUE Benchmark and shown that the induced error by using quantization is less than 1%. We invite the readers to delve into the detailed explanation of this work in this blog post and on our website. We implemented the quantization method as a base model in NLP Architect, allowing developers to utilize quantization-aware training using BERT pre-trained models with any classification heads they desire.

Distilling BERT into smaller LSTM and CNN based models

Another common approach for reducing computation and model size is by distilling the knowledge of a smarter model to a smaller model. This approach is often referred to as model distillation or teacher-student model training, and was introduced by Hinton et. al in 2015.

Building upon the high task accuracy that BERT models can achieve by fine tuning, we can utilize them as teacher models to train smaller models. Accordingly, the first step is to fine tune a BERT model on a specific task. The distillation process is the process of training the student model using the teacher model (BERT in our scenario) in inference to get the output of the examples used to train the small model. This distillation process is manifested by estimating the distance between the teacher and student probability distribution, for example by using KL divergence and the outputs of the models (logits). This estimation is added to the loss of the student model. More details on this process can be found on our website.

In NLP Architect v0.5 we used distillation to train small NER models with the help of BERT-based models. We used two robust tagging models (CNN-LSTM and ID-CNN) as student models and BERT base and large as teacher models. These student models utilize pre-trained word embeddings and several layers of CNN or LSTM layers and use softmax and CRF for classifying words to tags. The teacher networks were fine tuned on the target task before being used by the student models.

Figure 1: Distillation Process

Figure 1: Distillation Process.

Our evaluations resulted in models with competitive accuracy, especially in low-resource tagging tasks where very little training data is available, while resulting in a student model that has significantly fewer parameters than BERT-large.[1] As a result, we were able to train a model with competitive accuracy to BERT that has shown a 27 to 45X improvement in time to completion when running the model in inference on CPU. The distillation method is included with NLP Architect 0.5 and we integrated a procedure to train the mentioned taggers using BERT as a teacher.

NLP Architect as a model-oriented library for optimized models

NLP Architect was designed to provide a library of state-of-the-art NLP models for developing applications. In addition, the library provides a platform for sharing optimized models developed by our Intel AI Lab team of data scientists with the greater research community in order to develop and further optimize NLP based neural networks.

Starting with release 0.5 of NLP Architect we are planning to improve many aspects of the library, shaping it to be more model-oriented, with the process of ‘process-train-save-run’ in mind. In addition, we are going to be streamlining the usage by adding a command-line interface for running procedures, improve package organization and the process of contributing components. We invite you  to visit the NLP Architect website for more details on each model, API documentation and tutorials on how to use the library.

In future releases of NLP Architect we plan to continue improving the library, providing simple pythonic interfaces for doing inference on trained models, revising our model serving API and, of course, adding additional novel and optimized models based on the work we are doing in the Intel AI Lab. Follow us on Twitter for the latest updates on our research and if you’re attending the O’Reilly AI Conference in San Jose, you can hear Moshe Wasserblat from the NLP Architect team talk more about the challenges and opportunities of NLP.