Future Directions for NLP in Commercial Environments

Natural Language Processing (NLP) has made a major leap in the past decade, both in theory and in practical integration into broadly deployed industry solutions. From interacting with virtual assistants to texting with a flight-booking chatbot to extracting insights from call center interactions to analyze customer satisfaction levels, NLP is everywhere today. In 2018, Deep learning (DL) for NLP finally hit its stride with neural nets outperforming traditional machine learning (ML) methods in a wide range of NLP tasks, and even surpassing human performance in the challenging areas of question answering and machine translation. The field seems poised to continue its advances in using and understanding natural language. However, there are a number of significant challenges when deploying NLP solutions in commercial environments that need to be addressed. Among the top challenges are improving focused learning from a small set of examples and scaling solutions across different domains. A few recent technology advances and practices hold great promise for improving scalability and robustness, and embody a shift in how business organizations consume computing resources and deploy NLP applications.

Prior “State-of-the-Art”: BiLSTM Hegemony

In 1997, Sepp Hochreiter and Jürgen Schmidhuber introduced the long short term memory (LSTM), a memory cell unit that efficiently stores information over extended time intervals via Recurrent Neural Networks (RNN). LSTMs have since become a key tool for capturing long-range dependencies for NLP tasks. Two decades later, the bi-directional LSTM DL version (BiLSTM) has driven many of the advances in NLP, powered by neural network (NN) computing resources, word-embedding representations and greater access to useful datasets. As Chris Manning, director of Stanford’s Artificial Intelligence Library, explains:“No matter what the [NLP] task, you throw a BiLSTM at it.” [1]

No matter what the [NLP] task, you throw a BiLSTM at it.

Chris Manning
Director of Stanford Artificial Intelligence Library

Barriers to Large-Scale Deployment and Impact

Many of the NLP use cases in the industry fall under the category of text analytics[2], targeting the extraction of symbolic information from unstructured text in order to infer useful insights. Practical applications of this technology include the extraction of concepts, events and topics from emails, online reviews, tweets, call center voice interactions, survey results, and other types of textual feedback, extracted for the purpose of analyzing customer opinions, needs or levels of satisfaction from products and services.

However, organizations are generating textual data at explosive rates and increasingly need to exploit their unstructured data to improve operational costs and competitiveness. The current DL paradigms cannot keep pace.

To expound: nowadays, data science teams build a DL model to solve specific NLP tasks. Data-hungry DL models must be fed with large amounts of annotated data, which is time consuming and costly to acquire since its annotation requires domain expertise. At the same time, business environments are dynamic and complex, making it impractical to collect and label enough data for each scenario in a given domain within a practical time frame. Although much progress has been made in DL for NLP, the current models are not able to cope with the challenges listed above, a major holdback to the deployment of DL-NLP in commercial environments.

Transfer learning refers to the use of a model that has been trained to solve one problem (such as classifying images from ImageNet) as the basis to solve some other somewhat similar problem.

Jeremy Howard
Researcher: fast.ai

Transfer Learning and NLP

Encouraging results were recently achieved using transfer learning for various NLP tasks. As defined by Jeremy Howard, founding researcher at fast.ai: “Transfer learning refers to the use of a model that has been trained to solve one problem (such as classifying images from ImageNet) as the basis to solve some other somewhat similar problem.” Effective NLP is highly dependent on the context of the language being analyzed. New NLP practices are addressing transfer learning by learning language structures from domains/tasks rich with labelled examples and applying the learned model with some adaptation to a different domain/task.

A new set of approaches have emerged with effective transfer learning methods: Embeddings from Language Model (ELMo), Universal Language Model Fine-tuning (ULMFit), Transformer and recently Bidirectional Encoder Representations from Transformers (BERT). These approaches are demonstrating that pre-trained models originally trained on a specific task such as a language model (LM) task can be used successfully for other NLP tasks, outperforming state-of-the-art performance as well as gaining high accuracy with smaller amounts of data when compared to training from scratch. For example, fast.ai has demonstrated that the ULMFit method “significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100x more data.”

NLP transfer learning is most likely to replace the traditional approach of training using a large amount of labeled data (training from scratch) to fine-tune a pre-trained model with training using a small amount of in-domain labeled data. The NLP transfer learning approach includes the following steps:

  1. Pre-train learning – This is where the heavy computing lifting occurs. In this step, a model is pre-trained to solve a generic task such as language modeling; the training consumes very large amounts of unlabeled textual data (e.g. the entire Wikipedia dataset) in a semi-supervised manner such that no data labeling is needed.
  2. Fine-tune learning – This step consists of reasonably fast, supervised learning of specific tasks such as named entity recognition (NER) or sentiment analysis using a small amount of in-domain labeled data. The pre-trained model can be held fixed or can be fine-tuned during this training.
  3. Inference – In this step, the fine-tuned model is loaded and prediction is invoked. Note that the fine-tuned model is based on the pre-trained model which is very large, and therefore a very large feed-forward calculation occurs which is computationally intensive compared to traditional supervised learning inference. BERT is an example of such a model; its large topology contains 24 transformer blocks, with a total of 340M parameters.

Figure 1: Transfer learning steps applied to a sentiment analysis task.

Figure 1: Transfer learning steps applied to a sentiment analysis task.

The Implications of Transfer Learning on Computing Resources

In practice, NLP applications deployed in businesses will most likely focus on the fine-tuning and inference steps (Steps 2-3 of Figure 1), leveraging generic pre-trained models (Step 1). This development has a major impact on the way business organizations will consume computing resources, shifting the emphasis from training to the inference stage. The inference computing resource will have to handle loading of large models and accordingly an heavy feed-forward calculations, and therefore we anticipate significant changes in the design of inference computing architectures.

An advancement related to this trend features the development of software frameworks aiming to reduce model size and computational costs of specific tasks such as the work presented in Transformer to CNN.

Examples from the Intel AI Lab

The Intel AI lab is working on efficient transfer learning techniques for training with small amounts of training data and providing tools for non-DL/NLP experts to scale and adapt models to new domains. The models and tools are available in NLP Architect, an open and flexible library for the NLP research and development community.

Here are two examples that demonstrate the way NLP Architect leverages transfer learning to mitigate the challenges in commercial deployments:

  1. Term Set Expansion application is a transfer learning based tool for non-DL/NLP experts (aka. business analysts) that provides an iterative end-to-end workflow via a friendly user interface that enables the extraction of domain specific entities and a refining of the results. Thus, organizations can speed up the process of building and maintaining a taxonomy for their NLP use cases.
  2. At NeurIPS 2018, the Intel AI lab team presented an spect-based sentiment analysis (ABSA) tool that demonstrates how language structure (e.g. grammar) can assist in acquiring sentiment terms for a new domain in an unsupervised manner. The algorithm does not require the training of a new model for each domain and can continuously learn from new data coming in.


We are witnessing a paradigm shift in the way data scientists are deploying NLP applications in business environments, such as the adaptation of large pre-trained models with relatively small amounts of data instead of the traditional approach of training from-scratch with large amounts of data per task and then performing inference. The implications of this shift are very promising for the deployment of NLP applications in commercial settings. For example, we see that the field of computer vision has gone through a set of accelerated adoptions with the rise of transfer learning (e.g. ImageNet), which enabled the productization of the technology. We expect a similar development in the field of NLP.