Building Skip-Thought Vectors for Document Understanding

The neon™ deep learning framework was created by Nervana Systems to deliver industry-leading performance. As of 2018, the neon framework is no longer being supported. We recommend customers to consider Intel optimized frameworks listed here.

The idea of converting natural language processing (NLP) into a problem of vector space mathematics using deep learning models has been around since 2013. A word vector, from word2vec [1], uses a string of numbers to represent a word’s meaning as it relates to other words, or its context, through training. From a word vector, one can train neural networks to build vector representations for phrases, sentences, and then documents. This significantly improves the scalability of NLP algorithms to new language data. Instead of traditional rule-based algorithms that do not adapt, we can feed enough text to the neural networks and let the training process extract useful intrinsic features.

Such vector representations, used for words, sentences, documents, or even images and other forms of inputs, can be generalized as representing “thoughts”. The term “thought vector” has been popularized by Geoffrey Hinton, a prominent deep-learning researcher. In this blog, we present a recent model for representing sentences, skip-thoughts (a name also suggested by Geoffrey Hinton) [2]. We will also introduce an implementation of this model on our platform, as part of Nervana neon and Nervana NLP engine, which enables document understanding and potential applications beyond.

Skip-thought vectors

The Skip-thought model [2] was inspired by the skip-gram structure used in word2vec [1], which is based on the idea that a word’s meaning is embedded by the surrounding words. Similarly, in contiguous text, nearby sentences provide rich semantic and contextual information. The skip-thought model is trained to reconstruct the surrounding sentences to map sentences that share semantic and syntactic properties into similar vectors.

The training data is only required to be contiguous, which means that the vast amount of unlabeled text corpus can be easily used for such a model. The learned sentence vectors are highly generic and can be reused for many different tasks by learning an additional mapping, such as a classification layer. This approach allows leveraging easily accessible unlabeled data to obtain high-quality representations before applying rare labeled data for specific applications.

Encoder-decoder architecture

The model is based on an encoder-decoder architecture[3]. All variants of this architecture share a common goal: encoding source inputs into fixed-length vector representations, and then feeding such vectors through a “narrow passage” to decode into a target output. The narrow passage forces the network to pick and abstract a small number of important features and builds the connection between a source and a target.

In the case of Neural Machine Translation [3], the input sentence is in a source language (English), and the target sentence is its translation in a target language (French). In the case of generating image descriptions [5], the source input is an image, and the target output is a caption sentence that describes the image.  Finally, in a dialogue system [3], the input is a question, and a target output is the answer. With the Skip-thought model, the encoding of a source sentence is mapped to two target sentences: one as the preceding sentence, the other as the subsequent sentence.  This mapping is illustrated in Figure 1, which is from [2].


Figure 1. The Skip-thought model attempts to predict the preceding sentence (in red) and the subsequent sentence (in green), given a source sentence (in grey)

The detailed implementation is shown in the figure below. The encoder uses a lookup table layer (also sometimes called a word embedding layer in some deep learning frameworks) to convert each word into a vector. Then, an encoder, built using either recurrent neural network (RNN) layers, bi-directional RNN (BiRNN) layers, or a combination of both, is able to capture the temporal patterns of sequential word vectors. The hidden states of the encoder are fed as representations of the inputs into two separate decoders (to predict the preceding and subsequent sentences). Each decoder uses another set of recurrent layers, which in turn shares the same look-up table layer with the encoder.


Figure 2. Network architecture

Intuitively speaking, the encoder generates a representation of the input sentence itself. Back-propagating costs from the decoder during training enables the encoder to capture the relationship of the input sentence to its surrounding sentences as well. Therefore, the encoder captures both syntactic and semantic properties. The lookup table being shared among encoder and decoders allows high-quality word embeddings for not only source sentences, but also their context. After training, the lookup table layer and encoder are kept as feature extractor for new input sentences. Following [2], our implementation of the model used the BookCorpus dataset [4] for training.

As a way of probing the model’s learned mapping, we used the feature extractor to get the sentence vectors as the encoder RNN’s final states. Figure 3 shows 2D projections of these vectors. Visualization tools are a component of Nervana’s NLP service that allows users to explore a large document collection interactively and easily see semantic relationships within the data. A user can also extract features for sentences from a new document and retrieve similar sentences to see how it relates semantically to past content.


Figure 3. Spatial distribution of sentences from the BookCorpus dataset based on sentence vectors. Each point corresponds to a sentence. The original sentences are displayed when user interacts with the display. Two closely related sentences are highlighted here.

One can choose to implement the encoder or decoder subnetworks in a general encoder-decoder network differently depending on the modalities of the input data.  The subnetworks are often implemented as RNNs when dealing with languages or temporal sequences, as in skip-thoughts [2] and machine translation [3]. If either subnetwork is processing images, as in image description generation [5], convolutional neural networks (CNN) are used to extract image features.

In terms of the connections in the narrow passage, some models use the last hidden states of RNN encoder, while other models feed features from more time steps and add extra transformations within the connection as a so-called “attention mechanism” [5][[6][7][8]. Please refer to those papers for more details.

Topic mapping from sentence vectors

The original paper evaluated the capability of the trained encoder as a generic feature extractor for 8 different tasks including semantic relatedness, paraphrase detection, and classification.  For some tasks, the sentence vectors could be used directly, and in other tasks, an additional classifier would be trained.

Similarly, one of the applications we can enable is topical analysis. Using some labeled data, one can train an extra classification module to map the sentence representation to different sentence-level topics. Then a user can run the topical analysis model on a collection of documents and visualize how the sentences in these documents are grouped through our Nervana NLP service. Figure 4 illustrates how the model is able to categorize sentences (represented by points) into topics (represented by colors).  We have also incorporated other search features. For example, a keyword search will highlight all the sentences mentioning the specified keyword, so one can see how certain entities are associated with topics.


Figure 4. Sentences grouped based on predicted topics. Each point corresponds to a sentence. Each color corresponds to a topic.

Apply to domain-specific documents

We released a basic implementation of the skip-thought model as part of neon 1.8 release. The example uses the same BookCorpus dataset and configurations as the paper. When applying to domain-specific documents for more practical use cases, there are a few things one should consider to fully utilize the algorithm.

Learning meaningful representations for words is a first step towards understanding natural language. Just like humans pick up the specialized vocabulary of their chosen professions, we can further train the word2vec network to learn the nuances of the datasets. The previously published and publicly available results were trained on a corpus of 100 billion words from Google News Dataset and did not have good representations for certain words encountered in domain-specific documents. Take financial news and reports as an example: financial documents often include acronyms such as EPS (earnings per share) and EBITDA (earnings before interest, taxes, depreciation and amortization) and terms like flow (as in cash-flow) and yield that have specific meanings in a financial context. Fine-tuning the word2vec model on one of our customer’s datasets led to significantly improved representations for such terms.

As we said earlier, models like skip-thoughts allow us to leverage easily accessible unlabeled data for first-step pre-training. When applying this algorithm or other NLP models for a particular use case, we should really consider what are the existing source of data, what are the possible ways to acquire unlabelled or labelled data, and how we structure an NLP pipeline to maximize the reusability of each component.

Nervana NLP engine and Nervana Cloud

Nervana’s NLP engine incorporates ideas from published state of the art models, including the ones we mentioned above, and also supports capabilities such as sentiment analysis, language translation, question answering, and speech-to-text conversion. We open-sourced many examples related to NLP and RNNs as part of neon, Nervana’s deep learning framework, to help our developer community. We also recently open-sourced an end-to-end speech-to-text model with its accompanying blog post. Anyone can download those models and run training or inference locally.

For our commercial applications, our team typically develop the deep learning models using neon, and train and eventually deploy the models using Nervana Cloud. Nervana Cloud, a hosted platform for deep learning (DL), enables businesses to develop and deploy state-of-the-art, high-accuracy AI solutions in record time for applications such as image classification, video activity detection, and natural language processing (NLP).

With Nervana Cloud, a developer can quickly build, train, and deploy deep learning models without needing to build their own hardware and software infrastructure. This provides customers with considerable gains in speed and ease of use when compared to tackling the same problems using other frameworks and cloud services. Nervana also provides comprehensive training sessions covering various topics from the basics of machine learning and deep learning, to cloud and API usage. Using the Nervana Cloud, a web application can submit a query to a REST API endpoint of a deployed model and get back the predicted results to the web application along with a confidence score which can be used by the application.

Going Beyond

Using a Deep Learning based approach, like skip-thoughts and a few mentioned above, allowed us to build an end-to-end model that required no engineered features or domain-specific pre-processing. As a consequence, this same approach can readily be reused for other data sources and domains with minimal modifications. The scalability and robustness of the NLP algorithms affect how well they can handle remarkably flexible natural languages.

Similar approaches can be used to go beyond representations and semantic search, to document classification and understanding and eventually document summarizing or generation. They serve as the basis for personal assistants, chatbots, and other agents whose purpose is to augment and entertain human beings, and will help to solve problems in finance, healthcare and many domains.

Nervana has expertise in many applications of Deep Learning. In the area of Natural Language Processing and Understanding, Nervana has developed models in topic classification, document search, sentiment analysis, machine translation, Q&A systems, and speech to text. For more information email


[1] Distributed representations of words and phrases and their compositionality,
[2] Skip-Thought Vectors,
[3] Sequence to Sequence Learning with Neural Networks,
[4] Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books,
[5] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,
[6] Neural Machine Translation by Jointly Learning to Align and Translate
[7] Grammar as a Foreign Language
[8] Teaching Machines to Read and Comprehend