Exploring Term Set Expansion with NLP Architect

Natural language processing (NLP) is an exciting field within artificial intelligence (AI) that seeks to process and analyze large volumes of human language data, essentially helping machines to better understand what we’re saying. From search engines to voice-command applications like Siri*, Alexa*, and Google Voice*, modern NLP requires state-of-the-art deep learning in order to make sense of our words and phrases.

Inside the Intel AI Lab, we have been exploring NLP by running term set expansion using the capabilities of our NLP Architect open source library. The work has a number of potential use cases, which I’ll discuss a bit more below. But first, some background.

NLP Architect is an open-source Python* library for natural language processing and natural language understanding. NLP Architect contains a number of components, including most common word sense, name entry recognition, intent extraction, supervised sentiment, and more. It was first released by Intel AI Lab at our inaugural Intel® AI DevCon in May 2018, and version 0.3 was released in November 2018.

Term set expansion is the task of expanding a given partial set of terms into a more complete set that belongs to the same semantic class. Let’s look at metals as an example. The term ‘zinc’ would be expanded to include “cobalt, nickel, magnesium, copper, chromium,” and the like. Another example would be the inclusion of acronyms or aliases. But we can go further with an example like ocean, which expands to include “seas, reef, coastlines, icebergs, tidal currents,” and more (Fig. 1).

Let’s look at another example also illustrating term sense disambiguation. The term ‘orange’ would be expanded to include both ‘red’ (color) and ‘apple’ (fruit) but the expanded set of the terms ‘orange, yellow’ would include ‘red’ and not ‘apple’. Moreover, we do not capture only the term sense, but also the granularity within a semantic class. The expanded set of the terms ‘orange’ and ‘banana’ (fruits) would include both ‘apple’ and ‘lemon’ while the expanded set of the terms ‘orange’ and ‘grapefruit’ (citrus fruits) would include ‘lemon’ but not ‘apple’.

Figure 1

Figure 1

Our pre-trained model is trained on a subset of English Wikipedia* articles as a sample corpus and integrated into a simple web application. You can see a demo in this video. Our term set expansion project isn’t being developed for a specific use case. Rather, it’s a way for us to explore the possibilities of how new algorithms and architectures can improve operations, first within Intel and then in the broader community. Let’s take a look at some of the use cases inside Intel.

Qualifying Applicants

The first use case for term set expansion comes from Intel’s human resources (HR) department. As anyone who works in technology knows, there are a variety of overlapping skills within different fields. For example, if the HR department is seeking an applicant with a certain set of AI skills, there might be a variety of related phrases—machine learning, deep learning, artificial intelligence, AI, Tensorflow*, Jupyter*, Python, inference, data mining, computer vision, training models, and more.

To better filter applicants, Intel HR uses term set expansion to create a more manageable pool of qualified candidates. At any point, Intel may have thousands of open positions and hundreds of thousands of résumés and CVs on file. Previously, HR might manually search a candidate’s LinkedIn profile for specific experience or education to get a better understanding of his or her abilities. In the past, this process could have taken several minutes per applicant. Today, applicants can be searched in seconds for specific types of skills. The system provides matches in a ranked order back to the HR agent, which they can review and then send to management for the next phase of consideration. This filtering can also help eliminate potential bias as HR isn’t looking at names, photos, or other information about an applicant—only skill sets.

Term set expansion can also help current employees find appropriate future positions within Intel that require a similar background. An employee can provide the system with their résumé and it will return a list of open jobs that match their experience.

Classifying Software Bugs

Another use case for term set expansion is the correct identification of software bugs. The taxonomy can be complex and not based on keywords alone. From acronyms to technical terms in different languages, there can be more than a dozen different variations for the same component. Bug reports come in from across the world, often from teams or individuals who have been working on an issue without knowing that someone else is also attempting to solve the same problem. This duplication of effort is costly. To create a better taxonomy, term set expansion can learn to classify various names accordingly. After a taxonomy for specific technical terms is built, the full expansion can be used to match terms across different bug reports.

In a test, Intel used three hundred technical terms and extended them to a full list, then developed a search for each term. This gave us more than a 10% enhancement in precision in matching similar bug reports, helping engineers come together faster to issue fixes.

The two use cases within Intel are promising, but the potential for others to use the tool is also very exciting for our team.

Try It Yourself

One of the biggest potential opportunities for our term set expansion application is its ability to help non-experts create taxonomies. Named-Entity Recognition (NER) is a very important NLP task for business organizations to extract product names, company names, and technical terms. Our term set expansion solution, accessible via a friendly Web-based user interface, is a useful tool for non-deep-learning-experts to extract entities and refine the results. Users don’t need annotated data — it can be unstructured and the training process is unsupervised. Term set expansion with NLP Architect will give more users the ability to create their own taxonomies, creating more use cases and future projects.

We have made tutorial instructions available on Jupyter showing how to install NLP Architect, train a model and run the term set expansion. We encourage you to try it yourself and we are excited to see what use cases others will create. For more information, please review the academic paper on this topic that was presented in November 2018 at the Conference on Empirical Methods in Natural Language Processing (EMNLP). The research work has been done in collaboration with Professors Ido Dagan and Yoav Goldberg from Bar Ilan University. If you’re interested in further NLP projects, please visit our NLP Architect site to see all of the components, models, and solutions.