Extracting Semantic Relations using External Knowledge Resources with NLP Architect

The Intel AI lab recently released a set of new features for NLP Architect, a Python* library for exploring the state-of-the-art deep learning topologies and techniques for natural language processing (NLP) and natural language understanding (NLU), in their version 0.3 release. One of the highlights is the introduction of a set of new features based on external knowledge resources, which enables users to extract the semantic relationship between entity mentions or between event mentions. The list of external knowledge resources includes Wikipedia*, WordNet*, VerbOcean*, Stanford Coref Dictionary* and Embeddings distance (currently using GloVe and Elmo). We intend to expand the list in the near future, so stay tuned. In this post, we focus on extracting semantic relations from Wikipedia.

Motivation

The goal of our work is to utilize external knowledge resources to determine whether two or more mentions of a term share the same semantic meaning or have a specific semantic relation between them. The following are a few examples of such semantic relations:

  • Acronyms: “NYC” and “New York City” or “DUI” and “driving under the influence”
  • Abbreviations: “prof” and “professor” or “Alabama” and “AL”
  • Is-a: “Donald Trump” is a “president” or “Donald Trump” is the “President of the United States”
  • General Synonyms: “apartment” and “flat”
  • Proper Noun Synonyms: “Big Blue” and “IBM” or “United States” and “America”

Finding semantic relations between mentions is a very important task in natural language processing (NLP). It enables the model to obtain more contextual information than found in individual terms, and improves the robustness of NLP applications (e.g. NER, Intent Extraction, and Cross Document Coref) in industry.

Wikipedia Relation Types

Semantic relations, such as in the examples above, can be extracted from a static data set (such as WordNet), but usually are very limited in scope. Wikipedia, on the other hand, is a large dynamic textual data set with over 5 million articles in the English version alone. This is a fount of information whose features further enable us to extract semantic relations. For example, mention hyperlinks that redirect to the same page (wiki-redirect-feature) can imply that both mentions refer to the same element in the real world (and are therefore most likely acronyms, abbreviations or aliases). Metadata such as disambiguation pages, categories and aliases are also good indications of semantic similarity.

Given a pair of entity mentions or event mentions, our API compares their match to six Wikipedia relation types, as listed in Table 1. The binary extracted relation types can be further utilized by ML or Sieve algorithms for semantic-based applications (e.g. Co-Reference resolution). For example, the pair (New York City, NYC) resulted in the binary vector [1,0,1,0,0,0] which means True for the Redirect Link and Category types, but False for the other four types.

Table 1: Wikipedia Relation Types

Wikipedia Relation Type

Mentions pairs (examples)
REDIRECT_LINK

Mentions redirect to the same page
DISAMBIGUATION

One of the mentions appears as a disambiguation link in the other mention page
CATEGORY

Mentions share at least one category type
ALIASES

Mentions are aliases
TITLE_PARENTHESIS

Mention found in parenthesis at disambiguation link of other mention
BE_COMP

Mentions follow Is-A pattern in the 1st sentence
X = “New York City”

Y = “NYC”
X = “IT”

Y = “Information technology”
X = “It”

Y = “film”
X = “Donald Trump”

Y = “President of the United States”
X = “Donald Trump”

Y = “POTUS”
X = “Ellen DeGeneres”

Y = “television host”

Now that we’ve provided examples of relation types, let’s take a closer look at how we’ve enabled the semantic relation feature in the newest version of NLP Architect.
First, we installed NLP Architect:

pip install nlp-architect

Then we imported the resources used in this example:

  
>>> from nlp_architect.data.cdc_resources.relations.wikipedia_relation_extraction import WikipediaRelationExtraction
>>> from nlp_architect.data.cdc_resources.relations.relation_types_enums import WikipediaSearchMethod, RelationType
>>> from nlp_architect.common.cdc.mention_data import MentionDataLight      
  

Next, we created an instance of WikipediaRelationExtraction which we used to extract Wikipedia relations.

We chose between two main initialization options:

  1. Initiation of an instance that will query the online site directly. This requires no preprocessing of any kind and works out-of-the-box, but access time is slow (1-2 seconds per pair).

          
    wiki = WikipediaRelationExtraction(WikipediaSearchMethod.ONLINE)
          
        
  2. Initiation of an instance that will query a local Elasticsearch index. This option requires an initial pre-process for creating the Elastic index (as described here). The one-time preprocess might take 2-5 hours depending on hardware, but will gain fast access time (1-2 milliseconds per pair).

          
    wiki = WikipediaRelationExtraction(WikipediaSearchMethod.ELASTIC, host='localhost', port='9200', index='enwiki_v2'
          
        

Next, we created two mentions for which to extract relations:

  
>>> mention_x = MentionDataLight('nyc')
>>> mention_y = MentionDataLight('New York City')      
  

Finally, we queried WikipediaRelationExtraction to extract all relations between those mentions:

  
>>> wiki_relations = wiki.extract_all_relations(mention_x, mention_y)
>>> print('Wikipedia Relations-', str(wiki_relations))
Wikipedia Relations- {<RelationType.WIKIPEDIA_REDIRECT_LINK: 1>, <RelationType.WIKIPEDIA_CATEGORY: 5>}
  

The following link includes the Jupyter* file for the above example.

In a similar way, we can extract specific semantic relations types from other knowledge resources. See here for the full class definitions of the knowledge resources supported by NLP Architect. See Table 2 for their respective class names and semantic relation types:

Table 2: Knowledge Resources, Class Names and Semantic Relation Types

Knowledge resource Class Name Semantic relation types
Wikipedia WikipediaRelationExtraction BE_COMP,
TITLE_PARENTHESIS,
DISAMBIGUATION, CATEGORY,
REDIRECT_LINK, ALIASES,
PART_OF_SAME_NAME
VerbOcean VerboceanRelationExtraction VERBOCEAN_MATCH
WordNet WordNetRelationExtraction SAME_SYNSET_ENTITY,
SAME_SYNSET_EVENT,
PARTIAL_SYNSET_MATCH
DERIVATIONALLY
Stanford Coref Dictionary ReferentDictRelationExtraction REFERENT_DICT
WordEmbedding

(Glove, ELMO)
WordEmbeddingRelationExtraction WORD_EMBEDDING_MATCH
Computational ComputedRelationExtraction EXACT_STRING,
FUZZY_FIT,
FUZZY_HEAD_FIT,
SAME_HEAD_LEMMA

You can find full documentation on how to extract semantic relations from external resources here. Please follow us on @IntelAIDev for the latest updates on NLP Architect.

Acknowledgments:

We would like to thank Shany Barham and Prof. Ido Dagan of the Bar Ilan University NLP Research Lab for their help, guidance and contribution to this project.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at ai.intel.com.

Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.