At the Intel AI Lab, we are constantly investing in research to advance the field of natural language processing (NLP) applications. Today, I am proud to announce the work we are doing with Shany Barhom, Vered Shwartz, and Professor Ido Dagan from Bar-Ilan University on a new paper about recognizing coreferring events and entities mentions. This paper is one of many accepted to the annual meeting of the Association for Computational Linguistics (ACL 2019).
Being able to correctly identify events and entities across documents is a critical task for NLP. If a system cannot determine that different documents are referring to the same event or entity, the potential to scale is severely limited. Many NLP applications such as document/ multi-document summarization, knowledge representation, information retrieval and other NLP applications will benefit from utilizing this cross-document capability. In the paper, we propose a neural architecture for cross-document coreference resolution that can be used for text summary or open knowledge representation. As far as we know, this is the first paper on cross document entity coreference, setting a new baseline for the task. In addition, we present a new state-of-the-art model for event cross document tasks.
Coreference occurs when two or more expressions in a document text or across different documents refer to the same thing.Recognizing that multiple texts are referring to the same entity or event is an important NLP task. For entity mentions, they are mostly noun phrases, like named entities (Dr. Donna Strickland), locations, times, and common nouns (plane, dog, table). Event mentions are more complex and consistent of three types: verbal predicates (such as “acquire”), normalization (such as “acquisition”), and concrete events (such as “the 21st Oscars”). As an example, take the sentences:
A system focusing solely on event coreference may find it difficult to recognize that “goes to” and “awarded” are coreferring, while a joint model would leverage the coreference between their arguments.
For our research we used the ECB+ corpus, which comprises over 900 texts found via Google Search. The news documents are divided into 43 topics, and each topic is divided into two sub-topics for increased difficulty. For example, one topic could be the Oscars, but sub-topics would be separated by news about Ellen DeGeneres hosting the awards ceremony from news about Hugh Jackman hosting.
Though the documents are partially annotated for WD and CD, there were many challenges with the data due to low lexical variability, many singleton clusters (a cluster with one mention), and clusters with few mentions.
Using a joint model inspired by the work of Lee et al. (2013), we are able to cluster events and entity mentions based on shared semantic role labeling arguments. The idea is that if two events/entities share arguments/verbs, then they might be coreferring.
For the inference part of our algorithm, we start our model (Fig. 1) by inputting the data set of Google news documents as a single folder into the system, not knowing which documents belong to which topic or sub-topic. Then we use a k-means clustering algorithm to separate the documents into their appropriate sub-topics.
Once in their sub-topics, for event mentions (verbal predicates, normalization, concrete events) we use the gold (labeled) mentions and initiate them into singleton clusters so that each event is in a cluster by itself. For extracting entity mentions (names, locations, times, common nouns), we use Stanford CorefAnnotator to create the initial WD clusters.
Next we use a pairwise scoring model, taking two clusters of the same type (event or entity), to merge the initial clusters. We use an iterative algorithm in this pairwise process that alternates between merging two entity clusters and then merging two event clusters. We do this until we have achieved all the newly merged clusters possible, as shown in Figure 2. (For more details about our pairwise scoring model, or how we train it, read sections 3.1-3.4 in the full academic paper.)
The end result from our model is an output of clusters that have the same event/entity coreference within their topic. Our model improves on the previous state-of-the-art event coreference model on ECB+, while providing the ﬁrst entity coreference results on the corpus, setting a new baseline for the task.
Search and recommendation engines, voice-command applications, and future technologies will require coreference resolution across vast amounts of documents to advance the field of NLP for academic and business use cases. I am extremely proud of the work we have accomplished with the team at Bar-Ilan University, and I am looking forward to continuing to work with them to improve the model. Our future work focuses on investigating ways to minimize pipeline errors, incorporating a mention prediction component, and adding some additional elements that I am excited to reveal in the coming months.
We will be presenting our paper as an oral presentation on Wednesday, July 31st at ACL 2019. If you are attending the event, please come by the Discourse and Pragmatics session at 10:30 AM to learn more about our efforts. For more information, visit the research portion of the Intel AI website at intel.ai/research and follow us on Twitter: @IntelAIResearch. Code will also be available in our open-source NLP Architect project.
All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation