Monday, September 9, 2019

2019-09-09: Introducing sumgram, a tool for generating the most frequent conjoined ngrams

Comparison of top 20 (first column) bigrams, top 20 (second column) six-grams, and top 20 (third column) sumgrams (conjoined ngrams) generated by sumgram for a collection of documents about the 2014 Ebola Virus Outbreak. Proper nouns of more than two words (e.g., "centers for disease control and prevention") are split when generating bigrams, sumgram strives to remedy this. Generating six-grams surfaces non-salient six-grams. Click image to expand.

A Web archive collection consists of groups of webpages that share a common topic e.g., “Ebola virus” or “Hurricane Harvey.” One of the most common tasks involved in understanding the "aboutness" of a collection is generating the top k (e.g., k = 20) ngrams. For example, given a collection about Ebola Virus, we could generate the top 20 bigrams as presented in Fig. 1. This simple operation of calculating the most frequent bigrams unveils useful bigrams that help us understand the focus of the collection, and may be used as a summary for the collection. For example, the most frequent bigram, "ebola virus" validates our prior knowledge about the collection topic, and the second bigram "west africa," provides geographical information related to the disease. 

Closer inspection however exposes a serious native defect of the bigrams summary - splitting multi-word proper nouns. Since the bigrams method splits terms into word pairs, all proper nouns with more than two words are split. For example, the trigram "world health organization" was split into two bigrams ("world health" - rank 7) and ("health organization" - rank 10). Also the six-gram "centers for disease control and prevention" was split into three bigrams: "disease control" - rank 9, "centers disease" - rank 12, and "control prevention" - rank 13. The splitting of multi-word proper nouns was easy to detect in this example because I am familiar with the Ebola Virus subject,  so it is possible that the split might go unnoticed if I am presented with bigrams for a collection I am not familiar with. The need to avoid splitting or fixing the splitting of multi-word proper nouns motivated the development of sumgram.

Ngrams vs. Sumgrams
To estimate the aboutness of a collection we could generated bigrams (n=2) or trigrams (n=3). Irrespective of the value of n, we only generate terms of the same ngram class. For example, in Fig. 1, bigrams were generated, so it is not possible to find a trigram term in the list. Fixing split multi-word proper nouns involves "gluing" together ngrams. For example, if we sum the bigrams "world health" and "health organization," the result is the trigram "world health organization." This means in our effort to conjoin split bigrams we could end up with a summary that consists of bigrams, trigrams, four-grams, five-gram, six-grams, etc (Fig. 1, column 1). A higher-order ngram (e.g., "world health organization") generated by conjoining lower-order ngrams (e.g., "world health" and "health organization"), we call a sumgram. In other words, sumgrams are formed by conjoining multiple lower-order ngrams. This means a collection summarized with sumgrams could include multiple different ngram classes (bigrams, trigrams, four-grams, five-grams, etc.). For example, as seen in Fig. 1, column 1, the first term ("ebola virus") is a bigram, the second term ("in west africa"), a trigram, the sixth - four-gram, the eighth - six-gram.


Performance consideration and Named Entity Recognition
Initially, we considered applying Named Entity Recognition (NER) as a means to identify and avoid splitting multi-word proper nouns. With a Named Entity Recognition (NER) system, one could easily label a text collection with entity labels (e.g., PERSON, LOCATION, and ORGANIZATION), and instruct the ngram generator to avoid splitting ngrams that have those labels, as a means to remedy the split ngrams problem. However, we decided not to apply NER to resolve split ngrams because NER would impose additional performance overhead upon sumgram. It was important to keep sumgrams as lightweight as possible without compromising the quality of results. There are some phrases such as "direct contact with" and "health care workers," that sumgram could generate unlike NER. However, NER unlike sumgrams provides the benefit of labeling ngrams (e.g, "CDC" - Organization) although with additional performance cost.


Sumgrams vs. Collocations
Shortly after I announced sumgram, Dan Ofer asked a very important question which I appreciate: "Why use this instead of phrase/coallocation detection? How is it different?"

The output of sumgram can be likened to the output of a collocation detection method. But even though sumgram's attempt to conjoin split ngrams can be likened to identifying collocations, sumgram is primarily a method for generating summaries, finding collocations (or conjoining split ngrams) is secondary. So the short answer to Dan's question is that both methods are different. Sumgram generates summaries by finding conjoined ngrams that are representative of the entire collection and not merely groups of words that co-occur. The focus of collocation method is not to summarize collections but to detect word groups that frequently occur together. The github "Sumgrams vs. Collocations" section expands on the differences.

The Sumgram Tool
We created a python tool, sumgram that implements two novel algorithms (pos_glue_split_ngrams and mvg_window_glue_split_ngramsresponsible for gluing split multi-word proper nouns. In addition to sumgram, we also released NwalaTextUtils, a collection of functions for processing text such as:

  1. derefURI(): Dereference URIs, returning HTML
  2. cleanHtml(): Remove boilerplate from HTML, returning plaintext
  3. getPgTitleFrmHTML(): Return text from within HTML title tag
  4. parallelGetTxtFrmURIs(): Dereference and remove boilerplate from URIs in parallel
  5. parallelGetTxtFrmFiles(): Return text from files (remove boilerplate from HTML) in parallel
  6. parallelTask(): Generic function for parallelizing other functions

The Github documentation explains how to install and use sumgram in addition to the utilization of sumgram with NwalaTextUtils.parallelGetTxtFrmURIs().

I would like to express my gratitude to Shawn Jones for his feedback, pull requests, and advice on improving sumgram and NwalaTextUtils. Additionally, I would like to thank Sawood Alam for reviewing sumgram. We encourage you to use sumgram and we welcome all feedback.

-- Alexander C. Nwala (@acnwala)

No comments:

Post a Comment