Posts

Showing posts with the label text processing

2019-09-09: Introducing sumgram, a tool for generating the most frequent conjoined ngrams

Image
Comparison of top 20 (first column) bigrams, top 20 (second column) six-grams, and top 20 (third column) sumgrams (conjoined ngrams) generated by sumgram for a collection of documents about the 2014 Ebola Virus Outbreak . Proper nouns of more than two words (e.g., "centers for disease control and prevention") are split when generating bigrams, sumgram strives to remedy this. Generating six-grams surfaces non-salient six-grams. Click image to expand. A Web archive collection consists of groups of webpages that share a common topic e.g., “Ebola virus” or “Hurricane Harvey.” One of the most common tasks involved in understanding the "aboutness" of a collection is generating the top k (e.g., k = 20) ngrams. For example, given a collection about Ebola Virus , we could generate the top 20 bigrams as presented in Fig. 1. This simple operation of calculating the most frequent bigrams unveils useful bigrams that help us understand the focus of the collection, and m...

2018-10-19: Some tricks to parse XML files

Recently I was parsing the ACM DL metadata in XML files. I thought parsing XML is a very straightforward job provided that Python has been there for a long time with sophisticated packages such as BeautifulSoup and lxml. But I still encountered some problems and it took me quite a bit of time to figure how to handle all of them. Here, I share some tricks I learned. They are not meant to be a complete list, but the solutions are general so they can be used as the starting points to handle future XML parsing jobs. CDATA. CDATA is seen in the values of many XML fields. CDATA means Character Data. Strings inside the CDATA section are not parsed. In other words, they are kept as what they are, including marksups. One example is <script>  <![CDATA[  <message> Welcome to TutorialsPoint </message>  ]] > </script > Encoding. Encoding is a pain in text processing. The problem is that there is no way to know what the encoding the text is before ope...

2017-03-20: A survey of 5 boilerplate removal methods

Image
Fig. 1: Boilerplate removal result for  BeautifulSoup's get_text()  method for a   news website . Extracted text includes extraneous text (Junk text), HTML, Javascript, comments, and CSS text. Fig. 2: Boilerplate removal result for  NLTK's (OLD) clean_html()  method for a   news website .  Extracted text includes  e xtraneous text, but does not include Javascript, HTML, comments or CSS text. Fig. 3: Boilerplate removal result for  Justext  method for a   news website .  Extracted text includes  s maller extraneous text compared to BeautifulSoup's get_text() and NLTK's (OLD) clean_html() method, but the page title is absent. Fig. 4: Boilerplate removal result for   Python-goose  method for this   news website . No extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext, but page title and first paragraph are absent. Fig. 5: Boilerplate...