Monday, March 20, 2017

2017-03-20: A survey of 5 boilerplate removal methods

Boilerplate removal result from BeautifulSoup's get_text() method for news website. Extracted text includes extraneous text, HTML and Javascript text.
Fig. 1: Boilerplate removal result for BeautifulSoup's get_text() method for a news website. Extracted text includes extraneous text (Junk text), HTML, Javascript, comments, and CSS text.
Boilerplate removal result from NLTK's (OLD) clean_html() method for news website. Extraneous text included, but does not include Javascript and HTML text.
Fig. 2: Boilerplate removal result for NLTK's (OLD) clean_html() method for a news websiteExtracted text includes extraneous text, but does not include Javascript, HTML, comments or CSS text.
Boilerplate removal result from Justext method for news website. Smaller extraneous text compared to BeautifulSoup's get_text() and NLTK's (OLD) clean_html() method, but title missing.

Fig. 3: Boilerplate removal result for Justext method for a news websiteExtracted text includes smaller extraneous text compared to BeautifulSoup's get_text() and NLTK's (OLD) clean_html() method, but the page title is absent.
Boilerplate removal result from Python-goose method for this news website. No extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext but missing title text and first paragraph.
Fig. 4: Boilerplate removal result for Python-goose method for this news website. No extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext, but page title and first paragraph are absent.
Boilerplate removal result from  Python-boilerpipe (ArticleExtractor) method for this news website. Smaller extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext.
Fig. 5: Boilerplate removal result for  Python-boilerpipe (ArticleExtractor) method for a news websiteExtracted text includes smaller extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext.
Boilerplate removal refers to the task of extracting the main text content of webpages. This is done through the removal of content such as navigation links, header and footer sections, etc. Even though this task is a common prerequisite for most text processing tasks, I have not found an authoritative versatile solution. In other to better understand how some common options for boilerplate removal perform against one another, I developed a simple experiment to measure how well the methods perform when compared to a gold standard text extraction method (myself). Python-boilerpipe (ArticleExtractor mode) performed best on my small sample of 10 news documents with an average Jaccard Index score of 0.7530, and median Jaccard Index score of 0.8964. The Jaccard scores for each document for a given boilerplate removal method was calculated over the sets (bag of words) created from the news documents and the gold standard text.

Some common boilerplate removal methods
  1. BeautifulSoup's get_text()
    • Description: BeautifulSoup is a very (if not the most) popular python library used to parse HTML. It offers a boilerplate removal method - get_text() - which can be invoked with a tag element such as the body element of a webpage. Empirically, the get_text() method does not do a good job removing all the Javascript, HTML markups, comments, and CSS text of webpages, and includes extraneous text along with the extracted text.
    • Recommendation: I don't recommend exclusive use of get_text() for boilerplate removal.
  2. NLTK's (OLD) clean_html()
    • Description: Natural Language processing Toolkit (NLTK) used to provide a method called clean_html() for boilerplate removal. This method used regular expressions to parse and subsequently remove HTML, Javascript, CSS, comments, and white spaces. However, presently, NLTK deprecated this implementation and suggests the use of BeautifulSoup's get_text() method, which as we have already seen does not do a good job.
    • Recommendation: This method does a good job removing HTML, Javascript, CSS, comments, and white spaces. However, it includes the boilerplate text such as the navigation link text, as well as header and footer sections text. Therefore, if your application is not sensitive to extraneous text, and you just care about including all text from a page, this method is sufficient.
  3. Justext
    • Description: According to Mišo Belica, the creator of Justext, it was designed to preserve mainly text containing full sentences, thus, well suited for creating linguistic resources. Justext also provides an online demo.
    • Recommendation: Justext is a decent boilerplate removal method that performed almost as well as the best boilerplate removal method from our experiment (Python-boilerpipe). But note that Justext may omit page titles.
  4. Python-goose
    • Description: Python-goose is a python rewrite of an application originally written in Java and subsequently Scala. According to the author, the goal of Goose is to process news article or article-type pages, extract the main body text, metadata, and most probable image candidate.
    • Recommendation: Python-goose is a decent boilerplate removal method, but it was outperformed by Python-boilerpipe. Also note that Python-goose may omit page titles just like Justext.
  5. Python-boilerpipe
    • Description: Python-boilerpipe is a python wrapper of the original Java library for boilerplate removal and text extraction from HTML pages.
    • Recommendation: Python-boilerpipe outperformed all the other boilerplate removal methods in my small test sample. I currently use this method as the boilerplate removal method for my applications.
With the following corresponding gold standard text documents:

  1. Gold standard text for news document - 1
  2. Gold standard text for news document - 2
  3. Gold standard text for news document - 3
  4. Gold standard text for news document - 4
  5. Gold standard text for news document - 5
  6. Gold standard text for news document - 6
  7. Gold standard text for news document - 7
  8. Gold standard text for news document - 8
  9. Gold standard text for news document - 9
  10. Gold standard text for news document - 10
The HTML extracted from the 10 news documents was extracted by dereferencing each of the 10 URLs with curl. This means the boilerplate removal methods operated on just HTML (without running Javascript). I also ran the boilerplate removal methods on archived copies from archive.is for the 10 documents. This was based on the rationale that since archive.is runs Javascript and transforms the original page, this might impact the results. My experiment showed that boilerplate removal run on archived copies reduced the similarity between the gold standard texts and the output texts of all the boilerplate removal methods except BeautifulSoup's get_text() method (Table 2).

Second, for each document, I manually copied text I considered to be the main body of text for the document to create a total of 10 gold standard texts. Third, I removed the boilerplate from the 10 documents using the 8  methods outlined in Table 1. This led to a total of 80 extracted text documents (10 for each boilerplate removal method). Fourth, for each of the 80 documents, I computed the Jaccard Index (intersection divided by union of both set) over each document and it's respective gold standard. Fifth, for each of the 8 boilerplate removal methods outlined in Table 1, I computed the average of the Jaccard scores for the 10 news documents (Table 1).

Result

Table 1: Boilerplate removal results for live web news documents

Index Methods Averages of Jaccard Indices for 10 documents Median of Jaccard Indices for 10 documents
1 BeautifulSoup's get_text() 0.1959 0.2201
2 NLTK's (OLD) clean_html() 0.3847 0.3479
3 Justext 0.7134 0.8339
4 Python-goose 0.7009 0.6822
5 Python-boilerpipe.ArticleExtractor 0.7530 0.8964
6 Python-boilerpipe.DefaultExtractor 0.6706 0.7073
7 Python-boilerpipe.CanolaExtractor 0.6227 0.6472
8 Python-boilerpipe.LargestContentExtractor 0.6188 0.6444


Table 2: Boilerplate removal results for archived news documents showing lower similarity compared to live web version (Table 1)
Index Methods Averages of Jaccard Indices for 10 documents Median of Jaccard Indices for 10 documents
1 BeautifulSoup's get_text() 0.2630 0.2687
2 NLTK's (OLD) clean_html() 0.3365 0.3232
3 Justext 0.5956 0.6414
4 Python-goose 0.4209 0.4289
5 Python-boilerpipe.ArticleExtractor 0.6240 0.7121
6 Python-boilerpipe.DefaultExtractor 0.5534 0.7010
7 Python-boilerpipe.CanolaExtractor 0.5028 0.5274
8 Python-boilerpipe.LargestContentExtractor 0.4961 0.4669

Python-boilerpipe (ArticleExtractor mode) outperformed all the other methods. I acknowledge that this experiment is by no means rigorous for important reasons which include:
  • The test sample is very small.
  • Only news documents were considered.
  • The use of the Jaccard similarity measure forces documents to be represented as sets. This eliminates order (the permutation of words) and duplicate words. Consequently, if a boilerplate removal method omits some occurrences of a word, this information will be lost in the Jaccard similarity calculation.
Nevertheless, I believe this small experiment sheds some light about the different behaviors of the different boilerplate removal methods. For example, BeautifulSoup get_text() does not do a good job removing HTML, Javascript, CSS, and comments unlike NLTK's clean_html(), which does a good job removing these, but includes extraneous text. Also, Justext and Python-goose do not include a large body of extraneous text, even though they may omit a news article's title. Finally, based on these experiment results, Python-boilerpipe is best boilerplate removal method.
--Nwala

No comments:

Post a Comment