2017-03-20: A survey of 5 boilerplate removal methods

Fig. 1: Boilerplate removal result for BeautifulSoup's get_text() method for a news website . Extracted text includes extraneous text (Junk text), HTML, Javascript, comments, and CSS text. Fig. 2: Boilerplate removal result for NLTK's (OLD) clean_html() method for a news website . Extracted text includes e xtraneous text, but does not include Javascript, HTML, comments or CSS text. Fig. 3: Boilerplate removal result for Justext method for a news website . Extracted text includes s maller extraneous text compared to BeautifulSoup's get_text() and NLTK's (OLD) clean_html() method, but the page title is absent. Fig. 4: Boilerplate removal result for Python-goose method for this news website . No extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext, but page title and first paragraph are absent. Fig. 5: Boilerplate...