Posts

Showing posts with the label Python-goose

2017-03-20: A survey of 5 boilerplate removal methods

Image
Fig. 1: Boilerplate removal result for  BeautifulSoup's get_text()  method for a   news website . Extracted text includes extraneous text (Junk text), HTML, Javascript, comments, and CSS text. Fig. 2: Boilerplate removal result for  NLTK's (OLD) clean_html()  method for a   news website .  Extracted text includes  e xtraneous text, but does not include Javascript, HTML, comments or CSS text. Fig. 3: Boilerplate removal result for  Justext  method for a   news website .  Extracted text includes  s maller extraneous text compared to BeautifulSoup's get_text() and NLTK's (OLD) clean_html() method, but the page title is absent. Fig. 4: Boilerplate removal result for   Python-goose  method for this   news website . No extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext, but page title and first paragraph are absent. Fig. 5: Boilerplate...