Posts

Showing posts with the label text mining

2021-05-10: Chronicling the life-cycle of top new stories with StoryGraphBot

Image
Fig. 1: Fig. 1 (Click on image to expand): Story Attention Dynamics chart illustrating the life-cycle of two top news stories from May 18, 2018 -- May 19, 2018. Each line (red or blue) represents a top news story. The x-axis represents time while the y-axis represents the average degree of Connected Components (representation of story). Within our window of observation, the  Santa Fe High School Shooting   story received peak attention on Friday May 18, 2018 at 4:40PM, this attention waned with the lowest point coinciding with the rise of a new story, the  Royal Wedding of Prince Harry and Meghan Markle . News stories are born expected or unexpected, big or small, compete for attention with sibling stories or enjoy the spotlight alone, live short or long lives, and exit through death or hibernation.  Since August 2017, every 10-minutes, StoryGraph has been quantifying the attention given to news stories. In the past three years, we have seen threats of war , hurrica...

2021-01-20: 366 dots in 2020 - top news stories of 2020

Image
Fig. 1 (Click image to expand) 366 dots in 2020 - Top news stories for 366 days in 2020. Each dot represents the average degree of the Giant Connected Component (GCC) with the largest average degree across all the 144 story graphs for a given day. The x-axis represents time, the y-axis represents the average degree of the GCC. The annotations (and legend) represented by colored dots were assigned semi-automatically . I join the chorus to say 2020 was a year like no other, and shaped by three historic events: the Coronavirus pandemic , the protests surrounding the Black Lives Matter movement, and the US Presidential elections . According to StoryGraph , in 2018, the top news story was the Kavanaugh hearings . In 2019, it was the Mueller Report . Similar to 2018 and 2019 , we analyzed all news stories collected by StoryGraph at 10-minute intervals every day in 2020, to identify the top news stories of 2020. Recall how we identify top news stories , explained briefly in 365 dots in 201...

2020-01-04: 365 dots in 2019 - top news stories of 2019

Image
Fig. 1 (Click on image to expand) 365 dots in 2019 - News stories for 365 days in 2019. Each dot represents the average degree of the Giant Connected Component (GCC) with the largest average degree across all the 144 story graphs for a given day. The x-axis represents time, the y-axis represents the average degree of the GCC. In March 2019 I published " 365 dots in 2018 " where I presented the top stories for each day in 2018 according to StoryGraph . Now that 2019 is over, it is natural to ask  what were the top news stories of 2019? News organizations will often publish "the year's top stories" or "year in review" (e.g., CNN , CBS , FoxNews ), but the selection criteria is not always made explicit. The closest to a selection criteria I have seen from news organizations is the presentation of their  top most viewed (or most popular) news stories. But this criteria is not accessible to ordinary users who cannot access the private traffic sta...

2019-03-05: 365 dots in 2018 - top news stories of 2018

Image
Fig. 1: News stories for 365 days in 2018. Each dot represents the average degree of the Giant Connected Component (GCC) with the largest average degree across all the 144 story graphs for a given day. The x-axis represents time, the y-axis represent the average degree of the selected GCC. Click to expand image. There was no shortage of big news headlines in 2018. Amidst this abundance, a natural question is what were the top news stories of 2018? There are multiple lists from different news organizations that present candidate top stories in 2018 such as  CNN's most popular stories and videos of 2018 , and  The year in review: Top news stories of 2018 month by month  from CBS. Even though such lists from respectable news organizations pass the "seems right test," they mostly present the top news stories without presenting an explanation for their process. In other words, they often do not state why some story made the list and why another did not make the list. We ...

2017-03-20: A survey of 5 boilerplate removal methods

Image
Fig. 1: Boilerplate removal result for  BeautifulSoup's get_text()  method for a   news website . Extracted text includes extraneous text (Junk text), HTML, Javascript, comments, and CSS text. Fig. 2: Boilerplate removal result for  NLTK's (OLD) clean_html()  method for a   news website .  Extracted text includes  e xtraneous text, but does not include Javascript, HTML, comments or CSS text. Fig. 3: Boilerplate removal result for  Justext  method for a   news website .  Extracted text includes  s maller extraneous text compared to BeautifulSoup's get_text() and NLTK's (OLD) clean_html() method, but the page title is absent. Fig. 4: Boilerplate removal result for   Python-goose  method for this   news website . No extraneous text compared to BeautifulSoup's get_text(), NLTK's (OLD) clean_html(), and Justext, but page title and first paragraph are absent. Fig. 5: Boilerplate...