2020-01-13: Review of WS-DL's 2019

The Web Science and Digital Libraries (WS-DL) Research Group continued to grow in 2019! In this post we recap the highlights of calendar year 2019, including two Ph.D. graduates, seven incoming Ph.D. students, and a new assistant professor.  But don't wait another year to find out what we're up to now -- follow us now on Twitter (@WebSciDL) to keep up to date.

Students and faculty

We are the most proud of the fact that we graduated two Ph.D. students this year:

We now have five WS-DL faculty with the addition of:
With the benefit of five faculty, we've been able to offer a suite of WS-DL undergraduate and graduate classes that we're especially proud of.  In Spring 2019 we had five WS-DL classes (Web Science, Info Vis, IR, HCI, Archiving Forensics), and in Fall 2019 we had a record six WS-DL classes (Web Programming, Web Server Design, Data Science, Data Vis, Emergent Tech, Intelligent UIs)!  We can't replace Lulwah and Mat, but to fill those classes and keep our five faculty engaged we added many new full-time and part-time graduate students this year, including:

In 2019 we also had four students advance their avatars on the "Ph.D. Crush board" from "Ph.D. student" to "Ph.D. candidate":

Publications and presentations

We had 30 publications in 2019: three journal articles, two book chapters, 18 conference or workshop papers (including a best paper award plus two best paper nominations), and eight technical reports.  This total does not include the 2019 publications from Dr. Ashok since those were in the pipeline prior to his joining WS-DL; his contributions will be included in the 2020 summary.  

With five faculty and numerous students, manually keeping an exhaustive publication list up to date is now difficult.  It's easier to look at the "@WebSciDL" label on Google Scholar or to look at our individual pages (Nelson, Weigle, Wu, Jayarathna, Ashok). 

The highlights of the many conferences and workshops we attended this year include:

Research presentations and outreach

In addition to travel affiliated with formal publications, we presented our research in a variety of venues, both local and abroad, as well as for a range of audiences, from peers to K-12 students.  Some of the highlights include:

Software, data sets, services

Our scholarly output is not limited to conventional publications or presentations: we also seek to advance the state of the art through release of software, data sets, and proof-of-concept services.  Some of these that we either released or made significant updates to in the course of 2019 include:

  • Alexander Nwala's "Storygraph" is all three and then some: storygraph.cs.odu.edu is a demo service and data store, there's a GitHub repo, and there are blog posts describing "Top News Stories in 2018" and "Top News Stories in 2019".  Storygraph has over two years of RSS feeds from 17 different news sources and computes term overlap to measure when the stories "agree" (share terms) and when they diverge.  Follow @StoryGraphBot for more info. 
  • Nauman Siddique (@m_nsiddique) released a data set of Twitter handles for members of the 116th US Congress (blog, GitHub repo) as part of this research in uncovering deleted tweets.  Creating this data set was not nearly as straight forward as one might imagine, and the many sources that purport to have such lists do not agree with each other.
  • In further support of his research in deleted tweets, Nauman Siddique (with Sawood Alam) released "TweetedAt" (blog, service, GitHub repo) which extracts the datetime of tweets from the tweet id itself (note the Twitter API will not provide metadata for deleteted tweets). What separates this library/service from others is that even though the datetime is encoded in id for all tweets using the "Snowflake" algorithm (ca. late 2010), TweetedAt can also estimate the datetime of pre-Snowflake ids.
  • Mohamed Aturban released a data set of 16k Mementos (URLs of archived pages) that we have been periodically replaying for over one year (GitHub repo, tech report).  The formal publication is forthcoming, but it informed Mohamed's four part blog series about "Where did the archive go?" (Part 1: Library and Archives Canada, Part 2: National Library of Ireland, Part 3: Public Record Office of Northern Ireland, Part 4: WebCite).  
  • Alexander Nwala released "sumgram", which identifies the most frequent conjoined ngrams in text.  For example, searching for frequent three-grams would find "World Health Organization", but split "Centers for Disease" and "Control and Prevention" instead of yielding the six-gram "Centers for Disease Control and Prevention".  The blog post documents the algorithm and the GitHub repo has already seen significant external activity. 
  • Sawood Alam and Mohamed Aturban released the "Archival Fixity" and "Archival fixity manifest server" code used in their JCDL 2019 paper "Archive Assisted Archival Fixity Verification Framework". 
  • Shawn Jones has been updating the suite of software packages in Dark and Stormy Archives Framework, the most significant of which has been Raintale (blog, GitHub repo, site).  Raintale is functionally a replacement for Storify, in that it works in conjunction with MementoEmbed to take a series of URLs (in our case, URLs for archived pages) and prepares the list in HTML, Markdown, Jekyll, Wikitext, Twitter, or Facebook. 
  • Sawood Alam released a GitHub repo for MementoMap, the complement to his best paper nominated "MementoMap Framework for Flexible and Adaptive Web Archive Profiling".  It is an especially niche application, but MementoMap will eventually prove invaluable for those who would like to summarize the holdings of web archives.
  • Undergraduates Abigail Mabe and Dhruv Patel made significant updates to tmvis (site, GitHub repo), which creates thumbnail-based visualizations of how a single web page has changed over time.  These improvements include: an animated GIFs, a histogram of the number of available mementos over time, and fine-grained controls such as custom time ranges and removing specific thumbnails from consideration.

Other contributions

Our research group made a number of other contributions in 2019 that do not necessarily fit into any of the above categories.  Some of the highlights include: 


As usual, we all submitted many proposals this year, with a variety of internal and external collaborators.  Dr. Wu had a very successful year, receiving two external grants as well as two internal grants (one with Dr. Weigle):

Looking forward to 2020!

In 2020 we expect to have several more Ph.D. students graduate, increase our external funding, and host more external visitors.  We'll continue to offer some of the same courses (e.g., Web Science, Data Science, Data Vis) but we're also offering several new courses for 2020, such as Data Mining, AI, and NLP.  We're always looking for motivated and talented students to work with us on research problems, so please sign up for one of the WS-DL courses if you'd like to be considered. 

WS-DL annual reviews are also available for 2018, 2017, 2016, 2015, 2014, and 2013.  Finally, we'd like to thank all those who have complimented our blog, students, publications, code, or the WS-DL research group in general.  We really appreciate the feedback, some of which we include below.