Sunday, January 7, 2018

2018-01-07: Review of WS-DL's 2017

The Web Science and Digital Libraries Research Group had a steady 2017, with one MS student graduated, one research grant awarded ($75k), 10 publications, and 15 trips to conferences, workshops, hackathons, internships, etc.  In the last four years (2016--2013) we have graduated five PhD and three MS students, so the focus for this year was "recruiting" and we did pick up seven new students: three PhD and four MS.  We had so many new and prospective students that Dr. Weigle and I created a new CS 891 web archiving seminar to indoctrinate introduce them to web archiving and graduate school basics.

We had 10 publications in 2017:
  • Mohamed Aturban published a tech report about the difficulties in simply computing fixity information about archived web pages (spoiler alert: it's a lot harder than you might think; blog post).  
  • Corren McCoy published a tech report about ranking universities by their "engagement" with Twitter.  
  • Yasmin AlNoamany, now a post-doc at UC Berkeley,  published two papers based on her dissertation about storytelling: a tech report about the different kinds of stories that are possible for summarizing archival collections, and a paper at Web Science 2017 about how our automatically created stories are indistinguishable from those created by experts.
  • Lulwah Alkwai published an extended version of her JCDL 2015 best student paper in ACM TOIS about the archival rate of web pages in Arabic, English, Danish, and Korean languages (spoiler alert: English (72%), Arabic (53%), Danish (35%), and Korean (32%)).
  • The rest of our publications came from JCDL 2017:
    •  Alexander published a paper about his 2016 summer internship at Harvard and the Local Memory Project, which allows for archival collection building based on material from local news outlets. 
    • Justin Brunelle, now a lead researcher at Mitre, published the last paper derived from his dissertation.  Spoiler alert: if you use headless crawling to activate all the javascript, embedded media, iframes, etc., be prepared for your crawl time to slow and your storage to balloon.
    • John Berlin had a poster about the WAIL project, which allows easily running Heritrix and the Wayback Machine on your laptop (those who have tried know how hard this was before WAIL!)
    • Sawood Alam had a proof-of-concept short paper about "ServiceWorker", a new javascript library that allows for rewriting URIs in web pages and could have significant impact on how we transform web pages in archives.  I had to unexpectedly present this paper since thanks to a flight cancellation the day before, John and Sawood were in a taxi headed to the venue during the scheduled presentation time!
    • Mat Kelly had both a poster (and separate, lengthy tech report) about how difficult it is to simply count how many archived versions of a web page an archive has (spoiler alert: it has to do with deduping, scheme transition of http-->https, status code conflation, etc.).  This won best poster at JCDL 2017!
We were fortunate to be able to travel to about 15 different workshops, conferences, hackathons:

WS-DL did not host any external visitors this year, but we were active with the colloquium series in the department and the broader university community:
In the popular press, we had had two main coverage areas:
  • RJI ran three separate articles about Shawn, John, and Mat participating in the 2016 "Dodging the Memory Hole" meeting. 
  • On a less auspicious note, it turns out that Sawood and I had inadvertently uncovered the Optionsbleed bug three years ago, but failed to recognize it as an attack. This fact was covered in several articles, sometimes with the spin of us withholding or otherwise being cavalier with the information.
We've continued to update existing and release new software and datasets via our GitHub account. Given the evolving nature of software and data, sometimes it can be difficult a specific release date, but this year our significant releases and updates include:
For funding, we were fortunate to continue our string of eight consecutive years with new funding.  The NEH and IMLS awarded us a $75k, 18 month grant, "Visualizing Webpage Changes Over Time", for which Dr. Weigle is the PI and I'm the Co-PI.  This is an area we've recognized as important for some time and we're excited to have a separate project dedicated to the visualizing archived web pages. 

Another point you can probably infer from the discussion above but I decided to make explicit is that we're especially happy to be able to continue to work with so many of our alumni.  The nature of certain jobs inevitably takes some people outside of the WS-DL orbit, but as you can see above in 2017 we were fortunate to continue to work closely with Martin (2011) now at LANL, Yasmin (2016) now at Berkeley, and Justin (2016) now at Mitre.  

WS-DL annual reviews are also available for 2016, 2015, 2014, and 2013.  Finally, I'd like to thank all those who at various conferences and meetings have complimented our blog, students, and WS-DL in general.  We really appreciate the feedback, some of which we include below.


