2019-01-07: Review of WS-DL's 2018
We had a great #jcdl2018 / #wadl2018 / @jcdl2018! Here is the 2018 @WebSciDL reunion photo, but we're getting so large we weren't all there at once!— Michael L. Nelson (@phonedude_mln) June 7, 2018
Not pictured:
Incoming fac (Fall 2018): @OpenMaze @fanchyna
Alum: @johnaberlin
See you June 2-6, 2019 @iSchoolUI @JCDLConf pic.twitter.com/GmKpqKg0Zs
— Alexander C. Nwala (@acnwala) January 4, 2018
The Web Science and Digital Libraries Research Group had a strong year, with the most significant event being our expansion from two professors to four. Beginning in Fall 2018, we added two new assistant professors:
- Dr. Sampath Jayarathna (@OpenMaze) joined us from the CSDL of Texas A&M and most recently from California State Polytechnic University (Cal Poly Pomona).
- Dr. Jian Wu (@fanchyna) joined us from the CiteSeerX group at Penn State University.
Dr. Michele Weigle and I also had an eventful 2018: she was promoted to full professor and I received a joint appointment with Virginia Modeling, Analysis & Simulation Center (VMASC).
In 2018 we also had three MS students graduate, two students do internships, two students advance to PhD candidacy, one new research grant ($248k) awarded, 11 publications, and 11 trips to conferences, workshops, hackathons, internships, etc.
- Grant Atkins finished his joint BS/MS in 2018, and then spent six months working with Dr. Martin Klein and Dr. Herbert Van de Sompel at LANL working on the joint LANL-ODU Scholarly Orphans project. Grant is currently interviewing at a number of places. Update: Grant has accepted a job at MITRE.
- John Berlin finished his MS thesis in 2018 on the topic of high-fidelity web archiving, and then joined the Webrecoder group at Rhizome. We're still hopeful that he will resume his PhD studies with us, based on the invaluable experience he's gaining working with Ilya Kremer, Dragan Espenschied, Anna Perricci, et al. at Rhizome.
- Maheedhar Gunnam finished his MS in 2018. His MS project was "How I Changed Over Time: A webservice to summarize TimeMaps based on SimHashed HTML content" and he now works at NetApp in Pittsburgh.
- Miranada Smith interned at NASA Langley's Autonomy Incubator over the summer.
- Mat Kelly passed his candidacy exam in July.
- Lulwah Alkwai passed her candidacy exam in April (slides).
— Michael L. Nelson (@phonedude_mln) April 12, 2018
Congratulations to @LulwahMA for passing her PhD candidacy exam! Congratulations to her advisor @weiglemc too! @WebSciDL pic.twitter.com/0zGSr4iE9o— Michael L. Nelson (@phonedude_mln) April 26, 2018
Congratulations to @machawk1, @WebSciDL and @oducs‘s newest PhD Candidate! Great job today! pic.twitter.com/NNvXXbiSfx— Michele Weigle (@weiglemc) July 31, 2018
#NIFS intern @mir_smi is kicking off our @NASA_Langley #larcAi exit presentations today. Armed with skills gained in the #ODU @WebSciDL group, Miranda supported @NASAaero #ATTRACTOR data collection for our #HMI research using Mechanical Turk. pic.twitter.com/cbvRFxrUGM— Autonomy Incubator (@AutonomyIncub8r) July 25, 2018
We had 11 publications in 2018. This total does not include the 2018 publications from Drs. Wu and Jayarathna since those were already in the pipeline prior to them joining WS-DL; their contributions will be included in the 2019 summary. Our students' publications this year mainly centered around three conferences, with one "best poster" award and two "best paper" nominations:
- Grant Atkins and Shawn Jones attended iPres 2018 and presented three papers: "The Many Shapes of Archive-It" (Shawn Jones and Alexander Nwala), "The Off-Topic Memento Toolkit" (Shawn Jones), and "Measuring News Similarity Across Ten U.S. News Sites" (Grant Atkins and Shawn Jones). Grant and Shawn's paper was nominated for the best paper award. Action shots: Shawn, Grant.
- Alexander Nwala attended ACM Hypertext 2018 and published "Bootstrapping Web Archive Collections from Social Media" (action shots).
- Of course, JCDL 2018 was our flagship conference again, and we participated in the pre-conference doctoral consortium, the main conference, and two post-conference workshops: Web Archiving and Digital Libraries (WADL) and Knowledge Discovery from Digital Libraries (KDDL). For JCDL 2018 we had:
- Two posters: one from Sawood Alam and Mat Kelly, "Unobtrusive and Extensible Archival Replay Banners Using Custom Elements" and one from Mohamed Aturban, Sawood Alam, and Mat Kelly, "ArchiveNow: Simplified, Extensible, Multi-Archive Preservation". Mohamed's poster won "best poster" award.
- Two full papers: Alexander Nwala's "Scraping SERPs for Archival Seeds: It Matters When You Start", and Mat Kelly's "A Framework for Aggregating Private and Public Web Archives". Mat's paper was nominated for best paper.
- Corren McCoy's paper at KDDL, "Mining the Web to approximate university rankings," was revised and accepted in the journal Information Discovery and Delivery.
- We had several presentations at WADL, but we were also honored to have Dr. Michele Weigle deliver the keynote, "Enabling Personal Use of Web Archives".
- We also had two publications without student co-authors (a rarity in our field): with Herbert Van de Sompel, I published a chapter entitled "Adding the Dimension of Time to HTTP" in the book "The SAGE Handbook of Web History", edited by Ian Milligan and Niels Brügger. Dr. Weigle published an invited article in SSRC Parameters entitled "On the Importance of Web Archiving". (2020-03-01 edit: "Parameters" is now "Items" and there is a new URI for the article".)
#ipres2018 @WebSciDL member @grantcatkins is now presenting “Measuring News Similarity Across Ten U.S. News Sites” from co-authors @acnwala @phonedude_mln @weiglemc https://t.co/guyxm4Q26B as part of the Web Archiving session led by @AnnaPerricci pic.twitter.com/K2FuMxRyhi— Shawn M. Jones (@shawnmjones) September 25, 2018
Research direction for @WebSciDL : Bring #webarchiving to the browser natively (i.e, not via plugins). @weiglemc provides a compelling list of possibilities at #wadl2018. pic.twitter.com/8yChwkzd33— Justin Littman (@justin_littman) June 6, 2018
— Martin Klein (@mart1nkle1n) June 6, 2018
#jcdl2018 Using reputation scores and followers on @Twitter to find social media profiles for universities, "University Twitter Engagement: Using Twitter Followers to Rank Universities" by @CorrenMcCoy https://t.co/hILDimq46I pic.twitter.com/0qkOcTuGWJ— Shawn M. Jones (@shawnmjones) June 6, 2018
.@johnaberlin at #wadl2018: Content Security Policy, js loading/tricks block lo-fidelity replay of some archived pages. Approach to handle: client-side URL rewriting (vs typical server-side rewriting). For #webarchiving aficionados: makes https://t.co/dChLa8bnPv replayable! pic.twitter.com/E13sSsC9Sx— Justin Littman (@justin_littman) June 6, 2018
.@acnwala:"we don't have enough curators to capture seeds for all events", @internetarchive @archiveitorg often send out requests for seeds from volunteers. Collection building often begins with a search-can we use search engine result pages to help find seeds as well? #jcdl2018 pic.twitter.com/WgB289Hfll— Shawn M. Jones (@shawnmjones) June 5, 2018
Great slides by @machawk1 as he walks us through personas working with their own personal web archives, supplemented by other personal and public collections. #jcdl2018 pic.twitter.com/qTJMXNdWTu— Ian Milligan (@ianmilligan1) June 5, 2018
@ibnesayeed’s poster “Unobtrusive and Extensible Archival Replay Banners Using Custom Elements” #JCDL2018 pic.twitter.com/aajlQWE4eL— Hany Alsalmi (@HanyAlsalmi) June 4, 2018
@maturban1’s poster “ArchiveNow: Simplified, Extensible, Multi-Archive Preservation.” #JCDL2018 pic.twitter.com/y76YJ5APZR— Hany Alsalmi (@HanyAlsalmi) June 5, 2018
.@phonedude_mln and @mart1nkle1n at #jcdl2018 #MinuteMadness talking about signposting, more info at https://t.co/yIRYHSSHwx pic.twitter.com/yBvE4kCq3H— Shawn M. Jones (@shawnmjones) June 4, 2018
.@maturban1 from @WebSciDL presents "Establishing and Verifying Fixity of Archived Web Pages" @jcdl2018 #jcdl2018 #DoctoralConsortium pic.twitter.com/kRa75Tb2vz— Shawn M. Jones (@shawnmjones) June 3, 2018
.@acnwala from @WebSciDL presents "Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media" @jcdl2018 #jcdl2018 #DoctoralConsortium pic.twitter.com/Y2ckCioXjC— Shawn M. Jones (@shawnmjones) June 3, 2018
Improving Collection Understanding in Web Archives by @shawnmjones @WebSciDL at #JCDL2018 #JCDLDC2018 #DoctoralConsortium pic.twitter.com/AKzNgNZWpA— Sawood Alam (@ibnesayeed) June 3, 2018
2/2 @WebSciDL having a marathon @jcdl2018 #jcdl2018 practice session @CorrenMcCoy pic.twitter.com/ba0sUhD2wE— Michael L. Nelson (@phonedude_mln) May 24, 2018
Just when @weiglemc and I were talking about recruiting new students for @WebSciDL, in walks @justinfbrunelle, @JuneeWright, Brayden, and Connor! We might have to wait a few years though... pic.twitter.com/P7xRuslxc0— Michael L. Nelson (@phonedude_mln) May 23, 2018
In addition to iPres, Hypertext, and JCDL/JCDL-DC/WADL/KDD, we also attended 12 additional events:
- Drs. Jayarathna, Wu, Weigle, and I went to the Library of Congress right before the holidays, where Dr. Weigle presented "WS-DL’s Work towards Enabling Personal Use of Web Archives".
- Dr. Wu presented a poster ("CiteSeerX-2018: A Cleansed Multidisciplinary Scholarly Big Dataset") at the IEEE Big Data 2018 conference in December in Seattle.
- Mat Kelly and I attended the CNI Fall 2018 Membership Meeting in December in Washington DC, where I presented "Blockchain Can Not Be Used To Verify Replayed Archived Web Pages".
- In November, I was scheduled to give a talk at the UNC Blockchain Symposium. Unfortunately, the day ended early due to a water main break and I did not get a chance to give my presentation.
- I gave a seminar at Va Tech, my alma mater (Go Hokies!), in November, entitled "Weaponized Web Archives: Provenance Laundering of Short Order Evidence". Dr. Cal Ribbens, now the department chair, introduced me. I took "CS 3414 Numerical Methods" from him in Spring 1990! So naturally, I took the opportunity to discuss a late homework assignment...
- In November, Nauman Siddique attended the Archives Unleashed Datathon in Vancouver.
- In September I traveled to LANL to attend Dr. Herbert Van de Sompel's exit presentation at LANL, "(Almost) Two Decades at LANL".
- In July/August, Sawood Alam went to San Francisco for the Distributed Web Summit.
- WS-DL Alumna Erika Siregar attended the useR! Conference in Australia (we're still holding out hope that she returns for her PhD!).
- Shawn Jones and Brian Griffin attended the Archives Unleashed Datathon in Toronto in April.
- I presented at the National Forum on Ethics and Archiving the Web (EAW) in NYC in March (Martin Klein was also supposed to be there, but mother nature had other plans).
- Michele Weigle and I attended the NEH Office of Digital Humanities project director's meeting in March, where Dr. Weigle gave a three minute lightning talk.
After much delay, we made it. @weiglemc giving the overview presentation to @LC_Labs et al. https://t.co/XGA0s1KUOg @WebSciDL pic.twitter.com/bYp2lLTPW5— Michael L. Nelson (@phonedude_mln) December 18, 2018
Michael and I were at @VT_CS today for his seminar on Weaponizing Web Archives. Ed Fox introduced him and then @phonedude_mln spent the next hour educating us and presenting false provenance to show that @bwilliams of @11thHour is the real OG w/ apologies to @SnoopDogg of course. pic.twitter.com/6dj8PEDsIA— B. Danette Allen (@DrDanetteAllen) November 3, 2018
A selfie for posterity. With @phonedude_mln and @mart1nkle1n after my farewell presentation @LosAlamosNatLab https://t.co/aOU6sYDEzW pic.twitter.com/BnXewWJzNc— Herbert @hvdsomp@octodon.social (@hvdsomp) September 28, 2018
Nice look at useR 2018 from @webscidl's @erikaris #rstats // 2018-09-03: Trip Report for useR! 2018 Conference https://t.co/JwqGSIyRNz pic.twitter.com/nOI6qPiwzv— boB Rudis (@hrbrmstr) September 4, 2018
.@weiglemc discussing @NEH_ODH impact at @WebSciDL, including: https://t.co/V7mWVSogO0 #ODHatTEN pic.twitter.com/zVFFbyflai— Michael L. Nelson (@phonedude_mln) February 9, 2018
Another @WebSciDL luncheon, including our newest member, Scarlett Anne Kelly (daughter of @machawk1)! pic.twitter.com/XSxIQ1uXJZ— Michael L. Nelson (@phonedude_mln) March 8, 2018
We were fortunate enough to host Michael Herzog and several other members of Hochschule Magdeburg-Stendal in March.
We were very happy to learn from @WebSciDL @ibnesayeed @phonedude_mln @OConnorBrewing the last days. Find our report here https://t.co/PpH8O9sNFX pic.twitter.com/eylCFr8Dls— Michael A. Herzog (@maherzog) March 23, 2018
For internal and local outreach, we did several presentations, seminars, and colloquiums within ODU and Norfolk.
- In December, Dr. Jayarathna presented "Introduction to Computer Programming" at the Norfolk City Jail.
- In October, Dr. Wu presented "CiteSeerX: Mining Scholarly Big Data" at a Data Science Colloquium.
- In September, Dr. Jayarathna gave one of the talks, "The Human Eye and the Brain: Peeking at the Future of Neuro-Information Retrieval (Neuro-IR)" at the initial College of Sciences "Science Friday".
- Sawood Alam supported Dr. Yaohang Li's Machine Learning and Data Science Summer Camp in July, with presentations about web archiving, WS-DL research, and tools like Docker.
- In April, I gave a CS departmental colloquium (based on my presentation at EAW the previous month).
Dr. Jian Wu (@fanchyna) presenting "CiteSeerX: Mining Scholarly Big Data" -- data science colloquium. @WebSciDL @oducs pic.twitter.com/HLSh7hIn6D— Michael L. Nelson (@phonedude_mln) October 12, 2018
Sampath Jayarathna (@OpenMaze, of @WebSciDL) giving one of the talks at the initial ODU College of Sciences "Science Friday". pic.twitter.com/XitEjw5TGm— Michael L. Nelson (@phonedude_mln) September 21, 2018
Yesterday, I introduced #WebArchiving and #Docker to 30+ #HighSchool students in the #MachineLearning & #DataScience #SummerCamp sponsored by @NASA #VSGC at @oducs @ODU. A report is coming this weekend on @WebSciDL blog. Photo Credit: Dr. Yaohang Li https://t.co/ezl6yiJePV pic.twitter.com/4DCW62xeBz— Sawood Alam (@ibnesayeed) June 28, 2018
Michael Nelson @phonedude_mln warning that web archives will be weaponized to alter trustworthy content at @oducs CS colloquium. @WebSciDL pic.twitter.com/9ttyQP3ozq— Michele Weigle (@weiglemc) April 6, 2018
This year was exceptionally good for public outreach about web archiving. I was quoted in the Washington Post, The Atlantic, and Vox, culminating in an interview on CNN in April. The entire story is nicely summarized in an ODU press release.
Michael explained web archiving live on CNN this morning. Host Michael Smerconish commented that "I think the professor just made a pretty compelling case..." @WebSciDL @phonedude_mln pic.twitter.com/BjyGO7Bdrq— B. Danette Allen (@DrDanetteAllen) April 28, 2018
We've continued to update existing and release new software and datasets via our GitHub account. Given the nature of software and data, sometimes it can be difficult a specific release date, but this year our significant releases and updates include:
- Mohamed Aturban released a new version of ArchiveNow (a tool for simultaneous submission of web pages to multiple archives) of with support with several new archives.
- Shawn Jones had a productive 2018, releasing:
- MementoEmbed, which generates archive-aware social cards.
- Archive-It Utilities, a suite of tools to aid the extraction of metadata from Archive-It collections.
- the Off-Topic Memento Toolkit, a complete re-implementation of the tools originally developed by Yasmin AlNoamany (PhD, 2016).
- John Berlin had an especially productive 2018; the full listing of code and data can be found in his MS thesis review blog post, but some of the highlights include:
- Wayback++: A Chrome and Firefox browser extension that brings client-side rewriting to the Internet Archive's Wayback Machine
- WAIL Electron: An updated, Electron version of the Mat Kelly original production WAIL Python (Web Archiving Interface Layer)
- Squidwarc: A high fidelity archival crawler that uses Chrome or Chrome Headless
- node-warc: Parse Web Archive (WARC) files or create WARC files using Electron or the chrome-remote-interface.
- Miranda Smith released FollowerCountHistory, a tool that tracks the growth of Twitter followers using web archives for historical data.
- Alexander Nwala and Sawood Alam "dockerized" Stanford's CoreNLP.
- Alexander Nwala reviewed a series of URL diversity indexes and introduced the WS-DL Diversity Index, which can have profiles for the entire URL, hostname, or domain. He also provided software that implements the measure
- Sawood Alam released Reconstructive, a client-side ServiceWorker module for archival replay.
- Grant Atkins investigated the problem of pay walls in web archives and built a classifier using tensorflow and puppeteer for identifying archived pay walls.
- Alexander Nwala released the data set for his JCDL 2018 paper.
- Hussam Hallak performed a preliminary analysis that showed up to two-thirds of web traffic is private and will never show up in public web archives.
- Shawn Jones did an analysis of how well guidelines.gov and qualitymeasures.ahrq.gov were archived before they were shut down in July.
- Shawn Jones, in researching what would eventually become the MementoEmbed service, reviewed a number of social card and web page surrogate approaches.
- Plinio Vargas and Sawood Alam that cookies are why archived Twitter pages often have non-English language templates (i.e., for "Submit", "Followers", "Following"). This is the one time that Javascript is not the problem!
- I posted about the evidence in web archives in the case of Joy Reid's blog (which led to the CNN interview listed above).
- Nauman Siddique reviewed the feasibility of using web archives to extract deleted tweets, working from the well-known issue of Breitbart's deleted Super Bowl tweet.
- Alexander Nwala provided a guest blog post for the NIH entitled "Why We Need Humans to Curate Web Collections", which reviews how to augment and amplify human effort in seed selection for web archives.
- Alexander Nwala created the @StoryGraphBot account, for a collection of analyses about linkage and event detection in news outlets.
— Michael L. Nelson (@phonedude_mln) June 3, 2018
WS-DL annual reviews are also available for 2017, 2016, 2015, 2014, and 2013. Finally, we'd like to thank all those who have complimented our blog, students, and the WS-DL research group in general. We really appreciate the feedback, some of which we include below.
--Michael
The @WebSciDL team members have ace conference trip reports and publish them on their blog, which is seriously helpful for those of us who aren't able to attend "all the things" https://t.co/E3ap38M0oI— boB Rudis (@hrbrmstr) June 12, 2018
Annnnd, I didn't realize @shawnmjones (also of Old Dominion team) had analyzed this exact problem and blogged. Post shows the nitty-gritty of his workflow. If you haven't encountered mementos before, read this from @hvdsomp first https://t.co/YEVfexHb0c https://t.co/XmVPew7bJh— Eileen Clancy (@clancynewyork) July 16, 2018
So many great tools by the @WebSciDL crew!! Only if I had them when I was working on the history of the University of Bologna website! #WADL2018 #jcdl2018— federico nanni (@f_nanni) June 6, 2018
A @WebSciDL contributor — @acnwala — came up w/a rly clever way to measure the diversity of a corpus/collection of URLs : https://t.co/5UccLtZolW : that has tons of potential applications, esp in cybersecurity. Here's a fledgling #rstats package for it: https://t.co/0xF8wja3sR— boB Rudis (@hrbrmstr) May 4, 2018
Indeed! We are really thrilled to announce that @johnaberlin will be joining the Webrecorder project as a Senior Backend Developer and bringing his #webarchiving expertise from @WebSciDL to the team here at @rhizome! https://t.co/80eqpayZtV— Webrecorder (@webrecorder_io) May 3, 2018
Appreciate another great @WebSciDL trip report & enjoyed climbing the Scala learning curve w/ you, @brian3354. https://t.co/fcopud4ZvR— Justin Littman (@justin_littman) May 16, 2018
Brilliant Shawn. Thank you so much for your sage effort... You preserved knowledge that we cannot afford to lose!!— Susan Knowles (@luvnow4all) July 16, 2018
Doubled the # of news outlet Twitter account we're collecting w/ @SocialFeedMgr to over 10K thanks to @localmem project (https://t.co/U43nUn8WT6). Thanks @acnwala @WebSciDL!— Justin Littman (@justin_littman) July 23, 2018
Excellent overview of web archiving, including value to scholars, tools/techniques, research challenges. Also, if you're not familiar w/ the outstanding work being done by @WebSciDL, this provides highlights. https://t.co/JKjdUigzyQ— Justin Littman (@justin_littman) September 24, 2018
Comments
Post a Comment