Posts

2017-03-07: Archives Unleashed 3.0: Web Archive Datathon Trip Report

Image
Archive Unleashed 3.0 took place in the Internet Archive , San Francisco, CA. The workshop was two days long, February 23-24, 2017. This workshop took place in conjunction with a National Web Symposium , hosted at the Internet Archive, February 23 – 24. Four members of Web Science and Digital Library group ( WSDL ) from Old Dominion University had the opportunity to attend. The members are: Sawood Alam , Mohamed Aturban , Erika Siregar , and myself . This event was the third follow-up of the Archives Unleashed Web Archive Hackathon 1.0 , and Web Archive Hackathon 2.0 . @WebSciDL at @internetarchive after Archives Unleashed 3.0 wrap up. We have a winner of #HackArchives pic.twitter.com/vYLi89yap0 — Sawood Alam (@ibnesayeed) February 25, 2017 This workshop, was supported by the Internet Archive ,  Rutgers University , and the University of Waterloo . The workshop brought together a small group of around 20 researchers that worked together to develop new open source tools t

2017-03-02: National Symposium on Web Archiving Interoperability Trip Report

Image
The National Symposium on Web Archiving Interoperability was held February 21-22, 2017 at The Internet Archive in San Francisco, CA.  The symposium was held as part of the IMLS - funded "WASAPI" project, which is researching "web archiving systems APIs".  The participants are Internet Archive’s Archive-It , Stanford University Libraries ( DLSS and LOCKSS ), University of North Texas , and Rutgers University .  There were nearly 50 attendees from a variety of international institutions. Jefferson Bailey and Nicholas Taylor began the day with a review of the WASAPI project: " Building API-Based Web Archiving Systems and Services ".   They also lead a discussion about soliciting usage scenarios and feedback from potential users (see the results from their 2016 survey ).  You can track the WASAPI developments at their github repo , where they have the WASAPI Data Transfer API General Specification (for the transfer of WARC files, WAT files, etc.)

2017-02-22: Archive Now (archivenow): A Python Library to Integrate On-Demand Archives

Image
Examples: Archive Now (archivenow) CLI A small part of my research is to ensure that certain web pages are preserved in public web archives to hopefully be available and retrievable whenever needed at any time in the future. As archivists believe that "lots of copies keep stuff safe", I have created a Python library ( Archive Now ) to push web resources into several on-demand archives, such as The Internet Archive , WebCite , Perma.cc , and Archive.is . For any reason, one archive stops serving temporarily or permanently, it is likely that copies can be fetched from other archives. By Archive Now , one command like:     $ archivenow --all www.cnn.com is sufficient for the current CNN homepage to be captured and preserved by all configured archives in this Python library. Archive Now allows you to accomplish the following major tasks: A web page can be pushed into one archive A web page can be pushed into multiple archives A web page can be pushed into all archi

2017-02-13: Electric WAILs and Ham

Image
Mat Kelly recently posted  Lipstick or Ham: Next Steps For WAIL in which he spoke about the past, present, and potential future for WAIL. Web Archiving Integration Layer (WAIL) is a tool that seeks to address the disparity between institutional and individual archiving tools by providing one-click configuration and utilization of both Heritrix and Wayback from a user's personal computer. I am here to speak on the realization of WAIL's future by introducing WAIL-Electron. WAIL-Electron WAIL has been completely revised from a Python application using modern Web technologies into an Electron application. Electron combines a Chromium (Chrome) browser with Node.js allowing for native desktop applications to be created using only HTML, CSS, and JavaScript. The move to Electron has brought with it many improvements most importantly, of which is the ability to update and package WAIL for the three major operating systems: Linux, MacOS, and Windows. Support for these

2017-01-23: Finding URLs on Twitter - A simple recommendation

Image
A prompt from Twitter indicating no search results As part of a research experiment, I had the need to find URLs embedded in tweets from Twitter's web  search service . Most of the URLs where much older than 7 days, so using the Twitter search API was not an option since the search is performed on a sample of tweets published in the past 7 days, so I used the web search service.  I began the experiment by pasting URLs from tweets into the search box on twitter.com: Searching Twitter for a URL by pasting the URL into the search box I noticed I was able to find some URLs embedded in tweets, but this was not always the case. Based on my observations, finding the URLs was not correlated with the age of the tweet. I discussed this observation with Ed Summers and he recommended adding a "url:" prefix to the URL before searching. For example, if the search URL is:        "http://www.cnn.com",  he recommended searching for       " url: http://

2017-01-20: CNN.com has been unarchivable since November 1st, 2016

Image
CNN.com has been unarchivable since 2016-11-01T15:01:31 , at least by the common web archiving systems employed by the Internet Archive , archive.is , and webcitation.org . The last known correctly archived page in the Internet Archive's Wayback Machine is 2016-11-01T13:15:40 , with all versions since then producing some kind of error (including today's;   2017-01-20T09:16:50 ). This means that the most popular web archives have no record of the time immediately before the presidential election through at least today's presidential inauguration. Given the political controversy surrounding the election, one might conclude this is a part of some grand conspiracy equivalent to those found in the TV series The X-Files . But rest assured, this is not the case; the page was archived as is, and the reasons behind the archival failure are not as fantastical as those found in the show.  As we will explain below, other archival systems have successfully archived CNN.com