Posts

2014-09-02: WARCMerge: Merging Multiple WARC files into a single WARC file

Image
WARCMerge is the name given to a new tool for organizing WARC files. The name describes it -- merging multiple WARC files into a single one. In web archiving, WARC files can be generated by well-known web crawlers such as Hertrix and Wget command, or by state-of-the-art tools like WARCreate/WAIL and Webrecorder.io which were developed to support the personal web archiving. WARC files contain records not only for HTTP responses and metadata elements but also all original HTTP requests. By having those WARC files, any replay tools (e.g., Wayback Machine) can be used to reconstruct and display the original web pages. I would emphasize here that a single WARC file may consist of records related to different web sites. In other words, multiple web sites can be archived in the same WARC file. This Python program runs in three different modes. In the first mode , the program sequentially reads records one by one from different WARC files and combines them into a new file in which an extr

2014-08-28: InfoVis 2013 Class Projects

Image
(Note: This is continuing a series of posts about visualizations created either by students in our research group or in our classes.) I've been teaching the graduate Information Visualization course since Fall 2011.   In this series of posts, I'm highlighting a few of the projects from each course offering.  (Previous posts: Fall 2011 , Fall 2012 ) In Spring 2013, I taught an Applied Visual Analytics course that asked students to create visualizations based on the "Life in Hampton Roads" annual survey performed by the Social Science Research Center at ODU.  In Fall 2013, I taught the traditional InfoVis course that allowed students to choose their own topics.  (All class projects are listed in my InfoVis Gallery .) Life in Hampton Roads Created by Ben Pitts, Adarsh Sriram Sathyanarayana, Rahul Ganta This project (currently available at https://ws-dl.cs.odu.edu/vis/LIHR/ ) provides a visualization of the "Life in Hampton Roads" survey for 20

2014-08-26: Memento 101 -- An Overview of Memento in 101 Slides

Image
In preparation for the upcoming " Tools and Techniques for Revisiting Online Scholarly Content " tutorial at JCDL 2014 , Herbert and I have revamped the canonical slide deck for Memento , and have called it " Memento 101 " for the 101 slides it contains.  The previous slide deck was from May 2011 and was no longer current with the RFC (December 2013).  The slides cover Memento basic and intermediate concepts, with pointers for some of the more detailed and esoteric bits (like patterns 2 , 3 , and 4 , as well as the special cases ) of interest to only the most hard-core archive wonks.  The JCDL 2014 tutorial will choose a subset of these slides, combined with updates from the Hiberlink project and various demos.  If you find yourself in need of explaining Memento please feel free to use these slides in part or in whole (PPT is available for download from slideshare).  Memento 101 from Herbert Van de Sompel --Michael & Herbert

2014-08-22: One WS-DL Class Offered for Fall 2014

Image
This fall, only one WS-DL class will be offered: CS 495/595 Introduction to Web Science , Thursdays, 4:20-7:00, r. 2120 This class approaches the Web as a phenomena to be studied in its own right .  In this class we will explore a variety of tools (e.g., Python , R , D3 ) as well as applications (e.g., social networks , recommendations , clustering , classification ) that are commonly used in analyzing the socio-technical structures of the Web.  The class will be similar to the fall 2013 offering.  Right now we're planning on offering in spring 2015: CS 418 Web Programming  CS 725/825 Information Visualization CS 751/851 Introduction to Digital Libraries --Michael

2014-07-25: Digital Preservation 2014 Trip Report

Image
Mat Kelly and Dr. Michael L. Nelson travel to Washington, DC and both report on their current research as well as be made aware of others' work in the field.                            On July 22 and 23, 2014, Dr. Michael Nelson ( @phonedude_mln ) and I ( @machawk1 ) attended Digital Preservation 2014 in Washington, DC. This was my fourth consecutive NDIIPP ( @ndiipp ) / NDSA ( @ndsa2 ) meeting (see trip reports from Digital Preservation 2011 , 2012 , 2013 ). With the largest attendance yet (300+) and compressed into two days, the schedule was jam-packed with interesting talks. Per usual, videos for most of the presentations are included inline below. Day One Micah Altman ( @drmaltman ) led the presentations with information about the NDSA and asked, regarding Amazon claiming reliability of 99.99999999999% for uptime, "What do the eleven nines mean?". "There are a number of risk that we know about [as archivists] that Amazon doesn't", he sa

2014-07-22: "Archive What I See Now" Project Funded by NEH Office of Digital Humanities

Image
We are grateful for the continued support of the National Endowment for the Humanities and their Office of Digital Humanities for our "Archive What I See Now" project. In 2013, we received support for 1 year through a Digital Humanities Start-Up Grant .  This week, along with our collaborator Dr. Liza Potts from Michigan State, we were awarded a 3-year Digital Humanities Implementation Grant . We are excited to be one of the seven projects selected this year . Our project goals are two-fold: to enable users to generate files suitable for use by large-scale archives (i.e., WARC files) with tools as simple as the "bookmarking" or "save page as" approaches that they already know to enable users to access the archived resources in their browser through one of the available add-ons or through a local version of the Wayback Machine ( wayback ). Our innovation is in allowing individuals to "archive what I see now". The user can create a st

2014-07-14: "Refresh" For Zombies, Time Jumps

Image
We've blogged before about " zombies ", or archived pages that reach out to the live web for images, ads, movies, etc.  You can also describe it as the live web "leaking" into the archive, but we prefer the more colorful metaphor of a mixture of undead and living pages.  Most of the time Javascript is to blame (for example, see our TPDL 2013 paper " On the Change in Archivability of Websites Over Time "), but in this example the blame rests with the HTML < meta http-equiv="refresh" content="..."> tag, whose behavior in the archives I discovered quite by accident. First, the meta refresh tag is a nasty bit of business that allows HTML to specify the HTTP headers you should have received.  This is occasionally useful (like loading a file from local disk), but more often that not seems to create situations in which the HTML and the HTTP disagree about header values, leading to surprisingly complicated things like MIME ty