Posts

2015-09-08: Releasing an Open Source Python Project, the Services That Brought py-memento-client to Life

Image
The LANL Library Prototyping Team recently received correspondence from a member of the Wikipedia team requesting Python code that could find the best URI-M for an archived web page based on the date of the page revision . Collaborating with Wikipedia,  Harihar Shankar , Herbert Van de Sompel , Michael Nelson , and I were able to create the py-mement-client Python library to suit the needs of  pywikibot . Over the course of library development, Wikipedia suggested the use of two services, Travis CI and Pypi, that we had not used before.  We were very pleased with the results of those services and learned quite a bit from the experience.  We have been using GitHub for years, and also include it here as part of the development toolchain for this Python project. We present three online services that solved the following problems for our Python library: Where do we store source code and documentation for the long term? - GitHub How do we ensure the project is well tested in an

2015-09-01: From Student To Researcher II

Image
After successfully defending my Master's Thesis , I accepted a position as a Graduate Research Assistant at  Los Alamos National Laboratory  (LANL) Library's Digital Library Research and Prototyping Team .  I now work directly for  Herbert Van de Sompel , in collaboration with my advisor, Michael Nelson . Up to this point, I worked for years as a software engineer, but then re-entered academia in 2010 to finish my Master's Degree.  I originally just wanted to be able to apply for jobs that required Master's Degrees in Computer Science, but during my time working on my thesis, I discovered that I had more of a passion for the research than I had expected, so I became a  PhD student in Computer Science at Old Dominion University .  During the time of my Master's Degree, I had taken coursework that counts toward my PhD, so I am free to accept this current extended internship while I complete my PhD dissertation. LANL is a fascinating place to work.  I

2015-08-28 Original Header Replay Considered Coherent

Image
Introduction As web archives have advanced over time, their ability to capture and playback web content has grown. The Memento Protocol, defined in RFC 7089 , defines an HTTP protocol extension that bridges the present and past web by allowing time-based content negotiation. Now that Memento is operational at many web archives, analysis of archive content is simplified. Over the past several years, I have conducted analysis of web archive temporal coherence. Some of the results of this analysis will be published at Hypertext'15 . This blog post discusses one implication of the research: the benefits achieved when web archives playback original headers. Archive Headers and Original Headers Consider the headers (Figure 1) returned for a logo from the ODU Computer Science Home Page as archived on Wed, 29 Apr 2015 15:15:23 GMT. HTTP/1.1 200 OK Content-Type: image/gif Last-Modified: Wed, 29 Apr 2015 15:15:23 GMT Figure 1. No Original Header Playback Try to answer the

2015-08-20: ODU, L3S, Stanford, and Internet Archive Web Archiving Meeting

Image
Two weeks ago (on Aug 3, 2015), I was glad to be invited to visit Internet Archive in San Francisco in order to share our latest work with a set of the Web Archiving pioneers from around the world. The attendees were Jefferson Bailey  and Vinay Goel  from IA, Nicholas Taylor  and Ahmed AlSum from Stanford, and Wolfgang Nejdl , Ivana Marenzi  and Helge Holzmann from L3S . First, we took a quick introduction to each others mentioning the purpose and the nature of our work to IA. Then, Nejdl introduced the Alexandria project , and demoed the ArchiveWeb project, which aims to develop tools and techniques to explore and analyze Web archives in a meaningful way. In the project, they develop tools that will allow users to visualize and collaboratively interact with Archive-it collections by adding new resources in the form of tags and comments. Furthermore, it contains a collaborative search and sharing platform. I presented the off-topic detection work with a live demo for the

2015-08-18: Three WS-DL Classes Offered for Fall 2015

Image
The Web Science and Digital Libraries Group is offering three classes this fall.  Unfortunately there are no undergraduate offerings this semester, but there are three graduate classes covering the full WS-DL spectrum: CS 695 - NoSQL Databases (CRN 21159 ) will be offered by Dr. Cartledge .  While we've used NoSQL databases in a variety of classes in the past, this is the first time we've offered a class entirely on this topic.  This is a good complement to the CS 495/595 Big Data class he offered last spring.    CS 734/834 - Introduction to Information Retrieval (CRNs 19986 & 20004 ) will be offered by Dr. Nelson .  Although the number and name have slightly changed, this will be similar to previous offerings of this class (e.g., see CS 895 spring 2014 ).   This class will broadly cover the foundations of information retrieval .   CS 791/891 - Visualization Seminar (CRNs 12619 & 12620 )will be taught by Dr. Weigle .  This P/F course will cover the fund

2015-07-27: Upcoming Colloquium, Visit from Herbert Van de Sompel

Image
On Wednesday, August 5, 2015 Herbert Van de Sompel (Los Alamos National Laboratory) will give a colloquium in the ODU Computer Science Department entitled "A Perspective on Archiving the Scholarly Web".  It will be held in the third floor E&CS conference room (r. 3316) at 11am.  Space is somewhat limited (the first floor auditorium is being renovated), but all are welcome to attend.  The abstract for his talk is:  A Perspective on Archiving the Scholarly Web As the scholarly communication system evolves to become natively web-based and starts supporting the communication of a wide variety of objects, the manner in which its essential functions -- registration, certification, awareness, archiving -- are fulfilled co-evolves.  Illustrations of the changing implementation of these functions will be used to arrive at a high-level characterization of a future scholarly communication system and of the objects that will be communicated. The focus will then shift to the fu

2015-07-24: ICSU World Data System Webinar #6: Web-Centric Solutions for Web-Based Scholarship

Image
Earlier this week Herbert Van de Sompel gave a webinar for the ICSU World Data System entitled " Web-Centric Solutions for Web-Based Scholarship ".  It's a short and simple review of some of the interoperability projects we've worked on through since 1999, including OAI-PMH , OAI-ORE , and Memento .  He ends with a short nod to his simple but powerful " Signposting the Scholarly Web " proposal, but the slides in the appendix give the full description. The main point of this presentation was to document how each project successively further embraced the web, not just as a transport protocol but fully adopting the semantics as part of the protocol.  Herbert and I then had a fun email discussion about how the web, scholarly communication, and digital libraries were different in 1999 (the time of OAI-PMH & our initial collaboration) and now.  Some highlights include: Although Google existed, it was not the hegemonic force that it is today, and co

2015-07-22: I Can Haz Memento

Image
Inspired by the " #icanhazpdf "  movement   and built upon the Memento   service,  I Can Haz Memento  attempts to expand the awareness of  Web Archiving  through  Twitter . Given a URL (for a page) in a tweet with the hash tag " #icanhazmemento ," the I Can Haz Memento service replies the tweet with a link pointing to an archived version of the page closest to the time of the tweet. The consequence of this is: the archived version closest to the time of the tweet likely expresses the intent of the user at the time the link was shared. Consider a scenario where Jane shares a link in a tweet to the front page of  cnn  about a story on healthcare. Given the fluid nature of the news cycle, at some point, the story about healthcare would be replaced by another fresh story; thus the link in Jane's tweet and its corresponding intent (healthcare story) become misrepresented by Jane's original link (for the new story). This is where I Can Haz Memento comes i

2015-07-07: WADL 2015 Trip Report

Image
It was the last day of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL) 2015 when the Workshop on Web Archiving and Digital Libraries (WADL) 2015 was scheduled and it started on time. When I entered in the workshop room, I realized we needed a couple of more chairs to accommodate all the participants, which was a good problem to have. The session started with a brief informal introduction of individual participants. Without wasting any time, the lightning talks session was started. Gerhard Gossen started the lightning talk session with his presentation on "The iCrawl System for Focused and Integrated Web Archive Crawling". It was a short description of how iCrawl can be used to create archives for current events, targeted primarily to researchers and journalists. The demonstration illustrated how to search on the Web and Twitter for trending topics to find good seed URLs, manually add seed URLs and keywords, extract entities, configure crawling basic po