Posts

2016-08-25: Documenting the Now Advisory Board Meeting Trip Report

Image
On August 21-23, 2016, I attended the Advisory Board Meeting for the Documenting the Now (DocNow) project at the Washington University in St. Louis .  The DocNow project is f unded by the Andrew Mellon Foundation "aims to collect, archive, and provide access to social media feeds chronicling historically significant events, particularly concerning social justice."   In practice, this means providing a friendly interface for interacting with trending events on Twitter (e.g., #BlackLivesMatter and affiliated hashtags).  This is significant because tools like twarc (created by Ed Summers , the technical lead for DocNow), a widely used Twitter archiving command line tool, are not within the scope of non-expert users.  The DocNow has a strong project team and a diverse advisory board , of which I am honored to be a member of.  The team has been pretty active on github , slack , Twitter , etc., but those are no substitute for an extended f2f meeting. The day began on th

2016-08-25: Two WS-DL Classes Offered for Fall 2016

Image
Two Web Science & Digital Library ( WS-DL ) courses will be offered in Fall 2016:  CS 418/518 " Web Programming ", Tuesdays 4:20-7:00 pm (CRNs 16680 & 16681 ), will be offered by Dr. Justin F. Brunelle .  This will be an updated version of CS 418 last taught in Spring 2015 by Mat Kelly .  This class is currently full, but watch for openings. CS 734/834 " Introduction to Information Retrieval ", Thursdays 4:20-7:00 pm (CRNs 15028 & 15038 ) offered by Dr. Michael L. Nelson .  This will be an updated version of the same course most recently taught in Fall 2015 .  Note that Dr. Michele Weigle is not teaching this semester.  Obviously there is demand for CS 418/518, but if you're considering CS 734/834 you might be interested in this student's quote from a recent exit exam: [and] Dr. Nelson’s Information Retrieval course are the two which I feel have prepared me most for job interviews and work in the working world of computer science.

2016-08-15: Mementos In the Raw, Take Two

Image
In a previous post , we discussed a way to use the existing Memento protocol combined with link headers to access unaltered (raw) archived web content. Interest in unaltered content has grown as more use cases arise for web archives. Ilya Kremer and David Rosenthal had previously suggested that a new dimension of content negotiation would be necessary to allow clients to access unaltered content. That idea was not originally pursued, because it would have required the standardization of new HTTP headers. At the time, none of us were aware of the standard Prefer header from RFC7240 . Prefer can solve this problem in an intuitive way much like their original suggestion of content negotiation. To recap, most web archives augment mementos when presenting them to the user, often for usability or legal purposes. The figures below show examples of these augmentations. Figure 1: The PRONI web archive augments mementos for user experience; augmentations outlined in red Fi

2016-07-24: Improve research code with static type checking

The Pain of Late Bug Detection [The web] is big. Really big. You just won't believe how vastly, hugely, mindbogglingly big it is... [1] When it comes to quick implementation, Python is an efficient language used by many web archiving projects. Indeed, a quick search of github for WARC and Python yields a list of 80 projects and forks . Python is also the language used for my research into the temporal coherence of existing web archive holdings. The sheer size of the Web means lots of variation and lots low-frequency edge cases. These variations and edge cases are naturally reflected in web archive holdings. Code used to research the Web and web archives naturally contains many, many code branches. Python struggles under these conditions. It struggles because minor changes can easily introduce bugs that go undetected until much later. And later for Python means at run time. Indeed the sheer number of edge cases introduces code branches that are exercised so infrequent

2016-07-21: Dockerizing ArchiveSpark - A Tale of Pair Hacking

Image
"Some doctors prescribe application of sandalwood paste to remedy headache, but making the paste and applying it is no less of a headache." -- an Urdu proverb This is the translation of a couplet from an Urdu poem which is often used as a proverb. This couplet nicely reflects my feeling when Vinay Goel from the Internet Archive was demonstrating how suitable ArchiveSpark was for our IMLS Museums data analysis during the  Archives Unleashed 2.0 Datathon , in the Library of Congress, Washington DC on June 14, 2016. ArchiveSpark allows easy data extraction, derivation, and analysis from standard web archive files (such as CDX and WARC). On the back of my head I was thinking, it seems nice, cool, and awesome to use ArchiveSpark (or Warcbase ) for the task, and certainly a good idea for serious archive data analysis, but perhaps an overkill for a two day hackathon event. Installing and configuring these tools would have required us to setup a Hadoop cluster, Jupyter

2016-07-18: Tweet Visibility Dynamics in a Tweet Conversation Graph

Image
A Portion of a Tweet Conversation about the Ebola Virus We conducted another study in the same spirit as the first , as part of our research ( funded by IMLS ) to build collections for stories or events. This time we sought to understand how to extract not just a single tweet, but the conversation of which the tweet belongs to. We explored how the visibility of tweets in a conversation graph changes based on the tweet selected. A need for archiving tweet conversations Archiving tweets usually involves collecting tweets associated with a given hashtag. Even though this provides a "clean" way of collecting tweets about the event associated with the hashtag, something important is often missed - conversations. Not all tweets about a particular topic will have the given hashtag,  including portions of a threaded conversation, even if the initial tweet contained the hashtag. This is unfortunate because conversations may provide contextual information about tweet

2016-07-07: Signposting the Scholarly Web

Image
The web site for " Signposting the Scholarly Web " recently went online.  There is a ton of great content available and since it takes some time to process it all, I'll give some of the highlights here. First, this is the culmination of ideas that have been brewing for some time (see this early 2015 short video , although some of the ideas can arguably be traced to this 2014 presentation ).  Most recently, our presentation at CNI Fall 2015 , our 2015 D-Lib Magazine article , and our 2016 tech report advanced the concepts. Here's the short version: the purpose is to make a standard, machine-readable method for web robots and other clients to " follow their nose " as they encounter scholarly material on the web .  Think of it as similar (in purpose if not technique) to Facebook's Open Graph or FOAF , but for publications, slides, data sets, etc.  Currently there are three basic functions in Signposting: Discovering rich, structured, bibliograph

2016-07-01: Fulbright Enrichment Seminar - Lab to Market: Entrepreneurship and Technological Innovation Enrichment (May 24 - 28, 2016)

Image
One of the most valuable experiences that I have in my life is being a Fulbright scholar. Before winning this scholarship, I was an employee at BPS-Statistics Indonesia . I started working there right after I received my B.S. in Computational Statistics from Institute of Statistics in Jakarta, Indonesia. I worked for 3 years when suddenly I felt like I was stuck in a comfort zone.  Science and knowledge, especially those related to technology, are growing very rapidly. There is a lo@WebSciDLt of new information out there, which I could not get if I did not get out of my office building. I need to upgrade my education for a better career in the future. I started applying for many scholarships to study abroad, got many rejections before finally I was invited to an interview for Fulbright. After a very long selection process (it took almost a year since I submitted my application), I was fortunate enough to have a Fulbright scholarship. Now, I am pursuing an M.S. in Computer