Posts

Showing posts from July, 2016

2016-07-24: Improve research code with static type checking

The Pain of Late Bug Detection [The web] is big. Really big. You just won't believe how vastly, hugely, mindbogglingly big it is... [1] When it comes to quick implementation, Python is an efficient language used by many web archiving projects. Indeed, a quick search of github for WARC and Python yields a list of 80 projects and forks. Python is also the language used for my research into the temporal coherence of existing web archive holdings.

The sheer size of the Web means lots of variation and lots low-frequency edge cases. These variations and edge cases are naturally reflected in web archive holdings. Code used to research the Web and web archives naturally contains many, many code branches.

Python struggles under these conditions. It struggles because minor changes can easily introduce bugs that go undetected until much later. And later for Python means at run time. Indeed the sheer number of edge cases introduces code branches that are exercised so infrequently that code…

2016-07-21: Dockerizing ArchiveSpark - A Tale of Pair Hacking

Image
"Some doctors prescribe application of sandalwood paste to remedy headache, but making the paste and applying it is no less of a headache." -- an Urdu proverb This is the translation of a couplet from an Urdu poem which is often used as a proverb. This couplet nicely reflects my feeling when Vinay Goel from the Internet Archive was demonstrating how suitable ArchiveSpark was for our IMLS Museums data analysis during the Archives Unleashed 2.0 Datathon, in the Library of Congress, Washington DC on June 14, 2016. ArchiveSpark allows easy data extraction, derivation, and analysis from standard web archive files (such as CDX and WARC). On the back of my head I was thinking, it seems nice, cool, and awesome to use ArchiveSpark (or Warcbase) for the task, and certainly a good idea for serious archive data analysis, but perhaps an overkill for a two day hackathon event. Installing and configuring these tools would have required us to setup a Hadoop cluster, Jupyter notebook, Spark

2016-07-18: Tweet Visibility Dynamics in a Tweet Conversation Graph

Image
A Portion of a Tweet Conversation about the Ebola Virus
We conducted another study in the same spirit as the first, as part of our research (funded by IMLS) to build collections for stories or events. This time we sought to understand how to extract not just a single tweet, but the conversation of which the tweet belongs to. We explored how the visibility of tweets in a conversation graph changes based on the tweet selected.

A need for archiving tweet conversations Archiving tweets usually involves collecting tweets associated with a given hashtag. Even though this provides a "clean" way of collecting tweets about the event associated with the hashtag, something important is often missed - conversations. Not all tweets about a particular topic will have the given hashtag,  including portions of a threaded conversation, even if the initial tweet contained the hashtag. This is unfortunate because conversations may provide contextual information about tweets. @acnwala Agreed that…

2016-07-07: Signposting the Scholarly Web

Image
The web site for "Signposting the Scholarly Web" recently went online.  There is a ton of great content available and since it takes some time to process it all, I'll give some of the highlights here.

First, this is the culmination of ideas that have been brewing for some time (see this early 2015 short video, although some of the ideas can arguably be traced to this 2014 presentation).  Most recently, our presentation at CNI Fall 2015, our 2015 D-Lib Magazine article, and our 2016 tech report advanced the concepts.

Here's the short version: the purpose is to make a standard, machine-readable method for web robots and other clients to "follow their nose" as they encounter scholarly material on the web.  Think of it as similar (in purpose if not technique) to Facebook's Open Graph or FOAF, but for publications, slides, data sets, etc. 

Currently there are three basic functions in Signposting:
Discovering rich, structured, bibliographic metadata from web …

2016-07-01: Fulbright Enrichment Seminar - Lab to Market: Entrepreneurship and Technological Innovation Enrichment (May 24 - 28, 2016)

Image
One of the most valuable experiences that I have in my life is being a Fulbright scholar. Before winning this scholarship, I was an employee at BPS-Statistics Indonesia. I started working there right after I received my B.S. in Computational Statistics from Institute of Statistics in Jakarta, Indonesia. I worked for 3 years when suddenly I felt like I was stuck in a comfort zone.  Science and knowledge, especially those related to technology, are growing very rapidly. There is a lo@WebSciDLt of new information out there, which I could not get if I did not get out of my office building. I need to upgrade my education for a better career in the future. I started applying for many scholarships to study abroad, got many rejections before finally I was invited to an interview for Fulbright. After a very long selection process (it took almost a year since I submitted my application), I was fortunate enough to have a Fulbright scholarship. Now, I am pursuing an M.S. in Computer Science at…