Posts

2017-10-16: Visualizing Webpage Changes Over Time - new NEH Digital Humanities Advancement Grant

Image
In August, we were excited to be awarded an 18-month Digital Humanities Advancement Grant from the National Endowment for the Humanities (NEH) and the Institute of Museum and Library Services (IMLS) .  Our project, "Visualizing Webpage Changes Over Time", was one of 31 awards made through this joint NEH/IMLS program ( award announcement ). Michele C. Weigle and Michael L. Nelson - ODU Deborah Kempe - Frick Art Reference Library and New York Art Resources Consortium Pamela Graham and Alex Thurman - Columbia University Libraries Oct 2017 – Mar 2019, $75,000 As web archives grow in importance and size, techniques for understanding how a web page changes through time need to adapt from an assumption of scarcity (just a few copies of a page, no more than a few weeks or months apart) to one of abundance (tens of thousands of copies of a page, spanning as much as 20 years). This project, a joint effort among ODU, the New York Art Resources Consortium (NYARC), and

2017-09-19: Carbon Dating the Web, version 4.0

Image
With this release of Carbon Date there are new features being introduced to track testing and force python standard formatting conventions. This version is dubbed Carbon Date v4.0. We've also decided to switch from MementoProxy and take advantage of the  Memgator Aggregator tool built by Sawood Alam. Of course with new APIs come new bugs that need to be addressed, such as this exception handling issue . Fortunately, the new tools being integrated into the project will allow for our team to catch and address these issues quicker than before as explained below. The previous version of this project, Carbon Date 3.0 , added Pubdate  extraction, Twitter searching, and Bing  search. We found that Bing has changed its API to only allow 30 day trials for its API with 1000 requests per month unless someone wants to pay . We also discovered a few more use cases for the Pubdate extraction by applying Pubdate to the mementos retrieved from Memgator. By default, Memgator provides t

2017-09-13: Pagination Considered Harmful to Archiving

Image
Figure 1 - 2016 U.S. News Global Rankings Main Page as Shown on Oct 30, 2015 Figure 2 - 2016 U.S. News Global Rankings Main Page With Pagination Scheme as Shown on Oct 30, 2015 https://web.archive.org/web/20151030092546/https://www.usnews.com/education/best-global-universities/rankings While gathering data for our work in measuring the correlation of university rankings by reputation and by Twitter followers (McCoy et al., 2017), we discovered that many of the web pages which comprised the complete ranking list for U.S. News in a given year were not available in the Internet Archive . In fact, 21 of 75 pages (or 28%)  had never been archived at all. "... what is part of and what is not part of an Internet resource remains an open question" according to research concerning Web archiving mechanisms conducted by Poursadar and Shipman (2017).  Over 2,000 participants in their study were presented with various types of web content (e.g., multi-page stories, reviews,

2017-08-27: Four WS-DL Classes Offered for Fall 2017

Image
An unprecedented four Web Science & Digital Library ( WS-DL ) courses will be offered in Fall 2017:  CS 418/518 " Web Programming ", Tuesdays 4:20-7:00 pm (CRNs 14356 & 14357), will be offered by Dr. Justin F. Brunelle .  This will be an updated version of CS 418/518 last taught in Fall 2016 by Dr. Brunelle.  CS 734/834 " Introduction to Information Retrieval ", Thursdays 4:20-7:00 pm (CRNs 13631 & 13639) offered by Dr. Michael L. Nelson .  This will be an updated version of the same course most recently taught in Fall 2016 .   CS 725/825 " Information Visualization ", Fridays 8:30-11 am (CRNs 19344 & 19345) offered by Dr. Michele C. Weigle .  This will be an updated version of the same course most recently taught in Spring 2017 .  CS 791/891 " Web Archiving Seminar ", Wednesdays 2:00-4:30 pm (CRNs 21302 & 21303) offered by Drs. Nelson & Weigle.  This is a new seminar for incoming and prospective WS-DL students wh

2017-08-27: Media Manipulation research at the Berkman Klein Center at Harvard University Trip Report

Image
A photo of me inside "The Yellow House" - The Berkman Klein Center for Internet & Society On June 5, 2017, I started work as an Intern at the Berkman Klein Center for Internet & Society at Harvard University under the supervision of Dr.  Rob Faris , the Research Director for the Berkman Klein Center. This was a wonderful opportunity to conduct news media related research, and my second consecutive Summer research at Harvard. The Berkman Klein Center is an interdisciplinary research center that researches the means to tackle some of the biggest challenges on the Internet. Located in a yellow house at the Harvard Law School, the Center is committed to studying the development, dynamics, norms and standards of cyberspace. The center has produced many significant contributions such as the review of ICANN (Internet Corporation for Assigned Names and Numbers) and the founding of the DPLA (Digital Public Library of America). During the first week of my Inter

2017-08-26: rel="bookmark" also does not mean what you think it means

Image
Extending our previous discussion about how the proposed rel="identifier" is different from rel="canonical" (spoiler alert: "canonical" is only for pages with duplicative text ), here I summarize various discussions about why we can't use rel="bookmark" for the proposed scenarios .  We've already given a brief review of why rel="bookmark" won't work (spoiler alert: it is explicitly prohibited for HTML <link> elements or HTTP Link: headers) but here we more deeply explore the likely original semantics.  I say "likely original semantics" because: the short phrases in the IANA link relations registry ("Gives a permanent link to use for bookmarking purposes") and the HTML5 specification ("Gives the permalink for the nearest ancestor section") are not especially clear, nor is the example in the HTML5 specification.  rel="bookmark" exists to address a problem, anonymous co

2017-08-25: University Twitter Engagement: Using Twitter Followers to Rank Universities

Image
Figure 1: Summing primary and secondary followers for @ODUNow Our University Twitter Engagement (UTE) rank is based on the friend and extended follower network of primary and affiliated secondary Twitter accounts referenced on a university's home page. We show that UTE has a significant, positive correlation with expert university reputation rankings (e.g., USN&WR , THE , ARWU ) as well as rankings by endowment, enrollment, and athletic expenditures (EEE). As illustrated in Figure 1, we bootstrap the process by starting with the URI for the university's homepage obtained from the detailed institutional profile information in the ranking lists. For each URI, we navigated to the associated webpage and searched the HTML source for links to valid Twitter handles. Once the Twitter screen name was identified, the Twitter GET users/Show API was used to retrieve the URI from the profile of each user name. If the domain of the URI matched exactly or resolved to the known do

2017-08-14: Introducing Web Archiving and Docker to Summer Workshop Interns

Image
Last Wednesday, August 9, 2017, I was invited to give a talk to some summer interns of the Computer Science Department at Old Dominion University. Every summer our department invites some undergrad students from India and hosts them for about a month to work on some projects under a research lab here as summer interns. During this period, various research groups introduce their work to those interns to encourage them to become potential graduate applicants. Those interns also act as academic ambassadors who motivate their colleagues back in India for higher studies. This year, Mr. Ajay Gupta invited a group of 20 students from  Acharya Institute of Technology and B.N.M. Institute of Technology and supervised them during their stay at  Old Dominion University . Like the last year, I was selected from the Web Science and Digital Libraries Research Group again to introduce them with the concept of web archiving and various researches of our lab. An overview of the talk can be fo

2017-08-11: Where Can We Post Stories Summarizing Web Archive Collections?

Image
A social card generated by Facebook for my previous blog post. Rich links, snippet , social snippet, social media card , Twitter card , embedded representation, rich object , social card . These visualizations of web objects now pervade our existence on and off of the Web. The concept has been used to render web documents as results in academic research projects, like in Omar Alonso's " What's Happening and What Happened: Searching the Social Web ". oEmbed is a standard for producing rich embedded representations of web objects for a variety of consuming services. Google experiments with using richer objects in their search results , even including images  and other content from pages. Facebook , Twitter , Tumblr , Storify , and other tools use these cards. They have become so ubiquitous that services that do not produce these cards, like Google Hangouts , seem antiquated. These cards also no longer just sit within the confines of the web browser, being used