Tuesday, December 20, 2016

2016-12-20: Archiving Pages with Embedded Tweets

I'm from Louisiana and used Archive-It to build a collection of webpages about the September flood there (https://www.archive-it.org/collections/7760/).

One of the pages I came across, Hundreds of Louisiana Flood Victims Owe Their Lives to the 'Cajun Navy', highlighted the work of the volunteer "Cajun Navy" in rescuing people from their flooded homes. The page is fairly complex, with a Flash video, YouTube video, 14 embedded tweets (one of which contained a video), and 2 embedded Instagram posts. Here's a screenshot of the original page (click for full page):

Live page, screenshot generated on Sep 9, 2016

To me, the most important resources here were the tweets and their pictures, so I'll focus here on how well they were archived.

First, let's look at how embedded Tweets work on the live web. According to Twitter: "An Embedded Tweet comes in two parts: a <blockquote> containing Tweet information and the JavaScript file on Twitter’s servers which converts the <blockquote> into a fully-rendered Tweet."

Here's the first embedded tweet (https://twitter.com/vernonernst/status/765398679649943552), with a picture of a long line of trucks pulling their boats to join the Cajun Navy.
First embedded tweet - live web

Here's the source for this embedded tweet:
<blockquote class="twitter-tweet" data-width="500"> <p lang="en" dir="ltr">
<a target="_blank" href="https://twitter.com/hashtag/CajunNavy?src=hash">#CajunNavy</a> on the way to help those stranded by the flood. Nothing like it in the world! <a href="https://twitter.com/hashtag/LouisianaFlood?src=hash">#LouisianaFlood</a> <a href="https://t.co/HaugQ7Jvgg">pic.twitter.com/HaugQ7Jvgg</a> </p> — Vernon Ernst (@vernonernst) <a href="https://twitter.com/vernonernst/status/765398679649943552">August 16, 2016</a> </blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

When the widgets.js script executes in the browser, it transforms the <blockquote class="twitter-tweet"> element into a <twitterwidget>:
<twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-0" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;" data-tweet-id="765398679649943552">

Now, let's consider how the various archives handle this.


Since I'd been using Archive-It to create the collection, that was the first tool I used to capture the page. Archive-It uses the Internet Archive's Heritrix crawler and Wayback Machine for replay. I set the crawler to archive the page and embedded resources, but not to follow links. No special scoping rules were applied.

Archive-It, captured on Aug 18, 2016
Here's how the first embedded tweet displayed in Archive-It:
Embedded tweet as displayed in Archive-It

Here's the source (as rendered in the DOM) upon playback in Archive-It's Wayback Machine:
<blockquote class="twitter-tweet twitter-tweet-error" data-conversation="none" data-width="500" data-twitter-extracted-i1479916001246582557="true">
<p lang="en" dir="ltr"> <a href="http://wayback.archive-it.org/7760/20160818180453/https://twitter.com/hashtag/CajunNavy?src=hash" target="_blank" rel="external nofollow">#CajunNavy</a> on the way to help those stranded by the flood. Nothing like it in the world! <a href="http://wayback.archive-it.org/7760/20160818180453/https://twitter.com/hashtag/LouisianaFlood?src=hash" target="_blank" rel="external nofollow">#LouisianaFlood</a> <a href="http://wayback.archive-it.org/7760/20160818180453/https://t.co/HaugQ7Jvgg" target="_blank" rel="external nofollow">pic.twitter.com/HaugQ7Jvgg</a> </p> <p>— Vernon Ernst (@vernonernst) <a href="http://wayback.archive-it.org/7760/20160818180453/https://twitter.com/vernonernst/status/765398679649943552" target="_blank" rel="external nofollow">August 16, 2016</a></p></blockquote>
<p> <script async="" 
src="//wayback.archive-it.org/7760/20160818180453js_/http://platform.twitter.com/widgets.js?4fad35" charset="utf-8"></script> </p>

Except for the links being re-written to point to the archive, this is the same as the original embed source, rather than the transformed version.  Upon playback, although widgets.js was archived (http://wayback.archive-it.org/7760/20160818180456js_/http://platform.twitter.com/widgets.js?4fad35), it is not able to modify the DOM as it does on the live web (widgets.js loads additional JavaScript that was not archived).


Next up is the on-demand service, webrecorder.io. Webrecorder.io is able to replay the embedded tweets as on the live web.


Webrecorder.io, viewed Sep 29, 2016

The HTML source (https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/http://ijr.com/2016/08/674271-hundreds-of-louisiana-flood-victims-owe-their-lives-to-the-cajun-navy/) looks similar to the original embed (except for re-written links):
<blockquote class="twitter-tweet" data-width="500"><p lang="en" dir="ltr"><a target="_blank" href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://twitter.com/hashtag/CajunNavy?src=hash">#CajunNavy</a> on the way to help those stranded by the flood. Nothing like it in the world!  <a href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://twitter.com/hashtag/LouisianaFlood?src=hash">#LouisianaFlood</a> <a href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://t.co/HaugQ7Jvgg">pic.twitter.com/HaugQ7Jvgg</a></p>&mdash; Vernon Ernst (@vernonernst) <a href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://twitter.com/vernonernst/status/765398679649943552">August 16, 2016</a></blockquote>
<script async src="//wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135js_///platform.twitter.com/widgets.js" charset="utf-8"></script>

Upon playback, we see that webrecorder.io is able to fully execute the widgets.js script, so the transformed HTML looks like the live web (with the inserted <twitterwidget> element):
<twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-0" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;" data-tweet-id="765398679649943552"></twitterwidget>
<script async="" src="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135js_///platform.twitter.com/widgets.js" charset="utf-8"></script>

Note that widgets.js is archived and is loaded from webrecorder.io, not the live web.


archive.is is another on-demand archiving service.  As with webrecorder.io, the embedded tweets are shown as on the live web.

archive.is, captured Sep 9, 2016

archive.is executes and then flattens JavaScript, so although the embedded tweet looks similar to how it's rendered in webrecorder.io and on the live web, the source is completely different:
<article style="direction:ltr;display:block;">

<a href="https://archive.is/o/5JcKx/twitter.com/vernonernst/status/765398679649943552/photo/1" style="color:rgb(43, 123, 185);text-decoration:none;display:block;position:absolute;top:0px;left:0px;width:100%;height:328px;line-height:0;background-color: rgb(255, 255, 255); outline: invert none 0px; "><img alt="View image on Twitter" src="http://archive.is/5JcKx/fc15e4b873d8a1977fbd6b959c166d7b4ea75d9d" title="View image on Twitter" style="width:438px;max-width:100%;max-height:100%;line-height:0;height:auto;border-width: 0px; border-style: none; border-color: white; "></a>

<blockquote cite="https://twitter.com/vernonernst/status/765398679649943552" style="list-style: none outside none; border-width: medium; border-style: none; margin: 0px; padding: 0px; border-color: white; ">

on the way to help those stranded by the flood. Nothing like it in the world! <a href="https://archive.is/o/5JcKx/https://twitter.com/hashtag/LouisianaFlood?src=hash" style="direction:ltr;background-color: transparent; color:rgb(43, 123, 185);text-decoration:none;outline: invert none 0px; "><span>#</span><span>LouisianaFlood</span></a>


WARCreate is a Google Chrome extension that our group developed to allow users to archive the page they are currently viewing in their browser.  It was last actively updated in 2014, though we are currently working on updates to be released in 2017.

The image below shows the result of the page being captured with WARCreate and replayed in webarchiveplayer

WARCreate, captured Sep 9, 2016, replayed in webarchiveplayer
Upon replay, WARCreate is not able to display the tweet at all.  Here's the close-up of where the tweets should be:

WARCreate capture replayed in webarchiveplayer, with tweets missing
Examining both the WARC file and the source of the archived page helps to explain what's happening.

Inside the WARC, we see:
<h4>In stepped a group known as the <E2><80><9C>Cajun Navy<E2><80><9D>:</h4>
<twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-1" data-tweet-id="765398679649943552" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;"></twitterwidget>
<p><script async="" src="//platform.twitter.com/widgets.js?4fad35" charset="utf-8"></script></p>

This is the same markup that's in the DOM upon replay in webarchiveplayer, except for the script source being written to localhost:
<h4>In stepped a group known as the “Cajun Navy”:</h4>
<twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-1" data-tweet-id="765398679649943552" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;"></twitterwidget>
<p><script async="" src="//localhost:8090/20160822124810js_///platform.twitter.com/widgets.js?4fad35" charset="utf-8"></script></p>

WARCreate captures the HTML after the page has fully loaded.  So what's happening here is that the page loads, widgets.js is executed, the DOM is changed (thus the <twitterwidget> tag), and then WARCreate saves the transformed HTML. But, what we don't get is the widgets.js script in order to be able to properly display <twitterwidget>. Our expectation is that with fixes to allow WARCreate to archive the loaded JavaScript, the embedded tweet would be displayed as on the live web.

Each of these four archiving tools operates on the embedded tweet in a different way, highlighting the complexities of archiving asynchronously loaded JavaScript and DOM transformations.
  • Archive-It (Heritrix/Wayback) - archives the HTML returned in the HTTP response and JavaScript loaded from the HTML
  • Webrecorder.io - archives the HTML returned in the HTTP response, JavaScript loaded from the HTML, and JavaScript loaded after execution in the browser
  • Archive.is - fully loads the webpage, executes JavaScript, rewrites the resulting HTML, and archives the rewritten HTML
  • WARCreate - fully loads the webpage, executes JavaScript, and archives the transformed HTML
It is useful to examine how different archiving tools and playback engines render complex webpages, especially those that contain embedded media.  Our forthcoming update to the Archival Acid Test will include tests for embedded content replay.


Monday, November 21, 2016

2016-11-21: WS-DL Celebration of #IA20

The Web Science & Digital Library Research Group celebrated the 20th Anniversary of the Internet Archive with tacos, DJ Spooky CDs, and a series of tweets & blog posts about the cultural impact and importance of web archiving.  This was in solidarity with the Internet Archive's gala which featured taco trucks and a lecture & commissioned piece by Paul Miller (aka DJ Spooky). 

Normally our group posts about research developments and technical analysis of web archiving, but for #IA20 we had members of our group write mostly non-technical stories drawn from personal experiences and interests that are made possible by web archiving.  We are often asked "Why archive the web?" and we hope these blog posts will help provide you with some answers.
We've collected these links and more material related to #IA20 in both a Storify story and a Twitter moment; we hope you can take the time to explore them further.  We'd like to thank everyone at the Internet Archive for 20 years of yeoman's work, the many other archives that have come on-line more recently, and all of the WS-DL members who made the time to provide their personal stories about the impacts and opportunities of web archiving.


Wednesday, November 16, 2016

2016-11-16: Reminiscing About The Days of Cyber War Between Indonesia and Australia

Image is taken from Wikipedia

Indonesia and Australia are neighboring countries that, just like what always happens between neighbors, have a hot-and-cold relationship. The History has recorded a number of disputes between Indonesia and Australia, from East Timor disintegration (now Timor Leste) in 1999 to the Bali Nine case (the execution of Australian drug smugglers) in 2015. One of the issues that has really caused a stir in Indonesia-Australia's relationship is the spying imbroglio conducted by Australia toward Indonesia. The tension arose when an Australian newspaper The Sydney Morning Herald published an article titled Exposed: Australia's Asia spy network and a video titled Spying at Australian diplomatic facilities on October 31st, 2013. It revealed one of Edward Snowden's leaks that Australia had been spying on Indonesia since 1999. This startling fact surely enraged Indonesia's government and, most definitely, the people of Indonesia.

Indonesia strongly demanded clarification and an explanation by summoning Australia's ambassador, Greg Moriarty. Indonesia also demanded Australia to apologize. But Australia refused to apologize by arguing that this is something that every government will do to protect its country. The situation was getting more serious when it was also divulged that an Australian security agency attempted to listen in on Indonesian President Susilo Bambang Yudhoyono's cell phone in 2009. Yet, Tony Abbott, Australia's prime minister at that time, still refused to give either explanation or apology. This caused President Yudhoyono to accuse Tony Abbott of 'belittling' Indonesia's response to the issue. All of these situations made the already enraged Indonesian became more furious. Furthermore, Indonesian people judged that the government was too slow in following up and responding to this issue.

Image is taken from The Australian

To channel their frustration and anger, a group of Indonesian hacktivists named '
anonymous Indonesia' launched a number of attacks to hundreds of Australian websites that were chosen randomly. They hacked and defaced those websites to spread the message 'stop spying on Indonesia'. Over 170 Australian websites were hacked during November 2013, some of them are government websites such as Australian Secret Intelligence Service (ASIS), Australian Security Intelligence Organisation (ASIO), and Department of Foreign Affairs and Trade (DFAT).

Australian hackers also took revenge by attacking several important Indonesian websites such as the Ministry of Law and Human Rights and Indonesia's national airline, Garuda Indonesia. But, the number of the attacked websites is not as many as what have been attacked by the Indonesians. These websites are already recovered now and they look as if the attacks never happened. Fortunately, those who never heard this spying row before, could take advantage of using Internet Archive and go back in time to see how those websites looked like when they got attacked. Unfortunately, not all of those attacked websites have archives for November 2013. For example, according to Sydney Morning Herald and Australian Broadcasting Corporation, the ASIS websites were hacked on November 11, 2013. The Australian newspaper also reported that ASIO website was also hacked on November 13, 2013. But, these incidents were not archived by the Internet Archive as we cannot see any snapshot for the given dates.



However, we are lucky enough to have sufficient examples to give us a clear idea of the cyber war that once took place between Indonesia and Australia.







                   - Erika (@erikaris)-

2016-11-16: Introducing the Local Memory Project

Collage made from screenshot of local news websites across the US
The national news media has different priorities than the local news media. If one seeks to build a collection about local events, the national news media may be insufficient, with the exception of local news which “bubbles” up to the national news media. Irrespective of this “bubbling” of some local news to the national surface, the perspective and reporting of national news differs from local news for the same events. Also, it is well known that big multinational news organizations routinely cite the reports of smaller local news organizations for many stories. Consequently, local news media is fundametal to journalism.

It is important to consult local sources affected by local events. Thus the need for a system that helps small communities to build collections of web resources from local sources for important local events. The need for such a system was first (to the best of my knowledge) outlined by Harvard LIL. Given Harvard LIL's interest of helping facilitate participatory archiving by local communities and libraries, and our IMLS-funded interest of building collections for stories and events, my summer fellowship at Harvard LIL provided a good opportunity to collaborate on the Local Memory Project.

Our goal is to provide a suite of tools under the umbrella of the Local Memory Project to help users and small communities discover, collect, build, archive, and share collections of stories for important local events from local sources.

Local Memory Project dataset

We currently have a public json US dataset scraped from USNPL of:
  • 5,992 Newspapers 
  • 1,061 TV stations, and 
  • 2,539 Radio stations
The dataset structure is documented and comprises of the media website, twitter/facebook/youtube links, rss/open search links, as well as geo-coordinates of the cities or counties in which the local media organizations reside. I strongly believe this dataset could be essential to the media research

There are currently 3 services offered by the Local Memory Project:

1. Local Memory Project - Google Chrome extension:

This service is an implementation of Adam Ziegler and Anastasia Aizman's idea for a utility that helps one build a collection for a local event which did not receive national coverage. Consequently, given a story expressed by a query input, for a place, represented by a zip code input, the Google Chrome extension performs the following operations:
  1. Retrieve a list of local news (Newspapers and TV stations) websites that serve the zip code
  2. For each local news website search Google for stories from all the local news websites retrieved from 1.
The result is a collection of stories for the query from local news sources.

For example, given the problem of building a collection for Zika virus for Miami Florida, we issue the following inputs (Figure 1) to the Google Chrome Extension and click "Submit":
Figure 1: Google Chrome Extension, input for building a collection about Zika virus for Miami FL
After the submit button is pressed the application issues the "zika virus" query to Google with the site directive for newspapers and tv stations for the 33101 area.

Figure 2: Google Chrome Extension, search in progress. Current search in image targets stories about Zika virus from Miami Times
After the search, the result (Figure 3) was saved remotely.
Figure 3: A subset (see complete) of the collection about Zika virus built for the Miami FL area.
Here are examples of other collections built with the Google Chrome Extension (Figures 4 and 5):
Figure 4: A subset (see complete) of the collection about Simone Biles' return for Houston Texas
Figure 5: A subset (see complete) of the collection about Protesters and Police for Norfolk Virginia
The Google Chrome extension also offers customized settings that suit different collection building needs:
Figure 6: Google Chrome Extension Settings (Part 1)
Figure 7: Google Chrom Extension Settings (Part 2)
  1. Google max pages: The number of Google search pages to visit for each news source. Increase if you want to explore more Google pages since the default value is 1 page.
  2. Google Page load delay (seconds): This time delay between loading Google search pages ensures a throttled request.
  3. Google Search FROM date: Filter your search for news articles crawled from this date. This comes in handy if a query spans multiple time periods, but the curator is interested in a definite time period.
  4. Google Search TO date: Filter your search for news articles before this date. This comes in handy especially when combined with 3, it can be used to collect documents within a start and end time window.
  5. Archive Page load delay (seconds): Time delay between loading pages to be archived. You can increase this time if you want to have the chance to do something (such as hit archive again) before the next archived page loads automatically. This is tailored to archive.is.
  6. Download type: Download to your machine for a personal collection in (json or txt format). But if you choose to share, save remotely (you should!)
  7. Collection filename: Custom filename for collection about to be saved.
  8. Collection name: Custom name for your collection. It's good practice to label collections.
  9. Upload a saved collection (.json): For json collections saved locally, you may upload them to revisualize the collection.
  10. Show Thumbnail: A flag that decides whether to send a remote request to get a card (thumbnail summary) for the link. Since cards require multiple GET requests, you may choose to switch this off if you have a large collection.
  11. Google news: The default search of the extension is the generic Google search page. Check this box to search teh Google news vertical instead.
  12. Add website to existing collection: Add a website to an existing collection.
2. Local Memory Project - Geo service:

The Google Chrome extension utilizes the Geo service to find media sources that serve a zip code. This service is an implementation of Dr. Michael Nelson's idea for a service that supplies an ordered list of media outlets based on their proximity to a user-specified zip code.

Figure 8: List of top 10 Newspapers, Radio and TV station closest to zip code 23529 (Norfolk, VA)

3. Local Memory Project - API:

The local memory project Geo website is meant for human users, while the API website targets machine users. Therefore, it provide the same services as the Geo website but returns a json output (as opposed to HTML). For example, below is a subset output (see complete) corresponding to a request for 10 news media sites in order of proximity to Cambridge, MA.
  "Lat": 42.379146, 
  "Long": -71.12803, 
  "city": "Cambridge", 
  "collection": [
      "Facebook": "https://www.facebook.com/CambridgeChronicle", 
      "Twitter": "http://www.twitter.com/cambridgechron", 
      "Video": "http://www.youtube.com/user/cambchron", 
      "cityCountyName": "Cambridge", 
      "cityCountyNameLat": 42.379146, 
      "cityCountyNameLong": -71.12803, 
      "country": "USA", 
      "miles": 0.0, 
      "name": "Cambridge Chronicle", 
      "openSearch": [], 
      "rss": [], 
      "state": "MA", 
      "type": "Newspaper - cityCounty", 
      "website": "http://cambridge.wickedlocal.com/"
      "Facebook": "https://www.facebook.com/pages/WHRB-953FM/369941405267", 
      "Twitter": "http://www.twitter.com/WHRB", 
      "Video": "http://www.youtube.com/user/WHRBsportsFM", 
      "cityCountyName": "Cambridge", 
      "cityCountyNameLat": 42.379146, 
      "cityCountyNameLong": -71.12803, 
      "country": "USA", 
      "miles": 0.0, 
      "name": "WHRB 95.3 FM", 
      "openSearch": [], 
      "rss": [], 
      "state": "MA", 
      "type": "Radio - Harvard Radio", 
      "website": "http://www.whrb.org/"
    }, ...

Saving a collection built with the Google Chrome Extension

Collection built on a user machine can be saved in one of two ways:
  1. Save locally: this serves as a way to keep a collection private. Saving can be done by clicking "Download collection" in the Generic settings section of the extension settings. A collection can be saved in json or plaintext format. The json format permits the collection to be reloaded through "upload a saved collection" in the Generic settings section of the extension settings. The plaintext format does not permit reloading into the extension, but contains all the links which make up the collection.
  2. Save remotely: in order to be able to share the collection you built locally with the world, you need to save remotely by clicking the "Save remotely" button on the frontpage of the application. This leads to a dialog requesting a mandatory unique collection author name (if one doesn't exist) and an optional collection name (Figure 10). After supplying the inputs the application saves the collection remotely and the user is presented with a link to the collection (Figure 11).
Before a collection is saved locally or remotely, you may choose to exclude an entire news source (all links from a given source) or a single news source as described by Figure 9:
Figure 9: Exclusion options before saving locally/remotely
Figure 10: Saving a collection prompts a dialog requesting a mandatory unique collection author name and an optional collection name
Figure 11: A link is presented after a collection is saved remotely

Archiving a collection built with the Google Chrome Extension

Saving is the first step to make a collection persist after it is built. However, archiving ensures that the links referenced in a collection persist even if the content is moved or deleted. Our application currently integrates archiving via Archive.is, but we plan to expand the archiving capability to include other public web archives.

In order to archive your collection, click the "Archive collection" button on the frontpage of the application. This leads to a dialog similar to the saving dialog which requests a mandatory unique collection author name (if one doesn't exist) and an optional collection name. Subsequently, the application archives the collection by first archiving the front page which contains all the local news sources, and secondly, the application archives the individual links which make up the collection (Figure 12). You may choose to stop the archiving operation at any time by clicking "Stop" on the archiving update orange-colored message bar. At the end of the archiving process, you get a short URI corresponding to the archived collection (Figure 13).
Figure 12: Archiving in progress
Figure 13: When the archiving is complete, a short link pointing to the archived collection is presented

Community collection building with the Google Chrome Extension

We envision a community of users contributing to a single collection for a story. Even though the collections are built in isolation, we consider a situation in which we can group collections around a single theme. To begin this process, the Google Chrome Extension lets you share a locally built collections on Twitter by clicking the "Tweet" button (Figure 14). 
Figure 14: Tweet button enables sharing the collection

This means if user 1 and user 2 locally build collections for Hurricane Hermine, they may use the hashtags #localmemory and #hurricanehermine when sharing the collection. Consequently, all Hurricane Hermine-related collections will be seen via Twitter with the hashtags. We encourage users to include #localmemory and the collection hashtags in tweets when sharing collections. We also encourage you to follow the Local Memory Project on Twitter.
The local news media is a vital organ of journalism, but one in decline. We hope by providing free and open source tools for collection building, we can contribute in some capacity to help its revival.

I am thankful for everyone who has contributed to the ongoing success of this project. From Adam, Anastasia, Matt, Jack and the rest of the Harvard LIL team, to my Supervisor Dr. Nelson and Dr. Weigle, and Christie Moffat at the National Library of Medicine, as well as Sawood and Mat and the rest of my colleagues at WSDL, thank you.

2018-02-24 Edit: We had the need to extract links from Google as part of a larger research project to quantify "refinding" stories on Google. This required issuing queries to Google every day and collecting links.
Fortunately, the Local Memory Project Google chrome extension is well suited for this task because it enables downloading JSON files consisting of links extracted from Google for the issued queries. However, the standard version of the extension only permits issuing queries  to Google and extracting links from local news media organizations. This "local media" restriction was not required in the research project. In other words, our research project required a standard Google search in which links from all kinds of media sources (local and non-local) are included the search result. As a result, I added this functionality into the extension.

Additionally, a different research interest required us to extract tweets from conversation threads. The Twitter API does not permit this, similarly, we adapted the Local Memory Project Google Chrome extension for this task.

To summarize, here are the additional features of the extension:

  1. Extract links from standard Google searches: If you would like to extract links from Google for a given query (e.g., "winter olympics") set the zip code textbox to 0 (Figure 15). This instructs the extension to initiate a standard Google search and extract links for the specified number of pages (Figure 6, annotation 1). The extracted links may be downloaded locally as JSON files or saved (Figure 10-11) and/or archived (Figure 12-13) for remote and persistent access.
    extract links from Google
    Figure 15: Extract links from Google for query "winter olympics." This is achieved by setting the zip code textbox to "0"
  2. Extract tweets from Twitter: In order extract tweets from Twitter timeline, search or threaded conversations, copy the URI from the search bar and paste into the Tweet URL textbox (Figure 16), and press the Extract tweets button.
    Figure 16: Extract tweets from Twitter search for query: "winter olympics"
    For example, in order to extract tweets from your timeline, the Tweet URL textbox is set to "https://twitter.com/". In order to extract tweets from the search query "winter olympics," the Tweet URL textbox is set to "https://twitter.com/search?q=winter%20olympics&src=typd". In order to extract tweets from the hashtag "#WinterOlympics" the Tweet URL is set to "https://twitter.com/hashtag/WinterOlympics?src=hash". In order to extract tweets from a threaded conversation (e.g., Figure 17), set the Tweet URL to the conversation URL, e.g., "https://twitter.com/NewYorker/status/964220343278858240". The easiest way to get these tweet URLs is the address bar of your browser. The extension is able to scroll and click in order to load more tweets.
    Figure 17: The extension can extract tweets from threaded conversations
-- Nwala (@acnwala)

Monday, November 7, 2016

2016-11-07: Linking to Persistent Identifiers with rel="identifier"

(2018-06-08 update: as a result of community feedback, we replaced "identifier" with "cite-as".  We chose to keep the examples below intact, but please substitute rel="cite-as" whenever you see rel="identifier". -- MLN&HVDS)

Do you remember hearing about that study that found that people who are "good" at swearing actually have a large vocabulary, refuting the conventional wisdom about a "poverty-of-vocabulary"?  The DOI (digital object identifier) for the 2015 study is*:


But if you read about it in the popular press, such as the Independent or US News & World Report, you'll see that they linked to:


The problem is that although the DOI is the preferred link, browsers follow a series of redirects from the DOI to the ScienceDirect link, which is then displayed in the address bar of the browser, and that's the URI that most people are going to copy and paste when linking to the page.  Here's a curl session showing just the HTTP status codes and corresponding Location: headers for the redirection:

$ curl -iL --silent http://dx.doi.org/10.1016/j.langsci.2014.12.003 | egrep -i "(HTTP/1.1|^location:)"
HTTP/1.1 303 See Other
Location: http://linkinghub.elsevier.com/retrieve/pii/S038800011400151X
HTTP/1.1 301 Moved Permanently
location: /retrieve/articleSelectSinglePerm?Redirect=http%3A%2%2Fwww.sciencedirect.com%2Fscience%2Farticle%2Fpii%2FS038800011400151X%3Fvia%253Dihubkey=072c950bffe98b3883e1fa0935fb56a6f1a1b364
HTTP/1.1 301 Moved Permanently
location: http://www.sciencedirect.com/science/article/pii/S038800011400151X?via%3Dihub
HTTP/1.1 301 Moved Permanently
Location: http://www.sciencedirect.com/science/article/pii/S038800011400151X?via%3Dihub&ccp=y
HTTP/1.1 301 Moved Permanently
Location: http://www.sciencedirect.com/science/article/pii/S038800011400151X
HTTP/1.1 200 OK

Most publishers follow this model of a series of redirects to implement authentication, tracking, etc. While DOI use has made significant progress in scholarly literature, many times the final URL is the one that is linked to instead of the more stable DOI (see the study by Herbert, Martin, and Shawn presented at WWW 2016 for more information).  Furthermore, while sometimes the mapping between the final URL and DOI is obvious (e.g., http://dx.doi.org/10.1371/journal.pone.0115253 --> http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253), the above example proves that's not always the case.

Ad-hoc linking back to DOIs

One of the obstacles limiting the correct linking is that there is no standard, machine-readable method for the HTML from the final URI to link back to its DOI (and by "DOI" we also mean all other persistent identifiers, such as handles, purls, arks, etc.).  In practice, each publisher adopts its own strategy for specifying DOIs in <meta> HTML elements:

In http://link.springer.com/article/10.1007%2Fs00799-016-0184-4 we see:

<meta name="citation_publisher" content="Springer Berlin Heidelberg"/>
<meta name="citation_title" content="Web archive profiling through CDX summarization"/>
<meta name="citation_doi" content="10.1007/s00799-016-0184-4"/>
<meta name="citation_language" content="en"/>
<meta name="citation_abstract_html_url" content="http://link.springer.com/article/10.1007/s00799-016-0184-4"/>
<meta name="citation_fulltext_html_url" content="http://link.springer.com/article/10.1007/s00799-016-0184-4"/>
<meta name="citation_pdf_url" content="http://link.springer.com/content/pdf/10.1007%2Fs00799-016-0184-4.pdf"/>

In http://www.dlib.org/dlib/january16/brunelle/01brunelle.html we see:

<meta charset="utf-8" />
<meta id="DOI" content="10.1045/january2016-brunelle" />
<meta itemprop="datePublished" content="2016-01-16" />
<meta id="description" content="D-Lib Magazine" />

In http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253 we see:

<meta name="citation_doi" content="10.1371/journal.pone.0115253" />
<meta name="dc.identifier" content="10.1371/journal.pone.0115253" />

In https://www.computer.org/csdl/proceedings/jcdl/2014/5569/00/06970187-abs.html we see:

<meta name='doi' content='10.1109/JCDL.2014.6970187' />

And in http://ieeexplore.ieee.org/document/754918/ there are no HTML elements specifying the corresponding DOI.  Furthermore, HTML elements can only appears in HTML -- which means you can't provide Links for PDF, CSV, Zip, or other non-HTML representations.  For example, NASA uses handles for the persistent identifiers of the PDF versions of their reports:

$ curl -IL http://hdl.handle.net/2060/19940023070
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Location: http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19940023070.pdf
Expires: Thu, 03 Nov 2016 17:47:07 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 221
Date: Thu, 03 Nov 2016 17:47:07 GMT

HTTP/1.1 301 Moved Permanently
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Location: https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19940023070.pdf
Content-Type: text/html; charset=iso-8859-1

HTTP/1.1 200 OK
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Set-Cookie: JSESSIONID=C88324CAB3C27D6D8152C9BE3B322095; Path=/; Secure
Accept-Ranges: bytes
Last-Modified: Fri, 30 Aug 2013 19:15:59 GMT
Content-Length: 984250
Content-Type: application/pdf

And the final PDF obviously cannot use HTML elements to link back to its handle.

To address these shortcomings, and in support of our larger vision of Signposting the Scholarly Web, we are proposing a new IANA link relation type, rel="identifier", that will support linking from the final URL in the redirection chain (AKA as the "locating URI") back to the persistent identifier that ideally one would use to start the resolution.  For example, in the NASA example above the PDF would link back to its handle with the proposed Link header in red:

HTTP/1.1 200 OK
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Set-Cookie: JSESSIONID=C88324CAB3C27D6D8152C9BE3B322095; Path=/; Secure
Accept-Ranges: bytes
Last-Modified: Fri, 30 Aug 2013 19:15:59 GMT

Link: <http://hdl.handle.net/2060/19940023070>; rel="identifier"
Content-Length: 984250
Content-Type: application/pdf

And in the Language Sciences example that we began with, the final HTTP response (which returns the HTML landing page) would use the Link header like this:

HTTP/1.1 200 OK
Last-Modified: Fri, 04 Nov 2016 00:36:50 GMT
Content-Type: text/html
X-TransKey: 11/03/2016 20:36:50 EDT#2847_006#2415#
X-Cnection: close
X-RE-Ref: 0 1478219810005195
Server: www.sciencedirect.com
Vary: Accept-Encoding, User-Agent
Expires: Fri, 04 Nov 2016 00:36:50 GMT
Cache-Control: max-age=0, no-cache, no-store

Link: <http://dx.doi.org/10.1016/j.langsci.2014.12.003>; rel="identifier"

But it's not just the landing page that would link back to the DOI, but also the constituent resources that are also part of a DOI-identified object.  Below is a request and response for the PDF file in the Language Sciences example, and it carries the same Link: response header as the landing page:

$ curl -IL --silent "http://ac.els-cdn.com/S038800011400151X/1-s2.0-S038800011400151X-main.pdf?_tid=338820f0-a442-11e6-9f85-00000aab0f6b&acdnat=1478451672_5338d66f1f3bb88219cd780bc046bedf"
HTTP/1.1 200 OK
Accept-Ranges: bytes
Allow: GET
Content-Type: application/pdf
ETag: "047508b07a69416a9472c3ac02c5a9a01"
Last-Modified: Thu, 15 Oct 2015 08:11:25 GMT
Server: Apache-Coyote/1.1
X-ELS-Authentication: SDAKAMAI
X-ELS-ReqId: 67961728-708b-4cbb-af64-bb68f1da03ea
X-ELS-ResourceVersion: V1
X-ELS-ServerId: ip-10-93-46-150.els.vpc.local_CloudAttachmentRetrieval_prod
X-ELS-SIZE: 417655
X-ELS-Status: OK
Content-Length: 417655
Expires: Sun, 06 Nov 2016 16:59:44 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Sun, 06 Nov 2016 16:59:44 GMT
Connection: keep-alive

Link: <http://dx.doi.org/10.1016/j.langsci.2014.12.003>; rel="identifier"

Although at first glance there seems to be a number of existing rel types (some registered and some not) that would be suitable:
  • rel="canonical"
  • rel="alternate" 
  • rel="duplicate" 
  • rel="related"
  • rel="bookmark"
  • rel="permalink"
  • rel="shortlink"
It turns out they all do something different.  Below we explain why these rel types are not suitable for linking to persistent identifiers.

This would seem to be a likely candidate and it is widely used, but it actually exists for a different purpose: to "identify content that is either duplicative or a superset of the content at the context (referring) IRI." Quoting from RFC 6596:
If the preferred version of a IRI and its content exists at:


Then duplicate content IRIs such as:


may designate the canonical link relation in HTML as specified in

<link rel="canonical"
In the representative cases shown above, the DOI, handle, etc. is neither duplicative nor a superset of the content.  For example, the URI of the NASA report PDF clearly bears some relation to its handle, but the PDF URI is clearly not duplicative nor a superset of the handle.  This is reinforced by the semantics of the "303 See Other" redirection, which indicates there are two different resources with two different URIs**.  rel="canonical" is ultimately about establishing primacy among the (possibly) many URI aliases for a single resource.  For SEO purposes, this avoids splitting Pagerank.

Furthermore, publishers like Springer are already using rel="canonical" (highlighted in red) to specify a preferred URI in their chain of redirects:

$ curl -IL http://dx.doi.org/10.1007/978-3-319-43997-6_35
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Vary: Accept
Location: http://link.springer.com/10.1007/978-3-319-43997-6_35
Expires: Mon, 31 Oct 2016 20:52:26 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 191
Date: Mon, 31 Oct 2016 20:40:48 GMT

HTTP/1.1 302 Moved Temporarily
Content-Type: text/html; charset=UTF-8
Location: http://link.springer.com/chapter/10.1007%2F978-3-319-43997-6_35
Server: Jetty(9.2.14.v20151106)
X-Environment: live
X-Origin-Server: 19t9ulj5bca
X-Vcap-Request-Id: 48d17c7e-2556-4cff-4b2b-0e6fbae94237
Content-Length: 0
Cache-Control: max-age=0
Expires: Mon, 31 Oct 2016 20:40:48 GMT
Date: Mon, 31 Oct 2016 20:40:48 GMT
Connection: keep-alive
Set-Cookie: sim-inst-token=1:3000168670-3000176756-3001080530-8200972180:1477976448562:07a49aef;Path=/;Domain=.springer.com;HttpOnly
Set-Cookie: trackid=d9cf189bedb640a9b5d55c9d0;Path=/;Domain=.springer.com;HttpOnly
X-Robots-Tag: noarchive

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Link: <http://link.springer.com/chapter/10.1007%2F978-3-319-43997-6_35>; rel="canonical"
Server: openresty
X-Environment: live
X-Origin-Server: 19ta3iq6v47
X-Served-By: core-internal.live.cf.private.springer.com
X-Ua-Compatible: IE=Edge,chrome=1
X-Vcap-Request-Id: 5a458b2c-de85-42cd-7157-022c440a9668
X-Vcap-Request-Id: 54b0e2dc-7766-4c00-4f95-d33bdb6c427a
Cache-Control: max-age=0
Expires: Mon, 31 Oct 2016 20:40:48 GMT
Date: Mon, 31 Oct 2016 20:40:48 GMT
Connection: keep-alive
Set-Cookie: sim-inst-token=1:3000168670-3000176756-3001080530-8200972180:1477976448766:c35e0847;Path=/;Domain=.springer.com;HttpOnly
Set-Cookie: trackid=1d67fdfb47ab4a5f94b43326e;Path=/;Domain=.springer.com;HttpOnly
X-Robots-Tag: noarchive

And some publishers use it inconsistently.  In this Elsevier example, the content from http://dx.doi.org/10.1016/j.acra.2015.10.004 is indexed at three different URIs:

Even if we accept that the PubMed version is a different resource (i.e., hosted at NLM instead of Elsevier) and should have a separate URI, Elsevier still maintains two different URIs for this article:


The DOI resolves to the former URI (academicradiology.org), but it is the latter (sciencedirect.com) that has in the HTML (and not in the HTTP response header):

<link rel="canonical" href="http://www.sciencedirect.com/science/article/pii/S1076633215004535">

Presumably to distinguish this URI from the various URIs that you get starting with http://linkinghub.elsevier.com/retrieve/pii/S1076633215004535 instead of the DOI:

$ curl -iL --silent http://linkinghub.elsevier.com/retrieve/pii/S1076633215004535 | egrep -i "(HTTP/1.1|^location:)"
HTTP/1.1 301 Moved Permanently
location: /retrieve/articleSelectPrefsPerm?Redirect=http%3A%2F%2Fwww.sciencedirect.com%2Fscience%2Farticle%2Fpii%2FS1076633215004535%3Fvia%253Dihub&key=07077ac16f0a77a870586ac94ad3c000cfa1973f
HTTP/1.1 301 Moved Permanently
location: http://www.sciencedirect.com/science/article/pii/S1076633215004535?via%3Dihub
HTTP/1.1 301 Moved Permanently
Location: http://www.sciencedirect.com/science/article/pii/S1076633215004535?via%3Dihub&ccp=y
HTTP/1.1 301 Moved Permanently
Location: http://www.sciencedirect.com/science/article/pii/S1076633215004535
HTTP/1.1 200 OK

In summary, although "canonical" seems promising at first, the semantics are different from what we propose and publishers are already using it for internal linking purposes.  This eliminates "canonical" from consideration. 


This rel type has been around for a while and has some reserved historical definitions for stylesheets and RSS/Atom, but the general semantics for "alternate" is to provide "an alternate representation of the current document."  In practice, this means surfacing different representations for the same resource, but varying in Content-type (e.g., application/pdf vs. text/html) and/or Content-Language (e.g., en vs. fr).  Since a DOI, for example, is not simply a different representation of the same resource, "alternate" is removed from consideration.


RFC 6249 specifies how resources can specify resources with different URIs are in fact byte-for-byte equivalent.  "duplicate" might suitable for stating equivalence between the PDFs linked at both http://www.academicradiology.org/article/S1076-6332(15)00453-5/abstract and http://www.sciencedirect.com/science/article/pii/S1076633215004535, but we can't use it to link back to http://dx.doi.org/10.1016/j.acra.2015.10.004


Defined in RFC 4287, "related" is probably the closest to what we propose but its semantics are purposefully vague.  A DOI is certainly related to locating URI, but it is also related to a lot of other resources as well: the other articles in a journal issue, other publications by the authors, citing articles, etc. Using "related" to link to DOIs could be ambiguous, and would eventually lead to parsing the linked URI for strings like "dx.doi.org", "handle.net", etc. -- not what we want to encourage. 


We initially hoped this could mean "when you press <control-D>, use this URI instead of one in your address bar."  Unfortunately, "bookmark" is instead used to identify permalinks for different sections of the document that it appears in.  And as a result, it's not even defined for Link: HTTP  headers, and thus eliminated from consideration. 


It turns out that "permalink" was intended for what we thought "bookmark" would be used for, but although it was proposed, it was never registered nor did it gain significant traction ("bookmark" was used instead).  It is most closely associated with the historical problem of creating deep links within blogs and as such we choose not to resurrect it for persistent identifiers.


We include this one mostly for completeness since the semantics arguably provide the opposite of what we want: instead of a link to a persistent identifier, it allows linking to a shortened URI.   Despite its widespread use, it is actually not registered.

The ecosystem around persistent identifiers is fundamentally different than that of shortened URIs even though they may look similar to the untrained eye.  Putting aside the preservation nightmare scenario of bit.ly going out of business or Twitter deprecating t.co, "shortlink" could be used to complement "identifier".  Revisiting the NASA example from above, the two rel types could be combined to link to both the handle and the nasa.gov branded shortened URI:

HTTP/1.1 200 OK
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Set-Cookie: JSESSIONID=C88324CAB3C27D6D8152C9BE3B322095; Path=/; Secure
Accept-Ranges: bytes
Last-Modified: Fri, 30 Aug 2013 19:15:59 GMT

Link: <http://hdl.handle.net/2060/19940023070>; rel="identifier",
      <http://go.nasa.gov/2fkvyya>; rel="shortlink" 
Content-Length: 984250
Content-Type: application/pdf

Combining rel="identifier" with other Links

The "shortlink" example above illustrates that "identifier" can be combined with other rel type for more expressive resources.  Here we extend the NASA example further with rel="self":

HTTP/1.1 200 OK
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Set-Cookie: JSESSIONID=C88324CAB3C27D6D8152C9BE3B322095; Path=/; Secure
Accept-Ranges: bytes
Last-Modified: Fri, 30 Aug 2013 19:15:59 GMT

Link: <http://hdl.handle.net/2060/19940023070>; rel="identifier",
 <http://go.nasa.gov/2fkvyya>; rel="shortlink",
 <http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19940023070.pdf>; rel="self"
Content-Length: 984250
Content-Type: application/pdf

Now the HTTP response for the PDF is self-contained and unambiguously lists all of the appropriate URIs.  We could also combine rel="identifier" with version information.  arXiv.org does not issue DOIs or handles, but it does mint its own persistent identifiers.  Here we propose, using rel types from RFC 5829, how version 1 of an eprint could link both to version 2 (both the next and current version) as well as the persistent identifier (which we also know to be the "latest-version"):

$ curl -I https://arxiv.org/abs/1212.6177v1
HTTP/1.1 200 OK
Date: Fri, 04 Nov 2016 02:31:19 GMT
Server: Apache
ETag: "Tue, 08 Jan 2013 01:02:17 GMT"
Expires: Sat, 05 Nov 2016 00:00:00 GMT
Strict-Transport-Security: max-age=31536000
Set-Cookie: browser=; path=/; max-age=946080000; domain=.arxiv.org
Last-Modified: Tue, 08 Jan 2013 01:02:17 GMT

Link: <https://arxiv.org/abs/1212.6177>; rel="identifier latest-version",
      <https://arxiv.org/abs/1212.6177v2>; rel="successor-version",
      <https://arxiv.org/abs/1212.6177v1>; rel="self" 
Vary: Accept-Encoding,User-Agent
Content-Type: text/html; charset=utf-8

The Signposting web site has further examples how rel="identifier" can be used to express the relationship between the persistent identifiers, the "landing page", the "publication resources" (e.g., the PDF, PPT), and the combination of both the landing page and publication resources.  We encourage you to explore the analyses of existing publishers (e.g., Nature) and repository systems (e.g., DSpace, Eprints).

In summary, we propose rel="identifier" to standardize linking to DOIs, handles, and other persistent identifiers.  HTML <meta> tags can't be used as headers in HTTP responses, and existing rel types such as "canonical" and "bookmark" have different semantics.

We welcome feedback about this proposal, which we intend to eventually standardize with an RFC and register with IANA. Herbert will cover these issues at PIDapalooza, and we will include the slides here after the conference.

2016-11-10 Edit: Herbert's PIDapalooza slides are now available:

--Michael & Herbert

* 2017-08-04 edit: Strictly speaking, a DOI by itself is not actually a URI (i.e., "doi:" is not a registered scheme with IANA) and there are various ways to turn DOIs into HTTP URIs (useful for dereferencing on the web) or info URIs (useful for when dereferencing is not desired).  Without loss of generality, web-based discussions typically assume the promotion of DOIs to HTTP URIs.  Common forms use the resolvers run by CNRI; historically this meant dx.doi.org:

$ curl -I http://dx.doi.org/10.1016/j.langsci.2014.12.003
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Vary: Accept
Location: http://linkinghub.elsevier.com/retrieve/pii/S038800011400151X
Expires: Fri, 04 Aug 2017 19:52:21 GMT
Link: <https://api.elsevier.com/content/usage/doi/>; rel="dul"
Content-Type: text/html;charset=utf-8
Content-Length: 207
Date: Fri, 04 Aug 2017 19:21:50 GMT

I think now just doi.org is preferred:

$ curl -I http://doi.org/10.1016/j.langsci.2014.12.003
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Vary: Accept
Location: http://linkinghub.elsevier.com/retrieve/pii/S038800011400151X
Expires: Fri, 04 Aug 2017 19:45:44 GMT
<https://api.elsevier.com/content/usage/doi/>; rel="dul"
Content-Type: text/html;charset=utf-8
Content-Length: 207
Date: Fri, 04 Aug 2017 19:21:54 GMT

Even the following is possible (but not preferred):

$ curl -I http://hdl.handle.net/10.1016/j.langsci.2014.12.003
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Vary: Accept
Location: http://linkinghub.elsevier.com/retrieve/pii/S038800011400151X
Expires: Fri, 04 Aug 2017 19:51:53 GMT
<https://api.elsevier.com/content/usage/doi/>; rel="dul"
Content-Type: text/html;charset=utf-8
Content-Length: 207
Date: Fri, 04 Aug 2017 19:22:04 GMT

** Technically, a DOI is a "digital identifier of an object" rather than "identifier of a digital object", and thus there is not a representation associated with the resource identified by a DOI (i.e., not an information resource).  Relationships like "canonical", "alternate", etc. only apply to information resources, and thus are not applicable to most persistent identifiers.  Interested readers are encouraged to further explore the HTTPRange-14 issue.