Sunday, January 15, 2017

2017-01-15: Summary of "Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data"

Example: original URI vs. trusty URI
Based on the paper:

Kuhn, T., Dumontier, M.: Trusty URIs: Verifiable, immutable, and permanent digital artifacts for linked data. Proceedings of European Semantic Web Conference (ESWC) pp. 395–410 (2014).

A trusty URI is a URI that contains a cryptographic hash value of the content it identifies. The authors introduced this technique of using trusty URIs to make digital artifacts, specially those related to scholarly publications, immutable, verifiable, and permanent. With the assumption that a trusty URI, once created, is linked from other resources or stored by a third party, it becomes possible to detect if the content that the trusty URI identifies has been tampered with or manipulated on the way (e.g., trusty URIs to prevent man-in-the-middle attacks). In addition, trusty URIs can verify the content even if it is no longer found at the original URI but still can be retrieved from other locations, such as Google's cache, and web archives (e.g., Internet Archive).

The core contribution of this paper is the ability of creating trusty URIs on different kind of content. Two modules are proposed: in the module F, the hash is calculated on the byte-level file content while in the second module R the hash is calculated on RDF graphs. The paper introduced an algorithm to generate the hash value on RDF graphs independent of any serialization syntax (e.g., N-Quads or TriX). Moreover, they investigated how trusty URIs work on the structured documents (nanopublications). Nanopublications are small RDF graphs (named graphs -- one of the main concepts of Semantic Web) to describe information about scientific statements. The nanopublication as a Named Graph itself consists of multiple Named Graphs: the "assertion" has the actual scientific statement like "malaria is transmitted by mosquitos" in the example below; the "provenance" has information about how the statement in the "assertion" was originally derived; and the "publication information" has information like who created the nanopublication and when.

A nanopublication: basic elements from http://nanopub.org
Nanopublications may cite other nanopublications resulting in having complex citation tree. Trusty URIs are designed not only to validate nanopublications individually but also to validate the whole citation tree. The nanopublication example shown below, which is about the statement "malaria is transmitted by mosquitos", is from the paper ("The anatomy of a nanopublication") and it is in TRIG format:

@prefix swan: < http://swan.mindinformatics.org/ontologies/1.2/pav.owl> .
@prefix cw: < http://conceptwiki.org/index.php/Concept>.
@prefix swp: <http://www.w3.org/2004/03/trix/swp-1/>.
@prefix : <http://www.example.org/thisDocument#> .

:G1 = { cw:malaria cw:isTransmittedBy cw:mosquitoes }
:G2 = { :G1 swan:importedBy cw:TextExtractor,
:G1 swan:createdOn "2009-09-03"^^xsd:date,
:G1 swan:authoredBy cw:BobSmith }
:G3 = { :G2 ann:assertedBy cw:SomeOrganization }

In addition to the two modules, they are planning to define new modules for more types of content (e.g., hypertext/HTML) in the future.

The example below illustrates the general structure of trusty URIs:



The artifact code, everything after r1, is the part that make this URI a trusty URI. The first character in this code (R) is to identify the module. In the example, R indicates that this trusty URI was generated on a RDF graph. The second character (A) is to specify any version of this module. The remaining characters (5..0) represents the hash value on the content. All hash values are generated by SHA-256 algorithm. I think it would be more useful to allow users to select any preferred cryptographic hash function instead of enforcing a single hash function. This might result in adding more characters to the artifact code to represent the selected hash function. InterPlanetary File System (IPFS), for example, uses Multihash as mechanism to prefix the resulting hash value with an id that maps to a particular hash function. Similar to trusty URIs, resources in the IPFS network are addressed based on hash values calculated on the content. For instance, the first two characters "Qm" in the IPFS address "/ipfs/QmZTR5bcpQD7cFgTorqxZDYaew1Wqgfbd2ud9QqGPAkK2V" indicates that SHA256 is the hash function used to generate the hash value "ZTR5bcpQD7cFgTorqxZDYaew1Wqgfbd2ud9QqGPAkK2V".

Here are some differences between the approach of using trusty URIs and other related ideas as mentioned in the paper:

  • Trusty URIs can be used to identify and verify resources on the web while in systems like Git version control system, hash values are there to verify "commits" in Git repositories only. The same applies to IPFS where hashes in addresses (e.g., /ipfs/QmZTR5bcpQD7cFgTorqxZDYaew1Wqgfbd2ud9QqGPAkK2V) are used to verify files within the IPFS network only.
  • Hashes in trusty URIs can be generated on different kinds of content while in Git or ni-URI, hash values are computed based on the byte level of the content.
  • Trusty URIs support self-references (i.e., when trusty URIs are included in the content).

The same authors published a follow-up version to their ESWC paper ("Making digital artifacts on the web verifiable and reliable") in which they described in some detail how to generate trusty URIs on content of type RA for multiple RDF graphs and RB for a single RDF graph (RB was not included in the original paper). In addition, in this recent version, they graphically described the structure of the trusty URIs.

While calculating the hash value on the content of type F (byte-level file content) is a straightforward task, multiple steps are required to calculate the hash value on content of type R (RDF graphs), such as converting any serialization (e.g, N-Quads or TriG) into RDF triples, sorting of RDF triples lexicographically, serializing the graph into a single string, replacing newline characters with "\n", and dealing with self-references and empty nodes.

To evaluate their approach, the authors used the Java implementation to create trusty URIs for 156,026 of small structured data files (nanopublications) which are in different serialization format (N-Quads and TriX). By testing these files, again using the Java implementation, they all were successfully verified as they matched to their trusty URIs. In addition, they tested modified copies of these nanopublications. Results are shown in the figure below:


Examples of using trusty URIs:

[1] Trusty URI for byte-level content

Let say that I have published my paper on the web at http://www.cs.odu.edu/~maturban/pubs/tpdl-2015.pdf, and somebody links to it or saves the link somewhere. Now, if I intentionally (or not) change the content of the paper, for example, by modifying some statistics, adding a chart, correcting a typo, or even replacing the PDF with something completely different (read about content drift), anyone downloads the paper after these changes by dereferencing the original URI will not be able to realize that the original content has been tampered with. Trusty URIs may solve this problem. For testing, I used Trustyuri-python, the Python implementation, to first generate the artifact code on the PDF file "tpdl-2015.pdf":

%python ProcessFile.py tpdl-2015.pdf

The file (tpdl-2015.pdf) is renamed to (tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf) containing the artifact code (FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao) as a part of its name -- in the paper, they call this file a trusty file. Finally, I published this trusty file on the web at the trusty URI (http://www.cs.odu.edu/~maturban/pubs/tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf). Anyone with this trusty URI can verify the original content using the the library Trustyuri-python, for example:

%python CheckFile.py http://www.cs.odu.edu/~maturban/pubs/tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf
Correct hash: FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao

As you can see, the output "Correct hash: FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao" indicates that the hash value in the
trusty URI is identical to the hash value of the content which means that this resource contains the correct and the desired content.

To see how the library detects any changes in the original content available at http://www.cs.odu.edu/~maturban/pubs/tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf, I replaced all occurrence of the number "61" with the number "71" in the content. Here is the commands I used to apply these changes:

%pdftk tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf output tmp.pdf uncompress
%sed -i 's/61/71/g' tmp.pdf
%pdftk tmp.pdf output tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf compress

The figures below show the document before and after the changes:

Before changes
After changes
The library detected that the original resource has been changed:

$python CheckFile.py http://www.cs.odu.edu/~maturban/pubs/tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf
*** INCORRECT HASH ***

[2] Trusty URI for RDF content

I downloaded this nanopublication serialized in XML from "https://github.com/trustyuri/trustyuri-java/blob/master/src/main/resources/examples/nanopub1-pre.xml":




This nanopublication (RDF file) can be transformed into a trusty file using:

$python TransformRdf.py nanopub1-pre.xml http://trustyuri.net/examples/nanopub1

The Python script "TransformRdf.py" performed multiple steps to transform this RDF file into the trusty file "nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg.xml". The steps as mentioned above include generating RDF triples, sorting those triples, handling self-references, etc. The Python library used the second argument "http://trustyuri.net/examples/nanopub1", considered as the original URI, to manage self-references by replacing all occurrences of "http://trustyuri.net/examples/nanopub1" with "http://trustyuri.net/examples/nanopub1. " in the original XML file. You may have noticed that this ends with '.' and blank space. Once the artifact code is generated, the new trusty file is created "nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg.xml". In this trusty file all occurrences of "http://trustyuri.net/examples/nanopub1. " are replaced with "http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg#pubinfo". The trusty file is shown below:



To verify this trusty file we can use the following command which resulting in having "Correct hash" --the content is verified to be correct. Again, to handle self-references, the Python library replaces all occurrences of "http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg#pubinfo" with "http://trustyuri.net/examples/nanopub1. " before recomputing the hash.

%python CheckFile.py nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg.xml
Correct hash: RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg

Or by the following command if the trusty file is published on the web:

%python CheckFile.py http://www.cs.odu.edu/~maturban/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg.xml
Correct hash: RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg

What we are trying to do with trusty URI:

We are working on a project, funded by the Andrew W. Mellon Foundation, to automatically capture and archive the scholarly record on the web. One part of this project is to come up with a mechanism through which we can verify the fixity of archived resources to ensure that these resources have not been tampered with or corrupted. In general, we try to collect information about the archived resources and generate manifest file. This file will then be pushed into multiple archives, so it can be used later. Herbert Van de Sompel, from Los Alamos National Laboratory, pointed to this idea of using trusty URIs to identify and verify web resources. In this way, we have the manifest files to verify archived resources, and trusty URIs to verify these manifests.

Resources:

    --Mohamed Aturban


    Sunday, January 8, 2017

    2017-01-08: Review of WS-DL's 2016

    Sawood and Mat show off the InterPlanetary Wayback poster at JCDL 2016

    The Web Science and Digital Libraries Research Group had a productive 2016, with two Ph.D. and one M.S. students graduating, one large research grant awarded ($830k), 16 publications, and 15 trips to conferences, workshops, hackathons, etc.

    For student graduations, we had:
    Other student advancements:
    We had 16 publications in 2016:

     
    In late April, we had Herbert, Harish Shankar, and Shawn Jones visit from LANL.  Herbert has been here many times, but this was the first visit to Norfolk for Harish.  It was also on the visit that Shawn did his breadth exam.


    In addition to the fun road trip to JCDL 2016 in New Jersey (which included beers on the Cape May-Lewes Ferry!), our group traveled to:
    WS-DL at JCDL 2016 Reception in Newark, NJ
    Alex shows off his poster at JCDL 2016
    Although we did not travel to San Francisco for the 20th Anniversary of the Internet Archive, we did celebrate locally with tacos, DJ Spooky CDs, and a series of tweets & blog posts about the cultural impact and importance of web archiving.  This was in solidarity with the Internet Archive's gala which featured taco trucks and a lecture & commissioned piece by Paul Miller (aka DJ Spooky). We write plenty of papers, blog posts, etc. about technical issues and the mechanics of web archiving, but I'm especially proud of how we were able to assemble a wide array of personal stories about the social impact of web archiving.  I encourage you to take the time to go through these posts:


    We had only one popular press story about our research this year, with Tech.Co's "You Can’t Trust the Internet to Continue Existing" citing Hany SalahEldeen's 2012 TPDL paper about the rate of loss of resources shared via Twitter.

    We released several software packages and data sets in 2016:
    In April we were extremely fortunate to receive a major research award, along with Herbert Van de Sompel at LANL, from the Andrew Mellon Foundation:
    This project will address a number of areas, including: Signposting, automated assessment of web archiving quality, verification of archival integrity, and automating the archiving of non-journal scholarly output.  We will soon be releasing several research outputs as a result of this grant.

    WS-DL reviews are also available for 2015, 2014, and 2013.  We're happy to have graduated Greg, Yasmin, and Justin; and we're hoping that we can get Erika back for a PhD after her MS is completed.  I'll close with celebratory images of me (one dignified, one less so...) with Dr. AlNoamany and Dr. Brunelle; may 2017 bring similarly joyous and proud moments.

    --Michael



    Saturday, January 7, 2017

    2017-01-07: Two WS-DL Classes Offered for Spring 2017



    Two WS-DL classes are offered for Spring 2017:

    Information Visualization is being offered both online (CRNs 26614/26617 (HR), 26615/26618  (VA), 26616/26619 (US)) and on-campus (CRN 24698/24699).  Web Science is offered on-campus only (CRNs 25728/25729).  Although it's not a WS-DL course per se, WS-DL member Corren McCoy is also teaching CS 462/562 Cybersecurity Fundamentals this semester (see this F15 offering from Dr. Weigle for an idea about its contents).

    --Michael

    Tuesday, December 20, 2016

    2016-12-20: Archiving Pages with Embedded Tweets

    I'm from Louisiana and used Archive-It to build a collection of webpages about the September flood there (https://www.archive-it.org/collections/7760/).

    One of the pages I came across, Hundreds of Louisiana Flood Victims Owe Their Lives to the 'Cajun Navy', highlighted the work of the volunteer "Cajun Navy" in rescuing people from their flooded homes. The page is fairly complex, with a Flash video, YouTube video, 14 embedded tweets (one of which contained a video), and 2 embedded Instagram posts. Here's a screenshot of the original page (click for full page):

    Live page, screenshot generated on Sep 9, 2016

    To me, the most important resources here were the tweets and their pictures, so I'll focus here on how well they were archived.

    First, let's look at how embedded Tweets work on the live web. According to Twitter: "An Embedded Tweet comes in two parts: a <blockquote> containing Tweet information and the JavaScript file on Twitter’s servers which converts the <blockquote> into a fully-rendered Tweet."

    Here's the first embedded tweet (https://twitter.com/vernonernst/status/765398679649943552), with a picture of a long line of trucks pulling their boats to join the Cajun Navy.
    First embedded tweet - live web

    Here's the source for this embedded tweet:
    <blockquote class="twitter-tweet" data-width="500"> <p lang="en" dir="ltr">
    <a target="_blank" href="https://twitter.com/hashtag/CajunNavy?src=hash">#CajunNavy</a> on the way to help those stranded by the flood. Nothing like it in the world! <a href="https://twitter.com/hashtag/LouisianaFlood?src=hash">#LouisianaFlood</a> <a href="https://t.co/HaugQ7Jvgg">pic.twitter.com/HaugQ7Jvgg</a> </p> — Vernon Ernst (@vernonernst) <a href="https://twitter.com/vernonernst/status/765398679649943552">August 16, 2016</a> </blockquote>
    <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

    When the widgets.js script executes in the browser, it transforms the <blockquote class="twitter-tweet"> element into a <twitterwidget>:
    <twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-0" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;" data-tweet-id="765398679649943552">

    Now, let's consider how the various archives handle this.

    Archive-It

    Since I'd been using Archive-It to create the collection, that was the first tool I used to capture the page. Archive-It uses the Internet Archive's Heritrix crawler and Wayback Machine for replay. I set the crawler to archive the page and embedded resources, but not to follow links. No special scoping rules were applied.

    http://wayback.archive-it.org/7760/20160818180453/http://ijr.com/2016/08/674271-hundreds-of-louisiana-flood-victims-owe-their-lives-to-the-cajun-navy/
    Archive-It, captured on Aug 18, 2016
    Here's how the first embedded tweet displayed in Archive-It:
    Embedded tweet as displayed in Archive-It


    Here's the source (as rendered in the DOM) upon playback in Archive-It's Wayback Machine:
    <blockquote class="twitter-tweet twitter-tweet-error" data-conversation="none" data-width="500" data-twitter-extracted-i1479916001246582557="true">
    <p lang="en" dir="ltr"> <a href="http://wayback.archive-it.org/7760/20160818180453/https://twitter.com/hashtag/CajunNavy?src=hash" target="_blank" rel="external nofollow">#CajunNavy</a> on the way to help those stranded by the flood. Nothing like it in the world! <a href="http://wayback.archive-it.org/7760/20160818180453/https://twitter.com/hashtag/LouisianaFlood?src=hash" target="_blank" rel="external nofollow">#LouisianaFlood</a> <a href="http://wayback.archive-it.org/7760/20160818180453/https://t.co/HaugQ7Jvgg" target="_blank" rel="external nofollow">pic.twitter.com/HaugQ7Jvgg</a> </p> <p>— Vernon Ernst (@vernonernst) <a href="http://wayback.archive-it.org/7760/20160818180453/https://twitter.com/vernonernst/status/765398679649943552" target="_blank" rel="external nofollow">August 16, 2016</a></p></blockquote>
    <p> <script async="" 
    src="//wayback.archive-it.org/7760/20160818180453js_/http://platform.twitter.com/widgets.js?4fad35" charset="utf-8"></script> </p>

    Except for the links being re-written to point to the archive, this is the same as the original embed source, rather than the transformed version.  Upon playback, although widgets.js was archived (http://wayback.archive-it.org/7760/20160818180456js_/http://platform.twitter.com/widgets.js?4fad35), it is not able to modify the DOM as it does on the live web (widgets.js loads additional JavaScript that was not archived).

    webrecorder.io

    Next up is the on-demand service, webrecorder.io. Webrecorder.io is able to replay the embedded tweets as on the live web.

    https://webrecorder.io/mweigle/south-louisiana-flood---2016/20160909144135/http://ijr.com/2016/08/674271-hundreds-of-louisiana-flood-victims-owe-their-lives-to-the-cajun-navy/

    Webrecorder.io, viewed Sep 29, 2016

    The HTML source (https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/http://ijr.com/2016/08/674271-hundreds-of-louisiana-flood-victims-owe-their-lives-to-the-cajun-navy/) looks similar to the original embed (except for re-written links):
    <blockquote class="twitter-tweet" data-width="500"><p lang="en" dir="ltr"><a target="_blank" href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://twitter.com/hashtag/CajunNavy?src=hash">#CajunNavy</a> on the way to help those stranded by the flood. Nothing like it in the world!  <a href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://twitter.com/hashtag/LouisianaFlood?src=hash">#LouisianaFlood</a> <a href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://t.co/HaugQ7Jvgg">pic.twitter.com/HaugQ7Jvgg</a></p>&mdash; Vernon Ernst (@vernonernst) <a href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://twitter.com/vernonernst/status/765398679649943552">August 16, 2016</a></blockquote>
    <script async src="//wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135js_///platform.twitter.com/widgets.js" charset="utf-8"></script>

    Upon playback, we see that webrecorder.io is able to fully execute the widgets.js script, so the transformed HTML looks like the live web (with the inserted <twitterwidget> element):
    <twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-0" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;" data-tweet-id="765398679649943552"></twitterwidget>
    <script async="" src="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135js_///platform.twitter.com/widgets.js" charset="utf-8"></script>

    Note that widgets.js is archived and is loaded from webrecorder.io, not the live web.

    archive.is

    archive.is is another on-demand archiving service.  As with webrecorder.io, the embedded tweets are shown as on the live web.

    http://archive.is/5JcKx
    archive.is, captured Sep 9, 2016

    archive.is executes and then flattens JavaScript, so although the embedded tweet looks similar to how it's rendered in webrecorder.io and on the live web, the source is completely different:
    <article style="direction:ltr;display:block;">
    ...

    <a href="https://archive.is/o/5JcKx/twitter.com/vernonernst/status/765398679649943552/photo/1" style="color:rgb(43, 123, 185);text-decoration:none;display:block;position:absolute;top:0px;left:0px;width:100%;height:328px;line-height:0;background-color: rgb(255, 255, 255); outline: invert none 0px; "><img alt="View image on Twitter" src="http://archive.is/5JcKx/fc15e4b873d8a1977fbd6b959c166d7b4ea75d9d" title="View image on Twitter" style="width:438px;max-width:100%;max-height:100%;line-height:0;height:auto;border-width: 0px; border-style: none; border-color: white; "></a>
    ...

    </article>
    ...
    <blockquote cite="https://twitter.com/vernonernst/status/765398679649943552" style="list-style: none outside none; border-width: medium; border-style: none; margin: 0px; padding: 0px; border-color: white; ">
    ...

    <span>#</span><span>CajunNavy</span></a>
    on the way to help those stranded by the flood. Nothing like it in the world! <a href="https://archive.is/o/5JcKx/https://twitter.com/hashtag/LouisianaFlood?src=hash" style="direction:ltr;background-color: transparent; color:rgb(43, 123, 185);text-decoration:none;outline: invert none 0px; "><span>#</span><span>LouisianaFlood</span></a>
    </p>
    ...
    </blockquote>


    WARCreate

    WARCreate is a Google Chrome extension that our group developed to allow users to archive the page they are currently viewing in their browser.  It was last actively updated in 2014, though we are currently working on updates to be released in 2017.

    The image below shows the result of the page being captured with WARCreate and replayed in webarchiveplayer

    WARCreate, captured Sep 9, 2016, replayed in webarchiveplayer
    Upon replay, WARCreate is not able to display the tweet at all.  Here's the close-up of where the tweets should be:

    WARCreate capture replayed in webarchiveplayer, with tweets missing
    Examining both the WARC file and the source of the archived page helps to explain what's happening.

    Inside the WARC, we see:
    <h4>In stepped a group known as the <E2><80><9C>Cajun Navy<E2><80><9D>:</h4>
    <twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-1" data-tweet-id="765398679649943552" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;"></twitterwidget>
    <p><script async="" src="//platform.twitter.com/widgets.js?4fad35" charset="utf-8"></script></p>


    This is the same markup that's in the DOM upon replay in webarchiveplayer, except for the script source being written to localhost:
    <h4>In stepped a group known as the “Cajun Navy”:</h4>
    <twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-1" data-tweet-id="765398679649943552" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;"></twitterwidget>
    <p><script async="" src="//localhost:8090/20160822124810js_///platform.twitter.com/widgets.js?4fad35" charset="utf-8"></script></p>


    WARCreate captures the HTML after the page has fully loaded.  So what's happening here is that the page loads, widgets.js is executed, the DOM is changed (thus the <twitterwidget> tag), and then WARCreate saves the transformed HTML. But, what we don't get is the widgets.js script in order to be able to properly display <twitterwidget>. Our expectation is that with fixes to allow WARCreate to archive the loaded JavaScript, the embedded tweet would be displayed as on the live web.

    Discussion
     
    Each of these four archiving tools operates on the embedded tweet in a different way, highlighting the complexities of archiving asynchronously loaded JavaScript and DOM transformations.
    • Archive-It (Heritrix/Wayback) - archives the HTML returned in the HTTP response and JavaScript loaded from the HTML
    • Webrecorder.io - archives the HTML returned in the HTTP response, JavaScript loaded from the HTML, and JavaScript loaded after execution in the browser
    • Archive.is - fully loads the webpage, executes JavaScript, rewrites the resulting HTML, and archives the rewritten HTML
    • WARCreate - fully loads the webpage, executes JavaScript, and archives the transformed HTML
    It is useful to examine how different archiving tools and playback engines render complex webpages, especially those that contain embedded media.  Our forthcoming update to the Archival Acid Test will include tests for embedded content replay.

    -Michele

    Monday, November 21, 2016

    2016-11-21: WS-DL Celebration of #IA20



    The Web Science & Digital Library Research Group celebrated the 20th Anniversary of the Internet Archive with tacos, DJ Spooky CDs, and a series of tweets & blog posts about the cultural impact and importance of web archiving.  This was in solidarity with the Internet Archive's gala which featured taco trucks and a lecture & commissioned piece by Paul Miller (aka DJ Spooky). 

    Normally our group posts about research developments and technical analysis of web archiving, but for #IA20 we had members of our group write mostly non-technical stories drawn from personal experiences and interests that are made possible by web archiving.  We are often asked "Why archive the web?" and we hope these blog posts will help provide you with some answers.
    We've collected these links and more material related to #IA20 in both a Storify story and a Twitter moment; we hope you can take the time to explore them further.  We'd like to thank everyone at the Internet Archive for 20 years of yeoman's work, the many other archives that have come on-line more recently, and all of the WS-DL members who made the time to provide their personal stories about the impacts and opportunities of web archiving.

    --Michael

    Wednesday, November 16, 2016

    2016-11-16: Reminiscing About The Days of Cyber War Between Indonesia and Australia


    Image is taken from Wikipedia

    Indonesia and Australia are neighboring countries that, just like what always happens between neighbors, have a hot-and-cold relationship. The History has recorded a number of disputes between Indonesia and Australia, from East Timor disintegration (now Timor Leste) in 1999 to the Bali Nine case (the execution of Australian drug smugglers) in 2015. One of the issues that has really caused a stir in Indonesia-Australia's relationship is the spying imbroglio conducted by Australia toward Indonesia. The tension arose when an Australian newspaper The Sydney Morning Herald published an article titled Exposed: Australia's Asia spy network and a video titled Spying at Australian diplomatic facilities on October 31st, 2013. It revealed one of Edward Snowden's leaks that Australia had been spying on Indonesia since 1999. This startling fact surely enraged Indonesia's government and, most definitely, the people of Indonesia.

    Indonesia strongly demanded clarification and an explanation by summoning Australia's ambassador, Greg Moriarty. Indonesia also demanded Australia to apologize. But Australia refused to apologize by arguing that this is something that every government will do to protect its country. The situation was getting more serious when it was also divulged that an Australian security agency attempted to listen in on Indonesian President Susilo Bambang Yudhoyono's cell phone in 2009. Yet, Tony Abbott, Australia's prime minister at that time, still refused to give either explanation or apology. This caused President Yudhoyono to accuse Tony Abbott of 'belittling' Indonesia's response to the issue. All of these situations made the already enraged Indonesian became more furious. Furthermore, Indonesian people judged that the government was too slow in following up and responding to this issue.

    Image is taken from The Australian

    To channel their frustration and anger, a group of Indonesian hacktivists named '
    anonymous Indonesia' launched a number of attacks to hundreds of Australian websites that were chosen randomly. They hacked and defaced those websites to spread the message 'stop spying on Indonesia'. Over 170 Australian websites were hacked during November 2013, some of them are government websites such as Australian Secret Intelligence Service (ASIS), Australian Security Intelligence Organisation (ASIO), and Department of Foreign Affairs and Trade (DFAT).

    Australian hackers also took revenge by attacking several important Indonesian websites such as the Ministry of Law and Human Rights and Indonesia's national airline, Garuda Indonesia. But, the number of the attacked websites is not as many as what have been attacked by the Indonesians. These websites are already recovered now and they look as if the attacks never happened. Fortunately, those who never heard this spying row before, could take advantage of using Internet Archive and go back in time to see how those websites looked like when they got attacked. Unfortunately, not all of those attacked websites have archives for November 2013. For example, according to Sydney Morning Herald and Australian Broadcasting Corporation, the ASIS websites were hacked on November 11, 2013. The Australian newspaper also reported that ASIO website was also hacked on November 13, 2013. But, these incidents were not archived by the Internet Archive as we cannot see any snapshot for the given dates.

    https://web.archive.org/web/20130101000000*/http://asis.gov.au

    https://web.archive.org/web/20130101000000*/http://asio.gov.au


    However, we are lucky enough to have sufficient examples to give us a clear idea of the cyber war that once took place between Indonesia and Australia.

    http://web.archive.org/web/20130520072344/http://australianprovidores.com.au

    http://web.archive.org/web/20131106225110/http://www.danzaco.com.au/

    http://web.archive.org/web/20131112141017/http://defence.gov.au/

    http://web.archive.org/web/20131107064017/http://dmresearch.com.au

    http://web.archive.org/web/20131109094537/http://www.flufferzcarwashcafe.com.au/

    http://web.archive.org/web/20131105222138/http://smartwiredhomes.com.au

                       - Erika (@erikaris)-

    2016-11-16: Introducing the Local Memory Project

    Collage made from screenshot of local news websites across the US
    The national news media has different priorities than the local news media. If one seeks to build a collection about local events, the national news media may be insufficient, with the exception of local news which “bubbles” up to the national news media. Irrespective of this “bubbling” of some local news to the national surface, the perspective and reporting of national news differs from local news for the same events. Also, it is well known that big multinational news organizations routinely cite the reports of smaller local news organizations for many stories. Consequently, local news media is fundametal to journalism.



    It is important to consult local sources affected by local events. Thus the need for a system that helps small communities to build collections of web resources from local sources for important local events. The need for such a system was first (to the best of my knowledge) outlined by Harvard LIL. Given Harvard LIL's interest of helping facilitate participatory archiving by local communities and libraries, and our IMLS-funded interest of building collections for stories and events, my summer fellowship at Harvard LIL provided a good opportunity to collaborate on the Local Memory Project.

    Our goal is to provide a suite of tools under the umbrella of the Local Memory Project to help users and small communities discover, collect, build, archive, and share collections of stories for important local events from local sources.

    Local Memory Project dataset

    We currently have a public json US dataset scraped from USNPL of:
    • 5,992 Newspapers 
    • 1,061 TV stations, and 
    • 2,539 Radio stations
    The dataset structure is documented and comprises of the media website, twitter/facebook/youtube links, rss/open search links, as well as geo-coordinates of the cities or counties in which the local media organizations reside. I strongly believe this dataset could be essential to the media research
    community. 

    There are currently 3 services offered by the Local Memory Project:

    1. Local Memory Project - Google Chrome extension:

    This service is an implementation of Adam Ziegler and Anastasia Aizman's idea for a utility that helps one build a collection for a local event which did not receive national coverage. Consequently, given a story expressed by a query input, for a place, represented by a zip code input, the Google Chrome extension performs the following operations:
    1. Retrieve a list of local news (Newspapers and TV stations) websites that serve the zip code
    2. For each local news website search Google for stories from all the local news websites retrieved from 1.
    The result is a collection of stories for the query from local news sources.

    For example, given the problem of building a collection for Zika virus for Miami Florida, we issue the following inputs (Figure 1) to the Google Chrome Extension and click "Submit":
    Figure 1: Google Chrome Extension, input for building a collection about Zika virus for Miami FL
    After the submit button is pressed the application issues the "zika virus" query to Google with the site directive for newspapers and tv stations for the 33101 area.

    Figure 2: Google Chrome Extension, search in progress. Current search in image targets stories about Zika virus from Miami Times
    After the search, the result (Figure 3) was saved remotely.
    Figure 3: A subset (see complete) of the collection about Zika virus built for the Miami FL area.
    Here are examples of other collections built with the Google Chrome Extension (Figures 4 and 5):
    Figure 4: A subset (see complete) of the collection about Simone Biles' return for Houston Texas
    Figure 5: A subset (see complete) of the collection about Protesters and Police for Norfolk Virginia
    The Google Chrome extension also offers customized settings that suit different collection building needs:
    Figure 6: Google Chrome Extension Settings (Part 1)
    Figure 7: Google Chrom Extension Settings (Part 2)
    1. Google max pages: The number of Google search pages to visit for each news source. Increase if you want to explore more Google pages since the default value is 1 page.
    2. Google Page load delay (seconds): This time delay between loading Google search pages ensures a throttled request.
    3. Google Search FROM date: Filter your search for news articles crawled from this date. This comes in handy if a query spans multiple time periods, but the curator is interested in a definite time period.
    4. Google Search TO date: Filter your search for news articles before this date. This comes in handy especially when combined with 3, it can be used to collect documents within a start and end time window.
    5. Archive Page load delay (seconds): Time delay between loading pages to be archived. You can increase this time if you want to have the chance to do something (such as hit archive again) before the next archived page loads automatically. This is tailored to archive.is.
    6. Download type: Download to your machine for a personal collection in (json or txt format). But if you choose to share, save remotely (you should!)
    7. Collection filename: Custom filename for collection about to be saved.
    8. Collection name: Custom name for your collection. It's good practice to label collections.
    9. Upload a saved collection (.json): For json collections saved locally, you may upload them to revisualize the collection.
    10. Show Thumbnail: A flag that decides whether to send a remote request to get a card (thumbnail summary) for the link. Since cards require multiple GET requests, you may choose to switch this off if you have a large collection.
    11. Google news: The default search of the extension is the generic Google search page. Check this box to search teh Google news vertical instead.
    12. Add website to existing collection: Add a website to an existing collection.
    2. Local Memory Project - Geo service:

    The Google Chrome extension utilizes the Geo service to find media sources that serve a zip code. This service is an implementation of Dr. Michael Nelson's idea for a service that supplies an ordered list of media outlets based on their proximity to a user-specified zip code.

    Figure 8: List of top 10 Newspapers, Radio and TV station closest to zip code 23529 (Norfolk, VA)

    3. Local Memory Project - API:



    The local memory project Geo website is meant for human users, while the API website targets machine users. Therefore, it provide the same services as the Geo website but returns a json output (as opposed to HTML). For example, below is a subset output (see complete) corresponding to a request for 10 news media sites in order of proximity to Cambridge, MA.
    {
      "Lat": 42.379146, 
      "Long": -71.12803, 
      "city": "Cambridge", 
      "collection": [
        {
          "Facebook": "https://www.facebook.com/CambridgeChronicle", 
          "Twitter": "http://www.twitter.com/cambridgechron", 
          "Video": "http://www.youtube.com/user/cambchron", 
          "cityCountyName": "Cambridge", 
          "cityCountyNameLat": 42.379146, 
          "cityCountyNameLong": -71.12803, 
          "country": "USA", 
          "miles": 0.0, 
          "name": "Cambridge Chronicle", 
          "openSearch": [], 
          "rss": [], 
          "state": "MA", 
          "type": "Newspaper - cityCounty", 
          "website": "http://cambridge.wickedlocal.com/"
        }, 
        {
          "Facebook": "https://www.facebook.com/pages/WHRB-953FM/369941405267", 
          "Twitter": "http://www.twitter.com/WHRB", 
          "Video": "http://www.youtube.com/user/WHRBsportsFM", 
          "cityCountyName": "Cambridge", 
          "cityCountyNameLat": 42.379146, 
          "cityCountyNameLong": -71.12803, 
          "country": "USA", 
          "miles": 0.0, 
          "name": "WHRB 95.3 FM", 
          "openSearch": [], 
          "rss": [], 
          "state": "MA", 
          "type": "Radio - Harvard Radio", 
          "website": "http://www.whrb.org/"
        }, ...
    


    Saving a collection built with the Google Chrome Extension

    Collection built on a user machine can be saved in one of two ways:
    1. Save locally: this serves as a way to keep a collection private. Saving can be done by clicking "Download collection" in the Generic settings section of the extension settings. A collection can be saved in json or plaintext format. The json format permits the collection to be reloaded through "upload a saved collection" in the Generic settings section of the extension settings. The plaintext format does not permit reloading into the extension, but contains all the links which make up the collection.
    2. Save remotely: in order to be able to share the collection you built locally with the world, you need to save remotely by clicking the "Save remotely" button on the frontpage of the application. This leads to a dialog requesting a mandatory unique collection author name (if one doesn't exist) and an optional collection name (Figure 10). After supplying the inputs the application saves the collection remotely and the user is presented with a link to the collection (Figure 11).
    Before a collection is saved locally or remotely, you may choose to exclude an entire news source (all links from a given source) or a single news source as described by Figure 9:
    Figure 9: Exclusion options before saving locally/remotely
    Figure 10: Saving a collection prompts a dialog requesting a mandatory unique collection author name and an optional collection name
    Figure 11: A link is presented after a collection is saved remotely

    Archiving a collection built with the Google Chrome Extension



    Saving is the first step to make a collection persist after it is built. However, archiving ensures that the links referenced in a collection persist even if the content is moved or deleted. Our application currently integrates archiving via Archive.is, but we plan to expand the archiving capability to include other public web archives.

    In order to archive your collection, click the "Archive collection" button on the frontpage of the application. This leads to a dialog similar to the saving dialog which requests a mandatory unique collection author name (if one doesn't exist) and an optional collection name. Subsequently, the application archives the collection by first archiving the front page which contains all the local news sources, and secondly, the application archives the individual links which make up the collection (Figure 12). You may choose to stop the archiving operation at any time by clicking "Stop" on the archiving update orange-colored message bar. At the end of the archiving process, you get a short URI corresponding to the archived collection (Figure 13).
    Figure 12: Archiving in progress
    Figure 13: When the archiving is complete, a short link pointing to the archived collection is presented

    Community collection building with the Google Chrome Extension


    We envision a community of users contributing to a single collection for a story. Even though the collections are built in isolation, we consider a situation in which we can group collections around a single theme. To begin this process, the Google Chrome Extension lets you share a locally built collections on Twitter by clicking the "Tweet" button (Figure 14). 
    Figure 14: Tweet button enables sharing the collection

    This means if user 1 and user 2 locally build collections for Hurricane Hermine, they may use the hashtags #localmemory and #hurricanehermine when sharing the collection. Consequently, all Hurricane Hermine-related collections will be seen via Twitter with the hashtags. We encourage users to include #localmemory and the collection hashtags in tweets when sharing collections. We also encourage you to follow the Local Memory Project on Twitter.
    The local news media is a vital organ of journalism, but one in decline. We hope by providing free and open source tools for collection building, we can contribute in some capacity to help its revival.

    I am thankful for everyone who has contributed to the ongoing success of this project. From Adam, Anastasia, Matt, Jack and the rest of the Harvard LIL team, to my Supervisor Dr. Nelson and Dr. Weigle, and Christie Moffat at the National Library of Medicine, as well as Sawood and Mat and the rest of my colleagues at WSDL, thank you.
    --Nwala