Tuesday, September 10, 2019

2019-09-10: Where did the archive go? Part 2: National Library of Ireland


In the previous post, we provided some details about changes in the archive Library and Archives Canada. After they upgraded their web archive replay system, we were no longer able to find 49 out of 351 mementos (archived web pages). In part 2 of this four part series, we focus on the movement of a collection from the National Library of Ireland (NLI).


In May 2018, we discovered that 979 mementos from the NLI collection that were originally archived at the European Archive (europarchive.org) were moved to the Internet Memory Foundation archive (internetmemory.org). Then in September 2018, we found that the collection of mementos had been moved to Archive-It (archive-it.org). We found that 192 mementos, out of 979, can not be found in Archive-It (i.e., missing mementos).

For example, the memento from the European Archive:


has been moved to the Internet Memory Foundation (IMF) archive at:

http://collections.internetmemory.org/nli/20141013204117/http://www.defense.gov/

before it ended up at Archive-it:

http://wayback.archive-it.org/10702/20141013204117/http://www.defense.gov/

The representations of the three mementos are illustrated in the figure below.



There were no changes in the 979 mementos (other than their URIs) when they moved from the European Archive to the IMF archive (even the archival banner remained the same as the figure above shows), but we found some significant changes upon the move from IMF to Archive-It which we will focus on in this post.

We refer to the archive from which mementos were moved (i.e., internetmemory.org) as the "original archive", and we use the "new archive" to refer to the archive to which the mementos were moved (i.e., archive-it.org). A memento is identified by a URI-M as defined in the Memento framework.

Our observations about changes in the NLI collection (from IMF to Archive-It) are as follows:

Observation 1: The functionality of the original archival banner is gone

Users of the European Archive and IMF were able to navigate through available mementos via the custom archival banner (marked in red in the top two screenshots in the figure above). Via this banner, the original archive allows users to view the available mementos and the representation of a selected memento in the same page. Archive-It, on the other hand, now uses the standard playback banner (marked in red in the bottom screenshot in the figure above). This new archive's banner contains information to inform users that they are viewing an "archived" web page. This banner also contains multiple links. One of the links will take you to a web page in archive-it.org that shows all available mementos in the archive as shown in the figure below:





Observation 2: The original archive is no longer reachable

After mementos were moved from internetmemory.org, the archive became unreachable as the following cURL session shows:

$ date
Tue May 21 08:03:51 EDT 2019

$ curl http://www.internetmemory.org
curl: (7) Failed to connect to www.internetmemory.org port 80: Operation timed out



In addition to IMF, the European Archive (europarchive.org) also is no longer maintained---it was shut down and the domain name was purchased by another entity and is now spam. 



The movement of mementos from these two archives will affect link integrity across web resources that contain links to mementos from the European Archive or IMF. As mentioned in the previous post, there actions that can be performed by original archives to maintain link integrity via "follow-your-nose" from the old URI-Ms to the corresponding URI-Ms in the new archive. 


For example, the archive Library and Archives Canada changed its domain name name from collectionscanada.gc.ca to webarchive.bac-lac.gc.ca (described in part 1), and because the archive still controls the original domain name collectionscanada.gc.ca, the archive could (even though it currently does not) redirect requests of URI-M in collectionscanada.gc.ca to the new archive webarchive.bac-lac.gc.ca. For instance, if the original archive uses the Apache web server, the mod_rewrite rule can be used to perform automatic redirects:

# With mod_rewrite
RewriteEngine on
RewriteRule   "^/webarchives/(\d{14})/(.+)" http://webarchive.bac-lac.gc.ca:8080/wayback/$1/$2  [L,R=301]

But these practices become impractical in the case of the European Archive and IMF because:

  • The archives no longer exist (e.g., the European Archive and IMF),  so there is not a maintained web server available to issue the redirects. 
  • Even if it still existed, the archive might decide to not issue redirects for former customers in order to increase lock-in.
In the upcoming post, we will describe the movement of mementos from the Public Record Office of Northern Ireland (PRONI) web archive to Archive-It. The PRONI organization still controls and maintains the original domain name webarchive.proni.gov.ukso it is possible for PRONI to issue redirects to the new URI-Ms in Archive-It.

Observation 3: Not all mementos are available in the new archive

As defined in the previous post, a missing memento occurs when the values of the Memento-Datetime, the URI-R, and the final HTTP status code of the memento from the original archive are not identical to the values of the corresponding memento from the new archive. In this study, we found 192 missing mementos (out of 979) that can not be retrieved from the new archive. Instead, the new archive responds with other mementos that have different values for the Memento-Datetime, the URI-R, or the HTTP status code. We give two examples of missing mementos. The first example shows a memento can not be found in the new archive  with the same Memento-Datetime as it was in the original archive. When requesting the URI-M:

http://collections.internetmemory.org/nli/20121221162201/http://bbc.co.uk/news/

from the original archive (internetmemory.org) on September 03, 2018, the archive responded with "200 OK" with the representation shown in the top screenshot in the figure below. The Memento-Datetime of this memento was Fri, 21 Dec 2012 16:22:01 GMT. Then, we requested the corresponding URI-M:

http://wayback.archive-it.org/10702/20121221162201/http://bbc.co.uk/news/


from the new archive (archive-it.org). As shown in the cURL session below, the request redirected to another URI-M:

http://wayback.archive-it.org/10702/20121221163248/http://www.bbc.co.uk/news/

As shown in the figure below, the representations of both mementos are identical (except for the archival banners), we consider the memento from the original archive as missing because both mementos have different values of the Memento-Datetime (i.e., Fri, 21 Dec 2012 16:32:48 GMT in the new archive) for a delta of about 10 minutes. Even though the 10 minute delta might not be semantically significant (apparently just a change in the canonicalization of the URI-R, with bbc.co.uk redirecting to www.bbc.co.uk), we do not consider it to be the same since the values of the Memento-Datetime are not identical.

$ curl --head --location --silent http://wayback.archive-it.org/10702/20121221162201/http://bbc.co.uk/news/ | egrep -i "(HTTP/|^location:|^Memento-Datetime)"

HTTP/1.1 302 Found
Location: /10702/20121221163248/http://www.bbc.co.uk/news/
HTTP/1.1 200 OK

Memento-Datetime: Fri, 21 Dec 2012 16:32:48 GMT



The second example shows a memento that has different values of the Memento-Datetime and URI-R compared to the corresponding values from the original archive. When requesting the memento:

http://collections.internetmemory.org/nli/20121223122758/http://www.whitehouse.gov/

on September 03, 2018, the original archive returned "200 OK" for an archival "403 Forbidden" as the WARC record below shows:


WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collections.internetmemory.org/nli/content/20121223122758/http://www.whitehouse.gov/

WARC-Date: 2018-09-03T16:31:30Z
WARC-Record-ID: <urn:uuid:d03e5020-af96-11e8-9d72-f10b53f82929>


Content-Type: application/http; msgtype=response
Content-Length: 1694
HTTP/1.1 200 OK

Date: Mon, 03 Sep 2018 16:31:19 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html
Via: 1.1 varnish-v4
cache-control: max-age=86400
X-Varnish: 28318986 28187250
Memento-Datetime: Sun, 23 Dec 2012 12:27:58 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: <http://collections.internetmemory.org/nli/content/20121223122758/http://www.whitehouse.gov/>; rel="memento"; datetime="Sun, 23 Dec 2012 12:27:58 GMT", <http://collections.internetmemory.org/nli/content/20110223072152/http://www.whitehouse.gov/>; rel="first memento"; datetime="Wed, 23 Feb 2011 07:21:52 GMT", <http://collections.internetmemory.org/nli/content/20180528183514/http://www.whitehouse.gov/>; rel="last memento"; datetime="Mon, 28 May 2018 18:35:14 GMT", <http://collections.internetmemory.org/nli/content/20121221220430/http://www.whitehouse.gov/>; rel="prev memento"; datetime="Fri, 21 Dec 2012 22:04:30 GMT", <http://collections.internetmemory.org/nli/content/20131208014833/http://www.whitehouse.gov/>; rel="next memento"; datetime="Sun, 08 Dec 2013 01:48:33 GMT", <http://collections.internetmemory.org/nli/content/timegate/http://www.whitehouse.gov/>; rel="timegate", <http://www.whitehouse.gov/>; rel="original", <http://collections.internetmemory.org/nli/content/timemap/http://www.whitehouse.gov/>; rel="timemap"; type="application/link-format"
Content-Length: 287


<HTML><head>
<title>[ARCHIVED CONTENT] Access Denied</title>
</head><BODY>
<H1>Access Denied</H1>
Reference&#32;&#35;18&#46;d8407b5c&#46;1356265678&#46;2324d94
</BODY>
</HTML>


You don't have permission to access "http&#58;&#47;&#47;wwws&#46;whitehouse&#46;gov&#47;" on this server.<P>

When requesting the corresponding memento from archive-it.org

http://wayback.archive-it.org/10702/20121223122758/http://www.whitehouse.gov/

the request redirected to another URI-M:

http://wayback.archive-it.org/10702/20121221222130/http://www.whitehouse.gov/administration/eop/nec/speeches/gene-sperling-remarks-economic-club-washington

which is "200 OK". Notice that not only the values of the Memento-Datetime are different but also the URI-Rs. The representations of both mementos from the original and new archives are shown below:



Observation 4: Both archives handle the archival 4xx/5xx differently

The replay tool in the original archive (internetmemory.org) is configured so that it returns the status code "200 OK" for archival 4xx/5xx. 

For example, when requesting the memento:

http://collections.internetmemory.org/nli/20121021203647/http://www.amazon.com/


on September 03, 2018, the original archive returned "200 OK" for an archival "503 Service Unavailable" as the WARC record below shows:


WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collections.internetmemory.org/nli/20121021203647/http://www.amazon.com/
WARC-Date: 2018-09-03T16:46:51Z
WARC-Record-ID: <urn:uuid:f4f2e910-af98-11e8-8de6-6f058c4e494a>
Content-Type: application/http; msgtype=response
Content-Length: 271841

HTTP/1.1 200 OK
Date: Mon, 03 Sep 2018 16:46:39 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 28349831
Memento-Datetime: Sun, 21 Oct 2012 20:36:47 GMT
Connection: keep-alive
Accept-Ranges: bytes
...

Even the HTTP status code of the inner iframe in which the archived content is loaded had "200 OK":

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collections.internetmemory.org/nli/content/20121021203647/http://www.amazon.com/
WARC-Date: 2018-09-03T16:46:51Z
WARC-Record-ID: <urn:uuid:f500cbc0-af98-11e8-8de6-6f058c4e494a>
Content-Type: application/http; msgtype=response
Content-Length: 2642

HTTP/1.1 200 OK
Date: Mon, 03 Sep 2018 16:46:40 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 27227468 27453379
Memento-Datetime: Sun, 21 Oct 2012 20:36:47 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: <http://collections.internetmemory.org/nli/content/20121021203647/http://www.amazon.com/>; rel="memento"; datetime="Sun, 21 Oct 2012 20:36:47 GMT", <http://collections.internetmemory.org/nli/content/20110221192317/http://www.amazon.com/>; rel="first memento"; datetime="Mon, 21 Feb 2011 19:23:17 GMT", <http://collections.internetmemory.org/nli/content/20180711130159/http://www.amazon.com/>; rel="last memento"; datetime="Wed, 11 Jul 2018 13:01:59 GMT", <http://collections.internetmemory.org/nli/content/20121016174254/http://www.amazon.com/>; rel="prev memento"; datetime="Tue, 16 Oct 2012 17:42:54 GMT", <http://collections.internetmemory.org/nli/content/20121025120853/http://www.amazon.com/>; rel="next memento"; datetime="Thu, 25 Oct 2012 12:08:53 GMT", <http://collections.internetmemory.org/nli/content/timegate/http://www.amazon.com/>; rel="timegate", <http://www.amazon.com/>; rel="original", <http://collections.internetmemory.org/nli/content/timemap/http://www.amazon.com/>; rel="timemap"; type="application/link-format"

<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1"/>
<title>[ARCHIVED CONTENT] 500 Service Unavailable Error</title>
</head>
<body style="padding:1% 10%;font-family:Verdana,Arial,Helvetica,sans-serif">
  <a  target="_top" href="http://collections.internetmemory.org/nli/20121021203647/http://www.amazon.com/"><img src="http://collections.internetmemory.org/nli/content/20121021203647/http://ecx.images-amazon.com/images/G/01/img09/x-site/other/a_com_logo_200x56.gif" alt="Amazon.com" width="200" height="56" border="0"/></a>
  <table>
    <tr>
      <td valign="top" style="width:52%;font-size:10pt"><br/><h2 style="color:#E47911">Oops!</h2><p>We're very sorry, but we're having trouble doing what you just asked us to do. Please give us another chance--click the Back button on your browser and try your request again. Or start from the beginning on our <a  target="_top" href="http://collections.internetmemory.org/nli/20121021203647/http://www.amazon.com/">homepage</a>.</p></td>
      <th><img src="http://collections.internetmemory.org/nli/content/20121021203647/http://ecx.images-amazon.com/images/G/01/x-locale/common/errors-alerts/product-fan-500.jpg" alt="product collage"/></th>
    </tr>
  </table>
</body>

</html>

When requesting the corresponding memento:

http://wayback.archive-it.org/10702/20121021203647/http://www.amazon.com/

Archive-It properly returned the status codes 503 for the archival 503:

curl -I http://wayback.archive-it.org/10702/20121021203647/http://www.amazon.com/

HTTP/1.1 503 Service Unavailable
Server: Apache-Coyote/1.1
Content-Security-Policy-Report-Only: default-src 'self' 'unsafe-inline' 'unsafe-eval' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org data: blob: ; frame-src 'self' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org ; report-uri https://partner.archive-it.org/csp-report
Memento-Datetime: Sun, 21 Oct 2012 20:36:47 GMT
Link: <http://www.amazon.com/>; rel="original", <https://wayback.archive-it.org/10702/timemap/link/http://www.amazon.com/>; rel="timemap"; 
...

Observation 5: The HTTP status code may change in the new archive

The HTTP status codes of URI-Ms in the new archive might not be identical to the HTTP status code of the corresponding URI-Ms in the original archive. For example, the HTTP request of the URI-M:

http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/


to the original archive resulted in "200 OK" as the part of the WARC below shows:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/
WARC-Date: 2018-09-03T08:40:43Z
WARC-Record-ID: <urn:uuid:0b947600-af55-11e8-9b13-5bce71cafd38>
Content-Type: application/http; msgtype=response
Content-Length: 28447

HTTP/1.1 200 OK
Date: Mon, 03 Sep 2018 08:40:30 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 27888670
Memento-Datetime: Sun, 23 Dec 2012 03:18:37 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="first memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="last memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="prev memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="next memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/timegate/http://www2008.org/>; rel="timegate", <http://www2008.org/>; rel="original", <http://collections.internetmemory.org/nli/timemap/http://www2008.org/>; rel="timemap"; type="application/link-format"



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
...

The representation of the memento is illustrated below:


In internetmemory.org

The request to the corresponding URI-M:

http://wayback.archive-it.org/10702/20121223031837/http://www2008.org/

from Archive-It results in "404 Not Found" as the cURL session below shows:

$ curl --head --silent http://wayback.archive-it.org/10702/20121223031837/http://www2008.org/

HTTP/1.1 404 Not Found
Server: Apache-Coyote/1.1
Content-Security-Policy-Report-Only: default-src 'self' 'unsafe-inline' 'unsafe-eval' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org data: blob: ; frame-src 'self' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org ; report-uri https://partner.archive-it.org/csp-report
Content-Type: text/html;charset=utf-8
Content-Length: 4902
Date: Thu, 05 Sep 2019 08:28:27 GMT

The list of all 979 URI-Ms is appended
 below. The file contains the following information:
  • The URI-M from the original archive (original_URI-M).
  • The final URI-M after following redirects, if any, of the URI-M from the original archive (final_original_URI-M).
  • The HTTP status code of the final URI-M from the original archive (final_original_URI-M_status_code).
  • The URI-M from the new archive (new_URI-M).
  • The final URI-M after following redirects, if any, of the URI-M from the new archive (final_new_URI-M).
  • The HTTP status code of the final URI-M from the new archive (final_new_URI-M_status_code).
  • The difference (in seconds) between the Memento-Datetimes of the final URI-Ms (delta).
  • Whether the URI-Rs of the final URI-Ms are identical or not (same_final_URI-Rs).
  • Whether the status codes of the final URI-Ms are identical or not (same_final_URI-Ms_status_code).

Conclusions

We did not find any changes in the 979 mementos of the National Library of Ireland (NLI) collection when they were moved from europarchive.org to internetmemory.org in May 2018.  Both archives had used the same replay tool and archival banners. The NLI collection then was moved to archive-it.org in September 2018.  We found that 192 out of the 979 mementos resurfaced in archive-it.org with a change in Memento-Datetime, URI-R, or the final HTTP status code. We also found that the functionality that used to be available in the original archival banner has gone from the new archive. We also noticed that both archives internetmemory.org and archive-it.org react differently to requests of archival 4xx/5xx. 

In the upcoming posts we will provide some details about changes in the archives:
--Mohamed Aturban






2 comments:

  1. Thanks for this. I used the NLI Web Archive for a case-study, with a sample dataset of URL/archived web pages, and presented it at a conference in summer 2017. Of course, as part of this, I linked to some of the URL/archived web pages for effect. In late 2018, I wanted to expand the dataset further, but the original dataset was no longer available through the URLs, and thereafter trying to find the same thing in the archive-it collection was problematic. So thank you for this - it explains a lot! Also, raises the challenge of research reproducibility when an archive moves.

    ReplyDelete
  2. Hi @SCHealy

    Hopefully replacing:

    http://collection.europarchive.org/nli/
    and/or
    http://collections.internetmemory.org/nli/

    with:

    http://wayback.archive-it.org/10702/

    Will get you started. You won't necessarily find the exact datestamp, but hopefully you'll still find something. We'd also be interested in knowing which URLs are completely gone.

    regards,

    Michael

    ReplyDelete