Wednesday, September 25, 2019

2019-09-25: Where did the archive go? Part 3: Public Record Office of Northern Ireland


In Where did the archive go? Part 1, we provided some details about changes in the archive Library and Archives Canada. After they upgraded their replay system, we were no longer able to find 49 out of 351 mementos (archived web pages). In Part 2, we focused on the movement of the National Library of Ireland (NLI). Mementos from NLI collection were moved from the European Archive to the Internet Memory Foundation (IMF) archive. Then, they were moved to Archive-It. We found that 192 mementos, out of 979, can not be found in Archive-It.

In part 3 of this four part series, we focus on changes in the Public Record Office of Northern Ireland (PRONI) Web Archive. In October 2018, mementos in the PRONI archive were moved to Archive-It (archive-it.org). We discovered that 114 mementos, out of 469, can no longer be found in Archive-It (i.e., missing mementos).

We refer to the archive from which mementos have moved as the "original archive" (i.e., PRONI archive), and we use the "new archive" to refer to the archive to which the mementos have moved (i.e., Archive-It). A memento is identified by a URI-M as defined in the Memento framework.

We have several observations about the changes in the PRONI web archive:

Observation 1: The HTTP request to a URI-M in PRONI archive does not redirect to the corresponding URI-M in the new archive

As shown in the cURL session below, every request to a memento (URI-M) in the PRONI archive will return the HTTP status code "404 Not Found":

$ curl --head --location http://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/

HTTP/1.1 302 Found
Cache-Control: no-cache
Content-length: 0
Location: https://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/
HTTP/2 404
date: Fri, 20 Sep 2019 08:13:45 GMT
server: Apache/2.4.18 (Ubuntu)
content-type: text/html; charset=iso-8859-1

PRONI did not leave a machine readable method of locating the new URI-Ms. However, we were able to manually discover the corresponding URI-Ms in Archive-It. For example, the memento:

http://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/

is now available at:

http://wayback.archive-it.org/11112/20100218151844/http://www.berr.gov.uk/

The representations of both mementos are illustrated in the figure below:


Unlike the European Archive and IMF, the Public Record Office of Northern Ireland (PRONI) still owns the domain name of the original archive, webarchive.proni.gov.uk. Therefore, to maintain link integrity via "follow-your-nose", PRONI could issue redirects (even though it currently does not) to the corresponding URI-Ms in Archive-It. For example, since PRONI uses the Apache web server, the mod_rewrite rule that could be used to perform automatic redirects is:

# With mod_rewrite
RewriteEngine on
RewriteRule "^/(\d{14})/(.+)" http://wayback.archive-it.org/11112/$1/$2 [L,R=301]

Observation 2: The functionality of the original archival banner is gone

Similar to the archival banners provided by the European Archive and IMF, users of the PRONI archive were able to navigate through available mementos via the custom archival banner (marked in red in the top screenshot in the figure above). Users are allowed to view the available mementos and the representation of a selected memento in the same page. Archive-It, on the other hand, now uses the standard playback banner (marked in red in the bottom screenshot in the figure above). This banner does not have the same functionality compared to the original archive's banner. The new archive's banner contains information to inform users that they are viewing an "archived" web page and shows multiple links. One of the links will redirect to a web page that shows all available mementos in archive-it.org. For example, the figure below shows the available mementos for the web page http://www.berr.gov.uk/:



Observation 3: Not all mementos are available in the new archive

We define a memento "missing" if the values of the Memento-Datetime, the URI-R, and the final HTTP status code of a memento from the original archive are not identical to the values of a corresponding memento in the new archive. In this study, we found 114 missing mementos (out of 469) that can not be retrieved from the new archive. Instead, the new archive responds with other mementos that have different values for the Memento-Datetime, the URI-R, or the HTTP status code. For example, when requesting the URI-M:

http://webarchive.proni.gov.uk/20160901021637/https://www.flickr.com/

from the original archive (PRONI) on 2017-12-01, the archive responded with "200 OK" with the representation shown in the top screenshot in the figure below. The Memento-Datetime of this memento was Thu, 01 September 2016 02:16:37 GMT. Then, we requested the corresponding URI-M:

http://wayback.archive-it.org/11112/20160901021637/https://www.flickr.com/

from the new archive (archive-it.org). As shown in the cURL session below, the request redirected to another URI-M:

http://wayback.archive-it.org/11112/20170401014520/https://www.flickr.com/

As shown in the figure below, the representations of both mementos are identical (except for the archival banners), we consider the memento from the original archive as missing because both mementos have different values of the Memento-Datetime (i.e., Fri, 21 Apr 2017 01:45:20 GMT in the new archive) for a delta of about 211 days.

$ curl --head --location --silent http://wayback.archive-it.org/11112/20160901021637/http://www.flickr.com/ | egrep -i "(HTTP/|^location:|^Memento-Datetime)"

HTTP/1.1 302 Found
Location: /11112/20170401014520/https://www.flickr.com/
HTTP/1.1 200 OK
Memento-Datetime: Sat, 01 Apr 2017 01:45:20 GMT



We found that 63 missing mementos, out of 114, have different values of the Memento-Datetime of a delta of less than 11 seconds. For example, the request to the memento:

http://webarchive.proni.gov.uk/20170102004044/http://www.fws.gov/

from the original archive on 2017-11-18 returned "302" redirect to

http://webarchive.proni.gov.uk/20170102004044/https://fws.gov/

The request to the corresponding memento:

http://wayback.archive-it.org/11112/20170102004044/http://www.fws.gov/

from the new archive redirects to the memento:

http://wayback.archive-it.org/11112/20170102004051/https://www.fws.gov/

as the cURL session below shows:

$ curl --head --location --silent http://wayback.archive-it.org/11112/20170102004044/http://www.fws.gov/ | egrep -i "(HTTP/|^location:|^Memento-Datetime)"

HTTP/1.1 302 Found
Location: /11112/20170102004051/https://www.fws.gov/
HTTP/1.1 200 OK
Memento-Datetime: Mon, 02 Jan 2017 00:40:51 GMT

There are 10-second difference between the values of the Memento-Datetime which might not be semantically significant (apparently just a change in the canonicalization of the URIs, with http://www.fws.gov/ redirecting to https://www.fws.gov), but we do not consider the memento in the original archive is identical to the corresponding memento in the new archive because of the difference in the values of the Memento-Datetime.

When the new archive receives an archived collection from the original archive, it may apply some post-crawling techniques to the received files (e.g., WARC files) including deduplication, spam filtering, and indexing. This may result in mementos in the new archive that have  different values of the Memento-Datetime compared to their corresponding values in the original archive. 

Observation 4: PRONI provides a list of original pages' URIs (URI-Rs)

Mementos in PRONI archive were moved to the Archive-It under the archival collection https://archive-it.org/collections/11112/:


As shown in Observation 1, requests to URI-Ms in PRONI do not redirect to Archive-It. However, webarchive.proni.gov.uk provides a list of all original resources' URIs (URI-Rs) for which mementos have been created as the following figure shows:



For instance, if interested in finding the corresponding memento in Archive-It to the PRONI memento:
URI-M = http://webarchive.proni.gov.uk/20150318223351/http://www.afbini.gov.uk/

In proni.gov.uk
which has:
URI-R = http://www.afbini.gov.uk/
Memento-Datetime = Wed 18 Mar 2015 22:33:51 GMT

From the index at webarchive.proni.gov.uk, we can click on the URI-R www.afbini.gov.uk, which will redirect to an Archive-It HTML page that contains all available mementos for the selected URI-R as shown in the figure below:



Finally, we choose 2015-03-18 since it is the same Memento-Datetime as in the original archive. The representation of the memento is shown below:



Although the PRONI archive does not issue "301" redirects to URI-Ms in the new archive (i.e., PRONI does not provide a direct mapping between original URI-Ms and new URI-Ms), users of the archive can indirectly find the corresponding URI-Ms as explained above.

Observation 5: Archival 4xx/5xx responses are handled differently

Similar to the European Archive and IMF, the replay tool in the original archive (proni.gov.uk) is configured so that it returns the status code "200 OK" for archival 4xx/5xx. For example, when requesting the memento:

http://webarchive.proni.gov.uk/20160216154000/http://www.megalithic.co.uk/

on 2017-11-18, the original archive returned "200 OK" for an archival "403 Forbidden" as the WARC record below shows:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://webarchive.proni.gov.uk/20160216154000/http://www.megalithic.co.uk/
WARC-Date: 2017-11-18T03:35:22Z
WARC-Record-ID: <urn:uuid:81bb8530-cc11-11e7-9c05-ff972ac7f9f2>
Content-Type: application/http; msgtype=response
Content-Length: 60568

HTTP/1.1 200 OK
Date: Sat, 18 Nov 2017 03:35:11 GMT
Server: Apache/2.4.10
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
Memento-Datetime: Tue, 16 Feb 2016 15:40:00 GMT
...

Even the HTTP status code of the inner iframe in which the archived content is loaded had "200 OK":

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://webarchive.proni.gov.uk/content/20160216154000/http://www.megalithic.co.uk/
WARC-Date: 2017-11-18T03:35:22Z
WARC-Record-ID: <urn:uuid:81c967e0-cc11-11e7-9c05-ff972ac7f9f2>
Content-Type: application/http; msgtype=response
Content-Length: 4587

HTTP/1.1 200 OK
Date: Sat, 18 Nov 2017 03:35:11 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html
Via: 1.1 varnish-v4
cache-control: max-age=86400
X-Varnish: 24849777 24360264
Memento-Datetime: Tue, 16 Feb 2016 15:40:00 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: ...

<html>
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<title>[ARCHIVED CONTENT] 403 FORBIDDEN : LOGGED BY www.megalithic.co.uk</title>
</head>
<body style=\"font:Arial Black,Arial\">
<p><b><font color="#FF0000" size="+2" face="Verdana, Arial, Helvetica, sans-serif" style="text-decoration:blink; color:#FF0000; background:#000000;">&nbsp;&nbsp;&nbsp;&nbsp; 403 FORBIDDEN! &nbsp;&nbsp;&nbsp;&nbsp;</font></b></p>
<b>
<p><font color="#FF0000">You have been blocked from the Megalithic Portal by our defence robot.<br>Possibly the IP address you are accessing from has been banned for previous bad behavior or you have attempted a hostile action.<br>If you think this is an error please click the Trouble Ticket link below to communicate with the site admin.</font></p>
...

When requesting the corresponding memento:

http://wayback.archive-it.org/11112/20160216154000/http://www.megalithic.co.uk/

Archive-It properly returned the status codes 403 for the archival 403:

$ curl --head http://wayback.archive-it.org/11112/20160216154000/http://www.megalithic.co.uk/

HTTP/1.1 403 Forbidden
Server: Apache-Coyote/1.1
Content-Security-Policy-Report-Only: default-src 'self' 'unsafe-inline' 'unsafe-eval' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org data: blob: ; frame-src 'self' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org ; report-uri https://partner.archive-it.org/csp-report
Memento-Datetime: Tue, 16 Feb 2016 15:40:00 GMT
X-Archive-Guessed-Charset: windows-1252
X-Archive-Orig-Server: Apache/2.2.15 (CentOS)
X-Archive-Orig-Connection: close
X-Archive-Orig-X-Powered-By: PHP/5.3.3
X-Archive-Orig-Status: 403 FORBIDDEN
X-Archive-Orig-Content-Length: 3206
X-Archive-Orig-Date: Tue, 16 Feb 2016 15:39:59 GMT
X-Archive-Orig-Warning: 199 www.megalithic.co.uk:80 You_are_abusive/hacking/spamming_www.megalithic.co.uk
X-Archive-Orig-Content-Type: text/html; charset=iso-8859-1
...

The representations of both mementos are illustrated below:



Observation 6: Some URI-Rs were removed from PRONI

Mementos may disappear when moving from the original archive to the new archive. For example, the request to the URI-M:

http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/

from the original archive resulted in "200 OK" as the part of the WARC below shows:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/
WARC-Date: 2017-11-18T02:06:22Z
WARC-Record-ID: <urn:uuid:13136370-cc05-11e7-83e9-19ddf7ecdbd2>
Content-Type: application/http; msgtype=response
Content-Length: 28657

HTTP/1.1 200 OK
Date: Sat, 18 Nov 2017 02:06:11 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 20429973
Memento-Datetime: Tue, 08 Apr 2014 18:55:12 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="first memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="last memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="prev memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="next memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/timegate/http://www.www126.com/>; rel="timegate", <http://www.www126.com/>; rel="original", <http://webarchive.proni.gov.uk/timemap/http://www.www126.com/>; rel="timemap"; type="application/link-format"

The representation of the memento is illustrated below:

In proni.gov.uk
The request to the corresponding URI-M:

http://wayback.archive-it.org/11112/20140408185512/http://www.www126.com/

from Archive-It results in "404 Not Found" as the cURL session below shows:

$ curl --head http://wayback.archive-it.org/11112/20140408185512/http://www.www126.com/

HTTP/1.1 404 Not Found
Server: Apache-Coyote/1.1
Content-Type: text/html;charset=utf-8
Content-Length: 4910
Date: Tue, 24 Sep 2019 02:00:45 GMT

Before transferring collections to the new archive, it is possible that the original archive reviews collections and removes URI-Rs/URI-Ms that are considered off topic (you may also read about the off-topic memento toolkit) or spam (e.g., the URI-R www.www126.com is about auto insurance).  

Observation 7: PRONI may have used the Europe Archive and IMF as hosting services

PRONI used an archival banner that is similar to what the Europe Archive and IMF used. Furthermore, the three archives returned similar set of HTTP response headers to requests of mementos. Values of multiple response headers (e.g., server and via) are identical as shown below:

The Europe Archive:
URI-M = http://collection.europarchive.org/nli/20161213111140/http://wordpress.org/

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collection.europarchive.org/nli/20161213111140/https://wordpress.org/
WARC-Date: 2017-12-01T18:15:32Z
WARC-Record-ID: <urn:uuid:9e705c20-d6c3-11e7-9f9e-5371622c3ef9>
Content-Type: application/http; msgtype=response
Content-Length: 159855

HTTP/1.1 200 OK
Date: Fri, 01 Dec 2017 18:15:18 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 15695216
Memento-Datetime: Tue, 13 Dec 2016 06:03:04 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: ...

The IMF archive:
URI-M = http://collections.internetmemory.org/nli/20161213111140/https://wordpress.org/

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collections.internetmemory.org/nli/20161213111140/https://wordpress.org/
WARC-Date: 2018-09-03T16:33:41Z
WARC-Record-ID: <urn:uuid:1e3173c0-af97-11e8-8819-6df9b412b877>
Content-Type: application/http; msgtype=response
Content-Length: 300060

HTTP/1.1 200 OK
Date: Mon, 03 Sep 2018 16:33:27 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 11721161
Memento-Datetime: Tue, 13 Dec 2016 06:03:04 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link:...

The PRONI archive:
URI-M = http://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/
WARC-Date: 2017-12-01T01:59:56Z
WARC-Record-ID: <urn:uuid:5452da60-d63b-11e7-91e2-8bf9abaf94b4>
Content-Type: application/http; msgtype=response
Content-Length: 36908

HTTP/1.1 200 OK
Date: Fri, 01 Dec 2017 01:59:43 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 16262557
Memento-Datetime: Thu, 18 Feb 2010 15:18:44 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: ...

We found that PRONI archival collection was listed as one of the maintained collections by the Europe Archive and IMF as the figure below shows:

https://web.archive.org/web/20180707131510/http://collections.internetmemory.org/

Even though the PRONI collection were moved from the Europe Archive to IMF, URI-Ms served by proni.gov.uk had not changed. It is possible that the PRONI archive followed a strategy of serving mementos under proni.gov.uk while using the hosting services provided by the Europe Archive and IMF. Thus, the regular users of the PRONI archive did not notice any change in URI-Ms. We do not think custom domains are available with Archive-It, so PRONI was unable to continue to host their mementos in their own URI namespace.

The list of all 979 URI-Ms is appended below. The file contains the following information:
  • The URI-M from the original archive (original_URI-M).
  • The final URI-M after following redirects, if any, of the URI-M from the original archive (final_original_URI-M).
  • The HTTP status code of the final URI-M from the original archive (final_original_URI-M_status_code).
  • The URI-M from the new archive (new_URI-M).
  • The final URI-M after following redirects, if any, of the URI-M from the new archive (final_new_URI-M).
  • The HTTP status code of the final URI-M from the new archive (final_new_URI-M_status_code).
  • The difference (in seconds) between the Memento-Datetimes of the final URI-Ms (delta).
  • Whether the URI-Rs of the final URI-Ms are identical or not (same_final_URI-Rs).
  • Whether the status codes of the final URI-Ms are identical or not (same_final_URI-Ms_status_code).

Conclusions

We described the movement of mementos from the PRONI archive (proni.gov.uk) to archive-it.org in October 2018. We found that 114 out of the 469 mementos resurfaced in archive-it.org with a change in Memento-Datetime, URI-R, or the final HTTP status code. We also found that the functionality that used to be available in the original archival banner has gone. We noticed that both archives proni.gov.uk and archive-it.org react differently to requests of archival 4xx/5xx. In the upcoming posts we will provide some details about changes in webcitation.org.


--Mohamed Aturban




Tuesday, September 10, 2019

2019-09-10: Where did the archive go? Part 2: National Library of Ireland


In the previous post, we provided some details about changes in the archive Library and Archives Canada. After they upgraded their web archive replay system, we were no longer able to find 49 out of 351 mementos (archived web pages). In part 2 of this four part series, we focus on the movement of a collection from the National Library of Ireland (NLI).


In May 2018, we discovered that 979 mementos from the NLI collection that were originally archived at the European Archive (europarchive.org) were moved to the Internet Memory Foundation archive (internetmemory.org). Then in September 2018, we found that the collection of mementos had been moved to Archive-It (archive-it.org). We found that 192 mementos, out of 979, can not be found in Archive-It (i.e., missing mementos).

For example, the memento from the European Archive:


has been moved to the Internet Memory Foundation (IMF) archive at:

http://collections.internetmemory.org/nli/20141013204117/http://www.defense.gov/

before it ended up at Archive-it:

http://wayback.archive-it.org/10702/20141013204117/http://www.defense.gov/

The representations of the three mementos are illustrated in the figure below.



There were no changes in the 979 mementos (other than their URIs) when they moved from the European Archive to the IMF archive (even the archival banner remained the same as the figure above shows), but we found some significant changes upon the move from IMF to Archive-It which we will focus on in this post.

We refer to the archive from which mementos were moved (i.e., internetmemory.org) as the "original archive", and we use the "new archive" to refer to the archive to which the mementos were moved (i.e., archive-it.org). A memento is identified by a URI-M as defined in the Memento framework.

Our observations about changes in the NLI collection (from IMF to Archive-It) are as follows:

Observation 1: The functionality of the original archival banner is gone

Users of the European Archive and IMF were able to navigate through available mementos via the custom archival banner (marked in red in the top two screenshots in the figure above). Via this banner, the original archive allows users to view the available mementos and the representation of a selected memento in the same page. Archive-It, on the other hand, now uses the standard playback banner (marked in red in the bottom screenshot in the figure above). This new archive's banner contains information to inform users that they are viewing an "archived" web page. This banner also contains multiple links. One of the links will take you to a web page in archive-it.org that shows all available mementos in the archive as shown in the figure below:





Observation 2: The original archive is no longer reachable

After mementos were moved from internetmemory.org, the archive became unreachable as the following cURL session shows:

$ date
Tue May 21 08:03:51 EDT 2019

$ curl http://www.internetmemory.org
curl: (7) Failed to connect to www.internetmemory.org port 80: Operation timed out



In addition to IMF, the European Archive (europarchive.org) also is no longer maintained---it was shut down and the domain name was purchased by another entity and is now spam. 



The movement of mementos from these two archives will affect link integrity across web resources that contain links to mementos from the European Archive or IMF. As mentioned in the previous post, there actions that can be performed by original archives to maintain link integrity via "follow-your-nose" from the old URI-Ms to the corresponding URI-Ms in the new archive. 


For example, the archive Library and Archives Canada changed its domain name name from collectionscanada.gc.ca to webarchive.bac-lac.gc.ca (described in part 1), and because the archive still controls the original domain name collectionscanada.gc.ca, the archive could (even though it currently does not) redirect requests of URI-M in collectionscanada.gc.ca to the new archive webarchive.bac-lac.gc.ca. For instance, if the original archive uses the Apache web server, the mod_rewrite rule can be used to perform automatic redirects:

# With mod_rewrite
RewriteEngine on
RewriteRule   "^/webarchives/(\d{14})/(.+)" http://webarchive.bac-lac.gc.ca:8080/wayback/$1/$2  [L,R=301]

But these practices become impractical in the case of the European Archive and IMF because:

  • The archives no longer exist (e.g., the European Archive and IMF),  so there is not a maintained web server available to issue the redirects. 
  • Even if it still existed, the archive might decide to not issue redirects for former customers in order to increase lock-in.
In the upcoming post, we will describe the movement of mementos from the Public Record Office of Northern Ireland (PRONI) web archive to Archive-It. The PRONI organization still controls and maintains the original domain name webarchive.proni.gov.ukso it is possible for PRONI to issue redirects to the new URI-Ms in Archive-It.

Observation 3: Not all mementos are available in the new archive

As defined in the previous post, a missing memento occurs when the values of the Memento-Datetime, the URI-R, and the final HTTP status code of the memento from the original archive are not identical to the values of the corresponding memento from the new archive. In this study, we found 192 missing mementos (out of 979) that can not be retrieved from the new archive. Instead, the new archive responds with other mementos that have different values for the Memento-Datetime, the URI-R, or the HTTP status code. We give two examples of missing mementos. The first example shows a memento can not be found in the new archive  with the same Memento-Datetime as it was in the original archive. When requesting the URI-M:

http://collections.internetmemory.org/nli/20121221162201/http://bbc.co.uk/news/

from the original archive (internetmemory.org) on September 03, 2018, the archive responded with "200 OK" with the representation shown in the top screenshot in the figure below. The Memento-Datetime of this memento was Fri, 21 Dec 2012 16:22:01 GMT. Then, we requested the corresponding URI-M:

http://wayback.archive-it.org/10702/20121221162201/http://bbc.co.uk/news/


from the new archive (archive-it.org). As shown in the cURL session below, the request redirected to another URI-M:

http://wayback.archive-it.org/10702/20121221163248/http://www.bbc.co.uk/news/

As shown in the figure below, the representations of both mementos are identical (except for the archival banners), we consider the memento from the original archive as missing because both mementos have different values of the Memento-Datetime (i.e., Fri, 21 Dec 2012 16:32:48 GMT in the new archive) for a delta of about 10 minutes. Even though the 10 minute delta might not be semantically significant (apparently just a change in the canonicalization of the URI-R, with bbc.co.uk redirecting to www.bbc.co.uk), we do not consider it to be the same since the values of the Memento-Datetime are not identical.

$ curl --head --location --silent http://wayback.archive-it.org/10702/20121221162201/http://bbc.co.uk/news/ | egrep -i "(HTTP/|^location:|^Memento-Datetime)"

HTTP/1.1 302 Found
Location: /10702/20121221163248/http://www.bbc.co.uk/news/
HTTP/1.1 200 OK

Memento-Datetime: Fri, 21 Dec 2012 16:32:48 GMT



The second example shows a memento that has different values of the Memento-Datetime and URI-R compared to the corresponding values from the original archive. When requesting the memento:

http://collections.internetmemory.org/nli/20121223122758/http://www.whitehouse.gov/

on September 03, 2018, the original archive returned "200 OK" for an archival "403 Forbidden" as the WARC record below shows:


WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collections.internetmemory.org/nli/content/20121223122758/http://www.whitehouse.gov/

WARC-Date: 2018-09-03T16:31:30Z
WARC-Record-ID: <urn:uuid:d03e5020-af96-11e8-9d72-f10b53f82929>


Content-Type: application/http; msgtype=response
Content-Length: 1694
HTTP/1.1 200 OK

Date: Mon, 03 Sep 2018 16:31:19 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html
Via: 1.1 varnish-v4
cache-control: max-age=86400
X-Varnish: 28318986 28187250
Memento-Datetime: Sun, 23 Dec 2012 12:27:58 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: <http://collections.internetmemory.org/nli/content/20121223122758/http://www.whitehouse.gov/>; rel="memento"; datetime="Sun, 23 Dec 2012 12:27:58 GMT", <http://collections.internetmemory.org/nli/content/20110223072152/http://www.whitehouse.gov/>; rel="first memento"; datetime="Wed, 23 Feb 2011 07:21:52 GMT", <http://collections.internetmemory.org/nli/content/20180528183514/http://www.whitehouse.gov/>; rel="last memento"; datetime="Mon, 28 May 2018 18:35:14 GMT", <http://collections.internetmemory.org/nli/content/20121221220430/http://www.whitehouse.gov/>; rel="prev memento"; datetime="Fri, 21 Dec 2012 22:04:30 GMT", <http://collections.internetmemory.org/nli/content/20131208014833/http://www.whitehouse.gov/>; rel="next memento"; datetime="Sun, 08 Dec 2013 01:48:33 GMT", <http://collections.internetmemory.org/nli/content/timegate/http://www.whitehouse.gov/>; rel="timegate", <http://www.whitehouse.gov/>; rel="original", <http://collections.internetmemory.org/nli/content/timemap/http://www.whitehouse.gov/>; rel="timemap"; type="application/link-format"
Content-Length: 287


<HTML><head>
<title>[ARCHIVED CONTENT] Access Denied</title>
</head><BODY>
<H1>Access Denied</H1>
Reference&#32;&#35;18&#46;d8407b5c&#46;1356265678&#46;2324d94
</BODY>
</HTML>


You don't have permission to access "http&#58;&#47;&#47;wwws&#46;whitehouse&#46;gov&#47;" on this server.<P>

When requesting the corresponding memento from archive-it.org

http://wayback.archive-it.org/10702/20121223122758/http://www.whitehouse.gov/

the request redirected to another URI-M:

http://wayback.archive-it.org/10702/20121221222130/http://www.whitehouse.gov/administration/eop/nec/speeches/gene-sperling-remarks-economic-club-washington

which is "200 OK". Notice that not only the values of the Memento-Datetime are different but also the URI-Rs. The representations of both mementos from the original and new archives are shown below:



Observation 4: Both archives handle the archival 4xx/5xx differently

The replay tool in the original archive (internetmemory.org) is configured so that it returns the status code "200 OK" for archival 4xx/5xx. 

For example, when requesting the memento:

http://collections.internetmemory.org/nli/20121021203647/http://www.amazon.com/


on September 03, 2018, the original archive returned "200 OK" for an archival "503 Service Unavailable" as the WARC record below shows:


WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collections.internetmemory.org/nli/20121021203647/http://www.amazon.com/
WARC-Date: 2018-09-03T16:46:51Z
WARC-Record-ID: <urn:uuid:f4f2e910-af98-11e8-8de6-6f058c4e494a>
Content-Type: application/http; msgtype=response
Content-Length: 271841

HTTP/1.1 200 OK
Date: Mon, 03 Sep 2018 16:46:39 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 28349831
Memento-Datetime: Sun, 21 Oct 2012 20:36:47 GMT
Connection: keep-alive
Accept-Ranges: bytes
...

Even the HTTP status code of the inner iframe in which the archived content is loaded had "200 OK":

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collections.internetmemory.org/nli/content/20121021203647/http://www.amazon.com/
WARC-Date: 2018-09-03T16:46:51Z
WARC-Record-ID: <urn:uuid:f500cbc0-af98-11e8-8de6-6f058c4e494a>
Content-Type: application/http; msgtype=response
Content-Length: 2642

HTTP/1.1 200 OK
Date: Mon, 03 Sep 2018 16:46:40 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 27227468 27453379
Memento-Datetime: Sun, 21 Oct 2012 20:36:47 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: <http://collections.internetmemory.org/nli/content/20121021203647/http://www.amazon.com/>; rel="memento"; datetime="Sun, 21 Oct 2012 20:36:47 GMT", <http://collections.internetmemory.org/nli/content/20110221192317/http://www.amazon.com/>; rel="first memento"; datetime="Mon, 21 Feb 2011 19:23:17 GMT", <http://collections.internetmemory.org/nli/content/20180711130159/http://www.amazon.com/>; rel="last memento"; datetime="Wed, 11 Jul 2018 13:01:59 GMT", <http://collections.internetmemory.org/nli/content/20121016174254/http://www.amazon.com/>; rel="prev memento"; datetime="Tue, 16 Oct 2012 17:42:54 GMT", <http://collections.internetmemory.org/nli/content/20121025120853/http://www.amazon.com/>; rel="next memento"; datetime="Thu, 25 Oct 2012 12:08:53 GMT", <http://collections.internetmemory.org/nli/content/timegate/http://www.amazon.com/>; rel="timegate", <http://www.amazon.com/>; rel="original", <http://collections.internetmemory.org/nli/content/timemap/http://www.amazon.com/>; rel="timemap"; type="application/link-format"

<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1"/>
<title>[ARCHIVED CONTENT] 500 Service Unavailable Error</title>
</head>
<body style="padding:1% 10%;font-family:Verdana,Arial,Helvetica,sans-serif">
  <a  target="_top" href="http://collections.internetmemory.org/nli/20121021203647/http://www.amazon.com/"><img src="http://collections.internetmemory.org/nli/content/20121021203647/http://ecx.images-amazon.com/images/G/01/img09/x-site/other/a_com_logo_200x56.gif" alt="Amazon.com" width="200" height="56" border="0"/></a>
  <table>
    <tr>
      <td valign="top" style="width:52%;font-size:10pt"><br/><h2 style="color:#E47911">Oops!</h2><p>We're very sorry, but we're having trouble doing what you just asked us to do. Please give us another chance--click the Back button on your browser and try your request again. Or start from the beginning on our <a  target="_top" href="http://collections.internetmemory.org/nli/20121021203647/http://www.amazon.com/">homepage</a>.</p></td>
      <th><img src="http://collections.internetmemory.org/nli/content/20121021203647/http://ecx.images-amazon.com/images/G/01/x-locale/common/errors-alerts/product-fan-500.jpg" alt="product collage"/></th>
    </tr>
  </table>
</body>

</html>

When requesting the corresponding memento:

http://wayback.archive-it.org/10702/20121021203647/http://www.amazon.com/

Archive-It properly returned the status codes 503 for the archival 503:

curl -I http://wayback.archive-it.org/10702/20121021203647/http://www.amazon.com/

HTTP/1.1 503 Service Unavailable
Server: Apache-Coyote/1.1
Content-Security-Policy-Report-Only: default-src 'self' 'unsafe-inline' 'unsafe-eval' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org data: blob: ; frame-src 'self' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org ; report-uri https://partner.archive-it.org/csp-report
Memento-Datetime: Sun, 21 Oct 2012 20:36:47 GMT
Link: <http://www.amazon.com/>; rel="original", <https://wayback.archive-it.org/10702/timemap/link/http://www.amazon.com/>; rel="timemap"; 
...

Observation 5: The HTTP status code may change in the new archive

The HTTP status codes of URI-Ms in the new archive might not be identical to the HTTP status code of the corresponding URI-Ms in the original archive. For example, the HTTP request of the URI-M:

http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/


to the original archive resulted in "200 OK" as the part of the WARC below shows:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/
WARC-Date: 2018-09-03T08:40:43Z
WARC-Record-ID: <urn:uuid:0b947600-af55-11e8-9b13-5bce71cafd38>
Content-Type: application/http; msgtype=response
Content-Length: 28447

HTTP/1.1 200 OK
Date: Mon, 03 Sep 2018 08:40:30 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 27888670
Memento-Datetime: Sun, 23 Dec 2012 03:18:37 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="first memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="last memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="prev memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="next memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/timegate/http://www2008.org/>; rel="timegate", <http://www2008.org/>; rel="original", <http://collections.internetmemory.org/nli/timemap/http://www2008.org/>; rel="timemap"; type="application/link-format"



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
...

The representation of the memento is illustrated below:


In internetmemory.org

The request to the corresponding URI-M:

http://wayback.archive-it.org/10702/20121223031837/http://www2008.org/

from Archive-It results in "404 Not Found" as the cURL session below shows:

$ curl --head --silent http://wayback.archive-it.org/10702/20121223031837/http://www2008.org/

HTTP/1.1 404 Not Found
Server: Apache-Coyote/1.1
Content-Security-Policy-Report-Only: default-src 'self' 'unsafe-inline' 'unsafe-eval' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org data: blob: ; frame-src 'self' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org ; report-uri https://partner.archive-it.org/csp-report
Content-Type: text/html;charset=utf-8
Content-Length: 4902
Date: Thu, 05 Sep 2019 08:28:27 GMT

The list of all 979 URI-Ms is appended
 below. The file contains the following information:
  • The URI-M from the original archive (original_URI-M).
  • The final URI-M after following redirects, if any, of the URI-M from the original archive (final_original_URI-M).
  • The HTTP status code of the final URI-M from the original archive (final_original_URI-M_status_code).
  • The URI-M from the new archive (new_URI-M).
  • The final URI-M after following redirects, if any, of the URI-M from the new archive (final_new_URI-M).
  • The HTTP status code of the final URI-M from the new archive (final_new_URI-M_status_code).
  • The difference (in seconds) between the Memento-Datetimes of the final URI-Ms (delta).
  • Whether the URI-Rs of the final URI-Ms are identical or not (same_final_URI-Rs).
  • Whether the status codes of the final URI-Ms are identical or not (same_final_URI-Ms_status_code).

Conclusions

We did not find any changes in the 979 mementos of the National Library of Ireland (NLI) collection when they were moved from europarchive.org to internetmemory.org in May 2018.  Both archives had used the same replay tool and archival banners. The NLI collection then was moved to archive-it.org in September 2018.  We found that 192 out of the 979 mementos resurfaced in archive-it.org with a change in Memento-Datetime, URI-R, or the final HTTP status code. We also found that the functionality that used to be available in the original archival banner has gone from the new archive. We also noticed that both archives internetmemory.org and archive-it.org react differently to requests of archival 4xx/5xx. 

In the upcoming posts we will provide some details about changes in the archives:
--Mohamed Aturban