Wednesday, September 25, 2019

2019-09-25: Where did the archive go? Part 3: Public Record Office of Northern Ireland


In Where did the archive go? Part 1, we provided some details about changes in the archive Library and Archives Canada. After they upgraded their replay system, we were no longer able to find 49 out of 351 mementos (archived web pages). In Part 2, we focused on the movement of the National Library of Ireland (NLI). Mementos from NLI collection were moved from the European Archive to the Internet Memory Foundation (IMF) archive. Then, they were moved to Archive-It. We found that 192 mementos, out of 979, can not be found in Archive-It.

In part 3 of this four part series, we focus on changes in the Public Record Office of Northern Ireland (PRONI) Web Archive. In October 2018, mementos in the PRONI archive were moved to Archive-It (archive-it.org). We discovered that 114 mementos, out of 469, can no longer be found in Archive-It (i.e., missing mementos).

We refer to the archive from which mementos have moved as the "original archive" (i.e., PRONI archive), and we use the "new archive" to refer to the archive to which the mementos have moved (i.e., Archive-It). A memento is identified by a URI-M as defined in the Memento framework.

We have several observations about the changes in the PRONI web archive:

Observation 1: The HTTP request to a URI-M in PRONI archive does not redirect to the corresponding URI-M in the new archive

As shown in the cURL session below, every request to a memento (URI-M) in the PRONI archive will return the HTTP status code "404 Not Found":

$ curl --head --location http://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/

HTTP/1.1 302 Found
Cache-Control: no-cache
Content-length: 0
Location: https://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/
HTTP/2 404
date: Fri, 20 Sep 2019 08:13:45 GMT
server: Apache/2.4.18 (Ubuntu)
content-type: text/html; charset=iso-8859-1

PRONI did not leave a machine readable method of locating the new URI-Ms. However, we were able to manually discover the corresponding URI-Ms in Archive-It. For example, the memento:

http://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/

is now available at:

http://wayback.archive-it.org/11112/20100218151844/http://www.berr.gov.uk/

The representations of both mementos are illustrated in the figure below:


Unlike the European Archive and IMF, the Public Record Office of Northern Ireland (PRONI) still owns the domain name of the original archive, webarchive.proni.gov.uk. Therefore, to maintain link integrity via "follow-your-nose", PRONI could issue redirects (even though it currently does not) to the corresponding URI-Ms in Archive-It. For example, since PRONI uses the Apache web server, the mod_rewrite rule that could be used to perform automatic redirects is:

# With mod_rewrite
RewriteEngine on
RewriteRule "^/(\d{14})/(.+)" http://wayback.archive-it.org/11112/$1/$2 [L,R=301]

Observation 2: The functionality of the original archival banner is gone

Similar to the archival banners provided by the European Archive and IMF, users of the PRONI archive were able to navigate through available mementos via the custom archival banner (marked in red in the top screenshot in the figure above). Users are allowed to view the available mementos and the representation of a selected memento in the same page. Archive-It, on the other hand, now uses the standard playback banner (marked in red in the bottom screenshot in the figure above). This banner does not have the same functionality compared to the original archive's banner. The new archive's banner contains information to inform users that they are viewing an "archived" web page and shows multiple links. One of the links will redirect to a web page that shows all available mementos in archive-it.org. For example, the figure below shows the available mementos for the web page http://www.berr.gov.uk/:



Observation 3: Not all mementos are available in the new archive

We define a memento "missing" if the values of the Memento-Datetime, the URI-R, and the final HTTP status code of a memento from the original archive are not identical to the values of a corresponding memento in the new archive. In this study, we found 114 missing mementos (out of 469) that can not be retrieved from the new archive. Instead, the new archive responds with other mementos that have different values for the Memento-Datetime, the URI-R, or the HTTP status code. For example, when requesting the URI-M:

http://webarchive.proni.gov.uk/20160901021637/https://www.flickr.com/

from the original archive (PRONI) on 2017-12-01, the archive responded with "200 OK" with the representation shown in the top screenshot in the figure below. The Memento-Datetime of this memento was Thu, 01 September 2016 02:16:37 GMT. Then, we requested the corresponding URI-M:

http://wayback.archive-it.org/11112/20160901021637/https://www.flickr.com/

from the new archive (archive-it.org). As shown in the cURL session below, the request redirected to another URI-M:

http://wayback.archive-it.org/11112/20170401014520/https://www.flickr.com/

As shown in the figure below, the representations of both mementos are identical (except for the archival banners), we consider the memento from the original archive as missing because both mementos have different values of the Memento-Datetime (i.e., Fri, 21 Apr 2017 01:45:20 GMT in the new archive) for a delta of about 211 days.

$ curl --head --location --silent http://wayback.archive-it.org/11112/20160901021637/http://www.flickr.com/ | egrep -i "(HTTP/|^location:|^Memento-Datetime)"

HTTP/1.1 302 Found
Location: /11112/20170401014520/https://www.flickr.com/
HTTP/1.1 200 OK
Memento-Datetime: Sat, 01 Apr 2017 01:45:20 GMT



We found that 63 missing mementos, out of 114, have different values of the Memento-Datetime of a delta of less than 11 seconds. For example, the request to the memento:

http://webarchive.proni.gov.uk/20170102004044/http://www.fws.gov/

from the original archive on 2017-11-18 returned "302" redirect to

http://webarchive.proni.gov.uk/20170102004044/https://fws.gov/

The request to the corresponding memento:

http://wayback.archive-it.org/11112/20170102004044/http://www.fws.gov/

from the new archive redirects to the memento:

http://wayback.archive-it.org/11112/20170102004051/https://www.fws.gov/

as the cURL session below shows:

$ curl --head --location --silent http://wayback.archive-it.org/11112/20170102004044/http://www.fws.gov/ | egrep -i "(HTTP/|^location:|^Memento-Datetime)"

HTTP/1.1 302 Found
Location: /11112/20170102004051/https://www.fws.gov/
HTTP/1.1 200 OK
Memento-Datetime: Mon, 02 Jan 2017 00:40:51 GMT

There are 10-second difference between the values of the Memento-Datetime which might not be semantically significant (apparently just a change in the canonicalization of the URIs, with http://www.fws.gov/ redirecting to https://www.fws.gov), but we do not consider the memento in the original archive is identical to the corresponding memento in the new archive because of the difference in the values of the Memento-Datetime.

When the new archive receives an archived collection from the original archive, it may apply some post-crawling techniques to the received files (e.g., WARC files) including deduplication, spam filtering, and indexing. This may result in mementos in the new archive that have  different values of the Memento-Datetime compared to their corresponding values in the original archive. 

Observation 4: PRONI provides a list of original pages' URIs (URI-Rs)

Mementos in PRONI archive were moved to the Archive-It under the archival collection https://archive-it.org/collections/11112/:


As shown in Observation 1, requests to URI-Ms in PRONI do not redirect to Archive-It. However, webarchive.proni.gov.uk provides a list of all original resources' URIs (URI-Rs) for which mementos have been created as the following figure shows:



For instance, if interested in finding the corresponding memento in Archive-It to the PRONI memento:
URI-M = http://webarchive.proni.gov.uk/20150318223351/http://www.afbini.gov.uk/

In proni.gov.uk
which has:
URI-R = http://www.afbini.gov.uk/
Memento-Datetime = Wed 18 Mar 2015 22:33:51 GMT

From the index at webarchive.proni.gov.uk, we can click on the URI-R www.afbini.gov.uk, which will redirect to an Archive-It HTML page that contains all available mementos for the selected URI-R as shown in the figure below:



Finally, we choose 2015-03-18 since it is the same Memento-Datetime as in the original archive. The representation of the memento is shown below:



Although the PRONI archive does not issue "301" redirects to URI-Ms in the new archive (i.e., PRONI does not provide a direct mapping between original URI-Ms and new URI-Ms), users of the archive can indirectly find the corresponding URI-Ms as explained above.

Observation 5: Archival 4xx/5xx responses are handled differently

Similar to the European Archive and IMF, the replay tool in the original archive (proni.gov.uk) is configured so that it returns the status code "200 OK" for archival 4xx/5xx. For example, when requesting the memento:

http://webarchive.proni.gov.uk/20160216154000/http://www.megalithic.co.uk/

on 2017-11-18, the original archive returned "200 OK" for an archival "403 Forbidden" as the WARC record below shows:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://webarchive.proni.gov.uk/20160216154000/http://www.megalithic.co.uk/
WARC-Date: 2017-11-18T03:35:22Z
WARC-Record-ID: <urn:uuid:81bb8530-cc11-11e7-9c05-ff972ac7f9f2>
Content-Type: application/http; msgtype=response
Content-Length: 60568

HTTP/1.1 200 OK
Date: Sat, 18 Nov 2017 03:35:11 GMT
Server: Apache/2.4.10
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
Memento-Datetime: Tue, 16 Feb 2016 15:40:00 GMT
...

Even the HTTP status code of the inner iframe in which the archived content is loaded had "200 OK":

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://webarchive.proni.gov.uk/content/20160216154000/http://www.megalithic.co.uk/
WARC-Date: 2017-11-18T03:35:22Z
WARC-Record-ID: <urn:uuid:81c967e0-cc11-11e7-9c05-ff972ac7f9f2>
Content-Type: application/http; msgtype=response
Content-Length: 4587

HTTP/1.1 200 OK
Date: Sat, 18 Nov 2017 03:35:11 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html
Via: 1.1 varnish-v4
cache-control: max-age=86400
X-Varnish: 24849777 24360264
Memento-Datetime: Tue, 16 Feb 2016 15:40:00 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: ...

<html>
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<title>[ARCHIVED CONTENT] 403 FORBIDDEN : LOGGED BY www.megalithic.co.uk</title>
</head>
<body style=\"font:Arial Black,Arial\">
<p><b><font color="#FF0000" size="+2" face="Verdana, Arial, Helvetica, sans-serif" style="text-decoration:blink; color:#FF0000; background:#000000;">&nbsp;&nbsp;&nbsp;&nbsp; 403 FORBIDDEN! &nbsp;&nbsp;&nbsp;&nbsp;</font></b></p>
<b>
<p><font color="#FF0000">You have been blocked from the Megalithic Portal by our defence robot.<br>Possibly the IP address you are accessing from has been banned for previous bad behavior or you have attempted a hostile action.<br>If you think this is an error please click the Trouble Ticket link below to communicate with the site admin.</font></p>
...

When requesting the corresponding memento:

http://wayback.archive-it.org/11112/20160216154000/http://www.megalithic.co.uk/

Archive-It properly returned the status codes 403 for the archival 403:

$ curl --head http://wayback.archive-it.org/11112/20160216154000/http://www.megalithic.co.uk/

HTTP/1.1 403 Forbidden
Server: Apache-Coyote/1.1
Content-Security-Policy-Report-Only: default-src 'self' 'unsafe-inline' 'unsafe-eval' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org data: blob: ; frame-src 'self' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org ; report-uri https://partner.archive-it.org/csp-report
Memento-Datetime: Tue, 16 Feb 2016 15:40:00 GMT
X-Archive-Guessed-Charset: windows-1252
X-Archive-Orig-Server: Apache/2.2.15 (CentOS)
X-Archive-Orig-Connection: close
X-Archive-Orig-X-Powered-By: PHP/5.3.3
X-Archive-Orig-Status: 403 FORBIDDEN
X-Archive-Orig-Content-Length: 3206
X-Archive-Orig-Date: Tue, 16 Feb 2016 15:39:59 GMT
X-Archive-Orig-Warning: 199 www.megalithic.co.uk:80 You_are_abusive/hacking/spamming_www.megalithic.co.uk
X-Archive-Orig-Content-Type: text/html; charset=iso-8859-1
...

The representations of both mementos are illustrated below:



Observation 6: Some URI-Rs were removed from PRONI

Mementos may disappear when moving from the original archive to the new archive. For example, the request to the URI-M:

http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/

from the original archive resulted in "200 OK" as the part of the WARC below shows:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/
WARC-Date: 2017-11-18T02:06:22Z
WARC-Record-ID: <urn:uuid:13136370-cc05-11e7-83e9-19ddf7ecdbd2>
Content-Type: application/http; msgtype=response
Content-Length: 28657

HTTP/1.1 200 OK
Date: Sat, 18 Nov 2017 02:06:11 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 20429973
Memento-Datetime: Tue, 08 Apr 2014 18:55:12 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="first memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="last memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="prev memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="next memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/timegate/http://www.www126.com/>; rel="timegate", <http://www.www126.com/>; rel="original", <http://webarchive.proni.gov.uk/timemap/http://www.www126.com/>; rel="timemap"; type="application/link-format"

The representation of the memento is illustrated below:

In proni.gov.uk
The request to the corresponding URI-M:

http://wayback.archive-it.org/11112/20140408185512/http://www.www126.com/

from Archive-It results in "404 Not Found" as the cURL session below shows:

$ curl --head http://wayback.archive-it.org/11112/20140408185512/http://www.www126.com/

HTTP/1.1 404 Not Found
Server: Apache-Coyote/1.1
Content-Type: text/html;charset=utf-8
Content-Length: 4910
Date: Tue, 24 Sep 2019 02:00:45 GMT

Before transferring collections to the new archive, it is possible that the original archive reviews collections and removes URI-Rs/URI-Ms that are considered off topic (you may also read about the off-topic memento toolkit) or spam (e.g., the URI-R www.www126.com is about auto insurance).  

Observation 7: PRONI may have used the Europe Archive and IMF as hosting services

PRONI used an archival banner that is similar to what the Europe Archive and IMF used. Furthermore, the three archives returned similar set of HTTP response headers to requests of mementos. Values of multiple response headers (e.g., server and via) are identical as shown below:

The Europe Archive:
URI-M = http://collection.europarchive.org/nli/20161213111140/http://wordpress.org/

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collection.europarchive.org/nli/20161213111140/https://wordpress.org/
WARC-Date: 2017-12-01T18:15:32Z
WARC-Record-ID: <urn:uuid:9e705c20-d6c3-11e7-9f9e-5371622c3ef9>
Content-Type: application/http; msgtype=response
Content-Length: 159855

HTTP/1.1 200 OK
Date: Fri, 01 Dec 2017 18:15:18 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 15695216
Memento-Datetime: Tue, 13 Dec 2016 06:03:04 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: ...

The IMF archive:
URI-M = http://collections.internetmemory.org/nli/20161213111140/https://wordpress.org/

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collections.internetmemory.org/nli/20161213111140/https://wordpress.org/
WARC-Date: 2018-09-03T16:33:41Z
WARC-Record-ID: <urn:uuid:1e3173c0-af97-11e8-8819-6df9b412b877>
Content-Type: application/http; msgtype=response
Content-Length: 300060

HTTP/1.1 200 OK
Date: Mon, 03 Sep 2018 16:33:27 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 11721161
Memento-Datetime: Tue, 13 Dec 2016 06:03:04 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link:...

The PRONI archive:
URI-M = http://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/
WARC-Date: 2017-12-01T01:59:56Z
WARC-Record-ID: <urn:uuid:5452da60-d63b-11e7-91e2-8bf9abaf94b4>
Content-Type: application/http; msgtype=response
Content-Length: 36908

HTTP/1.1 200 OK
Date: Fri, 01 Dec 2017 01:59:43 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 16262557
Memento-Datetime: Thu, 18 Feb 2010 15:18:44 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: ...

We found that PRONI archival collection was listed as one of the maintained collections by the Europe Archive and IMF as the figure below shows:

https://web.archive.org/web/20180707131510/http://collections.internetmemory.org/

Even though the PRONI collection were moved from the Europe Archive to IMF, URI-Ms served by proni.gov.uk had not changed. It is possible that the PRONI archive followed a strategy of serving mementos under proni.gov.uk while using the hosting services provided by the Europe Archive and IMF. Thus, the regular users of the PRONI archive did not notice any change in URI-Ms. We do not think custom domains are available with Archive-It, so PRONI was unable to continue to host their mementos in their own URI namespace.

The list of all 979 URI-Ms is appended below. The file contains the following information:
  • The URI-M from the original archive (original_URI-M).
  • The final URI-M after following redirects, if any, of the URI-M from the original archive (final_original_URI-M).
  • The HTTP status code of the final URI-M from the original archive (final_original_URI-M_status_code).
  • The URI-M from the new archive (new_URI-M).
  • The final URI-M after following redirects, if any, of the URI-M from the new archive (final_new_URI-M).
  • The HTTP status code of the final URI-M from the new archive (final_new_URI-M_status_code).
  • The difference (in seconds) between the Memento-Datetimes of the final URI-Ms (delta).
  • Whether the URI-Rs of the final URI-Ms are identical or not (same_final_URI-Rs).
  • Whether the status codes of the final URI-Ms are identical or not (same_final_URI-Ms_status_code).

Conclusions

We described the movement of mementos from the PRONI archive (proni.gov.uk) to archive-it.org in October 2018. We found that 114 out of the 469 mementos resurfaced in archive-it.org with a change in Memento-Datetime, URI-R, or the final HTTP status code. We also found that the functionality that used to be available in the original archival banner has gone. We noticed that both archives proni.gov.uk and archive-it.org react differently to requests of archival 4xx/5xx. In the upcoming posts we will provide some details about changes in webcitation.org.


--Mohamed Aturban




No comments:

Post a Comment