Friday, August 30, 2019

2019-08-30: Where did the archive go? Part 1: Library and Archives Canada


Web archives are established with the objective of providing permanent access to archived web pages, or mementos. However, in our 14-month study of 16,627 mementos from 17 public web archives, we found that three web archives changed their base URLs and did not leave a machine readable method of locating their new URLs. We were able to manually discover the three new URLs for the archives. A fourth archive has partially ceased operations.

(1) Library and Archives Canada (collectionscanada.gc.ca)
Around May 2018, mementos in this archive were moved to a new archive (webarchive.bac-lac.gc.ca) which has a different domain name. We noticed that 49 mementos (out of 351) can not be found in the new archive.

(2) The National Library of Ireland (NLI) 
Around May 2018, the European Archive (europarchive.org) was shut down and the domain name was purchased by another entity. The National Library of Ireland (NLI) collection preserved by this archive was moved to another archive (internetmemory.org). All 979 mementos can be retrieved from the new archive (i.e., no missing mementos). Around September 2018, the archive internetmemory.org became unreachable (timeout error). The NLI collection preserved by this archive was moved to another archive (archive-it.org). The other archived collections in internetmemory.org may also have been moved to archive-it.org or to other archives. The number of missing from NLI mementos is 192 (out of 979).

(3) Public Record Office of Northern Ireland (PRONI) (webarchive.proni.gov.uk)
Around October 2018, all mementos preserved by this archive were moved to archive-it.org. The PRONI archive's homepage  is still online and shows a list of web pages' URLs (not mementos' URLs). By clicking on any of these URLs, it redirects to an HTML page in archive-it.org that shows the available mementos (i.e., the TimeMap) associated with the selected URL. The number of missing mementos from PRONI is 114 (out of 469).

(4) WebCite (webcitation.org)
The archive has been unreachable (timeout error) for about a month (from June 06, 2019 to July 08, 2019). The archive no longer accepts any new archiving requests, but it still provides access to all preserved mementos.

Library and Archives Canada 

In this post, we provide some details about changes in the archive Library and Archives Canada. Changes in the other three archives will be described in upcoming posts.

We refer to the archive from which mementos have moved as the "original archive", and we use the "new archive" to refer to the archive to which the mementos have moved. A memento is identified by a URI-M as defined in the Memento framework. 

In our study we have 351 mementos from collectionscanada.gc.ca. Around May 2018, 302 of those mementos have been moved to webarchive.bac-lac.gc.ca (the 49 remaining mementos are lost). For instance, the memento:

http://www.collectionscanada.gc.ca/webarchives/20051228174058/http://nationalatlas.gov/

is now available at:

http://webarchive.bac-lac.gc.ca:8080/wayback/20051228174058/http://nationalatlas.gov/

The representations of both mementos are illustrated in the figure below. The original archive uses the green banner (left) while the new archive uses the yellow banner (right):



We have several observations about the change in the archive Library and Archives Canada:


Observation 1: The HTTP request of a URI-M from the original archive does not redirect to the corresponding URI-M in the new archive

The institution (Library and Archives Canada) that has developed the new archive (webarchive.bac-lac.gc.ca) still controls and maintains the domain name of the original archive (www.collectionscanada.gc.ca). Thus, it would be possible for requests of mementos (URI-Ms) to the original archive to redirect to the corresponding URI-Ms in the new archive. However, we found that every memento request to the original archive redirected to the home page of the new archive as shown below:

$ curl --head --location --silent http://www.collectionscanada.gc.ca/webarchives/20051228174058/http://nationalatlas.gov/ | egrep -i "(HTTP/1.1|^location:)"

Location: http://www.bac-lac.gc.ca/eng/discover/archives-web-government/Pages/web-archives.aspx
HTTP/1.1 302 Found
Location: http://webarchive.bac-lac.gc.ca/?lang=en
HTTP/1.1 200

Here is the representation of the home page of the new archive:



We had to 
manually intervene to detect the corresponding URI-Ms of the mementos in the new archive which can be done by replacing "www.collectionscanada.gc.ca/webarchives" with "webarchive.bac-lac.gc.ca:8080/wayback" in the URI-Ms of the original archive.

This reminds us of The End of Term Archive (eot.us.archive.org) which was established with the goal of preserving the United States government web (.gov). The domain name (eot.us.archive.org) is still under the control of the Internet Archive (archive.org). The example below shows how the HTTP request to a URI-M in the End of Term Archive redirects to the corresponding URI-M in the Internet Archive. This practice maintains link integrity via "follow-your-nose" from the old URI-M to the new URI-M.

$ curl --head --location --silent http://eot.us.archive.org/eot/20120520120841/http://www2.ed.gov/espanol/parents/academic/matematicas/brochure.pdf | egrep -i "(HTTP/|^location:)"

HTTP/1.1 302 Found
Location: https://web.archive.org/web/20120520120841/http://www2.ed.gov/espanol/parents/academic/matematicas/brochure.pdf
HTTP/2 200

We can rewrite URI-Ms of the original archive and have them redirect (301 Moved Permanently) to their corresponding URI-Ms in the new archive. For example, for the Apache web server, the mod_rewrite rule can be used to perform automatic redirects and rewrite  requested URIs on the fly. Here is a rewrite rule example that the original archive can use to redirect requests to the new archive:

# With mod_rewrite
RewriteEngine on
RewriteRule   "^/webarchives/(\d{14})/(.+)" http://webarchive.bac-lac.gc.ca:8080/wayback/$1/$2  [L,R=301]

If the original archive serves only mementos under /webarchives, then the mod_rewrite rule would be even simpler:

# With mod_rewrite
RewriteEngine on
RewriteRule   "^/webarchives/(.+)" http://webarchive.bac-lac.gc.ca:8080/wayback/$1  [L,R=301]


Observation 2: Not all mementos are available in the new archive

Each memento (URI-M) represents a prior version of an original web page (URI-R) at a particular datetime (Memento-Datetime). The timestamp, usually included in a URI-M, is identical to the value of the response header Memento-Datetime. 

For example, for:

URI-M = http://www.collectionscanada.gc.ca/webarchives/20060208075019/http://www.cdc.gov/

we have:

Memento-Datetime = Wed, 08 Feb 2006 07:50:19 GMT
URI-R = http://www.cdc.gov/

For a URI-M from the original archive, if the values of the Memento-Datetime, the URI-R, and the final HTTP status code are not identical to the values of the corresponding URI-M from the new archive, we name this as a missing memento. 

In this study, we found that 49 mementos (out of 351) can not be retrieved from the new archive. Instead, the archive responds with other mementos that have different Memento-Datetimes. Those mementos may (or may not) have the same content compared to the content returned by the original archive. For example, when we requested the URI-M:

http://www.collectionscanada.gc.ca/webarchives/20060208075019/http://www.cdc.gov/

from the original archive (www.collectionscanada.gc.ca) on February 27, 2018, we received the HTTP status "200 OK" with the following representation (the Memento-Datetime of this memento is Wed, 08 Feb 2006 07:50:19 GMT):


In www.collectionscanada.gc.ca
Then, we requested the corresponding URI-M:

http://webarchive.bac-lac.gc.ca:8080/wayback/20060208075019/http://www.cdc.gov/

from the new archive. As shown in the cURL session below, the request redirected to another URI-M:

http://webarchive.bac-lac.gc.ca:8080/wayback/20061026060247/http://www.cdc.gov/

This memento has a different Memento-Datetime (Thu, 26 Oct 2006 06:02:47 GMT) for a delta of about 260 days. The content of this memento (the figure below) in the new archive is different from the content of the memento that used to be available in the original archive (the figure above).
In webarchive.bac-lac.gc.ca
$ curl --head --location --silent http://webarchive.bac-lac.gc.ca:8080/wayback/20060208075019/http://www.cdc.gov/ | egrep -i "(HTTP/|^location:|^Memento-Datetime)"

HTTP/1.1 302 Found
Location: http://webarchive.bac-lac.gc.ca:8080/wayback/20061026060247/http://www.cdc.gov/
HTTP/1.1 200 OK
Memento-Datetime: Thu, 26 Oct 2006 06:02:47 GMT

The figure below shows a set of screenshots taken for the memento within 14 months. The screenshots with a blue border are representations of mementos in the original archive (www.collectionscanada.gc.ca) before it is moved to the new archive. The screenshots with a red border is the home page of the new archive before we manually detected the corresponding URI-Ms in the new archive. The screenshots with a green border shows the representations resulting from requesting the memento from the new archive (webarchive.bac-lac.gc.ca). The representation before the archive's change (blue border) is different from the representation of the memento after the change (green border).


Replayed the memento 33 times within 14-months.

Observation 3: New features available in the new archive because of the upgraded replay tool

The new archive (webarchive.bac-lac.gc.ca) uses an updated version of OpenWayback (i.e., OpenWayback Release 1.6.0 or later) that enables new features, such as raw mementos and Memento support. These features were not supported by the original archive that was running OpenWayback Release 1.4 (or earlier) .

Raw mementos

At replay time, archives transform the original content of web pages to appropriately replay them (e.g., in a user’s browser). Archives add their own banners to provide metadata about both the memento being viewed and the original page. Archives also rewrite links of embedded resources in a page so that these resources are retrieved from the archive, not from the original server.

Many archives allow accessing unaltered, or raw, archived content (i.e., retrieving the archived original content without any type of transformation by the archive). The most common mechanism to retrieve the raw mementos is by adding "id_" after the timestamp in the requested URI-M.

The feature of retrieving the raw mementos was not provided by the original archive (www.collectionscanada.gc.ca). However, it is supported by the new archive. For example, to retrieve the raw content of the memento identified by the URI-M

http://webarchive.bac-lac.gc.ca:8080/wayback/20061026060247/http://www.cdc.gov/

we add "id_" after the timestamp as shown in the cURL session below:

curl --head --location --silent http://webarchive.bac-lac.gc.ca:8080/wayback/20061026060247id_/http://www.cdc.gov/ | egrep -i "(HTTP/|^Memento-Datetime)"

HTTP/1.1 200 OK
Memento-Datetime: Thu, 26 Oct 2006 06:02:47 GMT

Memento support

The Memento protocol is supported by most public web archives including the Internet Archive. The protocol introduces two HTTP headers for content negotiation. First, Accept-Datetime is an HTTP Request header through which a client can request a prior state of a web resource by providing the preferred datetime, for example,

Accept-Datetime: Mon, 09 Jan 2017 11:21:57 GMT.

Second, the Memento-Datetime HTTP Response header is sent by a server to indicate the datetime at which the resource was captured, for instance,

Memento-Datetime: Sun, 08 Jan 2017 09:15:41 GMT.

The Memento protocol also defines:

  • TimeMap: A resource that provides a list of mementos (URI-Ms) for a particular original resource, 
  • TimeGate: A resource that supports content negotiation based on datetime to access prior versions of an original resource. 
The cURL session below shows the TimeMap of the original resource (http://www.cdc.gov/) available in the new archive. The TimeMap indicates that the memento with the Memento-Datetime Wed, 08 Feb 2006 07:50:19 GMT (as described above) is not available in the new archive.

$ curl http://webarchive.bac-lac.gc.ca:8080/wayback/timemap/link/http://www.cdc.gov/

<http://www.cdc.gov/>; rel="original",
<http://webarchive.bac-lac.gc.ca:8080/wayback/timemap/link/http://www.cdc.gov/>; rel="self"; type="application/link-format"; from="Thu, 26 Oct 2006 06:02:47 GMT"; until="Fri, 09 Oct 2015 13:26:42 GMT",
<http://webarchive.bac-lac.gc.ca:8080/wayback/http://www.cdc.gov/>; rel="timegate",
<http://webarchive.bac-lac.gc.ca:8080/wayback/20061026060247/http://www.cdc.gov/>; rel="first memento"; datetime="Thu, 26 Oct 2006 06:02:47 GMT",
<http://webarchive.bac-lac.gc.ca:8080/wayback/20151009132642/http://www.cdc.gov/>; rel="last memento"; datetime="Fri, 09 Oct 2015 13:26:42 GMT"

It is possible that two archives use the same version of OpenWayback but with different configuration options, such as whether to support Memento framework or not:

 <bean name="standardaccesspoint" class="org.archive.wayback.webapp.AccessPoint">
  <property name="accessPointPath" value="${wayback.url.prefix}/wayback/"/>
  <property name="internalPort" value="${wayback.url.port}"/>
  <property name="serveStatic" value="true" />
  <property name="bounceToReplayPrefix" value="false" />
  <property name="bounceToQueryPrefix" value="false" />
  <property name="enableMemento" value="true" />

or how to respond to (raw) archival redirects (thanks to Alex Osborne for help in locating this information):

<!-- WARN CLIENT ABOUT PATH REDIRECTS -->
<bean class="org.archive.wayback.replay.selector.RedirectSelector">
 <property name="renderer">
   <bean class="org.archive.wayback.replay.JSPReplayRenderer">
     <property name="targetJsp" value="/WEB-INF/replay/UrlRedirectNotice.jsp" />
   </bean>
 </property>
</bean>
...
<!-- Explicit (via "id_" flag) IDENTITY/RAW REPLAY -->
<bean class="org.archive.wayback.replay.selector.IdentityRequestSelector">
  <property name="renderer" ref="identityreplayrenderer"/>
</bean>


Observation 4: The HTTP status code may change in the new archive 

The HTTP status codes of URI-Ms in the new archive might not be identical to the HTTP status code of the corresponding URI-Ms in the original archive. For example, the HTTP request of the URI-M:

http://www.collectionscanada.gc.ca/webarchives/20070220181041/http://www.berlin.gc.ca/

to the original archive resulted in the following "302" redirects before it ended up with the HTTP status code "404":

http://www.collectionscanada.gc.ca/webarchives/20070220181041/http://www.berlin.gc.ca/ (302)
http://www.collectionscanada.gc.ca/webarchives/20070220181041/http://www.dfait-maeci.gc.ca/canadaeuropa/germany/ (302)
http://www.collectionscanada.gc.ca/webarchives/20070220181204/http://www.dfait-maeci.gc.ca/canadaeuropa/germany/ (302)
http://www.collectionscanada.gc.ca/webarchives/20070220181204/http://www.international.gc.ca/global/errors/404.asp?404%3Bhttp://www.dfait-maeci.gc.ca/canadaeuropa/germany/ (404)

We requested the corresponding URI-M from the new archive, it ended up with the HTTP status code "200":

http://webarchive.bac-lac.gc.ca:8080/wayback/20070220181041/http://www.berlin.gc.ca/ (Redirect by JavaScript (JS))
http://webarchive.bac-lac.gc.ca:8080/wayback/20070220181041/http://www.dfait-maeci.gc.ca/canadaeuropa/germany/ (Redirect by JS)
http://webarchive.bac-lac.gc.ca:8080/wayback/20070220181204/http://www.international.gc.ca/global/errors/404.asp?404%3Bhttp://www.dfait-maeci.gc.ca/canadaeuropa/germany/ (Redirect by JS)
http://webarchive.bac-lac.gc.ca:8080/wayback/20071115025620/http://www.international.gc.ca/canada-europa/germany/ (302) 
http://webarchive.bac-lac.gc.ca:8080/wayback/20071115023828/http://www.international.gc.ca/canada-europa/germany/ (200

The list of all 351 URI-Ms is shown below. The file contains the following information:
  • The URI-M from the original archive (original_URI-M).
  • The final URI-M after following redirects, if any, of the URI-M from the original archive (final_original_URI-M). 
  • The HTTP status code of the final URI-M from the original archive (final_original_URI-M_status_code).
  • The URI-M from the new archive (new_URI-M).
  • The final URI-M after following redirects, if any, of the URI-M from the new archive (final_new_URI-M).
  • The HTTP status code of the final URI-M from the new archive (final_new_URI-M_status_code).
  • The difference (in seconds) between the Memento-Datetimes of the final URI-Ms (delta).
  • Whether the URI-Rs of the final URI-Ms are identical or not (same_final_URI-Rs). The different URI-Rs are labeled with "No", otherwise "-".
  • Whether the status codes of the final URI-Ms are identical or not (same_final_URI-Ms_status_code). The different status codes are labeled with "No", otherwise "-".
  • The first 49 rows contain the information of the missing mementos.


Conclusions

When Library and Archives Canada migrated their archive in May 2018, 49 of the 351 mementos we were tracking resurfaced in the new archive with a change in Memento-Datetime, URI-R, or the final HTTP status code. In three cases, the HTTP status codes of mementos in the new archive change from the status codes in the original archive. Also, updating/upgrading a web archival replay tool (e.g., OpenWayback and PyWb) may affect how migrated mementos are indexed and replayed. In general, with any memento migration, we recommend that when possible requests of mementos to the original archives to be redirected to their corresponding mementos in the new archive (e.g., the case of The End of Term Archive explained above).

In the upcoming posts we will provide some details about changes in the archives: 

--Mohamed Aturban

No comments:

Post a Comment