Tuesday, October 1, 2019

2019-10-01: Attending the Machine Learning + Libraries Summit at the Library of Congress


On September 20, 2019, I attended the Machine Learning + Libraries Summit at the Library of Congress. The aim of the meeting is to gather computer and information scientists, engineers, data scientists, and Liberians from reputable universities, government agencies, and industrial companies to come up with ideas on a bunch of questions such as how to expand the service of digital libraries, how to establish a good collaboration with  other groups on machine learning projects, and what factors to consider to design a sustainable machine learning project, especially in the digital library domain. In the initial solicitation, the focus was cultural heritage, but the discussion went far beyond that.

The meeting features many interesting lightning talks. Unfortunately, due to the relatively short time allocated, many questions and discussions have to go offline. The organizer also arranged several activities, stimulating brainstorm discussion and teamwork between people from different places. I took notes of some speakers and their presentation contents that are highly relevant to my research.

The summit organizers solicited many other potentially interesting topics but because there was not enough time, they opened a Google doc to create a "look book" allowing people to post 3 slides to highlight their research and potential contribution to the project. There are 3 sections of presentations.

Section 1: existing projects:
* Leen-Kiat Soh, Liz Lorang: University of Nebraska-Lincoln
  These people are focusing on newspapers and manuscript collections. It is an explorative project in image segmentation and visual content extraction. The project is called Aida.

* Thu Phuong 'Lisa' Nguyen, Hoover Institution Library & Archives, Stanford University
  These people are trying to process digital collection fo scanned documents from 1898 to 1945, published in the USA. They are working toward extracting meaningful data, document layout analysis, page-level segmentation, article segmentation. The text could be arranged in different directions (from left to right or from top to bottom). Some scripts are mixed, i.e., English and Japanese.

* Kurt Luther:  Assistant Professor of Computer Science and (by courtesy) History, Virginia Tech
  Kurt was leading a group on a project called civil war photo sleuth, which combines crowdsourcing and face recognition to identify historical portraits. They have about 4 million portraits today but only 10-20% are identified. They have developed a crowdsourcing platform with about 10k registered users.

* Ben Lee + Michael Haley Goldman: United States Holocaust Memorial Museum
  Ben and Michael are working on a project that involves 190 million images in WWII. Their goal is to trace missing family members. This dataset is an invaluable resource of Holocaust survivors and their families, as well as Holocaust researchers. They mostly use random forest models + template matching methods.

* Harish Maringanti: Associate Dean for IT & Digital Library Services; J. Willard Marriott Library at University of Utah

* David Smith: Associate Professor, Computer Science, Northeastern University
  David introduced his work on Networked Texts.

* Helena Sarin: Founder, Neural Bricolage
* Nick Adams: Goodly Labs

Section 2: Partnerships
* Mark Williams: Media Ecology Lab, Dartmouth College
  Mark mentioned an annotation tool called SAT "Semantic Annotation Tool".
* Karen Cariani: WGBH Media Library and Archives
* Josh Hadro + Tom Cramer: IIIF, Stanford Libraries
* Mia Ridge + Adam Farquhar: British Library
* Kate Murray: Library of Congress
* Representatives from the Smithsonian Institution Data Lab, National Museum of American History
  Rebecca from Smithsonian OCIO data science lab talks about machine learning at Smithsonian. Some interesting and potentially useful tools include Google vision API, RESNET50, and VGG. Their experiments indicate that the Google tool achieves high performance, but not customizable, RESNET and VGG have far lower success numbers but can be customized and re-trained.

* Jon Dunn + Shawn Averkamp: Indiana University Bloomington, AVP
* Peter Leonard: Yale DH Lab
  Peter talked about their project called PixPlot, which is a web interface to visualize about 30k images from Lund, Sweden. The source code is at https://github.com/YaleDHLab/pix-plot. The website is https://dhlab.yale.edu/projects/pixplot/.

Section 3: Future Possibilities & Challenges
* Michael Lesk: Rutgers University
  Michael talked about duplicate image detector tool at NMAH, including between 1-2 TB of images stored on legacy hardware and network directory. The goal is to determine if there are duplicates. If there are, which images have higher quality.

* Heather Yager: MIT Libraries
* Audrey Altman: Digital Public Library of America
* Hannah Davis: Independent Research Artist
  Hannah mentions an interesting tagger: https://www.tagtog.net/ 

Besides, the summit also has arranged open discussions and activities to stimulate the attendant's thoughts and discussions. Some noted questions are
* How do we communicate machine learning results/outputs to end-users?
* How does one get ML products from the pilot to production?
* Do you know of existing practical guidelines for successful machine learning project agreements?
* How can we overcome the challenges of access variable resources across varying contexts, such as infrastructure, licensing, and copyright structure?
* Which criteria would you use for evaluation of service whether for providers for internal/external use?

Another activity is to ask attendants in different tables to form groups and discuss factors to consider in collaborations with machine learning projects. Some noted points include
* Standardize and documenting data
* Clarity of roles and communication
* User expectation, regular share document of progress
* Organizational and political factors to get the project done.
* Get the right reasons, the right people, and the right plan. Having a value of the project.

Below are the people I met with both known and new friends.

* Stephen Downie from UIUC. He introduced to me some useful tools in HathiTrust that I can borrow for my ETD project.
* Tom Cramer from Stanford. Tom was leading a team to work on a similar project on mining ETDs. He also introduced the yewno.com website, which they are working with, to transform information in ETDs into knowledge.
* Kurt Luther from Virginia Tech at Arlington. Kurt was doing a historical portrait study.
* Wayne Graham from CLIR.
* Heather Yager from MIT. Heather and I had a brief chat on accessing ETDs from DSpace in MIT libraries.
* David Smith from Northeastern. David was an expert on image processing. He introduced hOCR to me which is exactly the tool I was looking for to identify bounding boxes of text on a scanned document.
* Michael Lesk from Rutgers. A senior but still energetic information scientist. He knew Dr. C. Lee Giles.
* Kate Zwaard the chief of National Digital Initiatives at the Library of Congress

Overall, the summit was very successful. The attendances presented real-world problems and discussed very practical questions. The logistic was also good. Eileen Jakeway did excellent jobs on communicating with people before and after the summit, including a follow-up survey. I thank Dr. Michael Nelson for telling me to register for this meeting.

I made a wise decision to stay overnight before the meeting. The traffic from Springfield to the Library of Congress is terrible with 3 accidents in the morning. I was lucky to find a parking spot costing $18 a day near LOC. The back trip was 1 hr longer than the map distance due to constructions. But the weather was fine and the people were friendly!

Jian Wu

Wednesday, September 25, 2019

2019-09-25: Where did the archive go? Part 3: Public Record Office of Northern Ireland


In Where did the archive go? Part 1, we provided some details about changes in the archive Library and Archives Canada. After they upgraded their replay system, we were no longer able to find 49 out of 351 mementos (archived web pages). In Part 2, we focused on the movement of the National Library of Ireland (NLI). Mementos from NLI collection were moved from the European Archive to the Internet Memory Foundation (IMF) archive. Then, they were moved to Archive-It. We found that 192 mementos, out of 979, can not be found in Archive-It.

In part 3 of this four part series, we focus on changes in the Public Record Office of Northern Ireland (PRONI) Web Archive. In October 2018, mementos in the PRONI archive were moved to Archive-It (archive-it.org). We discovered that 114 mementos, out of 469, can no longer be found in Archive-It (i.e., missing mementos).

We refer to the archive from which mementos have moved as the "original archive" (i.e., PRONI archive), and we use the "new archive" to refer to the archive to which the mementos have moved (i.e., Archive-It). A memento is identified by a URI-M as defined in the Memento framework.

We have several observations about the changes in the PRONI web archive:

Observation 1: The HTTP request to a URI-M in PRONI archive does not redirect to the corresponding URI-M in the new archive

As shown in the cURL session below, every request to a memento (URI-M) in the PRONI archive will return the HTTP status code "404 Not Found":

$ curl --head --location http://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/

HTTP/1.1 302 Found
Cache-Control: no-cache
Content-length: 0
Location: https://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/
HTTP/2 404
date: Fri, 20 Sep 2019 08:13:45 GMT
server: Apache/2.4.18 (Ubuntu)
content-type: text/html; charset=iso-8859-1

PRONI did not leave a machine readable method of locating the new URI-Ms. However, we were able to manually discover the corresponding URI-Ms in Archive-It. For example, the memento:

http://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/

is now available at:

http://wayback.archive-it.org/11112/20100218151844/http://www.berr.gov.uk/

The representations of both mementos are illustrated in the figure below:


Unlike the European Archive and IMF, the Public Record Office of Northern Ireland (PRONI) still owns the domain name of the original archive, webarchive.proni.gov.uk. Therefore, to maintain link integrity via "follow-your-nose", PRONI could issue redirects (even though it currently does not) to the corresponding URI-Ms in Archive-It. For example, since PRONI uses the Apache web server, the mod_rewrite rule that could be used to perform automatic redirects is:

# With mod_rewrite
RewriteEngine on
RewriteRule "^/(\d{14})/(.+)" http://wayback.archive-it.org/11112/$1/$2 [L,R=301]

Observation 2: The functionality of the original archival banner is gone

Similar to the archival banners provided by the European Archive and IMF, users of the PRONI archive were able to navigate through available mementos via the custom archival banner (marked in red in the top screenshot in the figure above). Users are allowed to view the available mementos and the representation of a selected memento in the same page. Archive-It, on the other hand, now uses the standard playback banner (marked in red in the bottom screenshot in the figure above). This banner does not have the same functionality compared to the original archive's banner. The new archive's banner contains information to inform users that they are viewing an "archived" web page and shows multiple links. One of the links will redirect to a web page that shows all available mementos in archive-it.org. For example, the figure below shows the available mementos for the web page http://www.berr.gov.uk/:



Observation 3: Not all mementos are available in the new archive

We define a memento "missing" if the values of the Memento-Datetime, the URI-R, and the final HTTP status code of a memento from the original archive are not identical to the values of a corresponding memento in the new archive. In this study, we found 114 missing mementos (out of 469) that can not be retrieved from the new archive. Instead, the new archive responds with other mementos that have different values for the Memento-Datetime, the URI-R, or the HTTP status code. For example, when requesting the URI-M:

http://webarchive.proni.gov.uk/20160901021637/https://www.flickr.com/

from the original archive (PRONI) on 2017-12-01, the archive responded with "200 OK" with the representation shown in the top screenshot in the figure below. The Memento-Datetime of this memento was Thu, 01 September 2016 02:16:37 GMT. Then, we requested the corresponding URI-M:

http://wayback.archive-it.org/11112/20160901021637/https://www.flickr.com/

from the new archive (archive-it.org). As shown in the cURL session below, the request redirected to another URI-M:

http://wayback.archive-it.org/11112/20170401014520/https://www.flickr.com/

As shown in the figure below, the representations of both mementos are identical (except for the archival banners), we consider the memento from the original archive as missing because both mementos have different values of the Memento-Datetime (i.e., Fri, 21 Apr 2017 01:45:20 GMT in the new archive) for a delta of about 211 days.

$ curl --head --location --silent http://wayback.archive-it.org/11112/20160901021637/http://www.flickr.com/ | egrep -i "(HTTP/|^location:|^Memento-Datetime)"

HTTP/1.1 302 Found
Location: /11112/20170401014520/https://www.flickr.com/
HTTP/1.1 200 OK
Memento-Datetime: Sat, 01 Apr 2017 01:45:20 GMT



We found that 63 missing mementos, out of 114, have different values of the Memento-Datetime of a delta of less than 11 seconds. For example, the request to the memento:

http://webarchive.proni.gov.uk/20170102004044/http://www.fws.gov/

from the original archive on 2017-11-18 returned "302" redirect to

http://webarchive.proni.gov.uk/20170102004044/https://fws.gov/

The request to the corresponding memento:

http://wayback.archive-it.org/11112/20170102004044/http://www.fws.gov/

from the new archive redirects to the memento:

http://wayback.archive-it.org/11112/20170102004051/https://www.fws.gov/

as the cURL session below shows:

$ curl --head --location --silent http://wayback.archive-it.org/11112/20170102004044/http://www.fws.gov/ | egrep -i "(HTTP/|^location:|^Memento-Datetime)"

HTTP/1.1 302 Found
Location: /11112/20170102004051/https://www.fws.gov/
HTTP/1.1 200 OK
Memento-Datetime: Mon, 02 Jan 2017 00:40:51 GMT

There are 10-second difference between the values of the Memento-Datetime which might not be semantically significant (apparently just a change in the canonicalization of the URIs, with http://www.fws.gov/ redirecting to https://www.fws.gov), but we do not consider the memento in the original archive is identical to the corresponding memento in the new archive because of the difference in the values of the Memento-Datetime.

When the new archive receives an archived collection from the original archive, it may apply some post-crawling techniques to the received files (e.g., WARC files) including deduplication, spam filtering, and indexing. This may result in mementos in the new archive that have  different values of the Memento-Datetime compared to their corresponding values in the original archive. 

Observation 4: PRONI provides a list of original pages' URIs (URI-Rs)

Mementos in PRONI archive were moved to the Archive-It under the archival collection https://archive-it.org/collections/11112/:


As shown in Observation 1, requests to URI-Ms in PRONI do not redirect to Archive-It. However, webarchive.proni.gov.uk provides a list of all original resources' URIs (URI-Rs) for which mementos have been created as the following figure shows:



For instance, if interested in finding the corresponding memento in Archive-It to the PRONI memento:
URI-M = http://webarchive.proni.gov.uk/20150318223351/http://www.afbini.gov.uk/

In proni.gov.uk
which has:
URI-R = http://www.afbini.gov.uk/
Memento-Datetime = Wed 18 Mar 2015 22:33:51 GMT

From the index at webarchive.proni.gov.uk, we can click on the URI-R www.afbini.gov.uk, which will redirect to an Archive-It HTML page that contains all available mementos for the selected URI-R as shown in the figure below:



Finally, we choose 2015-03-18 since it is the same Memento-Datetime as in the original archive. The representation of the memento is shown below:



Although the PRONI archive does not issue "301" redirects to URI-Ms in the new archive (i.e., PRONI does not provide a direct mapping between original URI-Ms and new URI-Ms), users of the archive can indirectly find the corresponding URI-Ms as explained above.

Observation 5: Archival 4xx/5xx responses are handled differently

Similar to the European Archive and IMF, the replay tool in the original archive (proni.gov.uk) is configured so that it returns the status code "200 OK" for archival 4xx/5xx. For example, when requesting the memento:

http://webarchive.proni.gov.uk/20160216154000/http://www.megalithic.co.uk/

on 2017-11-18, the original archive returned "200 OK" for an archival "403 Forbidden" as the WARC record below shows:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://webarchive.proni.gov.uk/20160216154000/http://www.megalithic.co.uk/
WARC-Date: 2017-11-18T03:35:22Z
WARC-Record-ID: <urn:uuid:81bb8530-cc11-11e7-9c05-ff972ac7f9f2>
Content-Type: application/http; msgtype=response
Content-Length: 60568

HTTP/1.1 200 OK
Date: Sat, 18 Nov 2017 03:35:11 GMT
Server: Apache/2.4.10
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
Memento-Datetime: Tue, 16 Feb 2016 15:40:00 GMT
...

Even the HTTP status code of the inner iframe in which the archived content is loaded had "200 OK":

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://webarchive.proni.gov.uk/content/20160216154000/http://www.megalithic.co.uk/
WARC-Date: 2017-11-18T03:35:22Z
WARC-Record-ID: <urn:uuid:81c967e0-cc11-11e7-9c05-ff972ac7f9f2>
Content-Type: application/http; msgtype=response
Content-Length: 4587

HTTP/1.1 200 OK
Date: Sat, 18 Nov 2017 03:35:11 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html
Via: 1.1 varnish-v4
cache-control: max-age=86400
X-Varnish: 24849777 24360264
Memento-Datetime: Tue, 16 Feb 2016 15:40:00 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: ...

<html>
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<title>[ARCHIVED CONTENT] 403 FORBIDDEN : LOGGED BY www.megalithic.co.uk</title>
</head>
<body style=\"font:Arial Black,Arial\">
<p><b><font color="#FF0000" size="+2" face="Verdana, Arial, Helvetica, sans-serif" style="text-decoration:blink; color:#FF0000; background:#000000;">&nbsp;&nbsp;&nbsp;&nbsp; 403 FORBIDDEN! &nbsp;&nbsp;&nbsp;&nbsp;</font></b></p>
<b>
<p><font color="#FF0000">You have been blocked from the Megalithic Portal by our defence robot.<br>Possibly the IP address you are accessing from has been banned for previous bad behavior or you have attempted a hostile action.<br>If you think this is an error please click the Trouble Ticket link below to communicate with the site admin.</font></p>
...

When requesting the corresponding memento:

http://wayback.archive-it.org/11112/20160216154000/http://www.megalithic.co.uk/

Archive-It properly returned the status codes 403 for the archival 403:

$ curl --head http://wayback.archive-it.org/11112/20160216154000/http://www.megalithic.co.uk/

HTTP/1.1 403 Forbidden
Server: Apache-Coyote/1.1
Content-Security-Policy-Report-Only: default-src 'self' 'unsafe-inline' 'unsafe-eval' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org data: blob: ; frame-src 'self' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org ; report-uri https://partner.archive-it.org/csp-report
Memento-Datetime: Tue, 16 Feb 2016 15:40:00 GMT
X-Archive-Guessed-Charset: windows-1252
X-Archive-Orig-Server: Apache/2.2.15 (CentOS)
X-Archive-Orig-Connection: close
X-Archive-Orig-X-Powered-By: PHP/5.3.3
X-Archive-Orig-Status: 403 FORBIDDEN
X-Archive-Orig-Content-Length: 3206
X-Archive-Orig-Date: Tue, 16 Feb 2016 15:39:59 GMT
X-Archive-Orig-Warning: 199 www.megalithic.co.uk:80 You_are_abusive/hacking/spamming_www.megalithic.co.uk
X-Archive-Orig-Content-Type: text/html; charset=iso-8859-1
...

The representations of both mementos are illustrated below:



Observation 6: Some URI-Rs were removed from PRONI

Mementos may disappear when moving from the original archive to the new archive. For example, the request to the URI-M:

http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/

from the original archive resulted in "200 OK" as the part of the WARC below shows:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/
WARC-Date: 2017-11-18T02:06:22Z
WARC-Record-ID: <urn:uuid:13136370-cc05-11e7-83e9-19ddf7ecdbd2>
Content-Type: application/http; msgtype=response
Content-Length: 28657

HTTP/1.1 200 OK
Date: Sat, 18 Nov 2017 02:06:11 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 20429973
Memento-Datetime: Tue, 08 Apr 2014 18:55:12 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="first memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="last memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="prev memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/20140408185512/http://www.www126.com/>; rel="next memento"; datetime="Tue, 08 Apr 2014 18:55:12 GMT", <http://webarchive.proni.gov.uk/timegate/http://www.www126.com/>; rel="timegate", <http://www.www126.com/>; rel="original", <http://webarchive.proni.gov.uk/timemap/http://www.www126.com/>; rel="timemap"; type="application/link-format"

The representation of the memento is illustrated below:

In proni.gov.uk
The request to the corresponding URI-M:

http://wayback.archive-it.org/11112/20140408185512/http://www.www126.com/

from Archive-It results in "404 Not Found" as the cURL session below shows:

$ curl --head http://wayback.archive-it.org/11112/20140408185512/http://www.www126.com/

HTTP/1.1 404 Not Found
Server: Apache-Coyote/1.1
Content-Type: text/html;charset=utf-8
Content-Length: 4910
Date: Tue, 24 Sep 2019 02:00:45 GMT

Before transferring collections to the new archive, it is possible that the original archive reviews collections and removes URI-Rs/URI-Ms that are considered off topic (you may also read about the off-topic memento toolkit) or spam (e.g., the URI-R www.www126.com is about auto insurance).  

Observation 7: PRONI may have used the Europe Archive and IMF as hosting services

PRONI used an archival banner that is similar to what the Europe Archive and IMF used. Furthermore, the three archives returned similar set of HTTP response headers to requests of mementos. Values of multiple response headers (e.g., server and via) are identical as shown below:

The Europe Archive:
URI-M = http://collection.europarchive.org/nli/20161213111140/http://wordpress.org/

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collection.europarchive.org/nli/20161213111140/https://wordpress.org/
WARC-Date: 2017-12-01T18:15:32Z
WARC-Record-ID: <urn:uuid:9e705c20-d6c3-11e7-9f9e-5371622c3ef9>
Content-Type: application/http; msgtype=response
Content-Length: 159855

HTTP/1.1 200 OK
Date: Fri, 01 Dec 2017 18:15:18 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 15695216
Memento-Datetime: Tue, 13 Dec 2016 06:03:04 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: ...

The IMF archive:
URI-M = http://collections.internetmemory.org/nli/20161213111140/https://wordpress.org/

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collections.internetmemory.org/nli/20161213111140/https://wordpress.org/
WARC-Date: 2018-09-03T16:33:41Z
WARC-Record-ID: <urn:uuid:1e3173c0-af97-11e8-8819-6df9b412b877>
Content-Type: application/http; msgtype=response
Content-Length: 300060

HTTP/1.1 200 OK
Date: Mon, 03 Sep 2018 16:33:27 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 11721161
Memento-Datetime: Tue, 13 Dec 2016 06:03:04 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link:...

The PRONI archive:
URI-M = http://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://webarchive.proni.gov.uk/20100218151844/http://www.berr.gov.uk/
WARC-Date: 2017-12-01T01:59:56Z
WARC-Record-ID: <urn:uuid:5452da60-d63b-11e7-91e2-8bf9abaf94b4>
Content-Type: application/http; msgtype=response
Content-Length: 36908

HTTP/1.1 200 OK
Date: Fri, 01 Dec 2017 01:59:43 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 16262557
Memento-Datetime: Thu, 18 Feb 2010 15:18:44 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: ...

We found that PRONI archival collection was listed as one of the maintained collections by the Europe Archive and IMF as the figure below shows:

https://web.archive.org/web/20180707131510/http://collections.internetmemory.org/

Even though the PRONI collection were moved from the Europe Archive to IMF, URI-Ms served by proni.gov.uk had not changed. It is possible that the PRONI archive followed a strategy of serving mementos under proni.gov.uk while using the hosting services provided by the Europe Archive and IMF. Thus, the regular users of the PRONI archive did not notice any change in URI-Ms. We do not think custom domains are available with Archive-It, so PRONI was unable to continue to host their mementos in their own URI namespace.

The list of all 979 URI-Ms is appended below. The file contains the following information:
  • The URI-M from the original archive (original_URI-M).
  • The final URI-M after following redirects, if any, of the URI-M from the original archive (final_original_URI-M).
  • The HTTP status code of the final URI-M from the original archive (final_original_URI-M_status_code).
  • The URI-M from the new archive (new_URI-M).
  • The final URI-M after following redirects, if any, of the URI-M from the new archive (final_new_URI-M).
  • The HTTP status code of the final URI-M from the new archive (final_new_URI-M_status_code).
  • The difference (in seconds) between the Memento-Datetimes of the final URI-Ms (delta).
  • Whether the URI-Rs of the final URI-Ms are identical or not (same_final_URI-Rs).
  • Whether the status codes of the final URI-Ms are identical or not (same_final_URI-Ms_status_code).

Conclusions

We described the movement of mementos from the PRONI archive (proni.gov.uk) to archive-it.org in October 2018. We found that 114 out of the 469 mementos resurfaced in archive-it.org with a change in Memento-Datetime, URI-R, or the final HTTP status code. We also found that the functionality that used to be available in the original archival banner has gone. We noticed that both archives proni.gov.uk and archive-it.org react differently to requests of archival 4xx/5xx. In the upcoming posts we will provide some details about changes in webcitation.org.


--Mohamed Aturban




Tuesday, September 10, 2019

2019-09-10: Where did the archive go? Part 2: National Library of Ireland


In the previous post, we provided some details about changes in the archive Library and Archives Canada. After they upgraded their web archive replay system, we were no longer able to find 49 out of 351 mementos (archived web pages). In part 2 of this four part series, we focus on the movement of a collection from the National Library of Ireland (NLI).


In May 2018, we discovered that 979 mementos from the NLI collection that were originally archived at the European Archive (europarchive.org) were moved to the Internet Memory Foundation archive (internetmemory.org). Then in September 2018, we found that the collection of mementos had been moved to Archive-It (archive-it.org). We found that 192 mementos, out of 979, can not be found in Archive-It (i.e., missing mementos).

For example, the memento from the European Archive:


has been moved to the Internet Memory Foundation (IMF) archive at:

http://collections.internetmemory.org/nli/20141013204117/http://www.defense.gov/

before it ended up at Archive-it:

http://wayback.archive-it.org/10702/20141013204117/http://www.defense.gov/

The representations of the three mementos are illustrated in the figure below.



There were no changes in the 979 mementos (other than their URIs) when they moved from the European Archive to the IMF archive (even the archival banner remained the same as the figure above shows), but we found some significant changes upon the move from IMF to Archive-It which we will focus on in this post.

We refer to the archive from which mementos were moved (i.e., internetmemory.org) as the "original archive", and we use the "new archive" to refer to the archive to which the mementos were moved (i.e., archive-it.org). A memento is identified by a URI-M as defined in the Memento framework.

Our observations about changes in the NLI collection (from IMF to Archive-It) are as follows:

Observation 1: The functionality of the original archival banner is gone

Users of the European Archive and IMF were able to navigate through available mementos via the custom archival banner (marked in red in the top two screenshots in the figure above). Via this banner, the original archive allows users to view the available mementos and the representation of a selected memento in the same page. Archive-It, on the other hand, now uses the standard playback banner (marked in red in the bottom screenshot in the figure above). This new archive's banner contains information to inform users that they are viewing an "archived" web page. This banner also contains multiple links. One of the links will take you to a web page in archive-it.org that shows all available mementos in the archive as shown in the figure below:





Observation 2: The original archive is no longer reachable

After mementos were moved from internetmemory.org, the archive became unreachable as the following cURL session shows:

$ date
Tue May 21 08:03:51 EDT 2019

$ curl http://www.internetmemory.org
curl: (7) Failed to connect to www.internetmemory.org port 80: Operation timed out



In addition to IMF, the European Archive (europarchive.org) also is no longer maintained---it was shut down and the domain name was purchased by another entity and is now spam. 



The movement of mementos from these two archives will affect link integrity across web resources that contain links to mementos from the European Archive or IMF. As mentioned in the previous post, there actions that can be performed by original archives to maintain link integrity via "follow-your-nose" from the old URI-Ms to the corresponding URI-Ms in the new archive. 


For example, the archive Library and Archives Canada changed its domain name name from collectionscanada.gc.ca to webarchive.bac-lac.gc.ca (described in part 1), and because the archive still controls the original domain name collectionscanada.gc.ca, the archive could (even though it currently does not) redirect requests of URI-M in collectionscanada.gc.ca to the new archive webarchive.bac-lac.gc.ca. For instance, if the original archive uses the Apache web server, the mod_rewrite rule can be used to perform automatic redirects:

# With mod_rewrite
RewriteEngine on
RewriteRule   "^/webarchives/(\d{14})/(.+)" http://webarchive.bac-lac.gc.ca:8080/wayback/$1/$2  [L,R=301]

But these practices become impractical in the case of the European Archive and IMF because:

  • The archives no longer exist (e.g., the European Archive and IMF),  so there is not a maintained web server available to issue the redirects. 
  • Even if it still existed, the archive might decide to not issue redirects for former customers in order to increase lock-in.
In the upcoming post, we will describe the movement of mementos from the Public Record Office of Northern Ireland (PRONI) web archive to Archive-It. The PRONI organization still controls and maintains the original domain name webarchive.proni.gov.ukso it is possible for PRONI to issue redirects to the new URI-Ms in Archive-It.

Observation 3: Not all mementos are available in the new archive

As defined in the previous post, a missing memento occurs when the values of the Memento-Datetime, the URI-R, and the final HTTP status code of the memento from the original archive are not identical to the values of the corresponding memento from the new archive. In this study, we found 192 missing mementos (out of 979) that can not be retrieved from the new archive. Instead, the new archive responds with other mementos that have different values for the Memento-Datetime, the URI-R, or the HTTP status code. We give two examples of missing mementos. The first example shows a memento can not be found in the new archive  with the same Memento-Datetime as it was in the original archive. When requesting the URI-M:

http://collections.internetmemory.org/nli/20121221162201/http://bbc.co.uk/news/

from the original archive (internetmemory.org) on September 03, 2018, the archive responded with "200 OK" with the representation shown in the top screenshot in the figure below. The Memento-Datetime of this memento was Fri, 21 Dec 2012 16:22:01 GMT. Then, we requested the corresponding URI-M:

http://wayback.archive-it.org/10702/20121221162201/http://bbc.co.uk/news/


from the new archive (archive-it.org). As shown in the cURL session below, the request redirected to another URI-M:

http://wayback.archive-it.org/10702/20121221163248/http://www.bbc.co.uk/news/

As shown in the figure below, the representations of both mementos are identical (except for the archival banners), we consider the memento from the original archive as missing because both mementos have different values of the Memento-Datetime (i.e., Fri, 21 Dec 2012 16:32:48 GMT in the new archive) for a delta of about 10 minutes. Even though the 10 minute delta might not be semantically significant (apparently just a change in the canonicalization of the URI-R, with bbc.co.uk redirecting to www.bbc.co.uk), we do not consider it to be the same since the values of the Memento-Datetime are not identical.

$ curl --head --location --silent http://wayback.archive-it.org/10702/20121221162201/http://bbc.co.uk/news/ | egrep -i "(HTTP/|^location:|^Memento-Datetime)"

HTTP/1.1 302 Found
Location: /10702/20121221163248/http://www.bbc.co.uk/news/
HTTP/1.1 200 OK

Memento-Datetime: Fri, 21 Dec 2012 16:32:48 GMT



The second example shows a memento that has different values of the Memento-Datetime and URI-R compared to the corresponding values from the original archive. When requesting the memento:

http://collections.internetmemory.org/nli/20121223122758/http://www.whitehouse.gov/

on September 03, 2018, the original archive returned "200 OK" for an archival "403 Forbidden" as the WARC record below shows:


WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collections.internetmemory.org/nli/content/20121223122758/http://www.whitehouse.gov/

WARC-Date: 2018-09-03T16:31:30Z
WARC-Record-ID: <urn:uuid:d03e5020-af96-11e8-9d72-f10b53f82929>


Content-Type: application/http; msgtype=response
Content-Length: 1694
HTTP/1.1 200 OK

Date: Mon, 03 Sep 2018 16:31:19 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html
Via: 1.1 varnish-v4
cache-control: max-age=86400
X-Varnish: 28318986 28187250
Memento-Datetime: Sun, 23 Dec 2012 12:27:58 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: <http://collections.internetmemory.org/nli/content/20121223122758/http://www.whitehouse.gov/>; rel="memento"; datetime="Sun, 23 Dec 2012 12:27:58 GMT", <http://collections.internetmemory.org/nli/content/20110223072152/http://www.whitehouse.gov/>; rel="first memento"; datetime="Wed, 23 Feb 2011 07:21:52 GMT", <http://collections.internetmemory.org/nli/content/20180528183514/http://www.whitehouse.gov/>; rel="last memento"; datetime="Mon, 28 May 2018 18:35:14 GMT", <http://collections.internetmemory.org/nli/content/20121221220430/http://www.whitehouse.gov/>; rel="prev memento"; datetime="Fri, 21 Dec 2012 22:04:30 GMT", <http://collections.internetmemory.org/nli/content/20131208014833/http://www.whitehouse.gov/>; rel="next memento"; datetime="Sun, 08 Dec 2013 01:48:33 GMT", <http://collections.internetmemory.org/nli/content/timegate/http://www.whitehouse.gov/>; rel="timegate", <http://www.whitehouse.gov/>; rel="original", <http://collections.internetmemory.org/nli/content/timemap/http://www.whitehouse.gov/>; rel="timemap"; type="application/link-format"
Content-Length: 287


<HTML><head>
<title>[ARCHIVED CONTENT] Access Denied</title>
</head><BODY>
<H1>Access Denied</H1>
Reference&#32;&#35;18&#46;d8407b5c&#46;1356265678&#46;2324d94
</BODY>
</HTML>


You don't have permission to access "http&#58;&#47;&#47;wwws&#46;whitehouse&#46;gov&#47;" on this server.<P>

When requesting the corresponding memento from archive-it.org

http://wayback.archive-it.org/10702/20121223122758/http://www.whitehouse.gov/

the request redirected to another URI-M:

http://wayback.archive-it.org/10702/20121221222130/http://www.whitehouse.gov/administration/eop/nec/speeches/gene-sperling-remarks-economic-club-washington

which is "200 OK". Notice that not only the values of the Memento-Datetime are different but also the URI-Rs. The representations of both mementos from the original and new archives are shown below:



Observation 4: Both archives handle the archival 4xx/5xx differently

The replay tool in the original archive (internetmemory.org) is configured so that it returns the status code "200 OK" for archival 4xx/5xx. 

For example, when requesting the memento:

http://collections.internetmemory.org/nli/20121021203647/http://www.amazon.com/


on September 03, 2018, the original archive returned "200 OK" for an archival "503 Service Unavailable" as the WARC record below shows:


WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collections.internetmemory.org/nli/20121021203647/http://www.amazon.com/
WARC-Date: 2018-09-03T16:46:51Z
WARC-Record-ID: <urn:uuid:f4f2e910-af98-11e8-8de6-6f058c4e494a>
Content-Type: application/http; msgtype=response
Content-Length: 271841

HTTP/1.1 200 OK
Date: Mon, 03 Sep 2018 16:46:39 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 28349831
Memento-Datetime: Sun, 21 Oct 2012 20:36:47 GMT
Connection: keep-alive
Accept-Ranges: bytes
...

Even the HTTP status code of the inner iframe in which the archived content is loaded had "200 OK":

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collections.internetmemory.org/nli/content/20121021203647/http://www.amazon.com/
WARC-Date: 2018-09-03T16:46:51Z
WARC-Record-ID: <urn:uuid:f500cbc0-af98-11e8-8de6-6f058c4e494a>
Content-Type: application/http; msgtype=response
Content-Length: 2642

HTTP/1.1 200 OK
Date: Mon, 03 Sep 2018 16:46:40 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 27227468 27453379
Memento-Datetime: Sun, 21 Oct 2012 20:36:47 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: <http://collections.internetmemory.org/nli/content/20121021203647/http://www.amazon.com/>; rel="memento"; datetime="Sun, 21 Oct 2012 20:36:47 GMT", <http://collections.internetmemory.org/nli/content/20110221192317/http://www.amazon.com/>; rel="first memento"; datetime="Mon, 21 Feb 2011 19:23:17 GMT", <http://collections.internetmemory.org/nli/content/20180711130159/http://www.amazon.com/>; rel="last memento"; datetime="Wed, 11 Jul 2018 13:01:59 GMT", <http://collections.internetmemory.org/nli/content/20121016174254/http://www.amazon.com/>; rel="prev memento"; datetime="Tue, 16 Oct 2012 17:42:54 GMT", <http://collections.internetmemory.org/nli/content/20121025120853/http://www.amazon.com/>; rel="next memento"; datetime="Thu, 25 Oct 2012 12:08:53 GMT", <http://collections.internetmemory.org/nli/content/timegate/http://www.amazon.com/>; rel="timegate", <http://www.amazon.com/>; rel="original", <http://collections.internetmemory.org/nli/content/timemap/http://www.amazon.com/>; rel="timemap"; type="application/link-format"

<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1"/>
<title>[ARCHIVED CONTENT] 500 Service Unavailable Error</title>
</head>
<body style="padding:1% 10%;font-family:Verdana,Arial,Helvetica,sans-serif">
  <a  target="_top" href="http://collections.internetmemory.org/nli/20121021203647/http://www.amazon.com/"><img src="http://collections.internetmemory.org/nli/content/20121021203647/http://ecx.images-amazon.com/images/G/01/img09/x-site/other/a_com_logo_200x56.gif" alt="Amazon.com" width="200" height="56" border="0"/></a>
  <table>
    <tr>
      <td valign="top" style="width:52%;font-size:10pt"><br/><h2 style="color:#E47911">Oops!</h2><p>We're very sorry, but we're having trouble doing what you just asked us to do. Please give us another chance--click the Back button on your browser and try your request again. Or start from the beginning on our <a  target="_top" href="http://collections.internetmemory.org/nli/20121021203647/http://www.amazon.com/">homepage</a>.</p></td>
      <th><img src="http://collections.internetmemory.org/nli/content/20121021203647/http://ecx.images-amazon.com/images/G/01/x-locale/common/errors-alerts/product-fan-500.jpg" alt="product collage"/></th>
    </tr>
  </table>
</body>

</html>

When requesting the corresponding memento:

http://wayback.archive-it.org/10702/20121021203647/http://www.amazon.com/

Archive-It properly returned the status codes 503 for the archival 503:

curl -I http://wayback.archive-it.org/10702/20121021203647/http://www.amazon.com/

HTTP/1.1 503 Service Unavailable
Server: Apache-Coyote/1.1
Content-Security-Policy-Report-Only: default-src 'self' 'unsafe-inline' 'unsafe-eval' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org data: blob: ; frame-src 'self' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org ; report-uri https://partner.archive-it.org/csp-report
Memento-Datetime: Sun, 21 Oct 2012 20:36:47 GMT
Link: <http://www.amazon.com/>; rel="original", <https://wayback.archive-it.org/10702/timemap/link/http://www.amazon.com/>; rel="timemap"; 
...

Observation 5: The HTTP status code may change in the new archive

The HTTP status codes of URI-Ms in the new archive might not be identical to the HTTP status code of the corresponding URI-Ms in the original archive. For example, the HTTP request of the URI-M:

http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/


to the original archive resulted in "200 OK" as the part of the WARC below shows:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/
WARC-Date: 2018-09-03T08:40:43Z
WARC-Record-ID: <urn:uuid:0b947600-af55-11e8-9b13-5bce71cafd38>
Content-Type: application/http; msgtype=response
Content-Length: 28447

HTTP/1.1 200 OK
Date: Mon, 03 Sep 2018 08:40:30 GMT
Server: Apache/2.4.10
Age: 0
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
Via: 1.1 varnish-v4
cache-control: max-age=86400
Transfer-Encoding: chunked
X-Varnish: 27888670
Memento-Datetime: Sun, 23 Dec 2012 03:18:37 GMT
Connection: keep-alive
Accept-Ranges: bytes
Link: <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="first memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="last memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="prev memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/20121223031837/http://www2008.org/>; rel="next memento"; datetime="Sun, 23 Dec 2012 03:18:37 GMT", <http://collections.internetmemory.org/nli/timegate/http://www2008.org/>; rel="timegate", <http://www2008.org/>; rel="original", <http://collections.internetmemory.org/nli/timemap/http://www2008.org/>; rel="timemap"; type="application/link-format"



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
...

The representation of the memento is illustrated below:


In internetmemory.org

The request to the corresponding URI-M:

http://wayback.archive-it.org/10702/20121223031837/http://www2008.org/

from Archive-It results in "404 Not Found" as the cURL session below shows:

$ curl --head --silent http://wayback.archive-it.org/10702/20121223031837/http://www2008.org/

HTTP/1.1 404 Not Found
Server: Apache-Coyote/1.1
Content-Security-Policy-Report-Only: default-src 'self' 'unsafe-inline' 'unsafe-eval' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org data: blob: ; frame-src 'self' *.archive-it.org archive-it.org *.qa-archive-it.org qa-archive-it.org archive.org *.archive.org https://*.archive-it.org https://archive-it.org https://*.qa-archive-it.org https://qa-archive-it.org https://archive.org https://*.archive.org ; report-uri https://partner.archive-it.org/csp-report
Content-Type: text/html;charset=utf-8
Content-Length: 4902
Date: Thu, 05 Sep 2019 08:28:27 GMT

The list of all 979 URI-Ms is appended
 below. The file contains the following information:
  • The URI-M from the original archive (original_URI-M).
  • The final URI-M after following redirects, if any, of the URI-M from the original archive (final_original_URI-M).
  • The HTTP status code of the final URI-M from the original archive (final_original_URI-M_status_code).
  • The URI-M from the new archive (new_URI-M).
  • The final URI-M after following redirects, if any, of the URI-M from the new archive (final_new_URI-M).
  • The HTTP status code of the final URI-M from the new archive (final_new_URI-M_status_code).
  • The difference (in seconds) between the Memento-Datetimes of the final URI-Ms (delta).
  • Whether the URI-Rs of the final URI-Ms are identical or not (same_final_URI-Rs).
  • Whether the status codes of the final URI-Ms are identical or not (same_final_URI-Ms_status_code).

Conclusions

We did not find any changes in the 979 mementos of the National Library of Ireland (NLI) collection when they were moved from europarchive.org to internetmemory.org in May 2018.  Both archives had used the same replay tool and archival banners. The NLI collection then was moved to archive-it.org in September 2018.  We found that 192 out of the 979 mementos resurfaced in archive-it.org with a change in Memento-Datetime, URI-R, or the final HTTP status code. We also found that the functionality that used to be available in the original archival banner has gone from the new archive. We also noticed that both archives internetmemory.org and archive-it.org react differently to requests of archival 4xx/5xx. 

In the upcoming posts we will provide some details about changes in the archives:
--Mohamed Aturban