Monday, October 21, 2019

2019-10-21: Where did the archive go? Part 4: WebCite

webcitation.org
We previously described changes in the following web archives:
In the last part of this four part series, we focus on changes in webcitation.org (WebCite). The WebCite archive has been operational in its current form since at least 2004 and was the first archive to offer an on-demand archiving service by allowing users to submit URLs of web pages. Around 2019-06-07, the archive became unreachable. The Wayback Machine indicates that there were no mementos captured for the domain webcitation.org between 2019-06-07 and 2019-07-09 (for about a month), which is the longest period of time in 2019 that has no mementos for WebCite in the Internet Archive:

https://web.archive.org/web/*/webcitation.org

The host webcitation.org was not resolving as shown in the cURL session below:

$ date
Mon Jul 01 01:33:15 EDT 2019


$ curl -I www.webcitation.org
curl: (6) Could not resolve host: www.webcitation.org


We were conducting a study on a set of mementos from WebCite when the archive was inaccessible. The study included downloading the mementos and storing them locally in WARC files. Because the archive was unreachable, the WARC files contained only request records, with no response records, as shown below (the URI of the memento (URI-M) was http://www.webcitation.org/5ekDHBAVN):

WARC/1.0
WARC-Type: request
WARC-Target-URI: http://www.webcitation.org/5ekDHBAVN
WARC-Date: 2019-07-09T02:01:52Z
WARC-Concurrent-To: <urn:uuid:8519ea60-a1ed-11e9-82a3-4d5f15d9881d>
WARC-Record-ID: <urn:uuid:851c5b60-a1ed-11e9-82a3-4d5f15d9881d>
Content-Type: application/http; msgtype=request
Content-Length: 303


GET /5ekDHBAVN HTTP/1.1
Upgrade-Insecure-Requests: 1
User-Agent: Web Science and Digital Libraries Group (@WebSciDL); Project/archives_fixity; Contact/Mohamed Aturban (maturban@odu.edu)
X-DevTools-Emulate-Network-Conditions-Client-Id: 5B0BDCB102052B8A7EF0772D50B85540
Host: www.webcitation.org

<EOF>

 

The WebCite archive was back online on 2019-07-09 with a significant change; the archive no longer accepts archiving requests as its homepage indicates (e.g., the first screenshot above). 

Our WARC records below show the time at which the archive came back online on 2019-07-09T13:17:16Z. Note that there are a few hours difference between the value of the WARC-Date below and its value in the WARC record above. However, the WebCite archive was still down on 2019-07-09T02:01:52Z,  while it was online again around 13:17:16Z:


WARC/1.0
WARC-Type: request
WARC-Target-URI: http://www.webcitation.org/6MIxPlJUQ
WARC-Date: 2019-07-09T13:17:16Z
WARC-Concurrent-To: <urn:uuid:df5de3b0-a24b-11e9-aaf9-bb34816ea2ff>
WARC-Record-ID: <urn:uuid:df5f6a50-a24b-11e9-aaf9-bb34816ea2ff>
Content-Type: application/http; msgtype=request
Content-Length: 414


GET /6MIxPlJUQ HTTP/1.1
Pragma: no-cache
Accept-Encoding: gzip, deflate
Host: www.webcitation.org
Upgrade-Insecure-Requests: 1
User-Agent: Web Science and Digital Libraries Group (@WebSciDL); Project/archives_fixity; Contact/Mohamed Aturban (maturban@odu.edu)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Cache-Control: no-cache
Connection: keep-alive

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.webcitation.org/6MIxPlJUQ
WARC-Date: 2019-07-09T13:17:16Z
WARC-Record-ID: <urn:uuid:df5fb870-a24b-11e9-aaf9-bb34816ea2ff>
Content-Type: application/http; msgtype=response
Content-Length: 1212


HTTP/1.1 200 OK
Pragma: no-cache
Date: Tue, 09 Jul 2019 13:16:47 GMT
Server: Apache/2.4.29 (Ubuntu)
Vary: Accept-Encoding
Content-Type: text/html; charset=UTF-8
Set-Cookie: PHPSESSID=ugssainrn4pne801m58d41lm2r; path=/
Cache-Control: no-store, no-cache, must-revalidate
Connection: Keep-Alive
Keep-Alive: timeout=5, max=100
Content-Length: 814
Expires: Thu, 19 Nov 1981 08:52:00 GMT


<!DOCTYPE html PUBLIC
...
The archive was still down a few seconds before 2019-07- 09T13:17:16Z as there were not response records in the WARC files:

WARC/1.0
WARC-Type: request
WARC-Target-URI: http://www.webcitation.org/6E2fTSO15
WARC-Date: 2019-07-09T13:16:39Z
WARC-Concurrent-To: <urn:uuid:c94dd940-a24b-11e9-8d2f-a5805b26a392>
WARC-Record-ID: <urn:uuid:c9515bb0-a24b-11e9-8d2f-a5805b26a392>
Content-Type: application/http; msgtype=request
Content-Length: 303

GET /6E2fTSO15 HTTP/1.1
Upgrade-Insecure-Requests: 1
User-Agent: Web Science and Digital Libraries Group (@WebSciDL); Project/archives_fixity; Contact/Mohamed Aturban (maturban@odu.edu)
X-DevTools-Emulate-Network-Conditions-Client-Id: 78DDE3B763F42A09787D0EBA241C9C4A
Host: www.webcitation.org

The archive has had some issues in the past related to funding limitations:



https://archive.fo/eAETp
One of the main objectives for which WebCite was established was to reduce the impact of reference rot by allowing researchers and authors of scientific work to archive cited web pages. The instability in providing archiving services and being inaccessible from time to time raises important questions:

  • Is there any plan by the web archiving community to recover web pages archived by WebCite if the archive is gone?
  • Why didn't the archive give a notice (e.g., in their homepage) before they became unreachable? This will give users some time to deal with different scenarios, such as downloading a particular set of mementos
  • Has the archived content changed before the archive came back online?
  • The archive now does not accept archiving requests nor does it do web page crawling. Is there any plan by the archive to resume the on-demand archiving service in the future?
In case WebCite disappears, the structure of the URI-M used by the archive makes it difficult to recover mementos from other web archives. This is because the URI-M of a memento (e.g., www.webcitation.org/5BmjfFFB1) does not give any additional information about the memento. These shortened URI-Ms are also used by other archives, such as archive.is and perma.cc. In contrast, the majority of web archives that employ one of the Wayback Machine’s implementations (e.g., OpenWayback and PyWb) use the URI-M structure illustrated below. Note that archive.is and perma.cc also support this Wayback Machine-style URI-Ms (i.e., each memento has two URI-Ms):

URI-M structure typically used by Wayback-style archives

This common URI-M structure provides two pieces of information, the original page's URI (URI-R) and the creation datetime of the memento (Memento-Datetime). This information then can be used to look up similar (or even identical) archived pages in other web archives using services like the LANL Memento aggregator. With the URI-M structure used by WebCite, it is not possible to recover mementos using only the URI-M. The Robust Links article introduces a mechanism to allow a user to link to an original URI and at the same time describe the state of the URI so that, in the future, users will be able to obtain information about the URI even if it disappears from the live web.

The WebCite archive is a well-known web archive and one of the few on-demand archiving services. It is important that the web archiving community takes further steps to guarantee long-term preservation and stability of those archived pages at webcitation.org





--Mohamed Aturban


No comments:

Post a Comment