2019-10-21: Where did the archive go? Part 4: WebCite
webcitation.org |
- In Where did the archive go? Part 1, we provided some details about changes in the archive Library and Archives Canada. After they upgraded their replay system, we were no longer able to find 49 out of 351 mementos (archived web pages).
- In Part 2, we focused on the movement of the National Library of Ireland (NLI). Mementos from NLI collection were moved from the European Archive to the Internet Memory Foundation (IMF) archive. Then, they were moved to Archive-It. We found that 192 mementos, out of 979, cannot be found in Archive-It.
- In Part 3, we described changes in the Public Record Office of Northern Ireland (PRONI) Web Archive. Mementos in the PRONI archive were moved to Archive-It (archive-it.org). We discovered that 114 mementos, out of 469, can no longer be found in Archive-It (i.e., missing mementos).
https://web.archive.org/web/*/webcitation.org |
$ date
Mon Jul 01 01:33:15 EDT 2019
$ curl -I www.webcitation.org
curl: (6) Could not resolve host: www.webcitation.org
We were conducting a study on a set of mementos from WebCite when the archive was inaccessible. The study included downloading the mementos and storing them locally in WARC files. Because the archive was unreachable, the WARC files contained only request records, with no response records, as shown below (the URI of the memento (URI-M) was http://www.webcitation.org/5ekDHBAVN):
WARC/1.0
WARC-Type: request
WARC-Target-URI: http://www.webcitation.org/5ekDHBAVN
WARC-Date: 2019-07-09T02:01:52Z
WARC-Concurrent-To: <urn:uuid:8519ea60-a1ed-11e9-82a3-4d5f15d9881d>
WARC-Record-ID: <urn:uuid:851c5b60-a1ed-11e9-82a3-4d5f15d9881d>
Content-Type: application/http; msgtype=request
Content-Length: 303
GET /5ekDHBAVN HTTP/1.1
Upgrade-Insecure-Requests: 1
User-Agent: Web Science and Digital Libraries Group (@WebSciDL); Project/archives_fixity; Contact/Mohamed Aturban (maturban@odu.edu)
X-DevTools-Emulate-Network-Conditions-Client-Id: 5B0BDCB102052B8A7EF0772D50B85540
Host: www.webcitation.org
<EOF>
The WebCite archive was back online on 2019-07-09 with a significant change; the archive no longer accepts archiving requests as its homepage indicates (e.g., the first screenshot above).
Upgrade-Insecure-Requests: 1
User-Agent: Web Science and Digital Libraries Group (@WebSciDL); Project/archives_fixity; Contact/Mohamed Aturban (maturban@odu.edu)
X-DevTools-Emulate-Network-Conditions-Client-Id: 5B0BDCB102052B8A7EF0772D50B85540
Host: www.webcitation.org
<EOF>
The WebCite archive was back online on 2019-07-09 with a significant change; the archive no longer accepts archiving requests as its homepage indicates (e.g., the first screenshot above).
Our WARC records below show the time at which the archive came back online on 2019-07-09T13:17:16Z. Note that there are a few hours difference between the value of the WARC-Date below and its value in the WARC record above. However, the WebCite archive was still down on 2019-07-09T02:01:52Z, while it was online again around 13:17:16Z:
WARC/1.0
WARC-Type: request
WARC-Target-URI: http://www.webcitation.org/6MIxPlJUQ
WARC-Date: 2019-07-09T13:17:16Z
WARC-Concurrent-To: <urn:uuid:df5de3b0-a24b-11e9-aaf9-bb34816ea2ff>
WARC-Record-ID: <urn:uuid:df5f6a50-a24b-11e9-aaf9-bb34816ea2ff>
Content-Type: application/http; msgtype=request
Content-Length: 414
GET /6MIxPlJUQ HTTP/1.1
Pragma: no-cache
Accept-Encoding: gzip, deflate
Host: www.webcitation.org
Upgrade-Insecure-Requests: 1
User-Agent: Web Science and Digital Libraries Group (@WebSciDL); Project/archives_fixity; Contact/Mohamed Aturban (maturban@odu.edu)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Cache-Control: no-cache
Connection: keep-alive
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.webcitation.org/6MIxPlJUQ
WARC-Date: 2019-07-09T13:17:16Z
WARC-Record-ID: <urn:uuid:df5fb870-a24b-11e9-aaf9-bb34816ea2ff>
Content-Type: application/http; msgtype=response
Content-Length: 1212
HTTP/1.1 200 OK
Pragma: no-cache
Date: Tue, 09 Jul 2019 13:16:47 GMT
Server: Apache/2.4.29 (Ubuntu)
Vary: Accept-Encoding
Content-Type: text/html; charset=UTF-8
Set-Cookie: PHPSESSID=ugssainrn4pne801m58d41lm2r; path=/
Cache-Control: no-store, no-cache, must-revalidate
Connection: Keep-Alive
Keep-Alive: timeout=5, max=100
Content-Length: 814
Expires: Thu, 19 Nov 1981 08:52:00 GMT
<!DOCTYPE html PUBLIC
...
WARC-Type: request
WARC-Target-URI: http://www.webcitation.org/6MIxPlJUQ
WARC-Date: 2019-07-09T13:17:16Z
WARC-Concurrent-To: <urn:uuid:df5de3b0-a24b-11e9-aaf9-bb34816ea2ff>
WARC-Record-ID: <urn:uuid:df5f6a50-a24b-11e9-aaf9-bb34816ea2ff>
Content-Type: application/http; msgtype=request
Content-Length: 414
GET /6MIxPlJUQ HTTP/1.1
Pragma: no-cache
Accept-Encoding: gzip, deflate
Host: www.webcitation.org
Upgrade-Insecure-Requests: 1
User-Agent: Web Science and Digital Libraries Group (@WebSciDL); Project/archives_fixity; Contact/Mohamed Aturban (maturban@odu.edu)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Cache-Control: no-cache
Connection: keep-alive
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://www.webcitation.org/6MIxPlJUQ
WARC-Date: 2019-07-09T13:17:16Z
WARC-Record-ID: <urn:uuid:df5fb870-a24b-11e9-aaf9-bb34816ea2ff>
Content-Type: application/http; msgtype=response
Content-Length: 1212
HTTP/1.1 200 OK
Pragma: no-cache
Date: Tue, 09 Jul 2019 13:16:47 GMT
Server: Apache/2.4.29 (Ubuntu)
Vary: Accept-Encoding
Content-Type: text/html; charset=UTF-8
Set-Cookie: PHPSESSID=ugssainrn4pne801m58d41lm2r; path=/
Cache-Control: no-store, no-cache, must-revalidate
Connection: Keep-Alive
Keep-Alive: timeout=5, max=100
Content-Length: 814
Expires: Thu, 19 Nov 1981 08:52:00 GMT
<!DOCTYPE html PUBLIC
...
The archive was still down a few seconds before 2019-07- 09T13:17:16Z as there were not response records in the WARC files:
WARC/1.0
WARC-Type: request
WARC-Target-URI: http://www.webcitation.org/6E2fTSO15
WARC-Date: 2019-07-09T13:16:39Z
WARC-Concurrent-To: <urn:uuid:c94dd940-a24b-11e9-8d2f-a5805b26a392>
WARC-Record-ID: <urn:uuid:c9515bb0-a24b-11e9-8d2f-a5805b26a392>
Content-Type: application/http; msgtype=request
Content-Length: 303
GET /6E2fTSO15 HTTP/1.1
Upgrade-Insecure-Requests: 1
User-Agent: Web Science and Digital Libraries Group (@WebSciDL); Project/archives_fixity; Contact/Mohamed Aturban (maturban@odu.edu)
X-DevTools-Emulate-Network-Conditions-Client-Id: 78DDE3B763F42A09787D0EBA241C9C4A
Host: www.webcitation.org
The archive has had some issues in the past related to funding limitations:
#WebCite will stop accepting new submissions end of 2013, unless they reach their fundraising goals http://t.co/nug32Zzc . Please support.— Ahmed AlSum (@aalsum) February 16, 2013
https://archive.fo/eAETp |
One of the main objectives for which WebCite was established was to reduce the impact of reference rot by allowing researchers and authors of scientific work to archive cited web pages. The instability in providing archiving services and being inaccessible from time to time raises important questions:
- Is there any plan by the web archiving community to recover web pages archived by WebCite if the archive is gone?
- Why didn't the archive give a notice (e.g., in their homepage) before they became unreachable? This will give users some time to deal with different scenarios, such as downloading a particular set of mementos
- Has the archived content changed before the archive came back online?
- The archive now does not accept archiving requests nor does it do web page crawling. Is there any plan by the archive to resume the on-demand archiving service in the future?
URI-M structure typically used by Wayback-style archives |
The WebCite archive is a well-known web archive and one of the few on-demand archiving services. It is important that the web archiving community takes further steps to guarantee long-term preservation and stability of those archived pages at webcitation.org.
https://t.co/C46aEumrEM - the uncontested pioneer for fighting link rot in #scholcomm - should really consider handing over its activities to an org with some guarantees re long-term stability. And #Memento support ;-) / cc @permacc— Herbert @hvdsomp@octodon.social (@hvdsomp) August 1, 2019
--Mohamed Aturban
See also:
ReplyDeleteMohamed Aturban, Michael L. Nelson, Michele C. Weigle, Where Did the Web Archive Go?, TPDL 2021. (arXiv:2108.05939)