Thursday, October 31, 2019

2019-10-31: Continuing Education to Advance Web Archiving (CEDWARC)

Note: This blog post may be updated with additional links to slides and other resources as they become publicly available.

On October 28, 2019, web archiving experts met with librarians and archivists at the George Washington University in Washington, DC. As part of the Continuing Education to Advance Web Archiving (CEDWARC) effort, we covered several different modules related to tools and technologies for web archives. The event consisted of morning overview presentations and afternoon lab portions. Here I will provide an overview of the topics we covered.

Web Archiving Fundamentals

Prior to attending the event Edward A. Fox, Martin Klein, Anna Perricci, and Zhiwu Xie created a brief tutorial covering the fundamentals of web archiving. This tutorial, shown below, was distributed as a video to attendees prior to the event so they could familiarize themselves with the concepts we would discuss at CEDWARC.

Zhiwu Xie kicked off the event with a refresher of this tutorial. He stressed the complexities of archiving web content due to the number of different resources necessary to reconstruct a web page at a later time. He mentioned that it was not just necessary to capture all of these resources, but also replay them properly. Improper replay can lead to temporal inconsistencies, as has been covered on this blog by Scott Ainsworth. He further covered WARCs and other concepts, like provenance, related to the archive and replay of web pages.



Now that the attendees were familiar with web archives, Martin Klein provided a deep dive into what they could accomplish with Memento. Klein covered how Memento allows users to find mementos for a resource in multiple web archives via the Memento Aggregator. He further touched on recent machine learning work to improve the performance of the Memento Aggregator.

Klein highlighted how to use the Memento browser extension, available for Chrome and Firefox. He mentioned how one could use Memento with Wikipedia, and echoed my frustration with trying to get Wikipedia to adopt the protocol. He closed by introducing various Memento Time Travel APIs available.


Social Feed Manager

Laura Wrubel and Dan Kerchner covered Social Feed Manager, a tool by George Washington University that helps researchers build social media archives from Twitter, Tumblr, Flickr, and Sina Weibo. SFM does more than archive the pages of social media content. It also acquires content available via each API, preserving identifiers, location information, and other data not present on a given post's web page.


I presented work from the Dark and Stormy Archives Project on using storytelling techniques with web archives. I introduced the concept of a social cards as a summary of the content of a single web page. Storytelling services like Wakelet combine social cards together to summarize a topic. We can use this same concept to summarize web archives. I broke storytelling with web archives into two actions: selecting the mementos for our story and visualizing those mementos.

I briefly covered the problems of scale with selecting mementos manually before moving on to the steps of AlNoamany's Algorithm. WS-DL alumnus Yasmin AlNoamany developed this algorithm to produce a set of representative mementos from a web archive collection. AlNoamany's Algorithm can be executed using our Off-Topic Memento Toolkit.

Visualizing mementos requires that our social cards do a good job describing the underlying content. These cards should also avoid confusion. Because of the confusion introduced by other card services, we created MementoEmbed to produce surrogates for mementos. From MementoEmbed, we then created Raintale to produce stories from lists of memento URLs (URI-Ms).

In the afternoon, I conducted a series of laboratory exercises with the attendees using these tools.



Helge Holzmann presented ArchiveSpark for efficiently processing and extracting data from web archive collections. ArchiveSpark provides filters and other tools to reduce the data and provide it in a variety of accessible formats. It provides efficient access by first extracting information from CDX and other metadata files before directly processing WARC content.

ArchiveSpark uses Apache Spark to run multiple processes in parallel. Users employ Scala to filter and process their collections to extract data. Helge emphasized that data is provided in a JSON format that was inspired by the Twitter API. He closed by showing how one can use ArchiveSpark with Jupyter notebooks.


Archives Unleashed

Samantha Fritz and Sarah McTavish highlighted several tools provided by the Archives Unleashed Project. WS-DL members have been to several Archives Unleashed events, and I was excited to see these tools introduced to a new audience.

The team briefly covered the capabilities of the Archives Unleashed Toolkit (AUT). AUT employs Hadoop and Apache Spark to allow users to provide collection analytics, text analysis, named-entity recognition, network analysis, image analysis and more. From there, they introduced Archives Unleashed Cloud (AUK) for extracting datasets from ones own Archive-It collections. These datasets can then be consumed and further analyzed in Archives Unleashed Notebooks. Finally, the covered Warclight which provides a discovery layer for web archives.


Event Archiving

Ed Fox closed out our topics by detailing the event archiving work done by the Virginia Tech team. He talked about the issues with using social media posts to supply URLs for events so that web archives could then quickly capture them. After some early poor results, his team has worked extensively on improving the quality of seeds through the use of topic modeling, named entity extraction, location information, and more. This work is currently reflected in the GETAR (Global Event and Trend Archive Research) project.

In the afternoon session, he helped attendees acquire seed URLs via the Event Focused Crawler (EFC). Using code from Code Ocean and Docker containers, we were able to perform a focused crawl to locate additional URLs about an event. In addition to EFC, we classified Twitter accounts using TwiRole.


Update on 2019/10/31 at 20:42 GMT: The original version neglected to include the afternoon Webrecorder laboratory session, which I did not attend. Thanks to Anna Perricci for providing us with a link to her slides and some information about the session.

In the afternoon, Anna Perricci presented a laboratory titled "Human scale web collecting for individuals and institutions" which was about using Webrecorder. Unfortunately, I was not able to engage in these exercises because I was at Ed Fox's Event Archiving session. Webrecorder was part of the curriculum because it is a free, open source tool that demonstrates some essential web archiving elements. Her session covered manual use of Webrecorder as well as its newer autopilot capabilities.


CEDWARC's goal was to educate and provide outreach to the greater librarian and archiving community. We introduced the attendees to a number of tools and concepts. Based on the response to our initial announcement and the results from our sessions, I think we have succeeded. I look forward to potential future events of this type.

-- Shawn M. Jones

Monday, October 28, 2019

2019-10-28: The interaction between search engine caches and web archives

News articles from Indian newspapers about a corruption case involving an Indian doctor. The left images show screenshots of the article from the print newspaper. The right images show URLs for the articles returning with 404 pages.  

My brother, a lawyer in India, recently sent me two screenshots shown in Figures 1 and 2, of a news article about a corruption case involving a renowned doctor from India. In order to proceed with legal proceedings against the newspapers for publishing the article, my brother needed some evidence about the publication of the articles. Therefore he sought my help in finding the URLs of the articles shown in the screenshots. The news articles were published in an English language newspaper, The Asian Age, and a Hindi language newspaper, Punjab Kesari

Figure 1: Screenshot of the news article from the English language newspaper, The Asian Age shared with me by my brother

Figure 2: Screenshot of the news article from the Hindi language newspaper, Punjab Kesari shared with me by my brother

Finding URLs for the screenshot of the news articles

I searched the websites of The Asian Age and Punjab Kesari for the articles and found links to the articles (shown in the Original URL row of Tables 1 and 2) but they both redirected to a 404 page, as shown in Figures 3 and 4. Fortunately, we found search engine (SE) cached copies of both articles in the Google and Bing caches, as shown in Figures 5 and 6.

Plinio Vargas in his post "Link to Web Archives, not Search Engine Caches" talks about the ephemeral nature of the SE cache URLs and highlights the reason for linking to the web archives over the SE cache URLs. Furthermore, Dr. Michael Nelson in his post "Russell Westbrook, Shane Keisel, Fake Twitter Accounts, and Web Archives" has already shown us the use of SE cache URLs and the web archives to find answers to real world problems.

Figure 3: A 404 page appears on accessing the news article from Punjab Kesari
Figure 4: A 404 page appears on accessing the news article from The Asian Age 

cURL response for the The Asian Age news article which redirects to a 404 page
msiddique@wsdl-3102-03:~/Desktop/Test$ curl -IL ""
HTTP/1.1 301 Moved Permanently
Date: Fri, 20 Sep 2019 18:35:07 GMT
Server: Apache/2.4.7 (Ubuntu)
Cache-Control: max-age=300
Expires: Fri, 20 Sep 2019 18:40:07 GMT
Connection: close
Content-Type: text/html; charset=iso-8859-1

HTTP/1.1 301 Moved Permanently
Date: Fri, 20 Sep 2019 18:35:08 GMT
Server: Apache/2.4.7 (Ubuntu)
X-Powered-By: PHP/5.5.9-1ubuntu4.29
Set-Cookie: PHPSESSID=dsp7g2kkn5sfk2eggaftg3un84; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
location: /404.html
X-Cache: MISS from
Connection: close
Content-Type: text/html

HTTP/1.1 200 OK
Date: Fri, 20 Sep 2019 18:35:10 GMT
Server: Apache/2.4.7 (Ubuntu)
X-Powered-By: PHP/5.5.9-1ubuntu4.29
Set-Cookie: PHPSESSID=koaujt0tiaqgjvafa5je1djps5; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
X-Cache: MISS from
Connection: close
Content-Type: text/html

Figure 5: Bing Cache  

Figure 6: Google Cache

cURL response for the Punjab Kesari news article which redirects to a 404 page
msiddique@wsdl-3102-03:~/Desktop/Test$ curl -IL ""
HTTP/1.1 301 Moved Permanently
Content-Length: 0
Connection: keep-alive
Cache-Control: private
Server: Microsoft-IIS/8.0
Date: Fri, 20 Sep 2019 18:45:12 GMT
X-Cache: Miss from cloudfront
Via: 1.1 (CloudFront)
X-Amz-Cf-Pop: IAD79-C2

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 76757
Connection: keep-alive
Cache-Control: public, no-cache="Set-Cookie", max-age=15000
Expires: Fri, 20 Sep 2019 17:17:08 GMT
Last-Modified: Fri, 20 Sep 2019 13:07:08 GMT
Server: Microsoft-IIS/8.0
Date: Fri, 20 Sep 2019 13:07:08 GMT
Vary: Accept-Encoding,Cookie
X-Cache: Hit from cloudfront
Via: 1.1 (CloudFront)
X-Amz-Cf-Pop: IAD79-C2
X-Amz-Cf-Id: 5PzkcGPXziNxfNLDffTV3-V6Ks2w3FQiEUWnHMzfZm_aDKfyBKjw7A==
Age: 20281

Push the cached URLs to multiple web archives

We pushed the Bing and Google cache URLs (URI-R-SEs) for both news articles to the Internet Archive,, and The URI-Ms for the URI-R-SEs are shown in Tables 1 and 2. We can use ArchiveNow to automate pushing of URLs to multiple web archives. We also captured the WARC files of the URI-R-SEs for the articles using Webrecorder and stored the WARCs locally.

Table 1: Links to the original URL, SE cache URLs, and the mementos for The Asian Age news article.
The Asian Age News Article
Original URL (URI-R)
Google Cache URL (URI-R-SE)
Mementos for Google Cache (URI-M) Internet Archive
Bing Cache URL (URI-R-SE)
Mementos for Bing Cache (URI-M) Internet Archive

Table 2: Links to the original URL, SE cache URLs, and the mementos for Punjab Kesari news article.
Punjab Kesari News Article
Original URL (URI-R)
Google Cache URL (URI-R-SE) No results found
Mementos for Google Cache (URI-M) No Mementos
Bing Cache URL (URI-R-SE)
Mementos for Bing Cache (URI-M) Internet Archive

Accessing the Cache URLs in the Web Archives  

Web archives index mementos by their URI-R. A SE cache URI-M can only be accessed by users who know the URI-R-SE, which is mostly opaque as a result of various parameters and encodings. As shown in Figure 7, the URI-R-SE for the same web resource may vary according to different geographic location which means that the same web resource may be indexed under different URI-R-SEs in the web archives. 

In the US, the Bing Cache URL for the The Asian Age news article is

In India, the Bing cache URL for the The Asian Age news article is

Figure 7: The Bing Cache URL for the US (left) is 200 and the one for India 
is 404 (right)

Pushing the URI-R-SE to multiple web archives not only makes it accessible from web multiple archives, but also some web archives can be leveraged to find mementos in the other web archives. As shown in Figure 8, extracts the URI-R of the article from the URI-R-SE of the article and indexes the URI-Ms for the URI-R-SE under both the URI-R and URI-R-SE. As shown in Figure 9, we accessed a memento from Internet Archive for the URI-R-SE using the extracted URI-R-SE from which is what the other web archives consider as URI-R. 
Figure 8: lists the Bing cache URL for the memento upon searching for the URL of the web page which can be used to search in other web archive

Figure 9:  Using the Bing cache URL from to retrieve 
mementos of the search engine cache from the Internet Archive
Figure 10: Memento of a SE cache which did not capture the intended content 
Figure 11: Google indexed a document from the Internet Archive which lists the memento from for the The Asian Age news article 
As shown in Figure 10, the Internet Archive has archived Bing's soft 404 for the URI-R-SE. Fortunately,, as shown in Figure 8, archived its memento before the URI-R-SE became a soft 404. At times, we can find URI-Ms to a 404 page indexed in Google search result. As shown in Figure 11, the Google search result for the The Asian Age news article listed a document from Internet Archive which contains the URI-M from for the news article.

Sometimes SE caches have pages that are missing (404) from the live web but not yet archived. We should push SE cache URL (URI-R-SE) to multiple web archives. We can automate the process of  saving URLs to multiple web archives simultaneously by using ArchiveNow. We can use web archives like to get the URI-R-SE using the URI-R of the resource which can further to be used to search the other web archives for mementos of the URI-R-SEs.

My studies in web archiving helped me solve a real world problem posed by my brother where he needed the URLs of news articles for which he provided me with the screenshots. I found those URLs in SE caches and pushed them to multiple web archives which will be used by him in his legal proceedings. 
Mohammed Nauman Siddique

Friday, October 25, 2019

2019-10-25: Summary of "Proactive Identification of Exploits in the Wild Through Vulnerability Mentions Online"

Figure 1 Disclosed Vulnerabilities by Year (Source: CVE Details)

The number of software vulnerabilities discovered and disclosed to the public is steadily increasing every year.  As shown in Figure 1, in 2018 alone, more than 16,000 Common Vulnerabilities and Exposures (CVE) identifiers were assigned by various CVE Numbering Authorities (CNA).  CNAs are organizations from around the world that are authorized to assign CVE IDs to vulnerabilities affecting products within their distinct, agreed-upon scope. In the presence of voluminous amounts of data and limited skilled cyber security resources, organizations are challenged to identify the vulnerabilities that pose the greatest risk to their technology resources.

One of the key reasons the current approaches to cyber vulnerability remediation are ineffective is that organizations cannot effectively determine whether a given vulnerability poses a meaningful threat. In their paper,  "Proactive Identification of Exploits in the Wild Through Vulnerability Mentions Online", Almukaynizi et al. draw on a body of work that seeks to define an exploit prediction model that leverages data from online sources generated by the white-hat community (i.e., ethical hackers). The white-hat data is combined with vulnerability mentions scraped from the dark web to provide an early predictor of exploits that could appear "in the wild" (i.e., real world attacks).

Video: What is the dark web? And, what will you find there? (Source:

Common Vulnerability Scoring System (CVSS) Explained
The CVSS is a framework for rating the severity of security vulnerabilities in software and hardware. Operated by the Forum of Incident Response and Security Teams (FIRST), the CVSS uses an publicly disclosed algorithm to determine three severity rating scores: Base, Temporal, and Environmental. The scores are numeric and range from 0.0 through 10.0 with 10.0 being the most severe. According to the most recent version of the CVSS, V3.0:
  • A score of 0.0 receives a "None" rating. 
  • A 0.1-3.9 score receives a "Low" severity rating. 
  • A score of 4.0-6.9 receives a "Medium" rating. 
  • A score of 7.0-8.9 receives a "High" rating. 
  • A score of 9.0 - 10.0 receives a "Critical" rating. 
As shown in Figure 2, the score is computed according to elements and subscores required for each of the three metric groups.
  • The Base score is the metric most relied upon by enterprise organizations and reflects the inherent qualities of a vulnerability. 
  • The Temporal score represents the qualities of the vulnerability that change over time.  
  • The Environmental score represents the qualities of the vulnerability that are specific to the affected user's environment. 
Figure 2 CVSS Metric Groups (Source: FIRST)

Due to their specificity with the organization's environment, it should be noted the temporal and environmental metrics are normally not reflected in the reported CVSS base score, but can be calculated independently using the equations published in the FIRST Common Vulnerability Scoring V3.1 Specification document. The CVSS allows organizations to prioritize which vulnerabilities to remediate first and gauge the overall impact of the vulnerabilities on their systems. A consistent finding in this stream of research is that the status quo for how organizations address vulnerability remediation is often less than optimal and has significant room for improvement.  With that in mind, Almukaynizi et al. present an alternative prediction model which they evaluate against the standard CVSS methodology.

Exploit Prediction Model
Figure 3 depicts the individual elements that Almukaynizi et al. use to describe the three phases of their exploit prediction model. The phases are Data Collection, Feature Extraction, and Prediction.

Figure 3 Exploit Prediction Model (Source: Almukaynizi)
Data Collection
This phase is used to build a dataset for further analysis. Vulnerability data is assimilated from:
  • NVD. 12,598 vulnerabilities (unique CVE IDs) were disclosed and published in the National Vulnerability Database (NVD) between 2015 and 2016. For each vulnerability, the authors gathered the description, CVSS base score, and scoring vector.
  • EDB (white-hat community). Exploit Database is an archive of public, proof of concept (PoC) exploits and corresponding vulnerable software, developed for use by penetration testers and vulnerability researchers. The PoC exploits can often be mapped to a CVE ID. Using the unique CVE-IDs from the NVD database for the time period between 2015 and 2016, the authors queried the EDB to determine whether a PoC exploit was available. 799 of the vulnerabilities in the 2015 to 2016 data set were found to have verified PoC exploits. For each PoC, the authors gathered the date the PoC exploit was posted.
  • ZDI (vulnerability detection community). The Zero Day Initiative (ZDI) encourages the reporting of verified zero-day vulnerabilities privately to the affected vendors by financially rewarding researchers; a process which is sometimes referred to as a bug bounty. The authors queried this database to identify 824 vulnerabilities in the 2015 to 2016 data set that were common with the NVD.
  • DW (black-hat community). The authors built a system which crawls marketplace sites and forums on the dark web to collect data related to malicious hacking. They used a machine learning approach to identify content of interest and exclude irrelevant postings (e.g., pornography). They retained any postings which specifically have a CVE number or could be mapped from a Microsoft Security Bulletin to a corresponding CVE ID. They found 378 unique CVE mentions between 2015 and 2016.
  • Attack Signatures (Ground truth). Symantec's anti-virus and intrusion detection attack signatures were used to identify actual exploits that were used in the wild and not just PoC exploits. Some attack signatures are mapped to the CVE ID of the vulnerability which were correlated with NVD, EDB, ZDI, and DW. The authors noted this database may be biased towards products from certain vendors (e.g., Microsoft, Adobe).
Table I shows the number of vulnerabilities exploited as compared to the ones disclosed for all the data sources considered.
Source: Almukaynizi

Feature Extraction
A summary of features extracted from all the data sources mentioned is provided in Table II.
Source: Almukaynizi
  • The NVD description provides information on the vulnerability and the capabilities attackers will gain if they exploit it. Contextual information gleaned from DW was appended to the NVD description. Here, the authors observed foreign languages in use which they translated into English using the Google Translate API. The text features were analyzed using Term Frequency-Inverse Document Frequency (TF-IDF) to create a vocabulary of 1000 most frequent words in the entire data set. Common words were eliminated as important features.
  • The NVD provides CVSS base scores and vectors which indicate the severity of the vulnerability. The categorical components of the vector include Access Complexity, Authentication, Confidentiality, Integrity, and Availability. All possible categories of features were vectorized then assigned a value of "1" or "0" to denote whether the category is present or not.
  • The DW feeds are posted in different languages; most notably in English, Chinese, Russian, and Swedish. The language of the DW post is used rather than the description since important information can be lost during the translation process.
  • The presence of PoC on either EDB, DW, or ZDI increases the likelihood of a vulnerability being exploited. This is treated as a binary feature.
The authors employed several supervised machine learning approaches to determine a binary classification on the selected features indicating whether the vulnerability would be exploited or not.
Vulnerability and Exploit Analysis
As shown in Table III, Almukaynizi et al. assessed the importance of aggregating disparate data sources by first analyzing the likelihood of exploitation based on the coverage of each source. Then, they conducted a language based analysis to identify any socio-cultural factors present in the DW sites which might influence exploit likelihood. Table III presents the percentage of exploited vulnerabilities that appeared in each source along with results for the intersection.

Table III

Number of vulnerabilities 799 824 378 1180 1791
Number of exploited vulnerabilities 74 95 52 140 164
Percentage of exploited vulnerabilities 21% 31% 17% 46% 54%
Percentage of total vulnerabilities 6.3% 6.5% 3.0% 9.3% 14.2%
Source: Almukaynizi

The authors determined that 2.4% of the vulnerabilities disclosed in the NVD are exploited in the wild. The correct prediction of exploit likelihood increased when additional data sources were included. This was balanced by the fact that each data community (EDB, ZDI, DW) operates under a distinct set of guidelines (e.g., white hat, researchers, hackers).

In the DW, four languages were detected which resulted in noticeable variations in the exploit likelihood. English and Chinese have more vulnerability mentions (n=242 and n= 112, respectively) than Russian and Swedish (n=13 and n=11, respectively). Chinese postings exhibited the lowest exploitation rate (~10%). However, 46% of the vulnerabilities mentioned in Russian postings were exploited. Figure 4 shows the language analysis based on vulnerability mentions.
Figure 4 Exploited Vulnerabilities by Language (Source: Almukaynizi)

Performance Evaluation
Experiments using the exploit prediction model were examined using different supervised machine learning algorithms including Support Vector Machine (SVM), Random Forest (RF), Naive Bayes Classifier (NB), and Logistic Regression (LOG-REG). Random Forest, which is based on generating multiple decision trees, was found to provide the best F1 measure to determine classes of exploited versus not exploited vulnerabilities.

Their classifier was evaluated based on precision, recall, and Receiver Operating Characteristics (ROC). If minimizing the number of incorrectly flagged vulnerabilities is the goal, then high precision is desired. If minimizing the number of undetected vulnerabilities is the goal, then high recall is desired. To avoid temporal intermixing, the NVD data was sorted by the disclosure date and the first 70% was used for training and the rest for testing. This was necessary so that future PoC events would not influence the prediction of past events (i.e., vulnerability is published before the exploitation date). Table IV shows the precision, recall, and corresponding F1 measure for vulnerabilities mentioned on DW, ZDI, and EDB. DW information was able to identify exploited vulnerabilities with the highest level of precision at 0.67.

Table IV

Source: Almukaynizi
Almukaynizi et al. indicate promising results based on their random forest classification scheme. It should be noted that random forest outputs a confidence score for each sample which can be evaluated against a user-defined threshold for predicting a vulnerability as exploited. While the authors acknowledge the threshold can be varied in accordance with other factors of importance to the organization (e.g., vendors), they do not disclose the hard-cut threshold used during their experiments. It is also noteworthy that false negatives that received the lowest confidence scores shared common features (e.g., Adobe Flash and Acrobat Reader), base scores, and similar descriptions in the NVD. A similar observation was noted among the false positives where all predicted exploited vulnerabilities existed in Microsoft products. The inherent class imbalance in vulnerability data may also be a contributing factor along with perceived biases in the Symantec attack signatures which provide ground truth. In the future, the authors hope to enhance their exploit prediction model by expanding the vulnerability data sources to include social media sites and online blogs.

-- Corren McCoy (@correnmccoy)

Almukaynizi, Mohammed, et al. "Proactive identification of exploits in the wild through vulnerability mentions online." 2017 International Conference on Cyber Conflict (CyCon US). IEEE, 2017.

Monday, October 21, 2019

2019-10-21: Where did the archive go? Part 4: WebCite
We previously described changes in the following web archives:
In the last part of this four part series, we focus on changes in (WebCite). The WebCite archive has been operational in its current form since at least 2004 and was the first archive to offer an on-demand archiving service by allowing users to submit URLs of web pages. Around 2019-06-07, the archive became unreachable. The Wayback Machine indicates that there were no mementos captured for the domain between 2019-06-07 and 2019-07-09 (for about a month), which is the longest period of time in 2019 that has no mementos for WebCite in the Internet Archive:*/

The host was not resolving as shown in the cURL session below:

$ date
Mon Jul 01 01:33:15 EDT 2019

$ curl -I
curl: (6) Could not resolve host:

We were conducting a study on a set of mementos from WebCite when the archive was inaccessible. The study included downloading the mementos and storing them locally in WARC files. Because the archive was unreachable, the WARC files contained only request records, with no response records, as shown below (the URI of the memento (URI-M) was

WARC-Type: request
WARC-Date: 2019-07-09T02:01:52Z
WARC-Concurrent-To: <urn:uuid:8519ea60-a1ed-11e9-82a3-4d5f15d9881d>
WARC-Record-ID: <urn:uuid:851c5b60-a1ed-11e9-82a3-4d5f15d9881d>
Content-Type: application/http; msgtype=request
Content-Length: 303

Upgrade-Insecure-Requests: 1
User-Agent: Web Science and Digital Libraries Group (@WebSciDL); Project/archives_fixity; Contact/Mohamed Aturban (
X-DevTools-Emulate-Network-Conditions-Client-Id: 5B0BDCB102052B8A7EF0772D50B85540



The WebCite archive was back online on 2019-07-09 with a significant change; the archive no longer accepts archiving requests as its homepage indicates (e.g., the first screenshot above). 

Our WARC records below show the time at which the archive came back online on 2019-07-09T13:17:16Z. Note that there are a few hours difference between the value of the WARC-Date below and its value in the WARC record above. However, the WebCite archive was still down on 2019-07-09T02:01:52Z,  while it was online again around 13:17:16Z:

WARC-Type: request
WARC-Date: 2019-07-09T13:17:16Z
WARC-Concurrent-To: <urn:uuid:df5de3b0-a24b-11e9-aaf9-bb34816ea2ff>
WARC-Record-ID: <urn:uuid:df5f6a50-a24b-11e9-aaf9-bb34816ea2ff>
Content-Type: application/http; msgtype=request
Content-Length: 414

Pragma: no-cache
Accept-Encoding: gzip, deflate
Upgrade-Insecure-Requests: 1
User-Agent: Web Science and Digital Libraries Group (@WebSciDL); Project/archives_fixity; Contact/Mohamed Aturban (
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Cache-Control: no-cache
Connection: keep-alive

WARC-Type: response
WARC-Date: 2019-07-09T13:17:16Z
WARC-Record-ID: <urn:uuid:df5fb870-a24b-11e9-aaf9-bb34816ea2ff>
Content-Type: application/http; msgtype=response
Content-Length: 1212

HTTP/1.1 200 OK
Pragma: no-cache
Date: Tue, 09 Jul 2019 13:16:47 GMT
Server: Apache/2.4.29 (Ubuntu)
Vary: Accept-Encoding
Content-Type: text/html; charset=UTF-8
Set-Cookie: PHPSESSID=ugssainrn4pne801m58d41lm2r; path=/
Cache-Control: no-store, no-cache, must-revalidate
Connection: Keep-Alive
Keep-Alive: timeout=5, max=100
Content-Length: 814
Expires: Thu, 19 Nov 1981 08:52:00 GMT

The archive was still down a few seconds before 2019-07- 09T13:17:16Z as there were not response records in the WARC files:

WARC-Type: request
WARC-Date: 2019-07-09T13:16:39Z
WARC-Concurrent-To: <urn:uuid:c94dd940-a24b-11e9-8d2f-a5805b26a392>
WARC-Record-ID: <urn:uuid:c9515bb0-a24b-11e9-8d2f-a5805b26a392>
Content-Type: application/http; msgtype=request
Content-Length: 303

GET /6E2fTSO15 HTTP/1.1
Upgrade-Insecure-Requests: 1
User-Agent: Web Science and Digital Libraries Group (@WebSciDL); Project/archives_fixity; Contact/Mohamed Aturban (
X-DevTools-Emulate-Network-Conditions-Client-Id: 78DDE3B763F42A09787D0EBA241C9C4A

The archive has had some issues in the past related to funding limitations:
One of the main objectives for which WebCite was established was to reduce the impact of reference rot by allowing researchers and authors of scientific work to archive cited web pages. The instability in providing archiving services and being inaccessible from time to time raises important questions:

  • Is there any plan by the web archiving community to recover web pages archived by WebCite if the archive is gone?
  • Why didn't the archive give a notice (e.g., in their homepage) before they became unreachable? This will give users some time to deal with different scenarios, such as downloading a particular set of mementos
  • Has the archived content changed before the archive came back online?
  • The archive now does not accept archiving requests nor does it do web page crawling. Is there any plan by the archive to resume the on-demand archiving service in the future?
In case WebCite disappears, the structure of the URI-M used by the archive makes it difficult to recover mementos from other web archives. This is because the URI-M of a memento (e.g., does not give any additional information about the memento. These shortened URI-Ms are also used by other archives, such as and In contrast, the majority of web archives that employ one of the Wayback Machine’s implementations (e.g., OpenWayback and PyWb) use the URI-M structure illustrated below. Note that and also support this Wayback Machine-style URI-Ms (i.e., each memento has two URI-Ms):

URI-M structure typically used by Wayback-style archives

This common URI-M structure provides two pieces of information, the original page's URI (URI-R) and the creation datetime of the memento (Memento-Datetime). This information then can be used to look up similar (or even identical) archived pages in other web archives using services like the LANL Memento aggregator. With the URI-M structure used by WebCite, it is not possible to recover mementos using only the URI-M. The Robust Links article introduces a mechanism to allow a user to link to an original URI and at the same time describe the state of the URI so that, in the future, users will be able to obtain information about the URI even if it disappears from the live web.

The WebCite archive is a well-known web archive and one of the few on-demand archiving services. It is important that the web archiving community takes further steps to guarantee long-term preservation and stability of those archived pages at

--Mohamed Aturban