2018-07-18: Why We Need Private Web Archives: Almost Two-Thirds of Web Traffic IS NOT Publicly Archivable

Google.com mementos from May 8th 1999 on the Internet Archive

In terms of the ability to be archived in public web archives, web pages fall into one of two categories: publicly archivable, or not publicly archivable.

1. Publicly Archivable Web Pages:

These pages are archivable by public archives. The pages can be accessed without login/authentication. In other words, these pages do not reside behind a paywall. Grant Atkins examined paywalls in the Internet Archive for news sites and found that web pages behind paywalls may actually be redirecting to a login page at crawl time. A good example of a publicly archivable page is Dr. Steven Zeil's page since no authentication is required to view the page. Furthermore, it does not use client-side scripts (i.e., Ajax) to load additional content, so what you see in the web browser and what you can replay from public web archives are exactly the same.

Screen shot from Dr. Steven Zeil's page captured on 2018-07-02

Memento for Dr. Zeil's page on the Internet Archive captured on 2017-12-02

Some web pages provide "personalized" content depending on the GeoIP of the requester. In these cases, what you see in the browser and what you can replay from public web archives are nearly the same, except for some minor personalization/GeoIP related changes. For example, a user requesting https://www.islamicfinder.org from Suffolk, Virginia will see the prayer times for the closest major city (Norfolk, Virginia). On the other hand, when the Internet Archive crawls the page, it sees the prayer times for San Bruno, California. This is likely because the crawling/archiving is happening from San Francisco, California. The two pages, otherwise, are exactly the same!

The live version of https://www.islamicfinder.org for a user in Suffolk, VA on 2018-07-02

Memento for https://www.islamicfinder.org from the Internet Archive captured on 2018-06-22

Some social media sites, like Twitter, are publicly archivable and the Internet Archive captures most of their content. Twitter's home page is personalized, so user-specific contents, like "Who to Follow" and "Trends for you" are not captured, but the tweets are. Also, some Twitter services require authentication.

@twitter live web page

@twitter memento from the Internet Archive captured on 2016-05-18

The archived memento for the @twitter web page shows a message that cookies are used and they are important for an enhanced user experience, nevertheless, the main content of the page, tweets, is preserved (or at least the top-k tweets, since the crawler does not automatically scroll at archive time to activate the Ajax-based pagination, cf. Corren McCoy's "Pagination Considered Harmful to Archiving").

Message from Twitter about cookies use to enhance user experience

Also, deep links to individual tweets are archivable.

Memento for a deep link to a tweet on the Internet Archive captured on 2013-01-18

2. Not Publicly Archivable Web Pages:

As far as the amount of web traffic, search engines are at the top. According to SimilarWeb, Google is number one; its share is 10.10% of the entire web traffic. The Internet Archive crawls it on regular basis, and has over 0.5 million mementos as of 2018-05-01 (cf. Mat Kelly's tech report about the difficulty in counting the number of mementos). The captured mementos are exact copies as far as the look, but obviously not a functioning search page.

As of 2018-05-01 the IA has 552,652 mementos of Google.com

Google.com memento from May 8th 1999 on the Internet Archive played on 2018-05-01

It is possible to push a search result page from Google to a public web archive like archive.is, but that is not how web archives are normally used.

A Google search query for "Machine Learning" on 2018-06-18 archived in archive.is

Furthermore, it is not viable for web archives to try to archive search engines' result pages (SERPs) because there is an infinite number of possible URIs due to an infinite number of search queries and syntax, so even if we preserve a single SERP from June, 2018 (as shown above), we are unable to issue new queries against a June, 2018 version of Google. Maps and other applications that depend on user interaction are similar: individual pages may be archived, but we typically don't consider the entire application "archived" (cf. Michael Nelson's "Game Walkthroughs As A Metaphor for Web Preservation").

Even when web archives use headless browsers to overcome the Ajax problem, there can be additional challenges. For example, I pushed a page from Google Maps with an address in Chesapeake, Virginia to archive.today and the result was a page from Google support (in Russian) telling me that I (or more accurately, archive.today) need to update the browser in order to use Google Maps! While technically not a paywall, this is similar to Grant's study mentioned above in that there is now something in the web archive corresponding to that Google Maps URI, but it does not match the users' expectations. It also reveals a clue about the GeoIP of archive.today.

Google Maps page for the address 4940 S Military HWY, Chesapeake, VA 23321 pushed to archive.today on 2018-07-02

Memento for the Google Maps page I pushed to archive.today on 2018-07-02

It is worth mentioning there are emerging tools like Webrecorder, WARCreate, WAIL, and Memento Tracer for personal web archiving (or community tools in the case of Tracer), but even if/when the Internet Archive replaces Heritrix with Brozzler and resolves the problems with Ajax, their Wayback Machine cannot be expected to have pages requiring authentication, nor pages with effectively infinite inputs like search engines and maps.

Social media pages respond differently when web archives' crawlers try to crawl and archive them. Public web archives might have mementos of some social media pages, however, they often require a login to allow the download of the pages' representation. Otherwise, a redirection takes place. Another obstacle faces archiving social media pages is their heavy use of client-side executed scripts that will, for example, fetch new content when the page is scrolled or when hiding/showing comments with no change in the URI. Facebook, for example, does not allow web archives' crawlers to access the majority of its pages. The Internet Archive's Wayback Machine returned 1,699 mementos for the former president's official Facebook page, but when I opened one of these mementos, it returned the infamous Facebook login or register page.

1,699 mementos for the official Facebook page of Mr. Obama, former U.S. president as of 2018-05-01

The memento captured on 2017-02-10 is showing the login page of Facebook

There are few exceptions where the Internet Archive is able to archive some user-contributed Facebook pages.

Memento for a facebook page in the Internet Archive captured on 2012-03-02

Also, it seems like archive.is is using a dummy account ("Nathan") to authenticate, view, and archive some Facebook pages.

Memento for a facebook page in archive.is captured on 2018-06-21

With the previous exceptions in mind, it is still safe to say that Facebook pages are not publicly archivable.

Linkedin shares the same behavior with Facebook. The notifications page has 46 mementos as of 2018-05-29, but they are entirely empty. The live page contains notifications from contacts such as who is having a birthday, job anniversary, got a new job, and so on. This page is completely personalized and requires a cookie or login to display information that is related to the user, and therefore, the Internet Archive has no way of downloading its representation.

My account's notification page on Linkedin

Memento of Linkedin's notification page

The last example I would like to share is Amazon's "yourstore" page. I chose this example because it contains recommended items (another clean example for personalized web pages). The recommendations are based on the user's behavior. In my case, Amazon recommended electronics, automotive tools, and prime video.

My Amazon's page (live) on 2018-05-02

As of 2018-05-02, I found 111 mementos for "my Amazon's your store page" in the Internet Archive, and opened one of them to see what has been captured.

Mementos for Amazon's yourstore page in the Internet Archive on 2018-05-02

As I expected, the page has a redirect to another page that asks for a login. It returned a 302 response code when it was crawled by the Internet Archive. The actual content of the original page was not archived because the IA crawler does not provide credentials to download the content of the page. The representation saved to the Internet Archive is for a resource different from the originally crawled page.

IA crawler was redirected to a login page and captured it instead

There are many web sites with this behavior, so it is safe to assume that for some web sites, even when there are plenty of mementos, they all might return a soft 404.

Estimating the amount of archivable web traffic:

To explore the amount of web traffic that is archivable, I examined the top 100 sites as ranked by Alexa, and manually constructed a data set of those 100 sites using traffic analysis services from SimilarWeb and Semrush.

The data was collected on 2018-02-23 and I captured three web traffic measures offered by both websites, total visits, unique visits, and pages/visit.

Total visits is the total number of non-unique visits from last month.

Unique visits is the number of unique visits from last month.

Pages/visit is the average number of visited pages per user's visit.

I determined whether or not a website is archivable based on the discussion I provided earlier, and put it all together in a csv file to use it later as input for my script. Suggestions, feedback, and pull requests are always welcome!

The data set used in the experiment

Using Python 3, I wrote a simple script that calculates the percentage of web traffic that is publicly archivable. I am assuming that the top 100 sites is a good representative of the whole web. I am aware that 100 sites is a small number compared to 1.8 billion live websites on the Internet, but according to SimilarWeb, the top 100 sites receive 48.86% of the entire traffic on the web which is consistent with a Pareto distribution. The program offers six different results, each of which is based on a certain measure or a combination of measures, total visits, unique visits, and pages/visit. Flags can be set to control what measures are used in the calculation. If no flags are set, the program shows all the results using all three measures and their combination. I came up with this formula to calculate the percentage of publicly archivable websites based on all three measures combined:

Multiply the pages/visit by visits for each web site from both SimilarWeb and SemRush
Take the average for both sources, SimilarWeb and SemRush
Take the average of unique visits for each website from SimilarWeb and SemRush
Add the numbers obtained in 2 and 3
Add the number obtained in 4 for all archivable websites
Add the number obtained in 4 for all non-archivable websites
Add the numbers obtained in 5 and 6 to get the total
Calculate the percentage of the numbers obtained in 5 and 6 from the total, obtained in 7

Using all measures, I found that 65.30% of the traffic of the top 100 sites is not archivable by public web archives. The program and the data set are available on Github.

Now, it is possible to discuss three different scenarios and compute a range. If the top 100 sites receive 48.86% of the traffic, and 65.30% of that traffic is not publicly archivable, therefore:

If all of the remaining web traffic is publicly archivable, then 31.91% of the entire web traffic is not publicly archivable. 65.30 * 0.4886 = 31.91.
If the remaining web traffic is similar to the traffic from the top 100 sites, then 65.30% of the entire web traffic is not publicly archivable.
Finally, if all of the remaining web traffic is not publicly archivable, then only 16.95% of the entire web traffic is archivable. 34.7 * 0.4886 = 16.95. This means that 83.05% of the entire web traffic is not publicly archivable.

So the percentage of not publicly archivable web traffic is between 31.91% and 83.05%. More likely, it is close to 65.30% (the second case).

I would like to emphasize that since the top 100 websites are mainly Google, Bing, Yahoo, etc, and their derivatives, the nature of these top sites is the determining factor of my results. However, since the range has been calculated, it is safe to say that, at least, 1/3 of the entire web traffic is not publicly archivable. This percentage constitutes the necessity of private web archives. There are few available tools to solve this problem, Web Recorder, Warcreate, and WAIL. Public web archiving sites like the Internet Archive, archive.is, and others will never be able to preserve personalized or private web pages like emails, bank accounts, etc.

Take Away Message:

Personal web archiving is crucial since, at least, 31.91% of the entire web traffic is not archivable by public web archives. This is due to the increase use of personalized/private web pages and the use of technologies hindering the ability of web archives' crawlers to crawl and archive these pages. The experiment shows that the percentage of not publicly archivable web traffic can be as high as 83.05%, but the more likely case is that around 65% of web traffic is not publicly archivable. Unfortunately, no matter how good public web archives get at capturing web pages, there will always be a significant number of web pages that are not publicly archivable. This emphasizes the need for personal web archiving tools, such as Web Recorder, Warcreate, and WAIL - possibly combined with a collaboratively-maintained repository of how to interact with complex sites, as introduced by Memento Tracer. Even if Ajax-related web archiving problems were eliminated, no less than 1/3 of web traffic is to sites that will otherwise never appear in public web archives.

--
Hussam Hallak

Search This Blog

Web Science and Digital Libraries Research Group