|
Google.com mementos from May 8th 1999 on the Internet Archive |
In terms of the ability to be archived in public web archives, web pages fall into one of two categories: publicly archivable, or not publicly archivable.
1. Publicly Archivable Web Pages:
These pages are archivable by public archives. The pages can be accessed without login/authentication. In other words, these pages do not reside behind a paywall.
Grant Atkins examined
paywalls in the Internet Archive for news sites and found that web pages behind paywalls may actually be redirecting to a login page at crawl time. A good example of a publicly archivable page is
Dr. Steven Zeil's page since no authentication is required to view the page. Furthermore, it does not use client-side scripts (i.e., Ajax) to load additional content, so what you see in the web browser and what you can replay from public web archives are exactly the same.
Some web pages provide "personalized" content depending on the GeoIP of the requester. In these cases, what you see in the browser and what you can replay from public web archives are nearly the same, except for some minor personalization/GeoIP related changes. For example, a user requesting
https://www.islamicfinder.org from Suffolk, Virginia will see the prayer times for the closest major city (Norfolk, Virginia). On the other hand, when the Internet Archive crawls the page, it sees the prayer times for San Bruno, California. This is likely because the crawling/archiving is happening from San Francisco, California. The two pages, otherwise, are exactly the same!
Some social media sites, like
Twitter, are publicly archivable and the Internet Archive captures most of their content. Twitter's home page is personalized, so user-specific contents, like "Who to Follow" and "Trends for you" are not captured, but the tweets are. Also, some Twitter services require authentication.
The archived memento for the @twitter web page shows a message that cookies are used and they are important for an enhanced user experience, nevertheless, the main content of the page, tweets, is preserved (or at least the top-k tweets, since the crawler does not automatically scroll at archive time to activate the Ajax-based pagination, cf. Corren McCoy's "Pagination Considered Harmful to Archiving").
Also, deep links to individual tweets are archivable.
2. Not Publicly Archivable Web Pages:
As far as the amount of web traffic, search engines are at the top. According to
SimilarWeb, Google is number one; its share is 10.10% of the entire web traffic. The Internet Archive crawls it on regular basis, and has over 0.5 million mementos as of 2018-05-01 (cf.
Mat Kelly's tech report about
the difficulty in counting the number of mementos). The captured mementos are exact copies as far as the look, but obviously not a functioning search page.
It is possible to push a
search result page from Google to a public web archive like
archive.is, but that is not how web archives are normally used.
Furthermore, it is not viable for web archives to try to archive search engines' result pages (SERPs) because there is an infinite number of possible URIs due to an infinite number of search queries and syntax, so even if we preserve a single SERP from June, 2018 (as shown above), we are unable to issue new queries against a June, 2018 version of Google. Maps and other applications that depend on user interaction are similar: individual pages may be archived, but we typically don't consider the entire application "archived" (cf.
Michael Nelson's "
Game Walkthroughs As A Metaphor for Web Preservation").
Even when web archives use headless browsers to overcome the Ajax problem, there can be additional challenges. For example, I pushed a page from Google Maps with an address in Chesapeake, Virginia to archive.today and the result was a page from Google support (in Russian) telling me that I (or more accurately, archive.today) need to update the browser in order to use Google Maps! While technically not a paywall, this is similar to Grant's study mentioned above in that there is now something in the web archive corresponding to that Google Maps URI, but it does not match the users' expectations. It also reveals a clue about the GeoIP of archive.today.
It is worth mentioning there are emerging tools like
Webrecorder,
WARCreate,
WAIL, and
Memento Tracer for personal web archiving (or community tools in the case of Tracer), but even if/when the Internet Archive replaces
Heritrix with
Brozzler and resolves the problems with Ajax, their Wayback Machine cannot be expected to have pages requiring authentication, nor pages with effectively infinite inputs like search engines and maps.
Social media pages respond differently when web archives' crawlers try to crawl and archive them. Public web archives might have mementos of some social media pages, however, they often require a login to allow the download of the pages' representation. Otherwise, a redirection takes place. Another obstacle faces archiving social media pages is their heavy use of client-side executed scripts that will, for example, fetch new content when the page is scrolled or when hiding/showing comments with no change in the URI. Facebook, for example, does not allow web archives' crawlers to access the majority of its pages. The Internet Archive's Wayback Machine returned
1,699 mementos for
the former president's official Facebook page, but when I opened one of these mementos, it returned the infamous Facebook login or register page.
|
The memento captured on 2017-02-10 is showing the login page of Facebook |
There are few exceptions where the Internet Archive is able to archive some user-contributed Facebook pages.
Also, it seems like
archive.is is using a dummy account ("Nathan") to authenticate, view, and archive some Facebook pages.
With the previous exceptions in mind, it is still safe to say that Facebook pages are not publicly archivable.
Linkedin shares the same behavior with Facebook. The
notifications page has
46 mementos as of 2018-05-29, but they are entirely empty. The live page contains notifications from contacts such as who is having a birthday, job anniversary, got a new job, and so on. This page is completely personalized and requires a cookie or login to display information that is related to the user, and therefore, the Internet Archive has no way of downloading its representation.
The last example I would like to share is
Amazon's "yourstore" page. I chose this example because it contains recommended items (another clean example for personalized web pages). The recommendations are based on the user's behavior. In my case, Amazon recommended electronics, automotive tools, and prime video.
As of 2018-05-02, I found
111 mementos for "
my Amazon's your store page" in the Internet Archive, and opened one of them to see what has been captured.
As I expected, the page has a redirect to another page that asks for a login. It returned a 302 response code when it was crawled by the Internet Archive. The actual content of the original page was not archived because the IA crawler does not provide credentials to download the content of the page. The representation saved to the Internet Archive is for a resource different from the originally crawled page.
There are many web sites with this behavior, so it is safe to assume that for some web sites, even when there are plenty of mementos, they all might return a
soft 404.
Estimating the amount of archivable web traffic:
To explore the amount of web traffic that is archivable, I examined the top 100 sites as ranked by
Alexa, and manually constructed a
data set of those 100 sites using traffic analysis services from
SimilarWeb and
Semrush.
The data was collected on 2018-02-23 and I captured three web traffic measures offered by both websites, total visits, unique visits, and pages/visit.
- Total visits is the total number of non-unique visits from last month.
- Unique visits is the number of unique visits from last month.
- Pages/visit is the average number of visited pages per user's visit.
I determined whether or not a website is archivable based on the discussion I provided earlier, and put it all together in a
csv file to use it later as input for
my script. Suggestions, feedback, and pull requests are always welcome!
Using Python 3, I wrote a
simple script that calculates the percentage of web traffic that is publicly archivable. I am assuming that the top 100 sites is a good representative of the whole web. I am aware that 100 sites is a small number compared to
1.8 billion live websites on the Internet, but according to SimilarWeb, the top 100 sites receive 48.86% of the entire traffic on the web which is consistent with a
Pareto distribution. The program offers six different results, each of which is based on a certain measure or a combination of measures, total visits, unique visits, and pages/visit. Flags can be set to control what measures are used in the calculation. If no flags are set, the program shows all the results using all three measures and their combination. I came up with this formula to calculate the percentage of publicly archivable websites based on all three measures combined:
- Multiply the pages/visit by visits for each web site from both SimilarWeb and SemRush
- Take the average for both sources, SimilarWeb and SemRush
- Take the average of unique visits for each website from SimilarWeb and SemRush
- Add the numbers obtained in 2 and 3
- Add the number obtained in 4 for all archivable websites
- Add the number obtained in 4 for all non-archivable websites
- Add the numbers obtained in 5 and 6 to get the total
- Calculate the percentage of the numbers obtained in 5 and 6 from the total, obtained in 7
Using all measures, I found that 65.30% of the traffic of the top 100 sites is not archivable by public web archives. The program and the data set are available on
Github.
Now, it is possible to discuss three different scenarios and compute a range. If the top 100 sites receive 48.86% of the traffic, and 65.30% of that traffic is not publicly archivable, therefore:
- If all of the remaining web traffic is publicly archivable, then 31.91% of the entire web traffic is not publicly archivable. 65.30 * 0.4886 = 31.91.
- If the remaining web traffic is similar to the traffic from the top 100 sites, then 65.30% of the entire web traffic is not publicly archivable.
- Finally, if all of the remaining web traffic is not publicly archivable, then only 16.95% of the entire web traffic is archivable. 34.7 * 0.4886 = 16.95. This means that 83.05% of the entire web traffic is not publicly archivable.
So the percentage of not publicly archivable web traffic is between 31.91% and 83.05%. More likely, it is close to 65.30% (the second case).
I would like to emphasize that since the top 100 websites are mainly Google, Bing, Yahoo, etc, and their derivatives, the nature of these top sites is the determining factor of my results. However, since the range has been calculated, it is safe to say that, at least, 1/3 of the entire web traffic is not publicly archivable. This percentage constitutes the necessity of private web archives. There are few available tools to solve this problem,
Web Recorder,
Warcreate, and
WAIL. Public web archiving sites like the
Internet Archive,
archive.is, and others will never be able to preserve personalized or private web pages like emails, bank accounts, etc.
Take Away Message:
Personal web archiving is crucial since, at least, 31.91% of the entire web traffic is not archivable by public web archives. This is due to the increase use of personalized/private web pages and the use of technologies hindering the ability of web archives' crawlers to crawl and archive these pages. The experiment shows that the percentage of not publicly archivable web traffic can be as high as 83.05%, but the more likely case is that around 65% of web traffic is not publicly archivable. Unfortunately, no matter how good public web archives get at capturing web pages, there will always be a significant number of web pages that are not publicly archivable. This emphasizes the need for personal web archiving tools, such as
Web Recorder,
Warcreate, and
WAIL - possibly combined with a collaboratively-maintained repository of how to interact with complex sites, as introduced by
Memento Tracer. Even if Ajax-related web archiving problems were eliminated, no less than 1/3 of web traffic is to sites that will otherwise never appear in public web archives.
--
Hussam Hallak
Comments
Post a Comment