Sunday, July 22, 2018

2018-07-22: Tic-Tac-Toe and Magic Square Made Me a Problem Solver and Programmer

"How did you learn programming?", a student asked me in a recent summer camp. Dr. Yaohang Li organized the Machine Learning and Data Science Summer Camp for High School students of the Hampton Roads metropolitan region at the Department of Computer Science, Old Dominion University from June 25 to July 9, 2018. The camp was funded by the Virginia Space Grant Consortium. More than 30 students participated in it. They were introduced to a variety of topics such as Data Structures, Statistics, Python, R, Machine Learning, Game Programming, Public Datasets, Web Archiving, and Docker etc. in the form of discussions, hands-on labs, and lectures by professors and graduate students. I was invited to give a lecture about my research and Docker. At the end of my talk I solicited questions and distributed Docker swag.

The question "How did you learn programming?" led me to draw Tic-Tac-Toe Game and a 3x3 Magic Square on the white board. Then I told them a more than a decade old story of the early days of my bachelors degree when I had recently got my very first computer. One day while brainstorming on random ideas, I realized the striking similarity between the winning criteria of a Tic-Tac-Toe game and sums of 15 using three numbers of a 3x3 Magic Square that uses unique numbers from one to nine. The similarity has to do with their three rows, three columns, and two diagonals. After confirming that there are only eight combinations of selecting three unique numbers from one to nine whose sum is 15, I was sure that those are all placed at strategic locations in a magic square and there is no other possibility left for another such combination. If we assign values to each block of the Tic-Tac-Toe game according the Magic Square and store list of values acquired by the two players, we can decide potential winning moves in the next step by trying various combinations of two acquired vales of a player and subtracting it from 15. For example, if places 4 and 3 are acquired by the red (cross sign) player then a potential winning move would be place 8 (15-4-3=8). With this basic idea of checking potential wining move, when the computer is playing against a human, I could set strategies of first checking for the possibility of winning moves by the computer and if none are available then check for the possibility of the next winning moves by the human player and block them. While there are many other approaches to solve this problem, my idea was sufficient to get me excited and try to write a program for it.

By that time I only had the basic understanding of programming constructs such as variables, conditions, loops, and functions in C programming language as part of the introductory Computer Science curriculum. While C is a great language for many reasons, it was not an exciting language for me as a beginner. If I were to write Tic-Tac-Toe game in C, I would have ended up writing something that would have a text-based user interface in the terminal which is not what I was looking for. I asked someone about the possibility of writing software with a graphical user interface (GUI) and he suggested that I try Visual Basic. So I went to the library, got a book on VB6, and studied it for about a week. Now, I was ready to create a small window with nine buttons arranged in a 3x3 grid. When these buttons would be clicked, a colored label (a circle or a cross) would be placed and a callback function would be called with an argument associated with the value according to the position of the button (as per the Magic Square arrangement). The callback function can then update states and play the next move. Later, the game was improved with different modes and settings.

One day, I shared my program and approach with a professor (who is working for Microsoft now) with excitement. He said this technique is explored in an algorithm book too. This made me feel a little underwhelmed because I was not the first one who came up with this idea. However, I was equally happy that I discovered it independently and the fact that it was validated by some smart people already.

This was not the only event when I had an idea and needed the right tool to express it. Over time my curiosity lead me to many more challenges, ideas of potential solutions for the problem, and exploration of numerous suitable tools, techniques, and programming languages.

My talk was scheduled for Wednesday, June 27, 2018. I started by introducing myself, WS-DL Research Group, basics of Web Archiving, and then briefly talked about my Archive Profiling research. Without going too much into the technical details, I tried to explain the need of Memento Routing and how Archive Profiles can help to achieve this.

Luckily, Dr. Michele Weigle had already introduced Web Archiving to them the day before my talk. When I started mentioning Web Archives, they knew what I was talking about. This helped me cut my talk down and save some time to talk about other things and the Q/A session.

I then put my Docker Campus Ambassador hat on and started with the Dockerizing ArchiveSpark story. Then I briefly described what Docker is, where can it be useful, and how it works. I walked them through a code example to illustrate the procedure of working with Docker. As expected, it was their first encounter with Docker and many of them had no experience with Linux operating system either, so I tried to keep things as simple as possible.

I had a lot of stickers and some leftover T-shirts from my previous Docker event, so I gave them to those who asked any questions. A couple days later, Dr. Li told me that the students were very excited about Docker and especially those T-shirts, so I decided to give a few more of those away. For that, I asked them a few questions related to my earlier talk and whoever was able to recall the answers got a T-shirt.

Overall, I think it was a successful summer camp. I am positive that those High School students had a great learning experience and exposure to some research techniques that can be helpful in their career and some of them might be encouraged to go for a graduation degree one day. Being a research university, ODU is enriched with many talented graduate students with a variety of expertise and experiences which can benefit the community at large. I think more such programs should be organized in the Department of Computer Science and various other departments of the university.

It was a fun experience for me as I interacted with High School students here in the USA for the first time. They were all energetic, excited, and engaging. Good luck to all who were part of this two weeks long event. And now you know, how I learned programming!

Sawood Alam

Wednesday, July 18, 2018

2018-07-18: HyperText and Social Media (HT) Trip Report

Leaping Tiger statue next to the College of Arts at Towson University
From July 9 - 12, the 2018 ACM Conference on Hypertext and Social Media (HT) took place at the College of Arts at Towson University in Baltimore, Maryland. Researchers from around the world presented the results of complete or ongoing work in tutorial, poster, and paper sessions. Also, during the conference I had the opportunity to present a full paper: "Bootstrapping Web Archive Collections from Social Media" on behalf of co-authors Dr. Michele Weigle and Dr. Michael Nelson.

Day 1 (July 9, 2018)

The first day of the conference was dedicated to a tutorial (Efficient Auto-generation of Taxonomies for Structured Knowledge Discovery and Organization) and three workshops:
  1. Human Factors in Hypertext (HUMAN)
  2. Opinion Mining, Summarization and Diversification
  3. Narrative and Hypertext
I attended the Opinion Mining, Summarization and Diversification workshop. The workshop started with a talk titled: "On Reviews, Ratings and Collaborative Filtering," presented by Dr. Oren Sar Shalom, principal data scientist at Intuit, Israel. Next, Ophélie Fraisier, a PhD student studying stance analysis on social media at Paul Sabatier University, France, presented: "Politics on Twitter : A Panorama," in which she surveyed methods of analyzing tweets to study and detect polarization and stances, as well as election prediction and political engagement.
Next, Jaishree Ranganathan, a PhD student at the University of North Carolina, Charlotte, presented: "Automatic Detection of Emotions in Twitter Data - A Scalable Decision Tree Classification Method."
Finally, Amin Salehi, a PhD student at Arizona State University, presented: "From Individual Opinion Mining to Collective Opinion Mining." He showed how collective opinion mining can help capture the drivers behind opinions as opposed to individual opinion mining (or sentiment) which identifies single individual attitudes toward an item.

Day 2 (July 10, 2018)

The conference officially began on day 2 with a keynote: "Lessons in Search Data" by Dr. Seth Stephens-Davidowitz, a data scientist and NYT bestselling author of: "Everybody Lies."
In his keynote, Dr. Stephens-Davidowitz revealed insights gained from search data ranging from racism to child abuse. He also discussed a phenomenon in which people are likely to lie to pollsters (social desirability bias) but are honest to Google ("Digital Truth Serum") because Google incentivizes telling the truth. The paper sessions followed the keynote with two full papers and a short paper presentation.

The first (full) paper of day 2 in the Computational Social Science session: "Detecting the Correlation between Sentiment and User-level as well as Text-Level Meta-data from Benchmark Corpora," was presented by Shubhanshu Mishra, a PhD student at the iSchool of the University of Illinois at Urbana-Champaign. He showed correlations between user-level and tweet-level metadata by addressing two questions: "Do tweets from users with similar Twitter characteristics have similar sentiments?" and "What meta-data features of tweets and users correlate with tweet sentiment?" 
Next, Dr. Fred Morstatter presented a full paper: "Mining and Forecasting Career Trajectories of Music Artists," in which he showed that their dataset generated from concert discovery platforms can be used to predict important career milestones (e.g., signing by a major music label) of musicians.
Next, Dr. Nikolaos Aletras, a research associate at the University College London, Media Futures Group, presented a short paper: "Predicting Twitter User Socioeconomic Attributes with Network and Language Information." He described a method of predicting the occupational class and income of Twitter users by using information extracted from their extended networks.
After a break, the Machine Learning session began with a full paper (Best Paper Runner-Up): "Joint Distributed Representation of Text and Structure of Semi-Structured Documents," presented by Samiulla Shaikh, a software engineer and researcher at IBM India Research Labs.
Next, Dr. Oren Sar Shalom presented a short paper titled: "As Stable As You Are: Re-ranking Search Results using Query-Drift Analysis," in which he presented the merits of using query-drift analysis for search re-ranking. This was followed by a short paper presentation titled: "Embedding Networks with Edge Attributes," by Palash Goyal, a PhD student at University of Southern California. In his presentation, he showed a new approach to learn node embeddings that uses the edges and associated labels.
Another short paper presentation (Recommendation System session) by Dr. Oren Sar Shalom followed. It was titled: "A Collaborative Filtering Method for Handling Diverse and Repetitive User-Item Interactions." He presented a collaborative filtering model that captures multiple complex user-item interactions without any prior domain knowledge.
Next, Ashwini Tonge, a PhD student at Kansas State University presented a short paper titled: "Privacy-Aware Tag Recommendation for Image Sharing," in which she presented a means of tagging images on social media in order to improve the quality of user annotations while preserving user privacy sharing patterns.
Finally, Palash Goyal presented another short paper titled: "Recommending Teammates with Deep Neural Networks."

The day 2 closing keynote by Leslie Sage, director of data science at DevResults followed after a break that featured a brief screening of the 2018 World Cup semi-final game between France and Belgium. In her keynote, she presented the challenges experienced in the application of big data toward international development.

Day 3 (July 11, 2018)

Day 3 of the conference began with a keynote: "Insecure Machine Learning Systems and Their Impact on the Web" by Dr. Ben Zhao, Neubauer Professor of Computer Science at University of Chicago. He highlighted many milestones of machine learning by showing problems they have solved in natural language processing and computer vision. But showed that opaque machine learning systems are vulnerable to attack by agents with malicious intents, and he expressed the idea that these critical issues must be addressed especially given the rush to deploy machine learning systems. 
Following the keynote, I present our full paper: "Bootstrapping Web Archive Collections from Social Media" in the Temporal session. I highlighted the importance of web archive collections as a means of preserving the historical record of important events, and the seeds (URLs) from which they are formed. The seeds are collected by experts curators, but we do not have enough experts to collect seeds in a world of rapidly unfolding events. Consequently, I proposed exploiting the collective domain expertise of web users by generating seeds from social media collections and showed through a novel suite of measures, that seeds generated from social media are similar to those generated by experts.

Next, Paul Mousset, a PhD student at Paul Sabatier University, presented a full paper: "Studying the Spatio-Temporal Dynamics of Small-Scale Events in Twitter," in which he presented his work into the granular identification and characterization of event types on Twitter.
Next, Dr. Nuno Moniz, invited Professor at the Sciences College of the University of Porto, presented a short paper: "The Utility Problem of Web Content Popularity Prediction." He demonstrated that state-of-the-art approaches for predicting web content popularity have been optimized for improving the predictability of average behavior of data: items with low levels of popularity.
Next, Samiulla Shaikh (again), presented the first full paper (Nelson Newcomer Award winner) of the Semantic session: "Know Thy Neighbors, and More! Studying the Role of Context in Entity Recommendation," in which he showed how to efficiently explore a knowledge graph for the purpose of entity recommendation by utilizing contextual information to help in the selection of a subset of a entities in a knowledge graph.
Samiulla Shaikh (again), presented a short paper: "Content Driven Enrichment of Formal Text using Concept Definitions and Applications," in which he showed a method of making formal text more readable to non-expert users by text enrichment e.g., highlighting definitions and fetching of definitions from external data sources.
Next, Yihan Lu, a PhD student at Arizona State University, presented a short paper: "Modeling Semantics Between Programming Codes and Annotations." He presented the results from investigating a systematic method to examine annotation semantics and its relationship with source codes. He also showed their model which predict concepts in programming code annotation. Such annotations could be useful to new programmers.
Following a break, the User Behavior session began. Dr. Tarmo Robal, a research scientist at the Tallinn University of Technology, Estonia, presented a full paper: "IntelliEye: Enhancing MOOC Learners' Video Watching Experience with Real-Time Attention Tracking." He introduced IntelliEye, a system that monitors students watching video lessons and detects when they are distracted and intervenes in an attempt to refocus their attention.
Next, Dr. Ujwal Gadiraju, a postdoctoral researcher at L3S Research Center, Germany, presented a full paper: "SimilarHITs: Revealing the Role of Task Similarity in Microtask Crowdsourcing." He presented his findings from investigating the role of task similarity in microtask crowdsourcing on platforms such as Amazon Mechanical Turk and its effect on market dynamics.
Next, Xinyi Zhang, a computer science PhD candidate at UC Santa Barbara, presented a short paper: "Penny Auctions are Predictable: Predicting and profiling user behavior on DealDash." She showed that penny auction sites such as DealDash are vulnerable to modeling and adversarial attacks by showing that both the timing and source of bids are highly predictable and users can be easily classified into groups based on their bidding behaviors.
Shortly after another break, the Hypertext paper sessions began. Dr. Charlie Hargood, senior lecturer at Bournemouth University, UK and Dr. David Millard, associate Professor  at the University of Southampton, UK, presented a full paper: "The StoryPlaces Platform: Building a Web-Based Locative Hypertext System." They presented StoryPlaces, an open source authoring tool designed for the creation of locative hypertext systems.
Next, Sharath Srivatsa, a Masters student at International Institute of Information Technology, India, presented a full paper: "Narrative Plot Comparison Based on a Bag-of-actors Document Model." He presented an abstract "bag-of-actors" document model for indexing, retrieving, and comparing documents based on their narrative structures. The model resolves the actors in the plot and their corresponding actions.
Next, Dr. Claus Atzenbeck, professor at Hof University, Germany, presented a short paper: "Mother: An Integrated Approach to Hypertext Domains." He stated that the Dexter Hypertext Reference Model which was developed to provide a generic model for node-link hypertext systems does not match the need of Component-Based Open Hypermedia Systems (CB-OHS), and proposed how this can be remedied by introducing Mother, a system that implements link support.
The final (short) paper of the day, "VAnnotatoR: A Framework for Generating Multimodal Hypertexts," was presented by Giuseppe Abrami. He introduced a virtual reality and augmented reality framework for generating multimodal hypertexts called VAnnotatoR. The framework enables the annotation and linkage of texts, images and their segments with walk-on-able animations of places and buildings.
The conference banquet at Rusty Scupper followed the last paper presentation. The next HyperText conference was announced at the banquet.

Day 4 (July 12, 2018)

The final day of the conference featured multiple papers presentations such as:
The day began with a keynote "The US National Library of Medicine: A Platform for Biomedical Discovery and Data-Powered Health," presented by Elizabeth Kittrie, strategic advisor for data and open science at the National Library of Medicine (NLM). She discussed the role the NLM serves such as provider of health data for biomedical research and discovery. She also discussed the challenges that arise from the rapid growth of biomedical data, shifting paradigms of data sharing, as well as the role of libraries in providing access to digital health information.
The Privacy session of exclusively full papers followed the keynote. Ghazaleh Beigi, a PhD student at Arizona State University presented: "Securing Social Media User Data - An Adversarial Approach." She showed a privacy vulnerability that arises from the anonymization of social media data by demonstrating an adversarial attack specialized for social media data.
Next, Mizanur Rahman, a PhD student at Florida International University, presented: "Search Rank Fraud De-Anonymization in Online Systems." The bots and automatic methods session with two full paper presentations followed.
Diego Perna, a researcher at the University of Calabria, Italy, presented: "Learning to Rank Social Bots." Given recent reports about the use of bots to spread misinformation/disinformation on the web in order to sway public opinion, Diego Perna proposed a machine-learning framework for identifying and ranking online social network accounts based on their degree similarity to bots.
Next, David Smith, a researcher at University of Florida,  presented: "An Approximately Optimal Bot for Non-Submodular Social Reconnaissance." He showed that studies that show how social bots befriend real users as part of an effort to collect sensitive information operate with the premise that the likelihood of users accepting bot friend requests is fixed, a constraint contradicted by empirical evidence. Subsequently, he presented his work which addressed this limitation.
The News session began shortly after a break with a full paper (Best Paper Award) presentation from Lemei Zhang, a PhD candidate from Norwegian University of Science and Technology: "A Deep Joint Network for Session-based News Recommendations with Contextual Augmentation." She highlighted some of the issues news recommendation system suffer such as fast updating rate of news articles and lack of user profiles. Next, she proposed a news recommendation system that combines user click events within sessions and news contextual features to predict the next click behavior of a user.
Next, Lucy Wang, senior data scientist at Buzzfeed, presented a short paper: "Dynamics and Prediction of Clicks on News from Twitter."
Next, Sofiane Abbar, senior software/research engineer at Qatar Computing Research Institute, presented via a YouTube video: "To Post or Not to Post: Using Online Trends to Predict Popularity of Offline Content." He proposed a new approach for predicting the popularity of news articles before  they are published. The approach is based on observations regarding the article similarity and topicality and complements existing content-based methods.

Next, two full papers (Community Detection session) where presented by Ophélie Fraisier and Amin Salehi. Ophélie Fraisier presented: "Stance Classification through Proximity-based Community Detection."  She proposed the Sequential Community-based Stance Detection (SCSD) model for stance (online viewpoints) detection. It is a semi-supervised ensemble algorithm which considers multiple signals that inform stance detection. Next, Amin Salehi presented: "Sentiment-driven Community Profiling and Detection on Social Media." He presented a method of profiling social media communities based on their sentiment toward topics and proposed a method of detecting such communities and identifying motives behind their formation.
For pictures and notes complementary to this blogpost see Shubhanshu Mishra's notes.

I would like to thank the organizers of the conference, the hosts, Towson University College of Arts, as well as IMLS for funding our research.
-- Nwala (@acnwala)

2018-07-18: Why We Need Private Web Archives: Almost Two-Thirds of Web Traffic IS NOT Publicly Archivable mementos from May 8th 1999 on the Internet Archive
In terms of the ability to be archived in public web archives, web pages fall into one of two categories: publicly archivable, or not publicly archivable.

1. Publicly Archivable Web Pages:

These pages are archivable by public archives. The pages can be accessed without login/authentication. In other words, these pages do not reside behind a paywall. Grant Atkins examined paywalls in the Internet Archive for news sites and found that web pages behind paywalls may actually be redirecting to a login page at crawl time. A good example of a publicly archivable page is Dr. Steven Zeil's page since no authentication is required to view the page. Furthermore, it does not use client-side scripts (i.e., Ajax) to load additional content, so what you see in the web browser and what you can replay from public web archives are exactly the same.

Screen shot from Dr. Steven Zeil's page captured on 2018-07-02
Memento for Dr. Zeil's page on the Internet Archive captured on 2017-12-02 
Some web pages provide "personalized" content depending on the GeoIP of the requester. In these cases, what you see in the browser and what you can replay from public web archives are nearly the same, except for some minor personalization/GeoIP related changes. For example, a user requesting from Suffolk, Virginia will see the prayer times for the closest major city (Norfolk, Virginia). On the other hand, when the Internet Archive crawls the page, it sees the prayer times for San Bruno, California. This is likely because the crawling/archiving is happening from San Francisco, California. The two pages, otherwise, are exactly the same!

The live version of for a user in Suffolk, VA on 2018-07-02 
Memento for from the Internet Archive captured on 2018-06-22
Some social media sites, like Twitter, are publicly archivable and the Internet Archive captures most of their content. Twitter's home page is personalized, so user-specific contents, like "Who to Follow" and "Trends for you" are not captured, but the tweets are. Also, some Twitter services require authentication.

@twitter live web page
@twitter memento from the Internet Archive captured on 2016-05-18

The archived memento for the @twitter web page shows a message that cookies are used and they are important for an enhanced user experience, nevertheless, the main content of the page, tweets, is preserved (or at least the top-k tweets, since the crawler does not automatically scroll at archive time to activate the Ajax-based pagination, cf. Corren McCoy's "Pagination Considered Harmful to Archiving").

Message from Twitter about cookies use to enhance user experience
Also, deep links to individual tweets are archivable.
Memento for a deep link to a tweet on the Internet Archive captured on 2013-01-18

2. Not Publicly Archivable Web Pages:

As far as the amount of web traffic, search engines are at the top. According to SimilarWeb, Google is number one; its share is 10.10% of the entire web traffic. The Internet Archive crawls it on regular basis, and has over 0.5 million mementos as of 2018-05-01 (cf. Mat Kelly's tech report about the difficulty in counting the number of mementos). The captured mementos are exact copies as far as the look, but obviously not a functioning search page.
As of 2018-05-01 the IA has 552,652 mementos of memento from May 8th 1999 on the Internet Archive played on 2018-05-01
It is possible to push a search result page from Google to a public web archive like, but that is not how web archives are normally used.
A Google search query for "Machine Learning" on 2018-06-18 archived in
Furthermore, it is not viable for web archives to try to archive search engines' result pages (SERPs) because there is an infinite number of possible URIs due to an infinite number of search queries and syntax, so even if we preserve a single SERP from June, 2018 (as shown above), we are unable to issue new queries against a June, 2018 version of Google. Maps and other applications that depend on user interaction are similar: individual pages may be archived, but we typically don't consider the entire application "archived" (cf. Michael Nelson's "Game Walkthroughs As A Metaphor for Web Preservation").

Even when web archives use headless browsers to overcome the Ajax problem, there can be additional challenges. For example, I pushed a page from Google Maps with an address in Chesapeake, Virginia to and the result was a page from Google support (in Russian) telling me that I (or more accurately, need to update the browser in order to use Google Maps! While technically not a paywall, this is similar to Grant's study mentioned above in that there is now something in the web archive corresponding to that Google Maps URI, but it does not match the users' expectations. It also reveals a clue about the GeoIP of
Google Maps page for the address 4940 S Military HWY, Chesapeake, VA 23321 pushed to on 2018-07-02
Memento for the Google Maps page I pushed to on 2018-07-02
It is worth mentioning there are emerging tools like Webrecorder, WARCreate, WAIL, and Memento Tracer for personal web archiving (or community tools in the case of Tracer), but even if/when the Internet Archive replaces Heritrix with Brozzler and resolves the problems with Ajax, their Wayback Machine cannot be expected to have pages requiring authentication, nor pages with effectively infinite inputs like search engines and maps.

Social media pages respond differently when web archives' crawlers try to crawl and archive them. Public web archives might have mementos of some social media pages, however, they often require a login to allow the download of the pages' representation. Otherwise, a redirection takes place. Another obstacle faces archiving social media pages is their heavy use of client-side executed scripts that will, for example, fetch new content when the page is scrolled or when hiding/showing comments with no change in the URI. Facebook, for example, does not allow web archives' crawlers to access the majority of its pages. The Internet Archive's Wayback Machine returned 1,699 mementos for the former president's official Facebook page, but when I opened one of these mementos, it returned the infamous Facebook login or register page.
1,699 mementos for the official Facebook page of Mr. Obama, former U.S. president as of 2018-05-01

The memento captured on 2017-02-10 is showing the login page of Facebook
There are few exceptions where the Internet Archive is able to archive some user-contributed Facebook pages.

Memento for a facebook page in the Internet Archive captured on 2012-03-02
Also, it seems like is using a dummy account ("Nathan") to authenticate, view, and archive some Facebook pages.

Memento for a facebook page in captured on 2018-06-21
With the previous exceptions in mind, it is still safe to say that Facebook pages are not publicly archivable.

Linkedin shares the same behavior with Facebook. The notifications page has 46 mementos as of 2018-05-29, but they are entirely empty. The live page contains notifications from contacts such as who is having a birthday, job anniversary, got a new job, and so on. This page is completely personalized and requires a cookie or login to display information that is related to the user, and therefore, the Internet Archive has no way of downloading its representation.

My account's notification page on Linkedin
Memento of Linkedin's notification page

The last example I would like to share is Amazon's "yourstore" page. I chose this example because it contains recommended items (another clean example for personalized web pages). The recommendations are based on the user's behavior. In my case, Amazon recommended electronics, automotive tools, and prime video.

My Amazon's page (live) on 2018-05-02
As of 2018-05-02, I found 111 mementos for "my Amazon's your store page" in the Internet Archive, and opened one of them to see what has been captured.

Mementos for Amazon's yourstore page in the Internet Archive on 2018-05-02
As I expected, the page has a redirect to another page that asks for a login. It returned a 302 response code when it was crawled by the Internet Archive. The actual content of the original page was not archived because the IA crawler does not provide credentials to download the content of the page. The representation saved to the Internet Archive is for a resource different from the originally crawled page.

IA crawler was redirected to a login page and captured it instead
Login page captured in the IA instead of the crawled page
There are many web sites with this behavior, so it is safe to assume that for some web sites, even when there are plenty of mementos, they all might return a soft 404.

Estimating the amount of archivable web traffic:

To explore the amount of web traffic that is archivable, I examined the top 100 sites as ranked by Alexa, and manually constructed a data set of those 100 sites using traffic analysis services from SimilarWeb and Semrush.

The data was collected on 2018-02-23 and I captured three web traffic measures offered by both websites, total visits, unique visits, and pages/visit.
  • Total visits is the total number of non-unique visits from last month.
  • Unique visits is the number of unique visits from last month.
  • Pages/visit is the average number of visited pages per user's visit.
I determined whether or not a website is archivable based on the discussion I provided earlier, and put it all together in a csv file to use it later as input for my script. Suggestions, feedback, and pull requests are always welcome!

The data set used in the experiment
Using Python 3, I wrote a simple script that calculates the percentage of web traffic that is publicly archivable. I am assuming that the top 100 sites is a good representative of the whole web. I am aware that 100 sites is a small number compared to 1.8 billion live websites on the Internet, but according to SimilarWeb, the top 100 sites receive 48.86% of the entire traffic on the web which is consistent with a Pareto distribution. The program offers six different results, each of which is based on a certain measure or a combination of measures, total visits, unique visits, and pages/visit. Flags can be set to control what measures are used in the calculation. If no flags are set, the program shows all the results using all three measures and their combination. I came up with this formula to calculate the percentage of publicly archivable websites based on all three measures combined:
  1. Multiply the pages/visit by visits for each web site from both SimilarWeb and SemRush
  2. Take the average for both sources, SimilarWeb and SemRush
  3. Take the average of unique visits for each website from SimilarWeb and SemRush
  4. Add the numbers obtained in 2 and 3
  5. Add the number obtained in 4 for all archivable websites
  6. Add the number obtained in 4 for all non-archivable websites
  7. Add the numbers obtained in 5 and 6 to get the total
  8. Calculate the percentage of the numbers obtained in 5 and 6 from the total, obtained in 7
Using all measures, I found that 65.30% of the traffic of the top 100 sites is not archivable by public web archives. The program and the data set are available on Github.

Now, it is possible to discuss three different scenarios and compute a range. If the top 100 sites receive 48.86% of the traffic, and 65.30% of that traffic is not publicly archivable, therefore:

  1.  If all of the remaining web traffic is publicly archivable, then 31.91% of the entire web traffic is not publicly archivable. 65.30 * 0.4886 = 31.91.
  2. If the remaining web traffic is similar to the traffic from the top 100 sites, then 65.30% of the entire web traffic is not publicly archivable.
  3. Finally, if all of the remaining web traffic is not publicly archivable, then only 16.95% of the entire web traffic is archivable. 34.7 * 0.4886 = 16.95. This means that 83.05% of the entire web traffic is not publicly archivable.

So the percentage of not publicly archivable web traffic is between 31.91% and 83.05%. More likely, it is close to 65.30% (the second case).

I would like to emphasize that since the top 100 websites are mainly Google, Bing, Yahoo, etc, and their derivatives, the nature of these top sites is the determining factor of my results. However, since the range has been calculated, it is safe to say that, at least, 1/3 of the entire web traffic is not publicly archivable. This percentage constitutes the necessity of private web archives. There are few available tools to solve this problem, Web Recorder, Warcreate, and WAIL. Public web archiving sites like the Internet Archive,, and others will never be able to preserve personalized or private web pages like emails, bank accounts, etc.

Take Away Message:

Personal web archiving is crucial since, at least, 31.91% of the entire web traffic is not archivable by public web archives. This is due to the increase use of personalized/private web pages and the use of technologies hindering the ability of web archives' crawlers to crawl and archive these pages. The experiment shows that the percentage of not publicly archivable web traffic can be as high as 83.05%, but the more likely case is that around 65% of web traffic is not publicly archivable. Unfortunately, no matter how good public web archives get at capturing web pages, there will always be a significant number of web pages that are not publicly archivable. This emphasizes the need for personal web archiving tools, such as Web Recorder, Warcreate, and WAIL - possibly combined with a collaboratively-maintained repository of how to interact with complex sites, as introduced by Memento Tracer. Even if Ajax-related web archiving problems were eliminated, no less than 1/3 of web traffic is to sites that will otherwise never appear in public web archives.

Hussam Hallak