Friday, February 8, 2019

2019-02-08: Google+ Is Being Shuttered, Have We Preserved Enough of It?


In 2017 I reviewed many storytelling, curation, and social media tools to determine which might be a suitable replacement for Storify, should the need arise. Google+ was one of the tools under consideration. It did not make the list of top three, but it did produce quality surrogates.

On January 31, 2019, Sean Keane of CNET published an article indicating that the consumer version of Google+ will shut down on April 2, 2019. I knew that the social media service was being shut down in August, but I was surprised to see the new date. Google's blog mentions that they changed the deadline on December 10, 2018, for security reasons. David Rosenthal's recent blog post cites Google+ as yet another example of Google moving up a service decommission date.


This blog post is long because I am trying to answer several useful questions for would-be Google+ archivists. Here are the main bullet points:
  • End users can create a Google Takeout archive of their Google+ content. The pages from the archive do not use the familiar Google+ stylesheets. The archive only includes images that you explicitly posted to Google+.
  • Google+ pages load more content when a user scrolls. Webrecorder.io is the only web archiving tool that I know of that can capture this content.
  • Google+ consists of mostly empty, unused profiles. We can detect empty, unused profiles by page size. Profile pages less than 568kB are likely empty.
  • The robots.txt for plus.google.com does not block web archives.
  • Even when only considering estimates of active profiles, I estimate that less than 1% of Google+ is archived in either the Internet Archive or Archive.today.
  • I sampled some Google+ mementos from the Internet Archive and found a mean Memento Damage score of 0.347 on a scale where 0 indicates no damage. Though manual inspection does show missing images, stylesheets appear to be consistently present.
Google+ will join the long list of shuttered Web platforms. Verizon will be shuttering some former AOL and Yahoo! services in the next year. Here are some more recent examples:
Sometimes the service is not shuttered, but large swaths of content are removed, such as with Tumblr's recent crackdown on porn blogs, and Flickr's mass deletion of the photos of non-paying users.

The content of these services represents serious value for historians. Thus Geocities, Vine, and Tumblr were the targets of concerted hard-won archiving efforts.

Google launched Google+ in 2011. Writers have been declaring Google+ dead since its launch. Google+ has been unsuccessful for many reasons. Here are some mentioned in the news over the years:
As seen below, Google+ still has active users. I lost interest in 2016, but WS-DL Member Sawood Alam, Dave Matthews Band, and Marvel Entertainment still post content to the social network. Barack Obama did not last as long I did, with his last post in 2015.

I stopped posting to Google+ in 2016.
WS-DL member Sawood Alam is a more active Google+ member, having posted 17 weeks ago.

Dave Matthews Band uses Google+ to advertise concerts. Their last post was 1 week ago.

Marvel Entertainment was still posting to Google+ while I was writing this blog post.

Barack Obama lost interest in Google+. His last post was on March 6, 2015.

Back in July of 2018, I analyzed how much of the U.S. Government's AHRQ websites were archived. Google+ is much bigger than those two sites. Google+ allows users to share content with small groups or the public. In this blog post, I focus primarily on public content and current content.

I will use the following Memento terminology in this blog post:
  • original resource - a live web resource
  • memento - an observation, a capture, of a live web resource
  • URI-R - the URI of an original resource
  • URI-M - the URI of a memento
ArchiveTeam has a wiki page devoted to the shutdown of Google+. They list the archiving status as "Not saved yet." As shown below, I have found less than 1% of Google+ pages in the Internet Archive or Archive.today.

In the spirit of my 2017 blog post about saving data from Storify, I cover how one can acquire their own Google+ data. My goal is to provide information for archivists trying to capture the accounts under their care. Finally, in the spirit of the AHRQ post, I discuss how I determined much of Google+ is probably archived.

Saving Google+ Data

Google Takeout


There are professional services like PageFreezer that specialize in preserving Google+ content for professionals and companies. Here I focus on how individuals might save their content.

Google Takeout allows users to acquire their data from all of Google's services. 


Google provides Google Takeout as a way to download personal content for any of their services. After logging into Google Takeout, it presents you with a list of services. Click "Select None" and then scroll down until you see the Google+ entries.

Select "Google+ Stream" to get the content of your "main stream" (i.e., your posts). There are additional services from which you can download Google+ data. "Google+ Circles" allows you to download vCard-formatted data for your Google+ contacts. "Google+ Communities" allows you to download the content for your communities.

Once you have selected the desired services, click Next. Then click Create Archive on the following page. You will receive an email with a link allowing you to download your archive.

From the email sent by Google, a link to a page like the one in the screenshot allows one to download their data.

The Takeout archive is a zip file that decompresses into a folder containing an HTML file and a set of folders. These HTML pages include your posts, events, information about posts that you +1'd, comments you made on others' posts, poll votes, and photos.

Note that the actual files of some of these images are not part of this archive. It does include your profile pictures and pictures that you uploaded to posts. Images from any Google+ albums you created are also available. With a few exceptions, references to images from within the HTML files in the archive are all absolute URIs pointing to googleusercontent.com.  They will no longer function if googleusercontent.com is shut down. Anyone trying to use this Google Takeout archive will need to do some additional crawling for the missing image content.
Google Takeout (right) does not save some formatting elements in your Google+ posts (left). The image, in this case, was included in my Google Takeout download because it is one that I posted to the service.

Webrecorder.io


One could use webrecorder.io to preserve their profile pages. Webrecorder saves web content to WARCs for use in many web archive playback tools. I chose Webrecorder because Google+ pages require scrolling to load all content, and scrolling is a feature with which Webrecorder assists.

A screenshot of my public Google+ profile replayed in Webrecorder.io.
One of Webrecorder's strengths is the ability to authenticate to services. We should be able to use this authentication ability to capture private Google+ data.

I tried saving it using my native Firefox, but that did not work well. Unfortunately, as shown below, sometimes Google's cookie infrastructure got in the way of authenticating with Google from within Webrecorder.io.

In Firefox, Google does not allow me to log into my Google+ account via Webrecorder.io.

I recommend changing the internal Webrecorder.io browser to Chrome to preserve your profile page. I tried to patch the recording a few times to capture all of the JavaScript and images. Even in these cases, I was unable to record all private posts. If someone else has better luck with Webrecorder and their private data, please indicate how you got it to work in the comments.

Other Web Archiving Tools

The following methods only work on your public Google+ pages. Google+ supports a robots.txt that does not block web archives.

The robots.txt for plus.google.com as of February 5, 2019, is shown below:



You can manually browse through each of your Google+ pages and save them to multiple web archives using the Mink Chrome Extension. The screenshots below show it in action saving my public Google+ profile.

The Mink Chrome Extension in action, click to enlarge. Click the Mink icon to show the banner (far left), and then click on the "Archive Page To..." button (center left). From there choose the archive to which you wish to save the page (center right), or select "All Three Archives" to save to multiple archives. The far right displays a WebCite memento of my profile saved using this process.
Archive.is and the Internet Archive both support forms where you can insert a URI and have it saved. Using the URIs of your Google+ public profile, collections, and other content, manually submit them to these forms and the content will be saved.

The Internet Archive (left) has a Save Page Now form as part of the Wayback Machine.
archive.today (right) has similar functionality on its front page.
My Google+ profile saved using the Internet Archive's Save Page Now form.
If you have all of your Google+ profile, collection, community, post, photo, and so on URIs in a file and wish to push them to web archives automatically, submit them to the ArchiveNow tool. ArchiveNow can save them to archive.is, archive.st, the Internet Archive, and WebCite by default. It also provides support for Perma.cc if you have a Perma.cc API key.

Current Archiving Status of Google+

How Much of Google+ Should Be Archived?

This section is not about making relevance judgments based on the historical importance of specific Google+ pages. A more serious problem exists. Most Google+ profiles are indeed empty. Google made it quite difficult to enroll in Google's services without signing up for Google+ at the same time. At one time, if one wanted a Google account for Gmail, Android, Google Sheets, Hangouts, or a host of other services, they would inadvertently be signed up for Google+. Acquiring an actual count of active users has been difficult because Google reported engagement numbers for all services as if they were for Google+. President Obama, Tyra Banks, and Steven Speilberg have all hosted Google Hangouts. This participation can be misleading, as Google Hangouts and Photos were features most often used by users, and these users may not have maintained a Google+ profile. Again, there are a lot of empty Google+ profiles.

In 2015, Forbes wrote that less than 1% of users (111 million) are genuinely active, citing a study done by Stone Temple Consulting. In 2018, Statistics Brain estimated 295 million active Google+ accounts.

As archivists trying to increase the quality of our captures, we need to detect the empty Google+ profiles. Crawlers start with seeds from sitemaps. I reviewed the robots.txt for plus.google.com and found four sitemaps, one of which focuses on profiles. The sitemap at http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml consists of 50000 additional sitemaps. Due to the number and size of the files, I did not download them all to get an exact profile count. Each consists of between 67,000 and 68,000 URIs for an estimated total of 3,375,000,000 Google+ profiles.


An example of an "empty" Google+ profile.


How do we detect accounts that were never used, like the one shown above?  The sheer number of URIs makes it challenging to perform an extensive lexical analysis in a short amount of time, so I took a random sample of 601 profile page URIs from the sitemap. I chose the sample size using the Sample Size Calculator provided by Qualtrics and verified it with similar calculators provided by SurveyMonkeyRaosoftAustralian Bureau of Statistics, and Creative Research Systems. These sample sizes represent a confidence level of 95% and a margin of error of ±4%.

Detecting unused and empty profiles is similar to the off-topic problem that I tackled for web archive collections, and it turns out that the size of the page is a good indicator of a profile being unused.  I attempted to download all 601 URIs with wget, but 18 returned a 404 status. A manual review of this sample indicated that profiles of size 568kB or higher contain at least one post. Anyone attempting to detect an empty Google+ profile can issue an HTTP HEAD and record the byte size from the Content-Length header. If the byte size is less than 568kB, then the page likely represents an empty profile and can be ignored.

One could automate this detection using a tool like curl. Below we see a command that extracts the status line, date, and content-length for an "empty" Google+ profile of 567,342 bytes:




The same command for a different profile URI shows a size of 720,352 bytes for a non-empty Google+ profile:



An example of a 703kB Google+ profile with posts from 3 weeks ago.

An example of Google+ loading more posts in response to user scrolling. Note the partial blue circle on the bottom of the page, indicating that more posts will load.

As seen above, Google+ profiles load more posts on scroll. Profiles at 663kB or greater have filled the first "page" of scrolling. Any Google+ profile larger than this has more posts to view. Unfortunately, crawling tools must execute a scroll event on the page to load this additional content. Web archive recording tools that do not automatically scroll the page will not record this content.
This histogram displays counts of the file sizes of the downloaded Google+ profile pages. Most profiles are empty, hence a significant spike for the bin containing 554kB.
From my sample 57/601 (9.32%) had content larger than 568kB. Only 12/601 (2.00%) had content larger than 663kB, potentially indicating active users. By applying this 2% to the total number of profiles, we estimate that 67.5 million profiles are active. Of course, based on the sample size calculator, my estimate may be off by as much as 4%, leaving an upper estimate of 135 million, which is between the 111 million number from the 2015 Stone Temple study and the 295 million number from the 2018 StatisticsBrain web page. The inconsistencies are likely the result of the sitemap not reporting all profiles for the entire history of Google+ as well as differences in the definition of a profile between these studies.

I looked at various news sources that had linked to Google+ profiles. The profile URIs from the sitemaps do not correspond to those often shared and linked by users. For example, my vanity Google+ profile URI is https://plus.google.com/+ShawnMJones, but it is displayed in the sitemap as a numeric profile URI https://plus.google.com/115814011603595795989. Google+ uses the canonical link relation to link these two URIs but reveals this relation in the HTML of these pages. For a tool to discover this relationship, it must dereference each sitemap profile URI, an expensive discovery process at scale.  If Google had placed these relations in the HTTP Link header, then archivists could use an HTTP HEAD to discover the relationship. The additional URI forms make it difficult to use profile URIs from sitemaps alone for analysis.

The content of the pages found at the vanity and numeric profile URIs is slightly different. Their SHA256 hashes do not match. A review in vimdiff indicates that the differences are self-referential identifiers in the HTML (i.e., JavaScript variables containing +ShawnMJones vs. 115814011603595795989), a nonce that is calculated by Google+ and inserted into the content when it serves each page, and some additional JavaScript. Visually they look the same when rendered.

How much of Google+ is archived?


The lack of easy canonicalization of profile URIs makes it challenging to use the URIs found in sitemaps for web archive analysis. I chose instead to evaluate the holdings reported by two existing web archives.

For comparison, I used numbers from the sitemaps downloaded directly from plus.google.com.
I use these totals for comparison in the following sections.
Internet Archive Search Engine Result Pages
I turned to the Internet Archive to understand how many Google+ pages exist in its holdings. I downloaded the data file used in the AJAX call that produces the page shown in the screenshot below.

The Internet Archive reports 83,162 URI-Rs captured for plus.google.com.

The Internet Archive reports 83,162 URI-Rs captured. Shown in the table below, I further analyzed the data file and broke it into profiles, posts, communities, collections, and other by URI.

Category # in Internet Archive % of Total from Sitemap
Collections 1 0.00000572%
Communities 0 0%
Posts 12,946 Not reported in sitemap
Profiles 65,000 0.00193%
Topics 0 0%
Other 5,217 Not reported in sitemap

The archived profile page URIs are both of the vanity and numeric types. Without dereferencing each, it is difficult to determine how much overlap exists. Assuming no overlap, the Internet Archive possesses 65,000 profile pages, which is far less than 1% of 3 billion profiles and 0.0481% of our estimate of 135 million active profiles from the previous section.

I randomly sampled 2,334 URI-Rs from this list, corresponding to a confidence level of 95% and a margin of error of ±2%. I downloaded TimeMaps for these URI-Rs and calculated a mean of 67.24 mementos per original resource.
Archive.today Search Engine Result Pages
As shown in the screenshot below, Archive.today also provides a search interface on its web site.

Archive.today reports 2551 URI-Rs captured for plus.google.com.

Archive.today reports 2,551 URI-Rs, but scraping its search pages returns 3,061 URI-Rs. I analyzed the URI-Rs returned from the scraping script to place them into the categories shown in the table below.

Category # in Archive.today % of Total from Sitemap
Collections 10 0.0000572%
Communities 0 0%
Photos 22 Not reported in sitemap
Posts 1994 Not reported in sitemap
Profiles 989 0.0000293%
Topics 1 0.248%
Other 45 Not reported in sitemap


Archive.today contains 989 profiles, a tiny percent of the 3 billion suggested by the sitemap and the 135 million active profile estimate that we generated from the previous section.

Archive.today is Memento-compliant, so I attempted to download TimeMaps for these URI-Rs. For 354 URI-Rs, I received 404s for their TimeMaps, leaving me with 2707 TimeMaps. Using these TimeMaps, I calculated a mean of 1.44 mementos per original resource.

Are these mementos of good quality?

Archives just containing mementos is not enough. Their quality is relevant as well. Crawling web content often results in missing embedded resources such as stylesheets. Fortunately, Justin Brunelle developed an algorithm for scoring the quality of a memento that takes missing embedded resources into account. Erika Siregar developed the Memento Damage tool based on Justin's algorithm so that we can calculate these scores. I used the Memento Damage to score the quality of some mementos from the Internet Archive.

The histogram of memento damage scores from our random sample shows that most have a damage score of 0.
Memento damage takes a long time to calculate, so I needed to keep the sample size small. I randomly sampled 383 URI-Rs from the list acquired from the Internet Archive and downloaded their TimeMaps. I acquired a list of 383 URI-Ms by randomly sampling 1 URI-M from each of TimeMap. I then fed these URI-Ms into a local instance of the Memento Damage tool. The Memento Damage tool experienced errors for 41 URI-Ms.

This memento has the highest damage score of 0.941 in our sample. The raw size of its base page is 635 kB.


The mean damage score for these mementos is 0.347. A score of 0 indicates no damage. This score may be misleading, however, because more content is loaded via JavaScript when the user scrolls down the page. Most crawling software does not trigger this JavaScript code and hence misses this content.

The screenshot above displays the largest memento in our sample. The base page has a size of 1.3 MB and a damage score of 0.0. It is not a profile page, but a page for a single post with comments.
The screenshot above displays the smallest memento in our sample with a size greater than zero and no errors while computing damage. This single post page redirects to a page not captured by the Internet Archive. The base page has a size of 71kB and a damage score of 0.516.
The screenshot above displays a memento for a profile page of size 568kB, the lower bound of pages with posts from our earlier live sample. It has a memento damage score of 0.

This histogram displays the file sizes in our memento sample. Note how most have a size between 600kB and 700kB. 

As an alternative to memento damage, I also downloaded the raw memento content of the 383 mementos to examine their sizes.  The HTML has a mean size of 466kB and a median of 500kB. In this sample, we have mementos of posts and other types of pages mixed in. Post pages appear to be smaller. The memento of a profile page shown below still contains posts at 532kB. Mementos for profile pages smaller than this had just a user name and no posts. It is possible that the true lower bound in size is around 532kB.

This memento demonstrates a possible new lower bound in profile size at 532kB. The Internet Archive captured it in January of 2019.

Discussion and Conclusions


Google+ is being shut down on April 2, 2019. What direct evidence will future historians have of its existence? We have less than two months to preserve much of Google+. In this post, I detailed how users might preserve their profiles with either Google Takeout, Webrecorder.io, and other web archiving tools.

I mentioned that there are questions about how many active users ever existed on Google+. In Google's attempt to make all of its services "social" it conflated the number of active Google+ users with active users of other Google services. Third-party estimates of active Google+ users over the years have ranged from 111 million to 295 million.  With a sample size of 601 profiles from the profile sitemap at plus.google.com, I estimated that the number might be as high as 135 million.

To archive non-empty Google+ pages, we have to be able to detect pages that are empty. I analyzed a small sample of Google+ profile pages and discovered that pages of size 663kB or larger contain enough posts to fill the first "page" of scrolling. I also discovered that inactive profile pages tend to be less than 568kB. Using the HEAD method of HTTP and the Content-Length header, archivists can use this value to detect unused or poorly contributed to Google+ profiles before downloading their content.

I estimated how much of Google+ exists in public web archives. Scraping URIs from the search result pages of the Internet Archive, the most extensive web archive, reveals only 83,162 URI-Rs for Google+. Archive.today only reveals 2,551 URI-Rs. Both have less than 1% of the totals of different Google+ page categories found in the sitemap. The fact that so few are archived may indicate that few archiving crawls found Google+ profiles because few web pages linked to them.

I sampled some mementos from the Internet Archive and found a mean damage score of 0.347 on a scale where 0 indicates no damage. Though manual inspection does show missing images, stylesheets appear to be consistently present.

Because Google+ uses page scrolling to allow users to load more content, this means that many mementos will likely be of poor quality if recorded outside of tools like Webrecorder.io. With the sheer number of pages to preserve, we may have to choose quantity over quality.

If a sizable sample of those profiles is considered to be valuable to historians, then web archives have much catching up to do.

A concerted effort will be necessary to acquire a significant number of profile pages by April 2, 2019. My recommendations are for users to archive their public profile URIs with ArchiveNow, Mink, or the save page forms at the Internet Archive or Archive.today. Archivists looking to archive Google+ more generally should download the topics sitemap and at least capture the 404 (four hundred four, not 404 status) topics pages using these same tools. Enterprising archivists can search news sources, like this Huffington Post article and this Forbes article, that feature popular and famous Google+ users. Sadly, because of the lack of links, much of the data from these articles is not in a machine-readable form.  A Google+ archivist would need to search Google+ for these profile page URIs manually. Once that is done, the archivist can then save these URIs using the tools mentioned above.

Due to its lower usage compared to other social networks and its controversial history, some may ask "Is Google+ worth archiving?" Only future historians will know, and by then it will be too late, so we must act now.

-- Shawn M. Jones

2 comments: