, should the need arise.
was one of the tools under consideration. It did not make the list of top three, but it did produce quality
.
indicating that the consumer version of Google+ will shut down on April 2, 2019. I knew that the social media service was being shut down in August, but I was surprised to see the new date.
mentions that they changed the deadline on December 10, 2018, for security reasons.
cites Google+ as yet another example of Google moving up a service decommission date.
This blog post is long because I am trying to answer several useful questions for would-be Google+ archivists. Here are the main bullet points:
Google+ will join the long list of shuttered Web platforms. Verizon
. Here are some more recent examples:
Sometimes the service is not shuttered, but large swaths of content are removed, such as with
.
The content of these services represents
serious value for historians. Thus
Geocities,
Vine, and
Tumblr were the targets of concerted hard-won archiving efforts.
Google launched Google+
in 2011. Writers have been declaring Google+ dead since its launch. Google+ has been unsuccessful for many reasons. Here are some mentioned in the news over the years:
- Google+ executed a poor rollout of company brand pages, with Google even going so far as to delete brand pages made by companies before deciding that brand pages were allowed.
- It suffered from internal Google personnel and management changes, as well as a lack of vision for the product. Google+ was led by Vic Gundotra, who left Google with no succession plan or future goals for Google+. Pundits cite mismanagement of the product as another potential factor.
- Google+ lacks third-party posting support from social networking tools. It was eliminated as a storytelling tool candidate in my prior blog post because it had no writable API.
- Google separated the tools provided by Google+, such as Hangouts and Photos, from the main Google+ web site. This separation had the effect of driving users away from Google+ as a destination.
- Other Google services integrated Google+ too well, which was confusing for users.
- As noted by Google employees, Google+ was designed to meet Google's requirements and not those of its users. It was designed to keep track of your identity across disconnected Google services. It was not designed to be an online destination in itself, like Facebook or Twitter.
- Circles, a process whereby users could categorize their friends and then control who sees their posts, was confusing for some users.
- Google+ had poor UI design, especially on mobile.
- Google did not understand Facebook's network effect. Many potential Google+ users were already on Facebook. Creating a Facebook clone was not enough to get them to devote time to a new social network.
As seen below, Google+ still has active users. I lost interest in 2016, but WS-DL Member
Sawood Alam, Dave Matthews Band, and Marvel Entertainment still post content to the social network. Barack Obama did not last as long I did, with his last post in 2015.
|
WS-DL member Sawood Alam is a more active Google+ member, having posted 17 weeks ago. |
|
Barack Obama lost interest in Google+. His last post was on March 6, 2015. |
Back in July of 2018,
I analyzed how much of the U.S. Government's AHRQ websites were archived. Google+ is much bigger than those two sites. Google+ allows users to share content with small groups or the public. In this blog post, I focus primarily on public content and current content.
I will use the following
Memento terminology in this blog post:
- original resource - a live web resource
- memento - an observation, a capture, of a live web resource
- URI-R - the URI of an original resource
- URI-M - the URI of a memento
ArchiveTeam has
a wiki page devoted to the shutdown of Google+. They list the archiving status as "Not saved yet." As shown below, I have found less than 1% of Google+ pages in the Internet Archive or Archive.today.
Update on 2019/03/18 at 16:07 GMT:
ArchiveTeam's archiving status has been updated to "In progress...". According to this article by the Verge, there is a concerted effort now underway by ArchiveTeam and the Internet Archive to archive parts of Google+. There are limitations to web archiving, as only up to 500 comments can be archived per post. To help in these efforts, please read the rest of this post so that you can ensure that your own Google+ data is preserved.
Update on 2019/03/18 at 16:21 GMT:
On Twitter, Sawood Alam has mentioned that this Reddit post has more information on ArchiveTeam's efforts. For live tracking of the process, visit this page.
Update on 2019/03/26 at 12:21 GMT:
The UK Web Archive has been asking for recommendations on UK-based Google+ accounts suitable for preservation. See the tweet below for more information:
Update on 2019/03/27 at 20:08 GMT:
Via Google+, Edward Morbius informed me that the Google+ group Google+ Mass Migration is discussing migrating from Google+. Topics range from preservation of content to migration to other services.
In the spirit of my 2017 blog post about
saving data from Storify, I cover how one can acquire their own Google+ data. My goal is to provide information for archivists trying to capture the accounts under their care. Finally, in the spirit of
the AHRQ post, I discuss how I determined much of Google+ is probably archived.
Saving Google+ Data
Google Takeout
There are professional services like
PageFreezer that specialize in preserving Google+ content for professionals and companies. Here I focus on how individuals might save their content.
|
Google Takeout allows users to acquire their data from all of Google's services. |
Google provides
Google Takeout as a way to download personal content for any of their services. After logging into Google Takeout, it presents you with a list of services. Click "Select None" and then scroll down until you see the Google+ entries.
Select "Google+ Stream" to get the content of your "main stream" (i.e., your posts). There are additional services from which you can download Google+ data. "Google+ Circles" allows you to download vCard-formatted data for your Google+ contacts. "Google+ Communities" allows you to download the content for your communities.
Once you have selected the desired services, click
Next. Then click
Create Archive on the following page. You will receive an email with a link allowing you to download your archive.
|
From the email sent by Google, a link to a page like the one in the screenshot allows one to download their data. |
The Takeout archive is a zip file that decompresses into a folder containing an HTML file and a set of folders. These HTML pages include your posts, events, information about posts that you +1'd, comments you made on others' posts, poll votes, and photos.
Note that the actual files of some of these images are not part of this archive. It does include your profile pictures and pictures that you uploaded to posts. Images from any Google+ albums you created are also available. With a few exceptions, references to images from within the HTML files in the archive are all absolute URIs pointing to googleusercontent.com. They will no longer function if googleusercontent.com is shut down. Anyone trying to use this Google Takeout archive will need to do some additional crawling for the missing image content.
|
Google Takeout (right) does not save some formatting elements in your Google+ posts (left). The image, in this case, was included in my Google Takeout download because it is one that I posted to the service. |
Webrecorder.io
One could use webrecorder.io to preserve their profile pages. Webrecorder saves web content to WARCs for use in many web archive playback tools. I chose Webrecorder because Google+ pages require scrolling to load all content, and scrolling is a feature with which Webrecorder assists.
|
A screenshot of my public Google+ profile replayed in Webrecorder.io. |
One of Webrecorder's strengths is the ability to authenticate to services. We should be able to use this authentication ability to capture private Google+ data.
I tried saving it using my native Firefox, but that did not work well. Unfortunately, as shown below, sometimes Google's cookie infrastructure got in the way of authenticating with Google from within Webrecorder.io.
|
In Firefox, Google does not allow me to log into my Google+ account via Webrecorder.io. |
I recommend changing the internal Webrecorder.io browser to Chrome to preserve your profile page. I tried to patch the recording a few times to capture all of the JavaScript and images. Even in these cases, I was unable to record all private posts.
If someone else has better luck with Webrecorder and their private data, please indicate how you got it to work in the comments.
Other Web Archiving Tools
The following methods only work on your public Google+ pages. Google+ supports a robots.txt that
does not block web archives.
The robots.txt for plus.google.com as of February 5, 2019, is shown below:
You can manually browse through each of your Google+ pages and save them to multiple web archives using the
Mink Chrome Extension. The screenshots below show it in action saving my public Google+ profile.
|
The Mink Chrome Extension in action, click to enlarge. Click the Mink icon to show the banner (far left), and then click on the "Archive Page To..." button (center left). From there choose the archive to which you wish to save the page (center right), or select "All Three Archives" to save to multiple archives. The far right displays a WebCite memento of my profile saved using this process. |
Archive.is and the Internet Archive both support forms where you can insert a URI and have it saved. Using the URIs of your Google+ public profile, collections, and other content, manually submit them to these forms and the content will be saved.
|
The Internet Archive (left) has a Save Page Now form as part of the Wayback Machine.
archive.today (right) has similar functionality on its front page. |
If you have all of your Google+ profile, collection, community, post, photo, and so on URIs in a file and wish to push them to web archives automatically, submit them to the
ArchiveNow tool. ArchiveNow can save them to archive.is, archive.st, the Internet Archive, and WebCite by default. It also provides support for Perma.cc if you have a Perma.cc API key.
Current Archiving Status of Google+
How Much of Google+ Should Be Archived?
This section is not about making relevance judgments based on the historical importance of specific Google+ pages. A more serious problem exists. Most Google+ profiles are indeed empty. Google made it quite difficult to enroll in Google's services without signing up for Google+ at the same time. At one time, if one wanted a Google account for Gmail, Android, Google Sheets, Hangouts, or a host of other services,
they would inadvertently be signed up for Google+. Acquiring an actual count of active users has been difficult because
Google reported engagement numbers for all services as if they were for Google+.
President Obama,
Tyra Banks, and
Steven Speilberg have all hosted Google Hangouts. This participation can be misleading, as Google Hangouts and Photos were features most often used by users, and these users may not have maintained a Google+ profile. Again, there are a lot of empty Google+ profiles.
In 2015,
Forbes wrote that less than 1% of users (111 million) are genuinely active, citing
a study done by Stone Temple Consulting. In 2018,
Statistics Brain estimated 295 million active Google+ accounts.
As archivists trying to increase the quality of our captures, we need to detect the empty Google+ profiles. Crawlers start with seeds from
sitemaps. I reviewed the robots.txt for plus.google.com and found four sitemaps, one of which focuses on profiles. The sitemap at
http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml consists of 50000 additional sitemaps. Due to the number and size of the files, I did not download them all to get an exact profile count. Each consists of between 67,000 and 68,000 URIs for an estimated total of 3,375,000,000 Google+ profiles.
How do we detect accounts that were never used, like the one shown above? The sheer number of URIs makes it challenging to perform an extensive lexical analysis in a short amount of time, so I took a random sample of 601 profile page URIs from the sitemap. I chose the sample size using the
Sample Size Calculator provided by Qualtrics and verified it with similar calculators provided by
SurveyMonkey,
Raosoft,
Australian Bureau of Statistics, and
Creative Research Systems. These sample sizes represent a confidence level of 95% and a margin of error of ±4%.
Detecting unused and empty profiles is
similar to the off-topic problem that I tackled for web archive collections, and it turns out that
the size of the page is a good indicator of a profile being unused. I attempted to download all 601 URIs with
wget, but 18 returned a 404 status. A manual review of this sample indicated that profiles of size 568kB or higher contain at least one post. Anyone attempting to detect an empty Google+ profile can issue an HTTP
HEAD and record the byte size from the
Content-Length header. If the byte size is less than 568kB, then the page likely represents an empty profile and can be ignored.
One could automate this detection using a tool like curl. Below we see a command that extracts the status line, date, and content-length for an "empty" Google+ profile of 567,342 bytes:
The same command for a different profile URI shows a size of 720,352 bytes for a non-empty Google+ profile:
|
An example of a 703kB Google+ profile with posts from 3 weeks ago. |
|
An example of Google+ loading more posts in response to user scrolling. Note the partial blue circle on the bottom of the page, indicating that more posts will load. |
As seen above, Google+ profiles load more posts on scroll. Profiles at 663kB or greater have filled the first "page" of scrolling. Any Google+ profile larger than this has more posts to view. Unfortunately, crawling tools must execute a scroll event on the page to load this additional content. Web archive recording tools that do not automatically scroll the page will not record this content.
|
This histogram displays counts of the file sizes of the downloaded Google+ profile pages. Most profiles are empty, hence a significant spike for the bin containing 554kB. |
From my sample 57/601 (9.32%) had content larger than 568kB. Only 12/601 (2.00%) had content larger than 663kB, potentially indicating active users. By applying this 2% to the total number of profiles, we estimate that 67.5 million profiles are active. Of course, based on the sample size calculator, my estimate may be off by as much as 4%, leaving an upper estimate of 135 million, which is between the 111 million number from the 2015 Stone Temple study and the 295 million number from the 2018 StatisticsBrain web page. The inconsistencies are likely the result of the sitemap not reporting all profiles for the entire history of Google+ as well as differences in the definition of a profile between these studies.
I looked at various news sources that had linked to Google+ profiles. The profile URIs from the sitemaps do not correspond to those often shared and linked by users. For example, my
vanity Google+ profile URI is https://plus.google.com/+ShawnMJones, but it is displayed in the sitemap as a
numeric profile URI https://plus.google.com/115814011603595795989. Google+ uses the
canonical link relation to link these two URIs but reveals this relation in the HTML of these pages. For a tool to discover this relationship, it must dereference each sitemap profile URI, an expensive discovery process at scale. If Google had placed these relations in the HTTP Link header, then archivists could use an HTTP HEAD to discover the relationship. The additional URI forms make it difficult to use profile URIs from sitemaps alone for analysis.
The content of the pages found at the vanity and numeric profile URIs is slightly different. Their SHA256 hashes do not match. A review in vimdiff indicates that the differences are self-referential identifiers in the HTML (i.e., JavaScript variables containing +ShawnMJones vs. 115814011603595795989), a
nonce that is calculated by Google+ and inserted into the content when it serves each page, and some additional JavaScript. Visually they look the same when rendered.
How much of Google+ is archived?
The lack of easy canonicalization of profile URIs makes it challenging to use the URIs found in sitemaps for web archive analysis. I chose instead to evaluate the holdings reported by two existing web archives.
For comparison, I used numbers from the sitemaps downloaded directly from plus.google.com.
I use these totals for comparison in the following sections.
Internet Archive Search Engine Result Pages
I turned to the Internet Archive to understand how many Google+ pages exist in its holdings. I downloaded the data file used in the AJAX call that produces the page shown in the screenshot below.
The Internet Archive reports 83,162 URI-Rs captured. Shown in the table below, I further analyzed the data file and broke it into profiles, posts, communities, collections, and other by URI.
Category |
# in Internet Archive |
% of Total from Sitemap |
Collections |
1 |
0.00000572% |
Communities |
0 |
0% |
Posts |
12,946 |
Not reported in sitemap |
Profiles |
65,000 |
0.00193% |
Topics |
0 |
0% |
Other |
5,217 |
Not reported in sitemap |
The archived profile page URIs are both of the vanity and numeric types. Without dereferencing each, it is difficult to determine how much overlap exists. Assuming no overlap, the Internet Archive possesses 65,000 profile pages, which is far less than 1% of 3 billion profiles and 0.0481% of our estimate of 135 million active profiles from the previous section.
I randomly sampled 2,334 URI-Rs from this list, corresponding to a confidence level of 95% and a margin of error of ±2%. I downloaded TimeMaps for these URI-Rs and calculated a mean of 67.24 mementos per original resource.
Archive.today Search Engine Result Pages
As shown in the screenshot below, Archive.today also provides a search interface on its web site.
Archive.today reports 2,551 URI-Rs, but scraping its search pages returns 3,061 URI-Rs. I analyzed the URI-Rs returned from the scraping script to place them into the categories shown in the table below.
Category |
# in Archive.today |
% of Total from Sitemap |
Collections |
10 |
0.0000572% |
Communities |
0 |
0% |
Photos |
22 |
Not reported in sitemap |
Posts |
1994 |
Not reported in sitemap |
Profiles |
989 |
0.0000293% |
Topics |
1 |
0.248% |
Other |
45 |
Not reported in sitemap |
Archive.today contains 989 profiles, a tiny percent of the 3 billion suggested by the sitemap and the 135 million active profile estimate that we generated from the previous section.
Archive.today is Memento-compliant, so I attempted to download TimeMaps for these URI-Rs. For 354 URI-Rs, I received 404s for their TimeMaps, leaving me with 2707 TimeMaps. Using these TimeMaps, I calculated a mean of 1.44 mementos per original resource.
Are these mementos of good quality?
Archives just containing mementos is not enough. Their quality is relevant as well. Crawling web content often results in missing embedded resources such as stylesheets. Fortunately, Justin Brunelle
developed an algorithm for scoring the quality of a memento that takes missing embedded resources into account.
Erika Siregar developed
the Memento Damage tool based on Justin's algorithm so that we can calculate these scores. I used the Memento Damage to score the quality of some mementos from the Internet Archive.
|
The histogram of memento damage scores from our random sample shows that most have a damage score of 0. |
Memento damage takes a long time to calculate, so I needed to keep the sample size small. I randomly sampled 383 URI-Rs from the list acquired from the Internet Archive and downloaded their TimeMaps. I acquired a list of 383 URI-Ms by randomly sampling 1 URI-M from each of TimeMap. I then fed these URI-Ms into a local instance of the Memento Damage tool. The Memento Damage tool experienced errors for 41 URI-Ms.
|
This memento has the highest damage score of 0.941 in our sample. The raw size of its base page is 635 kB. |
The mean damage score for these mementos is 0.347. A score of 0 indicates no damage. This score may be misleading, however, because more content is loaded via JavaScript when the user scrolls down the page. Most crawling software does not trigger this JavaScript code and hence misses this content.
|
The screenshot above displays the largest memento in our sample. The base page has a size of 1.3 MB and a damage score of 0.0. It is not a profile page, but a page for a single post with comments. |
|
The screenshot above displays the smallest memento in our sample with a size greater than zero and no errors while computing damage. This single post page redirects to a page not captured by the Internet Archive. The base page has a size of 71kB and a damage score of 0.516. |
|
|
The screenshot above displays a memento for a profile page of size 568kB, the lower bound of pages with posts from our earlier live sample. It has a memento damage score of 0. |
|
This histogram displays the file sizes in our memento sample. Note how most have a size between 600kB and 700kB. |
As an alternative to memento damage, I also downloaded the
raw memento content of the 383 mementos to examine their sizes. The HTML has a mean size of 466kB and a median of 500kB. In this sample, we have mementos of posts and other types of pages mixed in. Post pages appear to be smaller. The memento of a profile page shown below still contains posts at 532kB. Mementos for profile pages smaller than this had just a user name and no posts. It is possible that the true lower bound in size is around 532kB.
|
This memento demonstrates a possible new lower bound in profile size at 532kB. The Internet Archive captured it in January of 2019. |
Discussion and Conclusions
Google+ is being shut down on April 2, 2019. What direct evidence will future historians have of its existence? We have less than two months to preserve much of Google+. In this post, I detailed how users might preserve their profiles with either Google Takeout, Webrecorder.io, and other web archiving tools.
I mentioned that there are questions about how many active users ever existed on Google+. In Google's attempt to make all of its services "social" it conflated the number of active Google+ users with active users of other Google services. Third-party estimates of active Google+ users over the years have ranged from 111 million to 295 million. With a sample size of 601 profiles from the profile sitemap at plus.google.com, I estimated that the number might be as high as 135 million.
To archive non-empty Google+ pages, we have to be able to detect pages that are empty. I analyzed a small sample of Google+ profile pages and discovered that pages of size 663kB or larger contain enough posts to fill the first "page" of scrolling. I also discovered that inactive profile pages tend to be less than 568kB. Using the
HEAD method of HTTP and the
Content-Length header, archivists can use this value to detect unused or poorly contributed to Google+ profiles before downloading their content.
I estimated how much of Google+ exists in public web archives. Scraping URIs from the search result pages of the Internet Archive, the most extensive web archive, reveals only 83,162 URI-Rs for Google+. Archive.today only reveals 2,551 URI-Rs. Both have less than 1% of the totals of different Google+ page categories found in the sitemap. The fact that so few are archived may indicate that few archiving crawls found Google+ profiles because few web pages linked to them.
I sampled some mementos from the Internet Archive and found a mean damage score of 0.347 on a scale where 0 indicates no damage. Though manual inspection does show missing images, stylesheets appear to be consistently present.
Because Google+ uses page scrolling to allow users to load more content, this means that many mementos will likely be of poor quality if recorded outside of tools like Webrecorder.io. With the sheer number of pages to preserve, we may have to choose quantity over quality.
If a sizable sample of those profiles is considered to be valuable to historians, then web archives have much catching up to do.
A concerted effort will be necessary to acquire a significant number of profile pages by April 2, 2019. My recommendations are for users to archive their public profile URIs with
ArchiveNow,
Mink, or the save page forms at the Internet Archive or Archive.today. Archivists looking to archive Google+ more generally should download the topics sitemap and at least capture the 404 (four hundred four, not 404 status) topics pages using these same tools. Enterprising archivists can search news sources,
like this Huffington Post article and
this Forbes article, that feature popular and famous Google+ users. Sadly, because of the lack of links, much of the data from these articles is not in a machine-readable form. A Google+ archivist would need to search Google+ for these profile page URIs manually. Once that is done, the archivist can then save these URIs using the tools mentioned above.
Update on 2019/03/27 at 20:23 GMT:
The ArchiveTeam is engaged in a heroic effort with the Internet Archive to try to archive as much of Google+ as possible. As of this moment, their project tracker currently shows 20,004,087 items (and rising) have been archived.
Due to its lower usage compared to other social networks and its controversial history, some may ask "Is Google+ worth archiving?" Only future historians will know, and by then it will be too late, so we must act now.
--
Shawn M. Jones
You may find https://archivebox.io useful as well...
ReplyDeleteSawood Alam just Tweeted that the Verge has an article on an effort by the Internet Archive and ArchiveTeam to preserve Google+:
ReplyDeletehttps://twitter.com/ibnesayeed/status/1107668321632223233
https://www.theverge.com/2019/3/17/18269707/internet-archive-archiveteam-preserving-public-google-plus-posts