2020-11-04: How well is Instagram archived?

Figure 1:  Snapshots of Katy Perry’s account page on the three leading social media platforms: Instagram, Facebook, and Twitter.

A little bit about Instagram

In 2020, social media is considered as one of the most popular ways for people to connect with one another. Instagram (IG or Insta) is a photo and video-sharing social networking service and is the 6th most popular social network worldwide, with over 1 billion active users. It was created by Kevin Systrom and Mike Krieger and was originally launched on iOS in October 2010. It remained independent until Facebook acquired it in 2012. Recent statistics show that about six in 10 Instagram users log in at least once daily, making it the second most logged-in social media site for daily use after Facebook. 

Katy Perry as a proxy for all of Instagram

In this study, I am using Katy Perry’s Instagram account as a proxy for all of Instagram. She created her Instagram account in 2012 and as of Oct 1, 2020, she is the 20th most popular person on Instagram, with over 100 million followers. As of Sept 24, 2020, she has 1642 posts in her Instagram feed, which is a moderate number of posts to be individually investigated. Through this study, I found that regardless of her popularity, only about 1/3 (520 out of 1642) of all of her individual posts on Instagram are archived in public web archives. 


Figure 2: A snapshot of Katy Perry’s Instagram account taken on 2020-10-11. 

Katy posts the same info across all platforms: 

I have noticed that Katy Perry typically posts the same content on all three platforms. Let’s compare the URL structure on Facebook, Twitter, and Instagram for a particular post/status.  


  • Facebook - facebook.com/katyperry/posts/{17DigitNumericID}

Example: https://www.facebook.com/katyperry/posts/10157480148716466


  • Twitter -  twitter.com/katyperry/status/{19DigitNumericID}

Example: https://twitter.com/katyperry/status/1315365116792242178


  • Instagram: instagram.com/p/{11DigitID}

Example: https://www.instagram.com/p/CGNsqMCBXW_/


On Instagram, the post URLs are completely opaque. That is to say, by only looking at this post URL, we can't say that this post indeed belongs to Katy’s account without dereferencing the URL. As shown in the below figure (Figure 3), we see how the post I took as the example above aligns on the three different accounts. The content is the same more or less, but the URL structure is completely different. Facebook and Twitter URLs are semi-opaque, and the account name is visible in the URL. Also, we know that the URLs are constructed left-to-right and thus we can use prefix searches. Internet Archive’s Wayback CDX Server API supports a “prefix” search which helps us in obtaining deep links. But unfortunately, we can’t use the “prefix” search for Instagram posts due to its URL structure, whereas we can use this search for Facebook and Twitter posts to find deep links to mementos. 


Figure 3: An example of Katy Perry posting the same content across the three different platforms.

How many mementos of her account page are available across different web archives?

I have quantified the number of mementos that are available for her account page on all three leading social media platforms (Facebook, Twitter, and Instagram). This data was collected through MemGator on Oct 11, 2020. The MemGator service, built by Sawood Alam, allows us to retrieve the TimeMap for a particular URL by aggregating data from 16 different archives. The collected data is summarized below (Table1). Out of the three services, Katy Perry's Twitter account has the highest number of copies in the web archives, followed by Facebook. Her Instagram account has the lowest number of mementos. 



Instagram

Facebook

Twitter

Internet Archive

1803

2032

4025

Archive-It

108

1234

1185

archive.today

7

-

2

Library of Congress

4

1

-

UK Web Archive

-

16

58

Australian Web Archive

-

37

2

Portuguese Web Archive

2

-

1

Total

1924

3320

5273

Table 1: Captures of Katy Perry’s social media account pages across different public web archives

How often are her account pages archived?

I also measured how often her account page is archived. This data was collected through MemGator on Oct 13, 2020. The below table (Table 2) shows how often her account page is archived. 



Instagram

Facebook

Twitter

First Memento

Mon, 12 Nov 2012 08:59:33

Thu, 05 Feb 2009 11:38:42

Sun, 11 May 2008 11:06:18

Last Memento

Mon, 05 Oct 2020 20:10:55

Wed, 07 Oct 2020 06:02:10

Fri, 09 Oct 2020 10:33:54

Total Days

2885

4263

4535

Total Mementos

1924

3320

5273

Memento Count > = 1

( Number of Days)

802

872

1863

Percentage

27.80%

20.46%

41.08%

Table 2: How often are the account pages archived


One thing to note here is that having this many mementos doesn’t mean that all of them replay well. Quantifying how many of these mementos are good or bad is out of scope for this study. On a side note, I have observed that there are a lot of archiving and replay issues faced by many web archiving tools on Instagram pages. However, we can use Conifer to successfully archive and replay Instagram pages. You could either manually scroll through the required pages or use autopilot mode, where there is a specialized behavior built for the Instagram user account which does the auto-scrolling through user posts including comments and replies. More information can be found on their user-friendly user guide

Follower count using historical data from web archives

Next, I looked at how the number of followers of her Instagram account grew over time. This was achieved by using historical data from web archives to extract the follower counts. There are 1920 mementos for her Instagram account page (collected through MemGator on Oct 08, 2020) of which the earliest memento is from 2012-11-12T08:59:33Z and the latest one is from 2020-10-05T20:10:55Z. Like most web pages, the UI of Instagram has undergone several layout changes over the years, which made it difficult to extract the follower count. To overcome this challenge, I have used different cases to handle these different layouts, which made it possible to scrape the follower count from her account page. The code used for scraping the follower count from the mementos is available in Github.


Dataset

  • Number of mementos: 1920

  • For each memento:

    • Memento-datetime (from MemGator)

    • Memento-datetime (of the landing-page)

    • Extracted value for the follower count (raw)

    • Follower count (processed follower count in millions)


The graph (Figure 4) created using R shows the follower count over the years from Jan 02, 2013, to Jun 12, 2020. The blue line indicates the linear regression line that I have fitted and I have extended this line (blue dotted) to check how the follower count growth will continue over the coming years. If this rate continues, Katy Perry will hit 150 million followers, and 200 million followers on Dec 26, 2022, and Jan 01, 2026, respectively.  


Figure 4: Follower count using historical data from web archives. The blue line indicates the fitted linear regression line and it is extended (blue dotted line) to predict the follower count. 


We can find the follower count growth over a specific period of time given that the particular account is popular enough to be archived. This means that we have enough mementos to extract the data from and plot a graph to observe the follower growth rate. The interesting thing is we can also use this method to find the follower count growth rates for deleted accounts with the assumption that the account was archived often. There is a similar study done by Nauman Siddique to check the historical Twitter follower count via web archives

Individual posts in her Instagram feed

Data

As of Sept 24, 2020, Katy Perry had a total of 1642 posts in her Instagram feed. After having exhausted all other approaches to collect the individual post data, I  used the Chrome developer tool to capture the network traffic in HTTP Archive (HAR) format while scrolling down until the first post in her Instagram feed. I used the Haralyzer to extract the relevant data fields that are required to compile the dataset of her individual posts. A snippet of HAR data highlighting the extracted fields is shown below.  


        "shortcode": "CEav2jYB12H",

        "edge_media_to_comment": {

            "count": 7596,

            "page_info": {

                "has_next_page": true,

                "end_cursor": ""

            }

        },

        "edge_media_to_sponsor_user": {

            "edges": []

        },

        "comments_disabled": false,

        "taken_at_timestamp": 1598585366,

        "edge_media_preview_like": {

            "count": 1447080,

            "edges": [{

                "node": {

                    "id": "2016325463",

                    "profile_pic_url": "https://instagram.forf1-3.fna.fbcdn.net/v/",

                    "username": "hiru_savi"

                }

            }]

        },

        "owner": {

            "id": "407964088",

            "username": "katyperry"

        },

        "location": {

            "id": "111670616996681",

            "has_public_page": true,

            "name": "Promo Smiles",

            "slug": "promo-smiles"

        },



By using the shortcode present in the data, I could form the URL since it is of the form “https://www.instagram.com/p/shortcode/”. Also, since the HAR data contained the date-time in the Unix time format, I have converted it into ISO 8601. For each individual post URL, I have got the number of mementos in the Internet Archive by obtaining the TimeMap for each individual post, using the Wayback CDX Server API. The scripts used for each step available in Github.


Dataset

  • Number of posts: 1642

  • For each Individual post:

    • Shortcode

    • URL

    • Posted time (in Unix)

    • Posted time (in UTC)

    • Number of likes

    • Number of comments

    • Engagement: SUM(likes, comments)

    • Number of mementos

    • Ranking by number of likes

    • Ranking by number of comments

    • Ranking by number of engagement

    • Ranking by number of mementos

    • Number of URL variations

    • URL variations

Distribution of the number of mementos for the Individual posts

To visualize the distribution of the number of mementos for each individual post, I have plotted a histogram (Figure 5). The x-axis shows the number of mementos for a particular post and the y-axis shows the frequency (how many posts have a certain number of mementos). Please note that the x-axis has different bin sizes to visualize the distribution better. From 0-10 they have a bin size of 1 and then the number of mementos are grouped (from 11-100 is one bin of size 90 mementos, and from 101 onwards it has a bin size of 100 mementos). At first glance, you can see that the number of posts with zero mementos is very high. Out of all the posts in her account, 68.33% of posts (1122 out of 1642, or approximately 2 out of 3) have zero mementos.


Figure 5: Distribution of the number of mementos for individual posts.  The x-axis has different bin sizes (from 0-10 they have a bin size of 1 memento, from 11-100 is one bin of size 90 mementos, and from 101 onwards it has a bin size of 100 mementos)

Correlation between likes, comments, total engagement, and mementos

I ranked the data according to the number of likes, comments, engagement, and mementos. Using those ranking values, I then computed the Kendall rank correlation coefficient to check for the correlation, using Kendall’s Tau-b value because it accounts for ties. Figure 6 shows four scatterplots of different field values created using R.


Figure 6: Distribution of the number of mementos for individual posts.  


  • Weak Positive Correlation between the number of mementos and likes.

    • Kendall’s Tau-b value=0.2171644, p-value=4.705449e-31


  • Weak Positive Correlation between the number of mementos and comments

    • Kendall’s Tau-b value: 0.340613, p-value=8.089941e-74


  • Moderately Positive Correlation between the number of likes and comments

    • Kendall’s Tau-b value: 0.551134, p-value=1.733003e-245


  • Weak Positive Correlation Between Mementos and Engagement: SUM(Likes, Comments)

    • Kendall’s Tau-b value: 0.2203515,  p-value=6.369405e-32


Slides: How well is Instagram archived? A quantitative case study using Katy Perry’s Instagram account.

Key Takeaways:

In this study, I used Katy Perry’s Instagram account as a proxy for all of Instagram to understand and quantify how well Instagram is archived. I have used her account page in the other two leading social media platforms (Facebook and Twitter) to establish a baseline. Also, I have created a follower count growth curve for Katy Perry on Instagram based on historical data from web archives.  In conclusion, the key findings  of this study are as follows:


  • Twitter is well-archived, Facebook is not well-archived as compared to Twitter, but they're both miles ahead of Instagram. 


  • In this case, Katy Perry is the 20th most popular person on IG, still, only 1/3 (520 out of 1642) of her posts are archived in public web archives. 


  • The general norm is that the number of copies in web archives acts as a proxy for how popular an URL is. From the data, I have found out that there's a weak positive correlation between engagement and archiving, which supports the above claim. 

Acknowledgments

This study began as a course project that I have since expanded. The advice given by Dr. Michael Nelson and Dr. Michele Weigle has been a great help in compiling this blog post.


Himarsha Jayanetti (@HimarshaJ)

Comments