2021-02-20: Creation Time and Published Time Are Not the Same: Estimating the Instagram Epoch
During the process of examining how the published datetime of a post can be extracted with the use of only the URL of an Instagram (IG) post, we have uncovered a discrepancy between the published time present in the HTML and the time that can be extracted from the shortcode (hereinafter creation time) of an IG post. During the early stages of this study, we assumed both these values to be the same, similar to Twitter. The time extracted from a Twitter ID is the same as what’s displayed in the JSON object (and HTML) for the tweet.
Instagram is understudied in academic research based on its popularity, as compared to Twitter. By 2021, Twitter has 330 million monthly active users and Instagram has 1 billion active users. As of Feb 11, 2021, Google Scholar returns 7.52 million hits for “twitter” and 1.54 million hits for “instagram”. This results in an “active-users/Google Scholar hits ratio” of 43.88 for Twitter and 649.35 for Instagram, almost 15 times larger.
We began this study while exploring the possibilities of bridging this gap between the two platforms in academic research. We looked at different methods to extract the published time associated with a post including the creation time embedded in their media ID and then obtained an estimate for the epoch value used by Instagram while creating their media IDs. We learned that there is a delta between the published time and creation time, which could have implications for applications that depend on sorting events by IG publishing time.
Published time in the HTML
We can obtain the published time of an Instagram post at two different places in the HTML. The date can be found at the bottom of the post in the "human-readable" form and the same date will be shown if you hover over it. But if you inspect this element you will be able to see the datetime of the post in ISO 8601 format (Figure 02). The published time of a post can be also found in JSON located in one of the <script> tags in the HTML body (Figure 03). The time here is displayed as a Unix timestamp. Converting this Unix timestamp obtained from JSON (Figure 02) back to ISO 8601 will give us the same time as displayed at the bottom of the post (Figure 03).
Figure 02: The published datetime (in ISO 8601) of the post found in the HTML (2020-10-08T21:12:34Z) explored via the inspect element option in the Google Chrome browser.
"graphql": {
"shortcode_media": {
"__typename": "GraphImage",
"id": "2415680307434230462",
"shortcode": "CGGOHDXhhK-",
"dimensions": {
"height": 1350,
"width": 1080
},
…
"comments_disabled": false,
"commenting_disabled_for_viewer": false,
"taken_at_timestamp": 1602191554,
"edge_media_preview_like": {
"count": 8,
"edges": [
{
"node": {
"id": "1919166011",
"is_verified": false,
"profile_pic_url": "https://instagram.forf1-4.fna.fbcdn.net/v/t51.2885-19/s150x150/102645018_541091879918535_2546629646016782682_n.jpg?_nc_ht=instagram.forf1-4.fna.fbcdn.net&_nc_ohc=PhA3b4UVuJsAX8qT-8x&tp=1&oh=05bd5c8c6357f3d1269a2d4ce7787550&oe=604F7B66",
"username": "kritika.garg_"
}
}
]
}
Figure 03: A snippet from the JSON file which contains the published datetime (in Unix) of the same post that is shown in Figure 02.
In addition to the above two mediums, the datetime is encoded in the shortcode of any IG post. We’ll look into that in detail in the following section.
Creation time from media object's shortcode
Let us now look at how we can obtain the creation time of the shortcode using just the URL of the post. The shortcode for a media object can be found in its URL, which follows the format “https://www.instagram.com/p/{shortcode}”. Here, the shortcode is the base64 encoding of the base 10 media ID assigned to a particular media object. Detailed information on how the media ID is built is available in the blog post “Sharding & IDs at Instagram” published in Instagram Engineering. According to the information provided in the aforementioned blog post, each of their IDs consists of 64 bits in total where the first 41 bits give us the time in milliseconds (Figure 04) since their internal epoch (hereinafter IG epoch).
For example, Figure 04 illustrates this conversion in the order of steps mentioned below for the post having the shortcode “CGGOHDXhhK-”.
1. Media ID and shortcode of the post.
2. Obtaining the 64-bit binary equivalent of the media ID.
3. Selecting the first 41-bits.
4. Converting the 41-bits to its decimal equivalent (gives the number of milliseconds since IG epoch).
1. 2415680307434230462 (CGGOHDXhhK-)
2. 0010000110000110001110000111000011010111100001100001001010111110
3. 0010000110000110001110000111000011010111100001100001001010111110
4. 287971533231
Figure 04: Creation time of a post in milliseconds since IG epoch. Please note that as the media ID is said to be a 64-bit integer, we should first convert it into binary and add extra padding bits to make it a 64-bit integer before taking the first 41 bits which correspond to the ID creation time.
Given the IG epoch, we would be able to calculate the creation time of the post. Although their internal epoch is not disclosed by IG, we can estimate its value.
Estimating the epoch value used by Instagram
Let’s take a look at how we can estimate the IG epoch value used by IG. As discussed in the previous sections, we are well-informed about the below two values:
Published time: The value extracted from the HTML (taken_at_timestamp), which is the number of seconds from the UNIX epoch (say, Tp).
Creation time: The value extracted from the media ID, which is the number of milliseconds from the IG epoch (say, Tc).
By using the above two values, we can easily obtain an estimate for the IG epoch. Be mindful of the unit conversions here during calculations.
IG epoch = Tp - (Tc/1000) seconds
Dataset
We have created a dataset for IG epoch estimates calculated using 1000 shortcodes. Without loss of generality, we sample 1000 shortcodes from a previous study about Katy Perry's IG account. The code used for estimating the epoch is available in GitHub. The dataset contains posts having a single media item (either one image or one video) as well as ones containing multiple media items. Multiple media posts, which are known within the community as “carousel posts” can have multiple images and/or videos. The construction of carousel posts will be discussed further in the next section.
Looking at the estimates obtained for the IG epoch in our dataset, we can see that 76.4% (764 out of 1000) of the values point to either 2011-08-24T21:07:00Z or 2011-08-24T21:07:01Z, where the two values only differ by one second. However, the epoch calculated with single video posts was off by a greater margin, as much as 60 min 32 sec.
Below is a summary of the estimated epoch values we obtained with the use of our dataset:
2011-08-24T21:07:00Z = 21.5% (215 out of 1000)
2011-08-24T21:07:01Z = 54.9% (549 out of 1000)
Every other estimate = 23.6% (236 out of 1000)
We picked the earliest datetime estimate (2011-08-24T21:07:00Z) as the epoch value. However, over half of the estimated values we obtained were one second later (2011-08-24T21:07:01Z). The other 23.6% of posts, which point to a range of different epoch estimates with a greater margin (as much as 60 min 32 sec), happen to be single video posts. (Not all single video posts belong to that 23.6%, meaning there are single video posts with zero or one second delta value.) This aroused our curiosity and made us look deeper into this.
The difference between the creation time and published time
At first, we thought that the reason for this difference between the creation time and published time could be dependent on the length of the video. To test this hypothesis, we calculated the difference between our selected IG epoch value of 2011-08-24T21:07:00Z and the datetime values in the “epoch_estimate (utc)” column to create the “delta” column in our dataset. We then filtered the single video posts and plotted the duration of the video against the delta values as shown in Figure 05(a). We also computed the Kendall rank correlation coefficient using Kendall’s Tau-b value to check for any correlation.
p-value = 0.4205415
Kendall Tau-b = 0.234224
Figure 05(a): Relationship between off_by value (delta) vs vid_length (duration of the video) for all the single video posts. p-value = 0.4205415, Kendall Tau-b = 0.234224
Additionally, we excluded the videos with video duration greater than 60 seconds (IGTV videos) and plotted the duration of the remaining videos against their delta values as shown in Figure 05(b). We have calculated the Kendall tau-b value for these posts as well.
p-value = 0.1785421
Kendall Tau-b = 0.1937192
Figure 05(b): Relationship between off_by value (delta) vs vid_length (duration of the video) for single video posts with video duration <= 60. Please note that the data used in Figure 05(a) is a subset of Figure 05(b). p-value = 0.1785421, Kendall Tau-b = 0.1937192
Although both Kendall coefficient values show us that there is a weak positive correlation between the video duration and delta value, they both have higher p-values. It appeared that there is no significant relationship between the video duration and delta value.
Our next assumption was that this delta might be affected by the complexity of the video. To verify this, I have posted two videos of different complexities (same video length, but different file size) and used its shortcode and published time to estimate the IG epoch. However, the two estimates for epoch were similar to each other regardless of the file size.
Our next hypothesis was that this delta is affected by the difference between the time users hit publish on a post and the time it finishes posting to get published to the web (processing time). What we considered as the processing time is the summation of upload time, the time it takes for the video conversion, and any other additional time Instagram takes until it gets posted on the web. The process of posting is shown in Figure 06.
Figure 06: The process of posting a media item to Instagram from the time the user hits post until the time it gets published.
We then checked if the processing time for the video is what’s causing this difference. To test this hypothesis, we posted the same video (video length = 11 sec) ten times by setting two different upload speeds (Figure 07). The first five videos (1-5) were uploaded with an upload network speed of 2.18 Mbps and the next 5 videos were uploaded with an upload speed of 0.02 Mbps, which has two orders of magnitude difference from the first five uploads. The same video is uploaded to keep constant other aspects like video duration, video complexity, etc. We also manually timed how long it took for the video to complete the processing in both scenarios, and they were approximately 35 sec with an upload speed of 2.18 Mbps and 11 min with an upload speed of 0.02 Mbps.
Figure 07: The two different network upload speeds used.
Table 1 shows the outcome of the above test. The delta associated with the first five shortcodes has a smaller value (ranging from 23 sec - 49 sec) whereas the delta associated with the final five shortcodes has a comparatively higher value (ranging from 10 min 30 sec - 11 min 48 sec). Also, these two different categories of the delta are corresponding to the approximate upload time mentioned in the initial step. This supports our hypothesis that the delta is affected by the time between the user hitting "publish" and the video being uploaded and published into the user feed.
Table 01: Results of the test conducted to see the effect of upload time/upload speed on the difference between the creation time and published time on IG.
We also thought the delta would be much higher for a post with multiple videos as it might take an even longer time to process/upload as compared to a single video post. However, that’s not the case, based on the structure of how a multiple media post is built.
Carousel post: Multiple images or videos
IG allows users to post up to 10 images and/or videos in a single post. If you look back at the dataset, you can see that the delta value is either 1 or 2 sec for all carousel posts. This means that there is either no difference or only a single second difference between the creation time and published time even if there are multiple videos and/or images involved.
Let us consider a multiple media post with an image and a video. Figure 08 shows a JSON snippet from the HTML body that explains how the structure of a carousel post is built.
"graphql": {
"shortcode_media": {
"__typename": "GraphSidecar",
"id": "2487219965527166501",
"shortcode": "CKEYWF7gY4l",
"dimensions": {
"height": 1080,
"width": 1080
},
...
"edge_sidecar_to_children": {
"edges": [
{
"node": {
"__typename": "GraphImage",
"id": "2487219962515789569",
"shortcode": "CKEYWDIA5cB",
"dimensions": {
"height": 1080,
"width": 1080
},
...
{
"node": {
"__typename": "GraphVideo",
"id": "2487219810312905327",
"shortcode": "CKEYT1YA-Jv",
"dimensions": {
"height": 750,
"width": 750
},
Figure 08: A snippet from the JSON file of a post with an image and a video displaying how the structure of a multiple media post is built.
There is a shortcode for the post ("CKEYWF7gY4l") and there are also shortcodes for child 1 ("CKEYWDIA5cB") and child 2 ("CKEYT1YA-Jv") under the “edge_sidecar_to_children” key. Any request made directly to the URLs constructed with a child shortcode will redirect to the main post URL (Figures 09 and 10).
$ curl -A "googlebot" -ILs https://www.instagram.com/p/CKEYWDIA5cB/
HTTP/2 302
content-type: text/html; charset=utf-8
location: https://www.instagram.com/p/CKEYWF7gY4l
vary: Accept-Language, Cookie
content-language: en
date: Sun, 07 Feb 2021 03:07:58 GMT
HTTP/2 301
content-type: text/html; charset=utf-8
location: https://www.instagram.com/p/CKEYWF7gY4l/
vary: Accept-Language, Cookie
date: Sun, 07 Feb 2021 03:07:58 GMT
HTTP/2 200
content-type: text/html; charset=utf-8
vary: Cookie, Accept-Language, Accept-Encoding
content-language: en
date: Sun, 07 Feb 2021 03:07:58 GMT
Figure 09: A direct request to the URL constructed using the shortcode for child post 1
$ curl -A "googlebot" -ILs https://www.instagram.com/p/CKEYT1YA-Jv/
HTTP/2 302
content-type: text/html; charset=utf-8
location: https://www.instagram.com/p/CKEYWF7gY4l
vary: Accept-Language, Cookie
content-language: en
date: Sun, 07 Feb 2021 03:10:03 GMT
HTTP/2 301
content-type: text/html; charset=utf-8
location: https://www.instagram.com/p/CKEYWF7gY4l/
vary: Accept-Language, Cookie
date: Sun, 07 Feb 2021 03:10:03 GMT
HTTP/2 200
content-type: text/html; charset=utf-8
vary: Cookie, Accept-Language, Accept-Encoding
content-language: en
date: Sun, 07 Feb 2021 03:10:03 GMT
Figure 10: A direct request to the URL constructed using the shortcode for child post 2
Table 02 shows the creation time of each shortcode, the published time on the HTML, and the delta between the two. It’s clear how the creation time of the main post shortcode and that of child 1’s shortcode aligns with the published time on the HTML, with a delta of only one second, whereas the creation time of child 2’s shortcode is 19 sec ahead of the published time in the HTML. This means that the media ID/creation time of child 2 is earlier than child 1 and the main post. We assume that IG creates the main post media ID at the very end of the posting process.
Comparison with Twitter
Twitter uses Snowflake as its internal service to generate Twitter IDs. They shifted to this method of ID generation since they needed to generate lots of 64-bit IDs per second which will still be roughly sortable. Also, this new method allows for distributed ID creation whereas the prior MySQL-based technique was centralized and not scalable. Performing a similar test by limiting the upload speed and posting the same video twice to the Twitter feed, we were able to verify that the time extracted from a Twitter ID is the same as what’s displayed in the JSON object for the tweet. Note that the creation time of the ID is obtained from TweetedAt, a service and library built by Mohammed Nauman Siddique and Sawood Alam that makes it easy to extract the datetime from Twitter IDs. Table 03 shows the outcome of the test.
Table 03: Results of the test conducted to see the effect of upload time/upload speed on the difference between the creation time and published time on Twitter.
As shown in Table 03, the delta is zero, meaning that there is no difference between the creation time and published time regardless of the limitations to the upload speed to prolong the processing time of the post.
Reasons why this delta could be important
Understandably, these are two different events (id creation and post publishing), and Twitter and Instagram decided to handle this differently. Twitter defines them to be the same, whereas at Instagram they are the same only for posts with the shortest processing times, which occurred 76.4% of the time (zero or one second delta) in our sample. However, we can think of a few instances where this difference between the two timings (delta) can be of importance. One could use this delta value to further study network connection related factors. As users use the IG mobile app to post pictures, we can use this delta value to understand how someone’s mobile-network connection speeds have changed over time. This could also give away location information based on mobile-network connection speed ranges vs home WiFi speed ranges. The results may not be precise but at least we could identify user clusters based on connection speeds.
Another use case of this delta could be at instances where priority claiming occurs. For example, a game organized via IG where one should complete a certain challenge and post a video of it, and the winners will be chosen according to the time of the post. In such a scenario it is only fair to use the creation time instead of published time to pick the winners. The contestants should not be penalized for having bad connection upload speeds.
Conclusion
In this study, we have shown how we can extract the published time of a post from the HTML of any Instagram post. We have also shown that it is possible to obtain the creation time of the ID from the media ID/shortcode of a post if the epoch value used by IG is known. Although Instagram has never disclosed this value, we have reasons to believe that the epoch value used by IG is 2011-08-24T21:07:00Z. We reached this conclusion after careful analysis of estimates obtained for the IG epoch by using the published time and creation time extracted from a collection of 1000 IG post shortcodes.
Even though one would expect the two values, published time and creation time, to be the same, we have discovered that these two values differ from each other. The difference between them is found to be the time it takes for a post to get completely processed and published on the web once the user hits publish. There can be several factors that will affect the processing time, but it seems to be dominated by the upload speed. Even though the impact of this delta value is minute, it can still affect scenarios where the chronological ordering of posts to find out on which post comes first, especially if fine grain granularity is of importance.
Comments
Post a Comment