2022-03-30: GitHub is not an archive - GitHub Pages

Most GitHub users are aware of *.github.io as a GitHub hosted website. But, before there was *.github.io, (https://elescamilla.github.io) there was *.github.com (https://elescamilla.github.com) that had the exact same functionality. What caused the change?

On April 5, 2013, GitHub released a statement that they would be deprecating *.github.com for security reasons. In the post, they said that "all traffic will be redirected to the new *.github.io location indefinitely,  so you won't have to change any links". However, on January 29, 2021, they released an updated statement that they would stop redirecting *github.com to *.github.io starting April 15, 2021 to further address security concerns. They recommended that users "remove any external references to *.github.com". To encourage users to update any external links, they scheduled two "brown out" dates and notified users of the upcoming change.

But...

In most situations, it is difficult to modify URLs in a publication where the content is permanent. For example, in the arXiv corpus that I am studying as part of the CoSAI project, from 2011 to 2021 there are 335 PDF publications that reference "aplpy.github.com". But, this page no longer exists and users are no longer automically redirected to the "aplpy.github.io" page that has replaced it. An example of a publication referencing "aplpy.github.com" is shown below. 

Captured from https://arxiv.org/abs/2011.08829, page 18

This is just one example of a broken GitHub Pages link. This publication now permanently contains a broken link even though the Web page was available when the publication was submitted. For more on the topic of URL integrity in the academic corpus, see Klein et al., 2014 and Jones et al., 2016.

https://aplpy.github.com has a 404 HTTP response 

$ curl -is https://aplpy.github.com | head -10
HTTP/1.1 404 Not Found
Connection: keep-alive
Content-Length: 9581
Server: GitHub.com
Content-Type: text/html; charset=utf-8
x-pages-interstitial: 1
Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'; img-src data:; connect-src 'self'
X-GitHub-Request-Id: 9662:239D:F8B449:177A80F:62437811
Accept-Ranges: bytes
Date: Tue, 29 Mar 2022 21:20:17 GMT

and displays the following:



The page suggests that the user go to the updated URL to find what they were looking for. Following the link (https://aplpy.github.io) has a 200 HTTP response 

$ curl -is https://aplpy.github.io | head -15
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 6912
Server: GitHub.com
Content-Type: text/html; charset=utf-8
permissions-policy: interest-cohort=()
Last-Modified: Sat, 16 Jun 2018 20:43:22 GMT
Access-Control-Allow-Origin: *
ETag: "5b25766a-1b00"
expires: Tue, 29 Mar 2022 15:14:26 GMT
Cache-Control: max-age=600
x-proxy-cache: MISS
X-GitHub-Request-Id: 51BE:3C52:5658F8:C5C838:62431FFA
Accept-Ranges: bytes
Date: Tue, 29 Mar 2022 21:22:24 GMT
and displays the following: 



However, GitHub has shown that implementations like this that help ease the transition from *.github.com 
to *.github.io are subject to deprecation over time. So, even this helpful message that redirects users to the page they are likely looking for, is not guaranteed to be permanent.

Why does this matter?

GitHub is an incredibly popular git hosting platform for software development and it doesn't appear to be going anywhere anytime soon. But, GitHub, like the rest of the Web, is not permanent. Files and pages stored in GitHub are not guaranteed to be available. 

This is where Web archiving comes in. Web archiving ensures that the content of the web is available in the future. 

There are 900 mementos of http://aplpy.github.com available in the Internet Archive alone at https://web.archive.org/web/*/http://aplpy.github.com/. The first memento is from March 2, 2011: 


And the latest memento is from February 23, 2022:


with the following curl response headers: 
$ curl -is https://web.archive.org/web/20220223200804/http://aplpy.github.com/ | head -25
HTTP/1.1 404 Not Found
Server: nginx/1.19.5
Date: Tue, 29 Mar 2022 21:28:15 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 19335
Connection: keep-alive
x-archive-orig-server: GitHub.com
x-archive-orig-x-pages-interstitial: 1
x-archive-orig-content-security-policy: default-src 'none'; style-src 'unsafe-inline'; img-src data:; connect-src 'self'
x-archive-orig-x-github-request-id: 79C2:39BE:AB8B4:150E77:62169424
x-archive-orig-content-length: 9581
x-archive-orig-accept-ranges: bytes
x-archive-orig-date: Wed, 23 Feb 2022 20:08:04 GMT
x-archive-orig-via: 1.1 varnish
x-archive-orig-age: 0
x-archive-orig-connection: keep-alive
x-archive-orig-x-served-by: cache-sjc10046-SJC
x-archive-orig-x-cache: MISS
x-archive-orig-x-cache-hits: 0
x-archive-orig-x-timer: S1645646885.714892,VS0,VE68
x-archive-orig-vary: Accept-Encoding
x-archive-orig-x-fastly-request-id: bf0b2be8a3fdc757390c0898be26536e09655876
x-archive-guessed-content-type: text/html
x-archive-guessed-charset: utf-8
memento-datetime: Wed, 23 Feb 2022 20:08:04 GMT
However, as I mentioned earlier, GitHub used to redirect from *.github.com to *.github.io. The Internet Archive captured the redirect as seen in the image below: 



and in the following curl response headers: 
$ curl -Is -H "Accept-datetime: Sat, 27 Mar 2021 03:27:32 GMT" https://web.archive.org/web/20210327032732/https://aplpy.github.com/ | head -25
HTTP/1.1 301 Moved Permanently
Server: nginx/1.19.5
Date: Tue, 29 Mar 2022 21:32:42 GMT
Content-Type: text/html
Content-Length: 162
Connection: keep-alive
x-archive-orig-connection: keep-alive
x-archive-orig-content-length: 162
x-archive-orig-server: GitHub.com
location: https://web.archive.org/web/20210327032732/http://aplpy.github.io/
x-archive-orig-x-github-request-id: 5F6C:49AA:10BCA9:1DB3E7:605EA623
x-archive-orig-accept-ranges: bytes
x-archive-orig-date: Sat, 27 Mar 2021 03:27:32 GMT
x-archive-orig-via: 1.1 varnish
x-archive-orig-age: 0
x-archive-orig-x-served-by: cache-sjc10072-SJC
x-archive-orig-x-cache: MISS
x-archive-orig-x-cache-hits: 0
x-archive-orig-x-timer: S1616815652.059438,VS0,VE22
x-archive-orig-vary: Accept-Encoding
x-archive-orig-x-fastly-request-id: 212700e23721274eb9d9b595922151667aee92c7
cache-control: max-age=1800
memento-datetime: Sat, 27 Mar 2021 03:27:32 GMT

While GitHub is still incredibly popular, the content stored in GitHub is not guaranteed to be permanent. Web archiving is a way of preserving what is otherwise ephemeral. 




Comments