2019-12-21: Preserving Open Source Software with GitHub's Arctic Vault

Source: Techworm

GitHub is used by more than 40 million developers and currently hosts more than 100 million repositories. In early November 2019, GitHub shared plans to open the Arctic Code Vault, an effort to store and preserve open source software like Flutter and TensorFlow. With this endeavor, code for all open source projects will be stored on specialized ultra-durable 3,500-foot film with frames that include 8.8 million pixels each, designed to last 1,000 years. The data can be read by a computer or a human with a magnifying glass in case of a global power outage.
"Our primary mission is to preserve open source software for future generations. We also intend the GitHub Archive Program to serve as a testament to the importance of the open source community. It’s our hope that it will, both now and in the future, further publicize the worldwide open source movement; contribute to greater adoption of open source and open data policies worldwide; and encourage long-term thinking." (Excerpt: GitHub Archive Program website)
GitHub is partnering with the Stanford Libraries, the Long Now Foundation, the Internet Archive, the Software Heritage Foundation, Piql, Microsoft Research, and Oxford University's Bodleian Library to preserve the world’s open source code. These partners represent the warm and cold tiers in the pace layer strategy GitHub has adopted for archiving code. Each institution provides redundancy by storing multiple copies across various data formats and locations, including a very-long-term archive called the GitHub Arctic Code Vault.

The Arctic World Archive
The Arctic World Archive (AWA), located in Svalbard, Norway, is a vault that wants to preserve the world’s digital heritage and make it available for future generations. The AWA already holds art collections, the Vatican’s 1,500-year-old manuscripts, and even film clips of the Brazilian football player Pele. In collaboration with their clients, the AWA determines whether to store the content in a digital format or in visual format so text and images are human readable. The open source code on GitHub will be maintained in a decommissioned coal mine designed specifically for the AWA.  Archivists believe that cold and near-constant conditions can help in film preservation. As one of the northernmost cities on Earth, Svalbard's permafrost can extend hundreds of meters below the surface. While Svalbard is affected by climate change, it’s likely to affect only the outermost few meters of permafrost in the foreseeable future. Warming is not expected to threaten the stability of the mine. The AWA caters to any country, institution or company in need of ultra-secure storage and was likely chosen by GitHub because:
  • Svalbard is a Norwegian archipelago situated approximately 1,300 kilometers from the North Pole; essentially out of reach for most cyber attacks.
  • The Svalbard Treaty signed in 1920 declares the area to be a demilitarized zone (DMZ) with no military activity; ensuring the data will not become a casualty in a military conflict. 43 nations, including the United States, Russia and China signed the treaty. 
  • Its unique location, geopolitical and climatic stability makes it a suitable place for safe long-term storage. No electricity or other human intervention is needed as the climatic conditions in the Arctic are ideal for long-term archival of film.

Contents of the 2020 Snapshot
Earlier in 2019, thousands of popular GitHub projects like Blockchain, WordPress, and programming languages like Rust and Ruby were added to the archive. Next year, the Arctic Code Vault will be extended to include all public GitHub repositories. The first snapshot will take place on February 2, 2020. For anyone with an active repository, the associated code will automatically be included in the snapshot. The snapshot will consist of the default branch of each repository, excluding any binaries larger than 100 KB in size. Each repository will be packaged as a single TAR file. For greater data density and integrity, most of the data will be stored QR-encoded. A human-readable index and guide will itemize the location of each repository and explain how to recover the data. The snapshot will also include "significant dormant repositories as determined by stars, dependencies, and an advisory panel", according to GitHub. The advisory panel will include experts from a range of fields, including anthropology, archaeology, history, linguistics, archival science, and futurism. Currently, the Archive Program Advisors include:
Over time, GitHub will develop a cadence to store code once a year or every two years, and a way for open source projects to retrieve code, but those processes are still being developed. The Frequently Asked Questions (FAQs) state that GitHub plans to evaluate the archived film reels and their current state every five years. The current film technology used in archiving, developed by Piql, is coated in iron oxide power. This medium has a lifespan of 500 years as measured by the International Standards Organization (ISO); simulated aging tests indicate Piql’s film will last twice as long. Depending upon GitHub's evaluation results, another snapshot may be taken and archived in the cold storage facility. However, this is not guaranteed or yet known.

Future Proofing Open Source Code
What will software look like 1,000 years from now?  Developers and archivist can only guess. Meanwhile, GitHub is working to ensure that today’s most important building blocks make it to tomorrow. Much of today's technology is powered by open source software. It’s a hidden cornerstone  and shared foundation for future development efforts. The mission of the GitHub Archive Program is to preserve the legacy for generations to come.
"There is a long history of lost technologies from which the world would have benefited, as well as abandoned technologies which found unexpected new uses, from Roman concrete, or the anti-malarial DFDT, to the hunt for mothballed Saturn V blueprints after the Challenger disaster. It is easy to envision a future in which today’s software is seen as a quaint and long-forgotten irrelevancy until an need for it arises. Like any backup, the GitHub Archive Program is also intended for currently unforeseeable futures as well." (Excerpt: GitHub Archive Program website)

Besides the Archive Program, GitHub is also working on Microsoft’s Project Silica to archive all active public repositories for over 10,000 years, by writing them into quartz glass platters using a femtosecond laser. For anyone who wants to safeguard their own code in the GitHub Arctic Code Vault, there's still time to do so. On the other hand, GitHub will only archive public repositories, so opting out is as simple as making your repository private which is a free feature for all users. One point that should be noted with GitHub's archival plan is that code depends on the underlying infrastructure to run (e.g., hardware, supporting libraries, assembly language, compilers). It's unknown whether GitHub will also include these elements in the AWA. As an alternative, the archive will also include a Tech Tree that provides an overview of the archive and how to use it. The Tech Tree will serve as a quick start manual on software development and computing, bundled with a user guide for the archive. The archive will also include information and guidance for applying open source, with context for how we use it today, in case future readers need to rebuild technologies from scratch. Answers to other common questions can be found in the GitHub Archive Program FAQs.

The Case for Cold Storage?
While the digital preservation of open source code may be culturally significant, David Rosenthal's "Seeds or Code?" blog post offers a slightly different perspective on GitHub's endeavor. He contends the AWA initiative is a publicity stunt conceived by Microsoft which acquired GitHub in 2018. Rosenthal questions whether the AWA would rank high on the list of basic necessities if the world were sufficiently devastated. Further, he draws comparison to other scientific projects which are inspiring in their intent but, in his opinion, similarly lack practical use cases. For emphasis, Rosenthal draws attention to these projects:

  • Clock of the Long Now which ticks once per year and is designed to accurately keep time for 10,000 years.
  • Voyager Golden Records which contain Earth sounds and images intended for any intelligent extraterrestrial life form who may find them.

The AWA itself is a cold storage facility that developers may never need until the day comes that they do. Currently, lots of organizations rely on open source software which means the AWA may represent a crucial building block in the restorative process. However, there are both high-quality and poorly-designed projects maintained on GitHub with no visible means to discriminate. As a result, AWA will house code in both categories which may present a challenge to the future of software development. While Rosenthal discounts the media hype surrounding the AWA, he readily acknowledges the importance of the software archives and accessible pace layer partnerships that increase awareness of digital preservation overall.

--Corren McCoy (@correnmccoy)

GitHub Archive Program, Preserving open source software for future generations, N.D., Retrieved from https://archiveprogram.github.com/ on 01-December-2019.