2025-02-11: Getting to the Source of the (Memento) Damage


    I've previously written about the Memento Damage project, originally started by Dr. Justin Brunelle, a Web service designed to estimate the amount of damage to a web archive by assessing it's missing resources. Previously, I had been specializating some of the project while working on the Memento Tracer project, funded by the Alfred P. Sloan Foundation, to take special considerations regarding the damage weighting for Web hosted repository pages. 
I have been making further updates to the Memento Damage project over the course of this year that helps improve this analysis and damage estimation. The most prominent is the implementation of a secondary crawler component for analyzing an archived repository and its source tree. Web-hosted Git repositories are hosted on centralized Web platforms, the largest being GitHub along with other major platforms such as GitLab, Bitbucket, and Sourceforge. The source files for a Git project are hosted "behind the scenes" on various platforms and although visitors can still access and directly download the code (in most cases), the primary means of interaction with the repository's files and structure is through varied HTML representations generated by each Web host. The primary and most prominent of these is a project's home, or landing, page. This page often has a table displaying the root of a project's source tree and is surrounded by metadata, such as a project description, a list of contributors, links to a project-specific issues discussion board, a rendering of the project's README file, and more, as shown in Figure 1.

Fig. 1: Screenshot of the out-of-development PhantomJS project's GitHub home page.

    Previous work by Emily Escamilla looked at the scholarly reference of GitHub repositories, highlighting both their importance to academic research and their necessity to be preserved. Her work found that one out of five arXiv articles cited some GitHub repository. This shows the rise in popularity of software among academic research but her work also showed that as repository URIs shift, their often immutable reference in a published paper can be left stale and invalid. Software Heritage is an institution seeking to preserve such Web-hosted Git repositories, but their focus is not all encompassing of a hosted Git project. Given the highlighted sections in Figure 1, Software Heritage is exclusively focused only on the source tree itself, highlighted in blue. This exclusivity can leave out important external information about the repository, such as related forks, issue discussions, and pending code changes held in pull or merge requests.

    Traditionally, the Memento Damage project has been focused on the estimation of page damage, directly about individual, archived HTML pages (though the project also works on any live HTML Web page). With archived software repositories, each source file, and even source directory indexes, are represented by their own HTML pages. These pages generally include an HTML formatted representation of the source code itself, which includes tags and styling behind the scenes to render the code properly, in addition to extra information surrounding it on the page, such as miscellaneous links, file metadata, and a header and footer of the Web host itself. Depending on the Web host there is often a link directly to a raw text version of the source code as well, though in my research I found that if a source page is archived in the Wayback Machine, then its raw source page counterpart is rarely archived alongside it.



Both Web sites and source repositories share an interlinked network of entities. In the case of a repository though, these connections are much more tightly coupled. While a Website can still retain its overall functionality should its FAQ or the page of a blog post go missing, the same might not be true in the case of software projects. As such, should the Web page of a source file fail to be archived and available, that entire archive of a source project might become invalidated.

    With previous work on the Memento Tracer project, we started to adapt the Memento Damage project a bit to specialize in the analysis of software projects. I have expanded this work further by creating a dedicated crawler apart from Memento Damage's main crawler that focuses on exploring the source code for a hosted software project. To this point, the Memento Damage project has focused on estimating damage in situ, without the need for contextual reference to an original live page. While this still holds for the analysis of individual pages themselves, it is less effective concerning the measurement of a repository's structure. Without context, we can examine the archived directory index pages for a repository, as well as the import statements of retrieved source files to help build an estimate of a repository's structure, but this can be limiting depending on the availability of archived source files. Due to the dynamic nature and structure of Web-hosted software repositories, crawlers can have a difficult time indexing all source files, leaving gaps in availability. Our findings here show that as the number of files in the repo increases, the archived percentage tends to decrease, as shown in Figure 2 below, for over 5,400 analyzed repositories.

Fig. 2: Total number of source files, measured logarithmically, compared to the percentage available in the Wayback Machine.


    This problem isn't simply due to the number of files either but also partially due to the structure of source repositories. As most crawlers or Web archiving indexers are not habitually crawling the entire depth of a target website, the nested directory structure of source repositories makes it easy for a crawler to miss source file Web pages. While these links are readily available and linked, if they are not following links down the source tree then they would never be available in the archives. Additionally, while each linked source file is presented via its HTML representation, it also has a secondary "raw" URI, shown below in Figure 3, containing a plaintext version of the source file, though we found that these are rarely archived, as they are often served from a different URI host and/or a different patterned URI path and are only linked from a source file's HTML page or via API.


As most crawlers are visiting or triggered from a hosted repository's landing page, which might be the only page that gets archived, this leads to a skewed availability, shown in Figure 4, where the top-level source files are the most available with a sharp drop-off at subsequent levels. Note that Figure 4 depicts the availability of only those repositories with some level of availability. If all repositories were to be included in the chart, even those with no availability at all, the percentage of root-level source Web pages available drops to around 15.33%. 

Fig. 4: Availability of source file HTML representations for archived repositories with non-zero availability.


    
Better archiving methods, or at least specialized ones that take into account hosted source repositories, are needed to aid in this issue of archived source availability. This might not be enough though to help with repositories that are only partially archived and no longer exist on the live Web to be recrawled. An approach in this case might be to combine the source code preserved by Software Heritage with any pages preserved by the Internet Archive that are not covered by Software Heritage into a holistic view of the archived repository. A future direction we are considering is how we might be able to create a tool to help rebuild incomplete archived source projects. Machine learning is one obvious tool to help aid in this task, whereby given a set of partial source files for an archived project can we recreate the missing pieces using the available source code and context held within them, as well as the other related archived project pages. Things like function or source file names and import statements can help provide clues as to the underlying structure or purpose of missing source code but other project pages can be of aid as well. An archived issue page, for instance, can shed light both on source code potentially not available in the archives as well as issues surrounding it or in existing code that would aid in attempting to rebuild the project. We hope to be able to generate as faithful of a recreation as possible but functionality is the main point, and in many cases, synthesized solutions might not be exact or only partially available. We shall post more on this in the future as our research develops.

    As a last side note we have also been working to optimize the project a bit with UI updates and changes to use Microsoft's Playwright library to lessen certain project dependencies and allow users to run the project using various underlying browsers outside of Chromium. We will post more about these on the project's GitHub page as well as future blog posts. We welcome feature requests that users might want to see in the project, which can be submitted as an issue ticket.

- David