2021-10-14: Summer Internship at LANL

Last summer, I had an opportunity to work as a research intern at the Los Alamos National Laboratory (LANL) with the Research and Prototyping (Proto) team of the LANL Research Library. The position was my first internship in the US, and I am extremely grateful for the opportunity. Due to the pandemic, I worked remotely from Norfolk. I worked on the Memento validator project, updating the memento validation and compliance testing infrastructure under Dr. Lyudmila Balakireva.

Los Alamos National Laboratory (https://www.lanl.gov/)

About LANL

Los Alamos National Laboratory (LANL) is a Federally Funded Research and Development Center (FFRDC) to solve national security through scientific research and development. The research areas of the laboratory include nuclear security, national defense, energy, counter-terrorism, and the environment. The laboratory hires approximately 2000 students each year to work on scientific research and development projects as a part of their summer internship program. During the internships, the students get to work on a project to gain experience in research and development.

The library's research and development arm pursue research on various aspects of information infrastructure in the digital age. The Library Research and Prototyping team (Proto Team) under the Research Library explores aspects of scholarly communication, covering infrastructure, interoperability, and persistence. One of the critical contributions of the team is the contribution in standardizing and forming the infrastructure for Mementos in web archiving. The Memento Framework in web archiving enables access to digital resources in prior versions serving both persistence and the interoperability of the information.

Project

During the internship, I worked on improving the Memento validator applications. The Memento validator provides the functionality of testing the compliance of the web resources to the Memento specification. The application intends to be used by the community, such as web archives, researchers, librarians, and general users, with different levels of technical expertise.

Memento Protocol

The Memento protocol serves as the framework for providing time-based access to resource states over Hypertext Transfer Protocol (HTTP). Using the Memento protocol, the user can negotiate the content in the time dimension, such as requesting the resource at a specific point in time. The Memento protocol defines the four types of resources:

Original Resource (URI-R): A web resource for which we want to find a prior version.
Memento (URI-M): A web resource that represents the prior state of the Original Resource.
TimeGate (URI-G): A resource that provides access to previous states of the resource (i.e., Mementos) using datetime negotiation.
TimeMap (URI-T): A resource that serializes all Mementos for an Original Resource.

The Memento Framework specified each type of resource's expected behavior and how they need to be interconnected.

Memento Validator

A memento validator is an application that is capable of testing the Memento implementation of a resource. A validator does this by performing a series of tests and reporting the results. For example, testing the validity of the URI is the first step of a validator application. In the case of a valid URI, the validator will continue to the remaining tests. In the case of an error, the validator responds to the user by specifying the error with the URI.

An error such as having invalid URIs can hinder the traversal in the time dimension. For instance, an invalid TimeGate can prevent us from discovering prior versions of a particular resource. As a result, testing and validating resources for their compliance with the Memento Framework is essential. Validators help us in the process by identifying the issues with our implementation.

As a result, the Memento validators play a vital role in smooth functioning in time travel.

Existing Validators

At the time of starting my internship, LANL had two validator applications, the Web validator and the Daily validator. The web validator application offered a web interface through which a user can test the memento compliance of a URI. The daily validator was an internal application, which provided memento compliance reporting on popular web archives. However, each application had its limitations and needed updates.

My task during the project was to improve the memento validator applications by addressing weaknesses and adding new features. For this, I first had to go through the existing application and identify any shortcomings of the current validator. Based on the initial analysis of the existing validator source, I found both applications had limited modularity. Further, the applications used Python 2.7, discontinued by the Python software foundation and the community. Additionally, despite both applications providing the same functionality, there was no shared codebase, which is helpful in the long-term maintenance of the applications.

Memento validator logo

New Memento Validator

Based on these findings, I decided to restructure the validator applications for a shared codebase and resources. As the solution, I decided to develop a core library for performing Memento validation. The core library provides functionalities in different levels of granularity:

It includes validation at the resource level, such as validating a original resource (URI-R) for compliance with URI-R specifications.
It includes attribute level validation, such as validating that URI-R has a valid TimeGate in the link header.

An application will consume the core library at the desired level and perform accordingly at the application level. In this manner, the web validator and the daily validator will utilize the core library. As a result, the codebase will be shared and modular, with easier maintenance.

Further, with the core library's help, we can also generate automated test scripts along with a unit testing library. Through automation, we can speed up development and testing processes.

Upon presenting my findings and the proposal, I got additional ideas from my supervisor to further improve validator applications. Some of them were:

Make the applications use the configuration data from the Memento Aggregator when developing the daily validator
Compile summary report for daily validator
Add containerization to the application with Docker.

Because of the new design, the suggestions required changes only at the application level.

Implementation

Once we finalized the plan, I moved on to coding the application. To ensure long-term language support, I decided to use Python 3.8. Furthermore, to generate comprehensive automated documentation, I used Python Typings and Sphinx.

For building the web validator, I decided to decouple the user interface by introducing an HTTP Application Programming Interface (API). The web front-end (user interface) will consume the HTTP API. In this manner, we can have an additional interface for automating tests. To develop the web API, I used the Python Flask framework and resource-level validation test results from the library. For the front-end user interface, I created a single-page application using HTML and TypeScript. As the daily validator, I developed a Python script that can be executed daily using a cron job.

In addition to the mentioned applications, I developed a command-line application that can validate a given resource, helpful for testing using command-line interfaces.

Memento validator command-line interface

Deployment

At the end of the project, I deployed the web validator application and generated documentation to Memento lab LANL servers. The web validator application is available through http://labs.mementoweb.org/validator/app/ and the documentation through http://labs.mementoweb.org/validator/docs/. In addition, I created a user manual describing the library, process flows, and deployment techniques. We plan to launch a public release of the application after LANL approval.


Updated Memento validator (Web validator)

Side Project

Most aspects of Memento validation are associated with the Link header. As a result, I got an opportunity to explore several techniques of parsing the link header used in popular applications. I formed a simple library including a selected set of parsing procedures using Python. Since some of the techniques were implemented in other languages, such as Java and Go, I had to learn syntax and methods in those programming languages. We also plan a public release of link parser library upon approval from LANL.

Future Directions

At the time of the blog, the Memento validator package or the link parser library is not publicly available. However, the web validator and the HTTP API are publicly accessible. Upon approval from LANL, we plan to open-source the validator library and the application, which will provide users with additional ways of testing memento compliance. Furthermore, it will enable a broader community to contribute and improve the application through testing, developing, and documenting.

Team and Meetings

Throughout the program, I had weekly meetings with Dr. Balakireva and bi-weekly team meetings with the Proto team. During the one-to-one meetings, we discussed the progress of the project and set goals for the next meeting. The team meetings were for team-wide discussions, such as announcements. Further, we briefly discussed our projects, and it was an excellent opportunity for alpha testing the applications we developed.

The team has mixed features of innovative and team cultures, with open and honest feedback among team members and freedom for experimenting. Within the team, we relied on trust between members for achieving our overall goals instead of micro-management.

Events

Even as a teleworking member, LANL had plenty of opportunities to expand the knowledge areas through various events at the laboratory. These sessions discussed topics such as Machine Learning and High-performance computing from different perspectives. Further, there were laboratory-wide symposiums, where other team members presented their projects.

At the latter part of the internship, we also got the opportunity to present our projects in the Mini Summer Student Symposium (MiSuSup) organized by the research library. In the symposium, I discussed the importance of the Memento Framework, compliance, and the role of the validator. Further, I showcased few features in the new Memento validator.

Me presenting Memento validator at MiSuSup 2021

Final thoughts

The internship provided me with an excellent learning opportunity to improve my understanding of the Memento Framework and the importance of compliance. Further, I gained valuable experience in building compliance toolsets and how to avoid common pitfalls. I also gained experience in end-to-end development of an application and helpful toolsets such as Sphinx through the project. Overall, I firmly believe that the internship provided me with invaluable experience in research and development, which will help in strengthening my research projects.

I want to thank my Ph.D. advisor, Dr. Sampath Jayarathna, for encouraging me to apply for an internship at LANL. I'm also grateful for my internship supervisor, Dr. Luda Balakireva, who helped me immensely during the project. Further, I would like to extend my gratitude to Dr. Martin Klein, who recommended me for this position guided the project. I am honored and thankful to have worked with the Proto team and the diverse, multi-disciplinary team at LANL as a Research Intern.

-- Bhanuka Mahanama (@mahanama94)

Search This Blog

Web Science and Digital Libraries Research Group