2020-12-31: A Doctoral Degree Takes the Lifetime of a Bird and Then Some


It was a fine afternoon and I had just finished my weekly meeting with my advisor Michael L. Nelson in which we discussed a few potential topics for my doctoral research after completing my masters degree. I dropped my meeting notes on the desk of my cubicle and went downstairs for a walk in the Old Dominion University (ODU) campus. It was a pleasant sunny weather outside, so I sat on a bench near a small pond behind the Computer Science building to enjoy the fountain in the pond and to observe turtles taking sun bath, of which, a few small ones were stacked on the backs of the big ones. I noticed a long-stem grass around the edge of the pond that reminded me of a grass we used to use for crafting in my village in India during my childhood. I picked a few stems, took them upstairs, washed them, weaved them into a bird, and hanged the bird from the wall of my cubicle. Later, it was moved to the Web Science and Digital Libraries (WSDL) Research Group's PhD Crush board where it dried up, turned gray, and lasted for years. This grass bird served as my silent companion during those late night hours and weekends I spent in the WSDL lab for data wrangling, writing papers, and typesetting my dissertation.


Dissertation Defense

On December 4, 2020, I defended my dissertation on a Zoom call (thanks to the pandemic) with poor lighting and an awkward camera angle. On a positive note, a presentation on a video call allowed my family members, friends, and colleagues to participate in the event from around the globe along with the following committee members:

Below are the recording of my dissertation defense that includes both the presentation and public Q/A and slides of my talk.



After getting necessary papers digitally signed by the committee members, my advisor announced the success publicly.



Usually, it would have been followed by a dissertation defense feast in the WSDL lab, but the pandemic prevented this tradition from happening (let's call it "pending"). The pandemic has also postponed the graduation commencement ceremony, which was later celebrated on May 6, 2021, in an unusual manner in the S.B. Ballard Stadium, while practicing social-distancing, instead of the usual Ted Constant Convocation Center. Dr. Michele C. Weigle attend the ceremony with me, but many traditions were cut short, so we did not share the stage together. We did make sure that my avatar on the WSDL PhD Crush board was advanced to the Alumni cloud.



My dissertation is publicly accessible from ODU Digital Commons.

Sawood Alam. "MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing". Doctor of Philosophy (PhD), Dissertation, Computer Science, Old Dominion University, 2020. DOI:10.25777/5vnk-s536


MementoMap Framework for Web Archive Profiling

My research work introduced a web archive profiling framework called, "MementoMap". With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as to support routing of requests in Memento aggregators. A memento is a past version of a web page and a Memento aggregator is a tool or service that aggregates mementos from many different web archives. To save resources, the Memento aggregator should only poll the archives that are likely to have a copy of the requested Uniform Resource Identifier (URI). MementoMap framework allows a flexible and efficient means to learn about and express Holdings and/or Voids of a web archive.

Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). Our research explores strategies in between these two extremes. Our web archive profiling work was initially funded by the International Internet Preservation Consortium (IIPC), which I continued as my dissertation topic.

My dissertation addresses the following three primary research questions related to profiling web archives:

  • RQ1: How to learn about the Holdings and Voids of an archive?
  • RQ2: How to build an archive profile that will best summarize an archive's Holdings/Voids and allow for dissemination and exchange?
  • RQ3: How to utilize archive profiles for the routing of URI lookup requests?

We ran MemGator, a Memento aggregator we developed, at ODU for over three years and collected access logs. We then acquired the complete set of CDX (Capture Index) files from Arquivo.pt web archive and cross-checked to see what web archives are archiving and what users are looking for. We learned that a significant number of artifacts that are being archived frequently are rarely accessed and a significant number of resources that users are interested in are not present in web archives. We also realized that neither profiling a web archive for its holdings (content-based profiling) nor profiling what users are trying to access (usage-based profiling) tell the complete story. This means, an ideal profile would be application dependent. Moreover, not every kind of data that can be used for profiling a web archive is accessible to everyone. Hence, MementoMap framework is kept generic and flexible in which we explored various possibilities and described ways to profile archives for specific applications and based on the available data. These approaches and their applications have their own pros and cons as well as efficiency metrics associated.


We also looked at the cross-archive overlap of four sets of one million sample URIs each in three different web archives of varying sizes. We found that not many lookup URIs were present in any archive, and even fewer URIs were present in more than one archive. This means, aggregating results from various web archives would have an additive advantage, yielding better discovery of mementos for users.


A naive Memento aggregator would poll results from all the known web archives for every URI lookup, but only a few would return good results for a given URI. This is where MementoMaps of various known archives can be useful is making the aggregator an informed decision to selectively poll from only archives that are likely to return good results.


Analyzing CDX data of web archives can give a complete knowledge of their Holdings and analyzing their access logs can reveal what is not present in a web archive. However, these datasets are usually only accessible by the archives themselves and evolve rapidly. We introduced an algorithm, the "Random Searcher Model (RSM)", to address this issue for archives that allow fulltext searching. With this approach, we can incrementally learn about their Holdings as an external observer by performing keyword search and collecting returned links in the results. We use the content of the pages returned to discover more query terms, hence our algorithm operates in a language-independent manner. RSM has various tunable parameters and operation modes to address different use cases.


URIs in web archives are usually indexed using their Sort-friendly URI Reordering Transform (SURT) keys for improved spatial locality and canonicalization. We extended SURT to support wildcard to represent rolled-up sub-trees of the URI space. This way, we can express Holdings and/or Voids of a set of related URIs with fewer lines in the MementoMap file instead of enumerating them all one by one.


We developed a single-pass, memory-efficient, and parallelization-friendly algorithm for MementoMap generation that discovers dense sub-trees of the URI space of an archive and replaces them with their corresponding wildcard representations. This operation is applied incrementally while consuming a sorted index data. Our algorithm can be applied on an existing MementoMap file to compact it further with more aggressive compaction parameters if the resulting output is too large.


We introduced a file format called, "Unified Key Value Store (UKVS)", for serialization of MementoMaps. Our analysis suggested that if Arquivo.pt were to publish a 119MB of MementoMap file for their Archival Holdings, they could have avoided over 60% of the wasted traffic from MemGator without any false negatives. In addition to that, an Archival Voids profile of about 2.4k URIs that were accessed hundreds of times (or more) each could have avoided an additional 8.4% of wasted traffic without any false negatives (i.e., 100% recall).


Publications and Software

My doctoral research yielded numerous peer-reviewed paperstechnical reportspresentations, and blog posts. Below is a list of publications most relevant to my dissertation.

  1. Sawood Alam, Michele C. Weigle, Michael L. Nelson. "Profiling Web Archival Voids for Memento Routing". Proceedings of the 21st ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), Online, 2021. DOI:10.1109/JCDL52503.2021.00027
  2. Sawood Alam, Michele C. Weigle, Michael L. Nelson, Fernando Melo, Daniel Bicho, Daniel Gomes. "MementoMap Framework for Flexible and Adaptive Web Archive Profiling". Proceedings of the 19th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), Urbana-Champaign, Illinois, USA, 2019. DOI:10.1109/JCDL.2019.00033
  3. Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson. "Unobtrusive and Extensible Archival Replay Banners Using Custom Elements". Proceedings of the 18th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), Fort Worth, Texas, USA, 2018. DOI:10.1145/3197026.3203881
  4. Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson. "Client-side Reconstruction of Composite Mementos Using ServiceWorker". Proceedings of the 17th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), Toronto, Ontario, Canada, 2017. DOI:10.1109/JCDL.2017.7991579
  5. Sawood Alam, Michael L. Nelson, Herbert Van de Sompel, David S. H. Rosenthal. "Web Archive Profiling Through Fulltext Search". Proceedings of the 20th International Conference on Theory and Practice of Digital Libraries (TPDL), Hannover, Germany, 2016. DOI:10.1007/978-3-319-43997-6_10
  6. Sawood Alam, Mat Kelly, Michael L. Nelson. "InterPlanetary Wayback: The Permanent Web Archive". Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), Newark, New Jersey, USA, 2016. DOI:10.1145/2910896.2925467
  7. Sawood Alam, Michael L. Nelson. "MemGator - A Portable Concurrent Memento Aggregator". Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), Newark, New Jersey, USA, 2016. DOI:10.1145/2910896.2925452
  8. Sawood Alam, Michael L. Nelson, Herbert Van de Sompel, Lyudmila Balakireva, Harihar Shankar, David S. H. Rosenthal. "Web Archive Profiling Through CDX Summarization". International Journal on Digital Libraries (IJDL). 17.3 pp. 223–238. 2016. DOI:10.1007/s00799-016-0184-4
  9. Sawood Alam, Michael L. Nelson, Herbert Van de Sompel, Lyudmila Balakireva, Harihar Shankar, David S. H. Rosenthal. "Web Archive Profiling Through CDX Summarization". Proceedings of the 19th International Conference on Theory and Practice of Digital Libraries (TPDL), Poznan, Poland, 2015. DOI:10.1007/978-3-319-24592-8_1


My research also resulted in numerous open-source software releases and contributions that affect various aspects of the web archiving ecosystem. Below is a list of only a few most relevant ones.


Academic and Professional Journey

I am a first-generation student. I completed my BTech degree in Computer Engineering from the Department of Computer Engineering, Jamia Millia Islamia, New Delhi, India, in 2008. Then I worked for Belzabar Software Design, India, as a Computer Scientist for one year. Dr. Mohammad Zubair invited me for research at ODU. I worked with him for over a year, then joined the WSDL Research Group, led by Dr. Michael L. Nelson, as it aligned well with my interests. I completed my Masters degree in Computer Science at ODU and continued with the PhD program.

During my academic life at ODU I got the opportunity to travel to UK, Canada, Germany, Poland, and many places in the USA for academic conferences and datathons in the Digital Libraries and Web Archiving disciplines, which helped me grow my professional network significantly. I co-authored papers with researchers from other institutions and organizations (both American and international) such as Stanford University, Los Alamos National Laboratory (LANL), and Portuguese Web Archive. I have actively participated in the events and discussions of IIPC, some standard making bodies like Internet Engineering Task Force (IETF), emerging technologies like Decentralized Web (DWeb), and Web evolution tracking reports like Web Almanac from the HTTP Archive. I served various journals and conferences as a reviewer and Program Committee member.

While being at ODU I established the ODU Linux Users Group and promoted Open-Source Software. I have been selected as the Docker Campus Ambassador for ODU and was among the first few to earn the Docker Certified Associate certification. I gave numerous talks in classrooms and during campus events on various programming languages, frameworks, tools, and technologies.



I served as a teaching assistant for various courses and helped create a new course on Cryptocurrencies/Blockchain. Towards the end of my PhD program I offered a course on Web Server Design in which my students were tasked to read various HTTP RFCs and build their own standard-compliant web servers. I built an HTTP server testing framework and incorporated state of the art tools and technologies (such as Git and Docker) in my course to not only automate the deployment and testing of students' projects, but also expose my students to the modern tool-chains and practices used in the industry.

Apart from my academic interests in the Web and Web Archiving disciplines, I have also been working towards promoting the Urdu language on the Web and serving UrduWeb as an administrator for over a decade. In addition to leading numerous Urdu projects my interest in the language resulted in a research paper on Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages.

The year 2020 has been a rewarding one for me as three significant things happened in my professional life in this year:



On September 1, 2020, I joined the Internet Archive (IA) and worked part time while finalizing my dissertation. After one month, on October 1, 2020, I started working full time. My current role at IA is a Web and Data Scientist for the Wayback Machine. In addition to solving engineering challenges of the Wayback Machine at scale, I am responsible for connecting to and establishing relationship with the broader academic and research community. I continue to engage with and serve various consortia (e.g., IIPC), standard making bodies, academic conferences and journals in related disciplines in various roles. As part of my role at IA I interact with both academic and industry researchers and practitioners to provide support with the Wayback Machine data sets and APIs, collaborate on research works and grants, identify and incorporate open-source work that can make our services more useful and accessible, and establish partnerships on matters of mutual interest. Moreover, I serve as an advisory board member on a few academic programs, mentor graduate students, and offer guest lectures.


Wrap Up

So, what happened to that grass bird, you might ask. Unfortunately, it did not survive my PhD degree. The last time I saw it at its place in March 2020, before the university advised people to work from home to avoid university buildings after the surge of Coronavirus disease (COVID-19). When I went back to the WSDL lab in December 2020, for my dissertation defense, I wanted to use it as a prop during my presentation, but I could not find it anywhere. Was it dumped into the trash during the sanitization of the lab or something else happened to it is yet an unresolved mystery. This reminds us how brittle our world is and preserving the history requires an active attention and maintenance. Digital data is often even more volatile and short-lived than physical artifacts. I still wish that the bird had fallen in a cavity behind a bookshelf or a table and survives under benign neglect to be found after several years when this blog post or a copy of this post in one of many web archives would provide some context to the curious discoverers. Hopefully, that discovery would be aided by our MemGator tool and the MementoMap framework to perform a selective lookup in only web archives where a copy of this blog post is likely to be present.


--

Dr. Sawood Alam
Web and Data Scientist
Wayback Machine, Internet Archive


NOTE: A draft of this post was published on December 31, 2020, but it was taken back soon after. The finalized version was published a year later on December 31, 2021.


Comments