2019-09-02: So Long, and Thanks for All the Frogs

Mat Kelly has received his PhD. This is the Final Blog Post                                                                                                                                                                                                                                                                                                                                                                           ⓖⓞⓖⓐⓣⓞⓡⓢ

On May 7th, 2019, after a very long trek as a PhD student, I successfully defended my dissertation, "Aggregating Private and Public Web Archives Using the Mementity Framework" (slides). The tome (physical height still to be determined), originally titled, "A Framework for Aggregating Private and Public Web Archives" consisted of exactly that and a bit more. The crux of the work was originally presented in the best-paper-nominated paper with the latter named (hence the change) at JCDL 2018 (arXiv). The extended version addressed issues beyond the 10-page conference paper limit. In this post I will provide a very high level synopsis of the work's contribution to the area of web archiving and a round-up of my experience as a PhD student.

To first describe the "mementity" concept we introduced as an alternative to the already overloaded "entity" nomenclature. In the parlance of the framework, a mementity (Memento entity) is the realized or implemented concept like a Memento TimeGate or Memento Aggregator. In the framework, we introduced three new mementities:

Memento Meta-Aggregator (MMA)
For allowing advanced aggregation of archives like subsetting, supplementing, and filtering of archival sources
Private Web Archive Adapter (PWAA)
For integrating accessing to private web archives where dereferencing a URI is insufficient
StarGate (SG)
For advanced querying of web archives based on attributes of the mementos

The concepts behind the mementities were progressively developed. The first requirement was allowing clients to have more control over what archival sources are used for aggregation. MMAs allow exactly this as well as, through the usage of the HTTP Prefer (RFC 7240), allow for query precedence of archives queried (cf. sending requests to all known archives at the same time) as well as short-circuiting (halting querying subsequent archives when a specified condition is met).

Aggregating public and private web archives may not be straightforward when involving private web archives, as additional querying parameters, e.g., credentials, may be needed to access their mementos. Through the base case usage of OAuth 2 (RFC 6749), access patterns as used on the live web can be systematically translated to access to private web archives. If an MMA is aware that an archive is private, it can delegate the authentication dance to a separate mementity, the PWAA.

Because querying web archives can be temporally, computationally, and spatially expensive when performed in-bulk, exposing attributes of an archive's holdings to a TimeMap allows for more sophisticated querying. For example, when creating summaries of a URI over time, generating a SimHash of the HTML of each memento allows for detection of significant changes in the page and identifies likely candidates for inclusion (per Ahmed AlSum's ECIR 2014 paper). We encountered this issue when initially implementing the web archive summarization visualization for the Web Archiving Collaboration Conference. Retaining these SimHashes, once calculated, allows for the generation of summaries to be much quicker. Populating TimeMaps with attributes for mementos beyond time. The addition of this arbitrary, wildcard (*) set of attributes semantically renames TimeMaps to StarMaps per the framework. Being able to filter on these attributes requires communicating with endpoints, like one to generate SimHashes for a URI-M. Delegation of this role to a separate mementity, the StarGate, allows for client-side negotiation of web archives in dimensions beyond time. We initially explored this in our work for WADL 2018.

The framework can be implemented in piecemeal -- no mementity is reliant on the other. The power of the framework for the contribution of aggregating private and public web archives is emphasized when all mementities are used. This was the fundamental component of the dissertation that I defended in May.

"But May...", you might say, "why the delay?"

Following a defense, one must make edits and refine the document per their committee's recommendations. Luckily, aside from adding an additional appendix, some clarifications, and stylistic changes, my document did not require extensive changes. I applied these changes and submitted the document to the ODU College of Sciences on June 7, 2019:

The college approved and I submitted the document to ProQuest. About 2-and-one-half months later, I heard back with the sole minor change being an incorrect page number (thanks, LaTeX!), which I promptly adjusted. After a couple more weeks and some pings to ODU's ProQuest representative, my dissertation was approved:

This completes my time as an academic student. My next role will be keep me in academia but on the other side of the table:

During my term as a graduate student, I had 19 peer reviewed publication (worth 72 WS-DL publication points), collaborated with authors/presenters from 8 different institutions1 and wrote 29 blog posts (inclusive). I also lived in four cities (Charleston, Goose Creek, Virginia Beach, Portsmouth) in two locales (Lowcountry, Tidewater), had a child, worked for three employers, and most importantly, climbed to the top of the PhD Crush board.

—Mat (@machawk1)

1 Old Dominion University, Los Alamos National Laboratory, Clemson University, Science Systems and Applications, Inc., NASA Langley Research Center, BMW Group, Intel Corporation, Protocol Labs