2021-09-01: From Student To Researcher III

After graduating, I officially accepted a position in Los Alamos National Laboratory's Information Sciences Division (CCS-3) working for Diane Oyen. On October 4, 2021, I will no longer be a member of the Los Alamos National Laboratory (LANL) Research Library and I will instead move to LANL's CCS-3 division. As I noted in my previous transitional blog posts from 2015, I originally just wanted to find better computing job opportunities than the US Navy could provide me. Along the way to a Masters Thesis, I discovered that I actually enjoy research. My research work with wikis resulted in Herbert Van de Sompel offering me a student position at LANL. After living in Virginia almost all of my life, I moved to New Mexico to discover what new opportunities awaited me. I have lived in Santa Fe, a roughly 45 minute drive from Los Alamos, ever since.

My next step is to improve my knowledge of computer vision and machine learning from CCS-3, while also introducing CCS-3 to the wonderful data sources that are web archives. In my work on Dark and Stormy Archives, I have often wanted to extend our stories' current multimedia experience even further. Why stop with summarizing text? Why not images? What about audio? Video? So much of our life online lives inside non-textual forms. I hope to satisfy some of this curiosity in my work with CCS-3.

So, what got me here? I this post I will briefly summarize the work in my dissertation and talk about the future.

Dark and Stormy Archives

The Dark and Stormy Archives (DSA) project applies social media storytelling techniques to summarize web archive collections. It is an idea that was initially developed by Michael Nelson, Michele Weigle, and Yasmin AlNoamany. The project takes its name from a writing prompt made popular by the Peanuts comic strip. DSA has taken many forms over the years. I revived the idea for my dissertation work.

Web archive collections are vast, consisting of thousands of mementos. It is difficult for someone to understand a collection as a whole, or even a specific aspect of a collection. This problem is compounded when we realize that there are more than 14,000 collections in Archive-It alone. We need a method to not only understand a collection at a glance, but also compare collections to each other.

The DSA method leverages the existing paradigm of social media storytelling to visualize representative mementos from a web archive collection. We refer to these representative mementos as exemplars because they are selected to prove as examples of some aspect of the collection or the overall collection 's topic. A visualization that summarizes a collection through these exemplars using social media storytelling is a story. Many such stories exist throughout social media. One example is the Twitter Moment, like the one below about COVID-19 vaccines. Note how it is constructed of tweets, but each contains a link to a web resource and summarizes that web resource with a title, description, and striking image.

DSA was a good fit for me because it allowed me to transition from a software engineer to a researcher. There are numerous software projects in this problem space. How do we select the documents for our stories? Hypercane. How do we visualize a memento? MementoEmbed. How do we create and visualize a story? Raintale. We combine these tools together to produce stories from web archive collections.

This in itself what quite the undertaking, but, as Martin Klein often reminds me, we researchers do not get PhDs for writing software. These software projects helped me understand and answer research questions that led to publications.

How do we detect off-topic mementos in a web archive collection?

The Off-Topic Memento Toolkit, published at iPres 2018

What structural features exist for web archive collections? What types of web archive collections exist? Can we predict the type of web archive collection using structural features?

The Many Shapes of Archive-It, published at iPres 2018

What surrogate (visualization) of a single memento works best for helping users understand a web archive collection?

Social Cards Probably Provide For Better Understanding Of Web Archive Collections, published at ACM CIKM 2019

How are current platforms insufficient for visualizing mementos? How do we tell stories from web archives using our components?

MementoEmbed and Raintale for Web Archive Storytelling, presented at WADL 2020

How do we extend these concepts outside of Archive-It to summarize other collections of mementos, such as with news?

SHARI -- An Integration of Tools to Visualize the Story of the Day, presented at WADL 2020

How do automatically select striking images to visualize individual mementos when there is no metadata to help us generate good surrogates?

Automatically Selecting Striking Images for Social Cards, presented at ACM Web Science 2021

What algorithmic primitives exist to help us select exemplars from web archive collections?

Hypercane: Intelligent Sampling for Web Archive Collections, a poster to be presented at JCDL 2021
Hypercane: Toolkit for Summarizing Large Collections of Archived Webpages, to be published in SIGWEB Fall Newsletter in 2021

When examining web page metadata, what do the page authors' choices tell us about their priorities?

It's All About The Cards: Sharing on Social Media Encouraged HTML Metadata Growth, to be presented at JCDL 2021

Identifying the problems to be solved was key. With those problems identified, I could then break them up. From there, I was able to derive research questions and develop a model. With this model I could group the "small" research questions from each paper into more general research questions for the dissertation. I would like to thank Mark Graham for his conversations in this space that helped me think about this more abstractly.

When trying to actually create rich stories, I recognized that we did not just need to select exemplars. We also needed to generate story metadata, such as the title of the collection, a striking image to represent the story as a whole, information about top named entities, and more. This meant that Hypercane could not just be a tool for selecting exemplars. Because story metadata might help us select exemplars, that work did fit together into a single tool.

We initially conceived of MementoEmbed as a tool for generating a surrogate for an individual memento in the form of something like a card, thumbnail, or word cloud. Over time I began to realize that creating the surrogate alone was insufficient for the overall problem space. We needed something to generate document metadata. Though we showed that social cards are best for collection understanding, I kept struggling with the concept of a "perfect card" and realized that there were a lot of possible visualizations for archives.

While working on the CIKM paper, we needed a tool that could generate different types of surrogates. This led to the development MementoEmbed's API and the genesis of Raintale. With Raintale we could visualize the story and distribute the story. For our experiments we needed to generate many different types of stories. This led to Raintale's template interface, which we have improved throughout the project.

A lesson that I repeatedly learned was that one size did not fit all when it came to selecting exemplars from web archive collections. We could not dictate a single algorithm for all archivists and collection types. Instead, we now offer a set of primitives so that users can create their own selection algorithms and ultimately their own stories. The primitives are sample, filter, cluster, score, and order. By combining these primitives in different ways we get different sets of exemplars, and produce different stories. The DSA4 diagram below shows one such possible algorithm using these primitives.

From this work, we have generalized the DSA so that it can also work with other collections, rather than only those at Archive-It. There are so many directions ahead of DSA. Thanks to a grant from the IIPC, we are currently piloting the software tools in collaboration with the National Library of Australia. We have applied them to mementos from the Internet Archive, created stories mixing mementos from the Internet Archive and Archive.Today, and shown that they work well with mementos from many Memento-compliant web archives, like Arquivo.pt or the UK Web Archive. The magic lay in the Memento protocol, which allows us to support many web archives by programming to support a single consistent interface. Thanks to Martin Klein, I've contributed to the book chapter Interoperability for Accessing Versions of Web Resources with the Memento Protocol from The Past Web: Exploring Web Archives, which describes Memento and its importance to accessing and building capabilities, like storytelling, on top of web archives.

My involvement with DSA will lessen over time, but DSA is far from complete. There are so many more directions with which to take it. There is a lot of video in web archives that is not yet addressed. Audio is another format ripe for summarization -- how do we tackle that?

Defending my Dissertation on the DSA and Becoming Dr. Jones

On August 5, 2021, I defended this work before a live audience and my committee:

I really appreciate the input of all committee members toward making my dissertation better. Below is a video of the defense. The video includes me talking over the slides and the Q&A session afterward.

Improving Collection Understanding For Web Archives With Storytelling: Shining Light Into Dark and Stormy Archives - PhD Defense from Shawn Jones

I've had problems uploading the slides to SlideShare and Google Slides. Slideshare downsampled the slides (see above) to such an extent that many of my screenshots are illegible. Google Slides refused to allow the upload of my Powerpoint presentation. Not having the time to work around these issues, I resolved to entrust a good quality version of my defense slides to the Internet Archive for safekeeping.

Thanks to everyone who attended my defense!

Thanks to @phonedude_mln @weiglemc @mart1nkle1n @fanchyna @OpenMaze & Jose Padilla for serving on my committee and calling me Dr. Jones for the first time!#webarchiving #phdlife #storytelling #summarization https://t.co/QCrHmgzdOZ
— Shawn M. Jones, PhD (@shawnmjones) August 5, 2021

And, weeks later, after haggling over formatting with the College of Sciences, Old Dominion University awarded me the degree of Doctor of Philosophy in Computer Science.

It's official: my knowledge suffices! I am now Dr. Jones. pic.twitter.com/cTqtbMChdk
— Shawn M. Jones, PhD (@shawnmjones) August 27, 2021

What's next for Dr. Jones?

I'm going to throw everything I can at becoming a productive member of Los Alamos National Laboratory CCS-3. This will involve learning from people like Diane Oyen, Kari Sentz, Juan Castorena, Michael Kucer, Judith Cohn, Reid Rivenburgh, and others that I have yet to meet. My goal is to become a staff scientist by the end of my postdoc.

I look forward to publishing more work, working on new projects, and meeting new colleagues at fresh venues. I will continue to improve my online presence. I have a healthy Twitter following, am trying to figure out how to use Facebook, and will be extending my presence to LinkedIn. You will see me continue to post things of interest to this blog, and I am revamping my website to continue to share my adventures.

For the last several years, as leader of the DSA, I've been asking people "what story will you tell with web archives?" Now I write the next chapter of my own story.

-- Shawn M. Jones, PhD

Search This Blog

Web Science and Digital Libraries Research Group