2020-09-28: My report card to my mother

On July 28, 2020, I defended my PhD dissertation --- Bootstrapping Web Archive Collections From Micro-Collections in Social Media --- a culmination of an 12-year journey that began when I arrived the US from Nigeria in 2007.
Alexander C. Nwala, PhD Thesis: Bootstrapping Web Archive Collections From Micro-Collections in Social Media
I remain grateful to God and many who made this possible beginning with my father Alexander E. Nwala, my mother, Comfort C. Nwala whose vision I have realized, and my supervisors, Dr. Michael Nelson and Dr. Michele Weigle for guiding me through every paved and dirt road along my journey. I am very excited to join the Observatory on Social Media (OSoMe) and Networks & agents Network (NaN) research group at Indiana University as a post-doc, under the supervision of Dr. Filippo Menczer. My research would focus on (mis/dis)information diffusion and the detection and countering of online manipulation.

Summarizing the trajectory of my PhD research over the past six years is not an easy task. This blog post is an attempt. 

2014 -- 2017: Twitter Bots, SERPs, Local News, & Media Manipulation 
I joined WS-DL in May 2014, having just concluded my Master's thesis (Generating Combinatorial Objects: A New Perspective) where I re-invented the wheel by developing a method to generate Combinations as part of the broader task of enumerating the combinatorial states of placing balls into bins. My PhD research began with two Twitter bots ICanHazMemento and What Did It Look Like. The former archives URLs on Twitter when invoked with the hashtag #icanhazmemento, the latter generates an animated GIF showing what a website looked like over the years, and posts it on a Tumblr blog. 
My first study of Search Engine Result Pages (SERPs) began with training a classifier to classify web queries as either scholar (e.g., "genetically engineered mice") or non-scholar (e.g., "pizza near me"). The logistic regression classifier was trained on a labelled dataset of Google SERPs. This research (published at ACM/IEEE JCDL 2016) demonstrated the utility of SERPs beyond their conventional use of fulfilling an informational request encoded in a search query.

In 2016, I collaborated with researchers at the Harvard Library Innovation Lab to develop the Local Memory Project (LMP) which provides tools to help users build and archive collections of local news stories from local news organizations. The LMP research was published as part of the ACM/IEEE JCDL 2017 conference proceedings. I returned to Harvard in 2017 to research media manipulation in the 2016 US Presidential elections in collaboration with Dr. Rob Faris at the Berkman Klein Center. This research inspired StoryGraph (more on StoryGraph).

These prelusive work informed my PhD research by emphasizing the importance and impact of considering geographical and temporal constraints when building collections for stories and events.

2017 -- 2018: Generating seeds from SERPs and comparing collections
Between 2017 and 2018 I conducted an experiment to quantify the retrievability or the ease of re-finding news stories on SERPs. This was done by issuing seven queries semi-automatically to Google every day for over seven months (2017-05-25 to 2018-01-12) to learn when stories disappear from the Google SERP. Our results and recommendation were presented at JCDL 2018.
My first (published at ACM HyperText 2018) contribution toward comparing human and machine-generated seeds led to the Collection Charactering Suite (CCS). The CCS consist of seven metrics that help profile collection of seeds and serves as a baseline for comparing collections - a task that is difficult since collections often cater to different needs. Additionally, the CCS provided the foundation for the research contribution that concluded my PhD research - the Quality Proxies.
2019 -- 2020: Micro-collections and Quality Proxies (QPs)
In 2019, I extensively studied social media as a source for seeds and proposed the the post class system for labeling social media posts irrespective of platform. More importantly, we proposed Micro-collections --- a major contribution (published at ACM/IEEE JCDL 2019) of my PhD dissertation --- as a valuable source for generating seeds. Micro-collections are social media posts that contain URLs that are gathered by humans as a demonstration of domain expertise and editorial activity, using the existing tools of social media platforms. On Twitter they manifest as threaded conversations posted by single or multiple authors. 
MCQP framework overview for bootstrapping Web archive collections from Micro-collections in Social Media. The numbers shown represent the stages of the framework.
It is insufficient to generate seeds without establishing their quality, but this raises a new challenge: "How do you define and quantify quality?" The final major contribution of my PhD work was the proposal of Quality Proxies for assigning quality scores to seed URLs extracted from social media posts. A QP assigns a quality trait to a seed within a single dimension. Seeds can be assigned a quality score by selecting different combinations of Quality Proxies which map to different notions of quality across multiple dimensions such as popularity, reputation, geographical proximity, etc. The QP framework is flexible (enables multiple definitions of quality), robust (operates with subsets), explainable, and extensible. Our evaluation results showed that Quality Proxies resulted in the selection of quality seeds with increased precision when novelty is and is not prioritized.

The combination of  Micro-collections (MC) and Quality Proxies (QP) produced our MCQP framework for Bootstrapping Web Archive Collections From Micro-Collections in Social Media.
Advice for Prospective PhD students
I was recently asked what advise I would give to prospective PhD students as part of an interview for a news story published by the ODU College of Sciences. Here is an excerpt of my response organized in two parts, namely, before and during the PhD.
First, ensure you're pursuing a PhD for the right reasons. Don’t pursue it because you're pressured to or you think the "Dr." title sounds cool. But if you have a passion to learn and develop solutions, you're tenacious, and you don't mind studying the same ideas for 4 - 8 years, then you're on the right course.

Second, during your PhD, it's ok not to know anything about a topic or area you plan to research, or what to do next, in other words, you have to be comfortable with "not knowing." However, you have to learn how to navigate your way out of that space by exploiting all the resources at your disposal. Value your contribution, you have something to offer. Be curious, don't be satisfied at the first glimpse of progress, always verify good and bad results, always ask questions to yourself and to others. Learn from others. Take advantage of the community of researchers within and across your disciple, don't be isolated. Build networks. Relationships are important, value and maintain them.

Third, during your PhD, take care of yourself, don't let the work consume you. Manage your stress and mental/physical health. I run and love riding bikes. Find what you enjoy to de-stress. 

Finally, during your PhD find a way to serve and contribute to the development of others. Success is incomplete if you haven't brought others along.

-- Alexander C. Nwala (@acnwala)