On June 3, 2018, PhD students arrived in Fort Worth, Texas to attend the Joint Conference on Digital Libraries Doctoral Consortium. This is a pre-conference event associated with the ACM and IEEE-CS Joint Conference on Digital Libraries. This event gives PhD students a forum in which to discuss their dissertation work with others in the field. The Doctoral Consortium was well attended, not only by the presenting PhD students, their advisors/supervisors, and organizers, but also by those who were genuinely interested in emerging work. As usual, I live-tweeted the event to capture salient points. It was a very enjoyable experience for all.

Thanks very much to the chairs:

J. Stephen Downie, University of Illinois at Urbana-Champaign, USA
Oksana Zavalina, University of North Texas, USA
Daniel Gelaw Alemneh, University of North Texas, USA
Sampath Jayarathna, Old Dominion University, USA

In this post I will cover the work of all accepted students, three of whom are from the Web Science and Digital Libraries Research Group at Old Dominion University:

Shawn M. Jones (me), Old Dominion University, USA
Alexander Nwala, Old Dominion University, USA
Mohamed Aturban, Old Dominion University, USA
André Greiner-Petter, University of Konstanz, Germany
Timothy Kanke, Florida State University, USA
Hany Alsalmi, Florida State University, USA
Corinna Breitinger, University of Konstanz, Germany
Susanne Putze, University of Bremen, Germany
Stephen Abrams, Queensland University of Technology, Australia
Tirthankar Ghosal, Indian Institute of Technology Patna, India

I would also like to thank the assigned mentors of the Doctoral Consortium, who provided insight and guidance not only to their own assigned students, but the rest of us as well:

Trond Aalberg, Norwegia University of Science and Technology, Norway
Monika Akbar, University of Texas at El Paso, USA
Daniel Alemneh, University of North Texas, USA
Jose Borbinha, Técnico Lisboa, Portugal
Sally Jo Cunningham, The University of Waikato, New Zealand
Ying Ding, Indiana University Bloomington, USA
Kai Eckert, Hochschule der Medien, Germany
Daqing He, University of Pittsburgh, USA
Martin Klein, Los Alamos National Laboratory, USA
Philipp Mayr-Schlegel, GESIS – Leibniz Institute for the Social Sciences, Germany
Mirella M. Moro, Universidade Federal de Minas Gerais, Brazil
Federico Nanni, University of Mannheim, Germany
Michele C. Weigle, Old Dominion University, USA
Jian Wu, Pennsylvania State University, USA

WS-DL Presentations

Shawn M. Jones

Improving Collection Understanding in Web Archives from Shawn Jones

.@shawnmjones presenting "improving collection understanding in web archives" #jcdl2018 @WebSciDL @jcdl2018 pic.twitter.com/vq3mJrMSen
— Michael L. Nelson (@phonedude_mln) June 3, 2018

How does a researcher differentiate between web archive collections that cover the same topic? Some web archive collections consist of 100,000+ seeds, each with multiple mementos. There are more than 8000 collections in Archive-It as of the end of 2016. Existing metadata in Archive-It collections is insufficient because the metadata is produced by different curators from different organizations applying different content standards and different rules of interpretation. As part of my doctoral consortium submission, I proposed improving upon the solution piloted by Yasmin AlNoamany. She generated a series of representative mementos and then submitted them to the social media storytelling platform Storify in order to provide a summary of each collection.

Improving Collection Understanding in Web Archives by @shawnmjones @WebSciDL at #JCDL2018 #JCDLDC2018 #DoctoralConsortium pic.twitter.com/AKzNgNZWpA
— Sawood Alam (@ibnesayeed) June 3, 2018

As part of my preliminary work I presented some findings that will be published at iPres 2018. We discovered four semantic categories of Archive-It collections: collections where an organization archived itself, collections about a specific subject, collections about expected events or time periods, and collections about spontaneous events. The collections AlNoamany used in her work fit into the last category. This also turned out to be the smallest category of collections, meaning that there are many other types of collections not evaluated by her method. She proved that humans could not tell the difference between her automatically-generated stories and other stories generated by humans. She did not, however, provide evidence that the visualization was useful for collection understanding. We also have the problem that Storify is no longer in service, something that I mentioned in a previous blog post. My plan includes developing a flexible framework that allows us to test different methods of selecting representative mementos. This framework will also allow us to test different types of visualizations using those representative mementos. Some of these visualizations may make use of different social media platforms. I plan to evaluate these collections by first creating user tasks that give us some idea that a user understands aspects of a collection. With these tasks I intend to then evaluate different solutions via user testing. The solutions that score best from the testing will address a large problem inherent to the scale of web archives.

In analyzing @archiveitorg collections, @shawnmjones found 54% of the collections are people/orgs archiving themselves, 24% archiving a subject, 14% archiving event they know is coming, and 4% collections are unexpected events. #jcdl2018
— Mat Kelly (@machawk1) June 3, 2018

Alexander Nwala

Bootstrapping Web Archive Collections of Stories from Micro-collections in Social Media from anwala

How do we find high quality seeds for generating web archive collections? Alexander is focusing on a different aspect of web archive collections than I am. I am analyzing existing collections. He is building collections from seeds supplied by social media users. He notes that users often create "micro-collections" of web resources, typically surrounding an event. Using examples like ebola epidemics, the Ukraine crisis, and school shootings, Alexander asks if seeds generated by social media are comparable to those generated by professional curators. He also seeks quantitative methods for evaluating collections. Finally, he wants to evaluate the quality of collections at scale.

.@acnwala - "not all micro-collections in social media will give us high quality seeds"- how do we handle this at scale? @jcdl2018 #jcdl2018 #DoctoralConsortium pic.twitter.com/Tk2B3bthZz
— Shawn M. Jones (@shawnmjones) June 3, 2018

He demonstrated the results of using a prototype system that extracts seeds from social media and compared these seeds to those extracted from Google search engine result pages (SERPs). He discovered that, when using SERPs, the probability of finding a URI for a news story diminishes with time. He introduced methods like distribution of topics, distribution of sources, distribution in time, content diversity, collection exposure, target audience, and more. He covered some of his work on the Local Memory Project as well as work that will be presented at JCDL 2018 and Hypertext 2018. He intends to do further research on hubs and authorities in social media, as well as evaluating the quality of collections. Alexander will ensure that good quality seeds make it into web archives, addressing an aspect of curation that has long been an area of concern in web archives.

re: @acnwala's use of scraping Google UI vs. bing API, APIs and UIs can give very different results, see: "Agreeing to Disagree: Search Engines and their Public Interfaces" from #JCDL2007, @fmccown https://t.co/BnGSTzqS91
#jcdl2018
— Michael L. Nelson (@phonedude_mln) June 3, 2018

Mohamed Aturban

Establishing and Verifying Fixity of Archived Web Pages from maturban

How can we verify the content of web archives? Mohamed presented his work on fixity for mementos. He described issues with temporal violations and playback issues. He asked whether different web archives agreed on the content of mementos produced for the same live resource at the same time. He showed how "evil" archives could potentially manipulate memento content to produce a different page than existed at the time of capture. So, how do we ensure that the memento was unaltered since the time of capture?

.@maturban1 - existing hashing functions are insufficient for generating fixity of mementos in a #webarchive - we need a Memento-aware fixity solution, including all embedded resources #jcdl2018 @jcdl2018 #DoctoralConsortium pic.twitter.com/5yMkn9MPJx
— Shawn M. Jones (@shawnmjones) June 3, 2018

He demonstrated that the playback engine used by a web archive can inadvertently change the result of the displayed memento. Just providing a timestamped hash of the memento HTML is not enough. He proposes generating a cryptographic hash for the memento and all embedded resources and then generating a manifest of these hashes. This manifest will then be stored as mementos themselves in multiple web archives. I expect this work to be quite important to the archiving community, addressing a concern that many professional archivists have had for quite some time.

.@maturban1 - once we have fixity information for mementos, we will push a manifest with this information into multiple web archives #jcdl2018 @jcdl2018 #DoctoralConsortium pic.twitter.com/Dqerbxrb26
— Shawn M. Jones (@shawnmjones) June 3, 2018

Other Work Presented

André Greiner-Petter

Here's @GreinerPetter's suggestion for a semantification helper for mathematical formulae in Wikipedia for example. Nice! #jcdl2018 pic.twitter.com/Vk8kIozVAS
— Mandy (@protestreich) June 3, 2018

Research papers use equations all of the time. Unfortunately, there isn't a good method of comparing equations or providing semantic information about them. André Greiner-Petter is working on creating a method of enriching the equations used in research papers. This will have a variety of uses, such as detecting plagiarism or finding related literature.

Timothy Kanke

Timothy Kanke from @floridastate presents different concepts from @wikidata: items, projects, WikiProjects https://t.co/jPcXO16dKA #jcdl2018 #DoctoralConsortium @jcdl2018 pic.twitter.com/0T7meZJ6ht
— Shawn M. Jones (@shawnmjones) June 3, 2018

How are people using Wikidata? I had attended a session on Wikidata at Wiki ConferenceUSA 2014, but have not really examined it since. Will it be useful for me? How do I participate? Who is involved? Timothy Kanke seeks to understand the answers to all of these questions. The Wikidata project has grown over the last few years, feeding information back into the Wikipedia community. Kanke will study the Wikidata community and provide a good overview for those who want to use its content. Using his work, we will all have an understanding of the overall ways in which Wikidata can work for the scholarly community.

Hany Alsalmi

"Dual Language Information Seeking in Digital Libraries" talk by @HanyAlsalmi in #JCDL2018 #DoctroalConsortium for #Arabic is very relatable to many other languages like #Persian and #Urdu. pic.twitter.com/dnRXsgLFi6
— Sawood Alam (@ibnesayeed) June 3, 2018

How many languages do you use for searching? What is the precision of the results when you switch languages, even for the same query? Hany Alsalmi noticed that users who search in English were getting different results than when they searched for the same term in Arabic. Alsalmi will perform studies on users of the Saudi Digital Library to understand how they perform their searches and how successful those searches are. He will also record their reactions to search results, with the concern being that the user will quit in frustration if the results are insufficient. His work will have implications for search engines in the Arabic-speaking world.

Corinna Breitinger

.@BreitingerC mentions different types of semantic similarity measures for use in analyzing scholarly literature: I was familiar with text-based measures and citation-based measures, but approaches for analyzing mathematical language and figures are new to me #jcdl2018 pic.twitter.com/U8prxk5hUO
— Shawn M. Jones (@shawnmjones) June 3, 2018

Scholarly recommendation systems examine papers using text similarity. Can we do better? What about the figures, citations, and equations? Corinna Breitinger will take all of these text-independent semantic markers into consideration with the development of a new recommender approach targeted at STEM fields. Once that is done, she will create a new visualization concept that will help users view and navigate a collection of similar literature. The benefits of such a system will help spot redundant research and also help us find related research in the field.

Susanne Putze

Suzanne Putze presents "Digital Libraries for Experimental Data" @jcdl2018 #DoctroalConsortium #jcdl2018 pic.twitter.com/m1KDSj5EmB
— Shawn M. Jones (@shawnmjones) June 3, 2018

How is research data managed? How can we facilitate making data management a “first-class citizen”? To do so would improve the amount of data shared by researchers as well as its quality. Susanne Putze has extended experiment models to improve data documentation. She will create prototypes and evaluate how well they work to address data management in the scholarly process. From there she will begin the process of improving knowledge discovery using these prototypes. Her research has implications for how we handle our data and incorporate it into scholarly communications.

Stephen Abrams

I was composing my tweet on this same thing! :-)

"In science, truth is not binary, instead we approach to it asymptotically!" -- Stephen Abrams#JCDL2018 #JCDLDC2018 #DoctoralConsortium
— Sawood Alam (@ibnesayeed) June 3, 2018

How successful are digital preservation efforts? Stephen Abrams is working on creating metrics for this purpose. He is planning on evaluating digital preservation from the perspective of communications rather than through preservation management concepts like quantities, ages, of quality of preserved material. Thanks to his presentation I will now examine terms like “verisimilitude”, “semiotic”, and “truthlikeness”. When he is done, we should have better metrics to evaluate things like the trustworthiness of preserved material. His work is more general and theoretical than Mohamed’s, but there is a loose connection to be sure.

Tirthankar Ghosal

Tirthankar Ghosal's focus is helping editors to flag editors and papers ahead of time to determine the potential for a paper to be accepted or rejected - @weiglemc mentions that determining novelty of research is a "huge task" @jcdl2018 #jcdl2018 #DoctoralConsortium pic.twitter.com/gGDW8vyC5b
— Shawn M. Jones (@shawnmjones) June 3, 2018

Why are papers rejected by editors? Have we done a good job identifying what makes our paper novel? What if we could spot such complex issues in our papers prior to submission? Tirthankar Ghosal seeks to help address these concerns by using AI techniques to help researchers and editors more easily identify papers that will likely be rejected. He has already done some work examining reasons for desk rejections. He will identify methods for detecting what makes a paper novel, if a paper is fit for a given journal, if it is of sufficient quality to be accepted, and lastly create benchmark data that can be used to evaluate papers in the future. His work has large implications for scholarly communication and may affect not only the way we write, but also how submissions are handled in the future.

What Next?

#jcdl2018 @profdownie mentions to #DoctoralConsortium participants @jcdl2018: "a successful thesis is really focused", "really focused on the limitations", note what you have not explored but may explore in the future
— Shawn M. Jones (@shawnmjones) June 3, 2018

#jcdl2018 #doctoralconsortium @profdownie asks Stephen Abrams What is the "impact on the world" of your thesis? "What is this thing in the future that we want better of?" Things to think of when considering my own work...
— Shawn M. Jones (@shawnmjones) June 3, 2018

I would like to thank all participants for their input and insight throughout the event. Hearing their feedback for other participants was quite informative to me as well. We will all have improved candidacy proposals as a result of their input and, more importantly, will use this input to improve our contributions to the world.

#jcdl2018 Close the doctoral consortium by a nice photo~ pic.twitter.com/vTUTN7AEuT
— Chenwei Zhang (@z_vvvv) June 3, 2018

Updated on 2018/06/09 at 20:50 EDT with embed of Mohamed Aturban's Slideshare.

--Shawn M. Jones

Search This Blog

Web Science and Digital Libraries Research Group

2018-06-08: Joint Conference on Digital Libraries (JCDL) Doctoral Consortium Trip Report