2018-06-08: Joint Conference on Digital Libraries (JCDL) Doctoral Consortium Trip Report

On June 3, 2018, PhD students arrived in Fort Worth, Texas to attend the Joint Conference on Digital Libraries Doctoral Consortium. This is a pre-conference event associated with the ACM and IEEE-CS Joint Conference on Digital Libraries. This event gives PhD students a forum in which to discuss their dissertation work with others in the field. The Doctoral Consortium was well attended, not only by the presenting PhD students, their advisors/supervisors, and organizers, but also by those who were genuinely interested in emerging work. As usual, I live-tweeted the event to capture salient points. It was a very enjoyable experience for all.

Thanks very much to the chairs: In this post I will cover the work of all accepted students, three of whom are from the Web Science and Digital Libraries Research Group at Old Dominion University: I would also like to thank the assigned mentors of the Doctoral Consortium, who provided insight and guidance not only to their own assigned students, but the rest of us as well:

WS-DL Presentations

Shawn M. Jones

How does a researcher differentiate between web archive collections that cover the same topic? Some web archive collections consist of 100,000+ seeds, each with multiple mementos. There are more than 8000 collections in Archive-It as of the end of 2016. Existing metadata in Archive-It collections is insufficient because the metadata is produced by different curators from different organizations applying different content standards and different rules of interpretation. As part of my doctoral consortium submission, I proposed improving upon the solution piloted by Yasmin AlNoamany. She generated a series of representative mementos and then submitted them to the social media storytelling platform Storify in order to provide a summary of each collection. As part of my preliminary work I presented some findings that will be published at iPres 2018. We discovered four semantic categories of Archive-It collections: collections where an organization archived itself, collections about a specific subject, collections about expected events or time periods, and collections about spontaneous events. The collections AlNoamany used in her work fit into the last category. This also turned out to be the smallest category of collections, meaning that there are many other types of collections not evaluated by her method. She proved that humans could not tell the difference between her automatically-generated stories and other stories generated by humans. She did not, however, provide evidence that the visualization was useful for collection understanding. We also have the problem that Storify is no longer in service, something that I mentioned in a previous blog post. My plan includes developing a flexible framework that allows us to test different methods of selecting representative mementos. This framework will also allow us to test different types of visualizations using those representative mementos. Some of these visualizations may make use of different social media platforms. I plan to evaluate these collections by first creating user tasks that give us some idea that a user understands aspects of a collection. With these tasks I intend to then evaluate different solutions via user testing. The solutions that score best from the testing will address a large problem inherent to the scale of web archives.

Alexander Nwala

How do we find high quality seeds for generating web archive collections? Alexander is focusing on a different aspect of web archive collections than I am. I am analyzing existing collections. He is building collections from seeds supplied by social media users. He notes that users often create "micro-collections" of web resources, typically surrounding an event. Using examples like ebola epidemics, the Ukraine crisis, and school shootings, Alexander asks if seeds generated by social media are comparable to those generated by professional curators. He also seeks quantitative methods for evaluating collections. Finally, he wants to evaluate the quality of collections at scale.
He demonstrated the results of using a prototype system that extracts seeds from social media and compared these seeds to those extracted from Google search engine result pages (SERPs). He discovered that, when using SERPs, the probability of finding a URI for a news story diminishes with time. He introduced methods like distribution of topics, distribution of sources, distribution in time, content diversity, collection exposure, target audience, and more. He covered some of his work on the Local Memory Project as well as work that will be presented at JCDL 2018 and Hypertext 2018. He intends to do further research on hubs and authorities in social media, as well as evaluating the quality of collections. Alexander will ensure that good quality seeds make it into web archives, addressing an aspect of curation that has long been an area of concern in web archives.

Mohamed Aturban

How can we verify the content of web archives? Mohamed presented his work on fixity for mementos. He described issues with temporal violations and playback issues. He asked whether different web archives agreed on the content of mementos produced for the same live resource at the same time. He showed how "evil" archives could potentially manipulate memento content to produce a different page than existed at the time of capture. So, how do we ensure that the memento was unaltered since the time of capture?

He demonstrated that the playback engine used by a web archive can inadvertently change the result of the displayed memento. Just providing a timestamped hash of the memento HTML is not enough. He proposes generating a cryptographic hash for the memento and all embedded resources and then generating a manifest of these hashes. This manifest will then be stored as mementos themselves in multiple web archives. I expect this work to be quite important to the archiving community, addressing a concern that many professional archivists have had for quite some time.

Other Work Presented

André Greiner-Petter

Research papers use equations all of the time. Unfortunately, there isn't a good method of comparing equations or providing semantic information about them. André Greiner-Petter is working on creating a method of enriching the equations used in research papers. This will have a variety of uses, such as detecting plagiarism or finding related literature.

Timothy Kanke

How are people using Wikidata? I had attended a session on Wikidata at Wiki ConferenceUSA 2014, but have not really examined it since. Will it be useful for me? How do I participate? Who is involved? Timothy Kanke seeks to understand the answers to all of these questions. The Wikidata project has grown over the last few years, feeding information back into the Wikipedia community. Kanke will study the Wikidata community and provide a good overview for those who want to use its content. Using his work, we will all have an understanding of the overall ways in which Wikidata can work for the scholarly community.

Hany Alsalmi

How many languages do you use for searching? What is the precision of the results when you switch languages, even for the same query? Hany Alsalmi noticed that users who search in English were getting different results than when they searched for the same term in Arabic. Alsalmi will perform studies on users of the Saudi Digital Library to understand how they perform their searches and how successful those searches are. He will also record their reactions to search results, with the concern being that the user will quit in frustration if the results are insufficient. His work will have implications for search engines in the Arabic-speaking world.

Corinna Breitinger

Scholarly recommendation systems examine papers using text similarity. Can we do better? What about the figures, citations, and equations? Corinna Breitinger will take all of these text-independent semantic markers into consideration with the development of a new recommender approach targeted at STEM fields. Once that is done, she will create a new visualization concept that will help users view and navigate a collection of similar literature. The benefits of such a system will help spot redundant research and also help us find related research in the field.

Susanne Putze

How is research data managed? How can we facilitate making data management a “first-class citizen”? To do so would improve the amount of data shared by researchers as well as its quality. Susanne Putze has extended experiment models to improve data documentation. She will create prototypes and evaluate how well they work to address data management in the scholarly process. From there she will begin the process of improving knowledge discovery using these prototypes. Her research has implications for how we handle our data and incorporate it into scholarly communications.

Stephen Abrams

How successful are digital preservation efforts? Stephen Abrams is working on creating metrics for this purpose. He is planning on evaluating digital preservation from the perspective of communications rather than through preservation management concepts like quantities, ages, of quality of preserved material. Thanks to his presentation I will now examine terms like “verisimilitude”, “semiotic”, and “truthlikeness”. When he is done, we should have better metrics to evaluate things like the trustworthiness of preserved material. His work is more general and theoretical than Mohamed’s, but there is a loose connection to be sure.

Tirthankar Ghosal

Why are papers rejected by editors? Have we done a good job identifying what makes our paper novel? What if we could spot such complex issues in our papers prior to submission? Tirthankar Ghosal seeks to help address these concerns by using AI techniques to help researchers and editors more easily identify papers that will likely be rejected. He has already done some work examining reasons for desk rejections. He will identify methods for detecting what makes a paper novel, if a paper is fit for a given journal, if it is of sufficient quality to be accepted, and lastly create benchmark data that can be used to evaluate papers in the future. His work has large implications for scholarly communication and may affect not only the way we write, but also how submissions are handled in the future.

What Next?

I would like to thank all participants for their input and insight throughout the event. Hearing their feedback for other participants was quite informative to me as well. We will all have improved candidacy proposals as a result of their input and, more importantly, will use this input to improve our contributions to the world.
Updated on 2018/06/09 at 20:50 EDT with embed of Mohamed Aturban's Slideshare.
--Shawn M. Jones