2018-06-11: Knowledge Discovery From Digital Libraries (KDDL) Workshop Trip Report from JCDL2018

Fort Worth Museum of Science & History 9/11 Tribute

The theme of the workshop on Knowledge Discovery from Digital Libraries (KDDL) was to uncover hidden relationships between data with techniques from artificial intelligence, mathematics, statistics, and algorithms. The workshop organizers, which included ODU Computer Science alumna, Dr. Hui Shi, Dr. Wu He, and Dr. Guandong Xu identified the following objectives that we were to explore:
  • Existing and novel techniques to extract and present knowledge from digital libraries;
  • Advanced ways to organize and maintain digital libraries to facilitate knowledge discovery;
  • Knowledge discovery applications in business; and
  • New challenges and technologies brought to the area of knowledge discovery and digital libraries.

The KDDL workshop consisted of three paper presentations which are summarized here.

Presentation 1: I presented my work on Mining the Web to Approximate University Rankings based on the tech report "University Twitter Engagement: Using Twitter Followers to Rank Universities" (https://arxiv.org/abs/1708.05790) and discussed in an earlier blog post.

This paper presented an alternative methodology for approximating the academic rankings of a university using social media; specifically, the university's Twitter followers. We identified a strategy for discovering official Twitter accounts along with a comparative analysis of metrics mined from the web which could be predictors of high academic rank (e.g., athletic expenditures, undergraduate enrollment, endowment value). As expected, schools with more financial resources tend to have more Twitter (@Twitter) followers based on larger enrollments, big endowments, and big investments in their sport programs. We also discovered that smaller schools like Wake Forest University can enhance their reputation when they employ faculty with national name recognition (e.g., Melissa Harris-Perry (@MHarris-Perry)).  For those wishing to perform further analysis, we have posted all of the ranking and supporting data used in this study which includes a social media rich data set containing over 1 million Twitter profiles, ranking data, and other institutional demographics in the oduwsdl Github repository.

Presentation 2: Basic Science and Technological Innovation: A Classification of Research Publications was presented by Dr. Robert M. Patton, Oak Ridge National Laboratory. This paper explored the context required for funding decision makers, sponsors, and the general public to determine the value of research publications. Core questions addressed the accessibility of massive digital libraries and methods related to identification of new discoveries, data sets, publications in disparate journals, and new software codes. Dr. Patton asserted that research evaluation has become increasingly complicated and citation analysis alone is insufficient if considered within the context of the people who control the flow of funding. His presentation of evaluation techniques included altmetrics along with a comparison of Bohr’s, Edison’s, and Pasteur’s quadrants as classifiers which use the wording of titles and abstracts in conjunction with domain specific terminology.

A Classification of Research Publications

Presentation 3: Introducing Math QA -- A Math Aware Question Answering System was presented by Felix Hamborg, University of Konstanz. This paper presented a software tool that allows a user to enter a textual request for a math formula (e.g., What is the formula for …?) in English or Hindi and is then presented with the required parameters and the actual formula from Wikidata. The authors mined 40 million articles in Wikidata searching for <math> tags to identify 17 thousand general and geometric formulas. They defined a QA System workflow consisting of three distinct modules for calculation, question parsing, and formula retrieval. Their discovery of geometric formulas (e.g., polygons, curves) was slightly more complex as these formulas can include a nested hierarchy of related data that required traversal of the associated Wikidata subsections. Following evaluation and comparison to a commercial engine, exported information was parsed and ported back into Wikidata. The author's source code and data is available in their GitHub repository (http://github.com/ag-gipp/MathQa).

A Math Aware Question Answering System

Following the paper presentations, the workshop participants divided into two groups to conduct a breakout session where we discussed Challenges and Research Trends in Knowledge Discovery from Digital Libraries and Beyond.  Each group was asked to offer opinions and provide summary responses for each of the following topics:
  • What are your reactions to the paper presentations? What did you learn that you didn’t previously know?
  • What are the current techniques, applications, and/or research questions that you are addressing in Knowledge Discovery from Digital Libraries and Beyond? What are the biggest impediments or challenges limiting Knowledge Discovery from Digital Libraries and Beyond?
  • What are your top priorities in implementing Knowledge Discovery from Digital Libraries and Beyond? 
  • What resources and/or support do you need to implement? 
  • What areas will you recommend for research? How do you think artificial intelligence (AI) can benefit knowledge discovery in digital libraries? 
  • Suggestions for coordination of research and future collaboration.

Collectively, my group's responses centered on the themes of data curation with less reliance on subject matter experts, methods or tools to make data more self-documenting, and new strategies for relationship extraction between linked entities. There was also considerable discussion related to reproducible research using common repositories and formats conducive to sharing data (e.g., XML) and open access to both software and the peer review process.

I would like to thank Old Dominion University for the Graduate Student Travel Award which helped to facilitate my participation in the JCDL conference and this workshop.

--Corren (@correnmccoy)