Tuesday, June 12, 2018

2018-06-11: Knowledge Discovery From Digital Libraries (KDDL) Workshop Trip Report from JCDL2018


Fort Worth Museum of Science & History 9/11 Tribute

The theme of the workshop on Knowledge Discovery from Digital Libraries (KDDL) was to uncover hidden relationships between data with techniques from artificial intelligence, mathematics, statistics, and algorithms. The workshop organizers, which included ODU Computer Science alumna, Dr. Hui Shi, Dr. Wu He, and Dr. Guandong Xu identified the following objectives that we were to explore:
  • Existing and novel techniques to extract and present knowledge from digital libraries;
  • Advanced ways to organize and maintain digital libraries to facilitate knowledge discovery;
  • Knowledge discovery applications in business; and
  • New challenges and technologies brought to the area of knowledge discovery and digital libraries.

The KDDL workshop consisted of three paper presentations which are summarized here.

Presentation 1: I presented my work on Mining the Web to Approximate University Rankings based on the tech report "University Twitter Engagement: Using Twitter Followers to Rank Universities" (https://arxiv.org/abs/1708.05790) and discussed in an earlier blog post.


This paper presented an alternative methodology for approximating the academic rankings of a university using social media; specifically, the university's Twitter followers. We identified a strategy for discovering official Twitter accounts along with a comparative analysis of metrics mined from the web which could be predictors of high academic rank (e.g., athletic expenditures, undergraduate enrollment, endowment value). As expected, schools with more financial resources tend to have more Twitter (@Twitter) followers based on larger enrollments, big endowments, and big investments in their sport programs. We also discovered that smaller schools like Wake Forest University can enhance their reputation when they employ faculty with national name recognition (e.g., Melissa Harris-Perry (@MHarris-Perry)).  For those wishing to perform further analysis, we have posted all of the ranking and supporting data used in this study which includes a social media rich data set containing over 1 million Twitter profiles, ranking data, and other institutional demographics in the oduwsdl Github repository.

Presentation 2: Basic Science and Technological Innovation: A Classification of Research Publications was presented by Dr. Robert M. Patton, Oak Ridge National Laboratory. This paper explored the context required for funding decision makers, sponsors, and the general public to determine the value of research publications. Core questions addressed the accessibility of massive digital libraries and methods related to identification of new discoveries, data sets, publications in disparate journals, and new software codes. Dr. Patton asserted that research evaluation has become increasingly complicated and citation analysis alone is insufficient if considered within the context of the people who control the flow of funding. His presentation of evaluation techniques included altmetrics along with a comparison of Bohr’s, Edison’s, and Pasteur’s quadrants as classifiers which use the wording of titles and abstracts in conjunction with domain specific terminology.

A Classification of Research Publications


Presentation 3: Introducing Math QA -- A Math Aware Question Answering System was presented by Felix Hamborg, University of Konstanz. This paper presented a software tool that allows a user to enter a textual request for a math formula (e.g., What is the formula for …?) in English or Hindi and is then presented with the required parameters and the actual formula from Wikidata. The authors mined 40 million articles in Wikidata searching for <math> tags to identify 17 thousand general and geometric formulas. They defined a QA System workflow consisting of three distinct modules for calculation, question parsing, and formula retrieval. Their discovery of geometric formulas (e.g., polygons, curves) was slightly more complex as these formulas can include a nested hierarchy of related data that required traversal of the associated Wikidata subsections. Following evaluation and comparison to a commercial engine, exported information was parsed and ported back into Wikidata. The author's source code and data is available in their GitHub repository (http://github.com/ag-gipp/MathQa).

A Math Aware Question Answering System

Following the paper presentations, the workshop participants divided into two groups to conduct a breakout session where we discussed Challenges and Research Trends in Knowledge Discovery from Digital Libraries and Beyond.  Each group was asked to offer opinions and provide summary responses for each of the following topics:
  • What are your reactions to the paper presentations? What did you learn that you didn’t previously know?
  • What are the current techniques, applications, and/or research questions that you are addressing in Knowledge Discovery from Digital Libraries and Beyond? What are the biggest impediments or challenges limiting Knowledge Discovery from Digital Libraries and Beyond?
  • What are your top priorities in implementing Knowledge Discovery from Digital Libraries and Beyond? 
  • What resources and/or support do you need to implement? 
  • What areas will you recommend for research? How do you think artificial intelligence (AI) can benefit knowledge discovery in digital libraries? 
  • Suggestions for coordination of research and future collaboration.

Collectively, my group's responses centered on the themes of data curation with less reliance on subject matter experts, methods or tools to make data more self-documenting, and new strategies for relationship extraction between linked entities. There was also considerable discussion related to reproducible research using common repositories and formats conducive to sharing data (e.g., XML) and open access to both software and the peer review process.

I would like to thank Old Dominion University for the Graduate Student Travel Award which helped to facilitate my participation in the JCDL conference and this workshop.

--Corren (@correnmccoy)

Monday, June 11, 2018

2018-06-11: Web Archive and Digital Libraries (WADL) Workshop Trip Report from JCDL2018

Mat Kelly reports on the Web Archiving and Digital Libraries (WADL) Workshop 2018 that occurred in Fort Worth, Texas.                                                                                                                                                                                                                                                                                                                                                                            ⓖⓞⓖⓐⓣⓞⓡⓢ


On June 6, 2018, after attending JCDL 2018 (trip report), WS-DL members attended the Web Archiving and Digital Libraries 2018 Workshop (#wadl2018) in Fort Worth, Texas (see trip reports from WADL 2017, 2016, 2015, 2013). WS-DL's contributions to the workshop included multiple presentations inclusive of the workshop keynote by my PhD advisor, which I discuss below.

The Project Panel

Martin Klein (@mart1nkle1n) initially welcomed the workshop attendees and had the group of 26-or-so participants give a quick overview of who they were and their interest in attending. He then introduced Zhiwu Xie (@zxie) of Virginia Tech to begin the series of presentations reporting on the project kickoff of the IMLS-funded project (as establish at WADL 2017) "Continuing Education to Advance Web Archiving". A distinguishing feature of this project compared to others, Zhiwu said, is that the projects will use project-based problem solving instead of the products being surveys and lectures. He highlighted a collection of Curriculum modules involving existing practice (event archiving) to feed into various Web archiving tools (e.g., Social Feed Manager (SFM), ArchiveSpark, and Archives Unleashed Toolkit) to facilitate the understanding of the fundamentals (e.g., web, data science, big data) to produce experience in libraries, archives, and programming. The focus here was on individuals that had some level of prior experience with archives instead of the program being designed as training for those with zero experience in the area.

ODU WS-DL's Michael Nelson (@phonedude_mln) continued with the one motivation is to encourage storytelling using Web archives and how that has been hampered with the recent closing of Storify. Some recent work of the group (including the in-develop project MementoEmbed) would allow this concept to be revitalized despite Storify's demise through systematic "card" generation of mementos to allow a more persistent (in the preservation sense) version of the story to be extracted and retained.

Justin Littman (@justin_littman) of George Washington University Libraries continued the project description by describing Social Feed Manager's and emphasized that what you get from the Twitter API may well differ from what you get from the Web interface. The purpose of SFM is to be an easy-to-use, self-service Web interface to drive down the barriers in collecting social media data for academic research.

Ian Milligan (@ianmilligan1) continued by giving a quick run-down of his group's Archives Unleashed Projects, noting a realization in the project's development that not all historians like working with the command-line and Scala. He then briefly described the projects' approach of a filter-analyze-aggregate-visualize to make using large collections of Web archives more effective for research.

Wrapping up the project report, Ed Fox described Virginia Tech's initial attempts at performing crawls with Heritrix via Archive-It and how noisy the results were. He emphasized that a typical crawling approach consisting of starting with seed URIs harvested from tweets does not work well. The event model his group is developing and further evaluating will help guide the crawling procedure.

Ed's presentation completed the series of reports for the IMLS project panel and began a series of individuals presenting.

Individual Presentations

John Berlin (@johnaberlin) started off with an abbreviated version of his Master's Thesis titled, "Swimming In A Sea Of JavaScript, Or: How I Learned To Stop Worrying And Love High-Fidelity Replay". While John had recently given his defense in April (see his post for more details), this presentation focused on some of the more problematic aspects of archival replay as caused by JavaScript. He highlighted specific instances where the efforts of a replay system to accurately replay JavaScript varied from causing a page to display a completely blank viewport (see CNN.com has been unarchivable since November 1st, 2016) to the representation being highjacked to declare Brian Williams as the originator of "Gin and Juice" long before Snoop Dogg(y Dogg). John has created a Chrome and Firefox extension he dubbed "Wayback Plus Plus" that mitigates JavaScript-based replay issues using client-side redirects. See his presentation for more details.

The workshop participants then had a break to grab a boxed lunch and followed with Ed Fox, again, presenting "A Study of Historical Short URLs in Event Collections of Tweets". In this work Ed highlighted the number of tweets in their collections that had URLs, namely that 10% had 2 URLs and less than 0.5% had 3 or more. From this collection, his group analyzed how many of the URLs linked are still accessible in Internet Archive's Wayback Machine with an emphasis that the Wayback Machine is not covering a lot of things that are in the Twitter data he has gathered. His group also analyzed the time difference between when a tweet with URLs was made and when it was archived and found that 50% were archived within 5 days after the tweet was posted.

Keynote

The workshop keynote, "Enabling Personal Use of Web Archives" was next and presented by my PhD Advisor Dr. Michele C. Weigle (@weiglemc). Her presentation initially gave a high-level overview of the needs of those that want to perform personal Web archiving and the tools that the WS-DL group have created over the years in facilitating the efforts to address those needs. She highlighted the early work of the group in identifying disasters in existing archives with a segue of the realization that many archive users lack in that there are more archives beyond Internet Archive.

In her (our) group's tooling to encourage Web users to Archive What They See Now, they created the WARCreate Chrome extension to create WARC files from any Web page. To resolve the issue of what a user is to do with their WARCs, they then created the Web Archiving Integration Layer (WAIL) (and later an Electron version) to allow individuals to control both the preservation and replay process. To give users a better picture of the archived Web as they browsed, they created the Chrome extension Mink to give users a measure of how well-archived (in terms of quantity) a URI is as they browsed the live Web and optionally (and easily) submit the URI currently viewed to 1-to-3 Web archives.

Dr. Weigle also highlighted the work of other WS-DL students of past and present like Yasmin Anwar's (@yasmina_anwar) Dark and Stormy Archives (DSA) and Shawn Jones' (@shawnmjones) upcoming MementoEmbed tool.

Following a tool review, Dr. Weigle asked, "What if browsers could natively interpret and replay WARCs?". She performed a high level review of what could be possible if the compatibility barriers between the archived and live Web were resolved through live Web tools that could natively interact with the archived Web. In one example, she provided a screenshot where in-place of the "secure" badge a browser provides, it might also be aware that it is viewing an archived page and indicate as such.

Libby Hemphill (@libbyh) presented next with "Developing a Social Media Archive at ICPSR" where her group sought to make data useful for people who wanted to understand how we are today from the perspective of people of the long-distant future. She mentioned how messy it can be to consider the ethical challenges when archiving social media data and that people have different levels of comfort depending of what sort of research for which their social media content is to be used. She outlined an architecture of their social media archive SOMAR for federating data to follow the terms of service, rehydrating tweets to follow the terms of research, and other aspects of the social-media-to-research-data process.

The workshop then took another break with a simultaneous poster session including a poster by Justin Littman titled, "Supporting social media research at scale" and WS-DL's Sawood Alam's (@ibnesayeed) "A Survey of Archival Replay Banners". Just prior to their poster presentations, each gave a lightning talk as a quick overview to entice attendees into stopping by.

After the break, WS-DL's Mohamed Aturban (@maturban1) presented "It is Hard to Compute Fixity on Archived Web Pages". Mohamed's work highlighted an issue that subtle changes in content may be difficult to detect using conventional hashing methods to compute the fixity of Web pages. He emphasized that computing the fixity of the root HTML page of a memento is not enough for fixity and that the fixity must also be computed for all embedded resources. With an approach utilizing Merkle trees (or on WP), he generates a hash of the composite memento representative of the fixity of all embedded resources. In one example highlighted in his recent post and tech report, Mohamed showed the manipulation of Climate Change data.

To wrap up the presentations for the workshop, I (Mat Kelly, @machawk1) presented "Client-Assisted Memento Aggregation Using the Prefer Header". This work highlighted one particular aspect of my presentation the previous day at JCDL 2018 (see blog post), namely of how the framework in the basis presentation facilitates the specification of which archives are aggregated using Memento. The previous investigation by Jones, Van de Sompel et al. (see "Mementos in the Raw, Take Two") used the HTTP Prefer header to allow a client to request the un-rewritten version of mementos from archival replay system. In my work, I imagined a more capable Memento aggregator that would expose the archives aggregated and allow a client, basing their customizations off of the aggregator's response, customize the set of archives aggregated by sending the set as base64-encoded data in the Prefer request header.

Closing

When I was through with the final presentation, Ed Fox began the wrap-up of the conference. This discussion of all attendees opened the floor for comments and recommendations for the future of the workshop. With the discussion finished, the workshop came to a close. As usual, I found this workshop extremely informative, though I was familiar with many of the participants previous work. I am hoping, as also expressed by other attendees, to encourage other fields to become involved and present their ongoing work and ideas at this informal workshop. Doing so, from the perspective of both an attendee and presenter, has proven valuable.

Mat (@machawk1)

Friday, June 8, 2018

2018-06-08: Joint Conference on Digital Libraries (JCDL) Doctoral Consortium Trip Report




On June 3, 2018, PhD students arrived in Fort Worth, Texas to attend the Joint Conference on Digital Libraries Doctoral Consortium. This is a pre-conference event associated with the ACM and IEEE-CS Joint Conference on Digital Libraries. This event gives PhD students a forum in which to discuss their dissertation work with others in the field. The Doctoral Consortium was well attended, not only by the presenting PhD students, their advisors/supervisors, and organizers, but also by those who were genuinely interested in emerging work. As usual, I live-tweeted the event to capture salient points. It was a very enjoyable experience for all.

Thanks very much to the chairs: In this post I will cover the work of all accepted students, three of whom are from the Web Science and Digital Libraries Research Group at Old Dominion University: I would also like to thank the assigned mentors of the Doctoral Consortium, who provided insight and guidance not only to their own assigned students, but the rest of us as well:

WS-DL Presentations



Shawn M. Jones




How does a researcher differentiate between web archive collections that cover the same topic? Some web archive collections consist of 100,000+ seeds, each with multiple mementos. There are more than 8000 collections in Archive-It as of the end of 2016. Existing metadata in Archive-It collections is insufficient because the metadata is produced by different curators from different organizations applying different content standards and different rules of interpretation. As part of my doctoral consortium submission, I proposed improving upon the solution piloted by Yasmin AlNoamany. She generated a series of representative mementos and then submitted them to the social media storytelling platform Storify in order to provide a summary of each collection. As part of my preliminary work I presented some findings that will be published at iPres 2018. We discovered four semantic categories of Archive-It collections: collections where an organization archived itself, collections about a specific subject, collections about expected events or time periods, and collections about spontaneous events. The collections AlNoamany used in her work fit into the last category. This also turned out to be the smallest category of collections, meaning that there are many other types of collections not evaluated by her method. She proved that humans could not tell the difference between her automatically-generated stories and other stories generated by humans. She did not, however, provide evidence that the visualization was useful for collection understanding. We also have the problem that Storify is no longer in service, something that I mentioned in a previous blog post. My plan includes developing a flexible framework that allows us to test different methods of selecting representative mementos. This framework will also allow us to test different types of visualizations using those representative mementos. Some of these visualizations may make use of different social media platforms. I plan to evaluate these collections by first creating user tasks that give us some idea that a user understands aspects of a collection. With these tasks I intend to then evaluate different solutions via user testing. The solutions that score best from the testing will address a large problem inherent to the scale of web archives.


Alexander Nwala




How do we find high quality seeds for generating web archive collections? Alexander is focusing on a different aspect of web archive collections than I am. I am analyzing existing collections. He is building collections from seeds supplied by social media users. He notes that users often create "micro-collections" of web resources, typically surrounding an event. Using examples like ebola epidemics, the Ukraine crisis, and school shootings, Alexander asks if seeds generated by social media are comparable to those generated by professional curators. He also seeks quantitative methods for evaluating collections. Finally, he wants to evaluate the quality of collections at scale.
He demonstrated the results of using a prototype system that extracts seeds from social media and compared these seeds to those extracted from Google search engine result pages (SERPs). He discovered that, when using SERPs, the probability of finding a URI for a news story diminishes with time. He introduced methods like distribution of topics, distribution of sources, distribution in time, content diversity, collection exposure, target audience, and more. He covered some of his work on the Local Memory Project as well as work that will be presented at JCDL 2018 and Hypertext 2018. He intends to do further research on hubs and authorities in social media, as well as evaluating the quality of collections. Alexander will ensure that good quality seeds make it into web archives, addressing an aspect of curation that has long been an area of concern in web archives.


Mohamed Aturban



How can we verify the content of web archives? Mohamed presented his work on fixity for mementos. He described issues with temporal violations and playback issues. He asked whether different web archives agreed on the content of mementos produced for the same live resource at the same time. He showed how "evil" archives could potentially manipulate memento content to produce a different page than existed at the time of capture. So, how do we ensure that the memento was unaltered since the time of capture?

He demonstrated that the playback engine used by a web archive can inadvertently change the result of the displayed memento. Just providing a timestamped hash of the memento HTML is not enough. He proposes generating a cryptographic hash for the memento and all embedded resources and then generating a manifest of these hashes. This manifest will then be stored as mementos themselves in multiple web archives. I expect this work to be quite important to the archiving community, addressing a concern that many professional archivists have had for quite some time.


Other Work Presented



André Greiner-Petter



Research papers use equations all of the time. Unfortunately, there isn't a good method of comparing equations or providing semantic information about them. André Greiner-Petter is working on creating a method of enriching the equations used in research papers. This will have a variety of uses, such as detecting plagiarism or finding related literature.


Timothy Kanke



How are people using Wikidata? I had attended a session on Wikidata at Wiki ConferenceUSA 2014, but have not really examined it since. Will it be useful for me? How do I participate? Who is involved? Timothy Kanke seeks to understand the answers to all of these questions. The Wikidata project has grown over the last few years, feeding information back into the Wikipedia community. Kanke will study the Wikidata community and provide a good overview for those who want to use its content. Using his work, we will all have an understanding of the overall ways in which Wikidata can work for the scholarly community.

Hany Alsalmi



How many languages do you use for searching? What is the precision of the results when you switch languages, even for the same query? Hany Alsalmi noticed that users who search in English were getting different results than when they searched for the same term in Arabic. Alsalmi will perform studies on users of the Saudi Digital Library to understand how they perform their searches and how successful those searches are. He will also record their reactions to search results, with the concern being that the user will quit in frustration if the results are insufficient. His work will have implications for search engines in the Arabic-speaking world.

Corinna Breitinger




Scholarly recommendation systems examine papers using text similarity. Can we do better? What about the figures, citations, and equations? Corinna Breitinger will take all of these text-independent semantic markers into consideration with the development of a new recommender approach targeted at STEM fields. Once that is done, she will create a new visualization concept that will help users view and navigate a collection of similar literature. The benefits of such a system will help spot redundant research and also help us find related research in the field.

Susanne Putze



How is research data managed? How can we facilitate making data management a “first-class citizen”? To do so would improve the amount of data shared by researchers as well as its quality. Susanne Putze has extended experiment models to improve data documentation. She will create prototypes and evaluate how well they work to address data management in the scholarly process. From there she will begin the process of improving knowledge discovery using these prototypes. Her research has implications for how we handle our data and incorporate it into scholarly communications.

Stephen Abrams



How successful are digital preservation efforts? Stephen Abrams is working on creating metrics for this purpose. He is planning on evaluating digital preservation from the perspective of communications rather than through preservation management concepts like quantities, ages, of quality of preserved material. Thanks to his presentation I will now examine terms like “verisimilitude”, “semiotic”, and “truthlikeness”. When he is done, we should have better metrics to evaluate things like the trustworthiness of preserved material. His work is more general and theoretical than Mohamed’s, but there is a loose connection to be sure.

Tirthankar Ghosal




Why are papers rejected by editors? Have we done a good job identifying what makes our paper novel? What if we could spot such complex issues in our papers prior to submission? Tirthankar Ghosal seeks to help address these concerns by using AI techniques to help researchers and editors more easily identify papers that will likely be rejected. He has already done some work examining reasons for desk rejections. He will identify methods for detecting what makes a paper novel, if a paper is fit for a given journal, if it is of sufficient quality to be accepted, and lastly create benchmark data that can be used to evaluate papers in the future. His work has large implications for scholarly communication and may affect not only the way we write, but also how submissions are handled in the future.

What Next?


I would like to thank all participants for their input and insight throughout the event. Hearing their feedback for other participants was quite informative to me as well. We will all have improved candidacy proposals as a result of their input and, more importantly, will use this input to improve our contributions to the world.
Updated on 2018/06/09 at 20:50 EDT with embed of Mohamed Aturban's Slideshare.
--Shawn M. Jones

2018-06-08: Joint Conference on Digital Libraries (JCDL) 2018 Trip Report

The gathering place at the Cattle Raisers Museum, Fort Worth, Texas 
This year's 18th ACM/IEEE Joint Conference on Digital Libraries Libraries (JCDL 2018) took place at the University of North Texas (Fort Worth, Texas). Between June 3-6, members of WSDL attended paper sessions, workshops, tutorials, panels, and a doctoral consortium.

The theme of this year's conference was "From Data to Wisdom: Resilient Integration across Societies, Disciplines, and Systems." The conference provided researchers across multiple disciplines ranging from Digital Libraries and Web science research to Libraries and Information science, with the opportunity to communicate the findings of their research.

Day 1 (June 3, 2018)

The first day of the conference was dedicated to doctoral consortium, tutorials, and workshops. The doctoral consortium provided an opportunity for Ph.D. students in the early phases of their dissertation to present their thesis and research plans and receive constructive feedback. I will provide a link to the Doctoral Consortium blogpost when it becomes available.

Day 2 (June 4, 2018)

The conference officially began on the second day with Dr. Jiangping Chen's introduction of the conference and the keynote speaker - Dr. Trevor Owens. Dr. Trevor Owens is a librarian, researcher and policy maker and the first head of Digital Content Management for library services at the Library of Congress. His talk was titled: "We have interesting problems." 

It started with a highlight of Ben Shneiderman's The New ABCs of Research which provides students with guidance on how to succeed in research, and provides senior researchers and policy makers on how to respond to new problems and apply new technologies. The new ABC's of research may be grossly summarized with two acronyms included in the book: ABC (Applied, Basic, and Combined) and SED (Science, Engineering, and Design).
Additionally, he presented NDP@3, an IMLS framework for investments in digital infrastructures for libraries. Also he presented multiple IMLS-funded projects such as: Image Analysis for Archival Discover (AIDA), which explores various ways to use millions of images representing the digitized cultural record.
Next he talked about some resources at the Library of Congress Labs such as:
  • Library of Congress Colors: provides the capability of exploring the colors in the Library of Congress collections.
  • LC for Robots: provides a list of APIs, data and tutorials for exploring the digital collections at the Library of Congress.
Following the keynote were three concurrent paper sessions with the theme: Use, Collection Building, and Semantics & Linking. I will briefly describe the papers discussed in two paper sessions.


Paper session 1B (Day 2)


Myriam Traub (best paper nominee), a PhD student at Centrum Wiskunde & Informatica (CWI) presented a full paper titled: "Impact of Crowdsourcing OCR Improvements on Retrievability Bias." She discussed how  crowd-sourced correction of OCR errors affects the retrievability of documents in a historic newspaper corpus in a digital library.
Three short papers followed Traubs's presentation. First, Karen Harker, a Collection Assessment Librarian at the University of North Texas Libraries presented: "Applying the Analytic Hierarchy Process to an Institutional Repository Collection." She discussed the application of the Analytic Hierarchy Process (AHP) to create a model for evaluating collection development strategies of institutions. Second, Douglas Kennard presented: "Computer-Assisted Crowd Transcription of the U.S. Census with Personalized Assignments for Better Accuracy and Participation," where he introduced the Open Genealogy Data census transcription project that strives to make census  data readily available to researchers and digital libraries. This was achieved through the use of automatic handwriting recognition to bootstrap their census database, and subsequent crowd-sourced correction of the data through a web interface. Finally, Mandy Neumann, a research associate at the Institute of Information Science at TH Köln presented: "Prioritizing and Scheduling Conferences for Metadata Harvesting in dblp." She explored different features for ranking conference candidates by using a pseudo-relevance assessment.


Paper session 1C (Day 2)


Dr. Federico Nanni (best paper nominee), a postdoctoral researcher at the Data and Web Science Group at the University of Mannheim presented the first of three full papers titled: "Entity-Aspect Linking: Providing Fine-Grained Semantics of Entities in Context," in which he introduced a method for obtaining specific descriptions of entities in text by retrieving the most related section from Wikipedia.
Next, Gary Munnelly, a PhD student at the School of Computer Science and Statistics (SCSS) at Trinity College Dublin presented: "Investigating Entity Linking in Early English Legal Documents," discussing the effectiveness of different entity linking systems for the task of disambiguating named entities in 17th century depositions obtained during the 1641 Irish rebellion.
Finally, Dr. Ahmed Tayeh presented: "An Analysis of Cross-Document Linking Mechanisms," where he discussed different strategies for linking or associating information across physical and digital documents. The titles of other papers presented in a parallel session (1A) include:

Open Cross-Document Linking Service Based on a Plug-in Architecture from Ahmed Tayeh


Paper session 2A (Day 2)


Two full papers were presented after a break. The first was titled: "Putting Dates on the Map: Harvesting and Analyzing Street Names with Date Mentions and their Explanations," was presented by Rosita Andrade. She presented her research about the automated analysis of street names with date references around the world, and showed that "temporal streets" are frequently used to commemorate important events such as a political change in a country.
Next, Dr. Philipp Mayr, a deputy department head and a team leader at the GESIS department Knowledge Technologies for the Social Sciences presented: "Contextualised Browsing in a Digital Library's Living Lab." He presented two approaches that contextualize browsing in a digital library. The first approached is based on document similarity and the second utilizes implicit session information (e.g., queries and document metadata from sessions of users). 


Paper session 3A (Day 2)


Three concurrent paper sessions followed Dr. Phillip Mayr's presentation. Dr. Dominika Tkaczyk, a researcher and a data scientist at the Applied Data Analysis Lab at the University of Warsaw (Poland) presented: "Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers," in which she presented the results of the comparison of different methods for parsing scholarly article references.
Anne Lauscher, a PhD student at the University of Mannheim presented: "Linked Open Citation Database: Enabling Libraries to Contribute to an Open and Interconnected Citation Graph." She presented the current state of the workflow and implementation of the Linked Open Citation Database project, which is a distributed infrastructure based on linked data technology for efficiently cataloging citations in libraries.


Paper session 3C (Day 2)


Norman Meuschke, a PhD student at the University of Konstanz, presented: "An Adaptive Image-based Plagiarism Detection Approach," in which he discussed his analysis of images in academic documents to detect disguised forms of plagiarism with approaches such as perceptual hashing, ratio hashing and position-aware OCR text matching. 


Hisham Benotman presented his work: "Extending Multiple Diagram Navigation with Internal Diagram And Collection Connections." He discussed his work about extending Multiple diagram navigation (MDN) such that diagram-to-content queries reach related collection documents not directly connected to the diagrams.
Other papers presented in a parallel session (3B) include:
Minute madness followed the paper sessions. The minute madness was an activity in which poster presenters were given 1 minute to advertise their respective posters to the conference attendees. The poster session began after the minute madness.





Day 3 (June 5, 2018)

Day 3 of the conference began with Dr. Niall Gaffney's keynote. Dr. Niall Gaffney is an Astronomer and Director of Data Intensive Computing at the Texas Advanced Computing Center (TACC). He started by emphasizing the importance of scientific reproducibility before moving on to show some of the projects supported by the computational machinery at TACC such as Firefly.
Two concurrent paper sessions followed a short break.

Paper session 4A (Day 3)


Dr. Gianmaria Silvello, an assistant professor at the Department of Information Engineering of the University of Padua presented a full paper titled: "Evaluation of Conformance Checkers for Long-Term Preservation of Multimedia Documents." He discussed his project about the development of an evaluation framework for validating the conformance of long-term preservation by assessing correctness, usability and usefulness.
Next, Dr. Pavlos Fafalios a researcher at L3S Research Center in Germany presented a full paper titled: "Ranking Archived Documents for Structured Queries on Semantic Layers," in which he proposed two ranking models that rank archived documents and considers the similarity of documents to entities, timeliness of documents, and the temporal relations between the entities.
The final paper presented (not by an author of the paper) in this session was a short paper titled: "Modeling Author Contribution Rate With Blockchain." Three concurrent paper sessions (all full papers) followed after break.


Paper session 4B (Day 3)


Florian Mai, a graduate student at Kiel University in Germany was the first presenter of the paper session on Text Collections. He presented a full paper titled: "Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Competitive Performance to Full-Text," in which he presented the findings from investigating how deep learning models obtained from training on titles compare to deep learning models obtained from training on full-texts.
Next, Chris Holstrom, a PhD student from the Information School at the University of Washington presented a short paper: "Social Tagging: Organic and Retroactive Folksonomies," in which he showed that tags on MetaFilter and AskMetaFilter follow a power law distribution and retroactive taggers do not use "organization" tags like professional indexers.
Next, Jens Willkomm, a PhD student at the Karlsruhe Institute of Technology in Germany, presented a full paper titled: "A Query Algebra for Temporal Text Corpora." He proposed a novel query algebra for accessing and analyzing words in large text corpora.


Paper session 5A (Day 3)


Omar Alonso (best paper nominee) presented a full paper titled: "How it Happened:  Discovering and Archiving the Evolution of a Story Using Social Signals." He introduced a method of showing the evolution of stories from the perspective of social media users as well as the articles that include social media as supporting evidence.
Tobias Backes a researcher at Gesis presented  his paper titled: "Keep it Simple: Effective Unsupervised Author Disambiguation with Relative Frequencies." He addressed the problem of author name homonymy in the Web Science domain by proposing a novel probabilistic similarity measure for author name disambiguation based on feature overlap.
The last paper (best paper nominee) presented in this session was titled: "Digital History meets Microblogging: Analyzing Collective Memories in Twitter."


Paper session 5B (Day 3)


Noah Siegel a researcher at the Allen Institute for Artificial Intelligence presented a full paper titled: "Extracting Scientific Figures with Distantly Supervised Neural Networks," where he introduced a system of extracting figures from large number of scientific documents without human intervention.
Next, André Greiner-Petter presented his full paper titled: "Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context." He presented a new approach for mathematical format conversion that utilizes textual information to reduced error rate. Additionally, he evaluated state-of-the art tools for mathematical conversions and provided a public manually-created gold standard dataset for mathematical format conversion.

Next, Yuta Kobayashi presented a paper titled: "Citation Recommendation Using Distributed Representation of Discourse Facets in Scientific Articles," presenting the effectiveness of using facets of scientific articles such as "objective," "method," and "result" for citation recommendation by learning a multi-vector representation of scientific articles, in which each vector represents a facet in the article.

Paper session 5C (Day 3)


Catherine Marshall, an adjunct professor at Texas A&M University presented: "Biography, Ephemera, and the Future of Social Media Archiving." She presented her finding from answering the following question: "Will the addition of new digital sources such as records repositories, digital libraries, social media, and collections of ephemera change biographical research practices?" She demonstrated how new digital resources unravel a subject's social network, thus exposing  biographical information formerly invisible.
Next, I presented our full paper titled: "Scraping SERPs for Archival Seeds: It Matters When You Start" on behalf of co-authors Dr. Michele Weigle and Dr. Michael Nelson. In my presentation, first, I highlighted the importance of web archive collections for studying important historical events ranging from elections to disease outbreaks. Next, I showed that search engines (specifically Google) can be used to generate seeds. Finally, I showed that it becomes harder to find the older URLs of news stories over time, so seed generators that utilize search engines should begin early and persist to capture the evolution of an event.

Next, Mat Kelly (best paper nominee), a fellow PhD student at Old Dominion University and member of WSDL presented his full paper titled: "A Framework for Aggregating Private and Public Web Archives." He showed his framework that provides a means of combining public web archive captures and private web captures (e.g., banking and social media information) without compromising sensitive information included in the private captures. This work utilizes Sawood Alams's Memgator, a Memento aggregator that supports multiple serialization formats such as Link, JSON, and CDXJ.


Paper session 6A (Day 3)


The last paper session on Topic Modeling and Detection consisted of three full papers. First, Julian Risch (best paper nominee), a PhD student at Hasso-Plattner Institute (Germany) presented: "My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections." He presented a topic model combined with automatic domain term extraction and phrase segmentation that distinguishes collection-specific and collection-independent words based on information entropy.
Next, Dr. Ralf Krestel, the head of Web Science Research Group & Senior Researcher at Hasso-Plattner Institute (Germany) presented his full paper titled: "WELDA: Enhancing Topic Models by Incorporating Local Word Context." He proposed a new topic model called WELDA that combines word embeddings (WE) and Latent Dirichlet Allocation (LDA).
Finally, Angelo Salatino, a PhD student at the Knowledge Media Institute (UK) presented a full paper titled: "AUGUR: Forecasting the Emergence of New Research Topics." He introduced AUGUR, which is a new approach for the early detection of research topics in order to help stakeholders such as universities, institutional funding bodies, academic publishers and companies recognize new research trends.

A dinner at the Fort Worth Museum of Science and History followed after a break. The best poster award was presented to Mohamed Aturban, a fellow PhD student at Old Dominion University and member of WSDL for this poster "ArchiveNow: Simplified, Extensible, Multi-Archive Preservation."
Dr. Federico Nanni  (Providing Fine-Grained Semantics of Entities in Context) and Myriam Traub (Impact of Crowdsourcing OCR Improvements on Retrievability Bias) tied for the Vannevar Bush best paper awards. Myriam Traub also won the best student paper award.


Day 4 (June 6, 2018)

Day 4 began with a keynote from Dr. Carly Strasser, director of Strategic Development for the Collaborative Knowledge Foundation. Her keynote "Open Source Tech for Scholarly Communication: Why It Matters," illustrated the problems in the submission, production and delivery of scholarly communication. She talked about the problem of the disjoint nature (silos) of the various stages of scholarly communication, as well as the expensive delivery, slow production, static and less interoperable output.

She also presented a vision of scholarly communication that consists of living documents that link to open source code and data, a cheaper delivery system, faster production and more interoperable and dynamic output. Additionally, she talked about the organizations working to achieve various aspects of this vision.
The main conference gave way to workshops and a preview of JCDL 2019 which is scheduled to take place at the School of Information Sciences at the University of Illinois, Urbana-Champaign from June 2-6, 2019.
I would like to thank the organizers of the conference, the hosts, University of North Texas (UNT) College of Information and UNT Health Science Center, as well as SIGIR for the travel grants. Here are other trip reports including the Doctoral Consortium (from Shawn Jones), a preview of WADL (Web Archiving and Digital Libraries) workshop from Jasmine Mulliken, Digital Production Associate at Stanford University PressMat Kelly's (WADL) trip report, and Corren McCoy's Knowledge Discovery From Digital Libraries (KDDL) Workshop Trip Report. Dr. Min-Yen Kan set up a repository for all the slides from JCDL 2018; please upload your slides if you have not already done so.

-- Nwala (@acnwala)