2013-07-26: ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2013

The Old Dominion University Web Science and Digital Libraries (WSDL) research group was well-represented at the JCDL 2013 conference – Digital Libraries at the Crossroads. We arrived in Indianapolis, Indiana on Sunday night.

While Hany SalahEldeen and I took time on Monday to ready our presentations, Scott Ainsworth and Yasmin AlNoamany presented at the Doctoral Consortium. Scott presented his research on improving temporal drift in the archives, and Yasmin presented her work on creating a story from mementos. Their presentations (and doctoral consortium) are discussed in more detail in their blog posting.

Day 1

After opening remarks from J. Stephen Downie and Robert H. McDonald, Clifford Lynch gave the opening keynote of the conference entitled "Building Social Scale Information Infrastructure: Challenges of Coherence, Interoperability and Priority."

Lynch posed a series of questions that are influencing the research areas in the digital libraries. To begin, he mentioned that a number of systems could be considered digital libraries, such as the National Security systems tracking people and actions, or health-care systems tracking patients. The big challenge we are facing is how to think about massive enterprise systems and prioritize activities in such large, community controlled environments. PubMed is providing an example of a canonical collection of literature in a discipline, but our current models are institutionalized by publisher or collection topic as opposed to an entire discipline.

The next topic discussion centered around the notion of getting physical objects that may exist in peoples' homes or private collections into digital libraries. Additionally, what does it mean to control the access rights to this content? He referenced Herbert Van de Sompel's presentation on data in archives, and that the data should exist only in the context of content and creator, not archive and curator.

Lynch followed this up by mentioning that we have no good ways to assess the health of the stewardship environment. He touched on the need to assess how much of the web is archived, how much of the web is discoverable and archivable, and how well we are capturing the target content (the last two of which is discussed in my WADL presentation). We have worked or are currently working on each of these questions in the WSDL group.

Finally, Lynch closed with his position that digital stewardship is becoming an engineering problem, and should be treated as such with appropriate risk management, modeling and simulation, and business models (such as those presented by David Rosenthal). These engineering systems will reflect the values of the discipline by the policies put in place (such as privacy, access rights, and collection replication).

My colleagues and I went to our first round of paper presentations – Preservation I – to support session chair and WSDL alumnus Martin Klein (currently at Los Alamos National Laboratory Research Library) and our current student Scott Ainsworth.

The first paper of the session was from Ivan Subotic entitled A Distributed Archival Network for Process-Oriented Autonomic Long-Term Digital Preservation. Ivan proposed an archival scenario and the associated requirements for a digital preservation system. The distributed archival system, DISTARNET, was proposed and a prototype centering around object containers was presented by Ivan during his talk.

Scott presented his work on Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in Random Walks Through a Web Archi ve. His paper was also nominated as a candidate for best student paper. Scott discussed the temporal drift experienced by users when navigating (or walking) between mementos in the archives, and how this drift changes depending on walk length. The drift is greatly reduced when MementoFox is utilized by the user during the walk.

Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in Acyclic Walks Through a Web Archive� from ScottAinsworth

Kyle Rimkus and Tom Habing tag-team presented their paper Medusa at the University of Illinois at Urbana-Champaign: A Digital Preservation Service Based on PREMIS and gave a brief description of the MEDUSA project. MEDUSA facilitates the movement between archival platforms. Their presentation discussed the implementation of PREMIS which allows objects to form relationships within the archives.

WSDL alumnus Frank McCown presented the paper he wrote with first author Richard Schneider entitled First Steps in Archiving the Mobile Web: Automated Discovery of Mobile Websites. Frank discussed the difference between the mobile and desktop versions of the web, and how crawlers can detect or discover the URI of a mobile version of a site (if it exists).

After lunch, I attended a panel discussion on Managing Big Data and Big Metadata. Michael Khoo, Stacy T. Kowalczyk, and Matthew W. Mayernik. Khoo kicked the panel off with a brief introduction of the big data discipline and posed the question that the panel would discuss: "how can digital library research inform big data and big metadata?" Kowalczyk discussed "HathiTrust Research Center: Big Data for Digital Humanities." Her talk mentioned that terms used in big data are not well defined, and that digital libraries (Google, Internet Archive, etc.) are examples of specialized big data collections. Khoo raised a few more points about management and interoperability of big data between federated institutions. Mayernik presented "Managing Big Data and Big Metadata: Contributions from Digital Libraries." His work is more theoretical and focuses on how to create collections for sharing. He described digital libraries as technologies and sociological institutions for which convergence is not inevitable. That is, these institutions implement different policies for managing different types of big data that are not necessarily compatible.

A particularly intriguing question asked for differences between big data in industry and in the libraries. One difference is the user base: industry has specialized users and treats the data more as the process while the libraries must accommodate a wider variety of users and treats the goal (to create collections).

A second question asked if there was a difference between "big" and "much" data. Big data is not suitable for traditional processing, querying, scientific research, etc. while "much" data is more suited to traditional handling.

The session completed with a long discussion on what constitutes data, and how sampling techniques can be used to reduce "big" data to "small" data while learning almost as much from the sets.

In the last session of the day, I attended the Information Clustering session (in no small part because of the two Best Paper Nominee presentations).

Weimao Ke presented his paper Information-theoretic Term Weighting Schemes for Document Clustering, a Vannevar Bush Best Paper Award Nominee. Ke discussed methods for extracting information from documents in a collection, drawing additional information from them, and clustering the documents based on the information retrieved. The proposed LIT method produces the best clustering results with k-means clustering, and can produce better results than TF/IDF.

Kazunari Sugiyama presented his paper Exploiting Potential Citation Papers in Scholarly Paper Recommendation, another Vannevar Bush Best Paper Award Nominee. Sugiyama discussed the relationships between citation and reference papers and how they can be used to recommend academic papers to authors. He also used fragments (such as the abstract or conclusion) to refine the recommendations.

Peter Organisciak presented his paper Addressing diverse corpora with cluster-based term weighting. Heterogeneous language in a corpus can be problematic -- Inverse Document Frequency (IDF) becomes less valuable when texts are from mixed domains, time periods, or are multilingual. Classifying or clustering documents helps increase the value of IDF. Organisciak used his method to show that the English language has change over time.

Xuemei Gong presented her paper Interactive Search Result Clustering: A Study of User Behavior and Retrieval Effectiveness. She showed that a scatter/gather system is more difficult to use than a classic search engine interface, but that scatter/gather is more useful, and can improve user learning.

Day 2

Our second day at JCDL began with a keynote from Jill Cousins entitled "Why Europeana?"

Cousins discussed the value -- culturally and monetarily -- of Europeana which began as a method of reflecting the diversity of the European web. With the help of activists, it facilitates data aggregation, distribution, and user engagement. Europeana constructed an aggregation infrastructure of digital libraries to deliver archival data (an infrastructure that now spans beyond 2,300 content providers).

Europeana's distribution of the archival material was a challenge due to licensing concerns by content owners. However, Europeana offers an API utilized by 770 organizations and several services that deliver content.

Current operational challenges at Europeana include multilingualism -- the site cannot intake queries from multiple languages and effectively return the requested information. Additionally, engaging users is a continuing challenge, particularly with the goal of drawing traffic in the same order of magnitude as Wikipedia.

Cousins outlined three impacts of Europeana. The first is Europeana's support of economic growth in that the cultural material is being used to improve other services. The second is that Europeana connects Europe and the rest of the world through community engagement and heritage services. The third is making cultural available to everyone.

Budget cuts have severally reduced Europeana's ability to effectively deliver on these impacts, and they developed new goals to improve economic return on investment and finding alternate funding sources. One proposed source is the Incubator which is a research think tank that will support startups. The main source of revenue is service-oriented offerings within the impacts, such as enabling industry, providing license framework, or incubation services.

The next steps for Europeana are to receive support from the governmental bodies and become self-sufficient by 2020.

The first session I attended was Specialist DLs, moderated by Michael L. Nelson.

Annika Hinze presented her paper Tipple: Location-Triggered Mobile Access to a Digital Library for audio book. This work (embodied as a mobile application) allows location-specific sections in narrative books to be matched with the current location of a reader. For example, a book chapter set in the Hamilton Gardens would play when the reader is walking through the Hamilton Gardens. Hinze presented evaluations of the software.

Paul Bogen presented his paper Redeye: A Digital Library for Forensic Document Triage. Redeye helps an undisclosed sponsor filter relevant information on targets from noise in a large corpus of scanned documents. The system uses entity extraction, cross-linking, and machine translation and utilizes an ingestion pipeline, a repository, and a workbench.

Katrina Fenlon presented her paper Local Histories in Global Digital Libraries: Identifying Demand and Evaluating Coverage. She discussed the results of a survey of libraries on the demand for different granularities historical topics of interest, with local historical topics (as opposed to state or world granularities) being in the highest demand.

Laurent Pugin presented his paper Instrument distribution and music notation search for enhancing bibliographic music score retrieval. He presented an effort to inventory music scores that exist in multiple sources around the world. The main goal of this paper is to utilize the rich metadata of the set to provide the most effective search for the users.

The second session of day two was Web Replication.

Hany kicked off the session well with his paper Reading theCorrect History? Modeling Temporal Intention in Resource Sharing. Hany discussed his continuing studies of the differences between what we meant to share over social media and what is actual observed.

Reading the Correct History? Modeling Temporal Intention in Resource Sharing from heinestien

In what was (in my clearly unbiased opinion) the best presentation of the day, I presented my study of Memento TimeMaps entitled An Evaluation of Caching Policies for Memento TimeMaps. In this work, I discussed the change patterns of TimeMaps which should intuitively be monotonically increasing but in practice, sometimes decrease in cardinality. I proposed and evaluated a caching strategy based on our observations to better serve Memento users while limiting the load on the archives.

An Evaluation of Caching Policies for Memento TimeMaps from Justin Brunelle

Continuing the ODU dominance of this session, Martin presented his paper Extending Sitemaps for ResourceSync. He presented how ResourceSync uses Sitemaps as a resource list, including the extended information that ResourceSync uses to enhance Sitemaps.

Even as the odd-man out (having not attended ODU), Min-Yen Kan presented an exceptional paper written by Bamdad Bahrani entitled Multimodal Alignment of Scholarly Documents and Their Presentations. Their work generated an alignment map to match a conference proceedings paper content with the slides presented during the conference. The major contribution of this paper was the inclusion of visual content.

The session on Data was next.

Maximilian Scherer presented a paper entitled Visual-Interactive Querying for Multivariate Research Data Repositories Using Bag-of-Words. The authors presented a bag-of-words algorithm for discovering data in a repository.

Ixchel M. Faniel presented her paper on The Challenges of Digging Data: A Study of Context in Archaeological Data Reuse. This paper discusses the use and reuse of data of a specialized digital library for archaeologists, and the custom ontologies and reuse patterns that it implements.

Jesse Prabawa Gozali presented the paper Constructing an Anonymous Dataset From the Personal Digital Photo Libraries of Mac App Store Users. This presentation explored methods of retrieving information from photo collection and explored the types of information available to researchers.

Miao Chen presented the paper Modeling Heterogeneous Data Resources for Social-Ecological Research:A Data-Centric Perspective. This presentation discussed an ontology for representing and organizing information in a repository using sampling from an existing dataset.

The poster session rounded out day 2. Most importantly, Ahmed Alsum presented his poster ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph. Dr. Nelson's tweet (complete with YouTube video) captures his Minute Madness presentation perfectly.

the best ever minute madness presentation? https://t.co/9ddknsoxtM #JCDL13 w/ @aalsum
— Michael L. Nelson (@phonedude_mln) July 25, 2013

Day 3

In the last session of the conference, we attended the Preservation II presentations.

Yasmin presented her paper entitled Access Patterns for Robots and Humans in Web Archives which discussed how we can use archive access logs to determine the frequency with which robots and human users access the archives with the intention of better serving all users.

Access Patterns for Robots and Humans in Web Archives from Yasmina Anwar

Krešimir Đuretec presented a paper he wrote with Christoph Becker entitled Free Benchmark Corpora for Preservation Experiments: Using Model-Driven Engineering to Generate Data Sets.This work provides a method for automatically generated corpus that serves as an alternative to Govdocs and provides a ground truth dataset for use in benchmarking.

Hendrik Schöneberg presented his paper A scalable, distributed and dynamic workflow system for digitization processes. This work presents a workflow for quickly processing large amounts of images and image data (from scans of manuscripts) for curation and management.

In a paper that I assume was close to Dr. Nelson's heart, Otávio A. B. Penatti presented a paper he wrote with Lin Tzy Li entitled Domain-specific Image Geocoding: A Case Study on Virginia Tech Building Photos. This work evaluated the descriptors of an image for geocoding. It then infers and assigns a physical, geographical location to digital objects (in this case, photos of Virginia Tech campus buildings) for placement on a map.

To close out the conference, David De Roure gave his keynote entitled "Social Machines of Science and Scholarship."

De Roure discussed how scientific information and publications have changed over time, and how "the paper" can be improved as a way to disseminate scientific findings. The first failure of the paper is that the data cannot be put into the paper -- the container is inappropriate. The second failure is that an experiment cannot be reconstructed based on a paper alone. The third failure is that publications are becoming targeted at increasingly specialized audiences. The fourth failure is that research records are not [natively] machine readable. The fifth failure is that authorship has moved from single scientists to potentially thousands of researchers (that is more suitable to Hollywood-style credits). The sixth failure is that quality control is not able to keep up with publication speed. The seventh failure is that regulations force specific reporting. The eighth and final failure is that researchers are frustrated by increasing inefficiencies in scholarly communications.

He followed up these failures by discussing how scientists operate today. This included the relationship between the increasing data, computation, storage, and people involved in research.

De Roure also spoke about how automation, data collection, and other aspects of science are changing as technology is changing along with the world around it. He also mentioned that with the increase in data, we also have an increase of methods to work with it.

De Roure's notion of people and papers as knowledge objects that operate as a linked-data system is interesting. These objects can be exchanged, reused, and handled in workflows similarly to the web objects with which we are more familiar. Coming to the point of the talk, he also discussed examples of successful social machines (such as reCAPTCHA and Wikipedia) that evolve with society -- or more specifically, as a result of society. With the increase in social machines, they've started forming their own, larger social machine ecosystems by interacting. These social machines that we help design are supporting and enabling the research environments in which we operate.

We closed out our conference by attending the Web Archiving and Digital Libraries (WADL) workshop. Notes on the workshop will appear in a future blog posting.

We had a wonderful time visiting Indianapolis, learned a lot, and gathered great ideas to incorporate into our current and future works. We look forward to next year's DL2014 conference (a joint JCDL and TPDL conference) which was announced to be in London.

--Justin F. Brunelle

Search This Blog

Web Science and Digital Libraries Research Group