Thursday, September 18, 2014

2014-09-18: Digital Libraries 2014 (DL2014) Trip Report

Mat Kelly, Justin F. Brunelle and Dr. Michael L. Nelson travel to London, UK to report on the Digital Libraries 2014 Conference.                           

On September 9th through 11th, 2014, Dr. Nelson (@phonedude_mln), Justin (@justinfbrunelle), and I (@machawk1) attending the Digital Libraries 2014 conference (a composite of the JCDL (see trip reports for 2013, 2012, 2011) and TPDL (see trip reports for 2013 and 2012) conferences this year) in London, England. Prior to the conference, Justin and I attended the DL2014 Doctoral Consortium, which occurred on September 8th.

The main conference on September 9th opened with George Buchanan (@GeorgeRBuchanan) indicating that this year's conference was a combination of both TPDL and JCDL from previous years. With the first digital libraries conference being in 1994, this year marked the 20th year anniversary of the conference. George celebrated this by singing Happy Birthday to the conference and introduced the Ian Locks, the Master of company if the Worshipful Company of Stationers and Newspaper Makers, to continue the introduction.

Ian first gave a primer and history of the his organization as a "chain gang that dated back 1000 years" that "reduced the level of corruption when people could not read or write". Originally, his organization became Britain's first library of deposit wherein printed works needed to be deposited with them and the organization was central to the first copyright in world in 1710.

Ian then gave way to Martin Klein (@mart1nkle1n), who gave insight into the behind-the-scenes dynamics of the conference. He stated that the programming committee had 183 members to allocate reviews for every submission received. The committee's goal was to have four first level reviews. Of the papers received, 38 countries were represented with the largest number coming from the U.S. followed by the U.K. then Germany. The acceptance rate for full papers was 29% while the rate for short papers was 32%. 33 posters and 12 demos were also accepted. Interestingly, the country with the highest acceptance rate that submitted over five papers was Brazil, with over half of their papers accepted.

Martin then segued to introducing the keynote speaker, Professor Dieter Fellner of the Fraunhofer Institute. Dieter's theme consisted mostly of the different means and issues in digitally preserving 3-dimensional objects. He described the digitization as, "A grand opportunity for society, research, and economy but a grand challenge for digital libraries." In reference to object recovery for preservation before or after an act of loss he said, "if we cannot physically preserve an object, having a digital artifact is second best." Dieter then went on to tell of the inaccuracies of preserving artifacts from a single or insufficient lighting conditions. TO evaluate how well an object is preserved, he spoke of a "Digital Artifact Turing Test" wherein, he said, first create photos of a 3D artifact then make a 3D model. If you can't tell the difference, then the capture can be deemed successful and represntative. -

Dieter continued with some approaches they have used to achieve better lighting conditions and how varying the lighting conditions has provided instances of uncovering data that previously was hard to accurately capture. As an example, he show a piece of driftwood from Easter Island that had an ancient etched message that was very subtle to see and thus would likely be unknowingly used as firewood. By varying the light conditions when preserving the object, the ancient writing was exposed and preserved for later translation once more is known about the language.

Another instance he gave was based on scans today of ancient objects, how accurately can we replicate the original color, citing the discolored bust of Nefertiti. Further inspection using various colored lighting to scan produced potentially better results for a capture.

After a short break, the meeting resumed with simultaneous sessions. I attended the "Web archives and memory" session where WS-DL's Michael Nelson lead off with "When Should I Make Preservation Copies of Myself?", a work related to WS-DL's recent alumnus Chuck Cartledge's PhD dissertation. In his presentation, Michael spoke of the preservation of objects, particular of Chuck now-famous ancestor "Josie", for which he had a physical photo over a hundred years old with some small bits of metadata hand-written on the back. With respect to modeling the self-preservation of the correlative digital object of the Josie photo (e.g., a scanned image on Flickr), Michael described the "movement" of how this image should propagate in the model in a way akin to Craig Reynold's Boids in the desired behavior of collision avoidance, velocity matching, and flock centering. This "small world" consisting of the set of duplicated objects in a variety of locations can be described with the "small world" concept and will not create a lattice structure in its propagation scheme.

Chuck's implementation work includes adding a linked image embedded on the web page (using the HTML "link" tag and not the "a" tag) that allows the user to specify that they would like the object preserved. Michael then described the three policies used for duplication to ensure optimal spread of a resource, which included one-at-a-time until a limit is hit, as aggressively as possible until a soft limit then one-at-a-time until a hard limit, or a super aggressive policy of duplication until the hard limit is hit. From Chuck's work, Michael said, "It pays to be aggressive, conservation doesn't work for preservation. What we envisioned", he continued, "was to create objects that live longer than the people that created them."

Cathy Marshall (@ccmarshall) followed Michael with "An Argument for Archiving Facebook as a Heterogeneous Personal Store". From her previous studies, she found that users were apathetic about preserving their Facebook contents, "Why we should archive Facebook despite the users not caring about archiving it?", she said. "Evidence has suggested that people are not going to archive their stuff in any kind of consistent way in the long term." Most users in her study could not think of anything to save on Facebook, assuming that if Facebook died, those files also live somewhere else and can be recovered. Despite this, she attempted to identify what users found most important in their Facebook contents with 50% of the users saying that they find the most value in their photos, 35% saying they would carry over their contacts if they needed to, and other than that, they did not care about much else.

Michael Day (@bindonlane) followed Cathy with a review of recent work at the British Library. His group has been making attempts to implement preservation concepts from extremely large collections of digital material. They have published the British Library Content Strategy, which attempts to guide their efforts.

When Michael was done presenting, I (Mat Kelly, @machawk1) presented my paper "The Archival Acid Test: Evaluating Archive Performance on Advanced HTML and JavaScript". The purpose of the work was to evaluate different web archiving tools and web sites in a way much similar to the Acid Tests originally designed for web standards but with more clarity and to test specific facets of the web for which web archiving tools have trouble.

After I presented was the lunch session and another set of concurrent sessions. For this session I attended the "Digital Libraries: Evolving from Collections to Communities?" panel with Brian Beaton, Jill Cousins (@JilCos), and Herbert Van de Sompel (@hvdsomp) with Deanna Marcum and Karen Calhoun as moderators.

Jill stated, "Europeana can't be everything to everybody but we can provide the data that everyone reuses." The group spoke about the Europeana 2020 strategy and how it aims to fulfill 3 principles: "share data, improve access conditions to data and create value in what we're doing". Brian asked, "Can laypeople do managed forms of expert work? That question has been answered. The real issue is determining how crowd-sourced projects can remain sustainable.", referencing previous discussion on the integration of crowd sourcing to fund preservation studies and efforts. He continued, "I think we're at the moment where we are at a lot of competitors entering race [for crowd sourcing platforms]. Lots of non-profits are turning to integration of crowd-funded platforms. I'm curious to see what happens where more competition emerges for crowd-sourcing."

Following the panel and a short break, Alexander Ororbia of Penn State presented "Towards Building a Scholarly Big Data Platform: Challenges, Lessons and Opportunities" relating to CiteSeerX, "The scholarly big data platform". The application he described relates to research management, collaboration discovery, citation recommendation, expert search, etc. and uses technologies like a private cloud, HDFS, NOSQL, Map Reduce, and crawl scheduling.

Following Alex, Per Møldrup-Dalum (@perdalum) presented "Bridging the Gap Between Real World Repositories and Scalable Preservation Environments". In his presentation, Per spoke of Hadoop and his work in digitizing 32 million scanned newspaper pages using OCR and ensuring the digitization was valid according to his group's their preservation policy. In accomplishing this, he created a "stager" and "loader" as proof-of-concept implementations of using the SCAPE APIs. In doing this, Per wanted to emphasize reusability of the products he produced, as his work was mostly based on the reuse of other projects.

After Per, Yinlin Chen described their work on utilizing the ACM Digital Library (DL) data set as the basis for a project on finding good feature representations that minimizes the differences between source and target domains of selected documents.

C. Lee Giles of Penn State came next with his presentation "The Feasibility of Investing of Manual Correction of Metadata for a Large-Scale Digital Library". In this work, he sought to build a classifier using a truth discovery process using metadata from Google Scholar. He found that a "web labeling system" seemed more promising compared to simple models of crowdsourcing the classification.

This finished the presentations for the first day of the conference and the poster session followed. In the poster session, I presented my work on developing a Google Chrome extension called Mink (now publicly available) that attempts to integrate the live and archived web viewing experience.

Day 2

George Buchanan started the second day by introducing Professor Jane Ohlmeyer of Trinity College, Dublin (@tcddublin) and her work relating to the 1641 Depositions, the records of massacre, atrocity and ethnic cleansing in seventeenth-century Ireland. These testimonies related to the Irish Rebellion around the 22nd of October in 1641 where Catholics robbed, murdered, and pillaged their protestant neighbors. From what's documented of the conflict, Jane noted that "we only hear one side of the suffering and we don't have reports of how the Catholics suffered, were massacred, etc." referring to the accounts being mostly collected from a single perspective of the conflict. Jane highlighted one particular account by Anne Butler from the 7th of September, 1642, where Anne first explained who she was then followed with her neighbors with whom she previously interacted with daily in the market subsequently threatening her during the conflict solely due to her being Protestant. "It's as if the depositions are allowing us to hear the story of those that suffered through fear and conflicts.", Jane said, referencing the accounts. The controversial depositions had originally been donated in 1741 and held in Trinity College and locked away due to their controversial documentation of the conflict. The accounts consists of over 19000 pages (about 3.5 million words) and include 8000 witness testimonies of events related to 1641 rebellion. The accounts had been attempted to be published by multiple parties in the past (including an attempt in 1930) but had previously been censored by the Irish government because of their graphic nature. Now that the parties involved in the conflict are at peace, further work is being done by Jane's group preserving the accounts while dealing with various features of the writing (e.g., multiple spellings, document bleed through, inconsistent data collection pattern) that might otherwise be lost in the process were the documents naively digitized.

Jane's group has since launched a website (in 2010) to ensure that the documents are accessible to the public and currently have over 17,000 registered users. All of the data they have added is open source. Upon launching it, they had both Mary McAleese and Ian Paisley (who was notoriously anti-Catholic) together for the launch with Paisley surprisingly saying that he advocated the publication of the documents, as it "promoted learning" and he encouraged the documents be made accessible in the classroom to 14, 15, and 16 year olds so that society could "remember the past but not bound by the past". Through the digitization process, Jane's group has looked to other more recent (and some currently ongoing) controversial conflicts and how the accounts of the conflict can be documented and released in a way that is appropriate to the respectively affected society.

Following Jane's Keynote (and a coffee break), the conference was split into concurrent sessions where I attended the "Browsing and Searching" session, where Edie Rasmussen introduced Dana Mckay (@girlfromthenaki) began her presentation, "Borrowing rates of neighbouring books as evidence for browsing". In her work, she sought to explore the concept of browsing in respect to the various digital platforms for doing so (e.g., for books on Amazon) vs. the analog of browsing in a library. With library-based browsing, a patron is able to maintain physical context and see other nearby books as shelved by the library. "Browsing is part of the of the human information seeking process", she said, following with the quote, "The experience of browsing a physical library is enough to dissuade people to use e-books." In her work, she used 6 physical libraries as a sample set and checked the frequency at which physically nearby books had been borrowed as a function of likelihood in respect to an initially checked out book. In preliminary research, she found that from her sample set, that over 50% of the book had ever been borrowed and just 12% had been borrowed in the last year. In an attempt to quantify browsing, she first split her set of libraries into two sets consisting of those that used the Dewey Decimal system and those that used the Library of Congress system of organization. She first tested 100 random books, checked if they had been borrowed on day Y then checked the physically nearby books to see if they had been borrowed the day before. From her study, Dana found that there is definitely a causal effect on the location of books borrowed and that, especially in libraries, browsing has an effect on usage.

Javier Lacasta followed Dana with "Improving the visibility of geospatial data on the Web". In the work, his group's objective was to identify, classify, interrelated, and facilitate access to geospactial web services. They wanted to create an automatic process that identified existing geospatial services on the web by using an XML specification. From the service discover, Javier wanted to extract content of fields containing the resource's title, description, thematic keywords, content date, spatial coordinates, textual descriptions of the place, and the creator of the service. By doing this, they hoped to harmonize the content for consistency between services. Further, they wanted a mean of classifying the services by assigning values from a controlled vocabulary. The study, he said, ought to be applicable to other fields, though his discovery of services was largely limited by lack of content for these type of services on the web.

Martyn Harris was next with "The Anatomy of a Search and Mining System for Digital Humanities" where he looked at the barriers for tool adoption in the digital humanities spectrum. He found that documentation and usability evaluation was mostly neglected, so looked toward "dogfooding" in developing his own tool using context-dependent toolsets. An initial prototype uses a treemap for navigating the Old Testament and considers the probability of querying each document.

Õnne Mets followed Martyn with "Increasing the visibility of library records via a consortial search engine". The target for the study was the search engine behind National Library of Estonia, which provides an e-books on-demand service as well as a service for digitizing public domain books into e-book form. Their service has been implemented in 37 libraries in 12 countries and provides an "EOD button" that sends the request to the respective library to scan and transfer the images from the physical book. Their service provides a central point for users to discover EOD eligible books and uses OAI-PMH to harvest and batch upload the book files via FTP. Despite the services' interface, Õnne said that 89% of the hits on their search interface came directly to their landing pages via Google. From this, Õnne concluded that collaboration with a consortial search engine does in fact make collections of digitized books more visible, which increases the potential audience.

The conference then broke for lunch but returned with Daniel Hasan Dalip's presentation of "Quality Assessment of Collaborative Content With Minimal Information". In this work, Daniel investigated how users can better utilize the larger amount of information created in web documents using a multi-view approach that indicates a meaningful view of quality. As a use case, he divided a Wikipedia article into different views representing the evidence conveyed in the article. Using Spark Vector Machines (SVR), he worked to identify features within the document with a low prediction error. He concluded that using the algorithm allows the feature set of 68 features to be reduced by 15%, 18%, and 25% for three sample data article on Wikipedia for "MUPPET", "STARWAR", and "WIKIPEDIA", respectively.

As Daniel's presentation was going on, Justin viewed Adam Jatowt's presentation "Quality Assessment of Collaborative Content With Minimal Information". In this work, Adam showed that words changed meaning over time using tools to verify words' evolution. He first took 5-grams from The Corpus of Historical American English (COHA) on Google Books and measured both the frequency and the temporal entropy of each 5-gram. He found that if a word is popular in one decade, it's usually popular in the next decade. He also investigated similarity based on context (i.e., the position in a sentence). Through the study he discovered word similarities as was with the case of the word "nice" being synonymous with "pleasant" around the year 1850.

I then joined Justin for Nikolaos Aletras's (@nikaletras) presentation, "Representing Topics Labels for Exploring Digital Libraries". In this work, Nikolaos stated that the problem with online documents is that they have no structure, metadata, or manually created classification system accompanying them, which makes it difficult to explore and find specific information. He created unsupervised topic models that were data-driven and captured the themes discussed within the documents. The documents were then represented as a distribution over various topics. To accomplish this, he developed a topic model pipeline where a set of documents acted as the input with the output consisting of two matrices: topic-word (probability of each word on a given topic) and topic-document (probability of each document given the topic). He then used his trained model to identify as many documents relevant to a set of queries within 3 minute in a document collection using document models. The data set used was a subset of the Reuters Corpus from Rose et al. 2002. This data set had already been manually classified, so could be used for model verification. From the data set, 20 subject categories were used to generate a topic model. 84 topics were produced and provided via an alternative means of browsing the documents.

Han Xu presented next with "Topical Establishment Leveraging Literature Evolution" where he attempted to discovery research topics from a collection of papers and to measure how well or not a given topic is recognized by the community. First, Han's group identified research topics whose recognition can be described as either persistent, withering or booking. Their approach was inspired by bidirectional mutual enforcement between papers and topical recognition. By using the weight of a topic as a sum of its recognitions in papers, he could compare using PageRank and RALEX (their previous work using random WALKS) and show that their own approach was more suitable, as it was more designed to take into account literature evolution, unlike PageRank.

Fuminori Kimura was next with "A Method to Support Analysis of Personal Relationship through Place Names Extracted from Documents", a followup study on previous research for extracting personal relationships through place names. In this work, their extracted personal names and place names and counted the co-occurrence between them. Next, their created a personal's feature vector then calculated the personal relationship and stored this product in a database for further analysis. When a personal name and a place name appeared in the same paragraph, they hypothesized, it is an indicator of the relationship between the person and the location. Using cosine similarity and clustering, Fuminori found that initial tests of their word on Japanese historical documents could epitomize a relationship network graph of closely related people backed by their common relationships with locations.

After a short break, the final set of concurrent sessions started where I attempted Christine Borgman's (@scitechprof presentation of "The Ups and Downs of Knowledge Infrastructures in Science: Implications for Data Management". In this work, she spoke of how countries in Europe, the U.S., and other parts of the world are now requiring scholars to release the data from their studies and questioned what sort of digital libraries should we be building for this data. Her work was reporting on the progress from the Alfred P. Sloan Foundation's study of 4 different scientific processes and how they make and use data. "What kind of new professionals should be prepared for data mining", she asked.She described four different projects in a 2x2 matrix where two had large amounts of data and two were projects that were just ramping up (with each project of the four holding a unique combination of these traits). The four projects (Center for EMBEDDED network Sensing (CENS), Sloan Digital Sky Survey (SDSS), Center for Dark Energy Biosphere investigations (C-DEBI), and the Large Synoptic Survey Telescope (LSST) each either had either previous methods of storing the data or were proposing ways to handle, store, and filter the large amount of data to-come. "You don't just trickle the data out as it comes across the instruments. You must clean, filter, document, and release very specific blocks.", she said of some projects releasing the cleaned data sets while others were planning to opt to release the raw data to the public. "Each data is accompanied by a paper with 250 authors", she said, highlighting that they were greatly used as a basis for much further research.

Carl Lagoze of University of Michigan presented next with "CED2AR: The Comprehensive Extensible Data Documentation and Access Repository", which he described as "yet another metadata repository collection system." In a deal between the NSF and the Census Bureau, he worked to make better use of the Census Bureau's huge amount of data. Doing further work on the data was to increase emphasis to have scientists make data available on the network and make the data useful for replicating methods, verifying/validating studies, and taking advantage of the results. Key facets of the census data is that it is highly controlled and confidential, with the latter describing both the content itself as well as the metadata of the content. Because of this, both identity and provenance were key issues that had to be dealt with in the controlled data study. Regarding the mixing of this confidential data with public data, Carl said, "Taking controlled data spaces and mixing it with uncontrolled data spaces creates a new data problem in data integrity and scientific integrity.".

David Bainbridge present next with "Big Brother is Watching You - But in a Good Way" where he initially presented the use case of having had something on his screen earlier for which he could not remember the specifics of some text. His group has created a system that records and remembers text that has displayed on a machine running XWindows (think: Linux) and allows the collected data to be searchable with graphical recall.

During the presentation, David gave a live demo wherein he visited a website, which was immediately indexed and became searchable as well as showing results from earlier relevant browsing sessions.

Rachael Kotarski (@RachPK) presented next with "A comparative analysis of the HSS & HEP data submission workflows" where she withed with a UK data archive looking for social science data. She referenced that users registering for an ORCID greatly helps with the mining process and takes only five minutes.

Nikos Houssos (@nhoussos) presented the last paper of the day with "An Open Cultural Digital Content Infrastructure" where he spoke of 70 cultural heritage projects costing about 60 million Euros and how his group has helped associate successful validation with funding cash flows. By building a suite of services for repositories, they have provided a single point of access for these services through aggregation and harvesting. Much of the back-end, he said, is largely automated checking and compliance for safe keeping.

Nikos closed out the sessions for Day 2. Following the sessions, the conference dinner was held at The Mermaid Function Centre.

Day Three

The third day of the Digital Libraries was short but to lead off was ODU WS-DL's own Justin Brunelle (@justinfbrunelle) with "Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources". In this paper, we (I was a co-author) investigated the effects of missing resources on an archived web page and how the impact of a resource is not sufficiently evaluated on an each-resource-has-equal-weight basis. Instead, using measures of size, position, and centrality, Justin developed an algorithm to weight a missing resource's impact (i.e., "Damage") to a web page if not captured to the archived. He initially used the example of the web comic XKCD and how a single resource (the main comic) has much more importance for the page's purpose than all other resources on the page. When missing a stylesheet, the algorithm considers the background color of the page and the concentration of content with the assumption that if the stylesheet is missing and important, most of the content will be in the left third of the page.

Hugo Huurdeman (@TimelessFuture) followed Justin with "Finding Pages on the Unarchived Web" by first asking, "Given that we cannot crawl lost web pages, how can we recover the content lost?" Working with the National Libraries of the Netherlands, which consists of about 10 terabytes of data from 2007, they focused on a subset of this data for 2012 with the temporal span of a year. From this they extracted the data for processing and sought to answer three research questions:

  1. Can we recover a significant fraction of unarchived pages?
  2. How rich are the representations for the unarchived pages?
  3. Are these representations rich enough to characterize the content?

Using a measure involving Mean Reciprocal Rank, they took the average scores of the first correct result of each query while utilizing keywords within the URLs for non-homepages. A second measure of "Success Rate" allowed them to evaluate that 46.7% of homepages and 46% of non-homepages could have a summary generated if never preserved. Their approach claimed to "Reconstruct significant parts of the unarchived web." based on descriptions and link evidence pointing to the unpreserved pages.

Nattiya Kanhabua presented last in the session with "What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalysts for Collective Memory in Wikipedia" where she investigated the scenario of a computer that forgets intentionally and how that plays into digital preservation. "Forgetting plays a crucial role for human remembering and life.", she said. Nattiya spoke of "managed forgetting", i.e., to remember the right information. "Individuals' memories are subject to a fast forgetting process." She referenced various psychological studies to correlate the preservation process with "flashbulb memories". For a case study, they looked at the Wikipedia view logs as signal for collective memory, as they're publicly available traffic over a long span of time. "Looking at page views does not directly reflect how people forget; significant patterns are a good estimate for public remembering.", she said. Their approach developed a "remembering score" to rank related past events and identify features (e.g., time, location) as having a high correlation with remembering.

Following a short final break, the final paper presentations of the conference commenced. I was able to attend the last two presentations of the conference where C. Lee Giles of Penn State University presented "RefSeer: A Citation Recommendation System", a citation recommendation system based on the content of an entire manuscript query. His work served as an example on how to build a tool on top of other system through integration. To further facilitate this, the system contains a novel language translation method and is intended to help users write papers better.

Hamed Alhoori presented the last paper of the conference with "Do Altmetrics Follow the Crowd or Does the Crowd Follow Altmetrics?" where he used bookmarks as metrics. His work found that journal-level altmetrics have significant correlation among themselves compared with the weak correlations within article-level altmetrics. Further, they found that Mendeley and Twitter have the highest usage and coverage of scholarly activities.

Following Hamed's presentation, George Buchanan provided information on the next year's JCDL 2015 and TPDL 2015 (which would again be split into two locations) and what ODU WS-DL was waiting for: the announcements for best papers. For best student paper, the nominees were:

  • Glauber Dias Gonçalves, Flavio Vinicius Diniz de Figueiredo, Marcos Andre Goncalves and Jussara Marques de Almeida. Characterizing Scholar Popularity: A Case Study in the Computer Science Research Community
  • Daniel Hasan Dalip, Harlley Lima, Marcos Gonçalves, Marco Cristo and Pável Calado. Quality Assessment of Collaborative Content With Minimal Information
  • Justin F. Brunelle, Mat Kelly, Hany Salaheldeen, Michele C. Weigle and Michael Nelson. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources

For best paper, the nominees were:

  • Chuck Cartledge and Michael Nelson. When Should I Make Preservation Copies of Myself?
  • David A. Smith, Ryan Cordell, Elizabeth Maddock Dillon, John Wilkerson and Nick Stramp (Best paper nominees). Detecting and Modeling Local Text Reuse
  • Hugo Huurdeman, Anat Ben-David, Jaap Kamps, Thaer Samar and Arjen P. de Vries (Best paper nominees). Finding Pages on the Unarchived Web

The results (above tweet) served as a great finish to a conference with many fantastic papers that we will be exploring in-depth for the next year.

— Mat

No comments:

Post a Comment