Friday, December 14, 2018

2018-12-14: CNI Fall 2018 Trip Report

Mat Kelly reports on his recent trip to Washington, DC for the CNI Fall 2018 meeting                                                                                                                                                                                                                                                                                                                                                                           ⓖⓞⓖⓐⓣⓞⓡⓢ

I (Mat Kelly, @machawk1) attended my first CNI (#cni18f) meeting on December 10-11, 2018, an atypical venue for a PhD student, and am reporting my trip experience (also see previous trip reports from Fall 2017, Spring 2017, Spring 2016, Fall 2015, and Fall 2009).

Dr. Nelson (@phonedude_mln) and I left Norfolk, VA for DC, previously questioning whether the roads would be clear from unseasonably significant snow storm the night before (they were):

The conference was split up into eight sessions with up to 7 separate presentations being given concurrently in each session, which required attendees to choose a session. Between each session was a break, which allowed for networking and informal discussions. The eight sessions I chose to attend were:

  1. Collaboration by Design: Library as Hub for Creative Problem Solving Space
  2. From Prototype to Production: Turning Good Ideas into Useful Library Services
  3. First Steps in Research Data Management Under Constraints of a National Security Laboratory
  4. Blockchain: What's Not To Like?
  5. The State of Digital Preservation: A Snapshot of Triumps, Gaps, and Open Research Questions
  6. What Is the Future of Libraries in Academic Research?
  7. Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
  8. Building Infrastructure and Services for Open Access to Research

Also be sure to check out Dale Askey's report of the CNI 2018 Fall Membership Meeting. With so many concurrent sessions, he had a different experience.

Day One

In the Open Plenary prior to the current sessions Cliff Lynch described his concerns with the suggestions for using blockchain as the panacea of data problems. He discounted blockchain's practicality as a solution to most problems to which it is applied and expressed more concern but enthusiasm for the use of machine learning (ML), however, stated his wariness of ML's alignment with AI. Without training sets, he noted further, ML does not do well. He also noted that if there is bias in training data, the classifiers learn to internalize the biases.

Cliff continued by briefly discussion the rollout of 5G and how it will create competition in home cable-based Internet access, it will not fix the digital divide. Those that don't currently have access will likely not gain access with the introduction of 5G. He went on with his concerns over IoT devices and emulation of old systems and the security implications of reintroducing old, unpatched software.

He then mentioned the upcoming sunsetting of The Digital Preservation Network (DPN) and how their handling of the phase out process is a good example of what is at stake in terms of good behavior (cf. "we're shutting down in 2 weeks"). DPN's approach is very systematic in that they are going through their holdings and figuring out where the contents need to be returned, where other copies of these contents are are being held, etc. As was relevant, he also mentioned the recently announced formalization of a succession plan by CLOCKSS for when the time comes that the organization ceases operation.

Continuing, Cliff referenced Lisa Spiro's Plenary Talk at Personal Digital Archiving (PDA) 2018 this past April (which WS-DL attended in 2017, 2013, 2012, and 2011) and the dialogue that occurred following Hurricane Harvey on enabling and archiving the experiences of those affected. He emphasized that as exemplified in the cases of natural disasters like the hurricane, the recent wildfires in California, etc., we are in an increasing state of denial about how increasingly valuable the collections on our sites are.

Cliff also referenced recent developments in scholarly communication with respect to open-access, namely of the raising of the technical bar with the deposit of articles with strict DTD as prescribed by the European Plan S. The plan requires researchers who receive state funding to publish their work in open repositories or journals. He mentioned that for large open-access journals like PLoS and "big commercial players", doing so is not much of a problem as compared to the hardship that will be endured by smaller "labor of love" journals like those administered using OJS. He also lamented the quantification of Impact measurement in non-reproducible ways and the potential long terms implications using measures like this. In contrast, he noted that when journal editors get together and change the rules to reflect desired behaviors in a community, they can be a powerful force for change. He used the example of genomic journals requiring data submission to GenBank prior to any consideration of submission.

After touching a few more points, Cliff welcomed the attendees and thus begun the concurrent sessions.

Collaboration by Design: Library as Hub for Creative Problem Solving Space

The first session I attended was very interactive. The three speakers (Elliot Felix, Julia Maddox, and Mary Ann Mavrinac) gave a high-level of the iZone system as deployed at the University of Rochester. They first asked the attendees to go to a site or text their reply to the role of libraries and its needs then watched the presentation screen enumerating the responses as they came in real time.

The purpose of the iZone system as they described was to serve as a collaboration hub for innovation for the students to explore ideas of social or community benefit. The system seemed open-ended but the organization helpful to students where they "didn't have a methodology to do research or didn't know how to form teams."

Though the iZone team tried to encourage an "entrepreneurial" mindset, their vision statement intentionally did not include the word, as they found that students did not like the connotations of word. The presenters then had the audience members fill out a sort-of Mad Lib as supplied on the attendees seats stating:

For __audience__ who __motivation__ we deliver __product/service__ with __unique characteristic__ that benefit __benefit__.

Most of those that supplied their response were of a similar style of offering students some service for the benefit of what ever the library at their institution offered. Of course, the iZone representatives provided their own relating to offering a "creative space for problem solving".

Describing another barrier with students, the presenters utilized Bob McKim's tactic on the first day of classes while still teaching of having students draw their neighbor on a sheet of paper for 20 seconds. Having the audience at CNI do this was to demonstrate that "we fear judgement of peers" and "Throughout our education and upbringing, we worry about society's reaction to creative thoughts and urges, no matter how untamed they may be."

This process was an example of how they (at iZone) would help all students to become resilient, creative problem solvers.

Slides for this presentation are available (PDF)

From Prototype to Production: Turning Good Ideas into Useful Library Services

After a short break, I attended the session presented by Andrew K. Pace (@andrewkpace) of OCLC and (Holly Tomren @htomren) of Temple University Libraries. Andrew described a workflow that OCLC Research has been using to ensure that prototypes that are created and experimented with do not end up sitting in the research department without going to production. He described their work on two prototype-to-production projects consisting of IIIF integration into a digital discovery environment and a prototype for digital discovery of linked data in Wikibase. His progression from prototyping to production consisted of 5 steps:

  1. Creating a product vision statement
  2. Justifying the project using a "lean canvas" to determine effort behind a good idea.
  3. "Put the band together" consisting of assembling those fit to do the prototyping with "stereophonic fidelity" (here he cited Helen Keller with "Happiness is best attained through fidelity to a worthy purpose")
  4. Setting Team Expectations using OCLC Community Center (after failing to effective use a listserv) to concretely declare a finishing date that they had to stick to to prevent research staff from having to manage the project after completion.
  5. Accepting the outcome with a Fail Fast & Hard approach, stating that "you should be disappointed if you wind up with something that looks exactly lie you expected to build from the start.

Holly then spoke of her experience at Template piloting the PASSAGE project (the above Wikibase project) from May to September, 2018. An example use case they used in their pilot was asking users to annotate the Welsh translation of James and the Giant Peach. One such question asked was which properties should be associated with the original work and which to the translation.

Another such example was with a portrait of Caroline Still Anderson from Temple University Libraries' Charles L. Blockson Afro-American Collection and deliberating on attributes like "depicts" rather than "subject" in describing the work. In a discussion with the community, they sought to clarify the issue of the photo itself and the subject in the photo. "Which properties belong to which entity?", they asked, noting the potential loss of context if you did not click through. To further emphasize this point, she discussed a photo title "Civil Rights demonstration at Girard College August 3, 1965" where a primary attribute of "Philadelphia" would remove too much context in favor of more descriptive attribute of the subject in the photo like "Event: demonstration" and "Location: Girard College".

These sort of descriptions, Holly summarized, needed a cascading, inheritance style of description relative to other entities. This experience was much different than her familiarity with using MARC records to describe entities.

First Steps in Research Data Management Under Constraints of a National Security Laboratory

Martin Klein (@mart1nkle1n) and Brian Cain (@briancain101) of Los Alamos National Laboratory (LANL) presented next with Martin initially highlighting a 2013 OSTP stating that all federal agencies over $100 million in R&D research are required to store their data and make it publicly accessible to search, retrieve, and analyze. LANL being one of 17 national labs under the US Department of Energy with $12 billion in R&D funding (much greater than $100 million) was required to abide.

Brian highlighted a series of interviews at other institutions inclusive of in-depth interviews about data at their own institution. Responses to these interviews expressed a desire for a centralized storage solution to resolve storing it locally and having more assurance of its location "after the postdoc has left".

Martin documented an unnecessarily laborious process of having to burn data to a DVD, walking it to their review and release system (RASSTI) then once complete, physically walk the approval to a second location. He reported that this was a "Humungous pain" and thus "lots of people don't do this even though they should". He noted that the lab has started initiatives that have started to look into where money goes tracing it from starting points of an idea to funding, to paper, patents, etc.

He went on to describe the model used by the Open Science Framework (OSF) to bring together portability measures the researchers at LANL were already used to. Based on OSF, they created "Nucleus", a scholarly commons to connect the entire cycle and act as the glue that sits in the middle of research workflow and productivity portals. Nucleus can connect to storage solutions like GitLab and other authentication systems (or whatever their researchers are used to want to reuse) to act as a central means of access. As a prototype, Martin's group established an ownCloud instance to demonstrate the power of a sync-n-share solution for their lab. The intention of Nucleus would make the process of submitting datasets to RASTSTI much less laborious to obtain approval and comply.

Blockchain: What's Not To Like?

David Rosenthal presented the final session of the day that I attended, and much anticipated based on the promotion in the official CNI blog post. As is convention, Rosenthal's presentation consisted of a near-literal reading of his (then-) upcoming blog post with an identical title. Go and read that to do his very interesting and opinionated presentation justice.

As a very high-level summary, Rosenthal emphasized the attractiveness but mis-application of blockchain in respect to usage in libraries. He expressed multiple instances of Santoshi Nakamoto's revolutionary idea to have the consensus concept decentralized, which is often the problematic counterpart in this sorts of systems. The application of the ideas though, he summarized, and the side effects (e.g., potential manipulation of consensus, high trading latency) of Bitcoin as an exemplification of blockchain highlighted the real-world use case and the corresponding issues.

Rosenthal repeatedly mentioned the pump-and-dump schemes that allow for price manipulation and to "create money out of thin air". Upon completion of his talk and some less formal, opinionated thoughts on Austrian-led efforts for promotion of blockchain/Bitcoin (through venues of universities, companies, etc.), Dr. Nelson asked "Where are we in 5 years?"

Rosenthal answered with his prediction of "Cryptocurrency has been devaluaing for a year. It is hard to sustain a belief that cryptocurrencies will continue "going up"; miners are getting kicked out of mining. This is a death spiral. If it gets to this level, someone can exploit it. This has happened to small altcoins. You can see instances of using Amazon computing power to mount attacks".

Day 1 of CNI finished with a reception consisting of networking and some decent crab cakes. In a gesture of cosmic unconsciousness, Dr. Nelson preferred the plates of shrimp.

Day Two

Day two of the CNI Fall 2018 meeting started with breakfast and one of the four sessions of the day.

The State of Digital Preservation: A Snapshot of Triumps, Gaps, and Open Research Questions

The first session I attended was presented by Roger C. Schonfeld (@rschon) & Oya Tieger (@OyaRieger) of Ithaka S+R (of recent DSHR infamy) who reported on a recent open-ended study with "21 subject experts" to identify outstanding perspectives and issues in digital preservation. Oya noted that the interviewees were not necessarily a representative sample.

Oya referenced her report, recently published in October 2018 titled, "The State of Digital Preservation in 2018" that highlights the landscape of Web archiving, among other things, and how to transition the bits preserved for future use. In the report she (summarily) asked:

  1. What is working well now?
  2. What are you thoughts on how the community is preparing for new content types and format?
  3. Are you aware of any new research in practices and their impact?
  4. What areas need further attention?
  5. If you were writing a new preservation grant, what would be your focus?

From the paper, she noted that there are evolving priorities in research libraries, which are already spread thin. She questioned whether digital preservation is a priority for libraries' overall role in the community. Oya referenced the recent Harper's article, "Toward an ethical archive of the web" with a thought-provoking pull quote of "When no one is likely to lay eyes on a particular post or web page ever again, can it really be considered preserved?"

What Is the Future of Libraries in Academic Research?

Tom Hickerson, John Brosz (@jbrosz), and Suzanne Goopy of University of Calgary presented next by noting that academic research has changed and whether libraries have adapted. Through support of the Mellon Foundation, his group explored a multitude of project, which John enumerated. They sought to develop a new model for working with campus scholars using a research platform as well as providing equipment to augment the library's technical offerings.

Suzanne, a self-described "library user" described Story Map ECM (Empathic Cultural Mapping) to help identify small data in big data and vise-versa. This system merges personal stories of newcomers to Calgary using a map to show (for example) how adjustment to bus routes in Calgary can affect a Calgary's newcomer's health.

Tom closed the session by emphasizing the need to be able to support a diversity of research endeavors through a research platform to offer economy of scale instead of one-off solutions. Of the 12 projects that John described, he stated, there was only one instance where they asked for a resource that we had to try subscribe to, emphasizing the under-utilized availability of library resources. Even with this case, he mentioned, it was an unconventional example of access. "By having a common association with a research project", he continued, "these various silos of activity have developed new relationships with each other and strengthened our collegial involvement."

Blockchain Can Not Be Used To Verify Replayed Archived Web Pages

WS-DL's own Dr. Michael L. Nelson (@phonedude_mln) presented the second of two sessions relating to Blockchain that I attended at CNI, greeting the attendees with Explosions in the Sky (see related posts) and starting with a recent blog post from Peter Todd claimed to "Carbon Date (almost) the Entire Internet" and the contained caveat stating "In the future we hope to be able to work with the Internet Archive to extend this to timestamping website snapshot". Todd's report was more applicable to ensuring IA holdings like music recordings have not been altered (Nelson stated that "It's great to know that your Grateful Dead recording has not been modified) but is not as useful to validate Web pages. The fixity techniques Todd used are too naive to be applicable to Web archiving.

Nelson then demonstrated this using a memento of a Web page recording his travel log over the years. When this page is curled from different archives, each reports a different content length due to how the content has been amended at replay time. This served as a base example of the runtime manipulation of a page without an external resources. However, most pages contain embedded resources, JavaScript that can manipulate the HTML, etc. that cause an increasingly level of variability to this content length, the content preserved, and the content served at time of replay.

As a potential solution, Nelson facetiously suggested that the whole page with all embedded resources could be loaded, a snapshot taken, and the snapshot hashed; however, he demonstrated a simple example where a rotating image changed at runtime via JavaScript would indicate a change in presentation despite no change in contents, so discarded this approach.

Nelson then highlighted work relating to temporal violations in the archive where, because of the difference in time of preservation of embedded resources, pages that never existed are presented as the historical record of the live Web.

The problem even when viewing the same memento over time shows that what one sees at time in an archive may be different later -- hardly a characteristic of what would expect from an "archive". As an example, Nelson replay a memento of the raw and rewritten versions for the homepage of the future losers of the 2018 Peach Bowl (at the URI By doing so 35 times between November 2017 and October 2018, Nelson noted variance of the very same memento, even when stable (e.g., images failed to load) as well as an archival outage due to a subsequently self-reported upgrade. Nelson found that in 11 months, 11% of the URLs they surveyed disappeared or changed. This conclusion was supported by observing 16,627 URI-Ms over that time frame and observing from 17 different archives an 87.92% result of the hash of a page being two different values within that time frame. The conclusion: You cannot replay replay twice the same archived page (with a noted apology to Heraclitus).

As a final analogy, Nelson played a a video of a scene from Monty Python and the Holy Grail alluding to the guards as the archive and the king as the user.

Building Infrastructure and Services for Open Access to Research

The final session was presented by Jefferson Bailey (@jefferson_bail), Jason Priem (@jasonpriem, presenting remotely via Zoom), and Nick Shockey (@nshockey). Jason initially described his motivations and efforts in creating Unpaywall that is seeking to create an open database of scholarly articles. He initially emphasized that their work is open source and was delighted to see its reuse of their early prototypes by Plum Analytics. All data that Unpaywall collects is available using their data APIs, which serve about 1 million calls per day and are "well used and fast".

Jason emphasized that his organization behind Unpaywall (Impactstory) is a non-profit, so it cannot be "acquired" in the traditional sense. Unpaywall seeks to be 98% accurate in the level of open access in returned results and works with publishers and authors to increase the degree of openness of work if unsatisfactory.

He and co-owner of Impactstory published a paper titled "The state of OA: a large-scale analysis of the prevalence and impact of Open Access article" that categorized these degrees of open access and quantified the current state of open access articles in the scholarly record. Some of these articles from the 1920s, he stated, were listed as Open Access even though the concept did not exist at the time. He predicted that within 8 years, 80% of articles will be Open Access based on current trends. Unpaywall has a browser extension freely available.

Jefferson described Internet Archive's efforts at preservation in general with projects like GifCities, a search engine on top of archived Geocities for all GIFs contained within, and a collection of every powerpoint in military domains (about 60,000 in number). Relating to the other presenters' objectives, he provided the IA one liner objective to "Build a complete, use-oriented, highly available graph and archive of every publicly access scholarly article with bibliographic metadata and full-text, enhanced with public identifier metadata, linked with associated data/blog/etc, with a priority on long-tail, at-risk publications and distributed, machine-readable access."

He mentioned that a lot of groups (inclusive of Unpaywall) are doing work in aggregating Open Access articles. They are taking three approaches toward facilitating this:

  • Top-down: using lists, IDs, etc to target harvesting
  • Middle-sideways: Integrating with OA public systems and platforms
  • Bottom-up: using open source tools, algorithms, and machine learning to identify extant works, assess quality of preservation, identify additional materials.

Jefferson referenced IA's use of Grobid for metadata extraction and through their focus on the not-so-well archived, they found 2 million long tail articles that have DOIs that are not archived. Of those he found, 2 out of 3 articles were duplicates. With these removed, IA currently has about 10 million articles in their collection. Their goal is to build a huge knowledge graph of what is archived, what is out there, etc. Once they have that, they can build services on top of it.

Nick presented last of the group and first mentioned he was filling in for Joe McArthur (@Mcarthur_Joe). Nick introduced Open Access Button that provides free alternatives to paywalled articles with a single click. If they are unable to, their service "finds a way to make the article open access for you". They recently switched from a model of a user tools to institutional tooling (with a library focus. Their tools, as Nick reported, was able to find Open Access versions for 23.2% of ILL requests using Open Access or Unpaywall. They are currently building a way to deposit article when an Open Access version is not available using a simple drag-and-drop procedure after notifying authors. This tool can also be embedded in institutions' Web pages for easier accessibility for authors to facilitate more Open Access works.

Slides for this presentation are available (PDF).

Closing Plenary

And with that session (and a break), Cliff Lynch began the closing plenary of CNI by introducing Patricia Flatley Brennan, directory of the National Library of Medicine (NLM). She initially described NLM's efforts to creates Trust in Data, "What does a library do?", she said, "We fundamentally create trust. The substrate is data."

She referenced the NLM is best known for its products and services like PubMed, the MEDLINE database, the Visible Human Project, etc.

"There has never been a greater need for trusted, secure, accessible, valued information in this world.", she said, "Libraries, data scientists, and networked information specialists are essential to the future." Despite the "big fence" around the physical NLM campus in Bethesda, the library is open for visits. She described a refactoring of PubMed via PubMed Labs to create a relevance-based ranking tool instead of reverse temporal order. This would also entail a new user interface. Both of these improvements were formed by the observation that 80% of the people that launch a PubMed search never go to the second page.

...and finally

Upon completion of the primary presentation and prior to audience questions, Dr. Nelson and I left to beat the DC traffic back to Norfolk. Patricia slides are promised to be available soon from CNI, which I will later include in this post.

Overall, CNI was an interesting and quite different meeting with which I am used to attending. The heavier, less technical focus was an interesting perspective and made me even more aware that there quite a lot of what is done in libraries that I have only a high-level idea. As a PhD student, in Computer Science no less, I am grateful for the rare opportunity to see the presentations in-person when I have only ever had to view them via Twitter from a far. Beyond this post I have also taken extensive notes for many topics that I plan to explore in the near future to make myself aware of current work and research going on at other institutions.

—Mat (@machawk1)

1 comment: