Thursday, October 11, 2018

2018-10-11: iPRES 2018 Trip Report

September 24th marked the beginning of iPRES 2018 located in Boston, MA, for which both Shawn Jones and I traveled from New Mexico to present our accepted papers: Measuring News Similarity Across Ten U.S. News SitesThe Off-Topic Memento Toolkit, and The Many Shapes of Archive-It.

iPRES ran paper and workshop sessions in parallel, therefore I will focus on the sessions I was able to attend. However, this year organizers created and shared collaborative notes with all attendees for all sessions to help others who couldn't attend many individual sessions. All the presentation materials and associated papers were also made available via google drive.

Day 1 (September 24, 2018): Workshops & Tutorials

The first day of iPRES attendees gathered at the Joseph B. Martin Conference Center at Harvard Medical School to get their registration lanyards and iPRES swag.

Afterwards, there were scheduled workshops and tutorials to enjoy throughout the day. Attending registrants needed to sign up early to get into these workshops. Many different topics were available for to attendees choose from found on Open Science Framework event page. Shawn and I chose to attend: 
  • Archiving Email: Strategies, Tools, Techniques. A tutorial by: Christopher John Prom and Tricia Patterson.
  • Human Scale Web Collecting for Individuals and Institutions (Webrecorder Workshop). A workshop by: Anna Perricci.
Our first session on Archiving Email consisted of talks and small group discussion on various topics and tools for archiving email. It started with talks on the adoption of email preservation systems into our organizations. Within our group talk, it was found that few organizations have email preservation systems. I found the research ideas and topics stemming from these talks to be very interesting especially in the aspect of studying natural language from email content.
Many of the difficulties of archiving email unsurprisingly revolve around issues of privacy. Some of the difficulties range from actually requesting and acquiring emails from users, discovering and disclosing sensitive information inside emails, and also other ethical decisions for preserving emails.

Email preservation also has the challenge of curating at scale. As one can imagine, going through millions of emails inside of a collection can be time consuming and redundant which requires the development of new tools to combat these challenges.
This workshop also exposed many interesting tools to use for archiving and exploring emails including:

Many different workflows for archiving email and also using the aforementioned tools for archiving emails were explained thoroughly at the end of the session. These workflows covered migrations with different tools, accessing disk images of stored emails and attachments via emulation, and bit-level preservation.

Following the email archiving session we continued on for the Human Scale Web Collecting for Individuals and Institutions session presented by Anna Perricci from the Webrecorder team.

Having used Webrecorder before I was very excited for this session. Anna walked through process of registering and starting your first collection. She explained how to start sessions and also how collections are formed as easily as clicking different links on a website. Webrecorder can handle javascript replay very efficiently. For example, past videos streamed from a website like Vine or YouTube are recorded from a user's perspective and then available for replay later in time. Other examples included automated scrolling through twitter feeds or capturing interactive news stories from the New York Times.
During the presentation Anna showed Webrecorder's capability of extracting mementos from other web archives for the possibility of repairing missing content. For example, it managed to take CNN mementos from the Internet Archive past November 1, 2016 and then fix their replay by aggregating resources from other web archives and also the live web - although this could also be potentially harmful. This is an example of Time Travel Reconstruct implemented in pywb.

Ilya Kreymer presented the use of Docker containers for emulating different browser environments and how it could play an important role for replaying specific content like Flash. He demonstrated various tools available open source on Github including: pywb, Webrecorder WARC player, warcio, and warcit.
Ilya also teased at Webrecorder's Auto Archiver Prototype, a system that understands how Scalar websites work and can anticipate URI patterns and other behaviors for these platforms. Auto Archiver introduces automation of the capture of many different web resources on a website, including video and other sources.
Webrecorder Scalar automation demo for a Scalar website

To finish the first day, attendees were transported to a reception hosted at the MIT Samberg Conference Center accompanied by a great view of Boston.

Day 2 (September 25, 2018): Paper Presentations and Lightning Talks

To start the day attendees gathered for the plenary session which was opened by a statement from Chris Bourg.

Eve Blau then continued the session by presenting the Urban Intermedia: City, Archive, Narrative capstone project of a Mellon grant. This talk was about a Mellon Foundation project the Harvard Mellon Urban Initiative. It is a collaborative effort across multiple institutions of architecture, design and humanities. Using multimedia and visual constructs it looked at processes and practices that shape geographical boundaries, focusing on blind spots in:
  • Planned / unplanned - informal processes
  • Migration / mobility, patterns, modalities of inclusion & exclusion
  • Dynamic of nature & technology, urban ecologies
After the keynote I hurried over to open for the Web Preservation session with my paper on Measuring News Similarity Across Ten U.S. News Sites. I explained our methodology of selecting archived news sites, the tool top-news-selectors we created for mining archived news, how the similarity of news collections were calculated, the events that peaked in similarity, and how the U.S. election was recognized as a significant event among many of the news sites.

Following my presentation, Shawn Jones presented his paper The Off-Topic Memento Toolkit. Shawn presentation focused on the many different use cases of Archive-It, and then detailed how many of these collections can go of topic. For example, pages that have missing resources at a point in time, content drift causes different languages to be included in a collection, site redesigns, and etc. This lead to the development of the Off-Topic Memento Toolkit to detect these off-topic mementos inside of a collection through a process of collection a memento and then assigning a score, testing multiple different measures. It showed that in this study Word Count had the highest accuracy and best F1 score for detecting off-topic mementos.

Shawn also presented his paper The Many Shapes of Archive-It. He explained how to understand Archive-It collections using the content, metadata (Dublin Core and custom fields), and collection structure, but also the issues that come with these methods. Using 9351 collections from Archive-It as data, Shawn explained the concept of growth curves for collections which compares seed count, memento count, and also memento-datetime. Using different classifiers Shawn showed that using structural features of a collection one can predict the semantic category of a collection, with the best classifier found to be Random Forest.

Following lunch, I headed to the amphitheater to see Dragan Espenschied's short paper presentation Fencing Apparently Infinite Objects. Dragan questioned how objects, synonymous with file or a collection of files, are bound in digital preservation. The concept of "performative boundaries" was explained to explain different potentials of an object - bound, blurry, and boundless. Using many early software examples like early 2000 Microsoft Word (bound), Apple's QuickTime (blurry), and Instagram (boundless). He shared productive approaches for future replay of these objects:

  • Emulation of auxiliary machines
  • Synthetic stub services or simulations
  • Capture network traffic and re-enact on access 

Dragan Espenschied presenting on Apparently Infiinite Objects 
The next presentation was Digital Preservation in Contemporary Visual Art Workflows by Laura Molloy who presented remotely. This presentation informs us that on a regular basis digital preservation for someone's work isn't a main part of the teachings at an art school, and it should be. Digital technologies are used widely today for creating art with a variety of different formats. When asking various artist about digital preservation this is how they answered:
“It’s not the kind of thing that gets taught in art school, is it?”
“You don’t need to be trained in [using and preserving digital objects]. It’s got to be instinctive and you just need to keep it very simple. Those technical things are invented by IT guys who don’t have any imagination.” 
The third presentation was by Morgane Stricot for her short paper Open the museum’s gates to pirates: Hacking for the sake of digital art preservation. Morgane explained the that software dependency is a large threat for digital art and supporting media archaeology is required for preservation of some forms of these digital arts. Backups of older operating systems (OS) on disks help avoid issues of incompatibility. She also detailed how copyright prohibitions, for example older Mac OS, are difficult to find and that many pirates as well as "harmless hackers" have cracks to gain access to these OS environments while some are unsalvageable.
The final paper presentation was presented by Claudia Roeck on her long paper Evaluation of preservation strategies for an interactive, software-based artwork with complex behavior using the case study Horizons (2008) by Geert Mul. Claudia explored different possible preservation strategies for software such as reprogramming to a different programming language, migration of software, virtualization, and emulation, and also significant properties for what determines the qualities one would want to preserve. She used Horizons as an example project to explore the use cases and determined that reprogramming was of the options they decided was suitable for it. However, it was stated that there were no clear winner for the best preservation strategy in the mid-term of the work.
For the rest of the day lightning talks were available to the attendees and it became packed with viewers. Some of these talks consisted of preservation games to be held the next day such as: Save my Bits, Obsolescence, Digital Preservation Storage Criteria Game, and more. Ilya, from Webrecorder, held a lightning talk showing a demo of the new Auto Archiver prototype for Webrecorder.

After the proceedings another fantastic reception was held, this time at the Harvard Art Museum.

Harvard Art Museum at night

Day 3 (September 26, 2018): Minute Madness, Poster Sessions, and Awards 

This day was opened by a review of iPRES's achievements and challenges for past 15 years with a panel discussion composed of: William Kilbride, Eld Zierau, Cal Lee, and Barbara Sierman. Achievements included the innovation of new research as well as the courage to share and collaborate among peers with similarities in research. This lead to iPRES's adoption of cross-domain preservation in libraries, archives, and digital art. Some of the challenges include decisions for archivists to decide of what to do with past and future data and also conforming to the standard of OAIS.
After talking about the past 15 years it was time to talk about the next 15 years with a panel discussion composed of: Corey Davis, Micky Lindlar, Sally Vermaaten, and Paul Wheatley. This panel discussed what would be good for the future for more attendees be available to attend. They discussed possible organization models to emulate for regional meetings such as code4lib and NDSR. There were suggestions for updates to the Code of Conduct and the value for it to hold for the future.
After the discussion panels it was time for minute madness. I had seen videos of this before but it was the first time I personally had seen this. I found it somewhat theatrical. It was where most people had to explicitly pitch their research in a minute so we would later come visit them during the poster session while some of them put up a show, like Remco van Veenendaal. The topics ranged from workflow integration, new portals for preserving digital materials, code ethics, and timelines for detailing file formats.

After the minute madness attendees wandered around to view the posters available. The first poster I visited conveniently was referencing work from our WSDL group!
Another interesting poster consisted of research into file format usage over time.
I was also surprised at the amount of tools and technologies some of the new preservation platforms for government agencies that had emerged, like the French government IT program for digital archiving, Vitam.

Vitam poster presentation for their digital archiving architecture
Following the poster sessions I was back to paper presentations where Tomasz Miksa presented his long paper Defining requirements for machine-actionable data management plans. This talk involved machine actionable data management plans (maDMPs), which represents living documents automated by information collection systems and notification systems. He showed how current formatted data management systems could be transformed to reuse existing standards such as Dublin Core and PREMIS.
Alex Green then went on to present her short paper Using blockchain to engender trust in public digital archives. It was explained that archivist alter, migrate, normalize, and sometimes make changes to digital files but there is little proof that a researcher receives an authentic copy of a digital file. The ARCHANGEL project proposed to use blockchain to verify integrity of these files and their provenance. It is still unknown if blockchain tech will prevail as a lasting technology as it is still very new. David Rosenthal wrote a review of this paper found on his blog.
I then went on to the Storage Organization and Integrity session to see a long paper presentation Checksums on Modern Filesystems, or: On the virtuous consumption of CPU cycles by Alex Garnett and Mike Winter. The focus of the talk was the computing of checksums on files to prevent bit rot in digital objects and compares different approaches for verifying bit-level preservation. It showed that data integrity can be achieved when computer hardware, such as filesystems using ZFS, are dedicated to digital preservation. This work shows a bridge between digital preservation practices and high-performance computing for detecting bit-rot.

After this presentation I stayed for short paper presentation The Oxford Common File Layout by David Wilcox. The Oxford Common File Layout (OCFL) is an effort to define a shared approach to file hierarchy for long-term preservation. The goal of this layout is to have structure at scale, easily ready for migrations and minimize file transfers, and designed to be managed by many different applications. With a set of defined principles for this file layout, such as ability to log transactions on digital objects among other principles, there is plan for a draft spec release sometime at the end of 2018.
This day closed with the award ceremony for best poster, short papers, and long papers. My paper, Measuring News Similarity Across Ten U.S. New Sites, was nominated for best long paper but did not prevail as the winner. The winners were as follows:
  • Best short paper: PREMIS 3 OWL Ontology: Engaging Sets of Linked Data
  • Best long paper: The Rescue of the Danish Bits - A case study of the rescue of bits and how the digital preservation community supported it  by Eld Zierau
  • Best poster award: Precise & Persistent Web Archive References by Eld Zierau

Day 4 (September 27, 2018): Conference Wrap-up

The final day of iPRES 2018 was composed of paper presentations, discussion panels, community discussions, and games. I chose to attend the paper presentations.

The first paper presentation I viewed was Between creators and keepers: How HNI builds its digital archive by Ania Molenda. Over 4 million documents were recorded to track progressive thinking for Dutch architecture. When converting and pushing these materials into a digital archive there were many issues observed, such as: duplicate materials, file formats with complex dependencies, time and effort to digitalize the multitude of documents, and knowledge lost over time for accessing these documents with no standards in place.

Afterwards I watched the presentation on Data Recovery and Investigation from 8-inch Floppy Disk Media: Three Use Cases by Abigail Adams. This showed the acquisition of three different floppy disk collections ranging in date ranges from 1977-1989! This presentation introduced me to some foreign hardware, software, and encodings required for attempting to recover data from floppy disk media and also a workflow for data recovery from these floppies.

The last paper presentation of my viewing was Email Preservation at Scale: Preliminary Findings Supporting the Use of Predictive Coding by Joanne Kaczmarek and Brent West. Having already been to the email preservation workshop I was excited for this presentation and I was not let down. Using 20gb of emails publicly available they used two different methods, a capstone approach and predictive coding approach, for discovering sensitive content inside emails. With the predictive coding approach, machine learning for training and prediction of documents, they showed preliminary results that classifying emails automatically is an approach that is capable of handling emails at scale.

As a final farewell, attendees were handed bags of tulip buds and told this:
"An Honorary Award will be presented to the people with the best tulip pictures."
It seems William Kilbride, among others, have already got a foot up on all the competition.
This marks the end of my first academic conference as well as my first visit to Boston, Massachusetts. It was an enjoyable experience with a lot of exposure to diverse research fields in digital preservation. I look forward to submitting work to this conference again and hearing about future research in the realm of digital preservation.

Resources for iPRES 2018:

No comments:

Post a Comment