2018-10-11: iPRES 2018 Trip Report

September 24th marked the beginning of iPRES 2018 located in Boston, MA, for which both Shawn Jones and I traveled from New Mexico to present our accepted papers: Measuring News Similarity Across Ten U.S. News Sites, The Off-Topic Memento Toolkit, and The Many Shapes of Archive-It.

iPRES ran paper and workshop sessions in parallel, therefore I will focus on the sessions I was able to attend. However, this year organizers created and shared collaborative notes with all attendees for all sessions to help others who couldn't attend many individual sessions. All the presentation materials and associated papers were also made available via google drive.

Day 1 (September 24, 2018): Workshops & Tutorials

The first day of iPRES attendees gathered at the Joseph B. Martin Conference Center at Harvard Medical School to get their registration lanyards and iPRES swag.

Registration desk for #ipres2018 is almost open! We’ll have a total of almost 400 participants! pic.twitter.com/HiXGK9pBgn
— Micky Lindlar (@MickyLindlar) September 24, 2018

Afterwards, there were scheduled workshops and tutorials to enjoy throughout the day. Attending registrants needed to sign up early to get into these workshops. Many different topics were available for to attendees choose from found on Open Science Framework event page. Shawn and I chose to attend:

Archiving Email: Strategies, Tools, Techniques. A tutorial by: Christopher John Prom and Tricia Patterson.
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Workshop). A workshop by: Anna Perricci.

Our first session on Archiving Email consisted of talks and small group discussion on various topics and tools for archiving email. It started with talks on the adoption of email preservation systems into our organizations. Within our group talk, it was found that few organizations have email preservation systems. I found the research ideas and topics stemming from these talks to be very interesting especially in the aspect of studying natural language from email content.

#ipres2018 is kicked off by @chrisprom and Tricia Patterson: "Archiving Email: Strategies, Tools, Techniques", augmenting a recent report "The Future of Email Archives" by Council on Library and Information Resources (CLIR) - https://t.co/JbVMwCww1S pic.twitter.com/fQXV8f39s7
— Shawn M. Jones (@shawnmjones) September 24, 2018

Many of the difficulties of archiving email unsurprisingly revolve around issues of privacy. Some of the difficulties range from actually requesting and acquiring emails from users, discovering and disclosing sensitive information inside emails, and also other ethical decisions for preserving emails.

Email preservation also has the challenge of curating at scale. As one can imagine, going through millions of emails inside of a collection can be time consuming and redundant which requires the development of new tools to combat these challenges.

#ipres2018 @WilliamKilbride is leading a discussion on email preservation - We need more tools and skills "otherwise we wouldn't need this workshop today at all." - Is email an official record? - "Archiving is not storage. Backup and preservation are different." pic.twitter.com/i7Fg7Y2uif
— Shawn M. Jones (@shawnmjones) September 24, 2018

This workshop also exposed many interesting tools to use for archiving and exploring emails including:

Listening to Ric Ferrante from Smitsonian give demo on DArcMail a Python-based e-mail archiving system that they developed. #ipres2018 pic.twitter.com/vVZhlIt4uv
— Michael Moosberger (@MoosbergerM) September 24, 2018

.@WendyGogel demonstrates EASi from @Harvard "to prepare born-digital materials for submission to the Harvard University Library Digital Repository Service (DRS)" as part of the #ipres2018 workshop "Archiving Email: Strategies, Tools, Techniques"https://t.co/WZME8n351B pic.twitter.com/a1RCIDIFcl
— Shawn M. Jones (@shawnmjones) September 24, 2018

#ipres2018 @glynnedw presents ePADD for email archives, which "dedups messages", "resolves names of correspondents into one correspondent", "does a lot of entity extraction", handles image attachments, and provides search capabilities for things like PIIhttps://t.co/TaWrbecZVQ pic.twitter.com/Xe6SPwuFhy
— Shawn M. Jones (@shawnmjones) September 24, 2018

Camille Tyndall Watson from @NCArchives demonstrates TOMES (Transforming Online Mail with Embedded Semantics). It uses its own dictionary for PII based on NC statutory laws, @Stanford work on named entities. It is available as several packages: https://t.co/lLQvWZtikY #ipres2018 pic.twitter.com/wFBUDZ3rPi
— Shawn M. Jones (@shawnmjones) September 24, 2018

Many different workflows for archiving email and also using the aforementioned tools for archiving emails were explained thoroughly at the end of the session. These workflows covered migrations with different tools, accessing disk images of stored emails and attachments via emulation, and bit-level preservation.

@chrisprom showing basic bit level workflow for email archiving #s101 #ipres2018 pic.twitter.com/rLkzufcjB9
— Brent M. West (@BrentMWestCU) September 24, 2018

Following the email archiving session we continued on for the Human Scale Web Collecting for Individuals and Institutions session presented by Anna Perricci from the Webrecorder team.

Human Scale Web Collecting for Individuals and Institutions (Webrecorder Workshop at iPRES 2018) from Anna Perricci

Having used Webrecorder before I was very excited for this session. Anna walked through process of registering and starting your first collection. She explained how to start sessions and also how collections are formed as easily as clicking different links on a website. Webrecorder can handle javascript replay very efficiently. For example, past videos streamed from a website like Vine or YouTube are recorded from a user's perspective and then available for replay later in time. Other examples included automated scrolling through twitter feeds or capturing interactive news stories from the New York Times.

#ipres2018 @AnnaPerricci shares “Snowfall”, a story about an avalanche by @nytimes containing a lot of interactive content - it was well captured by @webrecorder_io which emphasizes this tool’s importance as part of the historians’ toolkit https://t.co/1dtoKzMmnY pic.twitter.com/s0tWS3WDk7
— Shawn M. Jones (@shawnmjones) September 24, 2018

During the presentation Anna showed Webrecorder's capability of extracting mementos from other web archives for the possibility of repairing missing content. For example, it managed to take CNN mementos from the Internet Archive past November 1, 2016 and then fix their replay by aggregating resources from other web archives and also the live web - although this could also be potentially harmful. This is an example of Time Travel Reconstruct implemented in pywb.

Ilya Kreymer presented the use of Docker containers for emulating different browser environments and how it could play an important role for replaying specific content like Flash. He demonstrated various tools available open source on Github including: pywb, Webrecorder WARC player, warcio, and warcit.

Brownian Motion. @webrecorder_io using Docker to emulate a website in the original Firefox browser environment. Of huge importance for replaying Flash web applications. #iPRES2018 pic.twitter.com/6mjFjlqYVF
— Edith Halvarsson (@EdithHalvarsson) September 24, 2018

Ilya also teased at Webrecorder's Auto Archiver Prototype, a system that understands how Scalar websites work and can anticipate URI patterns and other behaviors for these platforms. Auto Archiver introduces automation of the capture of many different web resources on a website, including video and other sources.

#ipres2018 @IlyaKreymer demonstrates "Webrecorder Auto Archiver Prototype for Scalar" using automation to capture many web resources, including video - currently experimental and right now only works for sites using the Scalar publishing platformhttps://t.co/sSjaUiO7tm pic.twitter.com/IkWiBtyn9z
— Shawn M. Jones (@shawnmjones) September 24, 2018

Webrecorder Scalar automation demo for a Scalar website

To finish the first day, attendees were transported to a reception hosted at the MIT Samberg Conference Center accompanied by a great view of Boston.

#ipres2018 view from MIT. pic.twitter.com/Cb1UxmEV0j
— tre berney (@treberney) September 25, 2018

Day 2 (September 25, 2018): Paper Presentations and Lightning Talks

To start the day attendees gathered for the plenary session which was opened by a statement from Chris Bourg.

“Preservation is an act of care and caring” great opening comment from @mchris4duke reminds me of some of @nowviskie’s take on importance of an ethic of care for digital infrastructures https://t.co/ZxZ1nHSNCB #ipres2018 pic.twitter.com/h6J6msLDNn
— Trevor Owens 💾🗄🕚 (@tjowens) September 25, 2018

Eve Blau then continued the session by presenting the Urban Intermedia: City, Archive, Narrative capstone project of a Mellon grant. This talk was about a Mellon Foundation project the Harvard Mellon Urban Initiative. It is a collaborative effort across multiple institutions of architecture, design and humanities. Using multimedia and visual constructs it looked at processes and practices that shape geographical boundaries, focusing on blind spots in:

Planned / unplanned - informal processes
Migration / mobility, patterns, modalities of inclusion & exclusion
Dynamic of nature & technology, urban ecologies

After the keynote I hurried over to open for the Web Preservation session with my paper on Measuring News Similarity Across Ten U.S. News Sites. I explained our methodology of selecting archived news sites, the tool top-news-selectors we created for mining archived news, how the similarity of news collections were calculated, the events that peaked in similarity, and how the U.S. election was recognized as a significant event among many of the news sites.

Measuring News Similarity Across Ten U.S. News Sites from Grant Atkins

Following my presentation, Shawn Jones presented his paper The Off-Topic Memento Toolkit. Shawn presentation focused on the many different use cases of Archive-It, and then detailed how many of these collections can go of topic. For example, pages that have missing resources at a point in time, content drift causes different languages to be included in a collection, site redesigns, and etc. This lead to the development of the Off-Topic Memento Toolkit to detect these off-topic mementos inside of a collection through a process of collection a memento and then assigning a score, testing multiple different measures. It showed that in this study Word Count had the highest accuracy and best F1 score for detecting off-topic mementos.

Thanks to @shawnmjones for all the tweets about the @webrecorder_io workshop at #ipres2018 yesterday. He’s now giving an excellent presentation in session 205 pic.twitter.com/I8Xx06ByWb
— Anna Perricci (@AnnaPerricci) September 25, 2018

The Off-Topic Memento Toolkit from Shawn Jones

Shawn also presented his paper The Many Shapes of Archive-It. He explained how to understand Archive-It collections using the content, metadata (Dublin Core and custom fields), and collection structure, but also the issues that come with these methods. Using 9351 collections from Archive-It as data, Shawn explained the concept of growth curves for collections which compares seed count, memento count, and also memento-datetime. Using different classifiers Shawn showed that using structural features of a collection one can predict the semantic category of a collection, with the best classifier found to be Random Forest.

The Many Shapes of Archive-It from Shawn Jones

Following lunch, I headed to the amphitheater to see Dragan Espenschied's short paper presentation Fencing Apparently Infinite Objects. Dragan questioned how objects, synonymous with file or a collection of files, are bound in digital preservation. The concept of "performative boundaries" was explained to explain different potentials of an object - bound, blurry, and boundless. Using many early software examples like early 2000 Microsoft Word (bound), Apple's QuickTime (blurry), and Instagram (boundless). He shared productive approaches for future replay of these objects:

Emulation of auxiliary machines
Synthetic stub services or simulations
Capture network traffic and re-enact on access

Dragan Espenschied presenting on Apparently Infiinite Objects

#truth #iPres2018 pic.twitter.com/OneOqRGEBh
— Kelly Stewart (@kellyannewithe) September 25, 2018

The next presentation was Digital Preservation in Contemporary Visual Art Workflows by Laura Molloy who presented remotely. This presentation informs us that on a regular basis digital preservation for someone's work isn't a main part of the teachings at an art school, and it should be. Digital technologies are used widely today for creating art with a variety of different formats. When asking various artist about digital preservation this is how they answered:

“It’s not the kind of thing that gets taught in art school, is it?”

“You don’t need to be trained in [using and preserving digital objects]. It’s got to be instinctive and you just need to keep it very simple. Those technical things are invented by IT guys who don’t have any imagination.”

Artist’s workflow diagrams in @LM_HATII ‘s presentation — ties back into the workflow diagram workshop yesterday #ipres2018 pic.twitter.com/wXHSnVYA1L
— Ariel Weinberg (@arielweinberg) September 25, 2018

The third presentation was by Morgane Stricot for her short paper Open the museum’s gates to pirates: Hacking for the sake of digital art preservation. Morgane explained the that software dependency is a large threat for digital art and supporting media archaeology is required for preservation of some forms of these digital arts. Backups of older operating systems (OS) on disks help avoid issues of incompatibility. She also detailed how copyright prohibitions, for example older Mac OS, are difficult to find and that many pirates as well as "harmless hackers" have cracks to gain access to these OS environments while some are unsalvageable.

Morgane Stricot presents "Open the museum’s gates to pirates: Hacking for the sake of digital art preservation." #ipres2018 discussing how we can still provide access to abandoned software pic.twitter.com/7azGr2veBP
— Shawn M. Jones (@shawnmjones) September 25, 2018

The final paper presentation was presented by Claudia Roeck on her long paper Evaluation of preservation strategies for an interactive, software-based artwork with complex behavior using the case study Horizons (2008) by Geert Mul. Claudia explored different possible preservation strategies for software such as reprogramming to a different programming language, migration of software, virtualization, and emulation, and also significant properties for what determines the qualities one would want to preserve. She used Horizons as an example project to explore the use cases and determined that reprogramming was of the options they decided was suitable for it. However, it was stated that there were no clear winner for the best preservation strategy in the mid-term of the work.

#ipres2018 How do we preserve complex multimedia digital artwork? What do we preserve? Can we preserve the behavior and interactivity? What technical dependencies exist? How do we know that we have created the right environment to reproduce it faithfully? pic.twitter.com/ebzg4e3s17
— Shawn M. Jones (@shawnmjones) September 25, 2018

For the rest of the day lightning talks were available to the attendees and it became packed with viewers. Some of these talks consisted of preservation games to be held the next day such as: Save my Bits, Obsolescence, Digital Preservation Storage Criteria Game, and more. Ilya, from Webrecorder, held a lightning talk showing a demo of the new Auto Archiver prototype for Webrecorder.

Oh my goodness, the #ipres2018 ad hoc session is packed out!! First up: four talks on original #digipres graphics. pic.twitter.com/pkROmtKoLf
— Maureen Pennock (@mopennock) September 25, 2018

First #ipres2018 ad hoc session is a hit! Packed house, speakers screaming at the top of their lungs to make their visuals, games and projects heard! 🗣️ 🔈 👏 pic.twitter.com/CvdcgQgkAI
— Erwin Verbruggen (@erwinverb) September 25, 2018

#ipres2018 @IlyaKreymer is presenting “Preservation of Scalar-based works” where he uses @webrecorder_io along with a custom prototype system to automate the capture of websites published with the Scalar publishing platform pic.twitter.com/JC39FijLy8
— Shawn M. Jones (@shawnmjones) September 25, 2018

After the proceedings another fantastic reception was held, this time at the Harvard Art Museum.

Harvard Art Museum at night

Day 3 (September 26, 2018): Minute Madness, Poster Sessions, and Awards

This day was opened by a review of iPRES's achievements and challenges for past 15 years with a panel discussion composed of: William Kilbride, Eld Zierau, Cal Lee, and Barbara Sierman. Achievements included the innovation of new research as well as the courage to share and collaborate among peers with similarities in research. This lead to iPRES's adoption of cross-domain preservation in libraries, archives, and digital art. Some of the challenges include decisions for archivists to decide of what to do with past and future data and also conforming to the standard of OAIS.

@EldZierau says the building bridges is a key component of what has been happening during the past 15 years #ipres2018 pic.twitter.com/X7YLlk5gfD
— Kari Smith (@karirene69) September 26, 2018

After talking about the past 15 years it was time to talk about the next 15 years with a panel discussion composed of: Corey Davis, Micky Lindlar, Sally Vermaaten, and Paul Wheatley. This panel discussed what would be good for the future for more attendees be available to attend. They discussed possible organization models to emulate for regional meetings such as code4lib and NDSR. There were suggestions for updates to the Code of Conduct and the value for it to hold for the future.

@Educopia Jessica Meyerson kicking off the Looking Ahead to iPRES 30 panel #iPRES2018 pic.twitter.com/nmUMb9dbou
— sam meister (@samalanmeister) September 26, 2018

After the discussion panels it was time for minute madness. I had seen videos of this before but it was the first time I personally had seen this. I found it somewhat theatrical. It was where most people had to explicitly pitch their research in a minute so we would later come visit them during the poster session while some of them put up a show, like Remco van Veenendaal. The topics ranged from workflow integration, new portals for preserving digital materials, code ethics, and timelines for detailing file formats.

Time for the minute madness. #iPres2018 join @RvanVeenendaal and delve deeper into the world of significant significant properties pic.twitter.com/WN4bnM0hSN
— pepijn lucker (@pepijnlucker) September 26, 2018

The best tag line so far in #ipres2018 minute madness is: "we built a time machine", related to the poster "Time travel with PRONOM - "The fourth dimension of DROID" by @MickyLindlar and @YvonneTunnat. I also like the use of #LEGO. pic.twitter.com/rt4FqMF9WY
— Shawn M. Jones (@shawnmjones) September 26, 2018

After the minute madness attendees wandered around to view the posters available. The first poster I visited conveniently was referencing work from our WSDL group!

#ipres2018 “Persistent Web Identifier (for web archives)” including metadata for web collections and also mentions Memento support, presented by Eld Zierau of the Royal Danish Library https://t.co/7uxyNZxvFr pic.twitter.com/qeHX0EM13k
— Shawn M. Jones (@shawnmjones) September 26, 2018

Another interesting poster consisted of research into file format usage over time.

@NKrabben file format usage over time based on NYPL digital collections, last modified dates #ipres2018 pic.twitter.com/GEjx0QGqke
— Brent M. West (@BrentMWestCU) September 26, 2018

I was also surprised at the amount of tools and technologies some of the new preservation platforms for government agencies that had emerged, like the French government IT program for digital archiving, Vitam.

Vitam poster presentation for their digital archiving architecture

Following the poster sessions I was back to paper presentations where Tomasz Miksa presented his long paper Defining requirements for machine-actionable data management plans. This talk involved machine actionable data management plans (maDMPs), which represents living documents automated by information collection systems and notification systems. He showed how current formatted data management systems could be transformed to reuse existing standards such as Dublin Core and PREMIS.

#ipres2018 @miksa_tomasz is presenting "Defining requirements for machine-actionable data management plans" as part of the "Machines in Action" section chaired by @alizaleventhal pic.twitter.com/49bqjG7OSZ
— Shawn M. Jones (@shawnmjones) September 26, 2018

Alex Green then went on to present her short paper Using blockchain to engender trust in public digital archives. It was explained that archivist alter, migrate, normalize, and sometimes make changes to digital files but there is little proof that a researcher receives an authentic copy of a digital file. The ARCHANGEL project proposed to use blockchain to verify integrity of these files and their provenance. It is still unknown if blockchain tech will prevail as a lasting technology as it is still very new. David Rosenthal wrote a review of this paper found on his blog.

#ipres2018 Alex Green presents "Using blockchain to engender trust in public digital archives" pic.twitter.com/ufWLBt2Mau
— Shawn M. Jones (@shawnmjones) September 26, 2018

I then went on to the Storage Organization and Integrity session to see a long paper presentation Checksums on Modern Filesystems, or: On the virtuous consumption of CPU cycles by Alex Garnett and Mike Winter. The focus of the talk was the computing of checksums on files to prevent bit rot in digital objects and compares different approaches for verifying bit-level preservation. It showed that data integrity can be achieved when computer hardware, such as filesystems using ZFS, are dedicated to digital preservation. This work shows a bridge between digital preservation practices and high-performance computing for detecting bit-rot.

I thought hell would freeze over before the wonderful zfs would be serious discussed in the digital preservation world. 😉 Fi-na-lly! #ipres2018 pic.twitter.com/EsdYKXWIFo
— Dragan Espenschied (@despens) September 26, 2018

After this presentation I stayed for short paper presentation The Oxford Common File Layout by David Wilcox. The Oxford Common File Layout (OCFL) is an effort to define a shared approach to file hierarchy for long-term preservation. The goal of this layout is to have structure at scale, easily ready for migrations and minimize file transfers, and designed to be managed by many different applications. With a set of defined principles for this file layout, such as ability to log transactions on digital objects among other principles, there is plan for a draft spec release sometime at the end of 2018.

David Wilcox introduces the #ipres2018 attendees to the Oxford Common File Layout or OCFL for short. Great to see so many in the room had heard of it already #DP0C pic.twitter.com/hKjlRncxjd
— James Mooney (@JamesMooneyUK) September 26, 2018

This day closed with the award ceremony for best poster, short papers, and long papers. My paper, Measuring News Similarity Across Ten U.S. New Sites, was nominated for best long paper but did not prevail as the winner. The winners were as follows:

Best short paper: PREMIS 3 OWL Ontology: Engaging Sets of Linked Data
Best long paper: The Rescue of the Danish Bits - A case study of the rescue of bits and how the digital preservation community supported it by Eld Zierau
Best poster award: Precise & Persistent Web Archive References by Eld Zierau

Day 4 (September 27, 2018): Conference Wrap-up

The final day of iPRES 2018 was composed of paper presentations, discussion panels, community discussions, and games. I chose to attend the paper presentations.

The first paper presentation I viewed was Between creators and keepers: How HNI builds its digital archive by Ania Molenda. Over 4 million documents were recorded to track progressive thinking for Dutch architecture. When converting and pushing these materials into a digital archive there were many issues observed, such as: duplicate materials, file formats with complex dependencies, time and effort to digitalize the multitude of documents, and knowledge lost over time for accessing these documents with no standards in place.

Between creators and keepers: How HNI builds its digital archive A short paper by: Ania Molenda #ipres2018 pic.twitter.com/DfNRSzvo02
— swabie (@swabie123) September 27, 2018

Afterwards I watched the presentation on Data Recovery and Investigation from 8-inch Floppy Disk Media: Three Use Cases by Abigail Adams. This showed the acquisition of three different floppy disk collections ranging in date ranges from 1977-1989! This presentation introduced me to some foreign hardware, software, and encodings required for attempting to recover data from floppy disk media and also a workflow for data recovery from these floppies.

Data Recovery and Investigation from 8-inch Floppy Disk Media: Three Use Cases by Walker Sampson, Abby Adams and Austin Roche #s401 #ipres2018 pic.twitter.com/lF7t0lhPM5
— Brent M. West (@BrentMWestCU) September 27, 2018

The last paper presentation of my viewing was Email Preservation at Scale: Preliminary Findings Supporting the Use of Predictive Coding by Joanne Kaczmarek and Brent West. Having already been to the email preservation workshop I was excited for this presentation and I was not let down. Using 20gb of emails publicly available they used two different methods, a capstone approach and predictive coding approach, for discovering sensitive content inside emails. With the predictive coding approach, machine learning for training and prediction of documents, they showed preliminary results that classifying emails automatically is an approach that is capable of handling emails at scale.

Using machine learning to appraise 4.7 million objects - no way a human could do that volume but do you trust the algorithm? #ipres2018 pic.twitter.com/SjMmFN1B1A
— Jon Tilbury (@dPreservation) September 27, 2018

As a final farewell, attendees were handed bags of tulip buds and told this:

"An Honorary Award will be presented to the people with the best tulip pictures."

It seems William Kilbride, among others, have already got a foot up on all the competition.

Paging @iPres2018 and @ipres2019: the bulbs have arrived safely at Kilbride Towers. Now planted. Looking forward to a flower based tribute to European Unity in Spring 2019. #iPres2018 #ipres2019 cc @DigitaalErfgoed pic.twitter.com/eqdn3v9P7z
— William Kilbride (@WilliamKilbride) September 30, 2018

This marks the end of my first academic conference as well as my first visit to Boston, Massachusetts. It was an enjoyable experience with a lot of exposure to diverse research fields in digital preservation. I look forward to submitting work to this conference again and hearing about future research in the realm of digital preservation.

Resources for iPRES 2018:

Looking for all of the tweets used in this blog post? They can be found from this url: https://twitter.com/search?f=tweets&vertical=default&q=%23ipres2018&src=typd
Collaborative Notes
Presentation materials and papers

Search This Blog

Web Science and Digital Libraries Research Group