2021-07-14: Web Archiving Conference (WAC) 2021 Trip Report

This was our (Travis, Kritika, and Himarsha) first time attending IIPC’s Web Archiving Conference (WAC). Usually this conference is an in-person event (IIPC General Assembly and WAC trip reports: 2011, 2012, 2013, 2014, 2015, 2016, and 2017), but this year the conference was held online. Remo was the conference platform for this event. The presentations for this conference were pre-recorded and shown during the conference, and instead of having presentations for the papers submitted, there were Q&A sessions. The invited panels were delivered live during the event. When more presentation slides and videos become available we will add the links to the blog post. Archive-It also wrote a blog post for this conference.

Day One

Sessions 1 through 10 were held on the first day of the event. We attended the sessions that were scheduled during times that were ideal for North American time zones. The sessions we focused on for the first day are sessions 6, 8, 9, and 10.
Chase Dooley chaired session 6 and the topic for this session was Archiving Communities.
Hélène Brousseau (@HeleneBrousseau) from Artexte Information Centre presented “Contemporary Art Knowledge: A Community-based Approach to Web Archiving”. She discussed the recent changes in the publication practices for contemporary arts. She also mentioned that a creator of a website should archive the website for legacy purposes, because the creator should know which parts of the website are important. Conifer makes it easier for artists to archive their web publications and websites. Conifer is ideal, because it can archive the audio, visual, and interactive content that are included on this type of website. If the artists’ web publications and websites have similar web resources on the web page, then Memento Tracer (which is presented in session 10) may also be useful, because the artist could create one trace that could be used on a class of similar web publications or websites.

Samantha Fritz (@SamVFritz) from University of Waterloo presented (slides) “Building Community Through Archives Unleashed Datathons”. She discussed some of the techniques the Archives Unleashed team used for community building and engagement. The Archives Unleashed team hosted a series of datathons (WS-DL trip reports for each datathon) to improve the participants’ skills and to build community around web archives. During the datathon, teams would be formed and each team worked on a project that involved data from a web archive collection.


From interviews with datathon participants, there were four major themes: 

  1. Datathon contributes to skill-building

  2. Exposed participants to interdisciplinary perspectives

  3. Fostered community formation

  4. Fostered a sense of belonging


Their community engagement model is a six-stage model.

  1. Scope: identify the problem or questions and identify the stakeholders that makes up the community

  2. Inform: provide information to the community so that they can understand the current problems and possible solutions

  3. Consult: conduct an open dialog to gather feedback from the community 

  4. Involve: work with the community to ensure community concerns are considered

  5. Collaborate: foster interdisciplinary collaborations and establish partnerships

  6. Empower: encourage and support scholars in building the skills needed to work with web archives


Kristinn Sigurðsson (@kristsi) from National and University Library of Iceland chaired session 8 and the topic for this session was Modeling, Packaging, Reusing.


Sawood Alam (@ibnesayeed), an ODU WS-DL alumnus and Web and Data Scientist at the Internet Archive presented (slides) his work on "Readying Web Archives to Consume and Leverage Web Bundles". Web Bundle/Packaging allows users to download a website in the form of a single bundle file. This bundle can be shared offline and loaded from any server as if the website is loaded from the original source. He described the potential of Web Bundle in addressing the challenges related to web archiving. He discussed enabling discovery, content negotiation, and ingestion of Web Bundles for archival purposes which may provide complete and more coherent crawling. He also touched upon decomposing bundled HTTP Exchanges for efficient storage, indexing and dynamically generating bundles for composite mementos at replay. This work has also been discussed in a DHSR blog post. 


Yinlin Chen (@Yinlin_Chen) from Virginia Tech Libraries delivered the second presentation of this session on “Analyzing WARC on Serverless Computing”. He presented a serverless architecture platform using AWS cloud-native services (AWS Lambda, Batch, ECS, etc.) that enables a resilient, scalable, and cost-effective processing of WARC.  As proof of concept, they tested the platform for scalability, cost, and performance using selected Common Crawl data stored in the AWS S3. Their analysis shows that the platform processed uncompressed big data in a few minutes.


Jefferson Bailey (@jefferson_bail) from the Internet Archive chaired session 9 and the presenters discussed the topic of Quality Assurance (QA) & Control. This was interesting to see how manual and automated quality control practices are adopted by web archives.

Brenda Reyes Ayala (@CamtheWicked) from University of Alberta presented “Detecting Quality Problems in Archived Websites Using Image Similarity”. She explained their method of improving QA using image similarity by comparing the appearance of the original website to that of the archived website. They created tools that generate screenshots of the live websites and their archived counterparts. They used image similarity metrics such as Structural Similarity Index (SSIM), the Mean Squared Error (MSE), and vector distance to identify the difference between the live and archived versions.

Patricia Klambauer (@pkbaclac) and Tom J. Smyth (@smythbound) from Library and Archives Canada presented “The Black Hole Of Quality Control: Toward a Framework for Managing QC Effort to Ensure Value”. They described the lessons learned from building the quality assurance and control framework at Library and Archives Canada. They also discussed key tools and techniques including: initial scoping of web collections, approaching QC as a “scrum” project, graphing QC technical complexity, etc.   
Jeffrey Van Der Hoeven (@jrvdhoeven), Ben O’Brien(@ob1_ben_ob), Hanna Koppelaar, Trienka Rohrbach (@trienka), Andrea Goethals (@AndreaGoethals), and Steve Knight presented “Improving the Quality of Web Harvests Using Web Curator Tool”. This work is a collaboration by the National Library of New Zealand and the National Library of the Netherlands. They talked about Web Curator Tool (WCT), which is an open source workflow management tool for selective web archiving that allows crawling websites and performing QA. They presented the version 4 of WCT which allows crawl patching using Webrecorder, screenshot generation, and integration with Pywb viewer.


Ben O'Brien from National Library of New Zealand chaired session 10 and the topic for this session was Archiving Frameworks & Tools.

Martin Klein (@mart1nkle1n) from Los Alamos National Laboratory and a WS-DL alumnus presented (slides) “Memento Tracer - An Innovative Approach Towards Balancing Web Archiving at Scale and Quality”. During this presentation he discussed Memento Tracer, which is a new approach that aims to balance between capturing at scale while also providing high-quality captures. A curator can create a trace, which is a set of actions needed to be performed by the crawler in order to archive all of the content needed for a web resource. These traces can be shared with other users via a publicly accessible repository. Traces can help with archiving the same type of resource like having a trace for archiving a Github repository or a trace for a slideshare presentation. Webrecorder’s autopilot feature is similar to using a trace from Memento Tracer. Webrecorder’s autopilot feature makes use of behaviors that have been prepared by the Webrecorder team, which are used to simulate human-like interactions on the web page so that more content can be archived in an automated way. Using a Memento Tracer’s trace is different from Webrecorder’s autopilot feature because the user can decide on which automated interactions will be used on a class of web resources.

Ilya Kreymer (@IlyaKreymer) from Webrecorder and Humbert Hardy from National Film Board of Canada presentedNot Gone in a Flash! Developing a Flash-capable remote browser emulation system”. Webrecorder and the National Film Board (NFB) of Canada worked together while updating a remote browser that can support Flash. During the presentation, different approaches for streaming the audio and video for remote browsers were mentioned. VNC was used for streaming the video. The audio could be streamed with WebRTC or websocket. The ideal combination is VNC Video + WebRTC Audio, because it has the lowest lag for audio. Different approaches for emulating Flash were also mentioned like using Ruffle, remote browser, or in-browser emulators. A remote browser named “pywb-remote-browsers” that was created by Webrecorder can be used to replay web pages that have Flash content. Chrome 84 is the latest version for their remote browser that supports Flash.

Mat Kelly (@machawk1) from Drexel University and a WS-DL alumnus presented (slides) “WASAPIfying Private Web Archiving Tools for Persistence and Collaboration”. He described WASAPI and some of his tools that can make use of WASAPI, which are the Web Archiving Integration Layer (WAIL) and InterPlanetary Wayback (ipwb). WASAPI is a framework for transmitting WARCs using a standard API. Only Archive-It and Webrecorder have implemented the server side component of the framework. WAIL has a feature similar to Wayback Machine’s Save Page Now feature except the files are stored locally on the user's computer. InterPlanetary Wayback can be used to create a distributed personal web archive. InterPlanetary Wayback disseminates the WARC files into the IPFS network.

Day Two

The sessions for the last day of the conference were sessions 11 through 19. We focused on sessions 14, 15, 16, and 19.

Sawood Alam chaired session 14 and the topic for this session was Non-traditional Archives.

Anne Helmond (@silvertje) from University of Amsterdam presented "Platform and app histories: Assessing source availability in web archives and app repositories"  where she discussed the difficulties of archiving platforms and apps. The continuous updates for platforms and apps make them difficult to archive. Each update overwrites the previous version of an app or platform, which results in some of the history being overwritten as well. She also described how researchers can use web archives and software repositories to reconstruct platform and app histories. For apps, web archives are used to get metadata and app repositories are used to get previous versions of the app. She then described an analysis of how well various platforms and apps are represented across web archives. For this analysis, Memento Time Travel API was used via MemGator to retrieve URLs for social media platforms and app detail pages from all web archives that support the memento protocol. After the URLs were gathered there were three dimensions used to analyze how well a platform or app is archived.


Three dimensions for analyzing how well a platform or app is archived:

  1. Volume of availability which is the number of mementos held

  2. Depth of availability which is the number of days, months, or years between the first and last mementos

  3. Breadth of availability which is the number of archives holding the mementos

 
Michael Kurzmeier (@MKRZMR) from Maynooth University presented “Website Defacements: Finding Hacktivism in Web Archives” where he discussed defaced websites that have been hacked and altered. Defaced websites are ephemeral web resources, because defaced websites are usually resorted quickly. During the presentation he also mentioned that there are community-maintained cybercrime archives that have archived defaced websites. Most of the cybercrime archives are in a suspended state where most of them do not accept new submissions. Some tools that could help with detecting website defacements are the Off-Topic Memento Toolkit and Timemap Visualization (TMVis).


Yves Maurer (@yvesmaurer) from National Library of Luxembourg chaired session 15 and the topic for this session was Archive Profile Summarization.

For the first presentation, Sawood Alam introduced the MementoMap framework for summarizing archival holdings. He started the presentation by introducing the MemGator Service, which is a tool built by Sawood Alam himself. MemGator aggregates TimeMaps across different public web archives by broadcasting a lookup request to all known web archives. He introduced how selectively polling archives that are likely to return good results is a more efficient way to avoid the wasteful and problematic nature of broadcasting. The proposed solution is to profile web archives using the MementoMap framework. Sawood talked about the main components of this framework: Ingestion, Summarization and Serialization, and Memento Routing. Ingestion is where you learn about an archive through CDX datasets, access logs, etc. The second component of this framework focuses on summarizing the findings about archival holdings which is the output of this framework. The final component is where the framework is put to use by integrating the said output with memento aggregators. Also, he announced that the MementoMap framework and its components are now ready for adoption by Web Archives.


Shawn M. Jones (@shawnmjones) of Los Alamos National Laboratory and WS-DL presented (slides) MementoEmbed and Raintale, two software components of the Dark and Stormy Archives (@StormyArchives) project. He started his presentation by explaining the different use cases of story telling with web archives. MementoEmbed is an archive-aware surrogate service (social card or thumbnail) that is used for summarizing an individual memento. A user of Raintale can tell a story with those social cards generated through MementoEmbed. He talks about how Raintale takes a list of mementos as the input (these mementos can either be chosen by the user manually or generated by Hypercane) and creates stories that can be in the form of HTML, Markdown or even publish it directly to services like Twitter. He finally talks about the current state and future of these tools. He states that the Dark and Stormy archive toolkit is currently being tested with the National Library of Australia and the users can also expect a GUI for Raintale to be introduced in the future.
Barbara Signori and Kai Jauslin (@kjauslin) from the Swiss National Library presented “Interactive Collage of Websites: A Deep Dive Into the Web Archive Switzerland”. During this presentation they discussed the redesign of the Swiss National Library’s web archive. Their web archive is now more visual and has a collage that shows many snapshots at once. Each snapshot is a screenshot of the start page for an archived web page. The collage is interactive and allows users to click and zoom into certain sections of the collage. If a snapshot is selected then an authorized user can view an archived web page and can switch between earlier and later versions of the web page. The collage also supports full text search and if a snapshot is associated with the search then it will be highlighted and the other snapshots will be masked out. Some of the tools and frameworks used for the collage are pywb, OpenWayback, Puppeteer, and Vue JS. Another feature is that the user can switch between using pywb or OpenWayback when viewing a snapshot.


Samantha Abrams (@smbrms) from University of Wisconsin-Madison chaired session 16 and the topic for this session was Research Into Archives.

Katie Mackinnon (@ktcmackinnon) from University of Toronto presented “Ethical Approaches to Researching Youth Cultures in Historical Web Archives”. During her presentation she discussed some of the ethical issues with using web archives for research. Some of the issues mentioned are the “Right to be Forgotten” and privacy issues especially for young people. The current way that data is handled is usually not in a privacy preserving way like not getting consent from the users and not balancing children's rights.

Samantha Fritz (@SamVFritz) from University of Waterloo presented (slides) “Accessible Web Archives: Rethinking and Designing Usable Infrastructure for Sustainable Research Platforms” where she discussed some of the accessibility barriers that exist when working with web archives. One of the issues with using web archives are the skills needed to work with web archives like needing to know how to work with a lot of data, needing to know high performance computing, and needing to know how to work from the command line. One of the goals of the Archives Unleashed Project is to make it easier to work with web archive data so that the researcher does not need to have all of the skills usually required to work with web archive data. To achieve this goal, the Archives Unleashed team created several tools that assist others with exploring and analyzing web archives. Some of the tools that were created by the Archives Unleashed team are the Archives Unleashed Cloud and Archives Unleashed Toolkit. WS-DL member Travis Reid has written blog posts about using Archives Unleashed Cloud (Working With Archives Unleashed Cloud) and Archives Unleashed Toolkit (Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane). The Archives Unleashed Cloud has recently shut down, but it will be integrated with Archive-It later.


Day 2 of WAC came to an end with the invited panelists session on Trust In Web Archiving. The session was chaired by Martin Klein (Chair of the WAC Programme Committee) and Yves Maurer. The panel of speakers were Michael L. Nelson (@phonedude_mln), Ada Lerner (@AdaLerner), Takuya Watanabe (@twatanabe1203), and Clare Stanton (@clare__stanton). The panel discussed why the web archive user should not blindly trust what is present in the archives.

Michael L. Nelson talked about his vision for trust in Web Archiving in 2025, which is to have “hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives”.


During the presentation (which is embedded below) he describes some of the current issues with web archives that need to be addressed in order to achieve this vision:

  • Hundreds: The number of archives may not continue to grow because there are “innumerable examples that point toward centralization / consolidation”.

  • Publicly available: Around two-thirds of web traffic is not publicly archivable based on a previous estimate.

  • Independent: Suggested having multiple independent observations, because having multiple copies archived at the same time can be different based on factors like the GeoIP, personalization, and CDN status.

  • Interoperable: “Homogeneity is not true interoperability” and “true interoperability comes through the hard work of protocols and standards".

  • Robust: Discussed some of the recent works related to robustness and malicious .html and .js files in web archives. Links to the related works are included on slides 18 and 19.

  • Auditible: Discussed some issues with auditing web pages from web archives like replaying the same archived web page can produce different results, conventional fixity-based approaches do not work, and can’t depend on a web archive for fixity (“Where did the archive go?” series: parts 1, 2, 3, and 4) since the archive can change or die. 

  • Cooperating: Mentioned that APIs are necessary but not sufficient and that we need to be able to preserve and audit data like WARC and HAR as rendered through software like pywb.

  • Web: Showed images of different apps that could be web applications. Anne Helmond’s presentation from session 14 also discusses the issues with archiving apps. 

  • Archive: We must also accommodate any system that supports rehosting/revisions.

 

During Ada Lerner’s presentation she described an attack where the attacker could replace images and text on an archived web page with newer content. This type of attack can occur when the archived web page has live resources included on the web page and the attacker owns the domain associated with the live resources. In her paper, this attack is referred to as Archive-Escape Abuse. Justin Brunelle’s (@justinfbrunelle) previous work has also identified examples of leakage in the web archives where certain older archive web pages had recent live content included on the web page. The Archive-Escape Abuse attack exploits one of the three Wayback Machine vulnerabilities that were mentioned in Ada Lerner’s paper.


The three types of Wayback Machine vulnerabilities identified in her paper: 

  1. The first type of vulnerability occurs when live web content is included on an archived web page, which allows an attacker to change the content if they own the domain associated with the live web content. Archive-Escape Abuse can exploit this vulnerability.

  2. The second type of vulnerability is caused by the lack of same-origin policy being enforced which can result in different sources interfering with each other. Since the Wayback Machine loads all of the content, the browser cannot enforce the same-origin policy which allows third-parties included in <iframes> to modify data on the main web page. To exploit this second vulnerability, the attacker must include their payload in an <iframe> before the web page is archived.

  3. The third type of vulnerability occurs because the Wayback Machine uses nearest-neighbor timestamp matching which is an issue when a resource is not successfully archived. If the attacker knows that a resource is missing then it is possible in some cases for a malicious payload to be used for the missing resource. To perform this attack, the attacker must be the owner of the domain for the missing resource and the malicious payload must be the only successfully archived version of the resource.


During Takuya Watanabe’s presentation, he discussed the five different types of attacks that can target users of web rehosting services. The five attacks are persistent man-in-the-middle attack, abusing privileges to access various resources, stealing credentials, stealing browser history, and session hijacking/injection


Wayback Machine is vulnerable to three of the attacks mentioned in his paper:

  1. Persistent man-in-the-middle (MITM) attacks when application cache (AppCache) is exploited which can result in the browser being compromised and sensitive information being leaked to the attacker.

  2. Privilege abuse to access the user’s resources like their camera, microphone, or GPS. This attack involves permission notification that requires the user to allow web.archive.org to use their resources.

  3. History theft (stealing browser history) by exploiting localStorage to get data that can be used to fingerprint visited websites.


Clare Stanton’s presentation was about Perma.cc. Perma.cc is a solution for legal citation that was developed at Harvard Law School. Perma.cc can be used by scholars, courts, and others to create permanent records of web pages that they cite. She also discussed how Perma.cc is different from other web archives.


In her previous work, she listed some differences between Perma.cc and Archive-It:

  • A Perma.cc record is a high fidelity web capture that allows the user to click through images, view animations, and scroll down with the content

  • The users don’t have to host or store the archived web pages, because the web pages are added to Harvard’s collection

  • The users have control over the privacy of the records and can decide when to make the records public


IIPC’s WAC proved very valuable for our research. It was wonderful to listen to great research happening in web archiving in all the different domains. The conference touched upon multiple topics such as:

 

  • Web Archives for preserving important events, contemporary arts, apps, defaced websites, etc.

  • Strengthening the web archiving community through Datathons

  • Using emerging technologies such as Web Bundle and AWS cloud-native services with web archives

  • Improving the QA & control in web archives by automating the process

  • Different tools built by researchers to support web archives

  • Summarizing large web archives

  • Ethical practices while archiving

  • Making web archives more accessible to researchers

  • Security vulnerabilities in web archives

 

The IIPC conference ended on a high note with all the presenters and attendees networking in Remo. We got to communicate with various researchers and understand their work in the web archiving field on both days. Also, the Web Science & Digital Libraries Research Group (@WebSciDL) members did not forget to stop for a picture at the end of Day 2.

--
Travis Reid (@TReid803), Kritika Garg (@kritika_garg), and Himarsha Jayanetti (@HimarshaJ

Comments