This was our (Travis, Kritika, and Himarsha) first time attending IIPC’s Web Archiving Conference (WAC). Usually this conference is an in-person event (IIPC General Assembly and WAC trip reports: 2011, 2012, 2013, 2014, 2015, 2016, and 2017), but this year the conference was held online. Remo was the conference platform for this event. The presentations for this conference were pre-recorded and shown during the conference, and instead of having presentations for the papers submitted, there were Q&A sessions. The invited panels were delivered live during the event. When more presentation slides and videos become available we will add the links to the blog post. Archive-It also wrote a blog post for this conference.

Day One

Sessions 1 through 10 were held on the first day of the event. We attended the sessions that were scheduled during times that were ideal for North American time zones. The sessions we focused on for the first day are sessions 6, 8, 9, and 10.

Chase Dooley chaired session 6 and the topic for this session was Archiving Communities.

Hélène Brousseau (@HeleneBrousseau) from Artexte Information Centre presented “Contemporary Art Knowledge: A Community-based Approach to Web Archiving”. She discussed the recent changes in the publication practices for contemporary arts. She also mentioned that a creator of a website should archive the website for legacy purposes, because the creator should know which parts of the website are important. Conifer makes it easier for artists to archive their web publications and websites. Conifer is ideal, because it can archive the audio, visual, and interactive content that are included on this type of website. If the artists’ web publications and websites have similar web resources on the web page, then Memento Tracer (which is presented in session 10) may also be useful, because the artist could create one trace that could be used on a class of similar web publications or websites.

Samantha Fritz (@SamVFritz) from University of Waterloo presented (slides) “Building Community Through Archives Unleashed Datathons”. She discussed some of the techniques the Archives Unleashed team used for community building and engagement. The Archives Unleashed team hosted a series of datathons (WS-DL trip reports for each datathon) to improve the participants’ skills and to build community around web archives. During the datathon, teams would be formed and each team worked on a project that involved data from a web archive collection.

From interviews with datathon participants, there were four major themes:

Datathon contributes to skill-building
Exposed participants to interdisciplinary perspectives
Fostered community formation
Fostered a sense of belonging

Their community engagement model is a six-stage model.

Scope: identify the problem or questions and identify the stakeholders that makes up the community
Inform: provide information to the community so that they can understand the current problems and possible solutions
Consult: conduct an open dialog to gather feedback from the community
Involve: work with the community to ensure community concerns are considered
Collaborate: foster interdisciplinary collaborations and establish partnerships
Empower: encourage and support scholars in building the skills needed to work with web archives

Kristinn Sigurðsson (@kristsi) from National and University Library of Iceland chaired session 8 and the topic for this session was Modeling, Packaging, Reusing.

Now is #iipcWAC21 session 8: "Modeling, Packaging, Re-Using", chaired by @kristsi with work from
* @ibnesayeed @weiglemc @phonedude_mln @mart1nkle1n @hvdsomp "Readying Web Archives to Consume and Leverage Web Bundles"
* @Yinlin_Chen "Analyzing WARC on serverless computing" pic.twitter.com/uS9fNGZdCc
— Shawn M. Jones (@shawnmjones) June 15, 2021

Sawood Alam (@ibnesayeed), an ODU WS-DL alumnus and Web and Data Scientist at the Internet Archive presented (slides) his work on "Readying Web Archives to Consume and Leverage Web Bundles". Web Bundle/Packaging allows users to download a website in the form of a single bundle file. This bundle can be shared offline and loaded from any server as if the website is loaded from the original source. He described the potential of Web Bundle in addressing the challenges related to web archiving. He discussed enabling discovery, content negotiation, and ingestion of Web Bundles for archival purposes which may provide complete and more coherent crawling. He also touched upon decomposing bundled HTTP Exchanges for efficient storage, indexing and dynamically generating bundles for composite mementos at replay. This work has also been discussed in a DHSR blog post.

.@ibnesayeed of @internetarchive @WebSciDL presenting:

"Readying Web Archives to Consume and Leverage Web Bundles"

video: https://t.co/hntGvs7Oz7
slides: https://t.co/nN4PLNiAgF
report: https://t.co/lS6x7AAL8a #iipcWAC21 early #WebArchiveWednesday pic.twitter.com/5dLPqsxLe1
— Michael L. Nelson (@phonedude_mln) June 15, 2021

Yinlin Chen (@Yinlin_Chen) from Virginia Tech Libraries delivered the second presentation of this session on “Analyzing WARC on Serverless Computing”. He presented a serverless architecture platform using AWS cloud-native services (AWS Lambda, Batch, ECS, etc.) that enables a resilient, scalable, and cost-effective processing of WARC. As proof of concept, they tested the platform for scalability, cost, and performance using selected Common Crawl data stored in the AWS S3. Their analysis shows that the platform processed uncompressed big data in a few minutes.

"Analyzing WARC on serverless computing" is by @Yinlin_Chen from @virginia_tech . More information on his work is available at https://t.co/hJ5a94k37m
— Shawn M. Jones (@shawnmjones) June 15, 2021

Jefferson Bailey (@jefferson_bail) from the Internet Archive chaired session 9 and the presenters discussed the topic of Quality Assurance (QA) & Control. This was interesting to see how manual and automated quality control practices are adopted by web archives.

Brenda Reyes Ayala (@CamtheWicked) from University of Alberta presented “Detecting Quality Problems in Archived Websites Using Image Similarity”. She explained their method of improving QA using image similarity by comparing the appearance of the original website to that of the archived website. They created tools that generate screenshots of the live websites and their archived counterparts. They used image similarity metrics such as Structural Similarity Index (SSIM), the Mean Squared Error (MSE), and vector distance to identify the difference between the live and archived versions.

@CamtheWicked is talking about comparing screenshots between the original and archived website to measure archive quality. #iipcWAC21

Code: https://t.co/G5ERGyNPL6
Slides: https://t.co/qhFY2YtSmE pic.twitter.com/dEUDL0jQPi
— Shawn M. Jones (@shawnmjones) June 15, 2021

Patricia Klambauer (@pkbaclac) and Tom J. Smyth (@smythbound) from Library and Archives Canada presented “The Black Hole Of Quality Control: Toward a Framework for Managing QC Effort to Ensure Value”. They described the lessons learned from building the quality assurance and control framework at Library and Archives Canada. They also discussed key tools and techniques including: initial scoping of web collections, approaching QC as a “scrum” project, graphing QC technical complexity, etc.

@pkbaclac from Library and Archives Canada is talking about the "Black Hole of quality control" and different collection measures. #iipcWAC21 https://t.co/5towTk9wVn pic.twitter.com/taQPAQGXJX
— Shawn M. Jones (@shawnmjones) June 15, 2021

Jeffrey Van Der Hoeven (@jrvdhoeven), Ben O’Brien(@ob1_ben_ob), Hanna Koppelaar, Trienka Rohrbach (@trienka), Andrea Goethals (@AndreaGoethals), and Steve Knight presented “Improving the Quality of Web Harvests Using Web Curator Tool”. This work is a collaboration by the National Library of New Zealand and the National Library of the Netherlands. They talked about Web Curator Tool (WCT), which is an open source workflow management tool for selective web archiving that allows crawling websites and performing QA. They presented the version 4 of WCT which allows crawl patching using Webrecorder, screenshot generation, and integration with Pywb viewer.

@jrvdhoeven of @openpreserve is talking about "Improving the quality of web harvests using Web Curator Tool" #iipcWAC21 https://t.co/vJobV7l8sT pic.twitter.com/drTDFrmhAG
— Shawn M. Jones (@shawnmjones) June 15, 2021

Ben O'Brien from National Library of New Zealand chaired session 10 and the topic for this session was Archiving Frameworks & Tools.

#iipcWAC21 Now @ob1_ben_ob is chairing session 10: "Archiving Frameworks & Tools" with work by @mart1nkle1n @hvdsomp @IlyaKreymer @machawk1 pic.twitter.com/mcqEsi0CPn
— Shawn M. Jones (@shawnmjones) June 15, 2021

Martin Klein (@mart1nkle1n) from Los Alamos National Laboratory and a WS-DL alumnus presented (slides) “Memento Tracer - An Innovative Approach Towards Balancing Web Archiving at Scale and Quality”. During this presentation he discussed Memento Tracer, which is a new approach that aims to balance between capturing at scale while also providing high-quality captures. A curator can create a trace, which is a set of actions needed to be performed by the crawler in order to archive all of the content needed for a web resource. These traces can be shared with other users via a publicly accessible repository. Traces can help with archiving the same type of resource like having a trace for archiving a Github repository or a trace for a slideshare presentation. Webrecorder’s autopilot feature is similar to using a trace from Memento Tracer. Webrecorder’s autopilot feature makes use of behaviors that have been prepared by the Webrecorder team, which are used to simulate human-like interactions on the web page so that more content can be archived in an automated way. Using a Memento Tracer’s trace is different from Webrecorder’s autopilot feature because the user can decide on which automated interactions will be used on a class of web resources.

Now @mart1nkle1n is presenting Memento Tracer a tool that finds a balance between human scale web archiving and massive crawls. Record your trace and apply it to many pages on the same site. #iipcWAC21
* https://t.co/MpsDnTgyhY
* https://t.co/f8dWNs9WRZ
* https://t.co/9wBbLVUVON pic.twitter.com/vc2ABLKAeX
— Shawn M. Jones (@shawnmjones) June 15, 2021

Ilya Kreymer (@IlyaKreymer) from Webrecorder and Humbert Hardy from National Film Board of Canada presented “Not Gone in a Flash! Developing a Flash-capable remote browser emulation system”. Webrecorder and the National Film Board (NFB) of Canada worked together while updating a remote browser that can support Flash. During the presentation, different approaches for streaming the audio and video for remote browsers were mentioned. VNC was used for streaming the video. The audio could be streamed with WebRTC or websocket. The ideal combination is VNC Video + WebRTC Audio, because it has the lowest lag for audio. Different approaches for emulating Flash were also mentioned like using Ruffle, remote browser, or in-browser emulators. A remote browser named “pywb-remote-browsers” that was created by Webrecorder can be used to replay web pages that have Flash content. Chrome 84 is the latest version for their remote browser that supports Flash.

@IlyaKreymer is presenting "Not gone in a Flash! Developing a Flash-capable remote browser emulation system" - running older web sites in older browsers to rerun Flash sites. #iipcWAC21 https://t.co/KiAKrOFhrE pic.twitter.com/rG2yftCFzU
— Shawn M. Jones (@shawnmjones) June 15, 2021

Mat Kelly (@machawk1) from Drexel University and a WS-DL alumnus presented (slides) “WASAPIfying Private Web Archiving Tools for Persistence and Collaboration”. He described WASAPI and some of his tools that can make use of WASAPI, which are the Web Archiving Integration Layer (WAIL) and InterPlanetary Wayback (ipwb). WASAPI is a framework for transmitting WARCs using a standard API. Only Archive-It and Webrecorder have implemented the server side component of the framework. WAIL has a feature similar to Wayback Machine’s Save Page Now feature except the files are stored locally on the user's computer. InterPlanetary Wayback can be used to create a distributed personal web archive. InterPlanetary Wayback disseminates the WARC files into the IPFS network.

#iipcWAC21 @machawk1 from @DrexelUniv is presenting "WASAPIfying private web archiving tools for persistence and collaboration" - a proof of concept of desktop-based web archiving tools, see:https://t.co/FD75ULN4Ou https://t.co/SOaXY1JlEl https://t.co/U4xaxG21Vs pic.twitter.com/YcEGv7emKt
— Shawn M. Jones (@shawnmjones) June 15, 2021

#iipcWAC21 @ob1_ben_ob closes with a statement on the need for standards so that tools can work together. Here are some standards related to #webarchiving:
CDXJ: https://t.co/ulbTzM5276
WARC: https://t.co/ZU3DfxBiSu
Memento: https://t.co/zDWGUxUdr3
CDX: https://t.co/RokfRRcEZw pic.twitter.com/xB6PG7wn5d
— Shawn M. Jones (@shawnmjones) June 15, 2021

Day Two

The sessions for the last day of the conference were sessions 11 through 19. We focused on sessions 14, 15, 16, and 19.

Sawood Alam chaired session 14 and the topic for this session was Non-traditional Archives.

.@ibnesayeed of @internetarchive @WebSciDL is chairing #iipcWAC21 Day-2 session-14 panel discussion on non-traditional archives. @silvertje @fvandervlist @MKRZMR #webarchivewednesday pic.twitter.com/yrUVVZk6kV
— Himarsha Jayanetti (@HimarshaJ) June 16, 2021

Anne Helmond (@silvertje) from University of Amsterdam presented "Platform and app histories: Assessing source availability in web archives and app repositories" where she discussed the difficulties of archiving platforms and apps. The continuous updates for platforms and apps make them difficult to archive. Each update overwrites the previous version of an app or platform, which results in some of the history being overwritten as well. She also described how researchers can use web archives and software repositories to reconstruct platform and app histories. For apps, web archives are used to get metadata and app repositories are used to get previous versions of the app. She then described an analysis of how well various platforms and apps are represented across web archives. For this analysis, Memento Time Travel API was used via MemGator to retrieve URLs for social media platforms and app detail pages from all web archives that support the memento protocol. After the URLs were gathered there were three dimensions used to analyze how well a platform or app is archived.

Three dimensions for analyzing how well a platform or app is archived:

Volume of availability which is the number of mementos held
Depth of availability which is the number of days, months, or years between the first and last mementos
Breadth of availability which is the number of archives holding the mementos

Michael Kurzmeier (@MKRZMR) from Maynooth University presented “Website Defacements: Finding Hacktivism in Web Archives” where he discussed defaced websites that have been hacked and altered. Defaced websites are ephemeral web resources, because defaced websites are usually resorted quickly. During the presentation he also mentioned that there are community-maintained cybercrime archives that have archived defaced websites. Most of the cybercrime archives are in a suspended state where most of them do not accept new submissions. Some tools that could help with detecting website defacements are the Off-Topic Memento Toolkit and Timemap Visualization (TMVis).

re: detecting defacements, @shawnmjones's Off Topic Memento Toolkit would likely help.https://t.co/IcTJCYNzhP #iipcWAC21
— Michael L. Nelson (@phonedude_mln) June 16, 2021

TMVis can also be used to view significant changes in an archived webpage's content over time, highlighting not only page design updates, but also potentially periods of defacement or outage https://t.co/Cj61g1XYyG https://t.co/0l5o2ldnn2 #iipcWAC21 https://t.co/ydLYGjl5B6
— Michele Weigle (@weiglemc) June 16, 2021

Yves Maurer (@yvesmaurer) from National Library of Luxembourg chaired session 15 and the topic for this session was Archive Profile Summarization.

How to summarize the web archive?#IIPCWAC21
1. https://t.co/CvpZV4EQTe by @shawnmjones @WebSciDL
2. MementoMap (https://t.co/6OGUR4KasL) by @ibnesayeed @phonedude_mln
3. Interactive collage (https://t.co/JOlPcbgZjz) by Web Archive Switzerland.#WebArchiveWednesday pic.twitter.com/X1Nu3TOqU3
— kritika garg (@kritika_garg) June 16, 2021

For the first presentation, Sawood Alam introduced the MementoMap framework for summarizing archival holdings. He started the presentation by introducing the MemGator Service, which is a tool built by Sawood Alam himself. MemGator aggregates TimeMaps across different public web archives by broadcasting a lookup request to all known web archives. He introduced how selectively polling archives that are likely to return good results is a more efficient way to avoid the wasteful and problematic nature of broadcasting. The proposed solution is to profile web archives using the MementoMap framework. Sawood talked about the main components of this framework: Ingestion, Summarization and Serialization, and Memento Routing. Ingestion is where you learn about an archive through CDX datasets, access logs, etc. The second component of this framework focuses on summarizing the findings about archival holdings which is the output of this framework. The final component is where the framework is put to use by integrating the said output with memento aggregators. Also, he announced that the MementoMap framework and its components are now ready for adoption by Web Archives.

Shawn M. Jones (@shawnmjones) of Los Alamos National Laboratory and WS-DL presented (slides) MementoEmbed and Raintale, two software components of the Dark and Stormy Archives (@StormyArchives) project. He started his presentation by explaining the different use cases of story telling with web archives. MementoEmbed is an archive-aware surrogate service (social card or thumbnail) that is used for summarizing an individual memento. A user of Raintale can tell a story with those social cards generated through MementoEmbed. He talks about how Raintale takes a list of mementos as the input (these mementos can either be chosen by the user manually or generated by Hypercane) and creates stories that can be in the form of HTML, Markdown or even publish it directly to services like Twitter. He finally talks about the current state and future of these tools. He states that the Dark and Stormy archive toolkit is currently being tested with the National Library of Australia and the users can also expect a GUI for Raintale to be introduced in the future.

#IIPCWAC21 sess15@shawnmjones @WebSciDL presenting MementoEmbed and Raintale tools that could be used for summarizing the web archives.https://t.co/0YQJw4PlOZ #WebArchiveWednesday pic.twitter.com/9vtF2uhlMc
— kritika garg (@kritika_garg) June 16, 2021

The slides for my talk are available at this link. We cover why we developed these tools and how MementoEmbed and Raintale work together. #iipcWAC21 https://t.co/zl69ELcZvl
— Shawn M. Jones (@shawnmjones) June 16, 2021

Barbara Signori and Kai Jauslin (@kjauslin) from the Swiss National Library presented “Interactive Collage of Websites: A Deep Dive Into the Web Archive Switzerland”. During this presentation they discussed the redesign of the Swiss National Library’s web archive. Their web archive is now more visual and has a collage that shows many snapshots at once. Each snapshot is a screenshot of the start page for an archived web page. The collage is interactive and allows users to click and zoom into certain sections of the collage. If a snapshot is selected then an authorized user can view an archived web page and can switch between earlier and later versions of the web page. The collage also supports full text search and if a snapshot is associated with the search then it will be highlighted and the other snapshots will be masked out. Some of the tools and frameworks used for the collage are pywb, OpenWayback, Puppeteer, and Vue JS. Another feature is that the user can switch between using pywb or OpenWayback when viewing a snapshot.

Samantha Abrams (@smbrms) from University of Wisconsin-Madison chaired session 16 and the topic for this session was Research Into Archives.

#iipcWAC21 Now @smbrms is chairing a session on "Research Into Archives" featuring work by @ktcmackinnon and @SamVFritz.https://t.co/QzGndKfyjU pic.twitter.com/tjOM003DMY
— Shawn M. Jones (@shawnmjones) June 16, 2021

Katie Mackinnon (@ktcmackinnon) from University of Toronto presented “Ethical Approaches to Researching Youth Cultures in Historical Web Archives”. During her presentation she discussed some of the ethical issues with using web archives for research. Some of the issues mentioned are the “Right to be Forgotten” and privacy issues especially for young people. The current way that data is handled is usually not in a privacy preserving way like not getting consent from the users and not balancing children's rights.

Now @ktcmackinnon is providing an overview of "Ethical approaches to researching youth cultures in historical web archives" - just because we can do certain research doesn't mean we should#iipcWAC21 #WebArchiveWednesday https://t.co/Jzjj0hRo3R pic.twitter.com/t132KM7rgM
— Shawn M. Jones (@shawnmjones) June 16, 2021

Samantha Fritz (@SamVFritz) from University of Waterloo presented (slides) “Accessible Web Archives: Rethinking and Designing Usable Infrastructure for Sustainable Research Platforms” where she discussed some of the accessibility barriers that exist when working with web archives. One of the issues with using web archives are the skills needed to work with web archives like needing to know how to work with a lot of data, needing to know high performance computing, and needing to know how to work from the command line. One of the goals of the Archives Unleashed Project is to make it easier to work with web archive data so that the researcher does not need to have all of the skills usually required to work with web archive data. To achieve this goal, the Archives Unleashed team created several tools that assist others with exploring and analyzing web archives. Some of the tools that were created by the Archives Unleashed team are the Archives Unleashed Cloud and Archives Unleashed Toolkit. WS-DL member Travis Reid has written blog posts about using Archives Unleashed Cloud (Working With Archives Unleashed Cloud) and Archives Unleashed Toolkit (Creating Collection Growth Curves With Archives Unleashed Toolkit And Hypercane). The Archives Unleashed Cloud has recently shut down, but it will be integrated with Archive-It later.

#iipcWAC21 #WebArchiveWednesday @SamVFritz is now discussing "Accessible Web Archives: Rethinking and Designing Usable Infrastructure for Sustainable Research Platforms" - usability only exists if it is accessiblehttps://t.co/1siBmEjN2T pic.twitter.com/5TjzEpU4xZ
— Shawn M. Jones (@shawnmjones) June 16, 2021

Great discussion on ethical approaches to researching in web archives and accessibility of web archives for researchers by @SamVFritz (@unleasharchives), @ktcmackinnon, and @smbrms. @NetPreserve #IIPCWAC21 session 16https://t.co/ylO6I5vIYc https://t.co/TeUGFSJZXa pic.twitter.com/UbkJONEOed
— kritika garg (@kritika_garg) June 16, 2021

Day 2 of WAC came to an end with the invited panelists session on Trust In Web Archiving. The session was chaired by Martin Klein (Chair of the WAC Programme Committee) and Yves Maurer. The panel of speakers were Michael L. Nelson (@phonedude_mln), Ada Lerner (@AdaLerner), Takuya Watanabe (@twatanabe1203), and Clare Stanton (@clare__stanton). The panel discussed why the web archive user should not blindly trust what is present in the archives.

Now #iipcWAC21 has an Invited Panel: "Trust in Web Archives" chaired by @mart1nkle1n and @yvesmaurer with panelists @AdaLerner @phonedude_mln @clare__stanton @twatanabe1203 #WebArchiveWednesday https://t.co/OsQy6pA4Ky pic.twitter.com/cLzgPAl3nv
— Shawn M. Jones (@shawnmjones) June 16, 2021

TRUST IN WEB ARCHIVING, the last session of #IIPCWAC21
Links to the panelist's work:
1. @AdaLerner: https://t.co/4u4bGXDTsf
2. @phonedude_mln: https://t.co/PkUeE4XOd0
3. @twatanabe1203: https://t.co/FSik83zuv7
4. @clare__stanton: https://t.co/e2OfvUfHRy, https://t.co/QQtV3vUzKA pic.twitter.com/nuZVxGyGk7
— kritika garg (@kritika_garg) June 16, 2021

Michael L. Nelson talked about his vision for trust in Web Archiving in 2025, which is to have “hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives”.

During the presentation (which is embedded below) he describes some of the current issues with web archives that need to be addressed in order to achieve this vision:

Hundreds: The number of archives may not continue to grow because there are “innumerable examples that point toward centralization / consolidation”.
Publicly available: Around two-thirds of web traffic is not publicly archivable based on a previous estimate.
Independent: Suggested having multiple independent observations, because having multiple copies archived at the same time can be different based on factors like the GeoIP, personalization, and CDN status.
Interoperable: “Homogeneity is not true interoperability” and “true interoperability comes through the hard work of protocols and standards".
Robust: Discussed some of the recent works related to robustness and malicious .html and .js files in web archives. Links to the related works are included on slides 18 and 19.
Auditible: Discussed some issues with auditing web pages from web archives like replaying the same archived web page can produce different results, conventional fixity-based approaches do not work, and can’t depend on a web archive for fixity (“Where did the archive go?” series: parts 1, 2, 3, and 4) since the archive can change or die.
Cooperating: Mentioned that APIs are necessary but not sufficient and that we need to be able to preserve and audit data like WARC and HAR as rendered through software like pywb.
Web: Showed images of different apps that could be web applications. Anne Helmond’s presentation from session 14 also discusses the issues with archiving apps.
Archive: We must also accommodate any system that supports rehosting/revisions.

#iipcWAC21 @phonedude_mln is presenting "Web Archiving in the Year 2025" featuring work by @ScottGAinsworth @ibnesayeed @maturban1 @johnaberlin @justinfbrunelle @kritika_garg @Hussam_A_Hallak @HimarshaJ @machawk1 @weiglemc #WebArchiveWednesday
Slides: https://t.co/NUzlMy8S2y pic.twitter.com/FMrtXNipUw
— Shawn M. Jones (@shawnmjones) June 16, 2021

During Ada Lerner’s presentation she described an attack where the attacker could replace images and text on an archived web page with newer content. This type of attack can occur when the archived web page has live resources included on the web page and the attacker owns the domain associated with the live resources. In her paper, this attack is referred to as Archive-Escape Abuse. Justin Brunelle’s (@justinfbrunelle) previous work has also identified examples of leakage in the web archives where certain older archive web pages had recent live content included on the web page. The Archive-Escape Abuse attack exploits one of the three Wayback Machine vulnerabilities that were mentioned in Ada Lerner’s paper.

The three types of Wayback Machine vulnerabilities identified in her paper:

The first type of vulnerability occurs when live web content is included on an archived web page, which allows an attacker to change the content if they own the domain associated with the live web content. Archive-Escape Abuse can exploit this vulnerability.
The second type of vulnerability is caused by the lack of same-origin policy being enforced which can result in different sources interfering with each other. Since the Wayback Machine loads all of the content, the browser cannot enforce the same-origin policy which allows third-parties included in <iframes> to modify data on the main web page. To exploit this second vulnerability, the attacker must include their payload in an <iframe> before the web page is archived.
The third type of vulnerability occurs because the Wayback Machine uses nearest-neighbor timestamp matching which is an issue when a resource is not successfully archived. If the attacker knows that a resource is missing then it is possible in some cases for a malicious payload to be used for the missing resource. To perform this attack, the attacker must be the owner of the domain for the missing resource and the malicious payload must be the only successfully archived version of the resource.

#iipcWAC21 @AdaLerner is presenting "People may have motivations to manipulate web archives" where she demonstrates the ways of manipulating data in web archives without compromising the servers or the client machine pic.twitter.com/lggETUs3JM
— Shawn M. Jones (@shawnmjones) June 16, 2021

During Takuya Watanabe’s presentation, he discussed the five different types of attacks that can target users of web rehosting services. The five attacks are persistent man-in-the-middle attack, abusing privileges to access various resources, stealing credentials, stealing browser history, and session hijacking/injection

Wayback Machine is vulnerable to three of the attacks mentioned in his paper:

Persistent man-in-the-middle (MITM) attacks when application cache (AppCache) is exploited which can result in the browser being compromised and sensitive information being leaked to the attacker.
Privilege abuse to access the user’s resources like their camera, microphone, or GPS. This attack involves permission notification that requires the user to allow web.archive.org to use their resources.
History theft (stealing browser history) by exploiting localStorage to get data that can be used to fingerprint visited websites.

#iipcWAC21 @twatanabe1203 is presenting "Melting Pot of Origins: Compromising the Intermediary Web Services that Rehost Websites" where he discusses the security issues with all services that rehost websites, including #webarchives #WebArchiveWednesday pic.twitter.com/wLJISIo078
— Shawn M. Jones (@shawnmjones) June 16, 2021

Clare Stanton’s presentation was about Perma.cc. Perma.cc is a solution for legal citation that was developed at Harvard Law School. Perma.cc can be used by scholars, courts, and others to create permanent records of web pages that they cite. She also discussed how Perma.cc is different from other web archives.

In her previous work, she listed some differences between Perma.cc and Archive-It:

A Perma.cc record is a high fidelity web capture that allows the user to click through images, view animations, and scroll down with the content
The users don’t have to host or store the archived web pages, because the web pages are added to Harvard’s collection
The users have control over the privacy of the records and can decide when to make the records public

at #iipcWAC21 @clare__stanton from @permacc is presenting "Trust in https://t.co/V35pa33UIg" - Perma is designed to be a citation tool for academics and those working in the court system pic.twitter.com/BdiaLxbr2I
— Shawn M. Jones (@shawnmjones) June 16, 2021

IIPC’s WAC proved very valuable for our research. It was wonderful to listen to great research happening in web archiving in all the different domains. The conference touched upon multiple topics such as:

Web Archives for preserving important events, contemporary arts, apps, defaced websites, etc.
Strengthening the web archiving community through Datathons
Using emerging technologies such as Web Bundle and AWS cloud-native services with web archives
Improving the QA & control in web archives by automating the process
Different tools built by researchers to support web archives
Summarizing large web archives
Ethical practices while archiving
Making web archives more accessible to researchers
Security vulnerabilities in web archives

The IIPC conference ended on a high note with all the presenters and attendees networking in Remo. We got to communicate with various researchers and understand their work in the web archiving field on both days. Also, the Web Science & Digital Libraries Research Group (@WebSciDL) members did not forget to stop for a picture at the end of Day 2.

The @WebSciDL members stopped for a photo op at #iipcWAC21 - pictured are @shawnmjones @kritika_garg @HimarshaJ @mart1nkle1n @machawk1 @phonedude_mln @phonedude_mln @ibnesayeed pic.twitter.com/mM3RDAwkTE
— Shawn M. Jones (@shawnmjones) June 16, 2021

Travis Reid (@TReid803), Kritika Garg (@kritika_garg), and Himarsha Jayanetti (@HimarshaJ)

Search This Blog

Web Science and Digital Libraries Research Group

2021-07-14: Web Archiving Conference (WAC) 2021 Trip Report

Day One

Day Two

Comments

Post a Comment