2021-07-14: Web Archiving Conference (WAC) 2021 Trip Report
This was our (Travis, Kritika, and Himarsha) first time attending IIPC’s Web Archiving Conference (WAC). Usually this conference is an in-person event (IIPC General Assembly and WAC trip reports: 2011, 2012, 2013, 2014, 2015, 2016, and 2017), but this year the conference was held online. Remo was the conference platform for this event. The presentations for this conference were pre-recorded and shown during the conference, and instead of having presentations for the papers submitted, there were Q&A sessions. The invited panels were delivered live during the event. When more presentation slides and videos become available we will add the links to the blog post. Archive-It also wrote a blog post for this conference.
Day One
Samantha Fritz (@SamVFritz) from University of Waterloo presented (slides) “Building Community Through Archives Unleashed Datathons”. She discussed some of the techniques the Archives Unleashed team used for community building and engagement. The Archives Unleashed team hosted a series of datathons (WS-DL trip reports for each datathon) to improve the participants’ skills and to build community around web archives. During the datathon, teams would be formed and each team worked on a project that involved data from a web archive collection.
From interviews with datathon participants, there were four major themes:
Datathon contributes to skill-building
Exposed participants to interdisciplinary perspectives
Fostered community formation
Fostered a sense of belonging
Their community engagement model is a six-stage model.
Scope: identify the problem or questions and identify the stakeholders that makes up the community
Inform: provide information to the community so that they can understand the current problems and possible solutions
Consult: conduct an open dialog to gather feedback from the community
Involve: work with the community to ensure community concerns are considered
Collaborate: foster interdisciplinary collaborations and establish partnerships
Empower: encourage and support scholars in building the skills needed to work with web archives
Now is #iipcWAC21 session 8: "Modeling, Packaging, Re-Using", chaired by @kristsi with work from
— Shawn M. Jones (@shawnmjones) June 15, 2021
* @ibnesayeed @weiglemc @phonedude_mln @mart1nkle1n @hvdsomp "Readying Web Archives to Consume and Leverage Web Bundles"
* @Yinlin_Chen "Analyzing WARC on serverless computing" pic.twitter.com/uS9fNGZdCc
Sawood Alam (@ibnesayeed), an ODU WS-DL alumnus and Web and Data Scientist at the Internet Archive presented (slides) his work on "Readying Web Archives to Consume and Leverage Web Bundles". Web Bundle/Packaging allows users to download a website in the form of a single bundle file. This bundle can be shared offline and loaded from any server as if the website is loaded from the original source. He described the potential of Web Bundle in addressing the challenges related to web archiving. He discussed enabling discovery, content negotiation, and ingestion of Web Bundles for archival purposes which may provide complete and more coherent crawling. He also touched upon decomposing bundled HTTP Exchanges for efficient storage, indexing and dynamically generating bundles for composite mementos at replay. This work has also been discussed in a DHSR blog post.
.@ibnesayeed of @internetarchive @WebSciDL presenting:
— Michael L. Nelson (@phonedude_mln) June 15, 2021
"Readying Web Archives to Consume and Leverage Web Bundles"
video: https://t.co/hntGvs7Oz7
slides: https://t.co/nN4PLNiAgF
report: https://t.co/lS6x7AAL8a#iipcWAC21 early #WebArchiveWednesday pic.twitter.com/5dLPqsxLe1
"Analyzing WARC on serverless computing" is by @Yinlin_Chen from @virginia_tech . More information on his work is available at https://t.co/hJ5a94k37m
— Shawn M. Jones (@shawnmjones) June 15, 2021
@CamtheWicked is talking about comparing screenshots between the original and archived website to measure archive quality. #iipcWAC21
— Shawn M. Jones (@shawnmjones) June 15, 2021
Code: https://t.co/G5ERGyNPL6
Slides: https://t.co/qhFY2YtSmE pic.twitter.com/dEUDL0jQPi
@pkbaclac from Library and Archives Canada is talking about the "Black Hole of quality control" and different collection measures. #iipcWAC21https://t.co/5towTk9wVn pic.twitter.com/taQPAQGXJX
— Shawn M. Jones (@shawnmjones) June 15, 2021
@jrvdhoeven of @openpreserve is talking about "Improving the quality of web harvests using Web Curator Tool" #iipcWAC21https://t.co/vJobV7l8sT pic.twitter.com/drTDFrmhAG
— Shawn M. Jones (@shawnmjones) June 15, 2021
#iipcWAC21 Now @ob1_ben_ob is chairing session 10: "Archiving Frameworks & Tools" with work by @mart1nkle1n @hvdsomp @IlyaKreymer @machawk1 pic.twitter.com/mcqEsi0CPn
— Shawn M. Jones (@shawnmjones) June 15, 2021
Now @mart1nkle1n is presenting Memento Tracer a tool that finds a balance between human scale web archiving and massive crawls. Record your trace and apply it to many pages on the same site. #iipcWAC21
— Shawn M. Jones (@shawnmjones) June 15, 2021
* https://t.co/MpsDnTgyhY
* https://t.co/f8dWNs9WRZ
* https://t.co/9wBbLVUVON pic.twitter.com/vc2ABLKAeX
@IlyaKreymer is presenting "Not gone in a Flash! Developing a Flash-capable remote browser emulation system" - running older web sites in older browsers to rerun Flash sites. #iipcWAC21https://t.co/KiAKrOFhrE pic.twitter.com/rG2yftCFzU
— Shawn M. Jones (@shawnmjones) June 15, 2021
#iipcWAC21 @machawk1 from @DrexelUniv is presenting "WASAPIfying private web archiving tools for persistence and collaboration" - a proof of concept of desktop-based web archiving tools, see:https://t.co/FD75ULN4Ouhttps://t.co/SOaXY1JlElhttps://t.co/U4xaxG21Vs pic.twitter.com/YcEGv7emKt
— Shawn M. Jones (@shawnmjones) June 15, 2021
#iipcWAC21 @ob1_ben_ob closes with a statement on the need for standards so that tools can work together. Here are some standards related to #webarchiving:
— Shawn M. Jones (@shawnmjones) June 15, 2021
CDXJ: https://t.co/ulbTzM5276
WARC: https://t.co/ZU3DfxBiSu
Memento: https://t.co/zDWGUxUdr3
CDX: https://t.co/RokfRRcEZw pic.twitter.com/xB6PG7wn5d
Day Two
.@ibnesayeed of @internetarchive @WebSciDL is chairing #iipcWAC21 Day-2 session-14 panel discussion on non-traditional archives. @silvertje @fvandervlist @MKRZMR #webarchivewednesday pic.twitter.com/yrUVVZk6kV
— Himarsha Jayanetti (@HimarshaJ) June 16, 2021
Anne Helmond (@silvertje) from University of Amsterdam presented "Platform and app histories: Assessing source availability in web archives and app repositories" where she discussed the difficulties of archiving platforms and apps. The continuous updates for platforms and apps make them difficult to archive. Each update overwrites the previous version of an app or platform, which results in some of the history being overwritten as well. She also described how researchers can use web archives and software repositories to reconstruct platform and app histories. For apps, web archives are used to get metadata and app repositories are used to get previous versions of the app. She then described an analysis of how well various platforms and apps are represented across web archives. For this analysis, Memento Time Travel API was used via MemGator to retrieve URLs for social media platforms and app detail pages from all web archives that support the memento protocol. After the URLs were gathered there were three dimensions used to analyze how well a platform or app is archived.
Three dimensions for analyzing how well a platform or app is archived:
Volume of availability which is the number of mementos held
Depth of availability which is the number of days, months, or years between the first and last mementos
Breadth of availability which is the number of archives holding the mementos
re: detecting defacements, @shawnmjones's Off Topic Memento Toolkit would likely help.https://t.co/IcTJCYNzhP#iipcWAC21
— Michael L. Nelson (@phonedude_mln) June 16, 2021
TMVis can also be used to view significant changes in an archived webpage's content over time, highlighting not only page design updates, but also potentially periods of defacement or outage https://t.co/Cj61g1XYyGhttps://t.co/0l5o2ldnn2#iipcWAC21 https://t.co/ydLYGjl5B6
— Michele Weigle (@weiglemc) June 16, 2021
How to summarize the web archive?#IIPCWAC21
— kritika garg (@kritika_garg) June 16, 2021
1. https://t.co/CvpZV4EQTe by @shawnmjones @WebSciDL
2. MementoMap (https://t.co/6OGUR4KasL) by @ibnesayeed @phonedude_mln
3. Interactive collage (https://t.co/JOlPcbgZjz) by Web Archive Switzerland.#WebArchiveWednesday pic.twitter.com/X1Nu3TOqU3
#IIPCWAC21 sess15@shawnmjones @WebSciDL presenting MementoEmbed and Raintale tools that could be used for summarizing the web archives.https://t.co/0YQJw4PlOZ#WebArchiveWednesday pic.twitter.com/9vtF2uhlMc
— kritika garg (@kritika_garg) June 16, 2021
Barbara Signori and Kai Jauslin (@kjauslin) from the Swiss National Library presented “Interactive Collage of Websites: A Deep Dive Into the Web Archive Switzerland”. During this presentation they discussed the redesign of the Swiss National Library’s web archive. Their web archive is now more visual and has a collage that shows many snapshots at once. Each snapshot is a screenshot of the start page for an archived web page. The collage is interactive and allows users to click and zoom into certain sections of the collage. If a snapshot is selected then an authorized user can view an archived web page and can switch between earlier and later versions of the web page. The collage also supports full text search and if a snapshot is associated with the search then it will be highlighted and the other snapshots will be masked out. Some of the tools and frameworks used for the collage are pywb, OpenWayback, Puppeteer, and Vue JS. Another feature is that the user can switch between using pywb or OpenWayback when viewing a snapshot.The slides for my talk are available at this link. We cover why we developed these tools and how MementoEmbed and Raintale work together. #iipcWAC21https://t.co/zl69ELcZvl
— Shawn M. Jones (@shawnmjones) June 16, 2021
#iipcWAC21 Now @smbrms is chairing a session on "Research Into Archives" featuring work by @ktcmackinnon and @SamVFritz.https://t.co/QzGndKfyjU pic.twitter.com/tjOM003DMY
— Shawn M. Jones (@shawnmjones) June 16, 2021
Now @ktcmackinnon is providing an overview of "Ethical approaches to researching youth cultures in historical web archives" - just because we can do certain research doesn't mean we should#iipcWAC21 #WebArchiveWednesdayhttps://t.co/Jzjj0hRo3R pic.twitter.com/t132KM7rgM
— Shawn M. Jones (@shawnmjones) June 16, 2021
#iipcWAC21 #WebArchiveWednesday @SamVFritz is now discussing "Accessible Web Archives: Rethinking and Designing Usable Infrastructure for Sustainable Research Platforms" - usability only exists if it is accessiblehttps://t.co/1siBmEjN2T pic.twitter.com/5TjzEpU4xZ
— Shawn M. Jones (@shawnmjones) June 16, 2021
Great discussion on ethical approaches to researching in web archives and accessibility of web archives for researchers by @SamVFritz (@unleasharchives), @ktcmackinnon, and @smbrms. @NetPreserve #IIPCWAC21 session 16https://t.co/ylO6I5vIYchttps://t.co/TeUGFSJZXa pic.twitter.com/UbkJONEOed
— kritika garg (@kritika_garg) June 16, 2021
Now #iipcWAC21 has an Invited Panel: "Trust in Web Archives" chaired by @mart1nkle1n and @yvesmaurer with panelists @AdaLerner @phonedude_mln @clare__stanton @twatanabe1203 #WebArchiveWednesdayhttps://t.co/OsQy6pA4Ky pic.twitter.com/cLzgPAl3nv
— Shawn M. Jones (@shawnmjones) June 16, 2021
TRUST IN WEB ARCHIVING, the last session of #IIPCWAC21
— kritika garg (@kritika_garg) June 16, 2021
Links to the panelist's work:
1. @AdaLerner: https://t.co/4u4bGXDTsf
2. @phonedude_mln: https://t.co/PkUeE4XOd0
3. @twatanabe1203: https://t.co/FSik83zuv7
4. @clare__stanton: https://t.co/e2OfvUfHRy, https://t.co/QQtV3vUzKA pic.twitter.com/nuZVxGyGk7
Michael L. Nelson talked about his vision for trust in Web Archiving in 2025, which is to have “hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives”.
During the presentation (which is embedded below) he describes some of the current issues with web archives that need to be addressed in order to achieve this vision:
Hundreds: The number of archives may not continue to grow because there are “innumerable examples that point toward centralization / consolidation”.
Publicly available: Around two-thirds of web traffic is not publicly archivable based on a previous estimate.
Independent: Suggested having multiple independent observations, because having multiple copies archived at the same time can be different based on factors like the GeoIP, personalization, and CDN status.
Interoperable: “Homogeneity is not true interoperability” and “true interoperability comes through the hard work of protocols and standards".
Robust: Discussed some of the recent works related to robustness and malicious .html and .js files in web archives. Links to the related works are included on slides 18 and 19.
Auditible: Discussed some issues with auditing web pages from web archives like replaying the same archived web page can produce different results, conventional fixity-based approaches do not work, and can’t depend on a web archive for fixity (“Where did the archive go?” series: parts 1, 2, 3, and 4) since the archive can change or die.
Cooperating: Mentioned that APIs are necessary but not sufficient and that we need to be able to preserve and audit data like WARC and HAR as rendered through software like pywb.
Web: Showed images of different apps that could be web applications. Anne Helmond’s presentation from session 14 also discusses the issues with archiving apps.
Archive: We must also accommodate any system that supports rehosting/revisions.
#iipcWAC21 @phonedude_mln is presenting "Web Archiving in the Year 2025" featuring work by @ScottGAinsworth @ibnesayeed @maturban1 @johnaberlin @justinfbrunelle @kritika_garg @Hussam_A_Hallak @HimarshaJ @machawk1 @weiglemc #WebArchiveWednesday
— Shawn M. Jones (@shawnmjones) June 16, 2021
Slides: https://t.co/NUzlMy8S2y pic.twitter.com/FMrtXNipUw
During Ada Lerner’s presentation she described an attack where the attacker could replace images and text on an archived web page with newer content. This type of attack can occur when the archived web page has live resources included on the web page and the attacker owns the domain associated with the live resources. In her paper, this attack is referred to as Archive-Escape Abuse. Justin Brunelle’s (@justinfbrunelle) previous work has also identified examples of leakage in the web archives where certain older archive web pages had recent live content included on the web page. The Archive-Escape Abuse attack exploits one of the three Wayback Machine vulnerabilities that were mentioned in Ada Lerner’s paper.
The three types of Wayback Machine vulnerabilities identified in her paper:
The first type of vulnerability occurs when live web content is included on an archived web page, which allows an attacker to change the content if they own the domain associated with the live web content. Archive-Escape Abuse can exploit this vulnerability.
The second type of vulnerability is caused by the lack of same-origin policy being enforced which can result in different sources interfering with each other. Since the Wayback Machine loads all of the content, the browser cannot enforce the same-origin policy which allows third-parties included in <iframes> to modify data on the main web page. To exploit this second vulnerability, the attacker must include their payload in an <iframe> before the web page is archived.
The third type of vulnerability occurs because the Wayback Machine uses nearest-neighbor timestamp matching which is an issue when a resource is not successfully archived. If the attacker knows that a resource is missing then it is possible in some cases for a malicious payload to be used for the missing resource. To perform this attack, the attacker must be the owner of the domain for the missing resource and the malicious payload must be the only successfully archived version of the resource.
#iipcWAC21 @AdaLerner is presenting "People may have motivations to manipulate web archives" where she demonstrates the ways of manipulating data in web archives without compromising the servers or the client machine pic.twitter.com/lggETUs3JM
— Shawn M. Jones (@shawnmjones) June 16, 2021
During Takuya Watanabe’s presentation, he discussed the five different types of attacks that can target users of web rehosting services. The five attacks are persistent man-in-the-middle attack, abusing privileges to access various resources, stealing credentials, stealing browser history, and session hijacking/injection
Wayback Machine is vulnerable to three of the attacks mentioned in his paper:
Persistent man-in-the-middle (MITM) attacks when application cache (AppCache) is exploited which can result in the browser being compromised and sensitive information being leaked to the attacker.
Privilege abuse to access the user’s resources like their camera, microphone, or GPS. This attack involves permission notification that requires the user to allow web.archive.org to use their resources.
History theft (stealing browser history) by exploiting localStorage to get data that can be used to fingerprint visited websites.
#iipcWAC21 @twatanabe1203 is presenting "Melting Pot of Origins: Compromising the Intermediary Web Services that Rehost Websites" where he discusses the security issues with all services that rehost websites, including #webarchives #WebArchiveWednesday pic.twitter.com/wLJISIo078
— Shawn M. Jones (@shawnmjones) June 16, 2021
Clare Stanton’s presentation was about Perma.cc. Perma.cc is a solution for legal citation that was developed at Harvard Law School. Perma.cc can be used by scholars, courts, and others to create permanent records of web pages that they cite. She also discussed how Perma.cc is different from other web archives.
In her previous work, she listed some differences between Perma.cc and Archive-It:
A Perma.cc record is a high fidelity web capture that allows the user to click through images, view animations, and scroll down with the content
The users don’t have to host or store the archived web pages, because the web pages are added to Harvard’s collection
The users have control over the privacy of the records and can decide when to make the records public
at #iipcWAC21 @clare__stanton from @permacc is presenting "Trust in https://t.co/V35pa33UIg" - Perma is designed to be a citation tool for academics and those working in the court system pic.twitter.com/BdiaLxbr2I
— Shawn M. Jones (@shawnmjones) June 16, 2021
IIPC’s WAC proved very valuable for our research. It was wonderful to listen to great research happening in web archiving in all the different domains. The conference touched upon multiple topics such as:
Web Archives for preserving important events, contemporary arts, apps, defaced websites, etc.
Strengthening the web archiving community through Datathons
Using emerging technologies such as Web Bundle and AWS cloud-native services with web archives
Improving the QA & control in web archives by automating the process
Different tools built by researchers to support web archives
Summarizing large web archives
Ethical practices while archiving
Making web archives more accessible to researchers
Security vulnerabilities in web archives
The IIPC conference ended on a high note with all the presenters and attendees networking in Remo. We got to communicate with various researchers and understand their work in the web archiving field on both days. Also, the Web Science & Digital Libraries Research Group (@WebSciDL) members did not forget to stop for a picture at the end of Day 2.
--The @WebSciDL members stopped for a photo op at #iipcWAC21 - pictured are @shawnmjones @kritika_garg @HimarshaJ @mart1nkle1n @machawk1 @phonedude_mln @phonedude_mln @ibnesayeed pic.twitter.com/mM3RDAwkTE
— Shawn M. Jones (@shawnmjones) June 16, 2021
Comments
Post a Comment