2023-05-31: Towards an Ethical Framework for Full-Text Search in Web Archives

Figure 1: Arquivo.pt's full-text search engine makes it effortless to search archived content on GeoCities. The exploitative search results for the query "party site:geocities.com", not included in the figure, show the dangers of full-text search without an ethical framework.

On International Women's Day, I wanted to highlight how web archives have preserved progress towards gender equality. Both Archive-It and the UK Web Archive have dedicated collections about this topic. 

While the Internet Archive does not have any dedicated collections, it is possible to conduct a keyword/metadata search across the Wayback Machine holdings. I performed a Google search (Internet Archive International Women's Day) which led me to the general Internet Archive search interface for this topic. I then selected the radio button to search archived web sites. 

Figure 2: The Internet Archive general search page can query for keywords across multiple kinds of archives, controlled by a radio button.

Selecting the radio button led me to the matching Wayback Machine search results page

Figure 3: Searching the Internet Archive's Wayback Machine for "subject:international women's day" via clicking a radio button


Figure 4: The first page of the search results page for the Internet Archive's Wayback Machine for "subject:international women's day" include off topic sites such as a Russian mail order bride website.

Unfortunately, none of these websites are about International Women's Day. The 7th result is about Russian mail order brides, and many of the other top results are spam websites. I was attempting to find mementos about progress towards gender equality, and instead I was given the exact opposite: mementos about marginalized women. Removing the "subject:" modifier from the query returns much more relevant results, but many users would not attempt this tactic, especially since the original search results page was suggested via clicking a radio button. This example highlights the need for ethical full-text search in web archives because it shows how standard algorithms falter when applied to underrepresented groups. 

Summary of Ethical Concerns

Many researchers have grappled with the ethics of web archiving itself, and full-text search adds another layer of ethical concerns on top of the problems already identified. In "Lessons from archives: strategies for collecting sociocultural data in machine learning," (2020) Unso Jo (@unsojo) and Timnit Gebru (@timnitGebru@dair-community.social) identify that consent, privacy, transparency, and representation all contribute to whether or not a dataset is ethical.  Crawling the web is considered "laissez-faire data collection" because the origins of the data with respect to privacy and consent are not considered. Many larger web archives operate on an opt-out model, rather than an opt-in model.  Content owners may withdraw their consent to have their websites removed from an archive, but they never actively gave consent in the first place. Additional concerns about consent arise when the website is an aggregate of multiple content creators, such as a web forum. The forum owner, along with the users who wrote the content, should all have withdrawal privileges. Privacy concerns especially apply to marginalized groups, such as minors. Some web archives do provide transparency through provenance data on captures. Finally, ensuring that marginalized groups have adequate representation is especially important in aggregate settings, such as machine learning as well as full-text search.

Bergis Jules (@BergisJules), Ed Summers (@edsu@social.coop), and Vernon Mitchell, Jr. (@vcmitchelljr) examined ethical concerns in web archiving in the 2018 Documenting the Now white paper, "Ethical Considerations for Archiving Social Media Content Generated by Contemporary Social Movements: Challenges, Opportunities, and Recommendations." They also identified consent as a major hurdle in web archiving. They also identified misinformation as problematic, since users examining collections with misinformation would need context in order to fully understand and verify the authenticity of the information. Representation of marginalized groups was also identified as an important issue. There is potential for commercial entities to exploit this data. Also, when archivists are members of the community they are archiving, it brings a different perspective to both consent and representation. The authors are clear that these practices apply to all aspects of web archiving, from collecting data to providing access to that data. Since full-text search is a very powerful way to access web archives, these ethical concerns are amplified if left unaddressed.

Jimmy Lin (@lintool), Ian Milligan (@ianmilligan1), Douglas Oard, Nick Ruest (@ruebot), and Katie Shilton also looked into the ethics of full-text search for youth websites like GeoCities in "We Could, but Should We? Ethical Considerations for Providing Access to GeoCities and Other Historical Digital Collections" in 2020. They stated that many people are comfortable with web archives because they provide "privacy by obscurity." Adding full-text search to web archives would eliminate this privacy. Because many of the content creators on GeoCities were minors, being able to find text and images from their youth would be exploitative. A second concern is related to consent: is it unlikely that users who used aliases would have consented to their content being archived, and it is also likely that they would withdraw consent because they could lose their anonymity with the technology available today.

In "Databound: Histories of Growing Up on the World Wide Web" (2022), Katie Mackinnon (@ktcmackinnon) laid out a framework for conducting research in web archives, especially with marginalized groups. She also conducted user studies on how participants felt about re-experiencing their personal data in web archives. Her web archive research framework is centered around the idea that data belongs to people, and those people need to have discovery paths for their data along with sovereignty. Mackinnon's user studies confirm Lin et al.'s hypothesis that users would be uncomfortable with those images and text being available today. In her user studies, participants expressed anxiety about not knowing if their personal data was archived because it was hard to discover. Full-text search would provide a remedy to this problem but would amplify privacy concerns. Users were uncomfortable with the data from their youth being preserved in web archives. One user stated, "No one wants this online! No one wants these things in an archive!" These people had never consented to their data being archived many years ago and should have more control over it now. As we saw in the introduction to this blog post, women are often marginalized on the Internet. Mackinnon stresses that web archives simply reflect this phenomenon and advocates that stronger privacy measures are needed for women, trans, and non-binary people in web archives. 

Ethical Framework Recommendations

Each author above gave recommendations for ethical guidelines when web archiving, and the recommendations are complementary. 

Representation: Jo and Gebru, like Jules et al., recommended that marginalized groups have agency and be a part of the archiving process. Jo and Gebru, as well as Lin et al., suggested that each project create a code of ethics. Lin et al. specified this further by suggesting that the original purpose and discovery goals of the website be considered when determining if full-text search is appropriate for the collection.

Consent and Transparency: Jules et al. state that web crawls are not collected with explicit permission, and collections need to employ additional procedures when collecting data that involve clear-cut consent. Mackinnon recommends informed consent crawls as well as soliciting donation of data. Jo and Gebru advised that data sets should include provenance information to help identify consent concerns. The Internet Archive's Wayback Machine does include provenance information about its crawls and its Save Page Now feature, and other web archives should also adopt this practice. Documenting the Now has created a Python library to detect provenance information from Wayback captures called waybackprov. Andy Jackson (@anjacks0n) also advocated for transparency through provenance in his 2015 blog entry, "The provenance of web archives." Mackinnon notes that consent is necessary but not sufficient, and that participants need to be involved in the digital afterlife of their data, which ties back into representation. Regarding full-text search, only collections with vetted consent should have full-text search enabled. There are some smaller web archives that do operate on an opt-in model, so those web archives would have fewer barriers to providing ethical access via full-text search.

Privacy: Both Mackinnon and Lin et al. suggested anonymization as a form of privacy control. Mackinnon refers to this process as data orphaning, which allows the content to remain in the collection with permission from the author in exchange for anonymity. Jo and Gebru, along with Mackinnon, advocate for active screening for private data during the curation process. Mackinnon also advocates for data sovereignty, especially related to youth data and the Right to be Forgotten. In order to incorporate these ideas into full-text search, the same right to erasure policies that apply to live web search engines like Google should apply to web archive search engines. Web archives should also develop a protocol for screening for data that should be kept private from search.

Mackinnon also developed care ethics scaffolding. For representation, she emphasizes that the research or product need to benefit the communities that produced the data, and that active measures must be taken to make sure that the communities are not taken advantage of. For consent, the people who created the data must explicitly give permission for their data to be archived and kept. Implementing these policies for web archives with full-text search will mean having the communities who created the data as active stakeholders in the development of the search tool.

Ethics are not Absolute

For some web archive collections, it is straightforward to resolve ethical concerns. Webpages like GeoCities, which include sensitive content created by minors, are not appropriate for full-text indexing. US federal webpages are appropriate for full-text indexing because the publications of the US government are public documents that are not entitled to any privacy. What about everything in between these extremes? When do privacy rights prevail, and when do other factors hold more influence?

In "Preservation Acts: Toward an ethical archive of the web," Nora Caplan-Bricker highlighted the ethical dilemmas that web archive researchers face. The article recounts how Bergis Jules chose to preserve and publish a dataset of deleted tweets because of their perceived historical value, rather than remove the tweets as required by Twitter's developer terms of service at the time. In "Using Historical Twitter Data for Research: Ethical Challenges of Tweet Deletions," Jim Maddock, Kate Starbird (@katestarbird), and Robert Mason chose to remove deleted tweets from their dataset, even though it directly hindered their research. They also likened deleting a tweet to withdrawing consent. Web archives contain a large amount of deleted content, which adds to the ethical complexity of making this content discoverable via full-text search. It's possible to index the changes on web pages, but it will not be appropriate to do this for every collection. The collection curators will need to weigh the historical importance of the collection versus the privacy rights of the content creators.

Politicians are afforded less privacy than ordinary citizens. One reason for this is that public figures have a lower expectation of privacy by their own career choice. A second reason for this is that the enfranchised public has a right to know how politicians' publicized political beliefs have changed over time. Another group of people with a lower expectation of privacy are people who have committed crimes. These people delete publicly posted content to hide evidence of illegal activity and motive, which is significantly different than a typical citizen updating their webpage. Curators of these types of web archive collections will need to delineate between content relevant to public knowledge and personal content with a typical expectation of privacy. 

Sometimes activists will even archive evidence of a crime in progress, which is what happened with the January 6 United States Capitol attack. Rioters did not consent to having their social media posts about the attack archived, but these archived posts were later used for suspect identification as well as motive for sentencing. Proactive archiving preserves evidence that can be used by law enforcement for justice, but Edward Chapman showed how it can also lead to vigilantism such as doxxing in his thesis, Crowdsourced Archiving of the January 6th US Capitol Insurrection. Doxxing is particularly problematic when it leads to misidentification. One solution could be to give law enforcement more access to these kind of archives than the general public to prevent vigilantism, but this is too idealistic: would they have prosecuted anyone if the riot had gone differently? The racial and economic biases in the US justice system also erode trust in law enforcement. Chapman recommends developing an ethical framework to address this issue for future archival crowdsourcing efforts.

Archived web content from different sources may necessitate different levels of access. Some different levels of access might include internal/employee access, on-site access for researchers, remote access for researchers, public access without full-text search, and public access with full-text search. For example, on-site access requirements could increase privacy by preventing data scraping, one of the concerns brought up by Jules et al. Many archives already implement different access levels: the Library of Congress web archive provides many items available on-site only, and Archive Today provides off-site researcher access. In other countries, these considerations are legal rather than ethical. The Non-Print Legal Deposit regulations in the UK specify that creators must opt-in to make their captures available off-site. The curators of each web archive collection should assess their collection contents to decide what level of access is appropriate. Users should be informed of the reasons for the chosen access level.

Outlook

Full-text search in web archives is becoming more common. It's important for web archiving organizations to create ethical frameworks before implementing this technology on a wide scale. If the underlying datasets are not created ethically, the search results will perpetuate algorithmic bias and further marginalize underrepresented communities. Using an ethical framework will strengthen users' trust. Curators will need to consider the ethical framework in conjunction with other important factors, such as historical significance, to decide on the appropriate privacy and access measures for their specific collections.

Researchers continue to think critically about how to archive the web ethically. The National Forum on Ethics and Archiving the Web allows for the community to share and synthesize ideas. Read the WS-DL Trip Report on EAW 2018.

Acknowledgements

I would like to thank Dr. Weigle and Dr. Nelson for their help in shaping my thoughts on this topic and for helping me think in a critical and well-rounded way. Dr. Weigle also recently gave a talk highlighting ethics, "Using Web Archives to Document Social Movements and Disinformation - Practice, Ethics, and Challenges," which helped give me the push to write this up.


-Lesley Frew

References

1. Eun Seo Jo and Timnit Gebru. 2020. Lessons from archives: strategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (FAT* '20). Association for Computing Machinery, New York, NY, USA, 306–316. https://doi.org/10.1145/3351095.3372829

2. Bergis Jules, Ed Summers, and Vernon Mitchell. "Documenting the now white paper: ethical considerations for archiving social media content generated by contemporary social movements: challenges, opportunities, and recommendations." 2018. https://www.docnow.io/docs/docnow-whitepaper-2018.pdf

3. Katie Mackinnon. Databound: Histories of Growing Up on the World Wide Web. Doctoral dissertation, University of Toronto, Canada. 2022. https://hdl.handle.net/1807/125246

4. Jimmy Lin, Ian Milligan, Douglas W. Oard, Nick Ruest, and Katie Shilton. 2020. We Could, but Should We? Ethical Considerations for Providing Access to GeoCities and Other Historical Digital Collections. In Proceedings of the 2020 Conference on Human Information Interaction and Retrieval (CHIIR '20). Association for Computing Machinery, New York, NY, USA, 135–144. https://doi.org/10.1145/3343413.3377980

5. Andy Jackson. The provenance of web archives. 20th November 2015. https://anjackson.net/2015/11/20/provenance-of-web-archives/

6. Nora Caplan-Bricker. Preservation Acts: Toward an ethical archive of the web. Harper's Magazine, December 2018. https://harpers.org/archive/2018/12/preservation-acts-archiving-twitter-social-media-movements/

7. Jim Maddock, Kate Starbird and Robert M. Mason. Using Historical Twitter Data for Research: Ethical Challenges of Tweet Deletions. Presented at CSCW ’15 Workshop on Ethics at the 2015 Conference on Computer Supported Cooperative Work (CSCW 2015), Vancouver, Canada. http://faculty.washington.edu/kstarbi/maddock_starbird_tweet_deletions.pdf

8. Edward Chapman. Crowdsourced Archiving of the January 6th US Capitol Insurrection: An r/DataHoarders Case Study. 2021. Master's Thesis, Marquette University. https://epublications.marquette.edu/theses_open/685


Comments