2025-02-10: Creating a Dataset of Archived Web Ads

Figure 1: Themes view for the web page we created to display ads from our dataset

One of the goals for the Saving Ads project was to create a dataset of advertisements from the live web. To construct our dataset, we randomly selected websites from SimilarWeb’s top websites worldwide (including all categories except “Adult”), rendered a web page from each website, and if the page loaded ads, archived it. We repeated this process until we had collected at least 250 ads.

Ultimately, we selected 17 web pages to archive, resulting in the collection of 279 advertisements (Table 1). To archive these web ads we used four web archiving services and three browser-based tools:

Web archiving services

Browser-based tools

The four web archiving services (Save Page Now, Arquivo.pt, archive.today, and Conifer) archived two web pages each, ArchiveWeb.page and Browsertrix Crawler archived four web pages each, and Brozzler archived one web page. We did not archive four web pages with each web archiving tool because we had reached our goal of 250 advertisements. We successfully archived (captured all the resources needed) nearly all (273/279) of these ads.

Web Page	Number of Ads Archived	Web Archiving Tool	Replay System
https://canalturf.com	66	Save Page Now	Wayback Machine
https://www.lequipe.fr/Tous-sports/Actualites/Le-flash-sports-du-5-avril/1389820	37	ArchiveWeb.page	ReplayWeb.page
https://www.leroymerlin.com.br/	31	Arquivo.pt	Arquivo.pt
https://www.ign.com/articles/the-last-of-us-season-1-review	24	ArchiveWeb.page	ReplayWeb.page
https://www.facebook.com/	24	ArchiveWeb.page	ReplayWeb.page
https://www.cnn.com/	23	ArchiveWeb.page	ReplayWeb.page
https://www.marketwatch.com/	18	Brozzler	ReplayWeb.page
https://www.diy.com/	13	Conifer	Conifer
https://www.realtor.com/news/unique-homes/frank-lloyd-wright-designed-home-in-tulsa-ok-lists-for-7-9m/	11	Browsertrix Crawler	ReplayWeb.page
https://mortalkombat.fandom.com/wiki/Tag_Team_Ladder	8	Browsertrix Crawler	ReplayWeb.page
https://tokopedia.com	6	Arquivo.pt	Arquivo.pt
https://www.deviantart.com/kvacm/art/Hellstone-Ruins-860415274	5	Browsertrix Crawler	ReplayWeb.page
https://unsplash.com/t/wallpapers	3	archive.today	archive.today
https://sports.yahoo.com	3	Conifer	Conifer
https://www.vidal.ru/novosti/kak-potreblenie-razlichnyh-doz-alkogolya-vliyaet-na-smertnost-11744	2	Save Page Now	Wayback Machine
https://canalturf.com	2	ArchiveBot	Wayback Machine
https://canalturf.com	1	Perma.cc	Wayback Machine
https://www.youtube.com/watch?v=PZShwWiepeY	1	Browsertrix Crawler	ReplayWeb.page
https://www.tripadvisor.it/Tourism-g186338-London_England-Vacations.html	1	archive.today	archive.today

Table 1: The number of archived ads from each web page that we archived when creating the dataset. When replaying a web page (https://canalturf.com) archived by Save Page Now, three of the ads that were loaded by Wayback Machine were archived by ArchiveBot and Perma.cc.

Six ads were not fully archived. One required a specific user interaction (clicking on a play button) to load all of the ad resources. Three ads requested unarchived JavaScript and HTML files during replay. We used ReplayWeb.page's URL prefix search and removed the query string from the requested URI to see if a resource with a similar URI was archived, but did not find these JavaScript and HTML files in ArchiveWeb.page's output file (WACZ). For one Flashtalking ad, it was not possible to determine if the ad was successfully archived because this type of ad cannot be replayed outside of its ad iframe (this problem with Flashtalking ads was described in a previous blog post). This prevented us from comparing the ad resources that would load on the live web and during replay. The last partially archived ad was missing images because the dynamically generated (during crawl time) URL included an “e” query string that prevented (Ad with e parameter | Ad without e parameter) some of the images from loading.

Finding Archived Ads’ Web Resources

By using ReplayWeb.page's URL search feature and our Display Archived Ads tool, we found that 55 out of 279 advertisements were not replayable when loading the archived containing web page (the web page that loaded the ad during a crawling session). We used ReplayWeb.page’s URL search feature to identify ads in our WARC and WACZ files. This feature allows users to specify the MIME type, which facilitates searching for a specific ad type like image ads. It also allows for a prefix search, which we used to identify resources associated with ad services like Flashtalking, Innovid, Amazon Ad Server, and Google AdSense.

Our Display Archived Ads tool (Figure 2 and Video 1) is used to display most of the HTML, image, and video files inside of a given WARC file. Our tool depends on the warcio, pywb, and Selenium software packages. warcio retrieves the URLs for the web resources from the WARC file. We used pywb to replay archived resources inside an iframe. We replayed the archived resources in an iframe in order to display multiple ads on the same web page. Selenium was used to open a web browser and execute the JavaScript necessary to display the ads. Our tool offered two affordances. First, it enabled us to filter out some of the known ad resources that remain invisible during replay (an example filtered file is pixel.gif, a commonly used image for ad services that only shows a few white pixels), thereby speeding up the review of ad resources. Second, by allowing us to display the live version of an ad beside the archived version, the tool showed problems with replay.

Figure 2: Our tool for displaying potential ads loading the live web ad beside the archived version of the same ad.

Video 1: Demo video for Display Archived Ads tool

Replaying Archived Ads

To replay the archived advertisements, we used four web archiving services (Internet Archive’s Wayback Machine, Arquivo.pt, archive.today, and Conifer) and three other replay systems (ReplayWeb.page, pywb, and OpenWayback). We replayed the web ads that we archived with a web archiving service with a service from the same web archive. We used ReplayWeb.page to replay the archived ads from our WARC and WACZ files, because at the beginning of 2023, ReplayWeb.page could replay more dynamically loaded web resources than pywb and OpenWayback. pywb (version 2.7.3) failed to replay archived web ads that relied upon an Amazon ad iframe. We did not select OpenWayback (version 2.4.0) as the preferred replay system for the web pages we archived, because during replay it loaded live web advertisements instead of the archived ads. This problem of live web resources being loaded during replay has been discussed by Brunelle (“Zombies in the Archives”) and Lerner et al. (“Rewriting History: Changing the Archived Web from the Present”).

Categorizing Advertisements

We organized our 279 ads into five categories:

Image
Video
Embedded web page
Text-only
Combination

The first three types are associated with one web resource that is viewable outside of the containing web page provided the user knows the URI associated with the resource. Figures 3 (image ad), 4 (video ad), and 5 (embedded web page ad) show examples. Text-only ads (Figure 6) cannot be viewed outside of the containing web page because the web page loads the text. The combination category comprises ads (Figure 7) that rely upon multiple resources and are constructed inside of the containing web page or ad iframe. Like text ads, combination ads cannot be viewed outside of the containing web page.

Figure 3: An example image ad loaded outside of the containing web page. Ad’s URI-M: https://conifer.rhizome.org/treid003/2023-05-16-archiving-ads-on-sportsyahoocom/https://s.yimg.com/bx/adb/20230310154032235.jpg

Figure 4: An example video ad loaded outside of the containing web page. WACZ: https://zenodo.org/record/8057942/files/2023_06_07_archiving_ads_on_lequipe_ArchiveWeb_page.wacz?download=1 | Ad’s URI-R: https://azv.adsrvr.org/thetradedesk-ads-video/2nwniqr/f2rcr8v/g5gzcxa29115fa8fdeed4ef88089cec513d745e4.mp4

Figure 5: An example web page ad loaded outside of the containing web page. WARC: https://zenodo.org/record/7601187/files/2023-01-11_00-59-34_ads_on_fandom_browsertrix_crawler.warc.gz?download=1 | Ad’s URI-R: https://s0.2mdn.net/sadbundle/13045786678919115269/CCD2C_5568424_300x600_MF_CP_APPLY_NA_NR_EN_V1_H5_BD_2022_042025/index.html

Figure 6: An example text-only ad for a sponsored news article loaded in the containing web page. WACZ: https://zenodo.org/record/8057942/files/2023_06_07_archiving_ads_on_lequipe_ArchiveWeb_page.wacz?download=1 | Containing web page’s URI-R: https://www.lequipe.fr/Tous-sports/Actualites/Le-flash-sports-du-5-avril/1389820

Figure 7: An example combination ad loaded in the containing web page. This ad uses three images and one video. WACZ: https://zenodo.org/record/8000975/files/2023-02-07-ads-on-ign_ArchiveWeb_page.wacz?download=1 | Containing web page’s URI-R: https://www.ign.com/articles/the-last-of-us-season-1-review

Next, we coded each ad topically. Each ad was assigned to a single theme. Table 2 shows the 24 themes and the corresponding number of ads for each. Most themes (17 out of 24) aligned with SimilarWeb’s website categories. The seven themes not associated with SimilarWeb’s categories were “Internet and Mobile Service Provider”, “Politics”, “Funeral Services”, “Charity”, “Military”, “Sponsored Brand”, and “Unknown”. The “Unknown” theme refers to ads that we were not able to replay and could not view on the live web. Notably, the Military theme was exclusively video ads. In contrast, the other themes with more than three ads included multiple ad types.

Theme	Number of Ads
Shopping	85
Finance	27
Vehicle and Automotive	23
Business Services	21
Travel	19
Entertainment	16
Health	15
Real Estate	15
News	14
Unknown	6
Internet and Mobile Service Provider	5
Art	4
Beauty and Cosmetics	4
Gaming	4
Military	4
Food and Drink	3
Gambling and Fantasy Sports	3
Computer Security	2
Fitness and Sports	2
Pets and Animals	2
Sponsored Brand	2
Charity	1
Funeral Services	1
Politics	1

Table 2: List of themes used for the ads in our dataset.

Creating a Web Page To Display Ads From Our Dataset

We created a web page (https://savingads.github.io/themed_ad_collections.html) to display ads from our dataset. This web page has three views which are determined by the query string:

Themes view: Shows all of the ad themes (Figure 1). No query string parameter is used for this view.
Collection view: Shows ad previews for ads from the same theme (Figure 8). The “collection” parameter is included in the URL for this view (Shopping collection example: https://savingads.github.io/themed_ad_collections.html?collection=Shopping).
Ad details view: Shows the archived ad and all of its information from the dataset (Figure 9). The “ad” query string parameter is included in the URL for this view and it is set to a hash code value computed from the string of the contents (URLs for the web resources and text) for the ad (Samsung Neo QLED TV Ad example: https://savingads.github.io/themed_ad_collections.html?ad=-62639933).

Figure 8: Example Collection view for shopping ads.

On the ad details view, we embedded the replay of the archived ad and provided links for the ad's resources and the containing web page. For ads that were archived using ArchiveWeb.page, Browsertrix Crawler, or Brozzler we embedded the replay of the ad using ReplayWeb.page. When replaying these ad resources, we included an option (Figure 10) for selecting either the latest version of ReplayWeb.page or a version released during 2023. The first option in this list is the recommended version that was used around the same time when the ad was archived.

Figure 9: Example Ad details view for a Samsung Neo QLED TV Ad.

Figure 10: Option for selecting the version of ReplayWeb.page to use when loading an archived ad’s web resource.

Summary

We described the methods used to create our dataset of 279 ads archived from the live web. Our dataset was created by archiving 17 web pages from SimilarWeb's top websites worldwide. When archiving these web pages, we utilized four web archiving services (Internet Archive's Save Page Now, Arquivo.pt, archive.today, and Conifer) and three browser-based tools (ArchiveWeb.page, Browsertrix Crawler, and Brozzler). We replayed these archived web pages with four web archiving services (Internet Archive's Wayback Machine, Arquivo.pt, archive.today, and Conifer) and three other replay systems (ReplayWeb.page, pywb, and OpenWayback). We also created a web page that allows us to view all of the information from the dataset including the replay of the archived ads.

--Travis Reid (@TReid803)

Search This Blog

Web Science and Digital Libraries Research Group