2025-02-10: Creating a Dataset of Archived Web Ads

 

Figure 1: Themes view for the web page we created to display ads from our dataset

One of the goals for the Saving Ads project was to create a dataset of advertisements from the live web. To construct our dataset, we randomly selected websites from SimilarWeb’s top websites worldwide (including all categories except “Adult”), rendered a web page from each website, and if the page loaded ads, archived it. We repeated this process until we had collected at least 250 ads.

Ultimately, we selected 17 web pages to archive, resulting in the collection of 279 advertisements (Table 1). To archive these web ads we used four web archiving services and three browser-based tools:
The four web archiving services (Save Page Now, Arquivo.pt, archive.today, and Conifer) archived two web pages each, ArchiveWeb.page and Browsertrix Crawler archived four web pages each, and Brozzler archived one web page. We did not archive four web pages with each web archiving tool because we had reached our goal of 250 advertisements. We successfully archived (captured all the resources needed) nearly all (273/279) of these ads.

Web Page

Number of Ads Archived

Web Archiving Tool

Replay System

https://canalturf.com

66

Save Page Now

Wayback Machine

https://www.lequipe.fr/Tous-sports/Actualites/Le-flash-sports-du-5-avril/1389820

37

ArchiveWeb.page

ReplayWeb.page

https://www.leroymerlin.com.br/

31

Arquivo.pt

Arquivo.pt

https://www.ign.com/articles/the-last-of-us-season-1-review

24

ArchiveWeb.page

ReplayWeb.page

https://www.facebook.com/

24

ArchiveWeb.page

ReplayWeb.page

https://www.cnn.com/

23

ArchiveWeb.page

ReplayWeb.page

https://www.marketwatch.com/

18

Brozzler

ReplayWeb.page

https://www.diy.com/

13

Conifer

Conifer

https://www.realtor.com/news/unique-homes/frank-lloyd-wright-designed-home-in-tulsa-ok-lists-for-7-9m/

11

Browsertrix Crawler

ReplayWeb.page

https://mortalkombat.fandom.com/wiki/Tag_Team_Ladder

8

Browsertrix Crawler

ReplayWeb.page

https://tokopedia.com

6

Arquivo.pt

Arquivo.pt

https://www.deviantart.com/kvacm/art/Hellstone-Ruins-860415274

5

Browsertrix Crawler

ReplayWeb.page

https://unsplash.com/t/wallpapers

3

archive.today

archive.today

https://sports.yahoo.com

3

Conifer

Conifer

https://www.vidal.ru/novosti/kak-potreblenie-razlichnyh-doz-alkogolya-vliyaet-na-smertnost-11744

2

Save Page Now

Wayback Machine

https://canalturf.com

2

ArchiveBot

Wayback Machine

https://canalturf.com

1

Perma.cc

Wayback Machine

https://www.youtube.com/watch?v=PZShwWiepeY

1

Browsertrix Crawler

ReplayWeb.page

https://www.tripadvisor.it/Tourism-g186338-London_England-Vacations.html

1

archive.today

archive.today


Table 1: The number of archived ads from each web page that we archived when creating the dataset. When replaying a web page (https://canalturf.com) archived by Save Page Now, three of the ads that were loaded by Wayback Machine were archived by ArchiveBot and Perma.cc.

Six ads were not fully archived. One required a specific user interaction (clicking on a play button) to load all of the ad resources. Three ads requested unarchived JavaScript and HTML files during replay. We used ReplayWeb.page's URL prefix search and removed the query string from the requested URI to see if a resource with a similar URI was archived, but did not find these JavaScript and HTML files in ArchiveWeb.page's output file (WACZ). For one Flashtalking ad, it was not possible to determine if the ad was successfully archived because this type of ad cannot be replayed outside of its ad iframe (this problem with Flashtalking ads was described in a previous blog post). This prevented us from comparing the ad resources that would load on the live web and during replay. The last partially archived ad was missing images because the dynamically generated (during crawl time) URL included an “e” query string that prevented (Ad with e parameter | Ad without e parameter) some of the images from loading.

Finding Archived Ads’ Web Resources

By using ReplayWeb.page's URL search feature and our Display Archived Ads tool, we found that 55 out of 279 advertisements were not replayable when loading the archived containing web page (the web page that loaded the ad during a crawling session). We used ReplayWeb.page’s URL search feature to identify ads in our WARC and WACZ files. This feature allows users to specify the MIME type, which facilitates searching for a specific ad type like image ads. It also allows for a prefix search, which we used to identify resources associated with ad services like Flashtalking, Innovid, Amazon Ad Server, and Google AdSense.

Our Display Archived Ads tool (Figure 2 and Video 1) is used to display most of the HTML, image, and video files inside of a given WARC file. Our tool depends on the warcio, pywb, and Selenium software packages. warcio retrieves the URLs for the web resources from the WARC file. We used pywb to replay archived resources inside an iframe. We replayed the archived resources in an iframe in order to display multiple ads on the same web page. Selenium was used to open a web browser and execute the JavaScript necessary to display the ads.  Our tool offered two affordances. First, it enabled us to filter out some of the known ad resources that remain invisible during replay (an example filtered file is pixel.gif, a commonly used image for ad services that only shows a few white pixels), thereby speeding up the review of ad resources. Second, by allowing us to display the live version of an ad beside the archived version, the tool showed problems with replay.

Figure 2:  Our tool for displaying potential ads loading the live web ad beside the archived version of the same ad.


Video 1: Demo video for Display Archived Ads tool

Replaying Archived Ads

To replay the archived advertisements, we used four web archiving services (Internet Archive’s Wayback Machine, Arquivo.pt, archive.today, and Conifer) and three other replay systems (ReplayWeb.page, pywb, and OpenWayback). We replayed the web ads that we archived with a web archiving service with a service from the same web archive. We used ReplayWeb.page to replay the archived ads from our WARC and WACZ files, because at the beginning of 2023, ReplayWeb.page could replay more dynamically loaded web resources than pywb and OpenWayback. pywb (version 2.7.3) failed to replay archived web ads that relied upon an Amazon ad iframe. We did not select OpenWayback (version 2.4.0) as the preferred replay system for the web pages we archived, because during replay it loaded live web advertisements instead of the archived ads. This problem of live web resources being loaded during replay has been discussed by Brunelle (“Zombies in the Archives”) and Lerner et al. (“Rewriting History: Changing the Archived Web from the Present”).

Categorizing Advertisements

We organized our 279 ads into five categories: 
  • Image
  • Video
  • Embedded web page
  • Text-only
  • Combination
The first three types are associated with one web resource that is viewable outside of the containing web page provided the user knows the URI associated with the resource. Figures 3 (image ad), 4 (video ad), and 5 (embedded web page ad) show examples. Text-only ads (Figure 6) cannot be viewed outside of the containing web page because the web page loads the text. The combination category comprises ads (Figure 7) that rely upon multiple resources and are constructed inside of the containing web page or ad iframe. Like text ads, combination ads cannot be viewed outside of the containing web page.

Figure 3: An example image ad loaded outside of the containing web page. Ad’s URI-M: https://conifer.rhizome.org/treid003/2023-05-16-archiving-ads-on-sportsyahoocom/https://s.yimg.com/bx/adb/20230310154032235.jpg




Figure 6: An example text-only ad for a sponsored news article loaded in the containing web page. WACZ: https://zenodo.org/record/8057942/files/2023_06_07_archiving_ads_on_lequipe_ArchiveWeb_page.wacz?download=1 | Containing web page’s URI-R: https://www.lequipe.fr/Tous-sports/Actualites/Le-flash-sports-du-5-avril/1389820


Figure 7: An example combination ad loaded in the containing web page. This ad uses three images and one video. WACZ: https://zenodo.org/record/8000975/files/2023-02-07-ads-on-ign_ArchiveWeb_page.wacz?download=1 | Containing web page’s URI-R: https://www.ign.com/articles/the-last-of-us-season-1-review

Next, we coded each ad topically. Each ad was assigned to a single theme. Table 2 shows the 24 themes and the corresponding number of ads for each. Most themes (17 out of 24) aligned with SimilarWeb’s website categories. The seven themes not associated with SimilarWeb’s categories were “Internet and Mobile Service Provider”, “Politics”, “Funeral Services”, “Charity”, “Military”, “Sponsored Brand”, and “Unknown”. The “Unknown” theme refers to ads that we were not able to replay and could not view on the live web. Notably, the Military theme was exclusively video ads. In contrast, the other themes with more than three ads included multiple ad types.

Theme

Number of Ads

Shopping

85

Finance

27

Vehicle and Automotive

23

Business Services

21

Travel

19

Entertainment

16

Health

15

Real Estate

15

News

14

Unknown

6

Internet and Mobile Service Provider

5

Art

4

Beauty and Cosmetics

4

Gaming

4

Military

4

Food and Drink

3

Gambling and Fantasy Sports

3

Computer Security

2

Fitness and Sports

2

Pets and Animals

2

Sponsored Brand

2

Charity

1

Funeral Services

1

Politics

1

Table 2: List of themes used for the ads in our dataset.

Creating a Web Page To Display Ads From Our Dataset

We created a web page (https://savingads.github.io/themed_ad_collections.html) to display ads from our dataset. This web page has three views which are determined by the query string:
  1. Themes view: Shows all of the ad themes (Figure 1). No query string parameter is used for this view.
  2. Collection view: Shows ad previews for ads from the same theme (Figure 8).  The “collection” parameter is included in the URL for this view (Shopping collection example: https://savingads.github.io/themed_ad_collections.html?collection=Shopping).
  3. Ad details view: Shows the archived ad and all of its information from the dataset (Figure 9). The “ad” query string parameter is included in the URL for this view and it is set to a hash code value computed from the string of the contents (URLs for the web resources and text) for the ad (Samsung Neo QLED TV Ad example: https://savingads.github.io/themed_ad_collections.html?ad=-62639933).
Figure 8: Example Collection view for shopping ads.

On the ad details view, we embedded the replay of the archived ad and provided links for the ad's resources and the containing web page. For ads that were archived using ArchiveWeb.page, Browsertrix Crawler, or Brozzler we embedded the replay of the ad using ReplayWeb.page.  When replaying these ad resources, we included an option (Figure 10) for selecting either the latest version of ReplayWeb.page or a version released during 2023. The first option in this list is the recommended version that was used around the same time when the ad was archived.

Figure 9: Example Ad details view for a Samsung Neo QLED TV Ad.


Figure 10: Option for selecting the version of ReplayWeb.page to use when loading an archived ad’s web resource.

Summary

We described the methods used to create our dataset of 279 ads archived from the live web. Our dataset was created by archiving 17 web pages from SimilarWeb's top websites worldwide. When archiving these web pages, we utilized four web archiving services (Internet Archive's Save Page Now, Arquivo.pt, archive.today, and Conifer) and three browser-based tools (ArchiveWeb.page, Browsertrix Crawler, and Brozzler). We replayed these archived web pages with four web archiving services (Internet Archive's Wayback Machine, Arquivo.pt, archive.today, and Conifer) and three other replay systems (ReplayWeb.page, pywb, and OpenWayback). We also created a web page that allows us to view all of the information from the dataset including the replay of the archived ads.

--Travis Reid (@TReid803)

Comments