2024-12-04: Problems With Replaying Ads That Use iframes
Figure 1: An example web page that loaded Google ads.
- The first problem involves Google (AdSense and Ad Manager) and Amazon (Ad Server) ad services dynamically generating URLs with random values that prevent an ad from being replayed, because the random number generator’s seed is not the same during the crawl and replay sessions.
- The JavaScript from the Flashtalking ad service caused the second problem of replaying an embedded web page ad outside of an ad iframe.
- The last problem was ads not loading during replay depending on the web browser.
Google SafeFrame
Problem Overview: Archived ads that use Google SafeFrames are injected into an iframe and are difficult to replay from a web archive, because the replay system cannot generate the same random value in the subdomain for the SafeFrame’s URL that was generated during the crawling session.
Problem Description:
Some Google ads (such as embedded web page ads) use SafeFrame, which is an iframe based on Interactive Advertising Bureau’s (IAB) specifications. SafeFrames are used instead of regular iframes, because SafeFrames allow the publisher's web page to communicate with the ad while also restricting the activity between the ad and the web page. This helps minimize the problems that can occur when a malicious ad is loaded. There are two types of Google SafeFrames. One type does not include a random value in the URL. However, this type of Google SafeFrame is now disallowed, since Google Publisher Tag (GPT) API’s SafeFrame configuration option useUniqueDomain is deprecated. The other type of SafeFrame includes a random value in the URL’s subdomain to “isolate SafeFrame content and provide stronger security guarantees” (Google, “Render creatives using SafeFrame”). Figure 2 shows an example Google SafeFrame URL with a dynamically generated random subdomain. This type of Google SafeFrame is difficult to replay, because the random value generated by Google’s pubads_impl_2023020201.js script during replay differs from the random value that was originally generated at crawl time. This results in a SafeFrame URL being generated during replay for an ad that may not have previously been archived or that may not even exist. Figure 3 shows an example ad that failed to load because the Google SafeFrame URL was different during replay time. When ReplayWeb.page, pywb, Conifer, or Wayback Machine attempt to load a Google SafeFrame for an advertisement, an HTTP status code of 404 (Not Found) is returned.
Some Google ads (such as embedded web page ads) use SafeFrame, which is an iframe based on Interactive Advertising Bureau’s (IAB) specifications. SafeFrames are used instead of regular iframes, because SafeFrames allow the publisher's web page to communicate with the ad while also restricting the activity between the ad and the web page. This helps minimize the problems that can occur when a malicious ad is loaded. There are two types of Google SafeFrames. One type does not include a random value in the URL. However, this type of Google SafeFrame is now disallowed, since Google Publisher Tag (GPT) API’s SafeFrame configuration option useUniqueDomain is deprecated. The other type of SafeFrame includes a random value in the URL’s subdomain to “isolate SafeFrame content and provide stronger security guarantees” (Google, “Render creatives using SafeFrame”). Figure 2 shows an example Google SafeFrame URL with a dynamically generated random subdomain. This type of Google SafeFrame is difficult to replay, because the random value generated by Google’s pubads_impl_2023020201.js script during replay differs from the random value that was originally generated at crawl time. This results in a SafeFrame URL being generated during replay for an ad that may not have previously been archived or that may not even exist. Figure 3 shows an example ad that failed to load because the Google SafeFrame URL was different during replay time. When ReplayWeb.page, pywb, Conifer, or Wayback Machine attempt to load a Google SafeFrame for an advertisement, an HTTP status code of 404 (Not Found) is returned.
Figure 2: Example URI for a Google SafeFrame. The subdomain contains a random value that is dynamically generated when loading an ad.
Figure 3: Different SafeFrame URLs during crawl and replay sessions. Google’s pubads_impl.js (WACZ | URI-R: https://securepubads.g.doubleclick.net/pagead/managed/js/gpt/m202308210101/pubads_impl.js?cb=31077272) generates the random SafeFrame URL.
We created an example web page to determine how the random values are generated for a Google SafeFrame. Our web page (Figure 4) uses ad code from Google’s pubads_impl_2023020201.js script to generate random numbers and Google SafeFrame URLs. First, our web page generates random numbers using the same random functions used for Google SafeFrame (Math.random() and window.crypto.getRandomValues() functions). We were able to determine if the replay system generates the same random numbers that were generated during crawl time and if the random numbers are different each time the web page is replayed. To check the random values that were generated during crawl time, we used ArchiveWeb.page and recorded a video of the crawling session. Second, we used the code from Google’s pubads_impl_2023020201.js script to generate the random value for each SafeFrame URL. In the first section of the web page, two SafeFrame URLs were generated, the first URL used Math.random() and the other used window.crypto.getRandomValues(). Similar to the first step, we compared if the values generated during crawl time and replay time were the same. Lastly, to check if the replay system successfully replayed the archived Google SafeFrames, we created and embedded iframes into the web page for each SafeFrame URL that was generated. When a Google SafeFrame is loaded without an ad it will be a blank iframe with no content inside of the iframe.
Figure 4: Demo web page (shown during the crawling session) that generates random values and Google SafeFrames.
Video 1: Archiving and replaying our demo web page that generated Google SafeFrames
Table 1 shows the random values generated for the Google SafeFrame URLs during crawl and replay sessions. The random values generated during replay time and crawl time differed, because the seeds for the random number generators that were used during the crawling session were different from the seeds that ReplayWeb.page used during replay (Figures 4 and 5).
Table 1: Random values in Google SafeFrame URLs generated during crawl time and replay time. Video: https://youtu.be/IzGMVmLyYGQ?t=2697
Figure 5: Demo web page shown during the replay session. The values generated differ from crawl time (Figure 4). WACZ | URI-R: https://treid003.github.io/random_Values_external_JS_with_async.html
Replay systems use rewriting tools like Wombat.js to overwrite the seed for the random number generators, which result in a more consistent replay where the random values generated should be the same each time the web page is replayed. The JavaScript code used to initialize the Math.random() and crypto.getRandomValues() are shown in Listings 1 and 2. When initializing Math.random() (Listing 1), Wombat.js overwrites the random number generator’s seed (on line 7) with an expression that uses the time when the resource was archived. Wombat.js initializes crypto.getRandomValues() (Listing 2) by overwriting the function (lines 6-10). The seed selection will be different during the crawling session, because there are different implementations for Math.random() and crypto.getRandomValues(). For Math.random(), the seed selection is an “implementation-defined” strategy, which allows an external source to define their approach without recommendations from the standard specification. For crypto.getRandomValues(), the W3C API specification states that the random number generator should be seeded with a high-quality entropy source like “/dev/urandom” which is an operating system entropy source that returns random bytes using a pseudorandom number generator that retrieved environmental noise from device drivers and other sources.
Listing 1: Wombat.js overriding the seed for Math.random() (WACZ)
Listing 2: Wombat.js overriding the crypto.getRandomValues() function (WACZ)
Loading Google ads in a containing web page (the web page that loaded the ad during the crawling session) is an example where overwriting the seed for the random number generators does not result in a consistent replay of the random numbers used when loading web resources. In this case, the random subdomain used in the Google SafeFrame URL can be different each time the archived web page is loaded (Table 2). Part of the reason that the Google SafeFrame subdomain differed on replay is that the SafeFrame is loaded when the ad slot is near the viewport, not immediately upon replay. Delaying the load of the ads can result in different timings in the network communications which “lead to a varying execution order and thus a different order of pop-requests from the ‘random’ number sequence” (Kiesel et al., “WASP: Web Archiving and Search Personalized”).
Table 2: When loading the ad iframe (Google SafeFrame) for a Google ad, the random value in the Google SafeFrame URL can be different each time the archived web page is replayed.
WARC | URI-R:https://mortalkombat.fandom.com/wiki/Tag_Team_Ladder
WARC | URI-R:https://mortalkombat.fandom.com/wiki/Tag_Team_Ladder
To test this case, we updated the demo web page to include two sections where the random values are dynamically generated based on user interaction (Figure 6). The first section that was added includes buttons that when clicked generate the random values (Figure 6a). The second section that was added generates random values when the user scrolls to that section of the web page (Figure 6b). We replayed the updated demo web page five times and changed the number of buttons clicked on each replay so that the number of function calls to Math.random() and crypto.getRandomValues() is different before reaching the last section. Having a different number of function calls to the random number functions before generating the last two Google SafeFrame URLs resulted in different random values in the subdomain for the URLs (shown in Table 3), which is similar to what happened in Table 2.
(a) Random values are generated after clicking a button
(b) Random values generated after scrolling to this part of the web page
Figure 6: New sections added to the demo web page that requires user interaction to generate the random values.
Table 3: This table shows that the random value in a Google SafeFrame URL can change on each replay when the number of function calls to the Math.random() and crypto.getRandomValues() differs before creating the SafeFrame URL. For this example, each button click resulted in two extra function calls to Math.random() and crypto.getRandomValues(). If the number of function calls to the Math.random() and crypto.getRandomValues() are the same on each replay, then the random values generated will be the same as in Table 4. (WACZ | URL: https://treid003.github.io/random_Values_external_JS_with_async.html)
When loading a Google ad, if there are multiple JavaScript files making function calls for Math.random() and crypto.getRandomValues() (e.g., running multiple ad services on a web page) before the ad is loaded, then it will change the random number sequence which causes variance with the random value included in the Google SafeFrame URL. If the number of function calls to the random number functions are the same, then the random values will be consistent like in our first version of the demo web page where replaying the archived web page multiple times resulted in the same random values (Table 4).
Table 4: Random values in Google SafeFrame URLs are the same when replaying the archived web page multiple times with the same number of function calls to the random number functions before creating the Google SafeFrames.
WACZ | URI-R: https://treid003.github.io/random_Values_external_JS_with_async.html
WACZ | URI-R: https://treid003.github.io/random_Values_external_JS_with_async.html
One takeaway from working on this example is that web archive replay systems cannot generate the same random value that was generated during crawl time. This problem will impact dynamically loaded web resources that use random values.
Amazon Ad iframe
Problem Overview: Amazon's ad service used a random number in the query string for the ad iframe’s URL and the ad bid URL. This results in unarchived URLs being generated during replay time, which can prevent a replay system from loading an ad.
Problem Description:
Amazon’s ad iframe also uses a random value in the iframe’s URL, but it is located in the query string instead of the subdomain (Figure 7). The presence of a random value in the query string results in the replay system generating an Amazon ad iframe URL that was not archived during crawl time. This can prevent an Amazon ad from loading during replay. To replay an Amazon ad, a crawler must archive both the embedded web page for the Amazon ad iframe and the ad loaded inside it. We encountered two challenges with replaying Amazon ads.
Amazon’s ad iframe also uses a random value in the iframe’s URL, but it is located in the query string instead of the subdomain (Figure 7). The presence of a random value in the query string results in the replay system generating an Amazon ad iframe URL that was not archived during crawl time. This can prevent an Amazon ad from loading during replay. To replay an Amazon ad, a crawler must archive both the embedded web page for the Amazon ad iframe and the ad loaded inside it. We encountered two challenges with replaying Amazon ads.
Figure 7: Example URI for an Amazon ad iframe. The rnd parameter in the query string contains a random value that is dynamically generated when loading an ad.
Pywb version 2.7.3 failed to replay some Amazon ads even when the ad resources were archived successfully. Ads that used Amazon’s iframe failed to replay because a URL for the ad bid contained incorrect ws and pid query string parameters (Figure 8). The values for these parameters were dynamically generated and differed during the crawling and replay sessions. On the other hand, ReplayWeb.page is the only replay system that we have seen so far that can successfully replay this type of Amazon ad. Amazon’s ad iframe uses a random value in the URL’s query string stored in the rnd parameter. ReplayWeb.page’s approach for fuzzy matching made it possible to replay these Amazon ads even when the rnd parameter generated during replay differed from the one generated during crawl time (Figure 9). Requests during replay that have different query string parameters than during crawl time are handled by using fuzzy matching to match the requests during replay with responses that were captured during crawl time (Kiesel et al., “Reproducible Web Corpora: Interactive Archiving with Automatic Quality Assessment”). While the random value generated in the query string of the URL did not prevent ReplayWeb.page from loading an Amazon ad, a random value used in the subdomain of a URL for a Google ad did because of the difference between having the random number as a parameter vs. as part of the subdomain.
Figure 8: Pywb 2.7.3 failed to load Amazon ads, because an ad bid URL failed to load.
Figure 9: When replaying an Amazon ad iframe, the rnd parameter is not the same as the original value that is in the URI-R. Even though an incorrect URI-M is generated, ReplayWeb.page is able to load the ad. WACZ | URI-R: https://aax-us-east.amazon-adsystem.com/e/dtb/admi?b=...
Amazon ads’ use of random numbers in their iframe URL caused another problem. Multiple ads may use the same base ad iframe URL, but with different query strings. This prevents some of the ads from being shown during replay because of how ReplayWeb.page uses fuzzy matching. If multiple ad iframe URLs only differ by the query string, then the same ad will get selected and replayed when loading Amazon ads.
Flashtalking
Problem Overview: When loading the URI for an embedded web page ad (outside of the ad iframe) that use Flashtalking and another ad service (such as Google or Amazon), the JavaScript from Flashtalking’s service will dynamically generate a URL that does not exist, preventing the ad’s web resources from loading during replay and on the live web.
Problem Description:
Flashtalking’s ad service also used a dynamically generated URL that prevented the replay of an ad. We were unable to replay the embedded web page ads that we archived during 2023 outside the containing web page when the ad used both an ad iframe and Flashtalking’s ad service. In these cases, the JavaScript for Flashtalking’s ad service loads a web resource that was not archived. Figure 10 shows an example of an embedded web page ad outside of its ad iframe. The error message shown in Figure 10 is associated with an incorrect resource being loaded that prevents the ad from being replayed.
Flashtalking’s ad service also used a dynamically generated URL that prevented the replay of an ad. We were unable to replay the embedded web page ads that we archived during 2023 outside the containing web page when the ad used both an ad iframe and Flashtalking’s ad service. In these cases, the JavaScript for Flashtalking’s ad service loads a web resource that was not archived. Figure 10 shows an example of an embedded web page ad outside of its ad iframe. The error message shown in Figure 10 is associated with an incorrect resource being loaded that prevents the ad from being replayed.
The URI-R highlighted on the right side of Figure 10:
https://cdn.flashtalking.com/richLoads/300x600_Master_Richload/index.html
is not associated with the current ad. The correct URI-R that should have been loaded is:
https://cdn.flashtalking.com/173980/300x600_Master_Richload_Compressed/index.html
which includes the ad id (173980).
The Richload URI includes the ad id when replaying the embedded web page ad in an Amazon ad iframe, which enables replay of the other ad resources (Figure 11). However, even if we try to access this web page ad outside of the ad iframe on the live web, the web page will use an incorrect Richload URI and the ad will not load (Figure 12).
Figure 10: Replaying a successfully archived embedded web page ad outside of its ad iframe. This ad uses Flashtalking and Amazon ad services.
WACZ | URI-R:https://cdn.flashtalking.com/173980/4163777/index.html
WACZ | URI-R:https://cdn.flashtalking.com/173980/4163777/index.html
Figure 11: This embedded web page ad will successfully replay when it is loaded inside of an Amazon ad iframe. The correct Richload URI will be loaded when the ad is in the iframe. This ad uses Flashtalking and Amazon ad services. WACZ | Web page for ad iframe URI-R: https://aax-us-east.amazon-adsystem.com/e/dtb/admi?b=JAkcZL99KtnLJUwYJ8dHHdIAAAGGLy8dDAEAAAxWAQBhcHNfdHhuX2JpZDEgICBOL0EgICAgICAgICAgICA_KuR0&rnd=8954498773591675828862700&pp=1wff280&p=e44jk0&crid=arcgnw6w
| Richload URI-R: https://cdn.flashtalking.com/173980/300x600_Master_Richload_Compressed/index.html
| Richload URI-R: https://cdn.flashtalking.com/173980/300x600_Master_Richload_Compressed/index.html
Figure 12: The live web version of this embedded web page ad also fails to load outside of the web page used for the ad iframe.
URI-R: https://cdn.flashtalking.com/173980/4163777/index.html
URI-R: https://cdn.flashtalking.com/173980/4163777/index.html
Replay of An Ad Can Be Different Based On The Web Browser Used
Problem Overview: During 2023, we used ReplayWeb.page to replay the same archived web page in Firefox and Chrome, but there was a replay problem with loading ads when using Chrome. This problem occurred because the implementation of the service workers were different and Chrome had a problem where the service worker was not able to access the resources inside of an iframe that used “about:blank” for the src attribute.
Problem Description:
The last replay problem involved a replay system that used service workers. In January of 2023, we archived and replayed a web page (https://www.scmp.com/news/china/society/article/3049489/...) that included an ad whose successful replay depended upon the web browser used to load the archived web page (Figure 13). The replay system for this example (ReplayWeb.page) used service workers. When replay systems use service workers, the replay of an archived web page can differ depending upon a browser’s implementation of those service workers. For this example, we observed this problem when an ad used an iframe with a src attribute value of “about:blank”.
The last replay problem involved a replay system that used service workers. In January of 2023, we archived and replayed a web page (https://www.scmp.com/news/china/society/article/3049489/...) that included an ad whose successful replay depended upon the web browser used to load the archived web page (Figure 13). The replay system for this example (ReplayWeb.page) used service workers. When replay systems use service workers, the replay of an archived web page can differ depending upon a browser’s implementation of those service workers. For this example, we observed this problem when an ad used an iframe with a src attribute value of “about:blank”.
(a) Replay session using Firefox
Figure 13: The replay of an ad differed depending on the web browser used.
WACZ | URI-R: https://www.scmp.com/news/china/society/article/3049489/...
| Video of replay session: https://www.youtube.com/watch?v=gCW15i-5teQ
WACZ | URI-R: https://www.scmp.com/news/china/society/article/3049489/...
| Video of replay session: https://www.youtube.com/watch?v=gCW15i-5teQ
Video 2: The replay of an ad differed depending on the web browser used.
When we used ReplayWeb.page with Firefox version 109.0, the image ad loaded (Figure 13a), but it failed to load the ad when using Chrome version 109.0.5414 (Figure 13b). After identifying this problem, we created a GitHub issue on ReplayWeb.page’s GitHub repository. One of the comments on GitHub that addressed this problem, mentioned that the service worker did not gain control of the ad iframe, which led to leaked requests. These leaked requests resulted in a 404 status code during replay for a resource that had been successfully archived during crawl time. There is a Chromium bug related to this issue, where the service worker is not able to access the resources loaded in an “about:blank” iframe. But a ReplayWeb.page update fixed this by overriding the document.write() method with a blob URL created by the service worker.
Summary
We identified three replay problems with loading web ads. The first problem was caused by the JavaScript used by Google's and Amazon's ad services when it dynamically generated a random value used in the URL for an ad's iframe. Because the seed for a random number generator is often itself set randomly, the seed would have a different value at crawl time and replay time. The second problem was caused by Flashtalking ad service’s JavaScript when replaying an ad outside of its ad iframe. The last replay problem occurred, because of a Chromium bug that prevented the replay system from accessing the resources loaded in an “about:blank” iframe.
--Travis Reid (@TReid803)
Comments
Post a Comment