2014-06-18: Google and JavaScript
In this blog post, we detail three short tests in which we challenge the Google crawler's ability to index JavaScript-dependent representations. After an introduction to the problem space, we describe our three tests as introduced below.
- String and DOM modification: we modify a string and insert it into the DOM. Without the ability to execute JavaScript on the client, the string will not be indexed by the Google crawler.
- Anchor Tag Translation: we decode an encoded URI and add it to the DOM using JavaScript. The Google crawler should index the decoded URI after discovering it from the JavaScript-dependent representation.
- Redirection via JavaScript: we use JavaScript to build a URI and redirect the browser to the newly built URI. The Google crawler should be able to index the resource to which JavaScript redirects.
Introduction
JavaScript continues to create challenges for web crawlers run by web archives and search engines. To summarize the problem, our web browsers are equipped with the ability to execute JavaScript on the client, while crawlers commonly do not have the same ability. As such, content created -- or requested, as in the case of Ajax -- by JavaScript are often missed by web crawlers. We discuss this problem and its impacts in more depth on our TPDL '13 paper.
Archival institutions and search engines are attempting to mitigate the impact JavaScript has on their archival and indexing effectiveness. For example, Archive-It has integrated Umbra into its archival process in an effort to capture representations dependent upon JavaScript. Google has announced that its crawler will index content created by JavaScript, as well. There is evidence that Google's crawler has been able to index JavaScript-dependent representations in the past, but they have announced a commitment to improve and more widely use the capability.
We wanted to investigate how well the Google solution could index JavaScript-dependent representations. We created a set of three extremely simple tests to gain some insight into how Google's crawler operated.
Test 1: String and DOM Modification
To challenge the Google crawler in our first test, we constructed a test page with a MD5 hash string "1dca5a41ced5d3176fd495fc42179722" embedded in the Document Object Model (DOM). The page includes a JavaScript function that changes the hash string by performing a ROT13 translation on page load. The function overwrites the initial string with the ROT13 translated string "1qpn5n41prq5q3176sq495sp42179722".Before the page was published, both hash strings returned 0 results when searched in Google. Now, Google shows the result of the JavaScript ROT13 translation that was embedded in the DOM (1qpn5n41prq5q3176sq495sp42179722) but not the original string (1dca5a41ced5d3176fd495fc42179722). The Google Crawler successfully passed this test and accurately crawled and indexed this JavaScript-dependent representation.
Test 2: Anchor Tag Translation
Continuing our investigation with a second test, we wanted to discover if Google could discover a URI to add to its frontier if the anchor tag is generated by JavaScript and only inserted into the DOM after page load. We constructed a page that uses JavaScript to ROT13 decode the string "uggc://jjj.whfgvasoeharyyr.pbz/erqverpgGnetrg.ugzy" to get a decoded URI. The JavaScript inserts an anchor tag linking to the decoded URI. This test evaluates whether the Google crawler will extract the URI from the anchor tag after JavaScript performs the insertion or if the crawler only indexes the original DOM before it is modified by JavaScript.The representation of the resource identified by the decoded URI contains the MD5 hash string "75ab17894f6805a8ad15920e0c7e628b". At the time of this blog posting's publication, this string returned 0 results in Google. To protect our experiment from contamination (i.e., linking to the resource from a source other than the JavaScript-reliant page), we will not post the URI of the hidden resource in this blog.
Test 3: Redirection via JavaScript
In a third test, we created two pages. One of which was linked by my homepage and is called "Google Test Page 1". This page has a MD5 hash string embedded in the DOM "d41d8cd98f00b204e9800998ecf8427e".A JavaScript function changes the hash code to "4e4eb73eaad5476aea48b1a849e49fb3" when the page's onload event fires. In short, when the page finishes loading in the browser, a JavaScript function will change the original hash string to a new hash string. After the DOM is changed, JavaScript constructs a URI string to redirect to another page.