Wednesday, June 18, 2014

2014-06-18: Google and JavaScript


In this blog post, we detail three short tests in which we challenge the Google crawler's ability to index JavaScript-dependent representations. After an introduction to the problem space, we describe our three tests as introduced below.
  1. String and DOM modification: we modify a string and insert it into the DOM. Without the ability to execute JavaScript on the client, the string will not be indexed by the Google crawler.
  2. Anchor Tag Translation: we decode an encoded URI and add it to the DOM using JavaScript. The Google crawler should index the decoded URI after discovering it from the JavaScript-dependent representation.
  3. Redirection via JavaScript: we use JavaScript to build a URI and redirect the browser to the newly built URI. The Google crawler should be able to index the resource to which JavaScript redirects.

Introduction

JavaScript continues to create challenges for web crawlers run by web archives and search engines. To summarize the problem, our web browsers are equipped with the ability to execute JavaScript on the client, while crawlers commonly do not have the same ability. As such, content created -- or requested, as in the case of Ajax -- by JavaScript are often missed by web crawlers. We discuss this problem and its impacts in more depth on our TPDL '13 paper.

Archival institutions and search engines are attempting to mitigate the impact JavaScript has on their archival and indexing effectiveness. For example, Archive-It has integrated Umbra into its archival process in an effort to capture representations dependent upon JavaScript. Google has announced that its crawler will index content created by JavaScript, as well. There is evidence that Google's crawler has been able to index JavaScript-dependent representations in the past, but they have announced a commitment to improve and more widely use the capability.

We wanted to investigate how well the Google solution could index JavaScript-dependent representations. We created a set of three extremely simple tests to gain some insight into how Google's crawler operated.

Test 1: String and DOM Modification

To challenge the Google crawler in our first test, we constructed a test page with a MD5 hash string "1dca5a41ced5d3176fd495fc42179722" embedded in the Document Object Model (DOM). The page includes a JavaScript function that changes the hash string  by performing a ROT13 translation on page load. The function overwrites the initial string with the ROT13 translated string "1qpn5n41prq5q3176sq495sp42179722".

Before the page was published, both hash strings returned 0 results when searched in Google. Now, Google shows the result of the JavaScript ROT13 translation that was embedded in the DOM (1qpn5n41prq5q3176sq495sp42179722) but not the original string (1dca5a41ced5d3176fd495fc42179722). The Google Crawler successfully passed this test and accurately crawled and indexed this JavaScript-dependent representation.

Test 2: Anchor Tag Translation

Continuing our investigation with a second test, we wanted to discover if Google could discover a URI to add to its frontier if the anchor tag is generated by JavaScript and only inserted into the DOM after page load. We constructed a page that uses JavaScript to ROT13 decode the string "uggc://jjj.whfgvasoeharyyr.pbz/erqverpgGnetrg.ugzy" to get a decoded URI. The JavaScript inserts an anchor tag linking to the decoded URI. This test evaluates whether the Google crawler will extract the URI from the anchor tag after JavaScript performs the insertion or if the crawler only indexes the original DOM before it is modified by JavaScript.

The representation of the resource identified by the decoded URI contains the MD5 hash string "75ab17894f6805a8ad15920e0c7e628b". At the time of this blog posting's publication, this string returned 0 results in Google. To protect our experiment from contamination (i.e., linking to the resource from a source other than the JavaScript-reliant page), we will not post the URI of the hidden resource in this blog.


The text surrounding the anchor tag is "The deep web link is: " followed by the anchor tag with the target being the decoded URI and the text of "HIDDEN!". If we search for the text surrounding the anchor tag, we receive a single result which includes the link to the decoded URI. However, at the time of this blog posting's publication, the Google crawler has not discovered the hidden resource identified by the decoded URI. It appears Google's crawler is not extracting URIs for its frontier from the JavaScript reliant resources.

Test 3: Redirection via JavaScript

In a third test, we created two pages. One of which was linked by my homepage and is called "Google Test Page 1". This page has a MD5 hash string embedded in the DOM "d41d8cd98f00b204e9800998ecf8427e".

A JavaScript function changes the hash code to "4e4eb73eaad5476aea48b1a849e49fb3" when the page's onload event fires. In short, when the page finishes loading in the browser, a JavaScript function will change the original hash string to a new hash string. After the DOM is changed, JavaScript constructs a URI string to redirect to another page.



In the impossible case (1==0 always evaluates to "false"), the redirect URI is testerpage1.php. This page does not exist. We put in this false URI to try to trick the Google crawler into indexing a page that never existed. (Google was not fooled.)

JavaScript constructs the URI of testerpage2.php that has the hash string "13bbd0f0352dc9f61f8a3d8b015aef67" embedded in the DOM. This page -- prior to this blog post -- is not linked from anywhere, and Google cannot discover it without executing the JavaScript redirect embedded in Google Test Page 1. When we searched for the hash string, Google returned 0 results.

testerpage2.php also writes to a text file whenever the page is loaded. We waited for a string to appear in the text file. After that point, when we search Google for the hash string in testerpage2.php, we receive a result that shows the content and hash of testerpage2.php, but shows the URI of the original Google Test Page 1.


While some may argue that the URI returned in our third test's search result should show the URI of testerpage2.php, this is a choice by Google to provide the original URI rather than the URI of the redirect.

Conclusion

This very simple test set shows that Google is effectively executing JavaScript and indexing the resulting representation. However, the crawler is not expanding its frontier to include URIs that are generated by JavaScript. In all, Google shows that crawling resources reliant on JavaScript is possible at Web scale, but more work is left to be done to properly crawl all JavaScript reliant representations.

--Justin F. Brunelle

No comments:

Post a Comment