2015-06-26: PhantomJS+VisualEvent or Selenium for Web Archiving?

My research and niche within the WS-DL research group focuses on understanding how the adoption of JavaScript and Ajax is impacting our archives. I leave the details as an exercise to the reader (D-Lib Magazine 2013, TPDL2013, JCDL2014, IJDL2015), but the proverbial bumper sticker is that JavaScript makes archiving more difficult because the traditional archival tools are not equipped to execute JavaScript.

For example, Heritrix (the Internet Archive's automatic archival crawler) executes HTTP GET requests for archival target URIs on its frontier and archives the HTTP response headers and the content returned from the server when the URI is dereferenced. Heritrix "peeks" into embedded JavaScript and extracts any URIs it can discover, but does not execute any client-side scripts. As such, Heritrix will miss any URIs constructed in the JavaScript or any embedded resources loaded via Ajax.

For example, the Kelly Blue Book Car Values website (Figure 1) uses Ajax to retrieve the data to populate the "Model" and "Year" drop down menus when the user selects an option from the "Make" menu (Figures 2-3).
Fig 1. KBB.com uses Ajax to retrieve data for the drop down menus.
Fig 2. The user selects the Make option, which initiates an Ajax request...
Fig 3. ... and the Model and Year data from the Ajax response is used in their respective drop down menus.
Using Chrome's Developer Tools, we can see the Ajax making a request for this information (Figure 4).

Fig 4. Ajax is used to retrieve additional data from the server and change the state of the client.
If we view a memento of KBB.com (Figure 5), we see that the drop downs are not operational because Heritrix was not able to run the JavaScript and capture the data needed to populate the drop downs.

Fig 5. The memento of KBB.com is not completely functional due to the reliance on Ajax to load extra-client data after the initial page load.
The overly-simplified solution to this problem is for archives to use a tool that executes JavaScript in ways the traditional archival crawlers cannot. (Our paper discussing the performance trade-offs and impact of using headless browsing vs. traditional crawling tools has been accepted for publication at iPres2015.) More specifically, the crawlers should make use of technologies that act more like (or load resources in actual) browsers. For example, Archive-It is using Umbra to overcome the difficulties introduced by JavaScript for a subset of domains.

We are interested in a similar approach and have been investigating headless browsing tools and client-side automation utilities. Specifically, Selenium (a client-side automation tool), PhantomJS (a headless browsing client), and a non-archival project called VisualEvent have piqued our interest as most useful to our approach.

There are other similar tools (BrowsertrixWebRecorder.ioCrawlJAX) but these are slightly outside the scope of what we want to do.  We are currently performing research that requires a tool to automatically identify interactive elements of a page, map the elements to a client-side state, and recognize and execute user interactions on the page to move between client-side states. Browsertrix uses Selenium to record HTTP traffic to create higher fidelity archives a page-at-a-time; this is an example of an implementation of Selenium, but does not match our goal of automatically running. WebRecorder.io can record user interactions and replay them with high fidelity (including the resulting changes to the representation), and matches our goal of replaying interactions; WebRecorder.io is another appropriate use-case for Selenium, but does not match our goal of automatically recognizing and interacting with interactive DOM elements. CrawlJAX is an automatic Ajax test suite that constructs state diagrams of deferred representations; however, CrawlJAX is designed for testing rather than archiving.

In this blog post, I will discuss some of our initial findings with detecting and interacting with DOM elements and the trade-offs we have observed between the tools we have investigated.

PhantomJS is a headless browsing utility that is scripted in JavaScript. As such, it provides a tight integration between the loaded page and its DOM and the code. This allows code to be easily directly injected into the target page, and native DOM interaction to be performed. As such, PhantomJS provides a better mechanism for identifying specific DOM elements and their properties.

For example, PhantomJS can be used to explore the DOM for all available buttons or button click events. In the KBB.com example, PhantomJS can discover the onclick events attached to the KBB menus. However, without external libraries, PhantomJS has a difficult time recognizing the onchange event attached to the drop downs.

Selenium is not a headless tool -- we have used the tongue-in-cheek phrase "headful" to describe it -- as it loads an entire browser to perform client-side automation. There are several APIs including Java, Python, Perl, etc. that can be used to interact with the page. Because Selenium is headful, it does not provide as close an integration between the DOM and the script as does PhantomJS. However, it provides better utilities for automated action through mouse movements.

Based on our experimentation, Selenium is a better tool for canned interaction. For example, a pre-scripted set of clicks, drags, etc. A summary of the differences between PhantomJS, Selenium, and VisualEvent (to be explored later in this post) is presented in the below table. Note that our speed testing is based on brief observation and should be used as a relative comparison rather than a definitive measurement.

OperationHeadlessFull-BrowserJavaScript bookmarklet and code
Speed (seconds)2.5-84-10< 1 (on user click)
DOM IntegrationClose integration3rd partyClose integration/embedded
DOM Event ExtractionSemi-reliableSemi-reliable100% reliable
DOM InteractionScripted, native, on-demandScriptedNone

To summarize, PhantomJS is faster (because it's headless), and more closely integrated with the DOM than Selenium (because it loads a full browser). PhantomJS is more closely coupled with the browser, DOM, and the client-side events than Selenium. However, by using a native browser, Selenium defers the responsibility of keeping up with advances of web technologies such as JavaScript to the browser rather than maintain the responsibility within the archival tool. This will prove to be beneficial as JavaScript, HTML5, and other client-side programming languages evolve and emerge.

Sources online (e.g., Stack OverflowReal PythonVilimblog) have recommended using Selenium and PhantomJS in tandem to leverage the benefits of both, but this is too heavy-handed an approach for a web-scale crawl. Instead, we recommend that canned interactions or recorded and pre-scripted events be performed using Selenium and adaptive or extracted events be performed in PhantomJS.

To confirm this, we tested Selenium and PhantomJS on Mat Kelly's archival acid test  (shown in Figure 6). Without a canned, scripted interaction based on a priori knowledge of the test, both PhantomJS and Selenium fail Test 2i, which is the user interaction test but pass all others. This indicates that both Selenium and PhantomJS have difficulty in identifying all events attached to all DOM elements (e.g., neither can easily detect the onchange event attached to the KBB.com drop downs).
Fig 6. The Acid Test is identical for PhantomJS and Selenium, failing the post-load interaction test.
VisualEvent is advertised as a bookmarklet-run solution for identifying client-side events, not an archival utility, but can reliably identify all of the event handlers attached to DOM elements. To improve the accuracy of the DOM Event Extraction, we have been using VisualEvent to discover the event handlers on the DOM.

VisualEvent has a reverse approach to discovering the event handlers attached to DOM elements. Our approach -- which was ineffective -- was to use JavaScript to iterate through all DOM elements and try to discover the attached event handlers. VisualEvent starts with the JavaScript, gets all of the JavaScript functions and understands which DOM elements reference those functions and determines whether these are event handlers. VisualEvent then displays the interactive elements of the DOM (Figure 7) and their associated event handler functions (Figure 8) visually through an overlay in the browser. We removed the visual aspects and leverage the JavaScript functions to extract the interactive elements of the page.

Fig 7. VisualEvent adds a DIV overlay to identify the interactive elements of the DOM.

Fig 8. The event handlers of each interactive elements are pulled from the JavaScript and displayed on the page, as well.

We use PhantomJS to inject the VisualEvent code into a page, extract interactive elements, and use PhantomJS to interact with those interactive elements. This discovers states on the client that traditional crawlers like Heritrix cannot capture.Using this approach, PhantomJS can capture all interactive elements on the page, including the onchange events attached to the drop downs menus on KBB.com.

So far, this approach provides the fastest, most accurate ad hoc set of DOM interactions. However, this is a recommendation from our personal experience for our use case: automatically identifying a set of DOM interactions; other experiment conditions and goals may be better suited for Selenium and other client-side tools.

Note that this set of recommendations is based on empirical evidence and personal experience. It is not meant as a thorough evaluation of each tool, but hope that our experiences are beneficial for others.

--Justin F. Brunelle


  1. Hi Justin,

    Thanks for mentioning WebRecorder and Browsertrix. I think there are several different
    areas here to be addressed, and several tools will need to work together.

    WebRecorder is specifically designed for user-driven high-fidelity on-demand archiving and should capture interactive content.

    Browsertrix is a brand new tool for automatic browsers to browse through an archiving backend, such as WebRecorder. Thus far, I've only had time to implement single page
    archiving with no behaviors, the simplest use case. The other key parts needed to make Browertrix a full-fledged crawler are: link extraction and page behaviors.

    I have not yet had a chance to implement either, but something I'd like to do soon.

    For link extraction, that would entail extracting other links from page to discover
    new pages to be browsed, and can be done with document.querySelectorAll()
    and variations, and follow specific restrictions, such as scope, hops, regex restrictions etc.. much like a traditional crawler.

    For page behaviors is where things get complicated/interesting. I was thinking of
    using canned behaviors, similar to what Umbra does, but VisualEvent may be
    another option. I will take a look at this tool to see if it can be integrated into Browsertrix.

    As for Selenium, I think it is especially good for running different browsers
    through a common interface. I have seen many cases where Chrome and Firefox
    fetch different urls (think streaming video), so that it may be needed to archive
    a page in both to get a complete capture. PhantomJS of course is a lot faster
    on its own, but is not a 'real' user-facing browser. I was considering adding it
    to Browsertrix, either directly or through Selenium.

    If you'd like to contribute to Browsertrix, let me know, as I think one of the next steps
    there is the page behavior component, and VisualEvent may well be the solution.

  2. Ilya:
    Thanks for your comment, and I agree with your input 100%. We have not looked at the differences between how various browsers handle deferred representations and streaming content, but this would be an excellent area of future investigation for our research.

    In my opinion, Selenium is perfect for Browsertrix (or any other similar service) due to its ability to take a set of interactions or events (recorded from users or "canned") and monitor the representation.

    VisualEvent does an excellent job at identifying event handlers as well as the DOM elements to which they are attached. AI found that a simple and easy way to make use of the code is to -- as I mentioned -- remove the visual aspects of the code and poll the DOM every few milliseconds to see when events have finished firing. We are still working on this workflow, but it shows promise.


Post a Comment