Friday, June 26, 2015

2015-06-26: PhantomJS+VisualEvent or Selenium for Web Archiving?

My research and niche within the WS-DL research group focuses on understanding how the adoption of JavaScript and Ajax is impacting our archives. I leave the details as an exercise to the reader (D-Lib Magazine 2013, TPDL2013, JCDL2014, IJDL2015), but the proverbial bumper sticker is that JavaScript makes archiving more difficult because the traditional archival tools are not equipped to execute JavaScript.

For example, Heritrix (the Internet Archive's automatic archival crawler) executes HTTP GET requests for archival target URIs on its frontier and archives the HTTP response headers and the content returned from the server when the URI is dereferenced. Heritrix "peeks" into embedded JavaScript and extracts any URIs it can discover, but does not execute any client-side scripts. As such, Heritrix will miss any URIs constructed in the JavaScript or any embedded resources loaded via Ajax.

For example, the Kelly Blue Book Car Values website (Figure 1) uses Ajax to retrieve the data to populate the "Model" and "Year" drop down menus when the user selects an option from the "Make" menu (Figures 2-3).
Fig 1. uses Ajax to retrieve data for the drop down menus.
Fig 2. The user selects the Make option, which initiates an Ajax request...
Fig 3. ... and the Model and Year data from the Ajax response is used in their respective drop down menus.
Using Chrome's Developer Tools, we can see the Ajax making a request for this information (Figure 4).

Fig 4. Ajax is used to retrieve additional data from the server and change the state of the client.
If we view a memento of (Figure 5), we see that the drop downs are not operational because Heritrix was not able to run the JavaScript and capture the data needed to populate the drop downs.

Fig 5. The memento of is not completely functional due to the reliance on Ajax to load extra-client data after the initial page load.
The overly-simplified solution to this problem is for archives to use a tool that executes JavaScript in ways the traditional archival crawlers cannot. (Our paper discussing the performance trade-offs and impact of using headless browsing vs. traditional crawling tools has been accepted for publication at iPres2015.) More specifically, the crawlers should make use of technologies that act more like (or load resources in actual) browsers. For example, Archive-It is using Umbra to overcome the difficulties introduced by JavaScript for a subset of domains.

We are interested in a similar approach and have been investigating headless browsing tools and client-side automation utilities. Specifically, Selenium (a client-side automation tool), PhantomJS (a headless browsing client), and a non-archival project called VisualEvent have piqued our interest as most useful to our approach.

There are other similar tools (BrowsertrixWebRecorder.ioCrawlJAX) but these are slightly outside the scope of what we want to do.  We are currently performing research that requires a tool to automatically identify interactive elements of a page, map the elements to a client-side state, and recognize and execute user interactions on the page to move between client-side states. Browsertrix uses Selenium to record HTTP traffic to create higher fidelity archives a page-at-a-time; this is an example of an implementation of Selenium, but does not match our goal of automatically running. can record user interactions and replay them with high fidelity (including the resulting changes to the representation), and matches our goal of replaying interactions; is another appropriate use-case for Selenium, but does not match our goal of automatically recognizing and interacting with interactive DOM elements. CrawlJAX is an automatic Ajax test suite that constructs state diagrams of deferred representations; however, CrawlJAX is designed for testing rather than archiving.

In this blog post, I will discuss some of our initial findings with detecting and interacting with DOM elements and the trade-offs we have observed between the tools we have investigated.

PhantomJS is a headless browsing utility that is scripted in JavaScript. As such, it provides a tight integration between the loaded page and its DOM and the code. This allows code to be easily directly injected into the target page, and native DOM interaction to be performed. As such, PhantomJS provides a better mechanism for identifying specific DOM elements and their properties.

For example, PhantomJS can be used to explore the DOM for all available buttons or button click events. In the example, PhantomJS can discover the onclick events attached to the KBB menus. However, without external libraries, PhantomJS has a difficult time recognizing the onchange event attached to the drop downs.

Selenium is not a headless tool -- we have used the tongue-in-cheek phrase "headful" to describe it -- as it loads an entire browser to perform client-side automation. There are several APIs including Java, Python, Perl, etc. that can be used to interact with the page. Because Selenium is headful, it does not provide as close an integration between the DOM and the script as does PhantomJS. However, it provides better utilities for automated action through mouse movements.

Based on our experimentation, Selenium is a better tool for canned interaction. For example, a pre-scripted set of clicks, drags, etc. A summary of the differences between PhantomJS, Selenium, and VisualEvent (to be explored later in this post) is presented in the below table. Note that our speed testing is based on brief observation and should be used as a relative comparison rather than a definitive measurement.

OperationHeadlessFull-BrowserJavaScript bookmarklet and code
Speed (seconds)2.5-84-10< 1 (on user click)
DOM IntegrationClose integration3rd partyClose integration/embedded
DOM Event ExtractionSemi-reliableSemi-reliable100% reliable
DOM InteractionScripted, native, on-demandScriptedNone

To summarize, PhantomJS is faster (because it's headless), and more closely integrated with the DOM than Selenium (because it loads a full browser). PhantomJS is more closely coupled with the browser, DOM, and the client-side events than Selenium. However, by using a native browser, Selenium defers the responsibility of keeping up with advances of web technologies such as JavaScript to the browser rather than maintain the responsibility within the archival tool. This will prove to be beneficial as JavaScript, HTML5, and other client-side programming languages evolve and emerge.

Sources online (e.g., Stack OverflowReal PythonVilimblog) have recommended using Selenium and PhantomJS in tandem to leverage the benefits of both, but this is too heavy-handed an approach for a web-scale crawl. Instead, we recommend that canned interactions or recorded and pre-scripted events be performed using Selenium and adaptive or extracted events be performed in PhantomJS.

To confirm this, we tested Selenium and PhantomJS on Mat Kelly's archival acid test  (shown in Figure 6). Without a canned, scripted interaction based on a priori knowledge of the test, both PhantomJS and Selenium fail Test 2i, which is the user interaction test but pass all others. This indicates that both Selenium and PhantomJS have difficulty in identifying all events attached to all DOM elements (e.g., neither can easily detect the onchange event attached to the drop downs).
Fig 6. The Acid Test is identical for PhantomJS and Selenium, failing the post-load interaction test.
VisualEvent is advertised as a bookmarklet-run solution for identifying client-side events, not an archival utility, but can reliably identify all of the event handlers attached to DOM elements. To improve the accuracy of the DOM Event Extraction, we have been using VisualEvent to discover the event handlers on the DOM.

VisualEvent has a reverse approach to discovering the event handlers attached to DOM elements. Our approach -- which was ineffective -- was to use JavaScript to iterate through all DOM elements and try to discover the attached event handlers. VisualEvent starts with the JavaScript, gets all of the JavaScript functions and understands which DOM elements reference those functions and determines whether these are event handlers. VisualEvent then displays the interactive elements of the DOM (Figure 7) and their associated event handler functions (Figure 8) visually through an overlay in the browser. We removed the visual aspects and leverage the JavaScript functions to extract the interactive elements of the page.

Fig 7. VisualEvent adds a DIV overlay to identify the interactive elements of the DOM.

Fig 8. The event handlers of each interactive elements are pulled from the JavaScript and displayed on the page, as well.

We use PhantomJS to inject the VisualEvent code into a page, extract interactive elements, and use PhantomJS to interact with those interactive elements. This discovers states on the client that traditional crawlers like Heritrix cannot capture.Using this approach, PhantomJS can capture all interactive elements on the page, including the onchange events attached to the drop downs menus on

So far, this approach provides the fastest, most accurate ad hoc set of DOM interactions. However, this is a recommendation from our personal experience for our use case: automatically identifying a set of DOM interactions; other experiment conditions and goals may be better suited for Selenium and other client-side tools.

Note that this set of recommendations is based on empirical evidence and personal experience. It is not meant as a thorough evaluation of each tool, but hope that our experiences are beneficial for others.

--Justin F. Brunelle

2015-06-26: JCDL 2015 Doctoral Consortium

Mat Kelly attended and presented at the JCDL 2015 Doctoral Consortium. This is his report.                           

Evaluating progress between milestones in a PhD program is difficult due to the inherent open-endedness of research. A means of evaluating whether a student's topic is sound and has merit while still early on in his career is to attend a doctoral consortium. Such an event, as the one held at the annual Joint Conference on Digital Libraries (JCDL), has previously provided a platform for WS-DL students (see 2014, 2013, 2012, and others) to network with faculty and researchers from other institutions as well as observe the approach that other PhD students at the same point in their career use to explain their respective topics.

As the wheels have turned, I have showed enough progress in my research for it to be suitable for preliminary presentation at the 2015 JCDL Doctoral Consortium -- so did so this past Sunday in Knoxville, Tennessee. Along with seven other graduate students from various other universities throughout the world, I gave a twenty minute presentation with ten to twenty minutes of feedback from the audience of both other presenting graduate students, faculty, and researchers.

Kazunari Sugiyama of National University of Singapore (where Hany SalahEldeen recently spent a semester as a research intern) welcomed everyone and briefly described the format of the consortium before getting underway. Each student was to have twenty minutes to present with ten to twenty minutes for feedback from the doctors and the other PhD students present.

The Presentations

The presentations were broken up into four topical categories. In the first section, "User's Relevance in Search", Sally Jo Cunningham introduced the two upcoming speakers. Sampath Jayarathna (@OpenMaze) of Texas A&M University was the first presenter of the day with his topic, "Unifying Implicit and Explicit Feedback for Multi-Application User Interest Modeling". In his research, he asked users to type short queries, which he used to investigate methods for search optimization. He asked, "Can we combine implicit and semi-explicit feedback to create a unified user interest model based on multiple every day applications?". Using a browser-based annotation tool, users in his study were able to provide relevance feedback of the search results via explicit and implicit feedback. One of his hypotheses is that if he has a user model, he should be able to compare the model against explicit feedback that the user provides for providing better relevance of results.

After Sampath, Kathy Brennan (@knbrennan) of University of North Carolina presented her topic, "User Relevance Assessment of Personal Finance Information: What is the Role of Cognitive Abilities?". In her presentation she alluded to the similarities of buying a washer and dryer to obtaining a mortgage in respect to being an indicator for a person's cognitive abilities. "Even for really intelligent people, understanding prime and subprime rates can be a challenge.", she said. One study she described analyzed rounding behavior with stock prices being an example of the observed critical details by an individual. Through testing 69 different abilities psychometrically through users analyzing documents for relevance, she found that someone with lower cognitive abilities will have a lower threshold for relevance and thus attribute more documents as relevant than those with higher cognitive abilities. "However", she said, "those with a higher cognitive ability were doing a lot more in the same amount of time as those with lower cognitive abilities."

After a short coffee break, Richard Furuta of Texas A&M University introduced the two speakers of the second session titled, "Analysis and Construction of Archive". Yingying Yu of Dalian Maritime University presented first in this session with "Simulate the Evolution of Scientific Publication Repository via Agent-based Modeling". In her research, she is seeking to find candidate co-authors for academic publications based on a model that includes venue, popularity and author importance as a partial set of parameters to generate a model. "Sometimes scholars only focus on homogenous network", she said.

Mat Kelly (@machawk1, your author) presented second in the session with "A Framework for Aggregating Private and Public Web Archives". In my work, I described the issues of integrate private and public web archives in respect to access restrictions, privacy issues, and other concerns that would arise were the archives' results to be aggregated.

The conference then broke for boxed lunch and informal discussions amongst the attendees.

After resuming sessions after the lunch break, George Buchanan (@GeorgeRBuchanan) of City University of London welcomed everybody and introduced the two speakers of the third session of the day, "User Generated Contents for Better Service".

Faith Okite-Amughoro (@okitefay) of University of KwaZulu-Natal presented her topic, "The Effectiveness of Web 2.0 in Marketing Academic Library Services in Nigerian Universities: a Case Study of Selected Universities in South-South Nigeria". Faith's research noted that there has not been any assessment on how the libraries in her region of study have used Web 2.0 to market their services. "The real challenge is not how to manage their collection, staff and technology", she said, "but to turn these resources into services". She found that the most used Web 2.0 tools were social networking, video sharing, blogs, and generally places where the user could add themselves.

Following Faith, Ziad Matni (@ziadmatni) of Rutgers University presented his topic, "Using Social Media Data to Measure and Influence Community Well-Being". Ziad asked, "How can we gauge how well people are doing in their local communities though the data that they generate on social media?" He is currently looking for useful measure of components of community well-being and their relationships with collective feelings of stress and tranquility (as he defined in his work). He is hoping to focus on one or two social indicators and to understand the influence factors that correlate the sentiment expressed on social media and a geographical community's well-being.

After Ziad's presentation, the group took a coffee break then started the last presentation session of the day, "Mining Valuable Contents". Kazunari Sugiyama (who welcomed the group at the beginning of the day) introduced the two speakers of the session.

The first presentation in this session was from Kahyun Choi of University of Illinois at Urbana-Champaign presented her work, "From Lyrics to Their Interpretations: Automated Reading between the Lines". In her work, she is looking to try to find the source of subject information from songs with the assumption that machines might have difficult analyzing songs' lyrics. She has three general research questions, the first relating lyrics and their interpretations, the second whether topic modeling can discover the subject of the interpretations, and the third in reliably obtaining the interpretations from the lyrics. She is training and testing a subject classifier where she collected lyrics and their interpretations from From this she obtained eight subject categories: religion, sex, drugs, parents, war, places, ex-lover, and death. With 100 songs in each category, she assigned each song to have only one subject. She then obtained the top ten interpretations per song to prevent the results from being skewed by songs with a large number of interpretations.

The final group presentation of the day was to come from Mumini Olatunji Omisore of Federal University of Technology with "A Classification Model for Mining Research Publications from Crowdsourced Data". Because of visa issues, he was unable to attend but planned on presenting via Skype or Google Hangouts. After changing wireless configurations, services, and many other attempts, the bandwidth at the conference venue proved insufficient and he was unable to present. A contingency was setup between him and the doctoral consortium organizers to review his slides.


Following the attempts to allow Mumini to present remotely, the consortium broke up into group of four (two students and two doctors) for private consultations. The doctors in my group (Drs. Edie Rasmussen and Michael Nelson) provided extremely helpful feedback in both my presentation and research objectives. Particularly valuable was their helpful discussions for how I could go about improving the evaluation of my proposed research.

Overall, the JCDL Doctoral Consortium was a very valuable experience. By viewing how other PhD students were approaching their research and obtaining critical feedback on mine, I believe the experience to be priceless for improving the quality of one's PhD research.

— Mat (@machawk1)

Edit: Subsequent to this post, Lulwah reported on the main portion of the JCDL 2015 conference and Sawood reported on the WADL workshop at JCDL 2015.

Tuesday, June 9, 2015

2015-06-09: Mobile Mink merges the mobile and desktop webs

As part of my 9-to-5 job at The MITRE Corporation, I lead several STEM outreach efforts in the local academic community. One of our partnerships with the New Horizon's Governor's School for Science and Technology pairs high school seniors with professionals in STEM careers. Wes Jordan has been working with me since October 2014 as part of this program and for his senior mentorship project as a requirement for graduation from the Governor's School.

Wes has developed Mobile Mink (soon to be available in the Google Play store). Inspired by Mat Kelly's Mink add-on for Chrome, Wes adapted the functionality to an Android application. This blog post discusses the motivation for and operation of Mobile Mink.


The growth of the mobile web has encouraged web archivists to focus on ensuring its thorough archiving. However, the mobile URIs are not as prevalent in the archives as their non-mobile (or as we will refer to them: desktop) URIs. This is apparent when we compare the TimeMaps of the Android Central site (with desktop URI and a mobile URI

TimeMap of the desktop Android Central URI
 The 2014 TimeMap in the Internet Archive of the desktop Android Central URI includes a large number of mementos with a small number of gaps in archival coverage.
TimeMap of the mobile Android Central URI
Alternatively, the TimeMap in the Internet Archive of the mobile Android Central URI has far fewer mementos and many more gaps in archival coverage.

This example illustrates the discrepancy between archival coverage of mobile vs desktop URIs. Additionally, as humans we can understand that these two URIs are representing content from the same site: Android Central. The connection between the URIs is represented in the live web, with mobile user-agents triggering a redirect to the mobile URI. This connection is lost during archiving.

The representations of the mobile and desktop URIs are different, even though a human will recognize the content as largely the same. Because archives commonly index by URI and archival datetime only, a machine may not be able to understand that these URIs are related.
The desktop Android Central representation
The mobile Android Central representation

Mobile Mink helps merge the mobile and desktop TimeMaps while also also providing a mechanism to increase the archival coverage of mobile URIs. We detail these features in the Implementation section.


Mobile Mink provides users with a merged TimeMap of mobile and desktop versions of the same site. We use the URI permutations detailed in McCown's work to transform desktop URIs to mobile URIs (e.g., -> and mobile URIs to desktop URIs (e.g., -> This process allows Mobile Mink to establish the connection between mobile and desktop URIs.

Merged TimeMap
With the mobile and desktop URIs identified, Mobile Mink uses Memento to retrieve the TimeMaps of both the desktop and mobile versions of the site. Mobile Mink merges all of the returned TimeMaps and sorts the mementos temporally, identifying the mementos of the mobile URIs with an orange icon of a mobile phone and the mementos of the desktop URIs with a green icon of a PC monitor.

To mitigate the discrepancy in archival coverage between the mobile and desktop URIs of web resources, Mobile Mink provides an option to allow users to push the mobile and desktop URIs to the Save Page Now feature at the Internet Archive and to This will allow Mobile Mink's users to actively archive mobile resources that may not be otherwise archived.

These features mirror the functionality of Mink by providing users with a TimeMap of the site currently being viewed, but extends Mink's functionality by providing the merged mobile and desktop TimeMap. Mink also provides a feature to submit URIs to and the Save Page Now feature, but Mobile Mink extends this functionality by submitting the mobile and desktop URIs to these two archival services.


The video below provides a demo of Mobile Mink. We use the Chrome browser and navigate to, which redirects us to From the browser menu, we select the "Share" option. When we select the "View Mementos" option, Mobile Mink provides the aggregate TimeMap. Selecting the icon in the top right corner, we can access the menu to submit the mobile and desktop URIs to and/or the Internet Archive.

Next Steps

We plan to release Mobile Mink in the Google Play store in the next few weeks. In the mean time, please feel free to download and use the app from Wes's GitHub repository ( and provide feedback to through the issues tracker ( We will continue to test and refine the software moving forward.

Wes's demo of MobileMink was accepted at JCDL2015. Because he is graduating in June and preparing to start his collegiate career at Virginia Tech, someone from the WS-DL lab will be presenting his work on his behalf. However, we hope to convince Wes to come to the Dark Side and join the WS-DL lab in the future. We have cookies.

--Justin F. Brunelle

2015-06-09: Web Archiving Collaboration: New Tools and Models Trip Report

Mat Kelly and Michele Weigle travel to and present at the Web Archiving Collaboration Conference in NYC.                           

On June 4 and 5, 2015, Dr. Weigle (@weiglemc) and I (@machawk1) traveled to New York City to attend the Web Archiving Collaboration conference held at the Columbia School of International and Public Affairs. The conference gave us an opportunity to present our work from the incentive award provided to us by Columbia University Libraries and the Andrew W. Mellon Foundation in 2014.

Robert Wolven of Columbia University Libraries started off the conference with welcoming the audience and emphasizing the variety of presentations that were to occur on that day. He then introduced Jim Neal, the keynote speaker.

Jim Neal starting by noting the challenges of "repository chaos", namely, which version of a document should be cited for online resources if multiple versions exist. "Born-digital content must deal with integrity", he said, "and remain as unimpaired and undivided as possible to ensure scholarly access."

Brian Carver (@brianwc) and Michael Lissner (@mlissner) of Free Law Project (@freelawproject) followed the keynote with Brian first stating, "Too frequently I encounter public access systems that have utterly useless tools on top of them and I think that is unfair." He described his project's efforts to make available court data from the wide variety of systems digitally deployed by various courts on the web. "A one-size-fits-all solution cannot guarantee this across hundreds of different court websites.", he stated, further explaining that each site needs its own algorithm of scraping to extract content.

To facilitate the crowd sourcing of scraping algorithms, he has created a system where users can supply "recipes" to extract content from the courts' sites as they are posted. "Everything I work with is in the public domain. If anyone says otherwise, I will fight them about it.", he mentioned regarding the demands people have brought to him when finding their name in the now accessible court documents. "We still find courts using WordPerfect. They can cling to old technology like no one else."

Free Law Project slides

Shailin Thomas (@shailinthomas) and Jack Cushman from the Berkman Center for Internet and Society, Harvard University spoke next of "From the digital citation in the Harvard Law Review from the last 10 year, 73% of the online links were broken. Over 50% of the links cited by the Supreme Court are broken." They continued to describe the Perma API and the recent Memento compliance. slides

After a short break, Deborah Kempe (@nyarcist) of the Frick Art Reference Library describe her recent observation that there is a digital shift in art moving to the Internet. She has been working with both Archive-It and Hanzo Archives for quality assurance of captured websites and for on-demand captures of sites that her organization found particularly challenging (respectively). One example of the latter is Wangechi Mutu's site, which has an animation on the homepage, which Archive-It was unable to capture but Hanzo was.

In the same session, Lily Pregill (@technelily) of NYARC stated, "We needed a discovery system to unite NYARC arcade and our Archive-It collection. We anticipated creating yet another silo of an archive." While she stated that the user interface is still under construction, it does allow the results of her organization's archive to be supplemented with results from Archive-It.

New York Art Resources Consortium (NYARC) slides

Following Lily in the session, Anna Perricci (@AnnaPerricci) of Columbia University Libraries talked about the Contemporary Composers Web Archive, which consists of 11 participating curators from 56 sites currently available in Archive-It. The "Ivies Plus" collaboration has Columbia building web archives with seeds chosen by subject specialists from Ivy League universities along with a few other universities.

Ivies Plus slides

In the same session, Alex Thurman (@athurman) (also from Columbia) presented on the IIPC Collaborative Collection Initiative. He referenced the varying legal environments between members based on countries, some being able to do full TLD crawling while some members (namely, in the U.S.) have no protection from copyright. He spoke of the preservation of Olympics web sites from 2010, 2012, and 2014 - the latter being the first logo to contain a web address. "Though Archive-It had a higher upfront cost", he said about the initial weighing of various option for Olympic website archiving, it was all-inclusive of preservation, indexing, metadata, replay, etc." To publish their collections, they are looking into utilizing the .int TLD, which is reserved for internationally significant information but is underutilized in that only about 100 sites exist, all which have research value.

International Internet Preservation Consortium collaborative collections slides

The conference then broke for a provided lunch then started with Lightning Talks.

To start off the lightning talks, Michael Lissner (@mlissner) spoke about RECAP, what it is, what has it done and what is next for the project. Much of the content contained with the Public Access to Court and Electronic Records (PACER) system is paywalled public domain documents. Obtaining the documents costs users ten cents per page with a three dollar maximum. "To download the Lehman Brothers proceedings would cost $27000.", he said. His system leverages user's browser via the extension framework to save a copy of the downloads from a user to Internet Archive and also first query the archive for a user to see if the document has been previously downloaded.

Dragan Espenschied (@despens) gave the next Lightning Talk talking about preserving digital art pieces, namely those on the web. He noted one particular example where the artist extensively used scrollbars, which are less common place in user interface today. To accurately re-experience the work, he fired up a browser based MacOS 9 emulator:

Jefferson Bailey @jefferson_bail followed Dragan with his work in investigating archive access methods that are not URI centric. He has begun working with WATs (web archive transformations), LGAs (longitudinal graph analyses), and WANEs (web archive named entities).

Dan Chudnov (@dchud) then spoke of his work at GWU Libraries. He had developed Social Feed Manager, a Django application to collect social media data from Twitter. Previously, researchers had been copy and pasting tweets into Excel documents. His tool automated this process. "We want to 1. See how to present this stuff, 2. Do analytics to see what's in the data and 3. Find out how to document the now. What do you collect for live events? What keywords are arising? Whose info should you collect?", he said.

Jack Cushman from gave the next lightning talk about, a site that is trying to make a strong dark archive. The concept would prevent archivists from reading material within until conditions are met. Examples where this would be applicable are the IRA Archive at Boston College, Hillary Clinton's e-mails, etc.

With the completion of the Lightning Talks, Jimmy Lin (@lintool) of University of Maryland and Ian Milligan (@ianmilligan1) of University of Waterloo rhetorically asked, "When does an event become history?" stating that history is written 20 to 30 years after an event has occurred. "History of the 60s was written in the 30s. Where are the Monica Lewinsky web pages now? We are getting ready to write the history of the 1990s.", Jimmy said. "Users can't do much with current web archives. It's hard to develop tools for non-existent users. We need deep collaborations between users (archivists, journalists, historians, digital humanists, etc.) and tool builders. What would a modern archiving platform built on big data infrastructure look like?" He compared his recent work in creating warcbase with the monolithic OpenWayback Tomcat application. "Existing tools are not adequate."

Warcbase: Building a scalable platform on HBase and Hadoop slides (part 1)

Ian then talked about warcbase as an open source platform for managing web archives with Hadoop and HBase. WARC data is ingested into HBase and Spark is used for text analysis and services.

Warcbase: Building a scalable platform on HBase and Hadoop slides (part 2)

Zhiwu Xie (@zxie) of Virginia Tech then presented his group's work on maintaining web site persistence when the original site is no longer available. By using an approach akin to a proxy server, the content served when the site was last available is continued to be served in lieu of the live site. "If we have an archive that archives every change of that web site and the website goes down, we can use the archive to fill the downtimes.", he said.

Archiving transactions towards an uninterruptible web service slides

Mat Kelly (@machawk1, your author) presented next with "Visualizing digital collections of web archives" where I described the SimHash archival summarization strategy to efficiently generate a visual representation of how a web page changed over time. In the work, I created a stand-alone interface, Wayback add-on, and embeddable service for a summarization to be generated for a live web page. At the close of the presentation, I attempted a live demo.

WS-DL's own Michele Weigle (@weiglemc) next presented Yasmin's (@yasmina_anwar) work on Detecting Off-Topic Pages. The recently accepted TPDL 2015 paper had her looking at how pages in Archive-It collections have changed over time and being able to detect when a page is no longer relevant to what the archivist intended to capture. She used six similarity metrics to find that cosine similarity performed the best.

In the final presentation of the day, Andrea Goethals (@andreagoethals) of Harvard Library and Stephen Abrams of California Digital Library discussed difficulties in keeping up with web archiving locally, citing the outdated tools and systems. A hierarchical diagram of a potential they showed piqued the audiences' interest as being overcomplicated for smaller archives.

Exploring a national collaborative model for web archiving slides

To close out the day, Robert Wolven gave a synopsis of the challenges to come and expressed his hope that there was something for everyone.

Day 2

The second day of the conference contained multiple concurrent topical sessions that were somewhat open-ended to facilitate more group discussion. I initially attended David Rosenthal's talk where he discussed the need for tools and APIs for integration into various system for standardization of access. "A random URL on the web has less than 50% chance of getting preserved anywhere.", he said, "We need to use resources as efficiently as possible to up that percentage". Further emphasizing this point:

DSHR then discussed repairing archives for bit-level integrity and LOCKSS' approach at accomplishing it. How would we go about establish a standard archival vocabulary?", he asked, "'Crawl scope' means something different in Archive-It vs. other systems."

I then changed rooms to catch the last half hour of Dragan Espenschied's tools where he discussed pywb (the software behind more in-depth. The software allows people to create their own public and private archives as well as offers a pass-through model where it does not record login information. Further, it can capture embedded YouTube and Google Maps.

Following the first set of concurrent sessions, I attended Ian Milligan's demo of utilizing warcbase for analysis of Canadian Political Parties (a private repo as of this writing but will be public once cleaned up). He also demonstrated using Web Archives for Historical Research. In the subsequent and final presentation of day 2, Jefferson Bailey demonstrated Vinay Goel's (@vinaygo) Archive Research Services Workshop, which was created to serve as an introduction to data mining and computational tools and methods for work with web archives for researchers, developers, and general users. The system utilizes the WAT, LGA, and WANE derived data formats that Jefferson spoke of in his Day 1 Lightning talk.

After Jefferson's talk, Robert Wolven again collected everyone into a single session to go over what was discussed in each session on the second day and gave a final closing.

Overall, the conference was very interesting and very relevant to my research in web archiving. I hope to dig into some of the projects and resources I learned about and follow up with contacts I made at the Columbia Web Archiving Collaboration conference.

— Mat (@machawk1)