Saturday, August 3, 2013

2013-07-26: Web Archiving and Digital Libraries workshop - WADL 2013 Trip Report

On July 25th and 26th 2013, the WS-DL group attended the Web Archiving and Digital Libraries Workshop that was collocated with JCDL 2013 at Indianapolis, IN.
Ed Fox, from Virginia Tech, opened the workshop by greeting the attendees. Then, Andreas Paepcke gave two presentations. The first presentation was entitled: "ArcSpread: Enabling Web Archive Analysis for non-CS experts". In this presentation, Andreas showed how to make the web archive useful outside the computer scientists. ArcSpread uses spreadsheet interface to help the user to gain information from the web archive. ArcSpread started with analysis activities such as filtering, aggregating, classifying, and manual coding. The output product is a spreadsheet that can answer some questions related to specific queries (e.g., Hurricane Katrina) such as: pages with words, images with the term, place/people name, and most frequent names. ArcSpread depends on sheet engine with Hadoop cluster of 60 nodes. The second presentation was entitled: "Applying web archives to real-­‐time group source prediction of speech". Andreas showed how to use the computer to help speech impaired computer users to communicate via text-to-speech. EchoTree  is displayed as a word tree that proposes multiple conversation direction where the impaired person, with a help from the conversation partner, may guess what the conversation looks like. EchoTree saved 11% to 14% of typing time to the disabled person to communicate with others.

After that, Ylva Braaten, from United Nations, gave a short presentation about her role as a reference librarian in United Nations to help people to reach the information that they need. She gave an example about a client who asked for a map as it appeared in 2008 that they could not find on UN website but with the help from the Wayback Machine, they were able to retrieve the required map.

The next presentation was from Martin Klein from Los Alamos National Lab, entitled "SiteStory: Archiving done differently". Martin compared between the traditional archiving methods (through crawlers) and  transactional archiving techniques. SiteStory is an transactional archive with memento compliant to access the data. Justin Brunelle, from Old Dominion University, presented his work on evaluation the SiteStory using apache workbench. Martin presented another project that is entitled Hiberlink to provide a fully access to the scholarly web. The pilot study showed that only 72% to 78% of the URIs that were used in scholarly articles are available on the archive or the live Web.
Hany SalahEldeen, from Old Dominion University, gave a presentation entitled "Temporal User Intention Modeling in Social Media". Hany gave an introduction about the concept of the user intention while sharing resources on the social media. He gave a demonstration about a new tool "TimeLord" to record the user preferred snapshot for the resources shared on historical tweets.

Then, Michael Szajewski, from Ball State University, gave a presentation entitled: "Needs and Obstacles for a Web Archiving Initiative at the Ball State University Libraries". Now, they have two digital repositories: Digital media repository (powered by contentDM) and cardinal scholar institutional repository (powered by DSpace).  Michael discussed the needs for the web archive are to develop methods to preserve digital scholarship projects, student groups, and university's social media activities. They had some obstacles such as: the limited human resources with high rate personal turnover.

Yasmin AlNoamany, from Old Dominion University, gave a presentation entitled "Who and What links to the Internet Archive". Yasmin presented the result of her analysis to the log files from the Internet Archive's Wayback Machine. Yasmin discussed interesting results about the external web sites that refer to the Internet Archive. She also presented an analysis for the most used languages on web archives.
Then, my presentation was the last one in the first day, I presented my research related to the profiling of the web archives. The results showed that the Internet Archive has a wide coverage between the others archives while the national archives are doing a good job for their national domains. We used these results to optimize the query routing technique in the Memento Aggregator.
Frank McCown, from Harding University, started the workshop's second day by presentation entitled: "Archiving the Mobile Web". Frank discussed the problem of archiving the mobile version of the website. He covered the different techniques of the implementation of the mobile web. The archiving of the mobile web is not well maintained even with Internet Archive. For example, even IA has copies for for years but not all the pages are rendered successfully. Monica Yarbrough, from Harding University, presented a Mobile Finder, a tool to discover the mobile version of any website. Mobile Finder uses both desktop and mobile user-agents to compare the different content between both versions. If the Mobile Finder couldn't find the mobile version, it will try to guess the different URIs. Mobile Finder is supported with web service to compare the different versions, the results showed that it was accurate in 96% of the cases. Keith Enlow, from Harding University, presented Heritrix Mobile, an extension for the Heritrix crawler to capture the mobile version for the website. Heritrix Mobile capture the desktop webpage by default, then it uses Mobile Finder to discover the mobile webpage. They updated the Heritrix configuration to include the mobile user agent.
Scott Ainsworth, from Old Dominion University, has a presentation entitled: "Temporal Spread in Archived Composite Resources". Scott showed that the rendered memento may include embedded resources from different datetimes. The mean delta for the difference between the memento datetime and the embedded resources datetime was 3 months. Scott discussed the temporal coherence of the memento and proposed different policies to fix it.

Fernanda Balbi, from Brazilian Development Bank-BNDES, presented "The role of BNDES in Brazil preservation of heritage". Fernanda spoke about BNDES and they support preservation of memorial collection and strength of relevant cultural institutions and historical sites. They provide planning, management, and monitoring with sustainability to keep the information through the years, with long term effects. Then, Fernanda showed Brasiliana digital library project as an example of the BNDES effort to preserve the Brazilian culture.

Justin Brunelle, from Old Dominion University, presented "How I spend my summer vacations?". Justin discussed his research for the level of archivability for the resources using different tools. He found some unimportant resources may be missed (Zombies in the archive), and not all the embedded resources are created equally (for example, missing stylesheet may affect the page layout but it doesn't affect the information). Justin proposed some techniques to measure the archivability of the page.

Ed Fox, from Virginia Tech, presented "Crisis, Tragedy, and Recovery Network (CTRnet) Digital Library Project". Ed started with the goal behind the CTRnet project is to collect, analyze, and visualize disaster information with a DL. The project combines members from different disciplines. The disaster web page archive has 45 collections and the disaster twitter archive has 120 tweet archives. Other applications based on the twitter information are "Visualizing Emergency Phases in Tweets" and "Water Main break visualization". After that, Ed explained "Integrated Digital Events Archive and Library (IDEAL) project" as an extension for CTRnet. IDEAL focuses on the detection of the events and triggering archiving the event data.

After that, in the closing discussion, we had an open discussion about various aspects related to the web archiving such as: funding and publishing.

Ahmed AlSum

