Wednesday, May 1, 2013

2013-04-22: IIPC GA 2013

From April 22--26, Michael Nelson and I attended International Internet Preservation Consortium (IIPC) General Assembly 2013 that was hosted by the National and University Library of Slovenia in Ljbuljana, Slovenia. This year is the ten-year anniversary of the IIPC. GA this year has the theme of "What were the past challenges? and how can we plan the future of IIPC?". Also, this year, Old Dominion University becomes an official member of the IIPC. The GA has been organized into five days.

Day 1: Monday, April 22, 2013
IIPC General Assembly.

Mateja Komel Snoj, the director of the National and University Library Slovenia, and Alenka Kavčič – Čolić, the Head of Library Research Center at National and University Library Slovenia opened the days welcomed the attendance and showed their pleasure for hosting IIPC GA in Slovenia. Mateja emphasized the importance of the digital preservation and the rule of National and University Library Slovenia in the preservation of the Slovenian digital heritage. Then, the steering committee chair, Cathy Hartman from University of North Texas, started her welcome note by thanking the hosting institution and the planning committee for their effort to organize the GA this year. Cathy announced the theme for the GA "How can we plan for the future?", also she announced the nomination for IIPC outstanding award for 2011-2012. Helen Hockx-Yu from British Library spoke about IIPC @ 10. Helen displayed a video from IIPC GA 2011 that discussed the web archiving challenges on that time. Helen defined the keywords of IIPC success by: "Collaboration" in common projects (e.g., Heritrix), and "Cost Effectiveness" in using budget to secure more projects (e.g., WARC ISO standards). She highlighted the importance of web archiving for all fields: computer science, political sciences, journalism, etc. She ended with reminding the IIPC members to think productively where they are going.

The new members update presentations started with Michael Nelson from Old Dominion University. Michael gave an overview of web preservation research in Web Science and Digital Library group.

The second new member was Valério Pereira da Silva from Organisation Information Archivierung (OIA). OIA provided web archiving service - collection the web, hosting, and providing access - to their partners including BMW and Siemens. Aaron Bins and Abigail Potter, the former program and communication officers respectively, gave an update about their efforts from 2010-2013. Aaron extended the IIPC programs beyond the technical track. Abigail announced that IIPC has 44 members from around the world. Her efforts include promotional video, DPC award nomination, web archiving use case report, and LinkedIn group. Mary Pitt, the new program and communication officer, introduced herself to the community and gave an idea about her background. Finally, Clément Oury, IIPC treasure, from Bibliothèque nationale de France gave an update about the prior financial year.

After the break, Hansueli Locher from Swiss National Library presented how to access to e-Helvetica. e-Helvetica has a mix of existing tools with developments of new interfaces to fullfill their challenges: integration between different digital materials using web based with the scalability and fulltext indexation for different languages, metadata, and access rights. Then, it was my presentation about ArcLink: Additional API support for Wayback Machines. ArcLink is a complete system to extract, preserve, and access to the link structure for the archived web. ArcLink has an XML/RDF interface that could be integrated with a Wayback Machine.

The morning session concluded with members' updates.  The British Library announced the new legal deposit legislation to capture the digital content in UK, and the challenge to define UK websites. Kris Carpenter from the Internet Archive announced the completion of "End of Term" collaborative project that preserved public government information in the end of US presidential and congressional terms. They had 5000+ URIs.

After lunch, we had the 10 years of IIPC panel, facilitated by Cathy Hartman and the panelist: Gordon Mohr, Internet Archive; Birgit Henriksen, Royal Library of Denmark; Thorsteinn Hallgrimsson, National and University Library of Iceland. The panelists remembered the early construction of IIPC on 2003. They agreed on the initial established goals: collaboration was important between the small countries and the big institutions, and the development of new tools. IIPC members have a broad view of their problems, and they can share the problems and solutions. They discussed the main challenges through the 10 years. Initially, it was how to convince the home organization about IIPC's ideas, then how to promote them to the world. From the harvesting perspective, the evolution of the web and interfaces made the crawling and replaying of the web harder. From the access perspective, there was a lack of defined set of APIs and tools between the archives, and the quality of the archived corpus.

In the next session, Cathy Hartman and Gildas Illien provided an overview about the discussion paper that covered the previous and the future efforts of the web archives. The discussion paper had various questions to the community: Does IIPC need any update in the organization structure?, How to increase the communication between the members?, and What should we access and what should not? The community members came with interesting issues such as: Outreach to other institution, e.g., UNESCO and Google; Making the web archiving more attractive to the users, e.g., Personal Digital Archiving tools; and Focusing on the common issues between the members. After that, we divided into three discussion groups to discuss these ideas closely, the results of these groups were discussed in the beginning of the second day.

Day 2: Tuesday, April 23, 2013
Workshops and Small Group Meetings - IIPC members Only.

The second day started with an overview of the outcomes from the discussion groups. The suggestions included: continuing the common tools developments like Heritrix and Wayback, outreach for other organizations, educating the members about the existence of Memento APIs, and taking benefits from the surveys between the institutions.

After the break, two workshops were running on parallel. I attended the quality assurance workshop. The workshop focused on the quality assurance tools that help the digital curator. Helen Hockx-Yu from BL presented the new update in QA module in Web Curator Tool for the selective crawl. The new update reduces the required number of clicks that made the process easier for the curators. The tool provides the curator with a thumbnail and colored indicator about the page quality. The tool also allows the curator to enter his feedback about the bad pages. Next tool from BL was presented by Andy Jackson. Andy gave a demonstration about using Monitrix, Heritrix crawl monitoring tool, in domain crawl. Monitrix gives statistics about the different crawling processes such as: rate of capture, crawling by TLD, redirection, and robots exclusion. Monitrix has a detailed information about crawling domains by number of URIs, mime type, and discovered viruses. Brenda Reyes, from University of North Texas, gave a presentation about the current state and practices of QA in web archives. Brenda found that there is a little public information about QA activities. In her study, she interviewed curators from UNT and Library of Congress, additional to email the other web archives. Brenda asked the people to complete a survey about the quality of the web archive.

The next session was the  Memento Aggregator update. Herbert Van de Sompel, from Los Alamos National Lab, announced the new Memento Internet Draft 7.0. Herbert covered various topics related to Memento such as: SiteStory transactional archive, Wikipedia's Memento compliance challenge, and IIPC aggregator progress. Herbert explained the current two Memento aggregator implementations: Distributed search and Metadata harvesting, and explored two novel Memento aggregator architectures: Rule based and Context dependent. Herbert argued the web archives to move to be Memento complaint especially enabling the Wayback Machine memento interface. Michael Nelson, from Old Dominion University, presented various resulted related to the Memento Aggregator. Michael showed that the best caching policy for distributed search aggregator is 15 days. Michael found the coverage of the archived web depends on the source of the URIs, his research on profiling the web archives illustrated the coverage per TLD and languages.

David Rosenthal, from Stanford University, discussed How to Make Memento Successful. Rosenthal discussed some problems about Memento and how to solve it. Memento needs client support, such as the Mementofox plugin, or gateway support such as: British Library gateway. The second problem is the quality of the archived web page. Memento should be able to indicate and filter the pages based on quality. Rosenthal discussed the funding and publicity problems of Memento aggregator. The open technical discussion about Memento emphasized the usage of Memento beyond the web archive, such as content management systems. Also, the Memento team should change the Memento language from requirements to benefits to make the archive compliant.

The education and training workshop discussed the various education and training opportunities that are provided by IIPC such as: Web Archive Workshop at BnF, PhD sponsorship, and Staff Exchange.
At evening, we had a tour in Ljubljana including the National and University Library Slovenia; then we had the dinner at City Castle of Ljubljana.

Day 3: Wednesday, April 24, 2013
Working Group Meetings - IIPC Members.

The working (Harvesting, Preservation, and Access) groups meetings have been held in the third day. I attended the access working group. During the member updates, National Diet Library of Japan announced the new interface for Web Archiving Project (WARP), California Digital Library announced the new update in WAS service to include Wayback Machine interface. Columbia university demonstrated the Human Rights Web Archive. Nicolas Giraud, from BnF, presented Web Archive access plugin to Internet Archive's Wayback Machine to index large CDX files using Elastic Search.  The members of the access working group discussed the status of the shared projects, additional to the new initiative to open the development of Wayback Machine to a wide range of developers.

Day 4: Thursday, April 25, 2013
Scholarly Access to Web Archives: Progress, Requirements and Challenges - OPEN TO PUBLIC.

The fourth day focused on the scholarly and researcher usage for the web archives. Janko Klasinc, from National and University Library Slovenia, presented the web archiving at the National and University Library of Slovenia. Janko announced that the Slovenian web archive is currently public. The legal deposit law was effective in 2006. Janko explained the challenges for crawling .si domains, and demonstrated the new Wayback Annotator tool that enables the user to create and annotate its own collection of archived pages.

Session 1: Scholarly Use of Web Archives 
Niels Brügger, from Aarhus University, had a presentation entitled "Scholarly Use of Web Archives Initiatives and Fundamental Tools". Brügger gave an overview about three main projects: Netlab, RESAWFUTARCNetLab is a research infrastructure project to study the internet materials a part of Danish Digital Humanities Lab. NetLab has various subprojects: Digital footprints (An APIs to get data from facebook), Network analysis of Danish parliamentary elections 2001-2011, Cross media production and communication. RESAW project aims to integrate the existing national research infrastructure to study the archived web materials; it includes 40 web archives, research groups, and scholars. RESAW is driven by researchers questions and extends the web materials to include emails and apps. FUTARC project aims to develop fundamentals tools to study the web archives. FUTARC studied the corpus creations problems such as: Identifying the URLs, Completeness which should be considered during the archiving, and Versioning challenges that includes the inconsistency in temporal and spatial dimensions.

Sophie Gebeil, from Aix-Marseille-Université, presented "Memories of North African immigration on the web". Sophie showed that the web enables the North African immigrants to remember their forgotten events such as memories of marsh of equality on 1983 and general moves for independence on 1961. The witnesses for these events put their comments on the web. The historian needs to understand the web archives to use it in their research.

Meghan Dougherty, from Loyola University Chicago, presented "Researcher Engagement with Web Archives". Meghan conducted a set of interviews with archivist, technician and researchers between 2008 and 2010. Meghan summarized the researchers' obstacles in using the web archives as the following: collaboration between the researchers and the technician; the need of building customized tools that fit their work; finally, they need some help to select the right tool that meet their needs.

Helen Hockx-Yu, from British Library, presented "Web archives in the eyes of scholars". On 2012, they surveyed 49 researchers divided into two groups: those who already use the Archive for research (26%) and those who have not used the Archive (74%). The study summarized the requirements of the web archives to fullfill the scholarly use such as: Availability of the data online; Access to collection crawl log, policy, and configuration; and Quality of the web archive version. Helen found that accessing to the unavailable websites is the first use-case for the web archive, so we can't compete with the live web.

Session 2: Research Tools for Scholarly Use of Web Archives 
Ditte Laursen and Bjarne Andersen, from, presented their work to make an automated screen capturing to study the interplay between real-time internet and live television in cross-media productions. The case study was "the Voice" that streamed on the TV concurrently with updates and discussions on facebook and TV homepage. The Danish archive summarized the challenges as: Synchronization between the crawlers, the crawler ability to capture the micro change, and the limitation of archiving stream content. Danish archive developed a tool to record the various stream concurrently; the tool is supported with web interface to manage the recording jobs.
Andrew Jackson, from British Library, presented "Seeing in the dark: Discovery and data‐mining of restricted web archives". Andy discussed the problem of how to provide an access to the restricted archive. The case study was JISC collection for .uk sites from 1996-2010; there was a challenge to make the data publicly accessible. Andy suggested publishing the data as CC0 open data sets, providing richer APIs, and building analytical access services. From the .uk collection, Andy showed results for GeoIndexes, TopLevel Domains linkage, formats, and mimetypes. For .uk collection, BL provides WAT files, and fulltext search.

Julien Masanès, from Internet Memory Foundation, presented "Leveraging social web to propel your archiving campaign". Julien started with the researchers' challenges in using the web archive. The temporal dimension of the data, the metadata about the data, and the format of the data (i.e., researchers can't use WARC files). Internet Memory realized the need for infrastructure for research. First, they replaced WARCs with BigTable technologies. Second, they built Extractor module on the top of data store. Finally, the user can build execution chain pipeline on the top of the Extractor. They already built applications for near duplicate multimedia video, named entity evolution, and temporal query.

Then, Julien demonstrated Archive The Net tool to find relevant materials for the special event based on the user query for archiving. The tool has two stages: Setup, the user will submit a new crawl including name and description, additional to target keywords and hashtags; Discovery, tool will construct a set of twitter users who share links about the topic; Collection, the tool will collect, crawl, and rank the links to create a campaign for browsing or preservation. You can try the tool on Beta version.
Hugo Huurdeman, from WebART team, presented "WebART: Facilitating Scholarly Use Of Web Archives". WebART project aims to combine the development of the web archive with the research activity to develop tools that help the researchers. Additional to the Wayback Machine interface, WebART is support with WebArtist, fulltext search for Detuch news aggregator. WebART hosted a winter school in the university of Amsterdam where they gave the tools to the researchers for analysis. The school work resulted various applications, for example, word frequency, co-word analysis, outlink analysis, geomapping location wire service, and image temporal listing. Hugo summarized the challenges as the following: Data quality and quantity, the search system, and the user interface.
The last session was given by Clément Oury, from National Library of France, about "Building a master's degree on digital archiving". The degree is given by enssib, librarian school in France. The colloquium combines theoretical modules, such as legal deposits and heritage, research driven archiving, and third party archiving; and practical experience by on selection, harvesting, and access side.

Day 5: Friday, April 26, 2013
Workshops - OPEN TO PUBLIC.

The last day was open to public, IIPC ran two workshops: Twittervane, a crowdsourceing technique to capture the seeds URLs from twitter; and Future Web, it discusses the new evolution of the Web and the current challenges with archiving for capturing, replaying, and data mining.

Abbey Potter from Library of Congress write a great report about IIPC @10. Rosalie Lack, from California Digital Library, wrote about Five things she learned at IIPC. The presentations of IIPC GA are available on IIPC website's Resources section.
Ahmed AlSum

1 comment:

  1. Thanks, Ahmed! Great report. Much easier than flying to Slovenia.