Friday, May 18, 2012

2012-04-30: IIPC 2012 GA, A Week with Archivists!


The International Internet Preservation Consortium (IIPC) held its annual general assembly meeting for 2012 from Apr 30 to May 4, 2012, in the Library of Congress in Washington D.C. I concluded this report based on Tweets using #IIPC12 from the meeting and my personal notes. In this report, I tried to assess how much the tweets about an event could give you a complete view about it. More details about the new approach will be in the blog comment.

The first day, April 30, 2012 was open to the public. It was entitled "The Broad Value of Web Archives: Demonstrated Use", @hhockx:: #iipc12 opened just now by Martha Anderson. Laura Campbell welcomes the participants. @netpreserve: IIPC starts with 11 members, now has 42. @gregorylisa: Apropos quote at #IIPC12 "The great use of life is to spend it for something that will outlast it." William James.  

@netpreserve: Gildas Illien from BnF sets the stage for researcher use case panel. @cleymour: Gildas Illien: from 3 to 6 university libraries in IIPC in 2012, signal of a stronger link with researchers. @cleymour: Kalev Leetaru from university of Illinois opens his speech with a demonstration of what big data is. Leeataru talk entitled: "A decade and a half of archiving the web for data mining: Lessons learned and how users use web archives". @MarthaBunton: 8 billion words a day added to written record on Twitter daily. @MarthaBunton: Most web archives today are black boxes--need context of captures. @saraaubry: crawl scope and policies are helpful to evaluate the original websites and its archives. @HRDocumentation: What's worth keeping on the web? Need to document why we choose to capture certain content. @cleymour: big issue: no way to access the archive in its entirety, because archives designed primarily for a small usage.


@netpreserve: Next up : Ian Soberoff from NIST. His talk entitled: "How web archives are used in the Text REtrieval Conference (TREC)". @lljohnston: Ian Soboroff talking about research into search and the TREC program. Search is not a fully solved problem. @marthaindc: Everything you do has search built into it--email, mobile phone, word processors. Ian Soberoff search is indispensable. Ian discussed the idea of "Is it a good search query response?".

@netpreserve: Bruce Hoffman from Georgetown University now talking about use of web archives for terrorism research. Between 1998 and 2006, new communication media have been in effect. @alexisan75: Hoffman: terrorists can frame their own message on Web in ways not possible before Internet. @marthaindc: if you don't have a website, you don't exist as a terrorist. @agrotke: Most terrorist websites today password protected. Major barrier for archivists.

@netpreserve: Now up: Stuart Shulman from UMASS on challenges for researching the social web. @cleymour: access costs are less related to the size in bytes of a collection but to the number of items. @lljohnston: Storing data does not mean access, the former is getting cheaper, the latter is expensive. Shulman gave some examples such as: @gregorylisa: Interesting: infoextractor.org takes social media comments and extracts them as XML.

@netpreserve: Afternoon sessions begin with session on trends in archive use. Claude Mussou and David Rapin from Institute National de l'Audiovisuel (InA) gave a talk entitled: "Data Mining in News Data from Multiple Media" @cleymour: presenting Observatoire Transmedia (web, radio, TV, press) with the example of GM corn debate end 2011. Starting with homogenization of meta-data into a document format, extraction images, Web (image), TV(keyframes). Finally, running a time-aware search engine.

@netpreserve: Next is Monica Omodei from National Library of Australia speaking on trends in Pandora archive. @kboughida: Bad news no legal deposit legislation (Australia) missed web collecting content lack of permission. After analyzing the access log, they found some patterns of use such as: @hhockx: People tend to use web archives to look for websites which have disappeared from the live web, former Prime Minster website that has been changed after the election, redirection to a new name or used as an archive from the live site.

@kboughida: Now Clément Oury and Peter Stirling from Bibliothèque nationale de France (BnF) presented Actual and potential users of the BnF web archives: experiences and expectations. Bnf has on site access only and access is restricted to researchers. They have three ways of conslulting: full text search (Very limited), URL searching and selection of sites guided tools. @kboughida: BNF: researchers in political and social sciences are the most freqent users + some personal searches. For detailed results of BnF study on potential users needs: here.

The following session was the business use of the Web archives. @netpreserve: Rod Wittenberg from Reed Technology on web archives use in legal industry. @hhockx: Archiving websites as evidence to fight piracy and infringement. @cleymour: Rod Wittenberg: examples of cases where the judge decided if an archived website was authenticated or not. Reed technology is preserving the fake sites with the highest quality to be used in court. @kboughida: Rod Wittenberg: re Rule 901. AUTHENTICATING OR IDENTIFYING EVIDENCE for courts Lawyers create pdf from web pages. @kboughida: Rod Wittenberg: sha-1 is used for every web page as digital signature. Accepted as legal.

Another example from the industry was Hanzo Archive. Mark Williams gave a talk entitled "Web archives to meet regulatory, management, e-discovery and cultural heritage needs". @hanzoman: Where web archives fit into E-Discovery Reference Model EDRM.

@netpreserve: Last panel of the day: web archive use in the public sphere. @gregorylisa: Next up at #IIPC12, my colleague Kathleen Kenney on CINCH. She spoked about "Harvesting from the harvest" for automatic extraction of state government publications from web archives. @kboughida: CINCH "capture ingest and checksum" funded by @US_IMLS. @kboughida: CINCH will use duracloud for archiving.

@kboughida: Now Leïla Medjkoune from Internet Memory: How can Web Archives become a critical component of today's Internet?. @gregorylisa: Internet Memory provides redirection service. Directs users to web archives instead of giving a 404 error.

@kboughida: Now Kent Norsworthy from UT Austin Web Archiving as part of a Research Library Special Collection: the Latin American Gov Docs. @cleymour: Kent Norsworthy: web archiving activity may help redefining the role and identity of the research library. @tjowens: Great web archive special collection "The Latin American Government Documents Archive". @netpreserve: MarthaBunton wraps up the open session with a story on origami and web archives. Thanks for a great day!

Tuesday May 1, 2012 - General Assembly


@netpreserve: About to start day 2 of #IIPC12 - the general assembly meeting of IIPC members. @netpreserve: You may have noticed we went from orange to green with our avatar, we are rolling out our new logo today! @kboughida: @MarthaBunton: explaining the new logo: Angle brackets for tech work. Blue color of trust + green

@netpreserve: New member updates this a.m.: welcome Columbia University . @MarthaBunton:: Robert Wolven, Columbia University Libraries, new IIPC member. Wolven gave an overview about "Web Archiving at Columbia University: Collecting Web Content for Research". @kboughida: R.Wolven @columbialib collecting focus is thematic and complementing archival collections, output of Columbia web output. The current available collections: Human rights, Historic preservation, New York city religious institutions and Carnegie corporation,

@kboughida: Daniel Chudnov, George Washington University talking now about GW libraries and web archiving. By the end of 2012, their goal is to deep select, collect captures a modest collection of web archiving materials. Collection areas are history, international relation, matching research in GW uni. Dan confirmed the fact that @kboughida: developers are people too (audience laugh).

@netpreserve: Welcome new IIPC member National Library of Estonia. Jaanus Kõuts gave a talk entitled "Estonian Web Archive: Preserving the Estonian Mind". @agrotke: Estonia: legal deposit law in 2006 covered web publications. @lljohnston: Amazing level of use of web and e-services in Estonia. Obviously a need for web preservation with so much online. @cleymour:: An interesting paper on Estonia and Internet to complement Jaanus Kõuts speech at the Battle Of Internet.

@netpreserve: We now welcome Herbert Van de Sompel and Los Alamos National Laboratory - also new in 2012. @kboughida: LANL contribution aDORe, mod_oai, OAI ORE, memento. @kboughida: google sitemap was influenced by OAI but didn't work out later. @cleymour:: Los Alamos will soon release a server-side transactional web archiving software that is able to produce WARC files.

After that, we started the IIPC projects updates. @MarthaBunton:: Memento aggregator project is bringing together IIPC metadata: Rob Sanderson from Los Alamos National Laboratory. The project goal is to aggregate the metadata of the distributed archives of the IIPC. @kboughida: 4 BnF family members: How to fit in? Integrate a web archiving program in your organization. @netpreserve: IIPC project updates: BnF staff onstage announcing plans to share their daily experience archiving at Paris workshop. @agrotke: Bnf workshop covers internal advocacy, training, integrating web archiving into operations and workflows, more looks fantastic.

@netpreserve: Nicholas Clarke from netarchive.dk updating on JhoNASHelen Hockx-Yu from British Library gave an update about Twittervane - Crowd sourcing for web archiving. @MarthaBunton:: Crowdsourcing selection for web archives using Twitter references. @lljohnston: @hhockx Presenting on Twittervane project to mine URLs from tweets to aid in just-in-time web archiving selection.

Then, the IIPC Member Updates & Announcements session began. @saraaubry: LoC Abbie Grotke and N. Taylor on stage to present Library of Congress updates. @hvdsomp: Library of Congress web archive grows 6 TB per month with total 315+ terabytes of content collected.

Rick Fitzgerald from the Library of Congress gave an update about "HIVE for LC Web Archives: Web Archives and Automatic Subject Indexing". @hvdsomp: Ongoing experiment at Library of Congress looks into automatic classification of web archive with LCSH.

Leïla Medjkoune from Internet Memory gave an update about LAWA (Longitudinal Analytics of Web Archive Data). The analysis on Web data is essential for many R&D services, the challenges includes: scalability, selection of essential resources, adding multilingual support and the time dimension.

Masaki Shibata from National Diet Library gave an update about "Web Archiving in 2012 at National Diet Library". Masaki presented "Web Archiving 3.11 Japanese earthquake & Tsunami". The crawling frequency started on daily basis and by the time the frequency became weekly, then monthly. The volume of data is about 4TB/month. He announced that a new system for deduplication will be used by the end of 2012.

David Walls from Government Printing Office spoke about "Challenges and Opportunities in the Absence of Legal Deposit". GPO has a challenge to design a web harvesting and archiving program to provide permanent public access to born-digital web-based Federal Government information. @cleymour:: GPO signing MOUs with Federal Agencies to harvest their publications.

Helen Hockx-Yu from British Library gave the British Library update. @netpreserve: Helen highlighting recent activities at British Library: new access tool, QA module improvements to web curator tool.

@kboughida: Barbara Sierman from KB National Library of the Netherlands talking about SCAPE project SCAlable Preservation Environments. Five IIPC members were involved in SCAPE to provide infrastructure and tools for scalable preservation actions, a framework for automated, QA preservation workflows and integration of these components with policy-based automated preservation planning and watch.

Mar Pérez Morillo from National Library of Spain gave an update about the Spanish Legal Deposit Law. @MarthaBunton:: Legal Deposit law in Spain updated in 2011 to include web sites.

@netpreserve: Final member update of the day from Libor Coufal from Czech National Library about Havel collaborative archive. After the death of Václav Havel, they called for collaboration with other institutions. They collected 479 URLs from 14 countries, the current crawling status is 256.3 GB.

Wednesday May 2, 2012 - Working Group meetings

Working Group Meetings were held on the third day, @netpreserve: Working group meetings starting today: access, harvesting, and preservation.

Thursday May 3 2012 - Workshops & Cross Working Group meetings

The fourth day was for Workshops and Cross Working Group meetings. The Web Archiving "Lifecycles" Workshop started with an introduction from Kris Carpenter from Internet Archive. Kris discussed the main challenges in the Web archiving life cycle. Then, we had an open discussion about these challenges. Being hosted in parallel was the NetarchiveSuite workshop, it discussed the curatorial and technical aspects of the integration and daily use of NetarchiveSuite in an automated Web harvesting workflow with a focus on crawling preparation which includes scheduling, packaging, configuring and data structuring.

There were two other afternoon workshops. Legal Roundtable was a discussion between the Web archivists and lawyers in order to discuss and compare the impact of international and national legislations and policies on Web archiving activities. In parallel, it was "Harvesting and Preserving the Future Web". The workshop was divided into three panel discussions: Capture, Replay and Scale Panel. David Rosenthal wrote about this workshop in his blog.

Friday May 4, 2012 - Workshops

The last day had three workshops. Crowdsourcing workshop discussed crowdsourcing methods to engage the public and improve the selection and access of Web archives.The Unified Digital Format Registry (UDFR) workshop was a full day workshop about understanding the UDFR system and the service. The final workshop was the ISO workshop on metrics and quality.

Abby Potter wrote a complete report about IIPC 2012 GA @The Signal

MLN 2012-06-24 edit: The IIPC 2012 GA was also mentioned in the June 2012 LC Digital Preservation Newsletter.  Also, Kalev Leetaru's talk has been posted in three parts at LC's blog (part 1, part 2, part 3). 

----
Ahmed AlSum