2015-05-09: IIPC General Assembly 2015 Trip Report

The day before International Internet Preservation Consortium (IIPC) General Assembly 2015 we landed in San Francisco and some delicious Egyptian dishes were waiting for us. Thank you Ahmed, Yasmin, Moustafa, Adrian, and Yusuf for hosting us. It was a great way to spend the evening before IIPC GA and we were delighted to see you people after long time.

@WebSciDL reunion before #iipcGA15 @phonedude_mln @mart1nkle1n @yasmina_anwar @ibnesayeed @hvdsomp @mousta pic.twitter.com/orSFWGULSW
— Ahmed AlSum (@aalsum) April 27, 2015

Day 1

We (Sawood Alam, Michael L. Nelson, and Herbert Van de Sompel) entered in the conference hall a few minutes after the session was started and Michael Keller from Stanford University Libraries was about to leave the stage after the welcome speech. IIPC Chair Paul Wagner gave brief opening remarks and invited the keynote speaker Vinton Cerf from Google on the stage. The title of the talk was "Digital Vellum: Interacting with Digital Objects Over Centuries" and it was such an informative and delightful talk. He mentioned that the high density low cast storage media is evolving, but the devices to read them might not last long. While mentioning Internet connected picture frames and surf boards he added, we should not forget about the security. To emphasize the security aspect he gave an example that grand parents would love to see their grand children in those picture frames, but will not be very happy if they see something which they do not expect.
Moving on to software emulators he invited Mahadev Satyanarayanan from Carnegie Mellon University to talk about their software archive and emulator called Olive Archive. Satya gave various live demos including the Great American History Machine, ChemCollective (a copy of the website frozen at certain time), PowerPoint 4.0 running in Windows 3.1, and the Oregon Trail, all powered by their virtual machines and running in a web browser. He also talked about the architecture of the Olive Archive and how in future multiple instances can be launched and orchestrated to emulate the subset of the Internet for applications that rely on external services where some instances might run those services independently.
In the QA session someone asked Cerf, how to ask big companies like Google to provide the data about their Crisis Response efforts for archiving after they are done with it? Cerf responded, "you just did." while acknowledging the importance of such data for archival. Here are some tweets that were capturing the moments:

High density, low cost storage media, but the devices to read them may not last long, says @vgcerf #iipcGA15
— Michael Widner (@mwidner) April 27, 2015

@vgcerf explains that there are not many places storing software, which you need in the future (to interpreting the bits). #iipcga15
— Helen Hockx (@hhockx) April 27, 2015

Cerf not sure that Google keeps its own software very long if it’s no longer used #iipcGA15
— Jane Winters (@jfwinters) April 27, 2015

Now seeing an example of the Great American History Machine, created in the late 1980s. Working now! #iipcGA15 pic.twitter.com/ZhxHP0zYFz
— Ian Milligan (@ianmilligan1) April 27, 2015

#iipcga15 Mahadev Satanarayanan highlights how @OliveArchive cleanly separates VM storage & execution, and how entire stack is open source
— Ina DL Web (@inadlweb) April 27, 2015

#iipcGA15 Awesome keynotes! Time for (hard) questions. pic.twitter.com/AcoHtkTs60
— Sabine Hartmann (@skhartmann) April 27, 2015

After the break Niels Brügger and Ditte Laursen presented their case study of Danish websphere under the title "Studying a nation's websphere over time: analytical and methodological considerations". Their study covered website content, file types, file sizes, backgrounds, fonts, layout and more importantly the domain names. They also raised the points like size of the ".dk" domain, geolocation, inter and intra domain link network, and if the Danish websites are actually in Danish language? They talked about some crawling challenges. Their domain name analysis tells that only 10% owners own 50% of all the ".dk" domains. I suspected that this result might be due to the private domain name registrations, so I talked to them later and they said, they did not think about private registrations, but they will revisit their analysis.

Brugger: size of .dk domain, file types, size of individual websites, where are websites located, networks within & between sites #iipcGA15
— Jane Winters (@jfwinters) April 27, 2015

#iipcGA15 Domain owners change. Top 10% of domain owners own 50% of the domains.
— Jackie Dooley (@minniedw) April 27, 2015

Between 2012 and 2015 14% of .dk domains changed ownership #iipcGA15
— Jane Winters (@jfwinters) April 27, 2015

"No 1:1 relation between Danish national archive and the Danish national web domain" #iipcGA15
— Yasmina Anwar (@yasmina_anwar) April 27, 2015

Andy Jackson from the British Library took the stage with his presentation title "Ten years of the UK web archive: what have we saved?". This case study covers three collections including Open Archive, Legal Deposit Archive, and JISC Historical Archive. These collections store over eight billion resources in over 160TB compressed files and now adding about two billion resources per year. With the help of a nice graph he illustrated that not all ".uk" domains are interlinked, so to maximize the coverage the crawlers need to include other popular TLDs such as ".com". He also presented the analysis of reference rot and content drift utilizing the "ssdeep" fuzzy hash algorithm. Their analysis tells that 50% of resources are unrecognizable or gone after oner year, 60% after two years and 65% after three years.

Fascinating - the halo around the middle cluster are .uk sites you can only find through other TLDs. #iipcGA15 pic.twitter.com/NvjJUfcJo0
— Ian Milligan (@ianmilligan1) April 27, 2015

Vanishing .uk domains. @anjacks0n at #iipcGA15 pic.twitter.com/t6H4ba0feV
— Katrin Weller (@kwelle) April 27, 2015

I had lunch together with Scott Fisher from the California Digital Library. I told him about various digital library and archiving related research projects we are working on at Old Dominion University and he described the holdings of his library and the phalanges they have in upgrading their Wayback to bring Memento support.
After the lunch, keynote speaker of the second session Cathy Marshall from the Texas A&M University took the stage with a very interesting title, "Should we archive Facebook? Why the users are wrong and the NSA is right". She motivated her talk by some interview style dialogues with the primary question, "Do you archive Facebook?" and mostly the answer was "No!". She highlighted that people have developed [wrong] sense that Facebook is taking care of their stuff, so they do not have to. She also noted that people usually do not value their Facebook content or they think it has immediate value, but no archival value. In a large survey she asked should Facebook be archived?, three fourth objected and half of them said "No" unconditionally. In the later part of her talk, she build the story of the marriage of Hal Keeler and Joan Vollmer by stitching various cuttings from local news papers. I am not sure if I could fully appreciate the story due to the cultural difference, but I laughed when everyone else did. Although I did follow her efforts and intention to highlight the need of archiving social media for future historians. And if asks me, is NSA is right? my answer would be, "Yes!, if they do it correctly with all the context included."

"click bait" slide for @ccmarshall's talk at #iipcGA15 pic.twitter.com/pW2wJNEO0z
— Michael L. Nelson (@phonedude_mln) April 27, 2015

Interview questions regarding to the preservation of data on fb #iipcGA15 pic.twitter.com/4GmKZ4EVvi
— Yasmina Anwar (@yasmina_anwar) April 27, 2015

Archiving Facebook's public vs. private data: maybe not the same challenge #iipcGA15
— Emmanuelle Bermes (@figoblog) April 27, 2015

C. Marshall: impossible to reconstruct a story like Vollmer's today because you'd have to rely on Facebook volatile data #iipcGA15
— Emmanuelle Bermes (@figoblog) April 27, 2015

The majority of #iipcGA15 attendees love the idea of #facebook archive!!
— Yasmina Anwar (@yasmina_anwar) April 27, 2015

#iipcGA15 try #WAIL and #WARCreate by @WebSciDL @machawk1 to #archive #Facebook https://t.co/jRl0moLpAz
— Sawood Alam (@ibnesayeed) April 27, 2015

Meghan Dougherty from Loyola University Chicago and Annette Markham from Aarhus University presented their talk "Generating granular evidence of lived experience with the Web: archiving everyday digitally lived life". They illustrated how sometimes intentionally or unintentionally people record moments of their life with different media. Among various visual illustrations, I particularly liked the video of a street artist playing with a ring that was posted on Facebook in a very different context than the context it appeared in YouTube. They ended their talk with a hilarious video of Friendster.

#iipcGA15 context matters https://t.co/FaVn23uZob
— Sawood Alam (@ibnesayeed) May 8, 2015

@mdocx1 questions how well web archives capture everyday digital lived life. #iipcGA15
— Helen Hockx (@hhockx) April 27, 2015

Megan Dougherty quote rebecca solnit tyranny of the quantifiable that which can be measured takes priority over that which cannot #iipcGA15
— rosalie lack (@rosalielack) April 27, 2015

"She forgets the camera, or rather treats the laptop camera as a close friend"—@mdocx1 on our relationship w/our digital lives #iipcGA15
— David Moles (@chronodm) April 27, 2015

@mdocx1 suggests a StoryCorp of the Web - link for the original StoryCorp http://t.co/kNG97fqMo4 which goes to @librarycongress #iipcGA15
— Abbie Grotke (@agrotke) April 27, 2015

#iipcGA15 @mdocx1 So funny video on "Friendster discovered by an Internet archeologist" fromThe Onion YT account : https://t.co/nHDl0k6Asd
— Ina DL Web (@inadlweb) April 27, 2015

Susan Aasman from University of Groningen presented her talk "Everyday saving practices: "small data" and digital heritage strategies". This talk was full of motivation, why people should care about personal archive of their daily life moments. She described how the service Kodak Gallery launched in 2001 with the tag-line, "live forever", and closed in 2012 after transferring billions of images to Shutterfy which was only available for US customers. As a result, people from other countries have lost their photo memories. She also played the Bye Bye Super 8 video of Johan Kramer that was amusing and motivating for personal archiving.

#iipcGA15 @aasmanna Family memories as technological memories. Engaging talk about personal archiving. pic.twitter.com/yskvw7FxHY
— Sabine Hartmann (@skhartmann) April 27, 2015

In 2001 Kodak launched a website promising to preserve everyone’s photos online... and failed @ISSN_IC #iipcGA15
— Emmanuelle Bermes (@figoblog) April 27, 2015

Aasman's project "Changing platforms of ritualized memory practices: the cultural dynamics of home movies" http://t.co/ZCkazCMBnb #iipcGA15
— Katrin Weller (@kwelle) April 27, 2015

Johan Kramer - Bye Bye Super 8 https://t.co/3rFSnS5hL8 #iipcGA15
— Michael L. Nelson (@phonedude_mln) April 27, 2015

After a short beak Jane Winters from the Institute of Historical Research, Helen Hockx-Yu from the British Library, and Josh Cowls from the Oxford Internet Institute took the stage with their topic "Big UK domain data for Arts and Humanities" also known as BUDDAH project. Jane highlighted the value of archives for research and described the development of a framework to help researchers leverage the archives. She illustrated the interface of the Big Data analysis of BUDDAH project, described the planned output, and various case studies showing what can be done with that data.

What is a web archive ? https://t.co/TaGnIQBCwF #iipcGA15
— Emmanuelle Bermes (@figoblog) April 30, 2015

BUDDAH: big uk domain data for arts & humanities http://t.co/dT0lfTYEEq aims at valuing web archives as a material for research #iipcGA15
— Emmanuelle Bermes (@figoblog) April 27, 2015

#iipcGA15 ... inform collection development and access at BL, train researchers in use of big data. Great acronyn: the BUDDHA project.
— Jackie Dooley (@minniedw) April 27, 2015

Helen Hockx-Yu began her talk "Co-developing access to the UK Web Archive" with reference to the earlier talk by Andy. She noted that a scenario that fits everyone's need is difficult. She described the high level requirements including query building, corpus formation, annotation and cuuration, in-corpus and whole-dataset analysis. She illustrated the SHINE interface that provides features like full-text search, multi-facet filters, query history, and result export.

#iipcGA15 @hhockx presents SHINE prototype for advanced academic use of web archive
— Ina DL Web (@inadlweb) April 27, 2015

#iipcGA15 @hhockx SHINE project : FT search, multifacet search, Ngram,trend analysis and access to data behind http://t.co/CTI5U1vx9e
— Ina DL Web (@inadlweb) April 27, 2015

"It's really not a choice between 'big' or 'small' data… what we really need is the flexibility to move between the two."—@hhockx #iipcGA15
— David Moles (@chronodm) April 27, 2015

Finally, Josh Cowls presented his talk about the book "The Web as History: Using Web Archives to Understand the Past and the Present" in which he contributed a chapter. He talked about the four second level domain from ".uk" TLD including ".co.uk", ".org.uk", ".ac.uk", and ".gov.uk" and how they are interlinked. He described the growth of web presence of the BBC and British universities.

Awesome chart!! @joshcowls at #iipcGA15 pic.twitter.com/HpsbIBsJKw
— Michael Corbett (@Reloaded2Boot) April 27, 2015

#iipcGA15 Mapping the UK Webspace: Fifteen Years of British Universities on the Web http://t.co/voe0rXJYT7
— Michael L. Nelson (@phonedude_mln) April 27, 2015

Fun fact #iipcGA15 : to demonstrate the interest of web archives as material for history, write a book and put it online
— Emmanuelle Bermes (@figoblog) April 27, 2015

#iipcGA15 @JoshCowls In the book The web as history a focus on the evolution of BBC online presence
— Ina DL Web (@inadlweb) April 27, 2015

IIPC Chair Paul Wagner concluded the day by emphasizing that we have only started scratching the surface. He also noted in his concluding remarks that the context matters.

Day 2

Herbert Van de Sompel from Los Alamos National Laboratory started the second day sessions by talking about "Memento Time Travel". He started with a brief introduction of the Memento followed by a bag full of announcements. For the ease of use in JavaScript clients, Memento now supports JSON responses along with traditional Link format. Memento aggregator now provides responses in two modes including DIY (Do It Yourself) and WDI (We Do It). The service now also allows to export the Time Travel Archive Registry in structured format. Due to the default Memento support in Open Wayback, various Web archives now natively support Memento. There is an extension available for MediaWiki to enable Memento support in it. Herbert described the Robust Links (Hiberlink) and how it can be used to avoid reference rot. He said that their service usage is growing, hence they upgraded the infrastructure and now using Amazon cloud for hosing services. He noted that going forward everyone will be able to participate by running Memento service instances in a distributed manner to provision load-balancing. He also demonstrated Ilya's work of constructing composite mementos from various sources to minimize the temporal inconsistencies while visualizing the sources of mementos.

.@hvdsomp kicking off #iipcGA15 Day One with a presentation on Memento. They’ve got a fantastic, well-documented API! http://t.co/YIGMTDbzGx
— Ian Milligan (@ianmilligan1) April 28, 2015

Day 2 of #iipcGA15 starts with Herbert Van de sompel pic.twitter.com/bD3aGk2elM
— IIPC (@NetPreserve) April 28, 2015

Memento for Chrome #iipcGA15 https://t.co/dMyGIZ3zXg adds #memento capability for your browswer. see also: http://t.co/ValYQrslkD
— Michael L. Nelson (@phonedude_mln) April 28, 2015

Another #Memento extension for #Chrome: Mink. see: http://t.co/QzAXp6IOsm #iipcGA15
— Michael L. Nelson (@phonedude_mln) April 28, 2015

18 public web archives + http://t.co/YdWmuXgLak + http://t.co/AwA61HKm4w + #MediaWikis aggregated. see: http://t.co/5GpeUFY4H6 #iipcGA15
— Michael L. Nelson (@phonedude_mln) April 28, 2015

Good guide on Robust Links from the Memento Project - what you can do as a web page author, user, etc. #iipcGA15 http://t.co/Dlh9lDoquJ
— Ian Milligan (@ianmilligan1) April 28, 2015

Hiberlink http://t.co/bJC1cr1d4A addresses "reference rot" in scholarly citations #iipcGA15
— Michael Widner (@mwidner) April 28, 2015

some results of the #Hiberlink project on reference rot in scholcom http://t.co/V0jsh8QEuX @PLOSONE @hvdsomp #iipcGA15
— Martin Klein (@mart1nkle1n) April 28, 2015

Time travel reconstruct. @hvdsomp in #iipcGA15 pic.twitter.com/nCU1CArQGK
— Yasmina Anwar (@yasmina_anwar) April 28, 2015

Replay of archived websites relies on patching resources not necessarily crawled at the same time. Not something known to users. #iipcGA15
— Helen Hockx (@hhockx) April 28, 2015

Daniel Gomes from the Portuguese Web Archive talked about "Web Archive Information Retrieval". He started classifying web archive information needs in three categories including Navigational, Informational, and Transactional. He noted that the usual way of accessing archive is URL searching which might not be known to the users. An alternate method is full-text search that poses the challenge of relevance. Daniel described various relevance models in great detail and how to select features to maximize the relevance. He announced that all the dataset and code is available for free and under open source license. The code is hosted on Google Code, but due to their announcement of sunsetting the service the code will be migrated to GitHub soon.

#iipcGA15 Map of web archiving around the world. We are doing well. But still so much room to grow. @dcgomes77 pic.twitter.com/cR2BGWiouO
— Sabine Hartmann (@skhartmann) April 28, 2015

"A full text index is like a huge book glossary" @dcgomes77 #iipcGA15
— Yasmina Anwar (@yasmina_anwar) April 28, 2015

Machine learning for web archives content discovery.. using in links, term frequency, etc. Very cool. #iipcGA15 pic.twitter.com/YV9zjwR66o
— Ian Milligan (@ianmilligan1) April 28, 2015

#iipcGA15 68 ranking features--too many to put into production. Using URL, title, text body, anchor text of incoming link.
— Jackie Dooley (@minniedw) April 28, 2015

Search the Past with the Portuguese Web Archive #iipcGA15 http://t.co/ap6tRwNJh6 #www2013 https://t.co/0xRGZTYTMC
— Michael L. Nelson (@phonedude_mln) April 28, 2015

Here’s the Google Code repository for the Portuguese Web Archive: looking forward to checking it all out. #iipcGA15 https://t.co/MeBRdRO0Ta
— Ian Milligan (@ianmilligan1) April 28, 2015

After this talk, there was a short break followed by the announcement that remaining sessions of the day will have two parallel tracks. It was a hard decision to choose one track or the other, but I can watch the missed sessions latter when the video recordings are made available. Later the parallel sessions were interfering each other so the microphone was turned off.

@NetPreserve only if I could #Memento #TimeTravel to attend both sessions. I will be looking forward for the #iipcGA15 session recordings.
— Sawood Alam (@ibnesayeed) April 28, 2015

#iipcGA15 Sometimes it is good we can still get by without microphones.
— Sabine Hartmann (@skhartmann) April 28, 2015

After the break Ilya Kreymer gave a live demo of his recent work "Web Archiving for all: Building WebRecorder.io". He acknowledged the collaboration with Rhizome and announced the availability of invite only beta implementation of the WebRecorder. He demonstrated how WebRecorder can be used perform personal archiving in What You See Is What You Archive (WYSIWYA) mode.

#iipcGA15 use #WebRecorder.io beta for #personal #archiving #WYSIWYA like #Facebook #Twitter @webrecorder_io
— Sawood Alam (@ibnesayeed) April 28, 2015

#iipcGA15 Webrecorder.io: on-demand archiving via browser. WYSIWYA: what u see is what you archive. Available to anybody. quality > quality.
— Jackie Dooley (@minniedw) April 28, 2015

Demo-ing webrecorder.io ability to record Facebook while you're logged in and Vines - good stuff! #iipcGA15 #webarchiving
— Web Archiving RT (@WebArch_RT) April 28, 2015

Ilya Kremer: http://t.co/xb01j9mJC6 built on the top of pywb https://t.co/yG0snopyd8 and warcprox https://t.co/2e0QcHsmfl #iipcGA15
— Ahmed AlSum (@aalsum) April 28, 2015

Ilya Kreymer is looking for collaborators, developers, UI designers, and archivists to move webrecorder.io to the next level #iipcGA15
— Ahmed AlSum (@aalsum) April 28, 2015

public demos for beta webrecorder.io http://t.co/J1eqeqQW3H -- also #Memento compliant #iipcGA15
— Michael L. Nelson (@phonedude_mln) April 28, 2015

Zhiwu Xie from VirginiaTech presented "Archiving transactions towards an uninterruptible web service". He described an indirection layer between the web application server and the client that archives each successful response and when server returns 4xx/5xx failure responses, it serves the most recent copy of the resource from the transactional archive. It is similar to services like CloudFlare in functionality from clients' perspective, but it has added advantage of building a transactional archive for website owners. Zhiwu demonstrated the implementation by reloading two web pages multiple times of which one was utilizing the UWS and the other was directly connected to the web application server that was returning the current timestamp with random failures. He mentioned that the system is not ready for the prime time yet.

@zxie suggests web archiving is working as UPS for websites during down time, similar to what we had in U.S. government shutdown #iipcGA15
— Ahmed AlSum (@aalsum) April 28, 2015

Uninterruptible Web Service, in diagram form #iipcGA15 pic.twitter.com/RugYwBMaSQ
— Mouse Reeve (@tripofmice) April 28, 2015

Xie: project uses SiteStory as a back up for website availability...Patch in archived version of a web page if 500 error occurs #iipcGA15
— Web Archiving RT (@WebArch_RT) April 28, 2015

During the lunch break I was with Andy, Kristinn, and Roger where we had free style conversation on advanced crawlers, CDX indexer memory error issues, the possibility of implementing CDX indexer in Go, separating data and view layers in Wayback for easy customization, some YouTube videos such as "Is Your Red The Same as My Red?", hilarious "If Google was a Guy", Ted talks such as "Can we create new senses for humans?", "Evacuated Tube Transport Technologies (ET3)", and the possible weather of Iceland around the time IIPC GA 2016 is scheduled.

#iipcGA15 announced #iipcGA16 is scheduled on April 11, 2016 in Reykjavik, Iceland.
— Sawood Alam (@ibnesayeed) April 29, 2015

Jefferson Bailey presented his talk on "Web Archives as research datasets". With various examples and illustrations from Archive-It collections he established the point that web archives are great sources of data for various researches. He acknowledged that WAT is a compact and easily parsable metadata file format that is about 18% of the WARC data files.

Two elements when thinking about research data: collections, and derivation @internetarchive #iipcGA15 pic.twitter.com/0wI2Pd4auj
— Yasmina Anwar (@yasmina_anwar) April 28, 2015

#iipcGA15 @jefferson_bail takes us through web archives as research datasets. pic.twitter.com/AR9Ae9YMx5
— Sabine Hartmann (@skhartmann) April 28, 2015

@jefferson_bail of the @internetarchive presenting research datasets: web archives are mature and ready for data-driven analysis. #iipcGA15
— Helen Hockx (@hhockx) April 28, 2015

#iipcGA15 Those WATs: 18% size of a WARC. In JSON, easily analyzed/parsed.
— Jackie Dooley (@minniedw) April 28, 2015

#iipcGA15 WANE = web archive named entities. Uses Stanford NER tool. Entities from colls (names, titles etc). Less than 1% of WARC size.
— Jackie Dooley (@minniedw) April 28, 2015

#iipcGA15 Failed research ideas still provide useful insight in @jefferson_bail talk. Great that is shared as well.
— Sabine Hartmann (@skhartmann) April 28, 2015

@jefferson_bail just showed amazing visualisation of linked images within a fashion blog collection. #iipcga15
— Helen Hockx (@hhockx) April 28, 2015

Ian Milligan from the University of Waterloo presented his talk on "WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Three Types of Web Archives". He described the importance of web archives and why historians should use web archives. His talk was primarily based on three case studies including Wide Web Scrape, GeoCities End-of-Life Torrent, and Archive-It Longitudinal Collections, Canadian Political Parties & Labour Organizations. I enjoyed his style of storytelling, some mesmerizing visualizations, and in particular the GeoCities case study. He noted that the GeoCities data was not the form of WARC files, instead it was regular Wget crawl.

"Web Archives offer windows into lives of everyday people". @ianmilligan1 presenting use cases of 3 web archive datasets. #iipcga15
— Helen Hockx (@hhockx) April 28, 2015

Visualizing the link structure of the wide web scrape by @ianmilligan1 #iipcGA15 http://t.co/7E9K18igS4
— Michael Widner (@mwidner) April 28, 2015

@ianmilligan1 #iipcGA15 answers do researchers want metadata or content analysis? #CDX vs #WARC pic.twitter.com/yG8eqmz0YY
— Sawood Alam (@ibnesayeed) April 28, 2015

@ianmilligan1 finds WAT files useful and offer the right details: these are sweat spot between the light CDX and heavy WARCs. #iipcGA15
— Helen Hockx (@hhockx) April 28, 2015

"Web archives will profoundly change the work of historians" says @ianmilligan1 #iipcGA15
— Michael Widner (@mwidner) April 28, 2015

Here’s my slide deck from last week’s #iipcGA15: “WARCs, WATs, and gets: Opportunity and Challenge for a Historian.” http://t.co/3kx3KMH3UN
— Ian Milligan (@ianmilligan1) May 4, 2015

After a short break Ahmed AlSum from the Stanford University Library (and a WS-DL alumnus) presented his work on "Restoring the oldest U.S. website". He described how he turned yearly backup files of SLAC website from 1992 to 1999 into WARC and CDX files with the help of Wget and by applying some manual changes to mimic the effect as if it was captured in those early days. These transforms were necessary to allow modern Open Wayback system to correctly replay it. Ahmed briefly handed the microphone over to Joan Winters who was responsible to take backups of the website in early days and she described how they did it. Ahmed also mentioned that the Wayback codebase had hardcoded 1996 as the earliest year that was fixed by making it configurable.
As an after thought I would love to see this effort combined with Satya's Olive Archive so that from the server stack to the browser experience all can be replicated as close to the original environment as possible.

@aalsum is trying to #restore the #oldest US website #iipcGA15 pic.twitter.com/xq39xlPBM1
— Sawood Alam (@ibnesayeed) April 28, 2015

#iipcGA15 SLAC archivist Joan Winters talks about the earliest website outside Europe. pic.twitter.com/LU3UMrJRu3
— Sabine Hartmann (@skhartmann) April 28, 2015

Amazing, painstaking work by @aalsum to reconstruct the SLAC website, the US’s first. Interviews, primary resource reviews, etc. #iipcGA15
— Ian Milligan (@ianmilligan1) April 28, 2015

#iipcGA15 Homepage wasn't a concept in 1991. One entry page with two internal links. First page said "Someday there will be text here." :)
— Jackie Dooley (@minniedw) April 28, 2015

blog posts on the oldest U.S. website (@SLAClab) from @aalsum http://t.co/25N9PU2BTU and @nullhandle http://t.co/WkG2OdLJTD #iipcGA15
— Nicholas Taylor (@nullhandle) April 28, 2015

Evolution of #SLAC #homepage @aalsum #iipcGA15, we don't have homepage! pic.twitter.com/qqAG1vE94B
— Sawood Alam (@ibnesayeed) April 28, 2015

The slides from my talk about Restoring US First website http://t.co/4JyB3hTZEV #iipcGA15
— Ahmed AlSum (@aalsum) April 29, 2015

Federico Nanni from the University of Bologana presented "Reconstructing a lost website". Looking at the schedule, my first impression was that it is going to be a talk about tools to restore any lost websites and reconstruct all the pages and links with the help of archives. I was wondering if they are aware of Warrick, a tool that was developed at Old Dominion University with this very objective. But, it turned out to be a case study of the world's oldest university established around 1088. One of the many challenges in reconstructing the university website he mentioned was the exclusion of the site from the Wayback Machine for unknown reasons which they tried to resolve together with Internet Archive. Amusingly, one of the many sources of collecting snapshots includes a clone of the site prepared by student protesters.

#iipcGA15 No national web archive for Italy. U Bologna excluded from Wayback Machine, so hard to reconstruct its web history. Undaunted!
— Jackie Dooley (@minniedw) April 28, 2015

#iipcGA15 Frederico Nanni exposes how he was able to retrieve Universiy-ty of Bologna web archive when @internetarchive had excluded it
— Ina DL Web (@inadlweb) April 28, 2015

Reconstruct a lost website:sometimes we still need persons to ask questions and do the paper trail and wait for student protests. #iipcGA15
— susan aasman (@aasmanna) April 28, 2015

Last speaker of the second day Michael L. Nelson from Old Dominion University presented the work of his student Scott G. Ainsworth "Evaluating the temporal coherence of archived pages". With an example of Weather Underground site he demonstrated how unrealistic pages can be constructed by archives due to the temporal violations. He acknowledged that among various categories of temporal violations, there are at least 5% cases where there exists a provable temporal violation. He also noted that temporal violation is not always a concern.

#iipcGA15 How much of web archived? Sources vary. Are archives stable? Nope. Temporal drift while browsing? Yep bec sparse crawls.
— Jackie Dooley (@minniedw) April 28, 2015

.@phonedudemln showing how this Wayback page _never existed! Mashing together temporal elements. #iipcGA15 pic.twitter.com/0lD6Qw1Mxb
— Ian Milligan (@ianmilligan1) April 28, 2015

@hhockx @phonedude_mln #iipcGA15 pic.twitter.com/pDVMAJl5X2
— Sawood Alam (@ibnesayeed) April 28, 2015

Listening to the talk by @phonedude_mln reminds us why HTTP headers matter. #iipcGA15
— Mark Phillips (@vphill) April 28, 2015

Evaluating the Temporal Coherence of Archived Pages http://t.co/9TCLWnvinY #iipcGA15 @hvdsomp @Galsondor @WebSciDL http://t.co/mHPuTrDF8k
— Michael L. Nelson (@phonedude_mln) April 28, 2015

Day 3

The third day sessions were in the Internet Archive building, San Francisco instead of the usual Li Ka Shing Center at Stanford University, Palo Alto. A couple of buses transported us to the IA and we enjoyed the bus trip in the valley as the weather was very good. IA staff was very humble and welcoming. The emulator of classical games installed in the lobby of IA turned out to be the prime center of attraction. We came to know some interesting facts about the IA such as the building was a church which was acquired because of its similarity with the IA logo and the pillows in the hall were contributed by various websites with the domain name and logo printed on them.

.@hhockx and @anjacks0n hard at work at the @internetarchive #hadoken pic.twitter.com/c05RG8OgRz
— PsypherPunk (@PsypherPunk) April 29, 2015

Very excited today! Our session of #iipcga15 is @internetarchive pic.twitter.com/1XVtJi5u3o
— Mar Pérez Morillo (@mpmorillo) April 29, 2015

@internetarchive acquired a church to make it the main office because it matches the logo. #iipcGA15 pic.twitter.com/8RiYtgqUFX
— Sawood Alam (@ibnesayeed) April 29, 2015

Sessions before lunch were mainly related to consortium management and logistics these include Welcome to the Internet Archive by Brewster Kahle, Chair address by Paul Wagner, Communication report by Jason Webber, Treasurer report by Peter Stirling, and Consortium renewal by the chair followed by break-out discussions to gather ideas and opinion from the IIPC members on various topics. Also, the date and venue for the next general assembly was announced to be on April 11, 2016 in Reykjavik, Iceland.

#iipcGA15 @brewster_kahle & @pnwagner To get us through the programme of the day. pic.twitter.com/JkK3BZ7x9q
— Sabine Hartmann (@skhartmann) April 29, 2015

#iipcGA16 will be in Reykjavík, Iceland #iipcGA15 pic.twitter.com/PZ67bjKVbR
— Kristinn Sigurðsson (@kristsi) April 29, 2015

After the lunch break, your author, Sawood Alam from Old Dominion University presented the progress report on "Profiling web archives" project, funded by IIPC. With the help of some examples and scenarios he established the point that the long tail of archive matters. He acknowledged the growing number of Memento compliant archives and the growth of use of Memento aggregator service. In order for the Memento aggregator to perform efficiently, it needs query routing support apart from caching which only helps when the requests are repeated before cache expires. Then he acknowledged two earlier profiling efforts one being a complete knowledge profile by Sanderson and the other minimalistic TLD only profile by AlSum. He described the limitations of the two profiles and explored the middle ground for various other possibilities. He evaluated his findings and concluded that his work so far gained up to 22% routing precision with less than 5% cost relative to the complete knowledge profile without any false negatives. Sawood also announced the availability of the code to generate profiles and benchmark them in a GitHub repository. In a later wrap-up session the chair Paul Wagner referred to Sawood's motivation slide in his own words, "sometimes good enough is not good enough."

#iipcGA15 Profiling Web Archives talk by @ibnesayeed of Old Dominion University pic.twitter.com/mpYTKcwZUi
— Sabine Hartmann (@skhartmann) April 29, 2015

@ibnesayeed proposes web archiving profiling approach easier than @azaroth42 CDX aggregation & accurate than @aalsum URL sampling. #iipcGA15
— Ahmed AlSum (@aalsum) April 29, 2015

@ibnesayeed: Web archiving profiling code is available at https://t.co/6ue4XZT4wR #iipcGA15
— Ahmed AlSum (@aalsum) April 29, 2015

@anjacks0n suggests adding web archiving profile from @ibnesayeed presentation in the openWayback #iipcGA15 https://t.co/qIrm4s2yH8
— Ahmed AlSum (@aalsum) April 29, 2015

Slides of my talk on Profiling Web Archives at #iipcGA15 http://t.co/VJUl8DMqaj @WebSciDL @ibnesayeed
— Sawood Alam (@ibnesayeed) May 7, 2015

In the break various IA staff members gave us tour of the IA facility including book scanners, television archive, an ATM, storage rack, music and video archive where they convert data from old recording media such as vinyl discs and cassettes.

On the @brewster_kahle tour of the @internetarchive. Excited to explore! #iipcGA15 pic.twitter.com/Q7iy4ffIdS
— Ian Milligan (@ianmilligan1) April 29, 2015

I think my favourite moment of #iipcGA15 was experiencing the hum and shimmer of the @internetarchive servers... pic.twitter.com/ijfRxRUKp7
— Andy Jackson (@anjacks0n) May 7, 2015

After the break a historian and writer Abby Smith Rumsey talked about "The Future of Memory in the Digital Age". Her talk was full of insightful and quotable statements. I will quote one of my favorite and will leave the rest in the form of tweets. Se says, "ask not what we can afford to save; ask what we can afford to lose".

#iipcGA15 Historians are the only people qualified to predict the future. Says Abby Smith Rumsey. pic.twitter.com/PofC4zB0mb
— Sabine Hartmann (@skhartmann) April 29, 2015

#iipcGA15 Abby Smith Rumsey : what is at stake with preservation is the survival of the species. Humans have known how to pass knowledge
— Ina DL Web (@inadlweb) April 29, 2015

Components of memory: starts by forgetting/ filter what is irrelevant/ keep what will be valuable in the future Abby #iipcGA15
— Emmanuelle Bermes (@figoblog) April 29, 2015

Rumsey: Scale is an issue with information, it always has been (no exception re: web archiving) #iipcGA15
— Abbie Grotke (@agrotke) April 29, 2015

Collect and make available, don't curate, allow the future to judge the value. #iipcGA15
— Andy Jackson (@anjacks0n) April 29, 2015

#iipcga15 Abby Smith Rumsey: "ask not what we can afford to save; ask what we can afford to lose"
— Dan Kerchner (@DanKerchner) April 29, 2015

Finally the founder of the Internet Archive, Brewster Kahle took the stage and talked about digital archiving and the role of IA in the form of various initiatives including book archive, music archive, and TV archive to name a few. He described the zero-sum book lending model utilized by the Open Library for the books that are not free for unlimited distribution. He invited all the archivists to create a common collective distributed library where people can share their resources such as computing power, storage, man power, expertise, and connections. During the QA session I asked when he thinks about collaboration, is he envisioning a model similar to the inter-library loan where peer libraries will refer to the other places in the form of external links if they don't have the resources but others do or in contrast they will copy the resources of each other? He responded, "both."

#iipcGA15 @brewster_kahle talks at the IIPC GA. How can the whole web archiving community collaborate @NetPreserve pic.twitter.com/M43uoGjc9V
— Sabine Hartmann (@skhartmann) April 29, 2015

Brewster Khale: we need to develop unexpected uses of our digital libraries #iipcga15
— Emmanuelle Bermes (@figoblog) April 29, 2015

300k books are available for lending at @internetarchive one at a time #iipcGA15
— Emmanuelle Bermes (@figoblog) April 29, 2015

Online music: IA started with concerts when music artists authorized it, in exchange for free storage and bandwidth #iipcga15
— Emmanuelle Bermes (@figoblog) April 29, 2015

@brewster_kahle 100+ libraries participating in @openlibrary are buying, digitizing, and loaning non-rights cleared ebooks 1/time #IIPCGA15
— Tom Smyth (@smythbound) April 29, 2015

#iipcGA15 @brewster_kahle presents the news archives : tens of news TV channels comprehensively archived since 2009 https://t.co/VE4eGvJbEI
— ISSN Int. Centre (@ISSN_IC) April 30, 2015

Personal digital archiving next step for Internet Archive, according to Brewster Kahle #iipcGA15, #pda15
— susan aasman (@aasmanna) April 30, 2015

@brewster_kahle Please don't through any book, film, video, CD, DVD or any material away, just give it to @internetarchive #iipcGA15
— Ahmed AlSum (@aalsum) April 30, 2015

#iipcGA15 @brewster_kahle : why not build libraries together ? Cooperative collection dvlpmt, distributed preservation & cloud/local access
— Ina DL Web (@inadlweb) April 30, 2015

#iipcGA15 @brewster_kahle : we should fight against the "winner takes all" idea behind the large centralized library repositories
— Clément Oury (@cleymour) April 30, 2015

"If somebody says 'We'll license it back to you, and you can be on the advisory committee…'—run the other way."—@brewster_kahle #iipcGA15
— David Moles (@chronodm) April 30, 2015

The chair gave a wrap-up talk and formally ended the third day session. Buses still had some time before they leave, so people were engaged in conversation, games and photographs while enjoying drinks and food. I particularly enjoyed a local ice cream named "It's-It" recommended by an IA staff. Lori Donovan from Internet Archive approached me and Mohamed Farag and initiated a good conversation about possible collaboration on archiving projects. We also talked about a project that WS-DL group at Old Domionion University was working on a few years ago to identify disaster related news and archive them. Our conversation ended up with a group selfie of three of us.

"@aalsum: #iipcGA15 group photo in front of @internetarchive https://t.co/aEXNJmpMMo" Great looking gang !!
— Paul N. Wagner (@pnwagner) April 30, 2015

Day 4

On fourth day Sara Aubry presented her talk on "Harvesting Digital Newspapers Behind Paywalls" in Berge Hall A where Harvesting Working Group was gathered while IIPC's communication strategy session was going on in Hall B. She discussed her experience of working with news publishers to make their content more crawler friendly. Some of the crawling and replay challenges include paywalls requiring authentication to grant access to the content and inclusion of the daily changing date string in the seed URIs. They modified the Wayback to fulfill their needs, but the modifications are not committed back to the upstream repository. She said, if it is useful for the community then the changes can be pushed out in the main repository.

#iipcGA15 @saraaubry : 23 press titles accessible upon payment, representing more than 200 local editions, are harvested every day
— ISSN Int. Centre (@ISSN_IC) April 30, 2015

@saraaubry discusses working with news publishers to make their sites more crawler friendly - generally was positive experience #iipcGA15
— Abbie Grotke (@agrotke) April 30, 2015

#iipcGA15 @saraaubry : @DLWebBnF is using ARK identifiers for a federated search on several versions of URLs of the same press title
— ISSN Int. Centre (@ISSN_IC) April 30, 2015

New Wayback features presented by @saraaubry #iipcGA15 pic.twitter.com/4fL7w09k3T
— webmiriam (@webmiriam1) April 30, 2015

@DLWebBnF is identifying alternative ways for collecting, deposit from publishers thru FTP, deposit from press aggregators #iipcGA15
— Abbie Grotke (@agrotke) April 30, 2015

Roger Coram presented his talk on "Supplementing Crawls with PhantomJS". I found his talk quite relevant to one of my colleague Justin Brunelle's work. This is a necessary step to improve the quality of the crawls especially when sites are becoming more interactive with extensive use of JavaScript. For some pages, he is using CSS selectors and takes screen shots to later complement the rendering.

@hhockx @PsypherPunk's #iipcGA15 presentation about #PhantomJS is very relevant to what @justinfbrunelle @WebSciDL is working on.
— Sawood Alam (@ibnesayeed) April 30, 2015

At #iipcGA15 @PsypherPunk talking about how we also store rendered home pages as potentially clickable image maps, or Google maps div as img
— Andy Jackson (@anjacks0n) April 30, 2015

Blog post by @PsypherPunk Archiving Screenshots: http://t.co/lmbWWx0wWr #iipcGA15
— Helen Hockx (@hhockx) April 30, 2015

HTTP Archive (HAR) format mentioned by @PsypherPunk: https://t.co/JMVK1yXWns #iipcGA15
— Helen Hockx (@hhockx) April 30, 2015

Hadn't come across CrawlJax before - looks interesting. http://t.co/wpwyRQxcyP #iipcGA15
— Andy Jackson (@anjacks0n) April 30, 2015

Kristinn Sigurðsson engaged everyone to talk about the "Future of Heritrix". He started with the question, "is Heritrix dead?" and I said to myself, "can we afford this?". This ignited the talk about what can be done to increase the activity on its development. I asked the question, what is slowing down the development of Heritrix, is it out of ideas and new feature requests or there are not enough contributors to continue the development? There was no clear answer to this question, but it helped continuing the discussion. I also suggested that if new developers are afraid of making changes that will break the system and will discourage upgrades then can we introduce plug-in architecture where new features can be added as optional add-ons.

Now at Harvesting Group: Is Heritrix dead? @kristsi No, but it needs sustainability and support. #iipcGA15
— Mar Pérez Morillo (@mpmorillo) April 30, 2015

Harvesting Working Group discussion on Heritrix: We need a framework where multiple crawlers can exist #iipcGA15
— Abbie Grotke (@agrotke) April 30, 2015

@anjacks0n suggests the future of Heritrix should be using the Archive Proxy #iipcGA15
— Ahmed AlSum (@aalsum) April 30, 2015

Helen Hockx-Yu took the microphone and talked about the Open Wayback development. She gave brief introduction of the development workflow and periodic telecon. She also talked about the short and long term development goals including better customization and internationalization support, display more metadata, ways to minimize the live leaks, and acknowledge/visualize the temporal coherence.

Requirements for Open Wayback presented by @hhockx #iipcGA15 pic.twitter.com/nWoCcSS7wH
— webmiriam (@webmiriam1) April 30, 2015

If you are interested in joining OpenWayback development, you can send an email to the group https://t.co/9R6m3aJkqW … #iipcGA15 @hhockx
— Ahmed AlSum (@aalsum) April 30, 2015

After a short break Tom Cramer gave his talk on "APIs and Collaborative Software Development for Digital Libraries". He formally categorized the software development models in five categories. He suggested IIPC to take the position to unify the high level API for each category of the archiving tools so that they can co-operate interchangeably. This was very appealing to me because I was thinking on the same lines and have done some architectural design of an orchestration system that achieves the same goal via a layer of indirection.

Different types of open source development (regardless of license) according to @tcramer #iipcGA15 pic.twitter.com/2CdJ2cGQyN
— Emmanuelle Bermes (@figoblog) April 30, 2015

A majority of open source software is actually "sole source" software (1 dev) or "closed source" (1 team or company) @tcramer #iipcGA15
— Emmanuelle Bermes (@figoblog) April 30, 2015

#iipcGA15 @tcramer: free to use software doesn't mean it's a distributed or scalable.
— Ahmed AlSum (@aalsum) April 30, 2015

#iipcGA15 @tcramer: presents @GeoBlacklight http://t.co/967Nxy5yTR, Hydra http://t.co/OBFJDYCcgJ, @FedoraRepo http://t.co/6PkW2t3yVB
— Ahmed AlSum (@aalsum) April 30, 2015

#iipcGA15 @tcramer Reason for success, open source fundamentals: Transparency, Inclusivity, Merit, agility, Quality, and Value.
— Ahmed AlSum (@aalsum) April 30, 2015

No grants were abused in the making of this project/community @tcramer #iipcGA15 pic.twitter.com/nxcr9iscEi
— Sawood Alam (@ibnesayeed) April 30, 2015

Daniel Vargas from LOCKSS presented his talk on "Streamlining deployment of web archiving tools" and demonstrated usage of Docker containers for deployment. He also demonstrated the use of plain WARC files on regular file system and in HDFS with Hadoop clusters. I was glad to see someone else deplying Wayback machine in containers as I was pushing some changes to the Open Wayback repository that will make containerization of Wayback easier.

Daniel Vargas is doing a demo on running OpenWayback instance using @docker container #iipcGA15
— Ahmed AlSum (@aalsum) April 30, 2015

LOCKSS can extract WARC files to play in Open Wayback through a Docker container #iipcGA15
— Emmanuelle Bermes (@figoblog) April 30, 2015

Nice demo running OpenWayback in Docker - https://t.co/8Zu2xMFRLg #iipcGA15
— Andy Jackson (@anjacks0n) April 30, 2015

During the lunch break Hunter Stern from IA approached me and told me about the Umbra project to supplement the crawling of JS-rich pages. Kristinn, me, and a few more people talked about the precision of time in HTTP/2.0, but no one was sure if it was changed from one second granularity to anything smaller such as millisecond or microsecond. Later I asked this question in the IETF HTTP WG mailing list and the response suggests that there was no change made to it. After the lunch there was a short open mic session where every speaker has got four minutes to introduce exciting stuff that they are working on. Unfortunately, due to the shortage of time I could not participate in it.

@kristsi Not only does HTTP/2.0 still use second granularity, it still lugs timestamps around in ASCII format @bsdphk http://t.co/b1j3kGbK4X
— Sawood Alam (@ibnesayeed) May 11, 2015

After the lunch break Access Working Group gathered to talk about "Data mining and WAT files: format, tools and use cases". Peter Stirling, Sara Aubry, Vinay Goel, and Andy Jackson gave talks on "Using WAT at the BnF to map the First World War", "The WAT format and tools for creating WAT files", and "Use cases at Internet Archive and the British Library". Vinay has got some really neat and interactive visualizations based on the WAT files. I talked to Vinay during the break and we had some interesting ideas to work on such as building a content store indexed by hashes while using WAT files in conjunction to replay and a WebSocket based BOINC implementation in JavaScript to perform Hadoop style distributed research operations on IA data on users' machine.

Peter Stirling at the BnF has been using WATs to map web archives relating to the First World War. Excited to see how it’s going! #iipcGA15
— Ian Milligan (@ianmilligan1) April 30, 2015

Peter Stirling at #iipcGA15 : analyze web archives to understand the use of digitized heritage documents on websites related to WWI
— ISSN Int. Centre (@ISSN_IC) April 30, 2015

There are technical, legal and organizational challenges to the set up of a data mining service for researchers at @ActuBnF #iipcGA15
— Emmanuelle Bermes (@figoblog) April 30, 2015

#iipcGA15 @saraaubry : WAT files were for the first time presented to the community at the 2011 @NetPreserve General Assembly
— ISSN Int. Centre (@ISSN_IC) April 30, 2015

#iipcGA15 @saraaubry More tools to extract WAT files from WARC https://t.co/H9SrZD8YWK
— Ahmed AlSum (@aalsum) April 30, 2015

#iipcGA15 Vinay Goel: if you have WAT files, you can directly produce CDX files from them
— ISSN Int. Centre (@ISSN_IC) April 30, 2015

Vinay Goel-WAT files provide contextual info for users about collections such as which domains where crawled, which urls, etc. #iipcGA15
— rosalie lack (@rosalielack) April 30, 2015

@vinaygo @internetarchive mentioned @ibnesayeed @WebSciDL #ArchiveProfiling work at #iipcGA15 w/ nice #visualization http://t.co/PaDcwRlNGs
— Sawood Alam (@ibnesayeed) April 30, 2015

.@anjacks0n has a repo for WAT files as well: wat-mining. https://t.co/hrhVVUVD6s #iipcGA15
— Ian Milligan (@ianmilligan1) April 30, 2015

After a short break Access Working Group talked about "Full-text search for web archives and Solr". Anshum Gupta, Andy Jackson, and Alex Thurman presented "Apache Solr: 5.0 and beyond", "Full-text search for web archives at the British Library", and "Solr-based full-text search in Columbia's Human Rights Web Archive" respectively. Anshum's talk was on technical aspects of Solr while the other two talks were more towards a case study.

Web Archive architecture drawn by @anjacks0n #iipcga15 pic.twitter.com/NTc8NjlzW4
— Ahmed AlSum (@aalsum) April 30, 2015

Historians prefer transparency and want to know how things work under the hood. They "hate" things like stemming. #iipcGA15 @anjacks0n #solr
— Helen Hockx (@hhockx) April 30, 2015

Size of index at BL is 15TB... So impossible to have as much RAM as index size (SolR recommendation) @anjacks0n #iipcGA15
— Emmanuelle Bermes (@figoblog) April 30, 2015

.@anjacks0n giving a live demo of the amazing UK Web Archive’s Shine interface. Big Data trending up! http://t.co/zpTMgddDyT #iipcGA15
— Ian Milligan (@ianmilligan1) April 30, 2015

BL's index is split in 24 shards (see https://t.co/XQEEK10J8r ) #iipcGA15
— Emmanuelle Bermes (@figoblog) April 30, 2015

#iipcGA15 @anshumgupta gives an interesting talk about @SolrLucene 5.0 new features.
— Ahmed AlSum (@aalsum) April 30, 2015

.@athurman discussing Columbia’s implementation of full-text search w/ the Human Rights Web Archive #iipcGA15 http://t.co/02yVjvPpxg
— Ian Milligan (@ianmilligan1) April 30, 2015

A search in this web archive for Chippewa (for example) allows expansion to these other terms too. Cool. #iipcGA15 pic.twitter.com/reFs8hUxj0
— Ian Milligan (@ianmilligan1) April 30, 2015

Day 5

On the last day of the conference Collection Development and Preservation Working Groups were discussing their current state and plans in separate parallel tracks. Before the break I attended Collection Development Working Group. They demonstrated Archive-It account functionality. I expressed the need of a web based API to interact with the Archive-It service. I gave the example of a project I was working on a few years ago in which a feed reader periodically reads from news feeds and sends it to a disaster classifier that Yasmin AlNoamany and Sawood Alam (me) built. If the classifier classifies the news article to be in disaster category, we wanted to archive that page immediately. Unfortunately, Archive-It did not provide a way to programmatically do that (unless we use page scraping or some headless browser), so we ended up using WebCite service for that.

#iipcGA15 @agrotke leads us into the morning session of the last day of the GA. We have collection development today. pic.twitter.com/xQ35y58Vbe
— Sabine Hartmann (@skhartmann) May 1, 2015

After the break I moved to the Preservation Working Group track where I had a talk scheduled. David S.H. Rosenthal presented his talk on "LOCKSS: Collaborative Distributed Web Archiving For Libraries". He described the working of LOCKSS and how it benefited the publishing industry. He described how Crawljax is used in LOCKSS to capture content that are loaded via Ajax. He also noted that most of the publishing sites try not to rely on Ajax and if they do, they provide some other means to crawl their content to maintain the search engine ranking.

Legal framework for LOCKSS: obtain explicit permission to crawl & preserve permission along w/ content #iipcGA15
— Emmanuelle Bermes (@figoblog) May 1, 2015

LOCKSS has a peer-to-peer protocol for verification and repair of the content in the boxes, w/ authorization check #iipcGA15
— Emmanuelle Bermes (@figoblog) May 1, 2015

LOCKSS supports a variety of stds incl. Memento, OpenUrl, http content negotiation,WARC import and export, bibliographic metadata #iipcGA15
— Emmanuelle Bermes (@figoblog) May 1, 2015

#iipcGA15 LOCKSS runs "Red Hat" model: free open source software but paid support
— ISSN Int. Centre (@ISSN_IC) May 1, 2015

DSHR: @crawljax web capture seeding turned out to be less crucial than originally thought due to journals' focus on @google #SEO #iipcGA15
— Nicholas Taylor (@nullhandle) May 1, 2015

#iipcGA15 Rosenthal: all components of LOCKSS processing chains should interact through web services
— ISSN Int. Centre (@ISSN_IC) May 1, 2015

David Rosenthal #LOCKSS mentioned @ibnesayeed @WebSciDL #ArchiveProfiling project in his talk #iipcGA15 pic.twitter.com/8x87EGV0Vv
— Sawood Alam (@ibnesayeed) May 1, 2015

Sawood Alam (me) happened to be the last presenter of the conference where he presented his talk on "Archive Profile Serialization". This talk was in continuation with his earlier talk at IA. He described what should be kept in profiles and how should it be organized. He also talked briefly about the implications of each data organization strategy. Finally he talked about the file format to be used and how it can affect the usefulness of the profiles. He noted that XML, JSON, and YAML like single node file formats are not suitable for profiles and he proposed an alternative format that is a fusion of CDX and JSON formats. Kristinn provided his feedback that it seems the right approach of serialization of such data, but he strongly suggested to name the file format something other than CDXJSON.

Slides of my talk on Profiling Serialization at #iipcGA15 http://t.co/GsfV5ICWch @WebSciDL @ibnesayeed
— Sawood Alam (@ibnesayeed) May 8, 2015

While we were having lunch, the chair took the opportunity to wrap-up the day and the conference. And now I would like to thank all the organizing team members especially Jason Webber, Sabine Hartmann, Nicholas Taylor, and Ahmed AlSum for organizing and making the event possible.

The #iipcGA15 comes to an end, thanks to everyone who made this such a great week! pic.twitter.com/z4TIHbZzsI
— webmiriam (@webmiriam1) May 1, 2015

In the afternoon Ahmed AlSum took me to the Computer History Museum where Marc Weber gave us a tour. It was a great place to visit after such an intense week.

#iipcGA15 The fun continues. pic.twitter.com/XEKjLbQ8Dv
— IIPC (@NetPreserve) May 1, 2015

Marc Weber is giving #iipcGA15 visitors a tour @ComputerHistory museum pic.twitter.com/nAwiqvoWK7
— Ahmed AlSum (@aalsum) May 1, 2015

Missed Talks

Due to the parallel tracks I missed some sessions that I wanted to attend such as "SoLoGlo - an archiving and analysis service" by Martin Klein, "Web archive content analysis" by Mohammed Farag, "Identifying national parts of the internet" by Eld Zierau, "Warcbase: Building a scalable platform on HBase and Hadoop" by Jimmy Lin, "WARCrefs for deduplicating web archives" by Youssef Eldakar, and "WARC Standard Revision Workshop" by Clément Oury to name a few. I hope the videos recordings will be available soon. Meanwhile I was following the related tweets.

Visualization of tweet locations during Charlie Hebdo events @mart1nkle1n at #iipcGA15 pic.twitter.com/XiBi7WaFCr
— Katrin Weller (@kwelle) April 28, 2015

Farag: Event Focused Crawler (EFC) helps curators focus the crawls and improve quality #iipcGA15
— Abbie Grotke (@agrotke) April 28, 2015

#iipcGA15 @cleymour introducing the WARC standard revision process. Do IIPC members need changes or evolutions?
— Dépôt légal Web BnF (@DLWebBnF) April 28, 2015

Youssef Eldakar from Bibliotheca Alexandrina is presenting issues (and solutions!) related the duplicates in WARC files at #iipcGA15
— ISSN Int. Centre (@ISSN_IC) April 28, 2015

Conclusions

IIPC GA 2015 was a fantastic event. I had great time, met a lot of new people and some of those whom I knew on the Web, shared my ideas and learned from others. It was the most amazing one complete week I ever had. I appreciate the efforts of everyone who made this possible including organizers, presenters, and attendees.

Resources

Please let us know the links of various resources related to IIPC GA 2015 to include below.

Official

Aggregations

Blog Posts

Web Archiving in 2015 -- a Quick Redux of IIPC's General Assembly at Stanford - Tom Cramer
Notes from IIPC General Assembly 2015 - Jefferson Bailey
Let Them Emulate! - Andy Jackson
IIPC 2015 Recap - Ian Milligan
IIPC trabalha para salvar a memória da internet (Portuguese) - Carlos Eduardo Entini
IIPC GA 2015, jour 1 : « context matters » (French) - Emmanuelle Bermes
IIPC GA 2015, jour 2 : WARC, WAT, WET et WANE (French) - Emmanuelle Bermes
Talk at IIPC General Assembly - David Rosenthal
Assemblée Générale de l’IIPC à Palo Alto (French) - Claude Mussou
Looking ahead from the 2015 IIPC General Assembly - Nicholas Taylor

Tools

Update (May 12, 2015): Added reference to HTTP/2.0 time resolution and some more blog posts.
Update (May 22, 2015): Added more blogs and tool references.
Update (June 1, 2015): Added link to the video recording playlist.
--
Sawood Alam

Search This Blog

Web Science and Digital Libraries Research Group