The day before International Internet Preservation Consortium (IIPC) General Assembly 2015 we landed in San Francisco and some delicious Egyptian dishes were waiting for us. Thank you Ahmed, Yasmin, Moustafa, Adrian, and Yusuf for hosting us. It was a great way to spend the evening before IIPC GA and we were delighted to see you people after long time.
We (Sawood Alam, Michael L. Nelson, and Herbert Van de Sompel) entered in the conference hall a few minutes after the session was started and Michael Keller from Stanford University Libraries was about to leave the stage after the welcome speech. IIPC Chair Paul Wagner gave brief opening remarks and invited the keynote speaker Vinton Cerf from Google on the stage. The title of the talk was "Digital Vellum: Interacting with Digital Objects Over Centuries" and it was such an informative and delightful talk. He mentioned that the high density low cast storage media is evolving, but the devices to read them might not last long. While mentioning Internet connected picture frames and surf boards he added, we should not forget about the security. To emphasize the security aspect he gave an example that grand parents would love to see their grand children in those picture frames, but will not be very happy if they see something which they do not expect.
Moving on to software emulators he invited Mahadev Satyanarayanan from Carnegie Mellon University to talk about their software archive and emulator called Olive Archive. Satya gave various live demos including the Great American History Machine, ChemCollective (a copy of the website frozen at certain time), PowerPoint 4.0 running in Windows 3.1, and the Oregon Trail, all powered by their virtual machines and running in a web browser. He also talked about the architecture of the Olive Archive and how in future multiple instances can be launched and orchestrated to emulate the subset of the Internet for applications that rely on external services where some instances might run those services independently.
In the QA session someone asked Cerf, how to ask big companies like Google to provide the data about their Crisis Response efforts for archiving after they are done with it? Cerf responded, "you just did." while acknowledging the importance of such data for archival. Here are some tweets that were capturing the moments:
Cerf not sure that Google keeps its own software very long if it’s no longer used #iipcGA15— Jane Winters (@jfwinters) April 27, 2015
After the break Niels Brügger and Ditte Laursen presented their case study of Danish websphere under the title "Studying a nation's websphere over time: analytical and methodological considerations". Their study covered website content, file types, file sizes, backgrounds, fonts, layout and more importantly the domain names. They also raised the points like size of the ".dk" domain, geolocation, inter and intra domain link network, and if the Danish websites are actually in Danish language? They talked about some crawling challenges. Their domain name analysis tells that only 10% owners own 50% of all the ".dk" domains. I suspected that this result might be due to the private domain name registrations, so I talked to them later and they said, they did not think about private registrations, but they will revisit their analysis.
Brugger: size of .dk domain, file types, size of individual websites, where are websites located, networks within & between sites #iipcGA15— Jane Winters (@jfwinters) April 27, 2015
#iipcGA15 Domain owners change. Top 10% of domain owners own 50% of the domains.— Jackie Dooley (@minniedw) April 27, 2015
Between 2012 and 2015 14% of .dk domains changed ownership #iipcGA15— Jane Winters (@jfwinters) April 27, 2015
"No 1:1 relation between Danish national archive and the Danish national web domain" #iipcGA15— Yasmina Anwar (@yasmina_anwar) April 27, 2015
Andy Jackson from the British Library took the stage with his presentation title "Ten years of the UK web archive: what have we saved?". This case study covers three collections including Open Archive, Legal Deposit Archive, and JISC Historical Archive. These collections store over eight billion resources in over 160TB compressed files and now adding about two billion resources per year. With the help of a nice graph he illustrated that not all ".uk" domains are interlinked, so to maximize the coverage the crawlers need to include other popular TLDs such as ".com". He also presented the analysis of reference rot and content drift utilizing the "ssdeep" fuzzy hash algorithm. Their analysis tells that 50% of resources are unrecognizable or gone after oner year, 60% after two years and 65% after three years.
I had lunch together with Scott Fisher from the California Digital Library. I told him about various digital library and archiving related research projects we are working on at Old Dominion University and he described the holdings of his library and the phalanges they have in upgrading their Wayback to bring Memento support.
After the lunch, keynote speaker of the second session Cathy Marshall from the Texas A&M University took the stage with a very interesting title, "Should we archive Facebook? Why the users are wrong and the NSA is right". She motivated her talk by some interview style dialogues with the primary question, "Do you archive Facebook?" and mostly the answer was "No!". She highlighted that people have developed [wrong] sense that Facebook is taking care of their stuff, so they do not have to. She also noted that people usually do not value their Facebook content or they think it has immediate value, but no archival value. In a large survey she asked should Facebook be archived?, three fourth objected and half of them said "No" unconditionally. In the later part of her talk, she build the story of the marriage of Hal Keeler and Joan Vollmer by stitching various cuttings from local news papers. I am not sure if I could fully appreciate the story due to the cultural difference, but I laughed when everyone else did. Although I did follow her efforts and intention to highlight the need of archiving social media for future historians. And if asks me, is NSA is right? my answer would be, "Yes!, if they do it correctly with all the context included."
Archiving Facebook's public vs. private data: maybe not the same challenge #iipcGA15— Emmanuelle Bermes (@figoblog) April 27, 2015
C. Marshall: impossible to reconstruct a story like Vollmer's today because you'd have to rely on Facebook volatile data #iipcGA15— Emmanuelle Bermes (@figoblog) April 27, 2015
Meghan Dougherty from Loyola University Chicago and Annette Markham from Aarhus University presented their talk "Generating granular evidence of lived experience with the Web: archiving everyday digitally lived life". They illustrated how sometimes intentionally or unintentionally people record moments of their life with different media. Among various visual illustrations, I particularly liked the video of a street artist playing with a ring that was posted on Facebook in a very different context than the context it appeared in YouTube. They ended their talk with a hilarious video of Friendster.
Megan Dougherty quote rebecca solnit tyranny of the quantifiable that which can be measured takes priority over that which cannot #iipcGA15— rosalie lack (@rosalielack) April 27, 2015
Susan Aasman from University of Groningen presented her talk "Everyday saving practices: "small data" and digital heritage strategies". This talk was full of motivation, why people should care about personal archive of their daily life moments. She described how the service Kodak Gallery launched in 2001 with the tag-line, "live forever", and closed in 2012 after transferring billions of images to Shutterfy which was only available for US customers. As a result, people from other countries have lost their photo memories. She also played the Bye Bye Super 8 video of Johan Kramer that was amusing and motivating for personal archiving.
After a short beak Jane Winters from the Institute of Historical Research, Helen Hockx-Yu from the British Library, and Josh Cowls from the Oxford Internet Institute took the stage with their topic "Big UK domain data for Arts and Humanities" also known as BUDDAH project. Jane highlighted the value of archives for research and described the development of a framework to help researchers leverage the archives. She illustrated the interface of the Big Data analysis of BUDDAH project, described the planned output, and various case studies showing what can be done with that data.
#iipcGA15 ... inform collection development and access at BL, train researchers in use of big data. Great acronyn: the BUDDHA project.— Jackie Dooley (@minniedw) April 27, 2015
Helen Hockx-Yu began her talk "Co-developing access to the UK Web Archive" with reference to the earlier talk by Andy. She noted that a scenario that fits everyone's need is difficult. She described the high level requirements including query building, corpus formation, annotation and cuuration, in-corpus and whole-dataset analysis. She illustrated the SHINE interface that provides features like full-text search, multi-facet filters, query history, and result export.
Finally, Josh Cowls presented his talk about the book "The Web as History: Using Web Archives to Understand the Past and the Present" in which he contributed a chapter. He talked about the four second level domain from ".uk" TLD including ".co.uk", ".org.uk", ".ac.uk", and ".gov.uk" and how they are interlinked. He described the growth of web presence of the BBC and British universities.
Fun fact #iipcGA15 : to demonstrate the interest of web archives as material for history, write a book and put it online— Emmanuelle Bermes (@figoblog) April 27, 2015
IIPC Chair Paul Wagner concluded the day by emphasizing that we have only started scratching the surface. He also noted in his concluding remarks that the context matters.
Daniel Gomes from the Portuguese Web Archive talked about "Web Archive Information Retrieval". He started classifying web archive information needs in three categories including Navigational, Informational, and Transactional. He noted that the usual way of accessing archive is URL searching which might not be known to the users. An alternate method is full-text search that poses the challenge of relevance. Daniel described various relevance models in great detail and how to select features to maximize the relevance. He announced that all the dataset and code is available for free and under open source license. The code is hosted on Google Code, but due to their announcement of sunsetting the service the code will be migrated to GitHub soon.
Replay of archived websites relies on patching resources not necessarily crawled at the same time. Not something known to users. #iipcGA15— Helen Hockx (@hhockx) April 28, 2015
#iipcGA15 68 ranking features--too many to put into production. Using URL, title, text body, anchor text of incoming link.— Jackie Dooley (@minniedw) April 28, 2015
After this talk, there was a short break followed by the announcement that remaining sessions of the day will have two parallel tracks. It was a hard decision to choose one track or the other, but I can watch the missed sessions latter when the video recordings are made available. Later the parallel sessions were interfering each other so the microphone was turned off.
#iipcGA15 Sometimes it is good we can still get by without microphones.— Sabine Hartmann (@skhartmann) April 28, 2015
After the break Ilya Kreymer gave a live demo of his recent work "Web Archiving for all: Building WebRecorder.io". He acknowledged the collaboration with Rhizome and announced the availability of invite only beta implementation of the WebRecorder. He demonstrated how WebRecorder can be used perform personal archiving in What You See Is What You Archive (WYSIWYA) mode.
#iipcGA15 Webrecorder.io: on-demand archiving via browser. WYSIWYA: what u see is what you archive. Available to anybody. quality > quality.— Jackie Dooley (@minniedw) April 28, 2015
Ilya Kreymer is looking for collaborators, developers, UI designers, and archivists to move webrecorder.io to the next level #iipcGA15— Ahmed AlSum (@aalsum) April 28, 2015
Zhiwu Xie from VirginiaTech presented "Archiving transactions towards an uninterruptible web service". He described an indirection layer between the web application server and the client that archives each successful response and when server returns 4xx/5xx failure responses, it serves the most recent copy of the resource from the transactional archive. It is similar to services like CloudFlare in functionality from clients' perspective, but it has added advantage of building a transactional archive for website owners. Zhiwu demonstrated the implementation by reloading two web pages multiple times of which one was utilizing the UWS and the other was directly connected to the web application server that was returning the current timestamp with random failures. He mentioned that the system is not ready for the prime time yet.
Xie: project uses SiteStory as a back up for website availability...Patch in archived version of a web page if 500 error occurs #iipcGA15— Web Archiving RT (@WebArch_RT) April 28, 2015
During the lunch break I was with Andy, Kristinn, and Roger where we had free style conversation on advanced crawlers, CDX indexer memory error issues, the possibility of implementing CDX indexer in Go, separating data and view layers in Wayback for easy customization, some YouTube videos such as "Is Your Red The Same as My Red?", hilarious "If Google was a Guy", Ted talks such as "Can we create new senses for humans?", "Evacuated Tube Transport Technologies (ET3)", and the possible weather of Iceland around the time IIPC GA 2016 is scheduled.
Jefferson Bailey presented his talk on "Web Archives as research datasets". With various examples and illustrations from Archive-It collections he established the point that web archives are great sources of data for various researches. He acknowledged that WAT is a compact and easily parsable metadata file format that is about 18% of the WARC data files.
#iipcGA15 Those WATs: 18% size of a WARC. In JSON, easily analyzed/parsed.— Jackie Dooley (@minniedw) April 28, 2015
#iipcGA15 WANE = web archive named entities. Uses Stanford NER tool. Entities from colls (names, titles etc). Less than 1% of WARC size.— Jackie Dooley (@minniedw) April 28, 2015
Ian Milligan from the University of Waterloo presented his talk on "WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Three Types of Web Archives". He described the importance of web archives and why historians should use web archives. His talk was primarily based on three case studies including Wide Web Scrape, GeoCities End-of-Life Torrent, and Archive-It Longitudinal Collections, Canadian Political Parties & Labour Organizations. I enjoyed his style of storytelling, some mesmerizing visualizations, and in particular the GeoCities case study. He noted that the GeoCities data was not the form of WARC files, instead it was regular Wget crawl.
After a short break Ahmed AlSum from the Stanford University Library (and a WS-DL alumnus) presented his work on "Restoring the oldest U.S. website". He described how he turned yearly backup files of SLAC website from 1992 to 1999 into WARC and CDX files with the help of Wget and by applying some manual changes to mimic the effect as if it was captured in those early days. These transforms were necessary to allow modern Open Wayback system to correctly replay it. Ahmed briefly handed the microphone over to Joan Winters who was responsible to take backups of the website in early days and she described how they did it. Ahmed also mentioned that the Wayback codebase had hardcoded 1996 as the earliest year that was fixed by making it configurable.
As an after thought I would love to see this effort combined with Satya's Olive Archive so that from the server stack to the browser experience all can be replicated as close to the original environment as possible.
#iipcGA15 Homepage wasn't a concept in 1991. One entry page with two internal links. First page said "Someday there will be text here." :)— Jackie Dooley (@minniedw) April 28, 2015
Federico Nanni from the University of Bologana presented "Reconstructing a lost website". Looking at the schedule, my first impression was that it is going to be a talk about tools to restore any lost websites and reconstruct all the pages and links with the help of archives. I was wondering if they are aware of Warrick, a tool that was developed at Old Dominion University with this very objective. But, it turned out to be a case study of the world's oldest university established around 1088. One of the many challenges in reconstructing the university website he mentioned was the exclusion of the site from the Wayback Machine for unknown reasons which they tried to resolve together with Internet Archive. Amusingly, one of the many sources of collecting snapshots includes a clone of the site prepared by student protesters.
#iipcGA15 No national web archive for Italy. U Bologna excluded from Wayback Machine, so hard to reconstruct its web history. Undaunted!— Jackie Dooley (@minniedw) April 28, 2015
Reconstruct a lost website:sometimes we still need persons to ask questions and do the paper trail and wait for student protests. #iipcGA15— susan aasman (@aasmanna) April 28, 2015
Last speaker of the second day Michael L. Nelson from Old Dominion University presented the work of his student Scott G. Ainsworth "Evaluating the temporal coherence of archived pages". With an example of Weather Underground site he demonstrated how unrealistic pages can be constructed by archives due to the temporal violations. He acknowledged that among various categories of temporal violations, there are at least 5% cases where there exists a provable temporal violation. He also noted that temporal violation is not always a concern.
#iipcGA15 How much of web archived? Sources vary. Are archives stable? Nope. Temporal drift while browsing? Yep bec sparse crawls.— Jackie Dooley (@minniedw) April 28, 2015
The third day sessions were in the Internet Archive building, San Francisco instead of the usual Li Ka Shing Center at Stanford University, Palo Alto. A couple of buses transported us to the IA and we enjoyed the bus trip in the valley as the weather was very good. IA staff was very humble and welcoming. The emulator of classical games installed in the lobby of IA turned out to be the prime center of attraction. We came to know some interesting facts about the IA such as the building was a church which was acquired because of its similarity with the IA logo and the pillows in the hall were contributed by various websites with the domain name and logo printed on them.
Sessions before lunch were mainly related to consortium management and logistics these include Welcome to the Internet Archive by Brewster Kahle, Chair address by Paul Wagner, Communication report by Jason Webber, Treasurer report by Peter Stirling, and Consortium renewal by the chair followed by break-out discussions to gather ideas and opinion from the IIPC members on various topics. Also, the date and venue for the next general assembly was announced to be on April 11, 2016 in Reykjavik, Iceland.
After the lunch break, your author, Sawood Alam from Old Dominion University presented the progress report on "Profiling web archives" project, funded by IIPC. With the help of some examples and scenarios he established the point that the long tail of archive matters. He acknowledged the growing number of Memento compliant archives and the growth of use of Memento aggregator service. In order for the Memento aggregator to perform efficiently, it needs query routing support apart from caching which only helps when the requests are repeated before cache expires. Then he acknowledged two earlier profiling efforts one being a complete knowledge profile by Sanderson and the other minimalistic TLD only profile by AlSum. He described the limitations of the two profiles and explored the middle ground for various other possibilities. He evaluated his findings and concluded that his work so far gained up to 22% routing precision with less than 5% cost relative to the complete knowledge profile without any false negatives. Sawood also announced the availability of the code to generate profiles and benchmark them in a GitHub repository. In a later wrap-up session the chair Paul Wagner referred to Sawood's motivation slide in his own words, "sometimes good enough is not good enough."
In the break various IA staff members gave us tour of the IA facility including book scanners, television archive, an ATM, storage rack, music and video archive where they convert data from old recording media such as vinyl discs and cassettes.
After the break a historian and writer Abby Smith Rumsey talked about "The Future of Memory in the Digital Age". Her talk was full of insightful and quotable statements. I will quote one of my favorite and will leave the rest in the form of tweets. Se says, "ask not what we can afford to save; ask what we can afford to lose".
#iipcGA15 Abby Smith Rumsey : what is at stake with preservation is the survival of the species. Humans have known how to pass knowledge— Ina DL Web (@inadlweb) April 29, 2015
Components of memory: starts by forgetting/ filter what is irrelevant/ keep what will be valuable in the future Abby #iipcGA15— Emmanuelle Bermes (@figoblog) April 29, 2015
Rumsey: Scale is an issue with information, it always has been (no exception re: web archiving) #iipcGA15— Abbie Grotke (@agrotke) April 29, 2015
Collect and make available, don't curate, allow the future to judge the value. #iipcGA15— Andy Jackson (@anjacks0n) April 29, 2015
#iipcga15 Abby Smith Rumsey: "ask not what we can afford to save; ask what we can afford to lose"— Dan Kerchner (@DanKerchner) April 29, 2015
Finally the founder of the Internet Archive, Brewster Kahle took the stage and talked about digital archiving and the role of IA in the form of various initiatives including book archive, music archive, and TV archive to name a few. He described the zero-sum book lending model utilized by the Open Library for the books that are not free for unlimited distribution. He invited all the archivists to create a common collective distributed library where people can share their resources such as computing power, storage, man power, expertise, and connections. During the QA session I asked when he thinks about collaboration, is he envisioning a model similar to the inter-library loan where peer libraries will refer to the other places in the form of external links if they don't have the resources but others do or in contrast they will copy the resources of each other? He responded, "both."
Brewster Khale: we need to develop unexpected uses of our digital libraries #iipcga15— Emmanuelle Bermes (@figoblog) April 29, 2015
Online music: IA started with concerts when music artists authorized it, in exchange for free storage and bandwidth #iipcga15— Emmanuelle Bermes (@figoblog) April 29, 2015
The chair gave a wrap-up talk and formally ended the third day session. Buses still had some time before they leave, so people were engaged in conversation, games and photographs while enjoying drinks and food. I particularly enjoyed a local ice cream named "It's-It" recommended by an IA staff. Lori Donovan from Internet Archive approached me and Mohamed Farag and initiated a good conversation about possible collaboration on archiving projects. We also talked about a project that WS-DL group at Old Domionion University was working on a few years ago to identify disaster related news and archive them. Our conversation ended up with a group selfie of three of us.
On fourth day Sara Aubry presented her talk on "Harvesting Digital Newspapers Behind Paywalls" in Berge Hall A where Harvesting Working Group was gathered while IIPC's communication strategy session was going on in Hall B. She discussed her experience of working with news publishers to make their content more crawler friendly. Some of the crawling and replay challenges include paywalls requiring authentication to grant access to the content and inclusion of the daily changing date string in the seed URIs. They modified the Wayback to fulfill their needs, but the modifications are not committed back to the upstream repository. She said, if it is useful for the community then the changes can be pushed out in the main repository.
Kristinn Sigurðsson engaged everyone to talk about the "Future of Heritrix". He started with the question, "is Heritrix dead?" and I said to myself, "can we afford this?". This ignited the talk about what can be done to increase the activity on its development. I asked the question, what is slowing down the development of Heritrix, is it out of ideas and new feature requests or there are not enough contributors to continue the development? There was no clear answer to this question, but it helped continuing the discussion. I also suggested that if new developers are afraid of making changes that will break the system and will discourage upgrades then can we introduce plug-in architecture where new features can be added as optional add-ons.
Harvesting Working Group discussion on Heritrix: We need a framework where multiple crawlers can exist #iipcGA15— Abbie Grotke (@agrotke) April 30, 2015
Helen Hockx-Yu took the microphone and talked about the Open Wayback development. She gave brief introduction of the development workflow and periodic telecon. She also talked about the short and long term development goals including better customization and internationalization support, display more metadata, ways to minimize the live leaks, and acknowledge/visualize the temporal coherence.
After a short break Tom Cramer gave his talk on "APIs and Collaborative Software Development for Digital Libraries". He formally categorized the software development models in five categories. He suggested IIPC to take the position to unify the high level API for each category of the archiving tools so that they can co-operate interchangeably. This was very appealing to me because I was thinking on the same lines and have done some architectural design of an orchestration system that achieves the same goal via a layer of indirection.
Daniel Vargas from LOCKSS presented his talk on "Streamlining deployment of web archiving tools" and demonstrated usage of Docker containers for deployment. He also demonstrated the use of plain WARC files on regular file system and in HDFS with Hadoop clusters. I was glad to see someone else deplying Wayback machine in containers as I was pushing some changes to the Open Wayback repository that will make containerization of Wayback easier.
LOCKSS can extract WARC files to play in Open Wayback through a Docker container #iipcGA15— Emmanuelle Bermes (@figoblog) April 30, 2015
During the lunch break Hunter Stern from IA approached me and told me about the Umbra project to supplement the crawling of JS-rich pages. Kristinn, me, and a few more people talked about the precision of time in HTTP/2.0, but no one was sure if it was changed from one second granularity to anything smaller such as millisecond or microsecond. Later I asked this question in the IETF HTTP WG mailing list and the response suggests that there was no change made to it. After the lunch there was a short open mic session where every speaker has got four minutes to introduce exciting stuff that they are working on. Unfortunately, due to the shortage of time I could not participate in it.
Peter Stirling at the BnF has been using WATs to map web archives relating to the First World War. Excited to see how it’s going! #iipcGA15— Ian Milligan (@ianmilligan1) April 30, 2015
Peter Stirling at #iipcGA15 : analyze web archives to understand the use of digitized heritage documents on websites related to WWI— ISSN Int. Centre (@ISSN_IC) April 30, 2015
#iipcGA15 Vinay Goel: if you have WAT files, you can directly produce CDX files from them— ISSN Int. Centre (@ISSN_IC) April 30, 2015
Vinay Goel-WAT files provide contextual info for users about collections such as which domains where crawled, which urls, etc. #iipcGA15— rosalie lack (@rosalielack) April 30, 2015
After a short break Access Working Group talked about "Full-text search for web archives and Solr". Anshum Gupta, Andy Jackson, and Alex Thurman presented "Apache Solr: 5.0 and beyond", "Full-text search for web archives at the British Library", and "Solr-based full-text search in Columbia's Human Rights Web Archive" respectively. Anshum's talk was on technical aspects of Solr while the other two talks were more towards a case study.
On the last day of the conference Collection Development and Preservation Working Groups were discussing their current state and plans in separate parallel tracks. Before the break I attended Collection Development Working Group. They demonstrated Archive-It account functionality. I expressed the need of a web based API to interact with the Archive-It service. I gave the example of a project I was working on a few years ago in which a feed reader periodically reads from news feeds and sends it to a disaster classifier that Yasmin AlNoamany and Sawood Alam (me) built. If the classifier classifies the news article to be in disaster category, we wanted to archive that page immediately. Unfortunately, Archive-It did not provide a way to programmatically do that (unless we use page scraping or some headless browser), so we ended up using WebCite service for that.
After the break I moved to the Preservation Working Group track where I had a talk scheduled. David S.H. Rosenthal presented his talk on "LOCKSS: Collaborative Distributed Web Archiving For Libraries". He described the working of LOCKSS and how it benefited the publishing industry. He described how Crawljax is used in LOCKSS to capture content that are loaded via Ajax. He also noted that most of the publishing sites try not to rely on Ajax and if they do, they provide some other means to crawl their content to maintain the search engine ranking.
Legal framework for LOCKSS: obtain explicit permission to crawl & preserve permission along w/ content #iipcGA15— Emmanuelle Bermes (@figoblog) May 1, 2015
LOCKSS has a peer-to-peer protocol for verification and repair of the content in the boxes, w/ authorization check #iipcGA15— Emmanuelle Bermes (@figoblog) May 1, 2015
LOCKSS supports a variety of stds incl. Memento, OpenUrl, http content negotiation,WARC import and export, bibliographic metadata #iipcGA15— Emmanuelle Bermes (@figoblog) May 1, 2015
#iipcGA15 LOCKSS runs "Red Hat" model: free open source software but paid support— ISSN Int. Centre (@ISSN_IC) May 1, 2015
#iipcGA15 Rosenthal: all components of LOCKSS processing chains should interact through web services— ISSN Int. Centre (@ISSN_IC) May 1, 2015
Sawood Alam (me) happened to be the last presenter of the conference where he presented his talk on "Archive Profile Serialization". This talk was in continuation with his earlier talk at IA. He described what should be kept in profiles and how should it be organized. He also talked briefly about the implications of each data organization strategy. Finally he talked about the file format to be used and how it can affect the usefulness of the profiles. He noted that XML, JSON, and YAML like single node file formats are not suitable for profiles and he proposed an alternative format that is a fusion of CDX and JSON formats. Kristinn provided his feedback that it seems the right approach of serialization of such data, but he strongly suggested to name the file format something other than CDXJSON.
While we were having lunch, the chair took the opportunity to wrap-up the day and the conference. And now I would like to thank all the organizing team members especially Jason Webber, Sabine Hartmann, Nicholas Taylor, and Ahmed AlSum for organizing and making the event possible.
Due to the parallel tracks I missed some sessions that I wanted to attend such as "SoLoGlo - an archiving and analysis service" by Martin Klein, "Web archive content analysis" by Mohammed Farag, "Identifying national parts of the internet" by Eld Zierau, "Warcbase: Building a scalable platform on HBase and Hadoop" by Jimmy Lin, "WARCrefs for deduplicating web archives" by Youssef Eldakar, and "WARC Standard Revision Workshop" by Clément Oury to name a few. I hope the videos recordings will be available soon. Meanwhile I was following the related tweets.
Farag: Event Focused Crawler (EFC) helps curators focus the crawls and improve quality #iipcGA15— Abbie Grotke (@agrotke) April 28, 2015
Youssef Eldakar from Bibliotheca Alexandrina is presenting issues (and solutions!) related the duplicates in WARC files at #iipcGA15— ISSN Int. Centre (@ISSN_IC) April 28, 2015
IIPC GA 2015 was a fantastic event. I had great time, met a lot of new people and some of those whom I knew on the Web, shared my ideas and learned from others. It was the most amazing one complete week I ever had. I appreciate the efforts of everyone who made this possible including organizers, presenters, and attendees.
Please let us know the links of various resources related to IIPC GA 2015 to include below.
- GA2015 OVERVIEW
- GA2015 SCHEDULE
- VIDEOS: Presentations from the 2 open days of the IIPCGA15
- What are they saying about IIPC GA2015?
- Flickr - IIPC
- Flickr - Ahmed AlSum
- #iipcGA15 Tweets
- Archived tweets, URIs, and embedded media for #iipcga15
- Popularity #iipcGA15 urls in Tweets
- Web Archiving in 2015 -- a Quick Redux of IIPC's General Assembly at Stanford - Tom Cramer
- Notes from IIPC General Assembly 2015 - Jefferson Bailey
- Let Them Emulate! - Andy Jackson
- IIPC 2015 Recap - Ian Milligan
- IIPC trabalha para salvar a memória da internet (Portuguese) - Carlos Eduardo Entini
- IIPC GA 2015, jour 1 : « context matters » (French) - Emmanuelle Bermes
- IIPC GA 2015, jour 2 : WARC, WAT, WET et WANE (French) - Emmanuelle Bermes
- Talk at IIPC General Assembly - David Rosenthal
- Assemblée Générale de l’IIPC à Palo Alto (French) - Claude Mussou
- Looking ahead from the 2015 IIPC General Assembly - Nicholas Taylor
- Olive Archive
- WebRecorder.io Beta
- Portuguese Web Archive Technologies
- Archive Profiler
- warcprox - WARC writing MITM HTTP/S proxy
- PyWb 0.9.7
- Heritrix 3
- LOCKSS - Lots Of Copies Keep Stuff Safe
- IIPC 2015 - GA - WAT Analysis Examples
- Chrome Extension - Memento Time Travel
- Chrome Extension - Mink
- WAIL - Web Archiving Integration Layer
- Memento - Time Travel
- Memento - API
- Robust Links
- What Did It Look Like?
- Docker OpenWayback
- Docker PyWb
- Open Library
- US first website (SLAC) on Stanford Web Archive
- Library of Congress > Archived Web Sites
Update (May 12, 2015): Added reference to HTTP/2.0 time resolution and some more blog posts.
Update (May 22, 2015): Added more blogs and tool references.
Update (June 1, 2015): Added link to the video recording playlist.