Thursday, July 28, 2011

2011-07-28: Web Video Discussing Preservation Disappears After 24 Hours

One week ago (July 21, 2011) I was fortunate enough to be invited to speak about Web Archiving on Canada AM, sort of like the Today Show or Good Morning America in the US. I was asked to appear on the program in part because of the July 17, 2011 article in the Washington Post, which followed a July 6, 2011 blog post for the Chronicle of Higher Education, which was based on a June 23, 2011 blog post about our JCDL 2011 paper "How Much of the Web is Archived?". In other words, the process went like this: step 1 - get lucky & step 2 - let preferential attachment do its thing.

I was able to do the appearance in Washington DC, while attending the NDSA/NDIIPP 2011 Partner Meetup. The morning of July 21, I took a taxi to an ABC studio in DC, did the interview (about 4 minutes) and took a taxi back to the conference in time to make the morning session. I had not been on TV before and was both nervous and excited. The local and Canadian crew made the entire experience painless and the whole interview was over right as I started to get comfortable.

Given the short time, I tried to stress two topics: the first is that the ODU/LANL Memento project is not a new archive, but rather a way to leverage all existing web archives at once (this is a common misunderstanding we've experienced in the past). The other point I tried to make was that much of our cultural discourse occurs on the web and we should try to preserve as much of that as possible (including things like lolcats) because we (collectively) do a bad job at predicting what will be important in the future. Shortly after airing, the video segments was available on-line at:

http://www.ctv.ca/canadaam/?video=504307

As the URI suggests, this is the homepage for Canada AM (http://www.ctv.ca/canadaam/), but with an argument ("?video=504307") specifying which video segment (i.e., each individual story -- not the entire morning's show) to display. I shared the video URI with colleagues, friends, and family and was enjoying my 4 minutes of fame (I should still have 11 left in the bank). I had not made a local copy of the video because their web site obfuscated the actual URI of the streaming video, I had to finish the rest of the conference and drive back to Norfolk, and I thought I would have the time to figure it out after I returned.

So imagine my surprise on Friday at about lunch time when I reload the URI and do not see the video, but instead a newly redesigned Canada AM web page! The video of me making the point that we should save web resources lasted approximately 24 hours. I don't mean to seem ungrateful for the opportunity Canada AM afforded me, but as a professor I try to see everything as a teaching opportunity, so here it goes...

Sometime on Friday morning (July 22), the entire web site was redesigned and the old URIs no longer worked (cf. "Cool URIs Don't Change"). The video id was an argument and is now silently ignored, so even worse than a 404 you now get a "soft 404":

% curl -I http://www.ctv.ca/canadaam/\?video=504307
HTTP/1.1 200 OK

Server: Apache/2.2.14 (Ubuntu)
Content-Type: text/html

X-Varnish: 2550613724

Date: Thu, 28 Jul 2011 16:55:48 GMT

Connection: keep-alive


The soft 404 means people clicking on the original video link in Facebook, Twitter, email, etc. won't even see an error page -- they see the new site, but without the video or indication that the video is missing. The new site has a link titled "watch full shows", with the URI:

http://www.ctv.ca/canadaAMPlayer/index.html

Which is textually described as the "Canada AM Video Archive", but the archive begins on July 22, 2011 -- one day after my appearance! The new segments are available at URIs of the form:

http://www.ctv.ca/canadaAMPlayer/index.html?video=504933

The older videos are not available, not even as an argument to the new URI, which also returns a soft 404 (i.e., the video is not available despite the 200 response):

% curl -I http://www.ctv.ca/canadaAMPlayer/index.html\?video=504307
HTTP/1.1 200 OK
Server: Apache/2.2.14 (Ubuntu)
Content-Type: text/html
X-Varnish: 2550976182
Date: Thu, 28 Jul 2011 17:35:35 GMT
Connection: keep-alive


The video ids seem to be continuous (i.e., they did not appear to start over with "1"), so URL rewriting could easily make all the old video URIs continue to work, unless whatever CMS that hosted those videos has been retired with no migration path forward.

Here are some screen shots of the newly redesigned home page (left) and the video archive page (right) from July 22:










Of course, I did not think to make a screen shot of the original home page, or the page of my video because I thought it would live longer than 24 hours! I was able to find a recent (December 8, 2010) copy in the Internet Archive's Wayback Machine:

http://web.archive.org/web/20101208084455/http://www.ctv.ca/canadaam/

And I also pushed the two pages above to WebCite, which nicely contrasts two styles of giving URIs for archived pages (URI-M in Memento parlance):


http://www.webcitation.org/60NizRC0o
http://www.webcitation.org/60Nj60H8D

The IA's URIs violate the W3C "good practice" of URI opacity, but they sure are handy for humans. WebCite actually offers both styles of URIs, for example the latter of the two URIs above is equivalent to:

http://www.webcitation.org/query?url=http%3A%2F%2Fwww.ctv.ca%2FcanadaAMPlayer%2Findex.html&date=2011-07-22

But the resulting URI encoding, while technically correct, is not conducive to easy memorizing and exploration by humans. Different styles of using a URI as an argument to another URI will be explored in a future blog post.

Fortunately I was given a DVD of the session, from which I was able to rip a copy and upload it to YouTube, provided below with the dual interests of vanity and pedagogy. I'm not sure about its status with respect to copyright, so it might disappear in the future as well. It should be covered under fair use, but I would not count on it. However, that is also a topic for another blog post...



--Michael

2012-05-30 Update: Apparently Canada AM did create a new page about the video, including a nice, anonymously authored summary of the material with direct quotes from me:


It appears to be authored on July 24, 2011, not just via the byline but through the HTTP response headers as well.  For example, look at the "Last-Modified" header for this image that appears in the page:

% curl -I http://images.ctv.ca/archives/CTVNews/img2/20110721/470_professor_nelson_110721_225128.jpg
HTTP/1.1 200 OK
Server: Apache/2.2.0 (Unix) DAV/2
Last-Modified: Sun, 24 Jul 2011 10:52:59 GMT
ETag: "a9e08e-51a4-807938c0"
Accept-Ranges: bytes
Content-Length: 20900
Content-Type: image/jpeg
Date: Wed, 30 May 2012 14:02:21 GMT
Connection: keep-alive

I originally wrote the above article on July 28, 2011 and I was unable to find any trace of my appearance on their site.  Perhaps I just missed it, or perhaps it was written but not yet linked.  This nicely illustrates the premise behind Martin Klein's PhD research: things rarely disappear completely, they just move to a new location; the trick is finding them.

Thursday, July 21, 2011

2011-07-25: NDSA/NDIIPP Partner Meetup 2011 Trip Report

The NDSA/NDIIPP (@ndiipp) Partner Meetup took place July 19-21 at the Hyatt Regency Washington on Capitol Hill in Washington, DC. Technical and non-technical joined together to form an aggregated consortium of archivists, librarians, digital media specialists and concerned parties. Three representatives from the ODU Web Sciences and Digital Libraries group attended to make archivists aware of tools they had developed to accomplish the common goal of web archiving.

WS-DL’s Comtributions to the NDSA/NDIPP Meetup

Mat Kelly presented the Mozilla Firefox add-on Archive Facebook to a breakout group of presentations specifically targeting web archiving. The redesigned and re-architected add-on allows a user to archive the content of his/her Facebook account with the result being truly WYSIWYG versus Facebook’s native offerings of a content dump.

Vivens Ndatinya showed the workings of a tool he is currently building with his presentation, “Creating Persistent Links to YouTube Music Videos”. The software serves as a medium between a user and YouTube where, if a music video has been deleted or removed, the proxy will search for a comparable or official substitute and seamlessly forward the user to the resource for which he/she was looking.

Michael Nelson presented "How Much of the Web is Archived?", which was also presented at JCDL 2011. By examining links on DMOZ, delicious, bit.ly and search engines and cross-referencing the links with various archives, they were able to establish the criteria for likelihood of archival rate and conclude the amount of the web that is archived with, "It depends on the source of the URIs".


The Speakers

Martha Anderson (@MarthaBunton), the director of program management for the National Digital Information Infrastructure and Preservation Program (NDIIPP), exclaimed that “We are growing!” in her introductory presentation, citing the increasing numbers of members in the group and the larger breadth of the scope of the members’ specializations. “ She introduced the theme of the conference "Make It Work" and stated that the conference’s 3 days were broken up by the respective keywords of “Open”, in that all presenters were committed to openness, “Solve”, where all speakers presented studies on creative approaches toward solving their problems and “Connect”, which had a focus on community building and relationships.

Tim O’Reilly (@timoreilly), founder and CEO of O’Reilly Media Inc., kicked off the list of speakers providing insightful one-liners such as “Forgetting makes room for new things”, “Design more systems that have their own memory” and “We’re engaged in the wholesale destruction of our history”. He listed two of his pasts failures where his process of archiving could have been improved:

In 1993 he created one of the first websites but neglected to archive it. “Things that turn out to be historic”, he stated, “aren’t deemed to be historical at the time” – a theme that reverberated through many other presentations.

His second past failure of preservation was in 1998 when he attended the inaugural Open Source Summit (link?), where the term “Open Source” was officially born. Learning from his 1993 failure, he diligently built an archive and linked to all of the relevant content but neglected to deep link the archiving, which meant all of the information that was coupled with his coverage was no longer available at time of access.

O’Reilly rhetorically queried the audience, “What kind of tools do we need in the everyday practice of the digital world to encourage presentation?” He stated that we have to consider the widely divergent scenarios if we are to archive effectively. He reiterated that the tools we have should be adapted to assure that it is more likely that archived would survive when things went awry. “What matters?”, Tim stated, again referencing his two failures and answering his own question. He emphasized that our current perspective of what matters is temporally subjective and that we are likely neglecting to archive collections we now consider trivial.

To close up, Tim emphasized that there should be an exception in copyright for the sake of archiving so that our past will be preserved.

Yancey Strickler (@ystrickler) came on next to speak about his project Kickstarter, a funding platform for creative projects. Kickstarter works on an all-or-nothing approach of fund-raising where users can offer monetary support for projects they believe worthwhile with no commitment if the project fails to get funded. Yancey spoke of a tipping point in the funding process, where a large majority get funding to and sometimes beyond the threshold after attain 30% of their goal. Those that donate to the project are forbidden from being rewarded with equity but the fundee usually provides something priceless in return, like a photo for a donator from a project where a girl wished to sail around the world or the ability to be first to purchase a potentially popular iPod accessory that neglected to get traditional backing.

Kickstarter takes only a very small (5% of the raised funds) to remain sustainable but only receives these if the project gets funded. With this, Kickstarter and the projects both grow. “One day”, Yancey said, “we’ll hopefully be a cultural institution”.

Michael Edson (@mpedson) of the Smithsonian Institution came on next after a short break with his presentation “Let us go boldly into the present”. Michael emphasized that the time to archive is now and that “today is the future that all of the visionaries wrote about.” To do so, he gave five “design patterns” that we should exhibit to assure that the present is archived:

  1. Extra-terrestrial Space Auditor is a concept best depicted by an extraterrestrial that examines an organization, blind to its current workings, and provokes the organization to do a self-analysis as to whether it is performing as it should in terms of business practices, HR, etc having been potentially skewed in operation by the baggage of the last epoch.
  2. On Ramp and Loading Docks encourages the mindset that successful preservation is not about building infrastructure but rather creating movement.
  3. Edge to Core suggests that the best work is done on the fringes of an institution where subject matter experts exist. “An organization”, Michael said, “should develop a process that brings in and bootstraps these experts so their ideas can scale.”
  4. Self Awareness about organization change patterns states that there are predictable miscommunications and general crankiness in an organization between innovators and managements.
  5. Focus on the mission was Michael’s observation that of the 80 to 90 organization that he had spoken to in the last few years, the ones that were not suffering their pursuit of worth know the outcomes they want in society.

After Michael, Aaron Presnall (blog) of the Jefferson Institute came on to speak about “Tools for informing public decision-making”. A continuing project of his was to assist those at the National Archives of Serbia in archiving their documents. Many of these documents are of great interest, as they document the recent struggles and secession of the country and have immediate application (such as implication for war crimes) if preserved. Using the tools available, some unconventional, Aaron assisted those interested in moving from the dissolving stacks of papers to a digital form. He then built a management tool and genericized the tool to allow it to be reused in instances beyond Serbia. He has since been queried by Bosnia, who wishes to do the same as Serbia and because of the generic setup Aaron has created, the information Bosnia has to offer will not be lost.

With Aaron being the last presenter for the day, Abby Rumsey moderated a panel discussion/Q&A with all of the Day One speakers., first hoping to address Martha's question, "How do we make it work?" She first asked Aaron how to connect demand of archiving with the supply of skill and if there is something that needs to be in-place to make these connections easier. He replied with the need to communicate the success of individual cases to much broader audience, convey the lessons learned and establish best practices for performing such an archiving session. He admitted that it's difficult "to make archiving sexy" but popularized projects such as History Pin get people thinking and both energize and popularize the task of archiving.

Tim O'Reilly expounded on Aaron's reply, referencing a collection of railway edition books from the 1880s that were bounded by people that found the works both valuable and beautiful. "When some individual finds something that would otherwise be disposable and finds it beautiful and a keepsake", Tim said, "that's a wonderful impulse for preservation". He continued, "When we allow things to be reused by individuals, it really appeals to value of fair use." He went on to speak about how intellectual property fights against preservation and what we can do to preserve things of value is to give them more freedom.

Abby then questioned Michael Edson about how his approach of Edge-to-Core has had an impact on The Smithsonian. Michael gave the example of how the Smithsonian handled the inception of the world-wide web with no business process in-place. "Because the institution took a decentralized approach to managing content and ideas", he said, "there was no existing infrastructure to make order out of the web. It's been a series of opportunistic efforts to pick the pieces of the low hanging fruit and bring them to the center of the organization to achieve scalability and a greater impact."

Yancey was then asked, "How do you get something where the connections are so profoundly personal into something that really scale to the level we think about with digital preservation?", citing Wikipedia's scaling issues. Yancey alluded to Wikipedia's moderation challenges in terms of curation with, "What happened if I'm a guy that knows a lot about a topic you're concerned with archiving and I decide to reach out and tell you everything I know and all of the ways to be wrong? What do I get to contribute? Do I have any voice whatsoever?" Aaron replied with, "Exactly, that's a tremendous challenge and whether 80% of time you're right, 20% of time you could be fundamentally, deeply, troublingly wrong."

The Q&A was followed up with a reception accompanied with 30-or-so poster displays. Of particular interest to the WS-DL members was the Ace Audit Manager and Integrity Management System, an integrity auditing system for archives, which would prove useful in both the Memento and Archive Facebook projects. This closed out day one.

Day two started with a presentation from Helen Hockx-Yu from the British Library. "In the UK", she said, "there are tow archives - The UK web archive and the UK Government Web Archive." She spoke further that there was pending legislation that would limit the viewing of archives to on-site within the library. "Web archiving in the UK", she said "is only 10 years old at the British library - much younger than Internet Archive." One notable part of the collection, to which she said the British Library found accidentally, is the oldest archived website - that of the British Library's website from 1995, which was found stored away on a library's server.

Tricia Cruse of the California Digital Library spoke next about "Curation approaches in a public university system", stating "We're seeing an ever-increasing amount and degree of diversity of content. While our budgets were going down, we have had to do more with less."She also spoke of EZID, a system for users to create unique identifiers for their archived content; UC3 Merritt, a place where collaboration for researchers can happen and data can be stored and shared and Digital Curation for Excel (DCXL), an open source Microsoft Excel add-in that allows working in Excel to be easier for versioning, archiving and applying unique identifiers.

Jack Brighton of WILL, a radio/television station in Illinois, spoke of "Archiving at Web Speed". Jack spoke of his efforts in preserving the stations broadcasts using PBCore and emphasized the need for the adaptation of the archiving process to make it as painless as possible for those that did not necessarily see the value of the content at the current time.


Ben Vershbow (@subsublibrary) from the New York Public Library finished up the first session of the day with his presentation, "Bringing in the Crowd". He cited a project his group created, "What's on the Menu?", which was a crowd-sourced effort to transcribe old menus. He believes that there is an untapped reservoir of time and through crowd coordination and building datasets, people will be willing to devote their time for free.

Subsequent to Ben's presentation, the crowd broke up into three groups for workshops. The three topics of the workshops were "And the winner is..: How does a community recognize achievement?", "Tales from the crypt: What are the emerging practices of large scale storage" and "Special Interest Session: Web Archiving: Pecha Kucha and discussion of emerging topics in Web archiving". Because Vivens and Mat presented at the latter of the three, the WS-DL members attended and participated in the third session.

Presentations resumed after the breakout session with the theme of Open Source Tools and Community. The first presentation was by MacKenzie Smith (website) of MIT with "Exhibit3@MIT: Lessons learned from 10 years of the Simile Projet for building library open source software". MacKenzie stated that "Everybody's a curator" and "If we're creating these tools for the public, how can we assure that these tools will flow into the organizations, as many die? When you're doing a project that's open source", he continued, "you need to design for that community from the beginning." MacKenzie went on to say that metrics should be used to assure that you can tell the chance of success of the open source project, you're more likely to have a sustainable project if you have an audience "outside of this room" (i.e. outside of the archiving community) and that maintenance of the code has to be done by those that are committed, not just casual developers.

Sharon Leon (website) of George Mason University then presented "Omeka: from digital exhibits to web publishing platform". Omeka is a plug-in based Content Management System (CMS) modeled off of Wordpress that emphasizes extensibility. Sharon repeatedly emphasized the openness of the platform and that her group "specifically fights against Flash for re-use", as wrapping content in a Flash-based application limits access to the content within. She also mentioned that in developing a grant-funded open source project, one should not spend all of the funds on the development of the project but rather should put funds toward workshops, outreach and marketing of the product.

Michele Kimpton (website) spoke of ways to go beyond grant funding once it's exhausted with "Building and sustaining open source communities through the life cycle: Dspace, Fedora and DuraCloud case studies". Her group has create a write-up on the Meetup.

Following Michele was another breakout session of concurrent workshops with each having the topics of "Tools at risk", "I can haz standardz" and "Developing cutting-edge internship programs in digital preservation: What are the essential elements?". The WS-DL group attended "I can haz standarz", which disappointingly was more about the inability of the non-technical in building a tool for data management rather than about the standards themselves. As the group were all of technical mind, this was clearly the wrong workshop of the three to attend.

After another short break was a third set of concurrent workshops: "Digital preservation in a box: What are the key resources for digital preservation and education and outreach?", "Slaying the dragons: What is at risk and how do we rescue it?" and "The Challenge challenge: What are ways we can spark digital preservation innovation". The WS-DL group attended the third of the three. There, the attendees were broken into groups with each group being tasked to discuss a single topic in-depth with varying concerns in each group. Unlike the previous workshop, one topic was specifically technical - that of investigating how one assures archive integrity from a host and how to go about performing an audit on the collections stored. The WS-DL group along with Michelle Gallinger (@mgallinger), Professor Micah Beck (website), Mike Smorul (@msmorul) and a couple others devised the Storage Ping concept, which would require those that host collections to enable a client induced check on the server's collection integrity.

Day 3 started out with an introduction by Martha Anderson and the followed with the first presenter, David Rosenthal (website) of Stanford University on "Cloud Storage for LOCKSS Boxes". LOCKSS (Lots Of Copies Keeps Stuff Safe) boxes are dedicated computers with local storage that communicate with each other and repair any damages of data. David discussed challenges of speed he encountered when developing his system and conveyed a method of assuring integrity of data and assurance of data's existence on a remote server by prepending a nonce. He has recently been working with students at Carnegie Melon University to develop a crawling process that he described as being "a pretty robust approach to form filling." He also expressed some difficulty he has had in the past with archiving AJAX-based contents but emphasized that his archiving process was different than others', as he does not use Heritrix, the crawler used by The Internet Archive.

After David, Cal Lee (website) of UNC Chapel Hill analyzed the four NDIIPP State projects:

  1. Persistent Digital Archives and Library System (PeDALS) by Arizona
  2. A Model Technological and Social Architecture for the Preservation of State Government Digital Information by Minnesota Historical Society
  3. Geomap (GIS Data headed by North Carolina for Center Geographic Information and Analysis)
  4. Multi-state Preservation Consortium by Washington State Archives

The questions he asked about each projects included:

  • What are the main factors that drove the project in the first place?
  • What brought these about?
  • Who was involved and why?
  • What were the activities they engaged in before this?

Following Cal was Robert Horton from the Minnesota Historical Society who presented his slide-less report of his NDIIPP-sponsored project. Cal spoke of a soon-to-be enacted uniform law for the preservation and authentication and access to electronic legislative records. The legislation will define the required usage of digital Signatures to sign all legislative content online.

Peter Krogh (@peterkrogh) of the American Society of Media Photographers spoke next with, "Extending the reach of www.dpBestflow.org". Peter had been investigating means of collaboration and methods to get people to archive by conveying the task of archiving in a way that will appeal to the would-be archivist.

After a break, summaries of the 2010 DPA finalists sponsored by the Library of Congress were presented. WS-DL's own Michael L. Nelson (website) reported on the Memento project (joint work with Herbert Van de Sompel (who gave the original presentation in London in December 2010) and Robert Sanderson of LANL) which was referenced multiple times by other presenters throughout the meetup. Dr. Nelson stated that there is currently a disconnect in viewing web archives, as there is no seamless way to go from the past and the present. Memento overcomes "being stuck in the perpetual now" by leveraging content that currently exists in the web archives and provides a bi-directional means to view different versions of a web site on-the-fly. Michael stated that Memento does not create web archives but rather puts the notion of time onto the web.

Following Michael's presentation was Fran Berman (website) of Rensselaer Polytechnic Institute with "Economics and Digital Preservation", a final report of the Blue Ribbon Task Force (BRTF), whose mission is to promote sustainable digital preservation and access. Fran spoke of BRTF's investigation of the technical, economical and social problem. "Infrastructure is not free", she said, "and the preservation and access to our data is not free. Because it is not free and because there are so many interesting solutions, you see it as a multivariate problem. " She stated that the Task Force wanted to do a deep dive into the economics of the problem of cost for digital preservation.

"Our charge was to do roughly three things", Fran enumerated:

  1. Assemble a representative group of experts with broad perspective and influence.
  2. Look at the problem space: how can we structure it and understand us in a way that helps us take action.
  3. Come up with actionable recommendations.

The BRTF created a report with it recommendations.

The final presentation of the conference was by Kari Kraus (@karikraus) of the University of Maryland with "Preserving Virtual Worlds" (Jerry McDonough gave the original presentation in London). Kari spoke of her attempts at preserving virtual worlds with repeatedly referencing example from Second Life. The project was a multi-institution, multi-disciplinary project by University of Illinois at Urbana-Champaign, Stanford University, Rochester Institute of Technology and The University of Maryland that investigating preserving virtual worlds for their aesthetic merit as well as their economic significance. "We believe there is tremendous cultural importance to these artifacts.", she said, "We believe video games represent the limit case of what we can do with digital preservation. If we can figure out how to save a classic first-person shooter game like Doom, we'll have a better chance of preserving computational simulations of genetic evolution or climate change or the galactic behavior of star systems."

She said that their mission was very practical: they needed to ingest game bits into institutional repositories and provide packaging standards for doing that. Other examples of virtual worlds she mentioned were investigated were Spacewar, Adventure (interactive fiction) and Mystery House (interactive fiction) among others.

In Closing

Neither of the WS-DL student presenters had presented at a meetup/conference of this caliber before, which made the experience more than worthwhile. Much was learned about the various efforts of the archiving community and WS-DL's projects gained exposure. Further, we were made aware of others' efforts and found some resources that we hope to integrate into our research in the near future.


— Mat Kelly

2011-07-21: Towards a Machine-Actionable Scholarly Communication System

I've told all the members of my research group they should watch this, so I thought I might as well make the same recommendation to the rest of the world... Herbert Van de Sompel presented "Towards a Machine-Actionable Scholarly Communication System" at LIBER 2011 in Barcelona, Spain on June 30, 2011.

You really have to simultaneously watch the video and review the slides to get the full impact of the presentation. The first part is a succinct review of various projects, but starting at slide 16 ("nanopublications") things really get interesting. Well worth the 40 minute investment.





--Michael

Tuesday, July 5, 2011

2011-07-05: JCDL 2011 Trip Report

JCDL 2011 (#jcdl2011) was held June 13–16 in Ottawa, Ontario, Canada. The weather was beautiful and the conference sessions wonderful. The ODU Web Sciences and Digital Libraries team was fortunate enough to to have six of its members attend, present three short papers, and demonstrate the Synchronicity Firefox extension.

Our Contributions to JCDL 2011

Ahmed Alsum presented How Much of the Web is Archived? This paper approximates the amount of the Web that is archived using four URI sources. From this data, we observe significant variation in archival rate in URIs from different sources. So, how much of the web is archived? It depends on which web you mean. (pdf, slides).


Martin Klein presented Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures, which details a method for discovering missing web pages (the dreaded 404). Martin also demonstrated Synchronicity, a Firefox extension that uses lexical signatures (and other methods) for automatically rediscovering missing web pages in real time (pdf, slides).



Abdulla Alasaadi presented Persistent Annotations Deserve New URIs, which describes a method for creating new persistent URIs for annotations and creating persistent, independent
archived versions of all resources involved in the annotation (pdf, slides).

The Web Archive Globalization Workshop

After the conference, the members of the Web Science and Digital Libraries team attended the Web Archive Globalization Workshop. This workshop focused current initiatives and future possibilities. Eric Hetzner provided insight into the California Digital Library's web archiving activities. The Library of Congress's Nicholas Taylor told us about the Library's digital preservation initiatives. Brad Tofel of the Internet Archive gave us the low down the future of web archive formats (ARC, WARC, WAT) and the Wayback Machine. Robert Sanderson with the LANL Research Library provided an overview of current Memento infrastructure. There was much discussion about current archiving challenges including management of huge volumes of information, copyright considerations, and the challenges of making the archives accessable to researchers and the public. (slides)

The workshop was organized by:

  • Frank McCown, Harding University
  • Hector Garcia-Molina, Stanford University
  • Michael L. Nelson, Old Dominion University
  • Andreas Paepcke, Stanford University

Keynotes

The opening keynote, "Leaving the Cathedral and Entering the Bazaar: Library and Archives Canada Engages Canada’s Digital Society," was given by Daniel J. Caron, the current Librarian and Archivist of Canada. Mr. Caron discussed the issues and opportunities faced by national libraries as they transition from an analog to digital environment. He compared the situation to the cultural and process differences put forward by Eric. S. Raymond in The Cathedral and the Bazaar. It was an excellent talk and I really got the impression that Mr. Caron understood the transition required and the chaos inherent with a technological change of this magnitude.

Wednesday's open talk was given by IBM's Joan Morris DiMicco. "Data Narratives: Telling Stories with Data" (slides) focused on current reasearch at IBM into data visualization as storytelling medium. She defined at story as concrete, temporal, purposeful, and emotional. Brief presentations of visualizing legislative text with Many Bills, SaNDVis social relationship search, and the impact of visualizations on group behavior Second Messenger.

Christopher R. Barnes, the director of NEPTUNE Canada, described the NEPTUNE Canada cabled ocean observatory using many wonderful illustrations and photographs. He then went on to describe the digital library problem he and his team face: the 4+ (and growing) gigabytes of data collected daily by the project. This data is used by over 8,000 user. Storage, cataloging, and access are ever growing challenges the digital library and preservation communities can help with.

Session Highlights

Two or three session ran simultaneously durng the conference and I was not able to attend all presentations.

Session 1 presented automated methods to assist human understanding of texts. There were full papers on improving understanding of historical word sense variation (Measuring Historical Word Sense Variation) and improving information extraction from PDF books (Structure Extractions from PDF-based Book Documents); and a short paper on using syntactic dependency parse tree to learn expected patters between lexical arguments (Word Order Matters: Measuring Topic Coherence with Lexical Structure).

Session 5 explored rediscovery of missing web content, a topic near and dear to us. This session included two of our short papers and full papers on using patterns to efficiently implement web archiving (Archiving the Web Using Page Changes Patterns: A Case Study) and identifying academic home pages (On Identifying Academic Homepages for Digital Libraries).

The impact of copyright on access and use was covered in session 7. The attitudes of the social-media savvy were explored (The Ownership and Reuse of Visual Media) and the implications of data quality problems in national bibliographies were explored in (Using National Bibliographies for Rights Clearance).

Session 8 looked at methods to annotate the Web. Rob Sanderson presented SharedCanvas (preprint, slides). There was also a paper on combining superimposed information with digital libraries (Use of Subimages in Fish Species Identification: A Qualitative Study). Our Persistent Annotations Deserve New URIs short paper was also presented in this session.

Session 11 and 12 looked at the needs and abilities of user and improving the digital library experience. Understanding Digital Library Adoption: A Use Diffusion Approach and In the Bookshop: Examining Popular Search Strategies studied how users interact with digital libraries. Improving recommendations was looked at from several perspectives (A Social Network-Aware Top-N Recommender System using GPU, Serendipitous Recommendation for Scholarly Papers Considering Relations Among Researchers, and Product Review Summarization from a Deeper Perspective).

Other Perspectives on JCDL 2011

— Scott G. Ainsworth