Friday, July 26, 2013

2013-07-26: Digital Preservation 2013 Trip Report

The time of year has again arrived for conferences related to our research area of web sciences and digital libraries. While much our group will be representing the university at the Joint Conference on Digital Libraries (JCDL) conference in Indianapolis (trip report), I was given the opportunity to attend Digital Preservation 2013 in Alexandria, Virginia.

Being much closer to home in Hampton Roads, this is the third year running that I have attended this conference (2012 Trip Report, 2011 Trip Report), having presented digital preservation tools at each: Archive Facebook in 2011 and WARCreate in 2012. Following up from the recent public release of WARCreate (see the announcement), I gave a presentation on another package I had created, Web Archiving Integration Layer (WAIL), originally unveiled at Personal Digital Archiving 2013 in February (Trip Report), WARCreate, and how all of the pieces fit together titled: WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy.

Long before it was my turn to present, however, the lineup included a fantastic cast of other presenters. To start off the conference, Bill LeFurgy (@blefurgy) gave the welcoming remarks.

Day One

Bill started by noting that this was the 9th year of the annual NDIIPP meeting and that a lot had changed in that time. He reminisced of when the conference first started in 2004 about how much progress has been made in preservation efforts. "One of principles goals was to build a community around the process of digital stewardship.", he said. Bill then introduced the first speaker of the conference, Hilary Mason.

Hilary Mason (@hmason) of bit.ly is the chief scientist at bit.ly. Hilary started her presentation titled "Humans and Data" by noting that she was there to learn, being an engineer. She offered her expertise on "How engineers and startup people think about preservation when they think about it at all.", she described, "...Which is not that often. That's that punchline."

Commenting on social behavior, she referenced a reddit thread that posed the question "If someone from the 1950s suddenly appeared today, what would be the most difficult thing to explain to them about life today?" to which the top answer was, "I possess a device, in my pocket, that is capable of accessing the entirety of information known to man. I use it to look at pictures of cats and get in arguments with strangers."

"That's the Internet!", said exclaimed, "While technology and our technical capability has changed very rapidly, human nature has not changed at all."

She continued on to speak of the origins of bitly and how it was a complete accident. bit.ly was part of a feature of another product spun out of a company called Betaworks. "They [Betaworks] had this brilliant idea that when you're reading a news article on the Internet, other people are reading that article at the same time and yet our experience of that is very lonely. It is not in any way social." she continued, "So they thought, what if we added a social layer to news consumption? They built a system where you could see the mouse cursors of everybody else on the news article with you. So you can guess what happened then." She described the behavior of the users, who would do the exact opposite by swearing at each other and would chase each other around the screen.

"It was horrible!", she said, "It had the opposite social effect that the product was intended to have. But two different things that were useful came out of it. One of them was bit.ly, which was just a little way to share content in that tool."

Along with most of the presentations at Digital Preservation 2013, I captured this one on digital video and made it viewable here.

Part One
Part Two

Following Hilary, Sarah Werner (@wykenhimself) of the Folger Shakespeare Library presented "Disembodying the Past to Preserve It". She spoke of collections of indulgences and how, because the physical items were not considered valuable, the ones that did survive were reused as waste paper and thus found in bindings of other saved items. "Being treated as disposable is how they survived." she said.

She continued to describe a works within The Great Parchment Book, a collection of 165 leaves describing a survey compiled in 1639 of all of the estates managed by the city of London, that were badly damaged by a fire in 1786.

"Through careful preservation about 50% of the text was recovered but the brittle wrinkled parchment remained an intractable obstacle for further work.", she said, "After extensive physical preservation work, the UCL [University College London] team was able to virtually un-wrinkle the pages. "

She continued, "About 90% of the text of The Great Parchment Book is now readable and available for examination online as images of the leaves, enhanced images, or transcription of the text. In both of these cases, digitization makes available objects for study that would otherwise be restricted either because they're too fragile to handle or they're too dispersed to work with."

After a short break, Micah Altman (drmaltman), the Director of Research and Head/Scientist, Program on Information Science for the MIT Libraries, formally announced the 2014 National Digital Stewardship Agenda and gave a brief on the document.

Describing why such a document is drawn up, he said "Effective digital stewardship is vital for maintaining authenticity of public records...and information on how to do it, what to do, and what's going is distributed across practice, research, sectors, disciplines, communities of practice. There's a diversity of perspectives in organizations that are involved. So that sort of sounds like us. More on the reasoning for the document can be seen in the full video."

Following Micah, Leslie Johnston (@lljohnston) of Library of Congress introduced the next panel titled "Creative Approaches to Content Preservation", which "is only a panel in name.", she stated, commenting on the greater similarity of the format to a series of presentations with subsequent questions to the group rather than the traditional format.

Anne Wootton (@annewootton), one-half of Pop-Up Archive (the other being Bailey Smith (@baileyspace) started the panel first describing her organization then beginning by "starting with the tail of an archive" by referencing the Kitchen Sisters and their organization's worth with them when approached with an "archival crisis". The Sisters had been working in public radio for decades, recording thousands hours of sound, and had these recordings stored on a variety of mediums stored in a variety of places.

For their Master's Thesis at Berkeley, they surveyed the digital archiving and public media ecosystems to see if they could identify a solution that would meet the Kitchen Sisters needs while keeping in-mind the restrictions of resources, workflow and lack of technical proficiency. "We saw the need for an inexpensive tool", she said, "that could be used oral history, archives, and media creators alike to store and/or create access to their materials safely and make it discoverable in a way that would be standardized with their industries." Their initial efforts were in creating plugins for Omeka

Travis May of the Federal Reserve Bank of Saint Louis followed Anne, describing his work on FRED, an economic database with over 83,000 economic series from 57 different sources with a majority of the data coming from the United States.

Cal Lee of University of North Carolina at Chapel Hill was up next with "Taking Bitstream Seriously". "The category of dealing with everything we get is pretty large...relates to trying to be a little bit more systematic to try to deal with messy situations where we get this kind of media.", he said after referencing existing systematic tools for transfer like bag.it. Cal went on to describe his project, BitCurator (funded by the Andrew W. Mellon Foundation) that is soon to be headed into phase II. "The main goals are to develop and disseminate a package and support open source tools that can help people apply digital forensics methods.", he said. "There are two main things that aren't traditionally addressed by the digital forensics field itself: building these things into library/archival workflows and supporting provisioning of public access to data.

The famed Jason Scott (@textfiles) of Archive Team took the stage next in patriot attire (and his token black hat) and begun, "I am the harbinger of death. I am the angel of death. I am the sad grim reaper that sits at the crossroads of your lost and dying dreams. I am the boatman on the River Styx who takes your hard drive from you and rides with you across the river to your utter destiny." he continued, humorously, "When the handshakes no longer happen and when the smiles fade - that's where I am living. I am living in this world because I help found something called Archive Team, archiveteam.org.

He went on to describe a few of Archive Team's recent projects including savings all of Xanga (project page), for which he described the progress indicating that the preservation won't end well, and Snapjoy (now down, project page), for which he showed more hope due to the small number of users.

Jason emphasized that there are many online communities that are "real shifting sand" that have no guarantees or laws preventing them from going away. "The fundamental question", he said, "is 'Is an online presence a valid humanitarian concern?'".

"Unfortunately, we are now the victims of the 'brogrammer/journalist' complex, which has worked together to really convince us that the place to put all of our stuff is with people we don't know for reasons we don't know until they decide that they're done with us...or have they've sold it to Google.", he said. "We have three virtues within Archive Team: Rage, Paranoia, and Kleptomania. So basically, we're very angry about these things going away, we have an enormous paranoia about things that might go away at any given time, and we take everything as fast as we can."

Jason went on with further allusions and anecdotes about Archive Team and their projects but the video (below) does his presentation better justice.

With the panel complete, a series of Lightning Talk followed.

Lightning Talks

William Ying of ARTStor presented "ARTstor Shared Shelf Preservation Plan Based on the NDSA Levels of Digital Preservation".


Abbie Grotke (@grotke) of Library of Congress presented "Content Working Group Case Studies"


Kim Schroeder of Wayne State University presented "Realities of Digital Preservation — What Are the Concerns and the Practice?


David Brunton of Library of Congress presented "The Importance of Being Developers"


Cathy N. Hartman of University of North Texas presented "International Internet Preservation Consortium: Update"


Patrick Loughney of Library of Congress presented "The Library of Congress National Recording Preservation Plan"


Christina Drummond of University of North Texas presented "Your VO 'Lab' Results Are in: What NDSA Members Think of the NDSA"


Yvonne Ng (@ng_yvonne) of WITNESS presented "The Activists' Guide to Archiving Video"


With Yvonne closing up the lightning talks, Barrie Howard of NDIIPP excused the audience from the first day and encouraged everyone to view the poster session just outside of the main presentation room.



Day 2

Lisa Green (@boudicca), Director of @commoncrawl started off Day 2 (with an introduction by Bill LeFurgy) with her presentation "Digital Preservation for Machine-Scale Access and Analysis". Citing Hilary's work from Day 1, she said "By machine scale I believe we have do be doing digital preservation in such a way that enables us to do data science on information we're preserving."

Lisa continued by giving a history of the progress of how we have moved from the concept of archiving hard bound information to machine readable information. "By the end of the 20th century, we had significantly increased our storage capacity. At this point, we were able to store and move around megabytes of data very easily. This was about the time that some of the really forward thinking people at Library of Congress started thinking about Digital preservation. We can store so much information now that we needed a new unit to even wrap our heads around the amount of storage we have: a "Library of Congress" worth of information."

She continued, describing rare books are setup for display without being accessible (e.g., behind glass) and juxtaposed them with Google Books and Google NGram Viewer in how the latter does not necessarily give direct access to information in the former. "We're not building a time capsule here. We're not putting things away so that they're safe for future generations and maybe we take a peek at them now and then. Citing the Library of Congress' mission statement:
The Library's mission is to make its resources available and useful to the Congress and the American people and to sustain and preserve a universal collection of knowledge and creativity for future generations.

"To me, the first part is the most important part - To make its resources available and useful. What good is collecting all of the information if we're not pushing forward the boundaries of human knowledge? I would propose that some efforts in digital preservation are focused a little too much on the second part, to preserve and sustain, and not enough on the available and useful.", she said.

Emily Gore (@ncschistory) of Digital Public Library of America followed Lisa. She spoke of the various partners that have contributed data to their organizations and that "we free our data. Our data is your data. Our partners' data is you data. You can download the complete repository of data you've give us. Do with it what you will." stating that, by default, the partner's data is under a CC0 license.

As with the first day of the conference, a panel followed titled "Green Bytes: Sustainable Approaches to Digital Stewardship" with an introduction by Erin Engle (@erinengle).

Green Bytes Panel

David Rosenthal of Stanford University

Kris Carpenter of Internet Archive

Krishna Kant of of George Mason University and the National Science Foundation

With the completion of the panel, the crowd was given a half hour to preview the workshops to follow. The five workshops/sessions occurred simultaneously with five different topics:

Workshops/Sessions

While very relevant to our interests at WS-DL, I presented at the Web Archiving session, so cannot give an account of the others.

The Tools of the Trade: The Library of Congress Perspective session contained presentations titled "World Digital Library" by Sandy Bostian of Library of Congress; Jukebox by Sam Brylawski of University of California, Santa Barbara; and "Congress.gov" by Andrew Weber (@atweber) of the Law Library of Congress.

The Digital Curation Education and Curriculum session started with the first presentation titled National Digital Stewardship Residency Program" by Kris Nelson of Library of Congress, Bob Horton of IMLS, Andrea Goethals (@andreagoethals of Harvard University, Jefferson Bailey (@jefferson_bail) of Metropolitan New York Library Council, and Prue Adler of Association of Research Libraries. The second presentation of the session was titled "Closing the Digital Curation Gaps: Getting Started Guide" by Helen Tibbo of UNC at Chapel Hill.

The Digital Preservation Tools session contained presentations titled "WGBH Media Library and Archives" by Karen Cariani of WGBH and "DSpace and Fedora Commons: A Comparison of Projects" by Wayne State University Students .

The Managing Software Projects session was more panel-like with David Brunton of Library of Congress, Lisa LaPlant of GPO, Daniel Chudnov of George Washington University Libraries and moderated by Kate Zwaard (@kzwa)

of Library of Congress.

Post-Panel Q&A

After the simultaneous session, the crowd was excused for lunch, where the NDSA Innovation Awards were presented by Jefferson Bailey (@jefferson_bail).

Following lunch was a short break then another set of simultaneous workshops and sessions.

The Web Archiving session contained presentations titled "WARCreate and WAIL" by Mat Kelly (@machawk1) of Old Dominion University and "DuraCloud and Archive-It Integration: Preserving Web Collections" by Carissa Smith of Duracloud.

The Digital Preservation Services session contained presentations titled "Digital Preservation Network" by David Minor of UCSD and "Integrating Repositories for Research Data Sharing" by Stephen Abrams of UCC.

The Graduate Curriculum in Digital Preservation session was panel-like involving Jane Zhang of Catholic University, Anthony Cocciolo of Pratt Institute, Kara Van Malssen of AudioVisual Preservation Solutions and Jefferson Bailey of Metropolitan New York Library Council.

The Digital Stewardship Tools from the Library of Congress session contained sessions titled "EDeposit and DMS" by Anupama Rai and Laura Graham of Library of Congress, "NDNP/ChronAm" by David Brunton of Library of Congress and "Viewshare" by Camille Salas of Library of Congress.

The Project Pitching session required prior sign-up and involved three funding agencies: Institute of Museum and Library Services, National Historical Publications and Records Commission, and National Endowment for the Humanities.

The final session of the day was another panel titled "Innovative Approaches to Digital.

Innovative Approaches to Digital Stewardship

Amy Robinson of EyeWire.

Rodrigo Davies of MIT Center for Civic Medias

Aaron Straup Cope of Cooper-Hewitt Museum Labs

After Aaron's presentation, the three presenters fielded questions.

With the completion of the panel, the conference wrapped-up and the crowd was adjourned from the conference.

— Mat

Thursday, July 25, 2013

2013-07-26: ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2013




While Hany SalahEldeen and I took time on Monday to ready our presentations, Scott Ainsworth and Yasmin AlNoamany presented at the Doctoral Consortium. Scott presented his research on improving temporal drift in the archives, and Yasmin presented her work on creating a story from mementos. Their presentations (and doctoral consortium) are discussed in more detail in their blog posting.

Day 1

After opening remarks from J. Stephen Downie and Robert H. McDonald, Clifford Lynch gave the opening keynote of the conference entitled "Building Social Scale Information Infrastructure: Challenges of Coherence, Interoperability and Priority."

Lynch posed a series of questions that are influencing the research areas in the digital libraries. To begin, he mentioned that a number of systems could be considered digital libraries, such as the National Security systems tracking people and actions, or health-care systems tracking patients. The big challenge we are facing is how to think about massive enterprise systems and prioritize activities in such large, community controlled environments. PubMed is providing an example of a canonical collection of literature in a discipline, but our current models are institutionalized by publisher or collection topic as opposed to an entire discipline.
The next topic discussion centered around the notion of getting physical objects that may exist in peoples' homes or private collections into digital libraries. Additionally, what does it mean to control the access rights to this content? He referenced Herbert Van de Sompel's presentation on data in archives, and that the data should exist only in the context of content and creator, not archive and curator. 

Lynch followed this up by mentioning that we have no good ways to assess the health of the stewardship environment. He touched on the need to assess how much of the web is archived, how much of the web is discoverable and archivable, and how well we are capturing the target content (the last two of which is discussed in my WADL presentation). We have worked or are currently working on each of these questions in the WSDL group. 

Finally, Lynch closed with his position that digital stewardship is becoming an engineering problem, and should be treated as such with appropriate risk management, modeling and simulation, and business models (such as those presented by David Rosenthal). These engineering systems will reflect the values of the discipline by the policies put in place (such as privacy, access rights, and collection replication).

My colleagues and I went to our first round of paper presentations – Preservation I – to support session chair and WSDL alumnus Martin Klein (currently at Los Alamos National Laboratory Research Library) and our current student Scott Ainsworth.

The first paper of the session was from Ivan Subotic entitled A Distributed Archival Network for Process-Oriented Autonomic Long-Term Digital Preservation. Ivan proposed an archival scenario and the associated requirements for a digital preservation system. The distributed archival system, DISTARNET, was proposed and a prototype centering around object containers was presented by Ivan during his talk.

Scott presented his work on Evaluating Sliding and Sticky Target Policies by Measuring Temporal Drift in Random Walks Through a Web Archive. His paper was also nominated as a candidate for best student paper. Scott discussed the temporal drift experienced by users when navigating (or walking) between mementos in the archives, and how this drift changes depending on walk length. The drift is greatly reduced when MementoFox is utilized by the user during the walk.



Kyle Rimkus and Tom Habing tag-team presented their paper Medusa at the University of Illinois at Urbana-Champaign: A Digital Preservation Service Based on PREMIS and gave a brief description of the MEDUSA project. MEDUSA facilitates the movement between archival platforms. Their presentation discussed the implementation of PREMIS which allows objects to form relationships within the archives.

WSDL alumnus Frank McCown presented the paper he wrote with first author Richard Schneider entitled First Steps in Archiving the Mobile Web: Automated Discovery of Mobile Websites. Frank discussed the difference between the mobile and desktop versions of the web, and how crawlers can detect or discover the URI of a mobile version of a site (if it exists).

After lunch, I attended a panel discussion on Managing Big Data and Big Metadata. Michael Khoo, Stacy T. Kowalczyk, and Matthew W. Mayernik. Khoo kicked the panel off with a brief introduction of the big data discipline and posed the question that the panel would discuss: "how can digital library research inform big data and big metadata?" Kowalczyk discussed "HathiTrust Research Center: Big Data for Digital Humanities." Her talk mentioned that terms used in big data are not well defined, and that digital libraries (Google, Internet Archive, etc.) are examples of specialized big data collections. Khoo raised a few more points about management and interoperability of big data between federated institutions. Mayernik presented "Managing Big Data and Big Metadata: Contributions from Digital Libraries." His work is more theoretical and focuses on how to create collections for sharing. He described digital libraries as technologies and sociological institutions for which convergence is not inevitable. That is, these institutions implement different policies for managing different types of big data that are not necessarily compatible.

A particularly intriguing question asked for differences between big data in industry and in the libraries. One difference is the user base: industry has specialized users and treats the data more as the process while the libraries must accommodate a wider variety of users and treats the goal (to create collections).

A second question asked if there was a difference between "big" and "much" data. Big data is not suitable for traditional processing, querying, scientific research, etc. while "much" data is more suited to traditional handling.

The session completed with a long discussion on what constitutes data, and how sampling techniques can be used to reduce "big" data to "small" data while learning almost as much from the sets.

In the last session of the day, I attended the Information Clustering session (in no small part because of the two Best Paper Nominee presentations).

Weimao Ke presented his paper Information-theoretic Term Weighting Schemes for Document Clustering, a Vannevar Bush Best Paper Award Nominee. Ke discussed methods for extracting information from documents in a collection, drawing additional information from them, and clustering the documents based on the information retrieved. The proposed LIT method produces the best clustering results with k-means clustering, and can produce better results than TF/IDF.

Kazunari Sugiyama presented his paper Exploiting Potential Citation Papers in Scholarly Paper Recommendation, another Vannevar Bush Best Paper Award Nominee. Sugiyama discussed the relationships between citation and reference papers and how they can be used to recommend academic papers to authors. He also used fragments (such as the abstract or conclusion) to refine the recommendations.

Peter Organisciak presented his paper Addressing diverse corpora with cluster-based term weighting. Heterogeneous language in a corpus can be problematic -- Inverse Document Frequency (IDF) becomes less valuable when texts are from mixed domains, time periods, or are multilingual. Classifying or clustering documents helps increase the value of IDF. Organisciak used his method to show that the English language has change over time.

Xuemei Gong presented her paper Interactive Search Result Clustering: A Study of User Behavior and Retrieval Effectiveness. She showed that a scatter/gather system is more difficult to use than a classic search engine interface, but that scatter/gather is more useful, and can improve user learning.

Day 2

Our second day at JCDL began with a keynote from Jill Cousins entitled "Why Europeana?"
Cousins discussed the value -- culturally and monetarily -- of Europeana which began as a method of reflecting the diversity of the European web. With the help of activists, it facilitates data aggregation, distribution, and user engagement. Europeana constructed an aggregation infrastructure of digital libraries to deliver archival data (an infrastructure that now spans beyond 2,300 content providers).

Europeana's distribution of the archival material was a challenge due to licensing concerns by content owners. However, Europeana offers an API utilized by 770 organizations and several services that deliver content.

Current operational challenges at Europeana include multilingualism -- the site cannot intake queries from multiple languages and effectively return the requested information. Additionally, engaging users is a continuing challenge, particularly with the goal of drawing traffic in the same order of magnitude as Wikipedia. 

Cousins outlined three impacts of Europeana. The first is Europeana's support of economic growth in that the cultural material is being used to improve other services. The second is that Europeana connects Europe and the rest of the world through community engagement and heritage services. The third is making cultural available to everyone.

Budget cuts have severally reduced Europeana's ability to effectively deliver on these impacts, and they developed new goals to improve economic return on investment and finding alternate funding sources. One proposed source is the Incubator which is a research think tank that will support startups. The main source of revenue is service-oriented offerings within the impacts, such as enabling industry, providing license framework, or incubation services.
The next steps for Europeana are to receive support from the governmental bodies and become self-sufficient by 2020.

The first session I attended was Specialist DLs, moderated by Michael L. Nelson.

Annika Hinze presented her paper Tipple: Location-Triggered Mobile Access to a Digital Library for audio book. This work (embodied as a mobile application) allows location-specific sections in narrative books to be matched with the current location of a reader. For example, a book chapter set in the Hamilton Gardens would play when the reader is walking through the Hamilton Gardens. Hinze presented evaluations of the software.

Paul Bogen presented his paper Redeye: A Digital Library for Forensic Document Triage. Redeye helps an undisclosed sponsor filter relevant information on targets from noise in a large corpus of scanned documents. The system uses entity extraction, cross-linking, and machine translation and utilizes an ingestion pipeline, a repository, and a workbench.

Katrina Fenlon presented her paper Local Histories in Global Digital Libraries: Identifying Demand and Evaluating Coverage. She discussed the results of a survey of libraries on the demand for different granularities historical topics of interest, with local historical topics (as opposed to state or world granularities) being in the highest demand.

Laurent Pugin presented his paper Instrument distribution and music notation search for enhancing bibliographic music score retrieval. He presented an effort to inventory music scores that exist in multiple sources around the world. The main goal of this paper is to utilize the rich metadata of the set to provide the most effective search for the users.

The second session of day two was Web Replication.

Hany kicked off the session well with his paper Reading theCorrect History? Modeling Temporal Intention in Resource Sharing. Hany discussed his continuing studies of the differences between what we meant to share over social media and what is actual observed.



In what was (in my clearly unbiased opinion) the best presentation of the day, I presented my study of Memento TimeMaps entitled An Evaluation of Caching Policies for Memento TimeMaps. In this work, I discussed the change patterns of TimeMaps which should intuitively be monotonically increasing but in practice, sometimes decrease in cardinality. I proposed and evaluated a caching strategy based on our observations to better serve Memento users while limiting the load on the archives.



Continuing the ODU dominance of this session, Martin presented his paper Extending Sitemaps for ResourceSync. He presented how ResourceSync uses Sitemaps as a resource list, including the extended information that ResourceSync uses to enhance Sitemaps.

Even as the odd-man out (having not attended ODU), Min-Yen Kan presented an exceptional paper written by Bamdad Bahrani entitled Multimodal Alignment of Scholarly Documents and Their Presentations. Their work generated an alignment map to match a conference proceedings paper content with the slides presented during the conference. The major contribution of this paper was the inclusion of visual content.

The session on Data was next.

Maximilian Scherer presented a paper entitled Visual-Interactive Querying for Multivariate Research Data Repositories Using Bag-of-Words. The authors presented a bag-of-words algorithm for discovering data in a repository.

Ixchel M. Faniel presented her paper on The Challenges of Digging Data: A Study of Context in Archaeological Data Reuse. This paper discusses the use and reuse of data of a specialized digital library for archaeologists, and the custom ontologies and reuse patterns that it implements.

Jesse Prabawa Gozali presented the paper Constructing an Anonymous Dataset From the Personal Digital Photo Libraries of Mac App Store Users. This presentation explored methods of retrieving information from photo collection and explored the types of information available to researchers.

Miao Chen presented the paper Modeling Heterogeneous Data Resources for Social-Ecological Research:A Data-Centric Perspective. This presentation discussed an ontology for representing and organizing information in a repository using sampling from an existing dataset.

The poster session rounded out day 2. Most importantly, Ahmed Alsum presented his poster ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph. Dr. Nelson's tweet (complete with YouTube video) captures his Minute Madness presentation perfectly.


Day 3

In the last session of the conference, we attended the Preservation II presentations.

Yasmin presented her paper entitled Access Patterns for Robots and Humans in Web Archives which discussed how we can use archive access logs to determine the frequency with which robots and human users access the archives with the intention of better serving all users.



Krešimir Đuretec presented a paper he wrote with Christoph Becker entitled Free Benchmark Corpora for Preservation Experiments: Using Model-Driven Engineering to Generate Data Sets.This work provides a method for automatically generated corpus that serves as an alternative to Govdocs and provides a ground truth dataset for use in benchmarking.

Hendrik Schöneberg presented his paper A scalable, distributed and dynamic workflow system for digitization processes. This work presents a workflow for quickly processing large amounts of images and image data (from scans of manuscripts) for curation and management.

In a paper that I assume was close to Dr. Nelson's heart, Otávio A. B. Penatti  presented a paper he wrote with Lin Tzy Li entitled Domain-specific Image Geocoding: A Case Study on Virginia Tech Building Photos. This work evaluated the descriptors of an image for geocoding. It then infers and assigns a physical, geographical location to digital objects (in this case, photos of Virginia Tech campus buildings) for placement on a map.

To close out the conference, David De Roure gave his keynote entitled "Social Machines of Science and Scholarship."

De Roure discussed how scientific information and publications have changed over time, and how "the paper" can be improved as a way to disseminate scientific findings. The first failure of the paper is that the data cannot be put into the paper -- the container is inappropriate. The second failure is that an experiment cannot be reconstructed based on a paper alone. The third failure is that publications are becoming targeted at increasingly specialized audiences. The fourth failure is that research records are not [natively] machine readable. The fifth failure is that authorship has moved from single scientists to potentially thousands of researchers (that is more suitable to Hollywood-style credits). The sixth failure is that quality control is not able to keep up with publication speed. The seventh failure is that regulations force specific reporting. The eighth and final failure is that researchers are frustrated by increasing inefficiencies in scholarly communications.

He followed up these failures by discussing how scientists operate  today. This included the relationship between the increasing data, computation, storage, and people involved in research.

De Roure also spoke about how automation, data collection, and other aspects of science are changing as technology is changing along with the world around it. He also mentioned that with the increase in data, we also have an increase of methods to work with it.

De Roure's notion of people and papers as knowledge objects that operate as a linked-data system is interesting. These objects can be exchanged, reused, and handled in workflows similarly to the web objects with which we are more familiar. Coming to the point of the talk, he also discussed examples of successful social machines (such as reCAPTCHA and Wikipedia) that evolve with society -- or more specifically, as a result of society. With the increase in social machines, they've started forming their own, larger social machine ecosystems by interacting. These social machines that we help design are supporting and enabling the research environments in which we operate.

We closed out our conference by attending the Web Archiving and Digital Libraries (WADL) workshop. Notes on the workshop will appear in a future blog posting.



We had a wonderful time visiting Indianapolis, learned a lot, and gathered great ideas to incorporate into our current and future works. We look forward to next year's DL2014 conference (a joint JCDL and TPDL conference) which was announced to be in London.








--Justin F. Brunelle




2013-07-22: JCDL 2013 Doctoral Consortium

The JCDL 2013 Doctoral Consortium is a workshop for Ph.D. students from all over the world who are in the early phases of their dissertation work.  Students present their thesis and research plan and a panel of prominent professors and experienced practitioners in the field of Digital Libraries provides feedback in a constructive atmosphere.  Yasmin AlNaomony and Scott Ainsworth had the privilege of presenting papers at this year's Doctoral Consortium.

Scott Ainsworth, Michael Nelson, & Yasmin AlNoamany

User Interaction

The first session focused on user interaction and was chaired by George Buchanan.  The session began with Erik Choi presenting his work on understanding the motivations behind the questions users ask in Internet Q&A forums.  Prior work in this area has focused on the use an content of Q&A forums; Erik's work focuses on why users ask questions with motivation, expectations, and the relationship between the them.

Yasmin AlNaomony presented her work on using web archive to enhance the permanence of web story telling.  Existing sites such as Storify allows users to create stories, but the web is inherently ephemeral and the stories degrade as web content is lost or moved.  Yasmin's work uses Memento and web archives to add stability to story telling content.  Yasmin's slides are below.



Most existing user-profiling techniques produce a monolithic view of the user.  Users, on the other hand, use the web for many tasks—in essence changing the profile as they switch from task to task.  Chao Xu's work focuses on detecting these task changes in order to respond the user's current needs.

Network Analysis

The network analysis session was chaired by Xiao Hu and included two presentations.  The Topical and Weighted Factor Graph (TWFG) was proposed by Lili Lin as a way to determine topic expertise within scholarly communities.  TWFG combines topic relevance and expert authority (determined using page rank) into a single score to enable expert finding (Google search).

Given the high volume of scholarly articles available electronically, it seems natural for digital libraries to assist in the process of discovering relevant work.  Zhuoren Jiang proposed a system that will incorporate topical changes in publications over time to enhance the computer-aided discovery process.

Data and Archiving

Richard Furuta chaired the session on data and archiving.  First up was Xin Shuai. who is studying the effects social networks (Twitter, Facebook, etc.) have on the dissemination and impact of scholarly information on society.

Su Inn Park proposes the PerCon digital library system.  This digital library system will allow management and analysis of diverse but related datasets.  Ultimately, PerCon will use an agent-based, mixed initiative user interaction model.

The third presentation was by Scott Ainsworth, who is studying the temporal coherence of existing data in web archives (e.g., the Internet Archive).  The goal of this work is to characterize existing web archive content and to produce browsing and recomposition heuristics that implement user priorities (e.g., speed, accuracy).  The slides for this presentation are below.


Social Factors

The final session addressed social factors and was chaired by Ingeborg Sølvbert.  Social Digital Libraries, digital libraries that include significant social features, were addressed by Adam Worrall.  Adam's particular focus is the role digital libraries play in collaboration, communities, and other social contexts.

Nathan Hall presented his exploratory study on faculty attitudes and socio-technical factors that affect scholarly communication and data sharing practices.  The study used a phenomenological approach to examine faculty attitudes toward institutional respositories.

The final presentation was by Jose Antonio Olvera.  Jose is applying computational intelligence to self-preserving digital objects.  Powered by a social network, this approach will enable a "preservation is to share" paradigm.
After the presentations, Sally Jo Cunningham summarized the panels comments:
  • Focus: pick 1 problem and stick to it.
  • Communication: develop your elevator pitch; this will help you communicate with other and help you stay focused.
  • Audience: who is going to use your work and what are they going to do with it.
  • Evaluation: how will you know when you are done? How will you know if you made things better?
  • Play your work and work your plan.
We would like to thank the Doctoral Consortium co-chairs, Sally Jo Cunningham and Edie Rasmussen, and the many reviewers and panel members for their time and valuable feedback.

—Scott G. Ainsworth




Monday, July 22, 2013

2013-07-15: Temporal Intention Relevancy Model (TIRM) Data Set

In the third anniversary of the Haiti earthquake, president Barack Obama held a press conference and discussed the need to keep helping the Haitian community and to invest more in rebuilding the economy. A user was watching the press conference tweeted about it on the 14th of January, and
provided a link to the streamed news.  A couple of days later when I read this tweet and clicked on the link and instead of seeing anything related to the press conference, Haiti, or President Obama, I got a stream feed of the Mercedes-Benz Super Dome in New Orleans in preparation for the 2013 Super Bowl. It is worth mentioning that at the time of writing this blog the tweet above was actually deleted, proving that social posts don't persist throughout time as we discussed in our earlier post.

This scenario illustrates the problem we are trying to detect, model, and solve. The inconsistency between what is intended at the time of sharing and what the reader sees at the time of clicking the link in the tweet.
It is evident that resources change, relocate, or even disappear. In some cases it is tolerable but in other times when it is related to sharing significantly important content (e.g., related to a revolution, protest, corruption claims, and others).

From these observations we decided to perform experiments to detect and model this "user intention" of the author at the time of tweeting and measure how accurately it is perceived by the reader at any point in time. In our JCDL 2013 paper, we deduced that the problem of intention is not straightforward and in order to correctly model it a mapping should be performed to transform the intention problem to a relevancy and change problem. 

Amazon's Mechanical Turk is utilized initially in a direct manner to collect data from workers about intention, unfortunately this approach produced very low accuracy in inter-rater agreement.
After a closer look at the most popular tasks on Mechanical Turk, we found out that categorization and classification problems are the most prominent. The questions that are asked to the workers are simpler and require far less explanation.

We introduce the Temporal Intention Relevancy Model or TIRM to illustrate the mapping between intention and relevancy. Let's consider the following tweet from Pfizer.  The tweet has a link which leads to the newsletter that is updated with the latest announcements of the company.

At any point in time this page is still relevant to tweet, thus we can deduce that the intention behind posting this tweet is to check whatever the current state of the page is. In other words, if the page changed from its initial state at the time of tweeting and it is still relevant we can assume the intention is: current state.

Similarly, we notice a different pattern upon inspecting a tweet posted on the day Michael Jackson died and linking to CNN.com. The front page of CNN.com has definitely changed since the time of the tweet and the content is no longer relevant to the tweet.
Thus, the author's intention was for the reader to see the state of the page at the time he tweeted about it. In conclusion, if the page changed and is no longer relevant to the tweet we can assume that the author's intention is: past state of the resource. So, we dig it up from the web archives.

In a large number of social posts the resource remains unchanged and still relevant to the post. In this case we assume that this is state of the resource at the point in time when the author published this post, but also since it is unchanged a current version will do as well.
Finally, when the resource is changed and has never been related to the post. Then in this case we do not have enough information to decide which user intention the author wanted to convey. This scenario happens often in spam posts.


 We use Mechanical Turk to collect the training data for our model along with multiple features related to the social post, such as its nature, archivability, social presence, and resource’s content.


The resulting dataset was utilized in extracting 39 different textual and semantic features that was used to train a classifier to implement the TIRM. We argue that this gold standard dataset will pave the way for future temporal intention based studies. Currently, we are extending the experiments and refining the utilized features.

For further details, please refer to the paper:

Hany M. SalahEldeen, Michael L. Nelson. Reading the Correct History? Modeling Temporal Intention in Resource Sharing. Proceedings of the Joint Conference on Digital Libraries JCDL 2013, Indianapolis, Indiana. 2013, also available as a technical report http://arxiv.org/abs/1307.4063

- Hany SalahEldeen

Monday, July 15, 2013

2013-07-15: Wayback Machine Upgrades Memento Support

Just over a week ago, the Internet Archive upgraded their support for Memento in the Wayback Machine.  The Wayback Machine has had native Memento support for about 2.5 years, but they've just recently implemented a number of changes and now the Wayback Machine and version 08 of the Memento Internet Draft are synchronized.  The changes will be mostly unseen by casual users, but developers will appreciate the changes that should make things even simpler.  Perhaps even more importantly, these changes have been reflected in the open source version of the Wayback Machine, so the numerous sites that are running this software (for example, see the IIPC member list) should enjoy native Memento support upon their next upgrade.

The first and most significant change is that there is now just a single URI prefix for mementos (URI-M).  Previously, the URI-M discovered through the Wayback Machine's UI was different from the URI-M discovered through the Memento interface (e.g., using the MementoFox add-on).  For example, for the original resource thecribs.com (@ 2003-09-30) you used to have both:

Wayback UI: http://web.archive.org/web/20030930231814/http://www.thecribs.com/

Memento: http://api.wayback.archive.org/memento/20030930231814/http://www.thecribs.com/

(The second URI is not linked; the api.wayback.archive.org/* interface is now turned off and those URIs now produce 404s.)

The problem was that a web.archive.org URIs rewrote the URIs in the HTML to point back in to the archive (i.e., "Archival Replay Mode"), but lacked the necessary Memento-Datetime and Link HTTP response headers. The api.wayback.archive.org URIs had the necessary HTTP response headers, but lacked the rewritten HTML for Archival Replay Mode.  So while both types of URIs (web.archive.org and api.wayback.archive.org) worked in their respective environments, a Memento user could not share (via email, Twitter, etc.) an api.wayback.archive.org URI with a non-Memento user, and likewise a Memento user would not have the additional Memento functionality with a web.archive.org URI.

Long story short: a single URI does it all now:



Never noticed that the dual URI thing?  That's fine, neither did most other people.  I included the above details only to document how things used to work in case you run across an old-style api.wayback.archive.org URI.  Otherwise, don't worry about it.

The URI merger also changes the base URIs for the Timemaps and Timegates:

http://web.archive.org/web/timemap/link/{URI-R}
http://web.archive.org/web/{URI-R}

The second change that may impact people is that TimeMaps now support paging.  The page size is large (currently 10,000), but popular sites like www.cnn.com have > 14,000 mementos.  Instead of having explicit "page 1", "page 2", etc., paged TimeMaps now have a "self" link with "from" and "until" parameters to indicate the left-hand and right-hand temporal endpoints, respectively, for this TimeMap.  It then links to the next TimeMaps with a "from" parameter to indicate the left-hand temporal endpoint of the next page (the "until" value might not be known if the last page is still being "filled", so to speak).  It is easier to look at the example:



Together, the multiple pages form a single logical TimeMap and the pages are only for convenience of transport.  The server determines how many links go into a single page.  Most TimeMaps have < 10,000 URI-Ms so you might not notice this change right away, but please be aware that your applications can not longer assume they're getting the entire TimeMap with a single HTTP GET.

The third change is about defining a standard way for the archive to tell the client "this is not a memento, so do not attempt memento processing on it"*.  This is new in section 4.5.8 of version 8 of the Internet Draft.  The idea is that most of the resources embedded in, for example, http://web.archive.org/web/20030930231814/http://www.thecribs.com/ are mementos captured at some point in the past.  However, some of the images, javascript, etc. are injected by the archive to assist in playback and are not actual mementos and thus the client should not attempt negotiation on those resources.  Rather than having clients maintain regular expressions for what is and what is not a memento at various archives, the server can now just send back this HTTP response header:

Link: <http://mementoweb.org/terms/donotnegotiate>; rel="type"

Here is the full HTTP response for http://web.archive.org/static/js/jwplayer/jwplayer.js, a javascript file injected into the archived HTML to assist in the archival playback:



If you study the HTTP responses for both http://web.archive.org/web/20030930231814/http://www.thecribs.com/ and http://web.archive.org/static/js/jwplayer/jwplayer.js, you will see that the former has "X-Archive-Playback: 1" and the latter has "X-Archive-Playback: 0".  In summary, section 4.5.8 of the Internet Draft just standardizes the current "X-Archive-Playback: 0" header with a Link header that is applicable to all kinds of Memento archives (and not just Wayback Machines).

We hope you will give the new Wayback Memento interfaces a test drive and let us know if you see any errors or have additional comments.  The new interfaces were integrated in the LANL and ODU aggregators last week, so if you are using those you should have seen a switch already.  We'd like to thank Ilya Kremer (IA) and Lyudmila Balakireva (LANL) for all of their feedback and efforts during this implementation and  Kris Carpenter (IA) for her continued support of Memento. 

--Michael





* or, if you prefer: "All these URIs are mementos except this one.  Attempt no negotiation there.  Use them together.  Use them in peace."

Wednesday, July 10, 2013

2013-07-10: WARCreate and WAIL: WARC, Wayback and Heritrix Made Easy

As the Web Science and Digital Libraries Research Group, we regularly interact with end users as well as developers that are interested in digital preservation. One of our goals is to assist in making web preservation accessible to regular users instead of just power users.  As computer scientists, this frequently means creating software. A few digital preservation software packages that were created by WS-DLers include:


Because shrimp, that's why.

  • Warrick - a utility for reconstructing (or recovering) a website using various archives and caches.
  • Synchronicity - a Firefox extension that supports the user in rediscovering missing web pages
  • mcurl - a command-line memento client
and two that are dear to my heart:
And other sea creatures

I had developed these two packages for JCDL2012 and PDA2013, respectively, with the former being given the Future Steward award from the National Digital Stewardship Alliance (NDSA) at Digital Preservation 2012.  While this was all swell for our group, one problem remained that was again surfaced at PDA2013 in February, where WAIL (which was a spin-off of the WARCreate server decoupling) was unveiled. At PDA, I made sure well before-hand that WAIL was available to the public in a double-click-and-go binary (pre-compiled executable, i.e., App) form. While we keep all of the software we develop free and open source, WARCreate remained experimental and thus never "released", per se, though anyone could download the source and try it if they we really eager.

Per above, we are technical and, as learned with WAIL, users are more willing to try your software when the barriers (e.g., compiling from source) are minimized. With WARCreate getting its first reference citation, it was time to formally release the tool, in binary form, for public consumption - ready or not.

WARCreate is now available for download in the Chrome Web Store.
To use it:
  1. Enable WARCreate in Chrome
  2. Navigate to a webpage
  3. Click the WARCreate logo on the right of the address bar
  4. Hit the "Generate WARC" button
Within seconds, a Web Archive (WARC) file will be created of the currently viewed webpage and saved to your downloads folder. Alternatively, WARCreate may crash or not behave 100 percent as expected, but I will gladly address bugs encountered by e-mail, through github issues or confront me personally at Digital Preservation 2013 on July 24, 2013 in Alexandria, Virginia where I will be doing a presentation on WAIL and WARCreate. There are sure to be bugs, however, pre-release software is better than no-release software.

— Mat (@machawk1)