Thursday, January 15, 2015

2015-01-15: The Winter 2015 Federal Cloud Computing Summit



On January 14th-15th, I attended the Federal Cloud Computing Summit in Washington, D.C., a recurring event in which I have participated in the past. In my continuing role as the MITRE-ATARC Collaboration Session lead, I assisted the host organization, the Advanced Technology And Research Center (ATARC) in organizing and run the MITRE-ATARC Collaboration Sessions. The summit is designed to allow Government representatives to meeting and collaborate with industry, academic, and other Government cloud computing practitioners on the current challenges in cloud computing.

The collaboration sessions continue to be highly valued within the government and industry. The Winter 2015 Summit had over 400 government or academic registrants and more than 100 industry registrants. The whitepaper summarizing the Summer 2014 collaboration sessions is now available.

A discussion of FedRAMP and the future of the policies was held in a Government-only session at 11:00 before the collaboration sessions began.
At its conclusion, the collaboration sessions began, with four sessions focusing on the following topics.
  • Challenge Area 1: When to choose Public, Private, Government, or Hybrid clouds?
  • Challenge Area 2: The umbrella of acquisition: Contracting pain points and best practices
  • Challenge Area 3: Tiered architecture: Mitigating concerns of geography, access management, and other cloud security constraints
  • Challenge Area 4: The role of cloud computing in emerging technologies
Because participants are protected by Chathan House Rule, I cannot elaborate on the Government representation or discussions in the collaboration sessions. MITRE will continue its practice of releasing a summary document after the Summit (for reference, see the Summer 2014 and Winter 2013 summit whitepapers).

On January 15th, I attended the Summit which is a conference-style series of panels and speakers with an industry trade-show held before the event and during lunch. At 3:25-4:10, I moderated a panel of Government representatives from each of the collaboration sessions in a question-and-answer session about the outcomes from the previous day's collaboration sessions.

To follow along on Twitter, you can refer to the Federal Cloud Computing Summit Handle (@cloudfeds), the ATARC Handle (@atarclabs), and the #cloudfeds hashtag.

This was the fourth Federal Summit event in which I have participated, including the Winter 2013 and Summer 2014 Cloud Summits and the 2013 Big Data Summit. They are great events that the Government participants have consistently identified as high-value. The events also garner a decent amount of press in the federal news outlets and at MITRE. Please refer to the fedsummits.com list of press for the most recent articles about the summit.

We are continuing to expand and improve the summits, particularly with respect to the impact on academia. Stay tuned for news from future summits!

--Justin F. Brunelle

Saturday, January 3, 2015

2015-01-03: Review of WS-DL's 2014

The Web Science and Digital Libraries Research Group's 2014 was even better than our 2013.  First, we graduated two PhD students and had many other students advance their status:
In April we introduced our now famous "PhD Crush" board that allows us to track students' progress through the various hoops they must jump through.  Although it started as sort of a joke, it's quite handy and popular -- I now wish we had instituted it long ago. 

We had 15 publications in 2014, including:
JCDL was especially successful, with Justin's paper "Not all mementos are created equal: Measuring the impact of missing resources" winning "best student paper" (Daniel Hasan from UFMG also won a separate "best student paper" award), and Chuck's paper "When should I make preservation copies of myself?" winning the Vannevar Bush Best Paper award.  It is truly a great honor to have won both best paper awards at JCDL this year (pictures: Justin accepting his award, and me accepting on behalf of Chuck).  In the last two years at JCDL & TPDL, that's three best paper awards and one nomination.  The bar is being raised for future students.

In addition to the conference paper presentations, we traveled to and presented at a number of conferences that do not have formal proceedings:
We were also fortunate enough to visit and host visitors in 2014:
We also released (or updated) a number of software packages for public use, including:
Our coverage in the popular press continued, with highlights including:
  • I appeared on the video podcast "This Week in Law" #279 to discuss web archiving.
  • I was interviewed for the German radio program "DRadio Wissen". 
We were more successful on the funding front this year, winning the following grants:
All of this adds up to a very busy and successful 2014.  Looking ahead to 2015, as well as continued publication and funding success, we are expecting to graduate both one MS & one Ph.D. student and host another visiting researcher (Michael Herzog, Magdeburg-Stendal University). 

Thanks to everyone that made 2014 such a great success, and here's to a great start to 2015!

--Michael





Saturday, December 20, 2014

2014-12-20: Using Search Engine Queries For Reliable Links

Earlier this week Herbert brought to my attention Jon Udell's blog post about combating link rot by crafting search engine queries to "refind" content that periodically changes URIs as the hosting content management system (CMS) changes.

Jon has a series of columns for InfoWorld, and whenever InfoWorld changes their CMS the old links break and Jon has to manually refind all the new links and update his page.  For example, the old URI:

http://www.infoworld.com/article/06/11/15/47OPstrategic_1.html

is currently:

http://www.infoworld.com/article/2660595/application-development/xquery-and-the-power-of-learning-by-example.html

The same content had at least one other URI as well, from at least 2009--2012:

http://www.infoworld.com/d/developer-world/xquery-and-power-learning-example-924

The first reaction is to say InfoWorld should use "Cool URIs", mod_rewrite, or even handles.  In fairness, Inforworld is still redirecting the second URI to the current URI:



And it looks like they kept redirecting the original URI to the current URI until sometime in 2014 and then quit; currently the original URI returns a 404:



Jon's approach is to just give up on tracking different URIs for his 100s of articles and instead use a combination of metadata (title & author) and the "site:" operator submitted to a search engine to locate the current URI (side note: this approach is really similar to OpenURL).  For example, the link for the article above would become:

http://www.bing.com/search?q=site%3Ainfoworld.com+%22jon+udell%22+%22XQuery+and+the+power+of+learning+by+example%22

Herbert had a number of comments, which I'll summarize as:
  • This problem is very much related to Martin's PhD research, in which web archives are used to generate lexical signatures to help refind the new URIs on the live web (see "Moved but not gone: an evaluation of real-time methods for discovering replacement web pages").  
  • Throwing away the original URI is not desirable because that is a useful key for finding the page in web archives.  The above examples used the Internet Archive's Wayback Machine, but Memento TimeGates and TimeMaps could also be used (see Memento 101 for more information).   
  • One solution to linking to a SE for discovery while retaining the original URI is to use the data-* attributes from HTML (see the "Missing Link" document for more information).  
For the latter point, including the original URI (and its publishing date), the SE URI, and the archived URI would result in html that looks like:



I posted a comment saying that a search engine's robots.txt page would prevent archives like the Internet Archive from archiving the SERPs and thus not discover (and archive) the new URIs themselves.  In an email conversation Martin made the point that rewriting the link to search engine is assuming that the search engine URI structure isn't going to change (anyone want to bet how many links to msn.com or live.com queries are still working?).  It is also probably worth pointing out that while metadata like the title is not likely to change for Jon's articles, that's not always true for general web pages, whose titles often change (see "Is This A Good Title?"). 

In summary, Jon's solution of using SERPs as interstitial pages as a way to combat link rot is an interesting solution to a common problem, at least for those who wish to maintain publication (or similar) lists.  While the SE URI is a good tactical solution, disposing of the original URI is a bad strategy for several reasons, including working against web archives instead of with them, and betting on the long-term stability of SEs.  The solution we need is a method to include > 1 URI per HTML link, such as proposed in the "Missing Link" document.

--Michael

Thursday, November 20, 2014

2014-11-20: Archive-It Partners Meeting 2014



I attended the 2014 Archive-It Partners Meeting in Montgomery, AL on November 18.  The meeting attendees are representatives from Archive-It partners with interests ranging from archiving webpages about art and music to archiving government webpages.  (Presentation slides are now available on the Archive-It wiki.)  This is ODU's third consecutive Partners Meeting (see trip reports from 2012 and 2013).

The morning program was focused on presentations from partners who are building collections.  Here's a brief overview of each of those.

Penny Baker and Susan Roeper from the Clark Art Institute talked about their experience in archiving the 2013 Venice Biennale international art exhibition (Archive-It collection) and plans for the upcoming exhibition.  Their collection includes exhibition catalogs, monographs, and press releases about the event.  The material also includes a number of videos (mainly from vimeo), which Archive-It can now capture.

Beth Downs from the Montana State Library (Archive-It collection) spoke about working with partners around the state to fulfill the state mandate to make all government documents publicly available and working to make the materials available to state employees, librarians, teachers, and the general public.  One of the nice things they've added to their site footer is a Page History link that goes directly to the Archive-It Wayback calendar page for the current page.


Beth has also provided instructions for their state agencies on how to include the Page History link and how to embed a Search box into the archive on their pages.  This could be easily adapted to point to other state government archives or to the general Internet Archive Wayback Machine.

Dory Bower from the US Government Printing Office talked about the FDLP (Federal Depository Library Program) Web Archive (Archive-It collections).  They have several archiving strategies and use Archive-It mainly for the more content rich websites along with born-digital materials.

Heather Slania, Director of the Betty Boyd Dettre Library and Research Center at the National Museum of Women in the Arts (Archive-It collections) spoke about the challenges of capturing dynamic content from artists websites.  This includes animation, video (mainly vimeo), and other types of Internet art. She has initially focused on capturing websites of a selection of Internet artists.  These sites include over 6000 videos (from just 30 artists).  The next step is to archive the work of video artists and web comics.  As part of this project, she has been considering what types of materials are currently capture-able and categorizing the amount of loss in the archived sites.  This is related to our group's recent work on measuring memento damage (pdfslides) and investigating the archivability of websites over time (pdf at arXivslides).

Nicholas Taylor from Stanford University Libraries gave an overview of the 2013 NDSA (National Digital Stewardship Alliance) Survey Report (pdf).  The latest survey was conducted in 2013 and the first was done in 2011.  NDSA's goal is to conduct this every 2 years.  Nicholas had lots of great stats in his slides, but here are a few that I noted:
  • 50% of respondents were university programs
  • 7% affiliated with IIPC, 33% with NDSA, 45% Web Archiving Roundtable, 71% with Archive-It
  • many are concerned with capturing social media, databases, and video
  • about 80% respondents are using external services for archiving, like Archive-It
  • 80% haven't transferred data to their local repository
  • many are using tools that don't support WARC (but the percentage using WARC has increased since 2011)
Abbie Nordenhaug and Sara Grimm from the Wisconsin Historical Society (Archive-It collections) presented next.  They're just getting started archiving in a systematic manner.  They have a range of state agency partners with websites that are dynamic to those that are fairly static.  So far, they've set up monthly, quarterly, semi-annual, and annual crawls for those sites.

After these presentations, it was time for lunch.  Since we were in Alabama, I found my way to Dreamland BBQ.

After lunch, the presentations focused on collaborations, an update on 2014-2015 Archive-It plans, BOF breakout sessions, and strategies and services.

Anna Perricci from Columbia University Libraries spoke about their experiences with collaborative web archiving projects (Archive-It collections), including the Collaborative Architecture, Urbanism, and Sustainability Web Archive (CAUSEWAY) collection and the Contemporary Composers Web Archive (CCWA) collection.

Kent Underwood, Head of the Avery Fisher Center for Music and Media at the NYU Libraries, spoke about web archiving for music history (Archive-It collection).  Kent gave an eloquent argument for web archiving:  "Today’s websites will become tomorrow’s historical documents, and archival websites must certainly be an integral part of tomorrow’s libraries. But websites are fragile and impermanent, and they cannot endure as historical documents without active curatorial attention and intervention. We must act quickly to curate and preserve the memory of the Internet now, while we have the chance, so that researchers of tomorrow will have the opportunity to discover their own past. The decisions and actions that we take today in web archiving will be crucial in determining what our descendants know and understand about their musical history and culture."

Patricia Carlson from Mount Dora High School in Florida spoke about Archive-It's K-12 Archiving Program and its impact on her students (Mount Dora's Archive-It collection).  She talked about its role in introducing her students to primary sources and metadata.  She's also been able to use things that they already do (like tag people on Facebook) as examples of adding metadata. The students have even made a video chronicling their archiving experiences.

After the updates on ongoing collaborations, Lori Donovan and Maria LaCalle from Archive-It gave an overview of Archive-It's 2014 activities and upcoming plans for 2015.  Archive-It currently has 330 partners in 48 US states (only missing Arkansas and North Dakota!) and 16 countries.  In 2014, with version 4.9, Archive-It crawls certain pages with Heritrix and Umbra, which allows Heritrix to access sites in the same way a browser would.  This allows for capture of client-side scripting (such as JavaScript) and improves the capture of social media sites.  There were several new features in the 5.0 release, among them integration with Google Analytics. There will be both a winter 2014 release and a spring/summer 2015 release.  In the spring/summer release several new features are planned, including visual/UI redesign of the web app, the ability to move and share seeds between collections, ability to manually rank metadata facets on public site, enhanced integration with archive.org, updated Wayback look and feel, and linking related pages on the Wayback calendar (in case URI changed over time).

After a short break, we divided up into BOF groups:
  • Archive.org v2
  • Researcher Services
  • Cross-archive collaboration
  • QA (quality assurance)
  • Archiving video, audio, animations, social media
  • State Libraries
I attended the Research Services BOF, led by Jefferson Bailey and Vinay Goel from Internet Archive and Archive-It.  Jefferson and Vinay described their intentions with launching research services and asked for feedback and requests.  The idea is to use the Internet Archive's big data infrastructure to process data and provide smaller datasets of derived data to partners from their collections.  This would allow researchers to work on smaller datasets that would be manageable without necessarily needing big data tools.  This could also be used to provide a teaser as to what's in the collection, highlight link structure in the collection, etc.  One of the initial goals is to seed example use cases of these derivative datasets to show others what might be possible.  The ultimate goal is to help people get more value from the archive.  Jefferson and Vinay talked in more detail about what's upcoming in the last talk of the meeting (see below). Most of the other participants in the BOF were interested in ways that their users could make research use out of their archived collections.

After the BOF breakout, the final session featured talks on strategies and services.

First up was yours truly (Michele Weigle from the WS-DL research group at Old Dominion University).  My talk was a quick update on several of our ongoing projects, funded by NEH Office of Digital Humanities and the Columbia University Libraries Web Archiving Incentives program.


The tools I mentioned (WARCreate, WAIL, and Mink) are all available from our Software page.  If you try them out, please let us know what you think (contact info is on the last slide).

Mohamed Farag from Virginia Tech's CTRnet research group presented their work on an event focused crawler (EFC).  Their previous work on automatic seed generation from URIs shared on Twitter produced lots of seeds, but not all of them were relevant.  The new work allows a curator to select high quality seed URIs and then uses the event focused crawler (EFC) to retrieve webpages that are highly similar to the seeds.  The EFC can also read WARCs and perform text analysis (entities, topics, etc.) from them.  This enables event modeling, describing what happened, where, and when.

In the final presentation of the meeting, Jefferson Bailey and Vinay Goel from Internet Archive spoke about building Archive-It Research Services, planned to launch in January 2015. The goals are to expand access models to web archives, enable new insights into collections, and facilitate computational analysis.  The plan is to leverage the Internet Archive's infrastructure for large-scale processing.  This could result in increasing the use, visibility, and value of Archive-It collections.  Initially, three main types of datasets are planned:
  • WAT - consists of key metadata from a WARC file, includes text data (title, meta-keywords, description) and link data (including anchor text) for HTML
  • LGA - longitudinal graph analysis - what links to what over time
  • WANE - web archive named entities
All of these datasets are significantly smaller than the original WARC files.  Jefferson and Vinay have built several visualizations based on some of this data for demonstration and will be putting some of these online.  Their future work includes developing programmatic APIs, custom datasets, and custom processing.

All in all, it was a great meeting with lots of interesting presentations. It good to see some familiar faces and to actually meet others I'd only previously emailed with.  It was also nice to be in an audience where I didn't have to motivate the need for web archiving.

There were several people live-tweeting the meeting (#ait14).  I'll conclude with some of the tweets.


-Michele

Friday, November 14, 2014

2014-11-14: Carbon Dating the Web, version 2.0



For over 1 year, Hany SalahEldeen's Carbon Date service has been out of service mainly because of API changes in some of the underlying modules on which the service is built upon. Consequently, I have taken up the responsibility of maintaining the service, beginning with the following now available in Carbon Date v2.0.

Carbon Date v2.0


The Carbon Date service currently makes requests to the different modules (Archives, backlinks, etc.), in a concurrent manner through threading.
The server framework has been changed from bottle server to CherryPy server which is still a python minimalist WSGI server, but a more robust framework which features a threaded server.

How to use the Carbon Date service

There are three ways:
  • Through the website, http://cd.cs.odu.edu/: Given that carbon dating is highly computationally intensive, the site should be used just for small tests as a courtesy to other users. If you have the need to Carbon Date a large number of URLs, you should install the application locally (local.py or server.py)
  • Through the local server (server.py): The second way to use the Carbon Date service is through the local server application which can be found at the following repository: https://github.com/HanySalahEldeen/CarbonDate. Consult README.md for instructions on how to install the application.
  • Through the local application (local.py): The third way to use the Carbon Date service is through the local python application which can be found at the following repository: https://github.com/HanySalahEldeen/CarbonDate. Consult README.md for instructions on how to install the application.

The backlinks efficiency problem

Upon running the Carbon Date service, you will notice a significant difference in the runtime of the backlinks module compared to the other modules, this is because the most expensive operation in the carbon dating process involves carbon dating backlinks. Consequently, in the local application (local.py), the backlinks module is switched off by default and reactivated with the --compute-backlinks option. For example, to Carbon Date cnn.com, with the backlinks module switched on:
Some effort was put towards optimizing the backlinks module, however, my conclusion is that the current implementation cannot be optimized.

This is because of the following cascade of operations associated with the inlinks:



Given a single backlink (an incoming link - inlink to the URL), the application retrieves all mementos (which could range from tens to hundreds). Thereafter, the application searches for the first occurrence of the link in the memento.

At first glance, one may suggest binary search since the mementos are in chronological order. However, given that there are potentially multiple memento instances which contain the URL, binary search does not help us because if we check the midpoint memento for the URL, we cannot act upon this information to narrow the search space by half, since the left half of the list of mementos or the right half of the list of mementos could contain the first occurrence of the URL. Therefore, the linear method is the only possible method.

I am grateful to everyone who contributed to the debugging of Carbon Date such as George Micros and the members of the Old Dominion University Introduction to Web Science class (Fall 2014). Further recommendation or comments about how this service can be improved is welcome and will be appreciated.

--Nwala

Sunday, November 9, 2014

2014-11-09: Four WS-DL Classes for Spring 2015

We're excited to announce that four Web Science & Digital Library (WS-DL) courses will be offered in Spring 2015:
Web Programming, Big Data, Information Visualization, & Digital Libraries -- we have you covered for spring 2015.  

--Michael

Monday, October 27, 2014

2014-10-27: 404/File Not Found: Link Rot, Legal Citation and Projects to Preserve Precedent

Herbert and I attended the "404/File Not Found: Link Rot, Legal Citation and Projects to Preserve Precedent" at the Georgetown Law Library on Friday, October 24, 2014.  Although the origins for this workshop are many, catalysts for it probably include the recent Liebler  & Liebert study about link rot in Supreme Court opinions,  and the paper by Zittrain, Albert, and Lessig about Perma.cc and the problem of link rot in the scholarly and legal record and the resulting popular media coverage resulting from it  (e.g., NPR and the NYT). 

The speakers were naturally drawn from the legal community at large, but some notable exceptions included David Walls from the GPO, Jefferson Bailey from the Internet Archive, and Herbert Van de Sompel from LANL. The event was streamed and recorded, and videos + slides will be available from the Georgetown site soon so I will only hit the highlights below. 

After a welcome from Michelle Wu, the director of the Georgetown Law Library, the workshop started with an excellent keynote from the always entertaining Jonathan Zittrain, called "Cites and Sites: A Call To Arms".  The theme of the talk centered around "Core Purpose of .edu", which he broke down into:
  1. Cultivation of Scholarly Skills
  2. Access to the world's information
  3. Freely disseminating what we know
  4. Contributing actively and fiercely to the development of free information platforms



For each bullet he gave numerous anecdotes and examples; some innovative, and some humorous and/or sad.  For the last point he mentioned Memento, Perma.cc, and timed release crypto

Next up was a panel with David Walls (GPO), Karen Eltis (University of Ottawa), and Ed Walters (Fastcase).  David mentioned the Federal Depository Library Program Web Archive, Karen talked about the web giving us "Permanence where we don't want it and transience where we require longevity" (I tweeted about our TPDL 2011 paper that showed for music videos on Youtube, individual URIs die all the time but the content just shows up elsewhere), and Ed generated a buzz in the audience when he announced that in rendering their pages they ignore the links because of the problem of link rot.  (Panel notes from Aaron Kirschenfeld.)

The next panel had Raizel Liebler (Yale) author of another legal link rot study mentioned above and an author of one of the useful handouts about links in the 2013-2014 Supreme Court documentsRod Wittenberg (Reed Tech) talked about the findings of the Chesapeake Digital Preservation Group and gave a data dump about link rot in Lexis-Nexis and the resulting commercial impact (wait for the slides).  (Panel notes from Aaron Kirschenfeld.)

After lunch, Roger Skalbeck (Georgetown) gave a web master's take on the problem, talking about best practices, URL rewriting, and other topics -- as well as coining the wonderful phrase "link rot deniers".  During this talk I also tweeted TimBL's classic 1998 resource "Cool URIs Don't Change". 

Next was Jefferson Bailey (IA) and Herbert.  Jefferson talked about web archiving, the IA, and won approval from the audience for his references to Lionel Hutz and HTTP status dogs.  Herbert's talk was entitled "Creating Pockets of Persistence", and covered a variety of topics, obviously including Memento and Hiberlink.




The point is to examine web archiving activities with an eye to the goal of making access to the past web:
  1. Persistent
  2. Precise
  3. Seamless
Even though this was a gathering of legal scholars, the point was to focus on technologies and approaches that are useful across all interested communities.  He also gave examples from our "Thoughts on Referencing, Linking, Reference Rot" (aka "missing link) document, which was also included in the list of handouts.  The point on this effort is enhance existing links (with archived versions, mirror versions, etc.), but not at the expense of removing the link to the original URI and the datetime of intended link.  See our previous blog post on this paper and a similar one for Wikipedia.

The closing session was Leah Prescott (Georgetown; subbing for Carolyn Cox),  Kim Dulin (Harvard), and E. Dana Neac┼ču (Colombia).   Leah talked some more about the Chesapeake Digital Preservation Group and how their model of placing materials in a repository doesn't completely map to the Perma.cc model of web archiving (note: this actually has fascinating implications for Memento that are beyond the scope of this post).  Kim gave an overview of Harvard's Perma.cc archive, and Dana gave an overview of a prior archiving project at Columbia.  Note that Perma.cc recently received a Mellon Foundation grant (via Columbia) to add Memento capability.

Thanks to Leah Prescott and everyone else that organized this event.  It was an engaging, relevant, and timely workshop.  Herbert and I met several possible collaborators that we will be following up with. 




Resources:

-- Michael