2013-09-06: Wolfram Data Summit 2013 Trip Report

I was fortunate enough to be invited to present at the 2013 Wolfram Data Summit in Washington DC, September 5-6, 2013.  My talk was about the future of web archiving, but the focus of the data summit was "big data".  As such, there was a variety of disciplines represented at the summit since the unifying factor was the scale of the data.  Logistics dictated that I missed several of the presentations, but many of the ones I did attend were very engaging.  The slides will be posted at the Wolfram site later, but I'll provide some short summaries below (2013-11-26 edit: the presentations are now available).

First was Greg Newby presenting about Project Gutenberg, the long-running collection of free ebooks.  His focus was on PG as a portable collection, which is subtly different from universal access from different interfaces (even if the interface is just Google).  The focus was more on PG as a collection to be explored and personalized services to be built-on.  During the question and answer period someone asked "what's next for Project Gutenberg?", and during lunch the next day me, Greg, and others talked about PG and Open Annotation, and maybe uploading some content to Rap Genius (I got the idea from Rob Sanderson).

Andrew Ng gave a skype presentation (which, unlike most video presentations, worked rather well) about Coursera.  I'm rather skeptical about most universities' stampede for MOOCs, but I should probably start looking for quality segments in Coursera to augment my own classes.

Another really engaging discussion was Paul Lamere of The Echo Nest.  With lots of illustrative examples using pop music, Paul gave one of the most well-received presentations of the summit.   We learned that band names are not getting longer (I was surprised, I thought they were, but older conventions like "Herb Albert and the Tijuana Brass" make for long names), metal fans are more "passionate" (defined as replays/favorites) than dub step fans (that one was easy), and that we can easily tell human drummers from machines by analyzing variances in the signal (his example was the variations in "So Lonely").  Paul's blog, Music Machinery, is worth checking out. 

Eric Newburger of the US Census Bureau gave an excellent presentation about how Census data is the original "big data".  Tufte fans will enjoy checking out his presentation (prior presentations are available at the moment).  He made a good pitch for using Census data for ground truth for a variety of business purposes, but you really should check out some of the early visualizations. 

Ryan Cordell and David Smith of Northeastern gave a great presentation about "infectious texts", a project to mine early US newspapers for early "viral" memes.  Apparently early newspapers were equal parts news, fiction, and apocryphal stories half-way between truth and fiction, and editors would fill their local papers with large-scale copying from other newspapers, with and without attribution.  The project analyzes the types of stories chosen for 19th century retweeting, the networks of reuse (which don't always match geography and population networks), their temporal patterns, etc.  During the Q&A period and later during lunch we speculated about identifying timeless stories (e.g., the soldier returning from war) and reintroducing them to Facebook & Twitter and see if they reignite.  The project uses LC data from the Chronicling of America project, and the OCR data is especially noisy and requires a host of tricks to align and find the reused portions.

Roger Macdonald of the Internet Archive discussed the Television Archive, which features 2M+ hours of TV news.  I'm guilty of thinking the Internet Archive is just web pages (of which they have some 338B), but they have a great deal more: 30k software titles, 600k books, 900k films/movies, 1M audio recordings (many concerts), and 2M ebooks.  The TV news archive features a very attractive and useful interface for browsing, search, and sharing its content. 

Leslie Johnston from the Library of Congress gave an overview of LC's collections and services.  Most of these I was already familiar with, but I'll mention two sites that I was not aware of.  First, the venerable THOMAS will be replaced with a new congress.gov (see the beta version now) and will will soon feature APIs for accessing the data behind the site.  See these reviews: O'Reilly, TechPresident.  I was also unavailable of id.loc.gov, a site that gathers the various naming, standards, and vocabulary functions into one place.  I knew LC performed this function, but I didn't know of this particular site.

Eric Rabkin gave a fascinating talk about the analysis of titles of works of science fiction and what that revealed about the society that they reflect.  Quoting from his "Genre Evolution Project" page:
We study literature as a living thing, able to adapt to society’s desires and able to influence those desires. Currently, we are tracking the evolution of pulp science fiction short stories published between 1926 and 1999. Just as a biologist might ask the question, “How does a preference for mating with red-eyed males effect eye color distribution in seven generations of fruit flies?” the GEP might ask, “How does the increasing representation of women as authors of science fiction affect the treatment of medicine in the 1960s and beyond?”
In addition the slides (when they're available), you might be interested in his SF course on Coursera.

I gave the last presentation of the day, talking about trends in web archiving.  I gave a high-level overview of some of our recent JCDL and TPDL papers, as well as mentioning long-running projects like Memento and how they integrate the various public web archives, most of which most people have never heard of.



Since I was the last presentation of the summit, we had an extended question and answer period with a handful of people who were not in a hurry to leave and jump in DC traffic.  I ended up meeting my friend Terry for dinner and then headed back to Norfolk at about 7:45 that evening. 

Overall this was a really interesting summit and I enjoyed the multidisciplinary nature of presentations.  I regret that I ended up missing as many as I did, but that's how things worked out.  I would definitely recommend the 2014 summit.  While waiting for the 2013 presentations to be posted, you might want to check out the presentations from 2012, 2011, and 2010

--Michael

2020-01-23 Edit: Updated SlideShare embed code -- MLN

Comments