Friday, May 21, 2010

2010-05-21: Travel Report for LDOW, WWW, DOE, OAC

I've just finished up a pretty busy four week stretch that involved one workshop, one conference, one proposal review panel, the space shuttle, a working group meeting and the end of the spring semester.

In the last week of April I went to Raleigh NC for the Linked Data on the Web Workshop (LDOW 2010) and the World Wide Web Conference (WWW 2010). I drove down to Raleigh Monday evening after giving the last lecture (on Memento) in my CS 751/851 class. In addition to myself, from the WS-DL team Scott Ainsworth and Jeff Shipman were able to attend the pre-conference workshops WS-REST 2010 and LDOW 2010 but they both had to return to work after that and missed the WWW conference itself. WS-DL alumnus Frank McCown was able to attend WWW and it was good catching up with him.

From the Memento team, Herbert & Rob were there for the entire week as well. We had a Memento paper at LDOW:
Herbert Van de Sompel, Robert Sanderson, Michael L. Nelson, Lyudmila L. Balakireva, Harihar Shankar, Scott Ainsworth, An HTTP-Based Versioning Mechanism for Linked Data, Proceedings of Linked Data on the Web (LDOW2010), 2010.
Herbert gave the presentation which was very well received:



Among the many good contacts we made at LDOW were John Sheridan & Jeni Tennison from the data.gov.uk project (see Jeni's blog and their LDOW paper for more info); the US equivalents would be data.gov (see this blog post for a comparison between data.gov and data.gov.uk) and citability.org.

The WWW conference itself was good, but I did not get to attend as much as of it as I would have liked: a dataset dynamics meet up, Memento planning meetings and other events it limited the number of paper sessions I could make. danah boyd's keynote was particularly good, both challenging and somewhat controversial, the latter of which I especially value in keynotes. Rather than attempt a summarization, I'll just link to danah's transcript and a nice summary from Elon. But perhaps the most thought-provoking highlight of WWW was the panel "Search is Dead! Long Live Search", of which Gene Golovchinsky has written a nice summary.

After WWW, it was back to Norfolk and exam week for my class. After the exam session, it was another Monday night drive, this time to Rockville MD for a Department of Energy review panel. I'm not supposed to say which panel it was, but it was my first DOE panel and it was illuminating for me to contrast this panel with the various NSF panels on which I've served. After the panel, it was back to Norfolk to grade exams & final projects before Saturday's commencement.

The following week Danette and I drove down to NASA Kennedy Space Center to watch the launch of STS-132. The week after that I flew out to Albuquerque, New Mexico for a working group meeting of the Open Annotation Collaboration (OAC) project. Some of the topics included the relationship between OAC and ORE (see, for example, the pre-meeting discussion on the OAC list) as well as discussion about whether or not an Annotation is an "event" (in the sense of, for example, LODE). Presumably meeting minutes will be published soon reflecting OAC's status on these and other issues.

--Michael

Wednesday, May 5, 2010

2010-05-11: How Good is Your Graph? A Question Posed to Hypertext 2010

Usually the first response to a question like that is: Huh, what kind of a question is that and why should I care? Here is a short answer to the caring part (the rest of why this is important is at the end): a good graph can keep data safe even after the person that created the data is gone.

The most common interpretation of "graph" is some sort of X-Y plot that shows how one value is affected by another. But in the context of this question, a graph is a system made up of edges and vertices (think of edges as HTML hypertext links and vertices as pages then Internet WWW sites become a graph). Now that we have a graph; the next part of the puzzle is: what does "good" mean and how do you measure it?

That is at the heart of a my paper "Analysis of Graphs for Digital Preservation Suitability" that I will be presenting at Hypertext 2010. I look at different types of graphs that are characterized by (on average) how many edges connect a vertex to its neighbors. Taking this back to WWW pages, some pages have more links to other pages than others. In the world of graph theory, the number of edges that a vertex has is called the "degree" of the vertex. The figure is a histogram showing the "degreeness" of different types of graphs (each with 1000 vertices), specifically: Power Law, Random, Unsupervised Small-World and Watts-Strogatz Small-World. The figure shows that for the Power Law graph, over 600 vertices have only 2 edges connecting them to other vertices in the graph. Now that we've defined some different types of graphs. we can go on and talk about what constitutes a "good" graph.


In the Internet world of WWW, sometimes the link that a person hopes will lead them to the next page, doesn't. This may be because the page has moved, the web site has moved, or some other reason. "Good" in my paper is how well the graph continues to allow someone to follow a series of edges (links) from one vertex (page) to another. All the while when someone else is actively trying to "break" the graph so that you can't get from any vertex to another. To quantify how good one graph is compared to another, I set up a game between an attacker looking to break the graph and a repair person trying to repair the graph in between attacks.


Continuing to add to the complexity of the problem, is: what percentage of the total graph is the attacker and the repair person allowed to see during their respective turns and how much of that small piece are they allowed to affect. Because after all, if either the attacker or the repair could see the entire graph, it would be too easy to find the best place to attack or repair the damage. Details on different combinations of what an attacker might go after (edges or vertices), how much of the graph he is allowed to see in order to select which edge or vertex to remove, how much of the graph the repaired is allowed to see, and what kind of repair she can make are part of the paper (an arXiv copy is here). Spoiler alert!! (The USW is the best.)


Back the second part of the answer to the first question. Now that we have identified a graph that can remain connected and functional after being attacked repeated times, we can talk about how the contents of a page could still be reached and recovered when things are unreachable or gone. Things like; when the originator is gone, when the originator's original site is gone, or when the digital information at the original site is no longer supported by the current technology (think about a WordPerfect file as part of a web page). Looking at solving these types of questions is what digital preservation is about and what Dr. Michael L. Nelson and I are researching.


I'm looking forward to the Hypertext conference and lively, informative and thought provoking presentations.


I hope that you have a nice day and I'll talk to you later.


-- Chuck