Wednesday, May 5, 2010

2010-05-11: How Good is Your Graph? A Question Posed to Hypertext 2010

Usually the first response to a question like that is: Huh, what kind of a question is that and why should I care? Here is a short answer to the caring part (the rest of why this is important is at the end): a good graph can keep data safe even after the person that created the data is gone.

The most common interpretation of "graph" is some sort of X-Y plot that shows how one value is affected by another. But in the context of this question, a graph is a system made up of edges and vertices (think of edges as HTML hypertext links and vertices as pages then Internet WWW sites become a graph). Now that we have a graph; the next part of the puzzle is: what does "good" mean and how do you measure it?

That is at the heart of a my paper "Analysis of Graphs for Digital Preservation Suitability" that I will be presenting at Hypertext 2010. I look at different types of graphs that are characterized by (on average) how many edges connect a vertex to its neighbors. Taking this back to WWW pages, some pages have more links to other pages than others. In the world of graph theory, the number of edges that a vertex has is called the "degree" of the vertex. The figure is a histogram showing the "degreeness" of different types of graphs (each with 1000 vertices), specifically: Power Law, Random, Unsupervised Small-World and Watts-Strogatz Small-World. The figure shows that for the Power Law graph, over 600 vertices have only 2 edges connecting them to other vertices in the graph. Now that we've defined some different types of graphs. we can go on and talk about what constitutes a "good" graph.

In the Internet world of WWW, sometimes the link that a person hopes will lead them to the next page, doesn't. This may be because the page has moved, the web site has moved, or some other reason. "Good" in my paper is how well the graph continues to allow someone to follow a series of edges (links) from one vertex (page) to another. All the while when someone else is actively trying to "break" the graph so that you can't get from any vertex to another. To quantify how good one graph is compared to another, I set up a game between an attacker looking to break the graph and a repair person trying to repair the graph in between attacks.

Continuing to add to the complexity of the problem, is: what percentage of the total graph is the attacker and the repair person allowed to see during their respective turns and how much of that small piece are they allowed to affect. Because after all, if either the attacker or the repair could see the entire graph, it would be too easy to find the best place to attack or repair the damage. Details on different combinations of what an attacker might go after (edges or vertices), how much of the graph he is allowed to see in order to select which edge or vertex to remove, how much of the graph the repaired is allowed to see, and what kind of repair she can make are part of the paper (an arXiv copy is here). Spoiler alert!! (The USW is the best.)

Back the second part of the answer to the first question. Now that we have identified a graph that can remain connected and functional after being attacked repeated times, we can talk about how the contents of a page could still be reached and recovered when things are unreachable or gone. Things like; when the originator is gone, when the originator's original site is gone, or when the digital information at the original site is no longer supported by the current technology (think about a WordPerfect file as part of a web page). Looking at solving these types of questions is what digital preservation is about and what Dr. Michael L. Nelson and I are researching.

I'm looking forward to the Hypertext conference and lively, informative and thought provoking presentations.

I hope that you have a nice day and I'll talk to you later.

-- Chuck

No comments:

Post a Comment