Friday, March 25, 2011

2011-03-25: OAC Phase II Workshop Trip Report

I've just finished attending the Open Annotation Collaboration (OAC) Phase II Workshop in Chicago, IL (March 24-25, 2011). The quality of the presentations was very high and I was surprised at how much the OAC community has grown in a relatively short time. Although I've served on OAC technical review panels before and my student, Abdulla Alasaadi, has worked on a small prototype (to be presented at JCDL 2011) for using SVG instead of the W3C Media Fragments for specifying an annotation target, I haven't been keeping up with the OAC community as closely as I should.

The Workshop has all the presentations online, as well as a wiki that contains various commentary, use cases, etc. (also, the hash tag is "#oacwkshp"). Although all of the presentations generated a lot of discussion from the attendees, the presentations that I learned the most from were:
Also of special note was the OAC model overview from Herbert and Rob, but I was already pretty familiar with that. Again, I encourage you to look through all the slides since each presentation was well received.

Thanks to Tim Cole for organizing such a successful workshop.

-- Michael

Monday, March 21, 2011

2011-03-21: Grasshopper, prepare yourself. It is time to speak of graphs and digital libraries and other things.

Announcing the publication of an Old Dominion Computer Science Department technical report and an homage to Davide Carradine, Keye Luke and the television series Kung Fu.

"Grasshopper."

"Yes, Master Po?"

"Grasshopper, you have passed many tests of strength, agility and stamina. But that is not enough. There are other trials you must pass before you are permitted to attempt to lift the fiery brazier. I will ask you a series of questions.

“Let us begin. What is a graph?"

"Master; a graph is a mathematical construct made of objects that may, or may not be connected to each other."

"Grasshopper, how does a graph relate to digital libraries and the world where we live?"

"Master; a graph is composed of nodes (or vertices) that can be connected in a pairwise manner with edges (or arcs). In the world of Facebook, people take the place of nodes and the connection that is made when one person “friends” another creates an edge. In the World Wide Web, pages can be nodes and navigational links can be edges. In digital libraries, a complex digital object, with all its contents and metadata, can be a node and the URIs of other digital objects are its edges. In our Shaolin temple, you and I can be nodes and your teachings are our edge."

"Grasshopper, explain what you mean by objects that in a graph may or may not be connected to each other."

"Master; one can think of the Internet as a graph made up of routers as nodes and cables as edges. If a cable between two routers is severed then the Internet can still function. Not as fully as before, but it will still function."

"Grasshopper, are you saying that a graph that is not connected can still function, al beit at a lower level?"

"Yes, Master."

"Grasshopper, I have in one hand a graph and in the other a hira shuriken. I will answer three questions and then you must cause as much damage as you can to the graph."

"Master; may I see the graph?"

"No. That is one question. I will tell you the name of one node. It is called 5."

"Master; to whom is 5 connected?"

"5 is connected to 4, 6, 8 and 9. That is two questions."

"Master who is connected to 4, 6, 8 and 9?"

"4 is connected to 2, 3 and 5. 6 is connected to 5, 7 and 11. 8 is connected to 2, 5 and 7. 9 is connected to 5 and 10. That is three questions. Now Grasshopper, you must select one node to remove with the shuriken."

"Master, I choose to attack node 5 because it has a the highest a 1 as its vertex centrality betweenness, while all others are far less than 0.5."

"Grasshopper, you have chosen wisely. Now, when node 5 is removed, how much damage has been done to the graph you have discovered?"

"Master, the damage is 0.29 after the first deletion."

"Grasshopper, here is the total graph. What will be the damage to the
discovered graph and to the total graph after the 5th deletion if I were to tell you the friends of the friends of the friends of 5?"

"Master, the damage to the discovered graph would be 0.89 and 0.68 for the total graph."


"Grasshopper, you have answered well. How and where did you learn these things?"

"Master; I read the technical report: Connectivity Damage to a Graph by the Removal of an Edge or a Vertex.”

“Grasshopper, tell me more about this report.”

"Master; it has an abstract that reads:

“The approach of quantifying the damage inflicted on a graph in Albert, Jeong and Barab´asi’s (AJB) report “Error and Attack Tolerance of Complex Networks” using the size of the largest connected component and the average size of the remaining components does not capture our intuitive idea of the damage to a graph caused by disconnections. We evaluate an alternative metric based on average inverse path lengths (AIPLs) that better fits our intuition that a graph can still be reasonably functional even when it is disconnected. We compare our metric with AJB’s using a test set of graphs and report the differences. AJB’s report should not be confused with a report by Crucitti et al. with the same name.

“Based on our analysis of graphs of different sizes and types, and using various numerical and statistical tools; the ratio of the average inverse path lengths of a connected graph of the same size as the sum of the size of the fragments of the disconnected graph can be used as a metric about the damage of a graph by the removal of an edge or a node. This damage is reported in the range (0,1) where 0 means that the removal had no effect on the graph’s capability to perform its functions. A 1 means that the graph is totally dysfunctional. We exercise our metric on a Collection of sample graphs that have been subjected to various attack profiles that focus on edge, node or degree betweenness values.

“We believe that this metric can be used to quantify the damage done to the graph by an attacker, and that it can be used in evaluating the positive effect of adding additional edges to an existing graph.”

“Grasshopper, where did you find this report?”

"Master; I found it at: http://arxiv.org/abs/1103.3075"

"Grasshopper, go and prepare for your next trial."

"Yes, Master."

Dr. Michael Nelson played the part of the Master Po. The part of Grasshopper is poorly played.

Chuck Cartledge

Wednesday, March 9, 2011

2011-03-09: Adventures with the Delicious API

I recently conducted an experiment on tags provided from the bookmarking site delicious.com. The goal was to obtain a decent sized sample set of URIs and tags that users have used to annotate the URIs. The website provides a recent tool that automatically redirects to a somewhat random URI that was recently annotated by some Delicious user. By parsing the HTTP headers I was able to grab the redirect URI and therefore build a corpus of 5000 unique URIs. The URI for the tool is http://www.delicious.com/recent/?random=1.
As the second step I needed to obtain the corresponding tags for each URI. I tried to be a good programmer and used the Delicious API to query for the tags instead of parsing the web interface. In order to use the API (v1) you need an account with Delicious/Yahoo. The request for

https://username:pwd@api.del.icio.us/v1/posts/suggest?url=http://www.google.com/

for example returns an XML formated response with the top five popular tags:

search google search engine engine web

The API returns at most the top five tags per URI even though there may be more than five visible through the web interface.
However, I split my URI set into 5 batches and ran five times a thousand queries with the same account and from the same IP address, all within 30 minutes. To my surprise I noticed that roughly 50% of the URIs did not return any tags even though they are indexed by Delicious. My intentions were good but a 50% loss was too much so I turned my attention to screen scraping the HTML page. You need to generate the md5 hash value for each URI (including http://) and append to the proper URI. For example for http://www.google.com you need to request

http://www.delicious.com/url/ff90821feeb2b02a33a6f9fc8e5f3fcd

By parsing the source with simple regular expressions you can extract at most the top 30 tags and the frequency how often users have used this tag for this URI. This path turned out to be fast, reliable and provides better results since you get more than just five tags.

The discrepancy between the API and the web interface however raised some questions and so I will share some statistics about my data and provide theories trying to explain the observed behavior:
I only collected 4969 unique URIs. Apparently the recent tool distinguishes between e.g. google.com and www.google.com and possibly www.google.com/

The API did not return any tags for 78 URIs but the web interface provided tags for all 4969 URIs. Maybe the API accesses a smaller index than the web interface? The recent tool however may pull data from the "live" index. Similar behavior was observed by Frank McCown for search engine caches (JCDL 2007).

I got down to 78 URIs from originally 50% by distributing the queries over five different IP addresses and re-querying the API dozens of times stretched over an entire day. The API seems to be sensitive to high frequency requests or is simply not very powerful.

For the 78 URIs I obtained a mean of 23.2 tags with a standard deviation of 7.8. The minimum number of tags was two (for one URI) and the maximum was 30 (for 38 URIs). 51 of the 78 URIs had 20 or more and 73 URIs had 10 or more tags through the web interface. This just underlines the point: the API is not reliable.

I further found that in 465 cases the API returned less than five tags where the web interface returned more tags. This "under reporting" (meaning the API should have reported the top five) is another strong indicator for the API pulling from a smaller and possibly dated index.

One can argue whether or not the order of tags matters. I found that out of the 4891 URIs with tags from the API 1759 had a different order compared to the web interface data. 191 times I observed a change at rank 1. These changes account for 718 times where terms were added or removed from the union of both tag sets (API vs web interface). On average 1.11 moved in or out of the intersection of both sets.

The moral of all this? As much as you may appreciate an API, in the case of Delicious you can obtain more (better?) data by screen scraping the HTML page.

--
martin

Friday, March 4, 2011

2011-03-04: Personal Digital Archiving Conference 2011

Last week, along with Dr. Nelson, I attended the 2nd annual conference of Personal Digital Archiving held at the Internet Archive in the heart of the foggy city, San Francisco. The weather was not on our side as the sunny state was facing the worst weather in quite a while. This didn't turn my spirit down as I was excited to be in room with experts and passionate geniuses whose collective IQ could cause an integer over-flow!



The general atmosphere was really nice; participants were very friendly and eager to introduce themselves and get to know you. I got exposed to a ton of ideas, projects and insights over coffee sometimes while other times just going up and down the stairs. My only regret is that I don't have a contact card as I got a bunch of them; I got to get me some of these!

So that the readers can relive this experience with me I have divided the conference into two days each in turn is divided into sessions. I will try to highlight a thing or two from each session and I will try to find the videos for the entire conference.

Day 1:

At 9am the conference started with Brewster Kahle and Jeff Ubois introducing Cathy Marshallfrom Microsoft Research who gave an amazing keynote entitled: "People are People and things change". It was a really insightful speech summarizing problems we face in dealing with data and backups. She gave examples from her own experience with her computer and the process of "not" backing it up, her tweets which needed to be backed up as well. She came across the note that even when we backup stuff we tend to replicate the entire folders and make copies not maintain an organized list of the resources, people always think that archiving data should be done by someone else.

Gary Wrightfrom Family search gave an insight from his paper entitled "Preserving your family history records digitally" (Legacy Dox) about best archiving practices. He also introduced the Millenniata Disc for data preservation. Jeremy Leighton Johnfrom the British Library came up next and discussed ways of processing digital manuscripts. Evan Caroll, the author and founder of TheDigitalBeyond, followed. He discussed a very interesting question: What happens to your digital assets when you die? Who gains access to them? Do you want them to be destroyed? He also discusses why certain assets grow to have more importance according to the sentimental value behind them.

After the break Ellysa Cahoy and Scott McDonaldfrom Penn State University proposed their ideas and project by which faculty members help in the archiving process on their level further more. Judith Zissmanused her software design expertise to discuss a very interesting idea about Agile Archiving. She discussed how to implement the Agile Manifesto but in the personal archiving process. I wish she could have given further examples though. After that Stan James, who later became my friend, introduced “The Smallest Day”. A project Stan and his father set up together to collect, archive, arrange, tag and connect all their family photos, documents, letters, postcards and more. It is an awesome project that utilized lots interesting technologies like Dragon software for voice recognition, Mechanical Turk, Ancestry.com…etc. Lori Kendallfollowed discussing and defining the concept of “personal” in regards to archiving by giving an example from her ancestors photos. Jason Scottfollowed with a wonderful speech first presenting himself as a collector. Then discussing that it is not enough to keep things safe but we need to make those collections accessible and available online. He discussed in agony the catastrophe of dropping down Geocities and its consequences. On a side note Jason Scott’s cat has 1.4 million followers on twitter and ranked on the top 200 to be followed!


Lunch was next, it was quite refreshing to discuss with other people ideas and thoughts. We had a tour in the internet archive and I took some photos, we saw microfilm readers, Book scanners (not the first time to see one, I saw a whole bunch of them in Alexandria Library). After lunch Birkin Dianafrom Brown University talked about Metadata on archived items and ways to enhance this metadata. Kate Leggfrom NCAR (National Center for Atmospheric Research) presented the center’s project in enabling ease of access to archived content and stated Warren Washington’s collection they have as a model that could be adopted in subsequent collections. From Bookism, Jay Datemadiscussed the issue of compatibility and creating standards of archiving (suggested why not make archive.txt file same as robots.txt and humans.txt files). He also talked about the paper of Jeremy John “The future of saving our past”. The next session was by Ben Gross and Evan Prodromouwhere they started by discussing the Social Data aspect added to archiving. They discussed FOAF, SIOC, ATOM, OpenDD, POCO and Activity Streams. Marc A. Smithdiscussed in a very interesting way where in the social graph an individual is located. He utilized MapXL, NodeXL to visualize the US Senates and to my surprise (and my lack of political knowledge) clusters started to appear which showed the Right and Left wings. From Berkeley Ray Larsoncame up next and introduced the SNAC project discussing also the Authority control and show the two possibilities: Having several names for one person, and having several persons with same name.

Financials and Economics were the theme of the next panelwhere Jeff Ubois, Brewster Kahle, Steve Griffinand David Rosenthaldiscussed the costs of archiving and showed that the bottleneck was in the scanning process as 80% of the cost goes in the human part of the process. 10 cents per page was the claimed number and discussed that a box of paper can cost from $200 to $680 to be archived. Those are the pay-once models like Presto Prime which costs $2000 per Terabyte to be preserved forever. You can notice that the hardware is the least cost as it is merely $50/TB. The LOCKSS system was introduced as well by Rosenthal. The closing keynote was made by Brian Fetzpatrickfrom Google and DataLiberation.org. He showed statistics of the Internet cut-off on Egypt the last month as an example of control over the data. He argued that there is a necessity to make data free from the framework beneath and introduce the Import/Export button to all products.

After dinner and reception the demos session started by Joanne Langand her project “About One” which is an amazing tool to gather information, data, documents and content for the family to help organize and manage life. The slogan was small pieces of information can build a connected life. Michael Ashenfelderfrom the Library of Congress talked next. Debbie Weissman discussed the possibility of claiming ownership of preserved content. Laura Welcherfrom the Long Now Foundation came up next and introduced the Rosetta Project aiming to archive 7000 language for fear that some languages can go extinct. She also introduced a hierarchy of languages, language commons and sources in a wiki-like theme. Susam Kostalfrom San Francisco magazine discussed the concept of digital Hoarding and its relation to physical hoarding. Then Jonathan Good(whom I had a very nice chat earlier about the Egyptian revolution) gave a demo on his project 1000memories.com how friends and loved ones can remember a passed person by collectively gather his/her photos, testimonials, or even start a grant in his/her name. He also showed a dedicated page for the 384 martyrs who died in the Egyptian revolution each photo linked to a dedicated profile so that people don’t forget who were those people and get to know what their lives were. The day was concluded by Denim Smith’spresentation on his project “My Internet Cooperation”.

Day 2:

The second day was also really interesting but definitely shorter. It started with a keynote speech by Clifford Lynchwhere he gave a very insightful talk about the different forms of exposure of the personal documents to the public. He discussed how the personal archiving concept evolved from just individual private shards accessed individually to shared content with the spread of social media till it finally reached the public domain. He also argued that we need an archive “button” in lots of the digital media.

Danial Reetzwith the DIY Book Scanner gave a very interesting talk with a different focus, it was on cameras and technologies. Initially it started when he made “an instructable” on how to build a cheap book scanner. He discussed how cameras vary in power and how the production is affected by users requirements in enhancements. He argued that sometimes users wanted the best modified photo not the best “real” one, slimming cameras, face enhancement lenses…etc. He wondered if it would have been better to invest in adding “document capturing” capabilities to cameras, perhaps OCR too. Dwight Swansonfollowed up next discussing Home Movies, their evolution and archival. Rich Gibsonfrom Gigapan Project showed how Extreme close-up images can give more stories (like in the Italian fashion runway zooming to the designer on the tag).

After the break 2 poets and writers, Devin Becker and Collier Nogues, made a survey on a broad group of writers and their methods of saving their documents and writings, their archiving and organization. Hong Zhanga PhD student from the University of Illinois followed then Jason Zalingerfrom Rensselaer Polytechnic Institute in New york. Jason presented his study on possible enhancements to Google’s Gmail by adding these concepts: Forget label for unwanted emails, Digital Regret for undoing the send, Sleep on It for postponing confirming the send, Word Cloud,…etc. Aiden Doherty and Cathal Gurrinfrom Dublin University presented a very interesting and intriguing concept which is LifeLog. A small wearable device that logs, takes snapshots, GPS coordinates, temperature sensors…etc and store them in a searchable memory platform. They have been wearing these devices for the last 4.5 years!

Ted Nelsongave an awesome speech about how if things were designed differently from the beginning it would have been better. He argued that the documents on the computer are the biggest example. The slides he had didn’t work initially but later that day he showed an amazing demo for Xanadu, a project he is working on for a long time and it introduced a very new data structure which is the multidimensional cells which I found fascinating! Ed Feigenbaumfrom Stanford introduced SALT (Self Archiving Legacy Toolkit) and talked about the initiative they started at Stanford. Then Christina Englebart, daughter of Douglas Englebart (the inventor of the mouse), gave a presentation about her institute's work in collecting the digital artifacts regarding her father’s legacy.

Lunch was as the day before a good opportunity to mix, mingle and exchange ideas. It helped a lot that it was sunny so most of the people had lunch outside in the sun. When we came back Cal Leefrom the University of North Carolina introduced the Forensics aspect in Digital preservation. Richard Coxfrom University of Pittsburgh followed up next then Mark Matienzo from Yale University Library and Amelia Abreu from University of Washington.

Gordon Bell from Microsoft Research came up next with a talk about his life experience in the health aspect. He illustrated how he gathered all his records from the very first one from decades ago in order to have a better picture on his health situation after he had a heart attack. The project MyLifeBits shows this initiative. Khaled Hassounah came up next and introduced a very successful PHR (Personal Health Record) service named MedHelp. Then Linda Branagan from Medweb argued the difference between EMR (Electronic Medical Record) and PHR (Health Panel Video).



After this final break, Elizabeth Churchill from Yahoo! Research lead a panel discussing forensics in the digital world. Kam Woods from the University of North Carolina presented the Forensic Toolkit Imager and Sam Miester from the University of Maryland discussed data from failed businesses like the Sherwood case.

As a grand finale, the Author Rudy Ruckergave a very interesting talk filled with insightful thoughts, humor and sarcasm discussing Digital Immortality by creating a digital replica of thought and memories which he named it LifeBox. LifeBox acts as a bot that can imitate your responses and be able to answer and give opinions based on your thoughts of memories it can stay forever even after your death for your great grand children.

As a summary it was an amazing conference, not just because I attended those 48 sessions but it gave me a priceless opportunity to meet those bright individuals and broaden my scope of thoughts. As a matter of fact I was inspired to come up with several ideas for my thesis proposal!

Also on other note, I found out that the size of the internet is 20x8x8 ft and it is located in a parking lot in Santa Clara California.

For more about the conference from another prospective please check out Collin Thorman's blog posts, the Library Of Congress's news page, Dick Eastman's blog, Christina Engelbart collective IQ post, Ellysa Cahoy's blog, Don Hawkins's article, collection of posts on The Waki Librarian's day1, day2 and #PDA2011 on twitter.

(2011-03-20 Update:) I have associated links to video recording of each of the sessions, press theand it will display it, photos from the conference can be seen here, courtesy of The Internet Archive and Jeff Ubois.

-- Hany SalahEldeen