Wednesday, February 29, 2012

2012-02-24: Personal Digital Archiving 2012

For its third consecutive year, the Personal Digital Archiving conference took place at Internet Archive in San Francisco, CA. Ahmed and I attended a diverse range of fascinating sessions on how people think about creating and preserving personal digital archives. The environment was very nice, and friendly (there were a baby and a dog in the second day ^_^).
The conference was held on Feb. 24 and Feb. 25, 2012. The first day started at 9:00 am with a keynote and welcome intro by Brewster Kahle about the Internet Archive and personal archives. Brewster gave a quick intro about Internet Archive history and asked an important question, “what would we want out of the Internet Archive in terms of preserving stuff that individuals are creating?”, which should be solved by knowing how to collect materials and make them useful for the people from this conference.
Mike Ashenfelder from the Library of Congress gave a talk entitled “Personal Digital Archive Advice for the General Public” (video). He gave a quick description for the LoC main role in archiving, additional to LoC effort for the personal digital archiving. At the end, he gave an advice for the general public; identify what you want to save, decide what is most important to you, organize the content, and save copies in different places. The take home message can be summarized in these words: outreach, educate, simplify.
Stan James gave a talk entitled “How my Family Archives Affected Others” (video). James gave an update from the last year, he and his father started a project together to collect and archive all their family photos, documents, letters, postcards and more. More than 20,000 files are over the past year. He realized that his family is ahead of the curve. He started with the beginning of the story; a photo of his grandmother who burned all their letters after she had let her children read them. He encouraged his family to collaborate in archiving their stuff. He mentioned that his dad is still scanning the photos and this project brings the family together. James tried many tools for uploading photos; Picasa does not let you enter dates before 1970 (Unix timestamp), Google Plus can’t edit the date at all. At last, he used Facebook timeline to upload his dad’s 25,000 photos and organize them.
Jerry Michalski gave a talk entitled “What I’ve learned from gardening my Brain” (video). He presented Brain, mind-mapping software that can be used in a place of bookmarking. He was using it for 15 years and shared his brain as a case study to explain the benefit of the tool. He mentioned that Brain helps in organizing the stream-of-consciousness thinking.
Jo An Morfin-Guerrero from University of Bristol presented her work “Unstable Archives: Performing the Franko B Archive” (video). She started her session with a case study of an artist (Franko B) who was trying to archive his pictures, media and all his work long time ago. It was part of her Ph.D. research on preservation of media and different artistic practices to organize and archive Franco’s records.
Through Media Types session, I learned of useful tools for preserving emails, bookmarking, and preserving photos. Peter Chan and Sudheendra Hangal from Stanford University gave two talks entitled “Processing and Delivering Email Archives in Special Collections using Muse” (video) and “Putting Personal Archives to Work: Reminiscence, Search and Browsing” (video), respectively about Muse, a project for archiving emails. It does sentiment analysis and slideshow for the images in the attachments. Muse gives a good insight for the trends in emails over time such as the topics of the month and so on. They mentioned that there are many challenges for email archiving, such as copyright and privacy, sensitive information, description (for creating metadata), and delivering. I found publication list which has more information about Muse.
Aaron Straup Cope presented his personal work “Parallel-flickr”. He started his talk with a question “What would happen if flickr went away tomorrow?”. He developed Parallel-flickr which is a software code that uses the Flickr API to pull out the photos and photo information.
Maciej Ceglowski from Pinboard a talk entitled “Remember the Web? Practical challenges of Bookmarking for Keeps” (video). Ceglowski presented Pinboard, a paid bookmarking site which was founded in 2009, 9 million archived bookmarks, and 4 TB stored web content. Pinboard downloads the full content of the bookmark and store it on a Pinboard server. Ceglowski said that “The search engine does not replace the need for your own bookmarks.”
Personally, I liked the idea of keeping the content of each bookmark; I used to save the content of each file I found that it was useful on my local machine.
Next was lunch and after that were Social Network Data session. Marc A. Smith from the Social Media Research Foundation gave a talk entitled “Arc-chiving: saving social links for study” about NodeXL, an open tool for visualizing the connections of social media data and converting them into graphs (video). NodeXL works with Excel 2007 and 2010.
I tried it myself in Information Visualization class and I created very cool graphs and gained a good insight about my relations of Twitter and Facebook. The graph on the right shows how most of my friends on Facebook don't assign relationship status. This is a Group-in-a-Box (GIB) layout of the Facebook network using the “Circular layout”. Clustering was done based on the relationship status of friends. The gray cluster represents the friends with no assigned relationship status.
Megan Alicia Winget from University of Texas at Austin gave a theoretical talk about thing-based behavior of looking at archiving objects versus interaction-based behavior (yelp, Facebook, etc.) which entitled “Personal Interaction Archiving: Saving our Attitudes, Beliefs, and Interests” (video). She raised a question: what are the people saving when they upload their annotations? She is studying the annotations/highlights that the people do through ebook tools and share them. Winget said that “The bookmarking tools can be thought of as a new form of commonplace books”.
Jonathan Harris from Cowbird gave a talk entitled “Cowbird : A public library of human experience” (video). He presented Cowbird, a site for the people to upload stories. He mentioned that his long-term goal is to build a public library of human experience.
Before the last Keynote, there were many interesting Lightning Talks. Denim Smith presented milifemap in which the people can upload photos and personal videos, create private diaries, and share their thoughts (video). It has a timeline visualization for organizing and presenting people’s personal content.
Christopher Prom from UIUC gave a lightning talk entitled “iKive: Towards a Trusted Personal Archives Service” (video). He presented iKive, a research project for the people to easily save their personal digital files to a trusted location.
Carly Strasser from California Digital Library gave a talk entitled “Digital Curation for Excel (DCXL)” (video). She presented DCXL project to facilitate data publishing, sharing, and organizing that would benefit others. The main result from the project will be an open source add-in for Microsoft Excel that will assist scientists in preparing their Excel data for sharing. Initial ideas include generating metadata, incorporating links to scientific data repositories and their requirements, and using controlled domain-specific vocabularies.
At the end of the day, Brewster Kahle gave an ending keynote entitled “A Data Archiving Service” (video). He gave different examples about the archiving initiatives, he estimated the cost for archiving a book page by 25 cent and he estimated the cost for storing 1TB by $2000 which is 40x more than the raw cost of TB hard disk.
Day 2 started with a keynote by Cathy Marshall from Microsoft Research entitled “Ownership, aggregation and re-use of Personal Data” (video). She started with an interesting question “Whose Content is it anyway?" She begins her talk with an example of how to reuse the pictures on the public web and social media. She did a study of user behaviors around using and reusing images. She found that everyone believes that you can keep anything you find online. “It’s yours.” About preserving social media, she said that “people can’t make a go of it on their own. Therefore, we need institutional archives to help with preserving social media”.
User Studies session started with Sarah Kim from University of Texas at Austin who gave a talk entitled “What is your plan for your personal digital archives after your lifetime? Learning from individuals” (video). A part of her Ph.D. dissertation, Kim presented the result of case studies of 20 persons. She asked them “what is your plan for your personal digital archives after your lifetime? “Most of them had never thought about it before or even didn't even think about planning for their personal digital archives. She found that there are reasons to leave something behind. She got a few basic categories into which people may fall: Delete all, Create condensed version, Sort and distribute to designated entities (e.g. kids, colleagues), Write in a will what to do including disposal and access methods, Allow caretaker or others to manage, Expect materials will be lost or deleted.
Debbie Weissmann gave a talk entitled “Personal Archiving in Not Personal Spaces” (video). She gave some examples of personal tweets and posts for opinions/problems at work and the persons who did that were fired. She raised an interesting question “What are the laws concerning that kind of thing?” There are no specific rules about this issue till now.
Lori Kendall from UIUC gave a talk entitled “Use of Personal Archives: Family History Works” (video). She argues that genealogy is becoming more of a thing due to’s popularity and that there are a lot of TV shows and news shows dedicated to it. She argues that genealogy is cool because it puts you in the middle as opposed to being “just another node” and the individualistic society of Americans can foster some love for the idea of a self-contained ecosystem based on ancestry.
Academics session started with interesting talks and then it was a Panel. John Butler from University of Minnesota gave a talk entitled “Practices in digital scholarship and personal archiving” (video). He made a study of research behaviors (50 grad students and faculty survey). He spoke about managing data to insure that they fit to contemporary and can be reused. He presented data management plan and best practices to preserve the data. The Key Finding is: there is a strong diversity of resources or media used. Methods learned in traditional contexts are not easily transferred to digital context. Researchers have unique collections to be shared, but they want to do it under personally-specified conditions.
Ellysa Stern Cahoy from Penn State University Libraries gave an interesting presentation entitled "Faculty Member as Microlibrarian: Critical Literacies for Personal Scholarly Archiving" (video). Her talk focused more on students (or perhaps scholars of tomorrow) than faculty to locate, and use the information. She talked about the ACRL Information Literacy Competency Standards for Higher Education, 2000 including the inclusion of effective competencies and a greater connection with K-12 learning standards.
The panel entitled “What's being Lost, What's being Saved: Practices in digital scholarship and personal archiving” (video). It had been shared between Smiljana Antonijevic from Royal Netherlands Academy of Arts and Sciences, John Butler from University of Minnesota, Laura Gurak from University of Minnesota, and Ellysa Stern Cahoy from Penn State University Libraries. Each of them covered a specific area of personal archiving relevant to academia. One of the interesting questions was about archiving emails, because every time we switch to another system, it is hard to preserve emails. Brewster said “IA has moved away from email to Skype. It is dead here, except for official purposes”.
After the panel, we took the lunch and Post Lunch session started. Jason Scott from Internet Archive gave a very nice talk about his work in IA entitled “Archive Team and the Case of the Widespread Recognition” (video). He mentioned some of his team achievement during the last year. He said that “Google is a library or archive like a supermarket is a food museum.”
Commercial Services session started with Maciej Ceglowski gave his second talk which entitled“The Business of Web Archiving” about Pinboard, but this time he presented the business models around social bookmarking: charge money (Zootool, Instapaper, Diigo, Pinboard,..etc), burn money “finding a sponsor”(Delicious, Yahoo Bookmarks, Google Bookmarks), offer a free service and fail (Magnolia 3 years, MyWeb 6 years, Xmarks). He gave a summary about Pinboard 2011 financials and the risks Pinboard faces.

The graph on the right is the HTTP requests per minute for two different periods; Dec. 9-11 in blue and Dec. 16-18 in green. The blue bars show web traffic one week before the news that Yahoo will sunset Delicious; the green bars show how traffic spiked immediately after the plans to ‘sunset’ Delicious became public. The image shows how it is important for the people to save their bookmarks; at the first sign of danger, people stampeded away to save their bookmarks.
Jed Lau from Memoir Tree company gave a talk entitled “Digital Archive for the Elderly: Facilitating Old-Fashioned Storytelling” (video). He started his talk with a story about his grandmother who told him many stories and he missed her after she died. It was an introduction for the company. Memoir Tree is an app for iPhone that makes it easy to tell, show, and share history through recorded audio and photos. He made a live experiment for the attendance by asking them “what is your favorite ice cream flavor?”. He recorded and showed that instantly.
Stacy Colleen Kozakavich gave a talk entitled “Every House has a History”. She said “you can research the history of your house yourself”.
Economics session started around 3:30 with a great talk to David S. H. Rosenthal from Stanford University entitled “Modeling the economics of long-term storage” (video). One of the interesting questions he tried to answer was “Why would we believe that in the future storage costs will drop at varying rates?”.
Lightning Talks of the second day presented many interesting tools. Matt Zimmerman gave a talk entitled “An Open Source personal data platform” (video). Zimmerman presented Singly which allow the users to put all of their data (personal photos, places, links, contacts) into a single structured place. He also talked about locker Project, an open source personal data store which is still under active development, where you can collect and store all of your personal digital information.
Eric C. Cook from University of Michigan focuses in his lightning talk which entitled “Personal Digital Photography and the Implications of Selective Positive Representation” on digital photography (video).
Jerome McDonough from UIUC gave a lightning talk entitled “Deep Personal Significance: Computer Gaming & the Notion of Significant Properties” (video). He introduced an ongoing research project called Preserving Virtual Worlds 2 funded by the Library of Congress’s National Digital Information Infrastructure and Preservation Program (NDIIP). They are focusing on digital preservation of complex media. It focuses on finding significant properties for a variety of educational games and game series in order to provide a set of best practices for preserving the materials through virtualization technologies and migration, as well as provide an analysis of how the preservation process is documented.
In summary, the presentations were thought-provoking, and creative. I was happy to meet that mix of archivists, programmers, and researchers each with different approach to personal archiving. Many presentations were about tools for preserving personal media. Few presentations were about case studies in which the researchers worked directly with people and reported their findings. I heard about many interesting tools, such as Muse, Pinboard, etc. The conference next year will take place at the University of Maryland.
Quotes from the speakers
  • “Google is a library or archive like a supermarket is a food museum” - Jason Scott
  • “Everyone takes pictures of the wedding, never of the divorce.”
  • “You can research the history of your house yourself.” - Stacy Colleen Kozakavitch
  • “Software engineers are now social engineers.” - Jonathan Harris
  • “Everything on the web feels so disposable.” - Jonathan Harris
  • “Sometimes we make things more complicated and we need to simplify.”
  • “The process of figuring out where to put thoughts—has to be something you enjoy.” - Jerry Michalski
  • “The search engine does not replace the need for your own bookmarks.” - Maciej Ceglowski
For more about the conference from another prospective, please check out the Litbrarian blog, Chris Prom's blog, Ellysa Cahoy's blog, The Wiki Librarian's day1 and day2, the Personal Digital Archives' blog, Mike Ashenfelder's blog, the DCXL's blog post, and #PDA12 on twitter.
I'll add the videos for the talks later.

(2011-04-8 Update:) I have associated links to video recording of each session.

Saturday, February 11, 2012

2012-02-11: Losing My Revolution: A year after the Egyptian Revolution, 10% of the social media documentation is gone.

The Egyptian revolution on the 25th of January 2011 was unlike any other revolution in history because of the role of social media. Several blogs, Storify entries, web pages, channels on YouTube where created to document the revolution. Several books were even published documenting the 18 days. All of these contributions were made by the public, not historians, utilizing the tools of web 2.0. As a result of all these contributions we have an enormous digital content including thousands of posts, tweets, images, videos and sound files narrating and documenting the revolution. Unfortunately, at the first anniversary of this revolution over 10%
of this digital content is already gone.

Websites like Twitter, YouTube, Facebook, Storify, 1000Memories, Blogger and IAmJan25 have allowed the public to document the events of the revolution in real-time. Storify, for example, allows the user to create a timed organized collection of tweets, links, images, posts, map locations or videos to create a story. 1000Memories on the other hand allows the user to keep the memory of a loved one after he/she has passed away by creating collections about them including photos, notes, testimonials, videos and other mementos. Iamjan25 is a website dedicated mainly as a hub for all the videos and images about the Egyptian revolution sent to the website administrators.

It is fascinating to read the amalgamated stories assembled from the tweets, Facebook posts, links, images, videos, map-taggings, etc. from the authors who were experiencing and documenting these events as they occurred. These social media contributions could give a great insight of what happened in the revolution and feed the curiosity of the readers by making them relive those moments with the authors.

Even in the period when the Internet and cellular services were shut down people still took photos and videos which they later posted in the social networks. You can often find videos and images documenting the same incident from multiple angles which reminded me of the movie "Vantage Point".

As an Egyptian in the WS-DL research group at ODU, web preservation of the Revolution is of particular interest. Fearing that the legacy was starting to vanish, we conducted an experiment to find the amount of missing digital artifacts related to the revolution. To measure this, we assembled a number of web sites that had a broad mixture of tweets, images, and videos contributed by the general public. Although we cannot say if this collection is representative of the entire collection of all such resources documenting the revolution, each of these resources was deemed important enough by somebody to have been included in a collection.


As stated earlier, there are several resources that curate the Egyptian Revolution and we want to investigate as many of them as possible. At the same time we need to diversify our resources and the types of digital artifacts that are embedded in them. Tweets, videos, images, embedded links, entire web pages and books were included in our investigation. For the sake of consistency, we will limit our analysis to resources created within the same time frame. For this purpose we tried to use the period of 20th of January until the 1st of March was selected as our temporal filter. Finally, to remove the possibility of transient errors skewing the results, we repeated our experiment 3 times over a period of three weeks before declaring a resource missing.

Our test collection consisted of:
In the next sections we elaborate each experiment we made in detail.

As mentioned earlier Storify is a website that enables users to create stories by collecting references to other media (for example: tweets, images, videos, links and more) and arrange them in a sequential time-based manner. For our experiment we collected stories posted by members in our investigation period from 20th of January until the 1st of March. In those entries we collected the number of missing images that we couldn't view as shown.

Here is an example of a tweet with a resource (image).

And here is an example of a tweet with a missing resource (image):

And finally here are the results of analyzing Storify:

Storify Results
Entry Type Number of Missing Entries Total Number of Entries Percentage Missing
Videos 3 26 11.54%
Images 19 179 10.61%
Links 2 17 11.76%
Total: 24 222 10.81%
Total Number of tweets: 367
Total Number of Tweets with Images: 177

Some entire websites, like IamJan25, were dedicated as a collection hub of media to curate the revolution. The administrators of the website received selected videos and images for notable events and actions that happened during the revolution and they published them as two separate collections. Those images and videos were selected by users as they vouched for them to be of some importance and they send a reference to the resource to the web site. We examined all of those resources and found several missing ones as shown.

Videos page with missing videos:

And images page with missing images:

We stored the downloaded pages of the entire website (the 130 pages of images and the 85 pages of videos) as of 24/01/2012. Also the extracted URLs for the videos and the images from all those pages could be found there too.

Finally here are the results of the analysis of IAmJan25:

IAmJan25 Results
Resource Type Number of Missing Resources Total Number of Resources Percentage Missing
Videos 338 2387 14.16%
Images 221 2928 7.58%
Total: 559 5315 10.52%

In PDA2011 last spring Jonathan Good gave a talk about his website 1000Memories. He mentioned a special page "" which was created to remember the martyrs of the revolution and to describe their lives. It was a wonderful resource where the families and friends of the martyrs can post pictures, notes, videos and testimonials of their life. Unfortunately, this entire web site became unavailable sometime between the 18th of July 2011 and the 20th of January 2012 when we first started investigating it.

Using curl, we see that the web site returns an HTTP response of "503 Service Unavailable", although the main site is still available.

hany:/> curl -I
HTTP/1.1 503 Service Unavailable
Server: nginx
Date: Mon, 06 Feb 2012 22:06:47 GMT
Connection: keep-alive
Content-Length: 606
X-Varnish: 414798630
Age: 0
Via: 1.1 varnish

hany:/> curl -I
HTTP/1.1 200 OK
Server: nginx
Date: Mon, 06 Feb 2012 22:10:13 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Status: 200 OK
Vary: Accept-Encoding
X-Ua-Compatible: IE=Edge,chrome=1
Etag: "877b51b486000c9e329c729d8d51793f"
Cache-Control: max-age=0, private, must-revalidate
Set-Cookie: _1000memories_session=BAh7B0kiD3Nlc3Npb25fa
e90f76840d4715d3bb07003e9124bdbe176a; domain=.1000memor; path=/; HttpOnly
X-Runtime: 0.237096
Content-Length: 116483
X-Varnish: 1502886930
Age: 0
Via: 1.1 varnish

Luckily the Wayback Machine had several archived copies, the last of which is dated 2011-07-18, after that the resource disappeared. An estimate of the resources lost:
  • Information about 403 martyrs.
  • 81 personal photos.
  • Several aggregated pages for each martyr.

This exact example shows the importance of preserving the resource as it might, as in this case, get lost or disappear permanently.

Tweets from Tahrir:
Several books were published in the last year documenting the revolution. To bridge the gap between books and digital media we picked a book entitled “Tweets from Tahrir” which was published later in 2011. As the name states, this book acts as a story formed by tweets of people during the revolution and the clashes with the past regime. We reviewed this book as a collection of tweets and focused on the tweeted media, in this case images, and tried to reproduce.

The image below is from a page in the book showing a snapshot of a tweet with an embedded resource. The embedded resource in this case is an image and is still present at the time of investigation.

While in the image below the tweet with the embedded resource, in this case also the image, is gone.

Reading the book you will notice there are several photos taken by professional photographers and websites. Those photos were presented in the book as a courtesy of the photographers. The rest of the photos in the book are images taken by several individuals and tweeted to the public during the revolution. We needed to state this difference to clarify our results. In our analysis we disregarded the professionally acquired images as those sources are most keen on preserving their own collections and they regulate their public dissemination. We focussed only on the images published by the public in form of tweets. We found 23 tweets in the book out of the 1118 having embedded images. We tried to trace the embedded link in each and reproduce those images. Here are the results of analyzing the whole book:

Tweets from Tahrir Results
Number of Images that we can't reproduce Number of Tweets with Images Percentage Missing Total Number of Tweets Tweets with Images Percentage
7 23 30.43% 1180 2.28%

Something worth noting is that while analyzing all the images in the tweets in the book we came across of a certain image and upon reproducing it using the link in the tweet it gave a totally unexpected result for an image of Miley Cyrus. After investigating further it turned out to be that there was a typo in the book in this tweet which lead to this unexpected result. The link in the tweet was to showing a snapshot to Tahrir Square but the missing "f" at the end of the link and printed in the book made the link resolve to which was the Miley Cyrus tweet.


While we cannot claim that the sample we investigated is a statistically significant sample of the entire web collection documenting the revolution, we believe that the number and diversity of the selected resources provide a representative sample. Also we exhaustively analyzed all the resources in this collection and we encourage the public to provide us with other similar resources that fit the profile so we can extend the investigation.

In the case of the book Tweets from Tahrir and the 1000Memories website they show us the huge importance of web preservation. Following a different preservation method, the book did a good job preserving the resources, in this case images and tweets but yet made it hard to reproduce. The snapshot in the Internet Archive also showed a great example of online preservation of important resources and in the case of 1000Memories to reproduce a completely gone resource.

The survival of the Storify entry or a blog or a media-collection like 1000Memories or IAmJan25 is based only on the survival of the resource on the provider website. Since those entries are provided by reference any change or loss in the original resource translates in a total loss in the entries of the story. This loss of the resource, let's say for example a video, could be caused by the owner deleting this video, the owner being banned or removed, the owner's subscription is finished, or even the publishing facility is completely down (for example, YouTube shutting down).

Here are the final results aggregating all the resources from all parts of the experiment and accumulating them according to type:

Accumulated Results
Resource Type Number of Missing Resources Total Number of Resources Percentage Missing
Videos 341 2413 14.13%
Images 247 3107 7.95%
Links 2 17 11.76%
Total: 590 5537 10.66%

In conclusion, after only one year more than 10% of the media that we thought we have stored for future generations was gone. If the decay continued at the same rate and if we didn't do anything to preserve this digital heritage of the revolution in less than 10 years there will be no story to tell for the future generations and we will lose these magnificent collections that can show what thousands of books couldn't convey.

- Hany SalahEldeen

Sunday, February 5, 2012

2012-02-05: Superbowl 46

Superbowl 46 is today and whether you love football or if you just watch for the commercials you are in for some entertainment tonight. Tonight's game is one of the closest games in recent history.
There is no doubt that New England has a great offense led by Tom Brady. New England has an Offensive Passing Efficiency of 7.65 yards per play compared to a league average of 5.97. The Giants led by Eli Manning are not far behind with a Passing efficiency of 7.32 yards per play. Both teams are in the top five for offensive passing. However the differences are more dramatic on the defensive side of the house. The Giants have given up 5.97 yards per play which is the league average. The patriots have the 29th worst pass defense and have have given up an average of 6.68 yards per play.
Running the algorithms the same way we have all year has the Patriots winning the Superbowl. The predicted margin of victory matches the Vegas Line exactly so this will be a close game. Because this is Science here are the outputs of the algorithms run in the exact same manner as they have all year.
Favorite Spread Underdog Discrete Pagerank

Now with that out of the way and adding a bit of logic. The Superbowl is not played at either of the teams home fields. New England is considered the home team but they are not playing at their home stadium. They are playing at the home stadium of the Indianapolis Colts and quarterback for the Colts is Peyton Manning, the brother of Eli Manning who is the quarterback for the Giants. Additionally the patriots are rivals of the Colts and the home crowd is not likely to be favorably disposed to the Patriots. Following that train of thought I swapped the home team to the Giants and ran the algorithms again. The Eigen Vector algorithm did not change as it does not take home team into account. The SVM algorithm switched its vote to the Giants, and the output of the Neural network model dropped below the Vegas Line.
Taking this into account we think that the Giants will either cover the spread or win the Superbowl outright.

-- Greg Szalkowski