Friday, October 23, 2015

2015-12-22: 60% of Web Annotations are Orphaned or in Danger of Being Orphaned

Figure 1. An Annotation is defined by OAC
 as a set of connected resources  
In our TPDL paper, we studied 6281 highlighted text annotations (out of 7744 annotations) available in the annotation system in January 2015. The main goal was to investigate the prevalence of orphaned annotations, where neither a live Web page nor an archived copy of the web page contains the text that had previously been annotated.

Recently, we applied the same analysis as in our TPDL paper to a larger number of annotations.  Figure 2 illustrates that the number of annotations in has been increasing since July 2013. Our TPDL paper focused on the 7744 annotations available in January 2015.  Our updated paper (available at analyzed the 20,133 highlighted text annotations (out of 33,946 total annotations) available in August 2015.  In this post, I will focus on reporting results of our arXiv paper.
Figure 2. January 2015 - dataset used in TPDL paper
August 2015 - dataset used in arXiv version  

Based on my experience in analyzing web annotations in, I have seen annotations created just for the purpose of testing the system to see how it works (e.g. some annotations contain the tag "test" in Although some annotations can be considered as not beneficial, the majority of annotations are valuable to the community in different aspects. For example, 9 out of the 10 most annotated websites in are related to education, academic research, or publishing.

The annotation system offers free accounts allowing users to annotate the Web by, for example, creating tags/notes to highlighted text or to a web page as a whole. supports collaborative work by letting users reply to each other's comments as shown in Figure 3.

Figure 3. Annotating the Web Using Annotation System

It is known that web pages are not fixed resources, and they might be changed or become unavailable at any time. These changes in webpages can affect the associated annotations. Figure 4 shows the target URI as it appeared in December 2014. The highlighted text “Scientific feedback for Climate Change information online” in the webpage was annotated with “After reading about your project at MIT news, I visited your page and ...”. In August 2015, this annotation can no longer be attached to the target web page because the highlighted text no longer appears on the page, as shown in Figure 5. Although the live Web version of has changed and the annotation was in danger of being orphaned, the original version that was annotated has been archived and is available at the Internet Archive. The annotation could be re-attached to this archived resource, or memento.

Figure 4. in December 2014 
Figure 5. in August 2015
Because web pages are changing, the status of annotations is also affected. We can classify web annotations into 4 categories based on the attachment to their target live web pages and to mementos:
  • Safe - The annotation can be attached to the target live web page and also to at least one memento. 
  • In Danger - The annotation can be attached to the target live web page but it is not attached to any mementos. In this case, if the live web page is changed such that the associated annotations become unattached, then these annotations, unfortunately, would become orphaned.
  • Re-attached - The annotation is no longer attached to the live web page but, fortunately, it can be reattached to at least one memento from public web archives. 
  • Orphaned - The annotation is neither attached to the live web page nor any mementos.

Safe and re-attached annotations can be recovered with web archives, so they are in better situation than the other two categories. We want to make annotations that belong to the second category (In danger) safe or re-attached by archiving their target web pages. Obviously, we can do nothing about annotations that belong to orphaned category. They are lost.

We used the LANL Memento Aggregator to look for archived copies of web pages (mementos) in the public archives. To be more specific, we were looking for the closest mementos to annotations' creation date. In the example shown in Figure 4, we would need to find the closest mementos captured immediately before and after the annotation creation date (e.g., December 3, 2014 at 12:47 AM for the web page

Figure 6(a) shows an example where mementos are available before and after the annotation creation date. In this example, only M1 and M3 will be tested to see if the associated annotations can be re-attached to these mementos. Figure 6(b) shows mementos that are only available before the annotation creation date while Figure 6(c) shows mementos that are only available after the annotation date. Finally, Figure 6(d) shows annotations that have no existing mementos for their target web pages in the web archives.

Figure 6. Discovering Mementos for Annotations' Target Web Pages
After we discovered the closest mementos to annotations' creation date and checked if annotations are still attached to their live web pages and to mementos, we get to the conclusion illustrated in Figure 7. It shows that 19% of annotations are orphaned while 41% are in danger of being orphaned. The remaining 40% of annotations are in an acceptable situation as 37% of annotations are considered safe while only 3% of them can be re-attached using archives. Results indicate also that if mementos are available for an annotation target web page, there will be a high chance that the annotation can re-attached. In addition, a copy of the same memento can be available in different web archives.

Figure 7. The Status of  Current Annotations

As we can see, having 60% of annotations orphaned or in danger of being orphaned will lead us to a conclusion that archiving webpages at the time of annotation is important to avoid orphaned annotations.

-- Mohamed Aturban

No comments:

Post a Comment