Thursday, August 30, 2018

2018-08-30: Excited to Join WS-DL group in ODU!

I am an outlier compared with most computer scientists because I spent 10 years on a field called "Astronomy and Astrophysics". Very few computer scientists followed the same path as me to transfer from a seemingly irrelevant major. But this is where my passion is, so I did it, and I made it!

Right after I graduated as a PhD in 2011, I joined the CiteSeerX group directed by Dr. C. Lee Giles at IST, Penn State University. I worked as a DBA for web crawling at the beginning and soon became the tech leader of the search engine, and recently the Co-PI of an NSF awarded proposal on CiteSeerX. I spent six years, an usually long time as a postdoc and then was promoted to a teaching faculty. However, I kept moving on, because I wanted to do research!

Luckily, Michael and Michele did not mind of taking the risk and bet on me to be a tenure-track faculty at the Old Dominion University. So I accepted the offer and became a member of the Web Science Digital Library group at ODU CS.

I appreciate many CS faculties, including but not limited to Dr. Jing He, Dr. Cong Wang, and Dr. Ravi Mukkamala, that helped me before and after I moved to Virginia Beach. Michael and Michele gave me tremendous guidance already on how to be successful. I am also glad to know Dr. Sampath Jayarathna and Dr. Jiangwen Sun as new colleagues. It is unbelievable that Sampath and I submitted our first NSF proposal before the first class began this fall!

I cherish my old friends at Penn State. I also look forward to doing more exciting work in this new position!

Posted by Jian Wu at ECSB, Norfolk, VA

Saturday, August 25, 2018

2018-08-25: Four WS-DL Classes Offered for Fall 2018


Four WS-DL classes are offered for Fall 2018:
Dr. Michele C. Weigle is not teaching this semester.

Our current plan for courses in Spring 2019 is to offer a record five WS-DL courses:
  • CS 432/532 Web Science, Alexander Nwala
  • CS 725/825 Information Visualization, Dr. Michele C. Weigle
  • CS 734/834 Information Retrieval, Dr. Jian Wu
  • CS 795/895 Human-Computer Interaction (HCI), Dr. Sampath Jayarathna
  • CS 795/895 Web Archiving Forensics, Dr. Michael L. Nelson
Note that CS 418, 431, and 432 all count for the CS Web Programming minor.  

--Michael


Wednesday, August 1, 2018

2018-08-01: A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages


As commonly seen on Facebook and Twitter, the social card is a type of surrogate that provides clues as to what is behind a URI. In this case, the URI is from Google and the social card makes it clear that the document behind this long URI is directions.
As I described to the audience of Dodging the Memory Hole last year, surrogates provide the reader with some clue of what exists behind a URI. The social card is one type of surrogate. Above we see a comparison between a Google URI and a social card generated from that URI. Unless a reader understands the structure of all URIs at google.com, they will not know what the referenced content is about until they click on it. The social card, on the other hand, provides clues to the reader that the underlying URI provides directions from Old Dominion University to Los Alamos National Laboratory. Surrogates allow readers to pierce the veil of the URI's opaqueness.

With the death of Storify, I've been examining alternatives for summarizing web archive collections. Key to these summaries are surrogates. I have discovered that there exist services that provide users with embeds. These embeds allow an author to insert a surrogate into the HTML of their blog post or other web page. These containing pages often use the surrogate to further illustrate some concept from the surrounding content. Our research team blog posts serve as containing pages for embeds all of the time. We typically use embeddable surrogates of tweets, videos from YouTube, and presentations from Slideshare, but surrogates can be generated for a variety of other resources as well. Unfortunately, not all services generate good surrogates for mementos. After some reading, I came to the conclusion that we can fill in the gap with our own embeddable surrogate service: MementoEmbed.


A recent WS-DL blog post containing embeddable surrogates of Slideshare presentations.


Blast Theory

Sam Pearson and Clara Garcia Fraile are in residence for one month Sam Pearson and Clara Garcia Fraile are in residence for one month working on a new project called In My Shoes. They are developin


MementoEmbed is the first archive-aware embeddable surrogate service. This means it can include memento-specific information such as the memento-datetime, the archive from which a memento originates, and the memento's original resource domain name. In the MementoEmbed social card above, we see the following information:
  • from the resource itself:
    • title — "Blast Theory"
    • a description conveying some information of what the resource is about — "Sam Pearson and Clara Garcia..."
    • a striking image from the resource conveying some visual aspect of aboutness
    • its original web site favicon — the bold "B" in the lower left corner
    • its original domain name — "BLASTTHEORY.CO.UK"
    • its memento-datetime — 2009-05-22T22:12:51 Z
    • a link to its current version — under "Current version"
    • a link to other versions — under "Other Versions"
  • from the archive containing the resource:
    • the domain name of the archive — "WEBARCHIVE.ORG.UK"
    • the favicon of the archive — the white "UKWA" on the aqua background
    • a link to the memento in the archive — accessible via the links in the the title and the memento-datetime
Most of this information is not provided by services for live web resources, such as Embed.ly.

MementoEmbed is a deployable service that currently generates social cards, like the one above, and thumbnails. As with most software I announce, MementoEmbed is still in its alpha prototype phase, meaning that crashes and poor output are to be expected. A bleeding edge demo is available at http://mementoembed.ws-dl.cs.odu.edu. The source code is available from https://github.com/oduwsdl/MementoEmbed. Its documentation is growing at https://mementoembed.readthedocs.io/en/latest/.

In spite of its simplicity in concept, MementoEmbed is an ambitious project, requiring that it not only support parsing and processing of the different web concepts and technologies of today, but all that have ever existed. With this breadth of potential in mind, I know that MementoEmbed does not yet currently handle all memento cases, but that is where you can help contribute by submitting issue reports that help us improve it.

But why use MementoEmbed instead of some other service? What are the goals of MementoEmbed? How does it work? What does the future of MementoEmbed look like?

Why MementoEmbed?


Why should someone use MementoEmbed and not some other embedding service? I reviewed several embedding services mentioned on the web. The examples in this section will demonstrate some embeds using a memento of the New York Times front page from 2005 preserved by the Internet Archive, shown below.

This is a screenshot of the example New York Times memento used in the rest of this section. Its memento-datetime is June 2, 2005 at 19:45:24 GMT and it is preserved by the Internet Archive. This page was selected because it contains a lot of content, including images.
I reviewed Embed.lyembed.rocksIframelynoembedmicrolink, and autoembed. As of this writing, the autoembed service appears to be gone. The noembed service only provides embeds for a small number of web sites and does not support web archives. Iframely responds with errors for memento URIs, as shown below.
Iframely fails to generate an embed for a memento of a New York Times page at the Internet Archive. The error message is misleading. There are multiple images on this page.
What the Iframely parsers see for this memento according to their web application.
What Iframely generates for the current New York Times web page (as of July 29, 2018 at 18:23:15 GMT).


Embed.ly, embed.rocks. and microlink are the only services that attempt to generate embeds for mementos. Unfortunately, none of them are fully archive-aware. One of the goals of a good surrogate is to convey some level of aboutness with respect to the underlying web resource. Mementos are documents with their own topics. They are typically not about the archives that contain them. Intermixing these two concepts of document content and archive information, without clear separation, produces surrogates that can confuse users. The microlink screenshot below shows an embed that fails to convey the aboutness of its underlying memento. The microlink service is not archive-aware. In this example, microlink mixes the Internet Archive favicon and Internet Archive banner with the title from the original resource. The embed.rocks example below does not fare much better, appearing to attribute the New York Times article to web.archive.org. What is the resource behind this surrogate really about? This mixing of resources weakens the surrogate's ability to convey the aboutness of the memento.

As seen in the screenshot of a social card for our example New York Times memento from 2005, microlink conflates  original resource information and archive information.
The embed.rocks social card does not fare much better, attributing the New York Times page to web.archive.org.

Embed.ly does a better job, but still falls short. In the screenshot below an embed was created for the same resource. It contains the title of the resource as well as a short description and even a striking image from the memento itself. Unfortunately, it contains no information about the original resource, potentially implying that someone at archive.org is serving content for the New York Times. Even worse, in the world where readers are concerned about fake news this surrogate may lead an informed reader to believe that this is a link to a counterfeit resource because it does not come from nytimes.com.
This screenshot of an embed for the same New York Times memento shows how well embed.ly performs. While the image and description convey more aboutness for the original resource, there is only attribution information about the archive.
Below, the same resource is represented as a social card in MementoEmbed. MementoEmbed chose the New York Times logo as the striking image for this page. This card incorporates elements used in other surrogates, such as the title of the page, a description, and a striking image pulled from the page content. Further down, I annotate the card and show how the information exists in separate areas of the card. MementoEmbed places archive information and the original resource information into their own areas of the card, visually providing separation between these concepts to reduce confusion.

A screenshot of the same New York Times memento in MementoEmbed.



This is not to imply that cards generated by Embed.ly or other services should not be used, just that they appear to be tailored to live web resources. MementoEmbed is strictly designed for use with mementos and strives to occupy that space.

Goals of MementoEmbed


MementoEmbed has the following goals in mind.

  1. The system shall provide archive-aware surrogates of mementos
  2. The system shall be deployable by others
  3. Surrogates shall degrade gracefully
  4. Surrogates shall have limited or no dependency on an external service
  5. Not just humans, but machines shall be able to generate surrogates
I have demonstrated how we meet the first goal in the prior section. In the following subsections I'll provide an overview of how well the current service meets these other goals.

Deployable by others



I did not want MementoEmbed to be another centralized service. My goal is that eventually web archives can run their own copies of MementoEmbed. Visitors to those archives will be able to create their own embeds from mementos they find. The embeds can be used in blog posts and other web pages and thus help these archives promote themselves.

MementoEmbed is a Python Flask application that can be run from a Docker container. Again, it is in its alpha prototype phase, but thanks to the expertise of fellow WS-DL member Sawood Alam, others can download the current version from DockerHub.

Type the following to acquire the MementoEmbed Docker image:

docker pull oduwsdl/mementoembed

Type the following to create a container from the image and run it on TCP port 5550:

docker run -it --rm -p 5550:5550 oduwsdl/mementoembed

Inside the container, the service runs on port 5550. The -p flag maps the container's port 5550 to your local port 5550.  From here, the user can access the container at http://localhost:5550 and they are greeted with the page below.

The welcome page for MementoEmbed.

Surrogates that degrade gracefully



Prior to executing any JavaScript, MementoEmbed's social cards use the blockquote, div, and p tags. After JavaScript, these tags are augmented with styles, images, and other information. This means that if the MementoEmbed JavaScript resource is not available, the social card is still viewable in a browser, as seen below.

A MementoEmbed social card generated for a memento from the Portuguese Web Archive.


The same social card rendered without the associated JavaScript.


Surrogates with limited or no external dependencies


All web resources are ephemeral, and embedding services are no exception. If an embed service fails or otherwise disappears, what happens to its embeds? Consider Embed.ly. The embed code for Embed.ly is typically less than 100 bytes in length. They achieve this small size because their embeds contain the title of the represented page, the represented URI, and a URI to a JavaScript resource. Everything else is loaded from their service via that JavaScript resource. Web page authors trade a small embed code for dependency on an outside service. Once that JavaScript is executed and a page is rendered, the embed grows to around 2kB. What has the web page author using the embed really gained from the small size? They have less to copy and paste, but their page size still grows once rendered. Also, in order for their page to render, it now relies on the speed and dependability of yet another external service. This is why Embed.ly cards sometimes experience a delay when the containing page is being rendered.

Privacy can be another concern. Embedded resources result in additional requests to web servers outside of the one providing the containing page. This means that an embed not only potentially conveys information about which pages it is embedded in, but also who is visiting these pages. If a web page author does not wish to share their audience with an outside service, then they might want to reconsider embeds.

Thinking about this from the perspective of web archives, I decided that MementoEmbed can do better. I started thinking about how its embeds could outlive MementoEmbed while at the same time offering privacy to visiting users.

MementoEmbed offers thumbnails as data URIs so that pages using these thumbnails do not depend on MementoEmbed.
Currently, MementoEmbed provides surrogates either as social cards or thumbnails. In response to requests for thumbnails, MementoEmbed provides an embed as a data URI, as shown above. Data URI support for images in browsers is well established at this point. A web page containing the data URI can render it without relying upon any MementoEmbed service, thus removing an external dependency. Of course, one can also save the thumbnail locally and upload it to their own server.

MementoEmbed offers the option of using data URIs for images and favicons in social cards so that these embedded resources are not dependent on outside services.
For social cards, I tried to take the use of data URIs a step further. As seen in the screenshot above, MementoEmbed allows the user to use data URIs in their social card rather than just relying upon external resources for favicons and images. This makes the embeds larger, but ensures that they do not rely upon external services.

As noted in the previous section, MementoEmbed includes some basic data and simple HTML to allow for degradation. CSS and images are later added by JavaScript loaded from the MementoEmbed service. To eliminate this dependency, I am currently working on an option that will allow the user (or machine) to request an HTML-only social card.

Not just for humans


The documentation provides information on the growing web API that I am developing for MementoEmbed. For the sake of brevity, I will talk about how a machine can request a social card or a thumbnail here.

MementoEmbed uses similar tactics to other web archive frameworks. Each service has its own URI "stem" and the URI-M to be operated on is appended to this stem.

Firefox displays a social card produced by the machine endpoint /services/product/socialcard at http://mementoembed.ws-dl.cs.odu.edu/services/product/socialcard/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/.
To request a social card, a URI-M is appended to the endpoint /services/product/socialcard/. For example, consider a system that wants to request a social card for the memento at http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ from the MementoEmbed service running at mementoembed.ws-dl.cs.odu.edu. The client would visit: http://mementoembed.ws-dl.cs.odu.edu/services/product/socialcard/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ and receive the HTML and JavaScript necessary to render the social card, as seen in the above screenshot.

Firefox displays a thumbnail produced by the machine endpoint /services/product/thumbnail at http://mementoembed.ws-dl.cs.odu.edu/services/product/thumbnail/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/.
Likewise, to request a thumbnail for the same URI-M from the same service, the machine would visit the endpoint at /services/product/thumbnail at the URI http://mementoembed.ws-dl.cs.odu.edu/services/product/thumbnail/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ and receive the image as shown in the above Firefox screenshot. The thumbnail service returns thumbnails in the PNG image format.

Clients can use the Prefer header from RFC 7240 to control the generation of these surrogates. I have written about the Prefer header before, and Mat Kelly is using it in his work as well. Simply, the client uses the Prefer header to request certain behavior on behalf of a server with respect to a resource. The server responds with a Preference-Applied header indicating which behaviors exist in the response.

For example, to change the width of a thumbnail to 500 pixels, a client would generate a Prefer header containing the thumbnail_width option. If one were to use curl, the HTTP request headers to a local instance of MementoEmbed would look like this, with the Prefer header marked red for emphasis:

GET /services/product/thumbnail/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ HTTP/1.1
Host: localhost:5550
User-Agent: curl/7.54.0
Accept: */*
Prefer: thumbnail_width=500

And the MementoEmbed service would respond with the following headers, with the Preference-Applied headed marked red for emphasis:

HTTP/1.0 200 OK
Content-Type: image/png
Content-Length: 216380
Preference-Applied: viewport_width=1024,viewport_height=768,thumbnail_width=500,thumbnail_height=375,timeout=15,remove_banner=no
Server: Werkzeug/0.14.1 Python/3.6.5
Date: Sun, 29 Jul 2018 21:08:19 GMT

The server indicates that the thumbnail returned has not only a width of 500 pixels, but also a height of 375 pixels. Also included are other preferences used in its creation, like the size of the browser viewport, the number of seconds MementoEmbed waited before giving up on a response from the archive, and whether or not the archive banner was removed.

The social card service also supports preferences for whether or not to use data URIs for images and favicons.

Other service endpoints exist, like /services/memento/archivedata, to provide parts of information used in social cards. In addition to these services, I am also developing an oEmbed endpoint for MementoEmbed.

Brief Overview of MementoEmbed Internals



Here I will briefly cover some of the libraries and algorithms used by MementoEmbed. The Memento protocol is a key part of what allows MementoEmbed to work. MementoEmbed uses the Memento protocol to discover the original resource domain, locate favicons, and of course to find a memento's memento-datetime.

If metadata is present in HTML meta tags, then MementoEmbed uses those values for the social card. MementoEmbed favors Open Graph metadata tags first, followed by Twitter card metadata, and then resorts to mining the HTML page for items like title, description, and striking image.

Titles are extracted for social cards using BeautifulSoup. The description is generated using readability-lxml. This library provides scores for paragraphs in an HTML document. Based on comments from the readability code, the paragraph with the highest score is considered to be "good content". The highest scored paragraph is selected for use in the description and truncated to the first 197 characters so it will fit into the card. If readability fails for some reason, MementoEmbed falls back to building one large paragraph from the content using justext and taking the first 197 characters from it, a process Grusky, et. al. refer to as Lede-3.

Striking image selection is a difficult problem. To support our machine endpoints, I needed to find a method that would select an image without any user intervention. There are several research papers offering different solutions for image selection based on machine learning. I was concerned about performance, so I opted to use some heuristics instead. Currently, MementoEmbed employs an algorithm that scores images using the equation below.



where S is the score, N is the number of images on the page, n is the current image position on the page, s is the size of the image in pixels, h is the number of bars in the image histogram containing a value of 0, and r is the ratio of width to height. The variables k1 through k4 are weights. This equation is built on several observations. Images earlier in a page (a low value of n) tend to be more important. Larger images (a high s) tend to be preferred. Images with a histogram consisting of many 0s tend to be mostly text, and are likely advertisements or navigational elements. Images whose width is much greater than their height (a high value for r) tend to be banner ads. For performance, the first 15 images on a page are scored. If the highest scoring image meets some threshold, then it is selected. If no images meet that threshold, then the next 15 are loaded and evaluated.

The thumbnails are generated by a call from flask to puppeteer. MementoEmbed includes a Python class that can make this cross-language call, provided a user has puppeteer installed. If requested by the user, MementoEmbed uses its knowledge of various archives to produce a thumbnail without the archive banner. This only works for some archives. For Wayback Archives, information for choosing URI-Ms without banners was gathered from Table 9 of John Berlin's Masters Thesis.

The Future of MementoEmbed



MementoEmbed has many possibilities. I have already mentioned that MementoEmbed will support features like an oEmbed endpoint and HTML-only social cards. In the visible future, I will address language-specific issues and problems with certain web constructs, like framesets and ancient character sets. I also foresee the need for additional social card preferences, like changes to width and height as well as a preference for a vertical rather than horizontal card. One could even use content negotiation to request thumbnails in formats other than PNG.

The striking image selection algorithm will be improved. At the moment the weights are set at what works based on my limited testing. It is likely new weights, a new equation, or even a new algorithm could be employed at some point. Feedback from the community will guide these decisions.

Some other ideas that I have considered involve new forms of surrogates. Simple alterations to existing surrogates are possible, like social cards that contain thumbnails or social cards without any images. More complex concepts like Teevan's Visual Snippets or Woodruff's enhanced thumbnails would require a lot of work, but are possible within the framework of MementoEmbed.

A lot of it will depend on the needs of the community. Thanks to Sawood Alam, Mat Kelly, Grant Atkins, Michael Nelson, and Michele Weigle for already providing feedback. As more people experience MementoEmbed, they will no doubt come up with ideas I had not considered, so please try our demo at http://mementoembed.ws-dl.cs.odu.edu or look at the source code in GitHub at https://github.com/oduwsdl/MementoEmbed. Most importantly, report any issues or ideas to our GitHub issue tracker: https://github.com/oduwsdl/MementoEmbed/issues.


--Shawn M. Jones