Monday, September 3, 2018

2018-09-03: Trip Report for useR! 2018 Conference




This year I was really lucky to get my abstract and poster accepted for useR! 2018 conference. The UseR! conference is an annual worldwide conference for international R users and developer community. The fourteenth annual conference was held in the Southern hemisphere in Brisbane, Australia from July 10-13, 2018. This four-day conference consists of nineteen 3-hour tutorials, seven keynote speeches, and more than 200 contributed talks, lightning talks and posters on using, extending, and deploying R. This year, the program successfully gathered almost 600 users of the data analysis language R, from all corners of the world from various expertise levels of R.

Distribution map of useR! 2018 participants across the globe


Fortunately, I was also granted a travel scholarship from the useR! 2018 and could attend the conference including the tutorial sessions for free (thanks useR! 2018).

Day 1 (July 10, 2018): Registration and Tutorial

The conference was held at Brisbane Convention and Exhibition Centre (BCEC). Each participant must register themselves at the secretariat desk and received a goodie bag containing a t-shirt, a pair of socks, and a lanyard (if lucky). The name tags can be picked from a board which are ordered by last name.

The Secretariat Desk


T-shirt and name tag from useR! 2018


useR! 2018 is identified with hexagonal shapes which can be found everywhere in useR! 2018: the name tags, the hex stickers, and of course, the amazing Hexwall designed by Mitchell O'Hara-Wild. He also wrote a blog post about how he created the hexwall. There was also a hexwall photo contest where all conference attendees are requested to take a picture with the hexwall and post it on twitter with hashtag #hexwall.

Me and the hexwall


The R tutorials are conducted in parallel sessions from Tuesday to Wednesday morning (July 10 - 11, 2018). Each participant can only participate in a maximum of three tutorials. The first tutorial that I attend is Wrangling Data in the Tidyverse by Simon Jackson.

This is my first time using Tidyverse, and I found it really helpful for data transformation and visualization, once I got familiar with it. Using the data example from booking.com, we got hands-on experience with various data wrangling techniques such as handle missing values, reshaping, filtering, and selecting data. The thing that I love the most about Tidyverse is the dplyr package. It comes with a very interesting feature pipe (%>%) which allows us to chain together many operations.

In the second tutorial by Statistical Models for Sport in R by Stephanie Kovalchik, we learned how to use R to implement statistical models that are common in sports statistics. The tutorial consists of three parts:
  1. web scraping to gather and clean public sports data using RSelenium and rvest
  2. explore data with graphics
  3. implementing three models: Bradley-Terry paired comparison models, Pythagorean Theorem, Generalized Additive Models, and Forecasting with Bayes.
During the tutorials session, I met three other Indonesians who are currently studying in Australia as Ph.D. students (small world!).
Indonesian students at useR! 2018

Day 2 (July 11, 2018): Tutorial, Opening Ceremony, and Poster Presentation. 

Tutorial

The morning session is filled with tutorial activities which are a continuation of the series of tutorials that began the day before. I attended the tutorial Follow Me: Introduction to social media analysis in R by Maria Prokofieva, Saskia Freytag, and Anna Quaglieri.

Dr. Maria Prokofieva talked about social media analytics using R

During this 2.5 hour tutorial, we learned how to use R libraries twitteR and rtweet for extracting data from twitter and then convert the tweet in the text column to token using tidytext. In general, the whole process is a bit similar to what I have learned in Web Science class by Dr. Michael Nelson at Old Dominion University (ODU), except that all of the processes are conducted in R instead of Python. At the end of the session, we were given a challenge to compare tweets which mention Harry to tweets mentioning Meghan in the royal wedding time series. The answer should be uploaded to twitter using the hashtags #useR2018, #rstats and #socialmediachallenge. All tutorial materials are available on R-Ladies Melbourne's GitHub.

R-Ladies Gathering

There was an R-Ladies gathering that took place during the lunch after the tutorial session. It was such an excellent opportunity to meet other amazing R-Ladies members who have done various project and research in R and get their R libraries published on CRAN. It was really inspiring to hear their stories of promoting gender diversity in the R community. There are 75 R-Ladies groups spread across the globe. Unfortunately, there is no R-Ladies group in Indonesia at this moment. Maybe, I should start creating one?
With Jenny Bryan during the R-Ladies meeting

Opening Ceremony, Keynote Speeches, and Poster Lightning Talk

At 1.30 pm, all conference attendees gathered in the auditorium for the Opening Ceremony. The event started with a performance by Songwoman Maroochy Welcome to the Country followed by an opening speech delivered by useR! 2018 chief organizer, Professor Di Cook from the Department of Econometrics and Business Statistics at Monash University. In her remarks, Professor Cook encourages all attendees to enjoy the meeting, learn as much as we can, and be cognizant of ensuring others have a good experience, too.

Opening speech by Professor Di Cook

By the way, for those who are curious, here's a sneak peek of the Songwoman Maroochy performance.



Next, we had a keynote speech by Steph de Silva Beyond Syntax, Towards Culture: The Potentiality of Deep Open Source Communities.

After the keynote speech, there was a poster lightning talk session where every presenter is given a chance to advertise and let everyone know what the work is about and encourage them to come and see it during the poster session.
My poster lightning talk
Before ending the opening ceremony, there was another keynote speech by Kelly O'Briant of RStudio titled RStudio 2018 - Who We are and What We Do.

Poster Session.

The poster session wrapped up the day. I am so grateful that useR! 2018 uses all-electronic posters. So, we did not have to bother ourself printing a large poster and carried it across the globe all the way to Australia. There are two poster sessions, one on Wednesday evening and another one during lunch on Thursday. For poster presentation, the conference committee provides 20 47-inch TVs that have HDMI connections to connect the TV to our laptop. This way, if someone asked, we can directly do a demo or showing a specific part of our code on the TV as well.

In this conference, I presented a poster titled AnalevR: An Interactive R-Based Analysis Environment for Utilizing BPS-Statistics Indonesia Data. This project idea originated from the challenge we faced at BPS-Statistics Indonesia. BPS produces a massive amount of strategic data every year. However, these data are still underutilized by public users because of several issues such as bureaucratic procedure, the money that they have to pay, and long waiting time to get their requested data processed. That’s why we introduce AnalevR, an online R-based analysis environment that allows anyone anywhere to access bps data and perform analyses by typing R codes on a notebook-like interface and get the output immediately. This project is still a prototype and currently in the development stage. The poster and the code are available on my GitHub.
Me during the poster session

Day 3 (July 12, 2018): Keynote Speech, Talk, Poster Presentation, and Conference Dinner

The agenda for day 3 was packed with two keynote speeches, several talks, poster presentation, and conference dinner.

Keynote Speech

The first keynote speech was The Grammar of Animation by Thomas Lin Pedersen (video, slides). In his speech, Pedersen explains that visualization is an element that falls somewhere between three dimensions of DataViz nirvana, which are static, interactive, and animated. Each dimension has its own pros and cons. Mara Averick's tweet below gives us a clearer illustration of this.
Pedersen implements this grammar concept by rewriting the gganimate package which extends the ggplot2 package to include the description of animation such as transition, view, and shadow. He made his presentation even more engaging by showing an example that channels Hans Rosling's 200 Countries, 200 Years, 4 Minutes visualization. The example is made by utilizing the transition_time() function in the gganimate package.

The second keynote speech was Adventures with R: Two Stories of Analyses and a New Perspective on Data by Bill Venables. He discussed two recent analyses, one from psycholinguistics and the other from fisheries, that show the versatility of R to tackle the full range of challenges facing the statistician/modeler adventurer. He also made a comparison between Statistics and Data Science and how they relate to each other. The emerging data science is not natural a successor of Statistics. There are some subtle differences between them. Professor Venables said that both sides are important domains and connected, but we have to think of them as essentially bifurcating to some extent and not taking on each other's roles. Things work best when domain expert and analyst work hand in hand.
Professor Venables ended his speech by mentioning two quotes that I would like to requote here:

"The relationship between Mathematics and Statistics is like that between chemistry and winemaking. You can bring as much chemistry as you can to winemaking, but it takes more than chemistry to make a drinkable dry red wine." 

"Everyone here is smart, distinguish yourself by being kind."


There was a tribute to Bill Venables at the end of the event.

The Talk Sessions

There are 18 parallel sessions of talks conducted from 10.30 am to 4.50 pm. Those sessions were held in three parts, where each part are separated by two tea breaks and one lunch break. I managed to attend eight talks that covered topics of data handling and visualization.
  1. Statistical Inference: A Tidy Approach using R by Chester Ismay.
    Chester Ismay from DataCamp introduces the infer package which was created to implement common classical inferential techniques in a tidyverse-friendly framework that is expressive of the underlying procedure. There are four main objectives of this package:
    1. Dataframe in, dataframe out
    2. Compose tests and intervals with pipes
    3. Unite computational and approximation methods
    4. Reading a chain of infer code should describe the inferential procedure
  2. Data Preprocessing using Recipes by Max Kuhn.
    Max Kuhn of RStudio gives a talk about the recipes package which aims for predictive data modeling. Recipes works in three steps (recipe → prepare → bake):
    1. Create a recipe, which is the blueprint of how your data will be processed. No data has been modified at this point.
    2. Prepare the recipe using the training set. 
    3. Bake the training set and the test set. At this step, the actual modification will take place.
  3. Build Scalable Shiny Applications for Employee Attrition Prediction on Azure Cloud by Le Zhang
    Le Zhang of Microsoft delivers a talk about building a model for employee attrition prediction and deploy the analytical solution as Shiny-based web service on Azure cloud. The project is available on GitHub.
  4. Moving from Prototype to Production in R: A Look Inside the Machine Learning Infrastructure at Netflix by Bryan Galvin
    Bryan Galvin of Netflix gave the audience a look inside the machine learning infrastructure at Netflix. Galvin explained briefly on how Netflix moves to production using microframework named Metaflow and R. Here's the link to the slides.
  5. Rjs: Going Hand in Hand with Javascript by Jackson Kwok
    rjs is a package that is designed is designed for utilizing JavaScript's visualization libraries and R's modeling packages to build tailor-made interactive apps. I think this package is super cool and it was an absolute highlight for me at useR! 2018. I will definitely spend some time to learn this package. Below is an example of rjs implementation. Check the complete project on GitHub.
  6. Shiny meets Electron: Turn your Shiny App into a Standalone Desktop App in No Time by Katie Sasso
    Katie Sasso of Columbus Collaboratory shares how the Columbus Collaboratory team overcame the barriers of using Shiny for large enterprise consulting by coupling R Portable and Electron. The result is a Shiny app in a stand-alone executable format. The details of her presentation along with the source code and tutorial video are available on her GitHub.
  7. Combining R and Python with GraalVM by Stepan Sindelar
    Stepan Sindelar of Oracle Labs told us how to combine R and Python into a polyglot application which is running on
    GraalVM.  GraalVM enables us to operate on the same data without the need to copy the data when crossing language boundaries.
  8. Large Scale Data Visualization with Deck.gl and Shiny by Ian Hansel.
    Ian Hansel of Verge Labs talked about how to integrate deck.gl, a web data visualization framework released by Uber, with Shiny using the R package deckard.
Conference Dinner

The conference dinner ticket

The conference dinner can only be attended by people who have the ticket only. I was fortunate because as a scholarship recipient, I got a free ticket for the dinner (again, thank you, useR! 2018 and R-Ladies Melbourne). There was a trivia quiz at the end of the dinner. All attendees are grouped based on the table they were sitting at and must team up to answer all the questions on the question sheets. The solution for the quiz can be found here. The teams who won the quiz got free books as the prizes.

The conference dinner and the trivia quiz
Day 4 (July 13, 2018): Keynote Speech, Talk, and Closing Ceremony

Keynote Speech

The last day of the conference starts with a keynote speech Teaching R to New Users: From tapply to Tidyverse by Roger Peng. In his talk, Dr. Peng talked about teaching R and selling R to new users. It could be difficult to describe the value proposition of R to someone who had never seen it before. Is it an interactive system for data analysis or is it a sophisticated programming language for software developers? To answer this, Dr. Peng quote a remark from John Chambers (one of the creators of the S language):

"The ambiguity [of the S language] is real and goes to a key objective: we wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important."

I think this is the beauty of R that attracts me. I do not have to jump into the developing things directly, but instead gradually transitioning myself into the programming. To sum up, Dr. Peng shares the keywords that could be useful in selling R to new users: free, open source, graphics, reproducibility - reporting - automation, R packages + community, RStudio, transferability skills, and jobs ($$).

Some tips for selling R by Dr. Roger Peng
The second keynote speech was R for Psychological Science by Danielle Navarro (video, slides). Dr. Navarro shared her experience in teaching R for psychology students. Fear apparently is the main challenge that prevents students from learning. She also talked about the difficulty she faced to find a good textbook to use in her class that finally lead her to write her own lecture notes. Her lecture notes tried to address student fears by using a relaxed style. This works well for her that she ended up having her own book and won a teaching award. Dr. Navarro ended her talk by encouraging everyone to conquer their fears and climb the mountain of R. It might not be easy to avoid the 'dragon' at the top, but there are always people who will support and help us. Reminds our community that we are stronger when we are kind to each other.
The third and the last keynote was Code Smells and Feels by Jenny Bryan. She shared some tips and tricks on how to write codes elegantly in a way that it is easier to understand and cheaper to modify. Some code smells apparently have official names such as Primitive Obsession and Inappropriate Intimacy.
Here are some tips that I summarize from her talk:
  1. Write simple conditions
  2. Use helper functions
  3. Handle class properly
  4. Return and exit early
  5. Use polymorphism
  6. Use switch() if you need to dispatch different logic based on a string.
Besides the three great keynotes above, I also attended several short talks:
  1. Tidy forecasting in R by Rob Hyndman
  2. jstor: An R Package for Analysing Scientific Articles by Thomas Klebel
  3. What is in a name? 20 Years of R Release Management by Peter Dalgaard
  4. Sustainability Community Investment in Action - A Look at Some of the R Consortium Funded Grant Projects and Working Groups by Joseph Rickert
  5. What We are Doing in the R Consortium Funds by various funded researchers

Closing Ceremony

The closing speech was delivered by Professor Di Cook from the Department of Econometrics and Business statistic at Monash University. There was also a  small handover ceremony between Di Cook and Nathalie Vialaneix who will organize next year's useR! 2019 in Toulouse, France.
At the end of the ceremony, there was an announcement for the winners of hexwall photo contest which are chosen randomly.
It was indeed a delightful experience for me. I am happy and went home with a list of homework and new packages that I have to learn. For those who did not make it to the useR! 2018 Conference, do not feel FOMO. All talks and keynote speech are posted online on R Consortium's youtube account.

I would like to thank Professor Di Cook of Monash University as well as R-Ladies Melbourne for giving me a scholarship and make it possible for me to attend this conference. I also would like to congratulate all useR! 2018 organizing committee for the great and brilliant efforts to make this event a great success. I really look for joining next year's useR! 2019 which will be held from July 9 - 12, 2019, in Toulouse, France. So, do not miss the updates. Check its website as well as follow the twitter account @UseR2019_Conf with hashtag #useR2019.

@erikaris

2018-09-03: Let's compare memento damage measures!

It is always nice getting a Google Scholar alert that one of my papers has been cited. In this case, I learned that the paper "Reproducible Web Corpora: Interactive Archiving with Automatic Quality Assessment" (to appear in the ACM Journal of Data and Information Quality) cited a paper that I wrote during my doctoral studies with fellow PhD students Mat Kelly and Hany SalahEldeen and our advisors Michael Nelson and Michele Weigle. More specifically, the Reproducible Web Corpora paper (by Johannes Kiesel, Florian Kneist, Milad Alshomary, Benno Stein, Matthias Hagen, and Martin Potthast) is a very important and well-executed follow on to our paper "Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources" (a best student paper from JCDL2014 and subsequently published in the International Journal of Digital Libraries).

In this blog post, I will be providing a quick recap and analysis of the Kiesel paper from the perspective of an author of the paper that provides the Brunelle15 metric used as the benchmark measure in the Kiesel 2018 paper.

Sunday, September 2, 2018

2018-09-02: Sampath Jayarathna (Assistant Professor, Computer Science)

I am really excited to be part of the Old Dominion University and the WS-DL group. I joined the faculty at Old Dominion University in 2018. Before that, I was a tenure-track assistant professor for two years at California State Polytechnic University (Cal Poly Pomona). I am truly grateful to Frank Shipman, Oleg Komogortsev, Richard Furuta, Dilma Da Silva and Cecelia Aragon for the help throughout this faculty search. It is sad to say goodbye to my colleagues at Cal Poly but I am excited to have an amazing bunch of mentors and colleagues here at ODU, Michael, Michele, Nikos, Ravi, Jian, Cong, Shubham, Anne and many more. Its truly amazing that I was able team up and put-together 2 NSF proposals (CRII and REU Site) within a short period of time.

I received my Ph.D. in Computer Science from Texas A&M University in 2016, advised by Frank Shipman. I was a member of the Center for the Study of the Digital Libraries (CSDL) group. In 2012, I did a 6 month internship at Knowledge Based Systems Inc., College Station, TX to build a Collaborative Analysis tool for JackalFish enterprise search tool. I earned MS degree from Texas State University-San Marcos in 2010. I worked with Oleg Komogortsev in the area of Oculomotor Systems research, eye tracking and Biomertrics using eye movements. I spent the summer 2009 at Lawrence Berkeley National Lab and with Cecilia Aragon (currently professor at UW Seattle) and Deb Agarwal on a very cool eye movement based biometric project.

My undergraduate degree is a B.S in computer Science (First Class Honors, similar to Latin honor summa cum laude) from University of Peradeniya, Sri Lanka in 2006. 
I am an avid gardener, my wife says I have a “green thumb”, something to do with coming from a tropical island. Most of my plants did not survive the 20 days west to east coast journey. 

Sri Lankan "King Coconut"


I grow variety of vegetables including tomato, water melon, leafy greens, chilies, and some exotic tropical fruits and vegies. It is exciting to see what I can do with long hot summers and 4-season weather. 

My academic Genealogy, Bucket List, Goodreads Bookshelf, YouTube playlist, IMDB lists of favorite TV-shows, and Movies.

--Sampath

Thursday, August 30, 2018

2018-08-30: Excited to Join WS-DL group in ODU!

I am an outlier compared with most computer scientists because I spent 10 years on a field called "Astronomy and Astrophysics". Very few computer scientists followed the same path as me to transfer from a seemingly irrelevant major. But this is where my passion is, so I did it, and I made it!

Right after I graduated as a PhD in 2011, I joined the CiteSeerX group directed by Dr. C. Lee Giles at IST, Penn State University. I worked as a DBA for web crawling at the beginning and soon became the tech leader of the search engine, and recently the Co-PI of an NSF awarded proposal on CiteSeerX. I spent six years, an usually long time as a postdoc and then was promoted to a teaching faculty. However, I kept moving on, because I wanted to do research!

Luckily, Michael and Michele did not mind of taking the risk and bet on me to be a tenure-track faculty at the Old Dominion University. So I accepted the offer and became a member of the Web Science Digital Library group at ODU CS.

I appreciate many CS faculties, including but not limited to Dr. Jing He, Dr. Cong Wang, and Dr. Ravi Mukkamala, that helped me before and after I moved to Virginia Beach. Michael and Michele gave me tremendous guidance already on how to be successful. I am also glad to know Dr. Sampath Jayarathna and Dr. Jiangwen Sun as new colleagues. It is unbelievable that Sampath and I submitted our first NSF proposal before the first class began this fall!

I cherish my old friends at Penn State. I also look forward to doing more exciting work in this new position!

Posted by Jian Wu at ECSB, Norfolk, VA

Saturday, August 25, 2018

2018-08-25: Four WS-DL Classes Offered for Fall 2018


Four WS-DL classes are offered for Fall 2018:
Dr. Michele C. Weigle is not teaching this semester.

Our current plan for courses in Spring 2019 is to offer a record five WS-DL courses:
  • CS 432/532 Web Science, Alexander Nwala
  • CS 725/825 Information Visualization, Dr. Michele C. Weigle
  • CS 734/834 Information Retrieval, Dr. Jian Wu
  • CS 795/895 Human-Computer Interaction (HCI), Dr. Sampath Jayarathna
  • CS 795/895 Web Archiving Forensics, Dr. Michael L. Nelson
Note that CS 418, 431, and 432 all count for the CS Web Programming minor.  

--Michael


Wednesday, August 1, 2018

2018-08-01: A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages


As commonly seen on Facebook and Twitter, the social card is a type of surrogate that provides clues as to what is behind a URI. In this case, the URI is from Google and the social card makes it clear that the document behind this long URI is directions.
As I described to the audience of Dodging the Memory Hole last year, surrogates provide the reader with some clue of what exists behind a URI. The social card is one type of surrogate. Above we see a comparison between a Google URI and a social card generated from that URI. Unless a reader understands the structure of all URIs at google.com, they will not know what the referenced content is about until they click on it. The social card, on the other hand, provides clues to the reader that the underlying URI provides directions from Old Dominion University to Los Alamos National Laboratory. Surrogates allow readers to pierce the veil of the URI's opaqueness.

With the death of Storify, I've been examining alternatives for summarizing web archive collections. Key to these summaries are surrogates. I have discovered that there exist services that provide users with embeds. These embeds allow an author to insert a surrogate into the HTML of their blog post or other web page. These containing pages often use the surrogate to further illustrate some concept from the surrounding content. Our research team blog posts serve as containing pages for embeds all of the time. We typically use embeddable surrogates of tweets, videos from YouTube, and presentations from Slideshare, but surrogates can be generated for a variety of other resources as well. Unfortunately, not all services generate good surrogates for mementos. After some reading, I came to the conclusion that we can fill in the gap with our own embeddable surrogate service: MementoEmbed.


A recent WS-DL blog post containing embeddable surrogates of Slideshare presentations.


Blast Theory

Sam Pearson and Clara Garcia Fraile are in residence for one month Sam Pearson and Clara Garcia Fraile are in residence for one month working on a new project called In My Shoes. They are developin


MementoEmbed is the first archive-aware embeddable surrogate service. This means it can include memento-specific information such as the memento-datetime, the archive from which a memento originates, and the memento's original resource domain name. In the MementoEmbed social card above, we see the following information:
  • from the resource itself:
    • title — "Blast Theory"
    • a description conveying some information of what the resource is about — "Sam Pearson and Clara Garcia..."
    • a striking image from the resource conveying some visual aspect of aboutness
    • its original web site favicon — the bold "B" in the lower left corner
    • its original domain name — "BLASTTHEORY.CO.UK"
    • its memento-datetime — 2009-05-22T22:12:51 Z
    • a link to its current version — under "Current version"
    • a link to other versions — under "Other Versions"
  • from the archive containing the resource:
    • the domain name of the archive — "WEBARCHIVE.ORG.UK"
    • the favicon of the archive — the white "UKWA" on the aqua background
    • a link to the memento in the archive — accessible via the links in the the title and the memento-datetime
Most of this information is not provided by services for live web resources, such as Embed.ly.

MementoEmbed is a deployable service that currently generates social cards, like the one above, and thumbnails. As with most software I announce, MementoEmbed is still in its alpha prototype phase, meaning that crashes and poor output are to be expected. A bleeding edge demo is available at http://mementoembed.ws-dl.cs.odu.edu. The source code is available from https://github.com/oduwsdl/MementoEmbed. Its documentation is growing at https://mementoembed.readthedocs.io/en/latest/.

In spite of its simplicity in concept, MementoEmbed is an ambitious project, requiring that it not only support parsing and processing of the different web concepts and technologies of today, but all that have ever existed. With this breadth of potential in mind, I know that MementoEmbed does not yet currently handle all memento cases, but that is where you can help contribute by submitting issue reports that help us improve it.

But why use MementoEmbed instead of some other service? What are the goals of MementoEmbed? How does it work? What does the future of MementoEmbed look like?

Why MementoEmbed?


Why should someone use MementoEmbed and not some other embedding service? I reviewed several embedding services mentioned on the web. The examples in this section will demonstrate some embeds using a memento of the New York Times front page from 2005 preserved by the Internet Archive, shown below.

This is a screenshot of the example New York Times memento used in the rest of this section. Its memento-datetime is June 2, 2005 at 19:45:24 GMT and it is preserved by the Internet Archive. This page was selected because it contains a lot of content, including images.
I reviewed Embed.lyembed.rocksIframelynoembedmicrolink, and autoembed. As of this writing, the autoembed service appears to be gone. The noembed service only provides embeds for a small number of web sites and does not support web archives. Iframely responds with errors for memento URIs, as shown below.
Iframely fails to generate an embed for a memento of a New York Times page at the Internet Archive. The error message is misleading. There are multiple images on this page.
What the Iframely parsers see for this memento according to their web application.
What Iframely generates for the current New York Times web page (as of July 29, 2018 at 18:23:15 GMT).


Embed.ly, embed.rocks. and microlink are the only services that attempt to generate embeds for mementos. Unfortunately, none of them are fully archive-aware. One of the goals of a good surrogate is to convey some level of aboutness with respect to the underlying web resource. Mementos are documents with their own topics. They are typically not about the archives that contain them. Intermixing these two concepts of document content and archive information, without clear separation, produces surrogates that can confuse users. The microlink screenshot below shows an embed that fails to convey the aboutness of its underlying memento. The microlink service is not archive-aware. In this example, microlink mixes the Internet Archive favicon and Internet Archive banner with the title from the original resource. The embed.rocks example below does not fare much better, appearing to attribute the New York Times article to web.archive.org. What is the resource behind this surrogate really about? This mixing of resources weakens the surrogate's ability to convey the aboutness of the memento.

As seen in the screenshot of a social card for our example New York Times memento from 2005, microlink conflates  original resource information and archive information.
The embed.rocks social card does not fare much better, attributing the New York Times page to web.archive.org.

Embed.ly does a better job, but still falls short. In the screenshot below an embed was created for the same resource. It contains the title of the resource as well as a short description and even a striking image from the memento itself. Unfortunately, it contains no information about the original resource, potentially implying that someone at archive.org is serving content for the New York Times. Even worse, in the world where readers are concerned about fake news this surrogate may lead an informed reader to believe that this is a link to a counterfeit resource because it does not come from nytimes.com.
This screenshot of an embed for the same New York Times memento shows how well embed.ly performs. While the image and description convey more aboutness for the original resource, there is only attribution information about the archive.
Below, the same resource is represented as a social card in MementoEmbed. MementoEmbed chose the New York Times logo as the striking image for this page. This card incorporates elements used in other surrogates, such as the title of the page, a description, and a striking image pulled from the page content. Further down, I annotate the card and show how the information exists in separate areas of the card. MementoEmbed places archive information and the original resource information into their own areas of the card, visually providing separation between these concepts to reduce confusion.

A screenshot of the same New York Times memento in MementoEmbed.



This is not to imply that cards generated by Embed.ly or other services should not be used, just that they appear to be tailored to live web resources. MementoEmbed is strictly designed for use with mementos and strives to occupy that space.

Goals of MementoEmbed


MementoEmbed has the following goals in mind.

  1. The system shall provide archive-aware surrogates of mementos
  2. The system shall be deployable by others
  3. Surrogates shall degrade gracefully
  4. Surrogates shall have limited or no dependency on an external service
  5. Not just humans, but machines shall be able to generate surrogates
I have demonstrated how we meet the first goal in the prior section. In the following subsections I'll provide an overview of how well the current service meets these other goals.

Deployable by others



I did not want MementoEmbed to be another centralized service. My goal is that eventually web archives can run their own copies of MementoEmbed. Visitors to those archives will be able to create their own embeds from mementos they find. The embeds can be used in blog posts and other web pages and thus help these archives promote themselves.

MementoEmbed is a Python Flask application that can be run from a Docker container. Again, it is in its alpha prototype phase, but thanks to the expertise of fellow WS-DL member Sawood Alam, others can download the current version from DockerHub.

Type the following to acquire the MementoEmbed Docker image:

docker pull oduwsdl/mementoembed

Type the following to create a container from the image and run it on TCP port 5550:

docker run -it --rm -p 5550:5550 oduwsdl/mementoembed

Inside the container, the service runs on port 5550. The -p flag maps the container's port 5550 to your local port 5550.  From here, the user can access the container at http://localhost:5550 and they are greeted with the page below.

The welcome page for MementoEmbed.

Surrogates that degrade gracefully



Prior to executing any JavaScript, MementoEmbed's social cards use the blockquote, div, and p tags. After JavaScript, these tags are augmented with styles, images, and other information. This means that if the MementoEmbed JavaScript resource is not available, the social card is still viewable in a browser, as seen below.

A MementoEmbed social card generated for a memento from the Portuguese Web Archive.


The same social card rendered without the associated JavaScript.


Surrogates with limited or no external dependencies


All web resources are ephemeral, and embedding services are no exception. If an embed service fails or otherwise disappears, what happens to its embeds? Consider Embed.ly. The embed code for Embed.ly is typically less than 100 bytes in length. They achieve this small size because their embeds contain the title of the represented page, the represented URI, and a URI to a JavaScript resource. Everything else is loaded from their service via that JavaScript resource. Web page authors trade a small embed code for dependency on an outside service. Once that JavaScript is executed and a page is rendered, the embed grows to around 2kB. What has the web page author using the embed really gained from the small size? They have less to copy and paste, but their page size still grows once rendered. Also, in order for their page to render, it now relies on the speed and dependability of yet another external service. This is why Embed.ly cards sometimes experience a delay when the containing page is being rendered.

Privacy can be another concern. Embedded resources result in additional requests to web servers outside of the one providing the containing page. This means that an embed not only potentially conveys information about which pages it is embedded in, but also who is visiting these pages. If a web page author does not wish to share their audience with an outside service, then they might want to reconsider embeds.

Thinking about this from the perspective of web archives, I decided that MementoEmbed can do better. I started thinking about how its embeds could outlive MementoEmbed while at the same time offering privacy to visiting users.

MementoEmbed offers thumbnails as data URIs so that pages using these thumbnails do not depend on MementoEmbed.
Currently, MementoEmbed provides surrogates either as social cards or thumbnails. In response to requests for thumbnails, MementoEmbed provides an embed as a data URI, as shown above. Data URI support for images in browsers is well established at this point. A web page containing the data URI can render it without relying upon any MementoEmbed service, thus removing an external dependency. Of course, one can also save the thumbnail locally and upload it to their own server.

MementoEmbed offers the option of using data URIs for images and favicons in social cards so that these embedded resources are not dependent on outside services.
For social cards, I tried to take the use of data URIs a step further. As seen in the screenshot above, MementoEmbed allows the user to use data URIs in their social card rather than just relying upon external resources for favicons and images. This makes the embeds larger, but ensures that they do not rely upon external services.

As noted in the previous section, MementoEmbed includes some basic data and simple HTML to allow for degradation. CSS and images are later added by JavaScript loaded from the MementoEmbed service. To eliminate this dependency, I am currently working on an option that will allow the user (or machine) to request an HTML-only social card.

Not just for humans


The documentation provides information on the growing web API that I am developing for MementoEmbed. For the sake of brevity, I will talk about how a machine can request a social card or a thumbnail here.

MementoEmbed uses similar tactics to other web archive frameworks. Each service has its own URI "stem" and the URI-M to be operated on is appended to this stem.

Firefox displays a social card produced by the machine endpoint /services/product/socialcard at http://mementoembed.ws-dl.cs.odu.edu/services/product/socialcard/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/.
To request a social card, a URI-M is appended to the endpoint /services/product/socialcard/. For example, consider a system that wants to request a social card for the memento at http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ from the MementoEmbed service running at mementoembed.ws-dl.cs.odu.edu. The client would visit: http://mementoembed.ws-dl.cs.odu.edu/services/product/socialcard/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ and receive the HTML and JavaScript necessary to render the social card, as seen in the above screenshot.

Firefox displays a thumbnail produced by the machine endpoint /services/product/thumbnail at http://mementoembed.ws-dl.cs.odu.edu/services/product/thumbnail/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/.
Likewise, to request a thumbnail for the same URI-M from the same service, the machine would visit the endpoint at /services/product/thumbnail at the URI http://mementoembed.ws-dl.cs.odu.edu/services/product/thumbnail/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ and receive the image as shown in the above Firefox screenshot. The thumbnail service returns thumbnails in the PNG image format.

Clients can use the Prefer header from RFC 7240 to control the generation of these surrogates. I have written about the Prefer header before, and Mat Kelly is using it in his work as well. Simply, the client uses the Prefer header to request certain behavior on behalf of a server with respect to a resource. The server responds with a Preference-Applied header indicating which behaviors exist in the response.

For example, to change the width of a thumbnail to 500 pixels, a client would generate a Prefer header containing the thumbnail_width option. If one were to use curl, the HTTP request headers to a local instance of MementoEmbed would look like this, with the Prefer header marked red for emphasis:

GET /services/product/thumbnail/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ HTTP/1.1
Host: localhost:5550
User-Agent: curl/7.54.0
Accept: */*
Prefer: thumbnail_width=500

And the MementoEmbed service would respond with the following headers, with the Preference-Applied headed marked red for emphasis:

HTTP/1.0 200 OK
Content-Type: image/png
Content-Length: 216380
Preference-Applied: viewport_width=1024,viewport_height=768,thumbnail_width=500,thumbnail_height=375,timeout=15,remove_banner=no
Server: Werkzeug/0.14.1 Python/3.6.5
Date: Sun, 29 Jul 2018 21:08:19 GMT

The server indicates that the thumbnail returned has not only a width of 500 pixels, but also a height of 375 pixels. Also included are other preferences used in its creation, like the size of the browser viewport, the number of seconds MementoEmbed waited before giving up on a response from the archive, and whether or not the archive banner was removed.

The social card service also supports preferences for whether or not to use data URIs for images and favicons.

Other service endpoints exist, like /services/memento/archivedata, to provide parts of information used in social cards. In addition to these services, I am also developing an oEmbed endpoint for MementoEmbed.

Brief Overview of MementoEmbed Internals



Here I will briefly cover some of the libraries and algorithms used by MementoEmbed. The Memento protocol is a key part of what allows MementoEmbed to work. MementoEmbed uses the Memento protocol to discover the original resource domain, locate favicons, and of course to find a memento's memento-datetime.

If metadata is present in HTML meta tags, then MementoEmbed uses those values for the social card. MementoEmbed favors Open Graph metadata tags first, followed by Twitter card metadata, and then resorts to mining the HTML page for items like title, description, and striking image.

Titles are extracted for social cards using BeautifulSoup. The description is generated using readability-lxml. This library provides scores for paragraphs in an HTML document. Based on comments from the readability code, the paragraph with the highest score is considered to be "good content". The highest scored paragraph is selected for use in the description and truncated to the first 197 characters so it will fit into the card. If readability fails for some reason, MementoEmbed falls back to building one large paragraph from the content using justext and taking the first 197 characters from it, a process Grusky, et. al. refer to as Lede-3.

Striking image selection is a difficult problem. To support our machine endpoints, I needed to find a method that would select an image without any user intervention. There are several research papers offering different solutions for image selection based on machine learning. I was concerned about performance, so I opted to use some heuristics instead. Currently, MementoEmbed employs an algorithm that scores images using the equation below.



where S is the score, N is the number of images on the page, n is the current image position on the page, s is the size of the image in pixels, h is the number of bars in the image histogram containing a value of 0, and r is the ratio of width to height. The variables k1 through k4 are weights. This equation is built on several observations. Images earlier in a page (a low value of n) tend to be more important. Larger images (a high s) tend to be preferred. Images with a histogram consisting of many 0s tend to be mostly text, and are likely advertisements or navigational elements. Images whose width is much greater than their height (a high value for r) tend to be banner ads. For performance, the first 15 images on a page are scored. If the highest scoring image meets some threshold, then it is selected. If no images meet that threshold, then the next 15 are loaded and evaluated.

The thumbnails are generated by a call from flask to puppeteer. MementoEmbed includes a Python class that can make this cross-language call, provided a user has puppeteer installed. If requested by the user, MementoEmbed uses its knowledge of various archives to produce a thumbnail without the archive banner. This only works for some archives. For Wayback Archives, information for choosing URI-Ms without banners was gathered from Table 9 of John Berlin's Masters Thesis.

The Future of MementoEmbed



MementoEmbed has many possibilities. I have already mentioned that MementoEmbed will support features like an oEmbed endpoint and HTML-only social cards. In the visible future, I will address language-specific issues and problems with certain web constructs, like framesets and ancient character sets. I also foresee the need for additional social card preferences, like changes to width and height as well as a preference for a vertical rather than horizontal card. One could even use content negotiation to request thumbnails in formats other than PNG.

The striking image selection algorithm will be improved. At the moment the weights are set at what works based on my limited testing. It is likely new weights, a new equation, or even a new algorithm could be employed at some point. Feedback from the community will guide these decisions.

Some other ideas that I have considered involve new forms of surrogates. Simple alterations to existing surrogates are possible, like social cards that contain thumbnails or social cards without any images. More complex concepts like Teevan's Visual Snippets or Woodruff's enhanced thumbnails would require a lot of work, but are possible within the framework of MementoEmbed.

A lot of it will depend on the needs of the community. Thanks to Sawood Alam, Mat Kelly, Grant Atkins, Michael Nelson, and Michele Weigle for already providing feedback. As more people experience MementoEmbed, they will no doubt come up with ideas I had not considered, so please try our demo at http://mementoembed.ws-dl.cs.odu.edu or look at the source code in GitHub at https://github.com/oduwsdl/MementoEmbed. Most importantly, report any issues or ideas to our GitHub issue tracker: https://github.com/oduwsdl/MementoEmbed/issues.


--Shawn M. Jones