Wednesday, August 1, 2018

2018-08-01: A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages


As commonly seen on Facebook and Twitter, the social card is a type of surrogate that provides clues as to what is behind a URI. In this case, the URI is from Google and the social card makes it clear that the document behind this long URI is directions.
As I described to the audience of Dodging the Memory Hole last year, surrogates provide the reader with some clue of what exists behind a URI. The social card is one type of surrogate. Above we see a comparison between a Google URI and a social card generated from that URI. Unless a reader understands the structure of all URIs at google.com, they will not know what the referenced content is about until they click on it. The social card, on the other hand, provides clues to the reader that the underlying URI provides directions from Old Dominion University to Los Alamos National Laboratory. Surrogates allow readers to pierce the veil of the URI's opaqueness.

With the death of Storify, I've been examining alternatives for summarizing web archive collections. Key to these summaries are surrogates. I have discovered that there exist services that provide users with embeds. These embeds allow an author to insert a surrogate into the HTML of their blog post or other web page. These containing pages often use the surrogate to further illustrate some concept from the surrounding content. Our research team blog posts serve as containing pages for embeds all of the time. We typically use embeddable surrogates of tweets, videos from YouTube, and presentations from Slideshare, but surrogates can be generated for a variety of other resources as well. Unfortunately, not all services generate good surrogates for mementos. After some reading, I came to the conclusion that we can fill in the gap with our own embeddable surrogate service: MementoEmbed.


A recent WS-DL blog post containing embeddable surrogates of Slideshare presentations.


Blast Theory

Sam Pearson and Clara Garcia Fraile are in residence for one month Sam Pearson and Clara Garcia Fraile are in residence for one month working on a new project called In My Shoes. They are developin


MementoEmbed is the first archive-aware embeddable surrogate service. This means it can include memento-specific information such as the memento-datetime, the archive from which a memento originates, and the memento's original resource domain name. In the MementoEmbed social card above, we see the following information:
  • from the resource itself:
    • title — "Blast Theory"
    • a description conveying some information of what the resource is about — "Sam Pearson and Clara Garcia..."
    • a striking image from the resource conveying some visual aspect of aboutness
    • its original web site favicon — the bold "B" in the lower left corner
    • its original domain name — "BLASTTHEORY.CO.UK"
    • its memento-datetime — 2009-05-22T22:12:51 Z
    • a link to its current version — under "Current version"
    • a link to other versions — under "Other Versions"
  • from the archive containing the resource:
    • the domain name of the archive — "WEBARCHIVE.ORG.UK"
    • the favicon of the archive — the white "UKWA" on the aqua background
    • a link to the memento in the archive — accessible via the links in the the title and the memento-datetime
Most of this information is not provided by services for live web resources, such as Embed.ly.

MementoEmbed is a deployable service that currently generates social cards, like the one above, and thumbnails. As with most software I announce, MementoEmbed is still in its alpha prototype phase, meaning that crashes and poor output are to be expected. A bleeding edge demo is available at http://mementoembed.ws-dl.cs.odu.edu. The source code is available from https://github.com/oduwsdl/MementoEmbed. Its documentation is growing at https://mementoembed.readthedocs.io/en/latest/.

In spite of its simplicity in concept, MementoEmbed is an ambitious project, requiring that it not only support parsing and processing of the different web concepts and technologies of today, but all that have ever existed. With this breadth of potential in mind, I know that MementoEmbed does not yet currently handle all memento cases, but that is where you can help contribute by submitting issue reports that help us improve it.

But why use MementoEmbed instead of some other service? What are the goals of MementoEmbed? How does it work? What does the future of MementoEmbed look like?

Why MementoEmbed?


Why should someone use MementoEmbed and not some other embedding service? I reviewed several embedding services mentioned on the web. The examples in this section will demonstrate some embeds using a memento of the New York Times front page from 2005 preserved by the Internet Archive, shown below.

This is a screenshot of the example New York Times memento used in the rest of this section. Its memento-datetime is June 2, 2005 at 19:45:24 GMT and it is preserved by the Internet Archive. This page was selected because it contains a lot of content, including images.
I reviewed Embed.lyembed.rocksIframelynoembedmicrolink, and autoembed. As of this writing, the autoembed service appears to be gone. The noembed service only provides embeds for a small number of web sites and does not support web archives. Iframely responds with errors for memento URIs, as shown below.
Iframely fails to generate an embed for a memento of a New York Times page at the Internet Archive. The error message is misleading. There are multiple images on this page.
What the Iframely parsers see for this memento according to their web application.
What Iframely generates for the current New York Times web page (as of July 29, 2018 at 18:23:15 GMT).


Embed.ly, embed.rocks. and microlink are the only services that attempt to generate embeds for mementos. Unfortunately, none of them are fully archive-aware. One of the goals of a good surrogate is to convey some level of aboutness with respect to the underlying web resource. Mementos are documents with their own topics. They are typically not about the archives that contain them. Intermixing these two concepts of document content and archive information, without clear separation, produces surrogates that can confuse users. The microlink screenshot below shows an embed that fails to convey the aboutness of its underlying memento. The microlink service is not archive-aware. In this example, microlink mixes the Internet Archive favicon and Internet Archive banner with the title from the original resource. The embed.rocks example below does not fare much better, appearing to attribute the New York Times article to web.archive.org. What is the resource behind this surrogate really about? This mixing of resources weakens the surrogate's ability to convey the aboutness of the memento.

As seen in the screenshot of a social card for our example New York Times memento from 2005, microlink conflates  original resource information and archive information.
The embed.rocks social card does not fare much better, attributing the New York Times page to web.archive.org.

Embed.ly does a better job, but still falls short. In the screenshot below an embed was created for the same resource. It contains the title of the resource as well as a short description and even a striking image from the memento itself. Unfortunately, it contains no information about the original resource, potentially implying that someone at archive.org is serving content for the New York Times. Even worse, in the world where readers are concerned about fake news this surrogate may lead an informed reader to believe that this is a link to a counterfeit resource because it does not come from nytimes.com.
This screenshot of an embed for the same New York Times memento shows how well embed.ly performs. While the image and description convey more aboutness for the original resource, there is only attribution information about the archive.
Below, the same resource is represented as a social card in MementoEmbed. MementoEmbed chose the New York Times logo as the striking image for this page. This card incorporates elements used in other surrogates, such as the title of the page, a description, and a striking image pulled from the page content. Further down, I annotate the card and show how the information exists in separate areas of the card. MementoEmbed places archive information and the original resource information into their own areas of the card, visually providing separation between these concepts to reduce confusion.

A screenshot of the same New York Times memento in MementoEmbed.



This is not to imply that cards generated by Embed.ly or other services should not be used, just that they appear to be tailored to live web resources. MementoEmbed is strictly designed for use with mementos and strives to occupy that space.

Goals of MementoEmbed


MementoEmbed has the following goals in mind.

  1. The system shall provide archive-aware surrogates of mementos
  2. The system shall be deployable by others
  3. Surrogates shall degrade gracefully
  4. Surrogates shall have limited or no dependency on an external service
  5. Not just humans, but machines shall be able to generate surrogates
I have demonstrated how we meet the first goal in the prior section. In the following subsections I'll provide an overview of how well the current service meets these other goals.

Deployable by others



I did not want MementoEmbed to be another centralized service. My goal is that eventually web archives can run their own copies of MementoEmbed. Visitors to those archives will be able to create their own embeds from mementos they find. The embeds can be used in blog posts and other web pages and thus help these archives promote themselves.

MementoEmbed is a Python Flask application that can be run from a Docker container. Again, it is in its alpha prototype phase, but thanks to the expertise of fellow WS-DL member Sawood Alam, others can download the current version from DockerHub.

Type the following to acquire the MementoEmbed Docker image:

docker pull oduwsdl/mementoembed

Type the following to create a container from the image and run it on TCP port 5550:

docker run -it --rm -p 5550:5550 oduwsdl/mementoembed

Inside the container, the service runs on port 5550. The -p flag maps the container's port 5550 to your local port 5550.  From here, the user can access the container at http://localhost:5550 and they are greeted with the page below.

The welcome page for MementoEmbed.

Surrogates that degrade gracefully



Prior to executing any JavaScript, MementoEmbed's social cards use the blockquote, div, and p tags. After JavaScript, these tags are augmented with styles, images, and other information. This means that if the MementoEmbed JavaScript resource is not available, the social card is still viewable in a browser, as seen below.

A MementoEmbed social card generated for a memento from the Portuguese Web Archive.


The same social card rendered without the associated JavaScript.


Surrogates with limited or no external dependencies


All web resources are ephemeral, and embedding services are no exception. If an embed service fails or otherwise disappears, what happens to its embeds? Consider Embed.ly. The embed code for Embed.ly is typically less than 100 bytes in length. They achieve this small size because their embeds contain the title of the represented page, the represented URI, and a URI to a JavaScript resource. Everything else is loaded from their service via that JavaScript resource. Web page authors trade a small embed code for dependency on an outside service. Once that JavaScript is executed and a page is rendered, the embed grows to around 2kB. What has the web page author using the embed really gained from the small size? They have less to copy and paste, but their page size still grows once rendered. Also, in order for their page to render, it now relies on the speed and dependability of yet another external service. This is why Embed.ly cards sometimes experience a delay when the containing page is being rendered.

Privacy can be another concern. Embedded resources result in additional requests to web servers outside of the one providing the containing page. This means that an embed not only potentially conveys information about which pages it is embedded in, but also who is visiting these pages. If a web page author does not wish to share their audience with an outside service, then they might want to reconsider embeds.

Thinking about this from the perspective of web archives, I decided that MementoEmbed can do better. I started thinking about how its embeds could outlive MementoEmbed while at the same time offering privacy to visiting users.

MementoEmbed offers thumbnails as data URIs so that pages using these thumbnails do not depend on MementoEmbed.
Currently, MementoEmbed provides surrogates either as social cards or thumbnails. In response to requests for thumbnails, MementoEmbed provides an embed as a data URI, as shown above. Data URI support for images in browsers is well established at this point. A web page containing the data URI can render it without relying upon any MementoEmbed service, thus removing an external dependency. Of course, one can also save the thumbnail locally and upload it to their own server.

MementoEmbed offers the option of using data URIs for images and favicons in social cards so that these embedded resources are not dependent on outside services.
For social cards, I tried to take the use of data URIs a step further. As seen in the screenshot above, MementoEmbed allows the user to use data URIs in their social card rather than just relying upon external resources for favicons and images. This makes the embeds larger, but ensures that they do not rely upon external services.

As noted in the previous section, MementoEmbed includes some basic data and simple HTML to allow for degradation. CSS and images are later added by JavaScript loaded from the MementoEmbed service. To eliminate this dependency, I am currently working on an option that will allow the user (or machine) to request an HTML-only social card.

Not just for humans


The documentation provides information on the growing web API that I am developing for MementoEmbed. For the sake of brevity, I will talk about how a machine can request a social card or a thumbnail here.

MementoEmbed uses similar tactics to other web archive frameworks. Each service has its own URI "stem" and the URI-M to be operated on is appended to this stem.

Firefox displays a social card produced by the machine endpoint /services/product/socialcard at http://mementoembed.ws-dl.cs.odu.edu/services/product/socialcard/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/.
To request a social card, a URI-M is appended to the endpoint /services/product/socialcard/. For example, consider a system that wants to request a social card for the memento at http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ from the MementoEmbed service running at mementoembed.ws-dl.cs.odu.edu. The client would visit: http://mementoembed.ws-dl.cs.odu.edu/services/product/socialcard/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ and receive the HTML and JavaScript necessary to render the social card, as seen in the above screenshot.

Firefox displays a thumbnail produced by the machine endpoint /services/product/thumbnail at http://mementoembed.ws-dl.cs.odu.edu/services/product/thumbnail/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/.
Likewise, to request a thumbnail for the same URI-M from the same service, the machine would visit the endpoint at /services/product/thumbnail at the URI http://mementoembed.ws-dl.cs.odu.edu/services/product/thumbnail/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ and receive the image as shown in the above Firefox screenshot. The thumbnail service returns thumbnails in the PNG image format.

Clients can use the Prefer header from RFC 7240 to control the generation of these surrogates. I have written about the Prefer header before, and Mat Kelly is using it in his work as well. Simply, the client uses the Prefer header to request certain behavior on behalf of a server with respect to a resource. The server responds with a Preference-Applied header indicating which behaviors exist in the response.

For example, to change the width of a thumbnail to 500 pixels, a client would generate a Prefer header containing the thumbnail_width option. If one were to use curl, the HTTP request headers to a local instance of MementoEmbed would look like this, with the Prefer header marked red for emphasis:

GET /services/product/thumbnail/http://web.archive.org/web/20180128152127/http://www.cs.odu.edu/~mkelly/ HTTP/1.1
Host: localhost:5550
User-Agent: curl/7.54.0
Accept: */*
Prefer: thumbnail_width=500

And the MementoEmbed service would respond with the following headers, with the Preference-Applied headed marked red for emphasis:

HTTP/1.0 200 OK
Content-Type: image/png
Content-Length: 216380
Preference-Applied: viewport_width=1024,viewport_height=768,thumbnail_width=500,thumbnail_height=375,timeout=15,remove_banner=no
Server: Werkzeug/0.14.1 Python/3.6.5
Date: Sun, 29 Jul 2018 21:08:19 GMT

The server indicates that the thumbnail returned has not only a width of 500 pixels, but also a height of 375 pixels. Also included are other preferences used in its creation, like the size of the browser viewport, the number of seconds MementoEmbed waited before giving up on a response from the archive, and whether or not the archive banner was removed.

The social card service also supports preferences for whether or not to use data URIs for images and favicons.

Other service endpoints exist, like /services/memento/archivedata, to provide parts of information used in social cards. In addition to these services, I am also developing an oEmbed endpoint for MementoEmbed.

Brief Overview of MementoEmbed Internals



Here I will briefly cover some of the libraries and algorithms used by MementoEmbed. The Memento protocol is a key part of what allows MementoEmbed to work. MementoEmbed uses the Memento protocol to discover the original resource domain, locate favicons, and of course to find a memento's memento-datetime.

If metadata is present in HTML meta tags, then MementoEmbed uses those values for the social card. MementoEmbed favors Open Graph metadata tags first, followed by Twitter card metadata, and then resorts to mining the HTML page for items like title, description, and striking image.

Titles are extracted for social cards using BeautifulSoup. The description is generated using readability-lxml. This library provides scores for paragraphs in an HTML document. Based on comments from the readability code, the paragraph with the highest score is considered to be "good content". The highest scored paragraph is selected for use in the description and truncated to the first 197 characters so it will fit into the card. If readability fails for some reason, MementoEmbed falls back to building one large paragraph from the content using justext and taking the first 197 characters from it, a process Grusky, et. al. refer to as Lede-3.

Striking image selection is a difficult problem. To support our machine endpoints, I needed to find a method that would select an image without any user intervention. There are several research papers offering different solutions for image selection based on machine learning. I was concerned about performance, so I opted to use some heuristics instead. Currently, MementoEmbed employs an algorithm that scores images using the equation below.



where S is the score, N is the number of images on the page, n is the current image position on the page, s is the size of the image in pixels, h is the number of bars in the image histogram containing a value of 0, and r is the ratio of width to height. The variables k1 through k4 are weights. This equation is built on several observations. Images earlier in a page (a low value of n) tend to be more important. Larger images (a high s) tend to be preferred. Images with a histogram consisting of many 0s tend to be mostly text, and are likely advertisements or navigational elements. Images whose width is much greater than their height (a high value for r) tend to be banner ads. For performance, the first 15 images on a page are scored. If the highest scoring image meets some threshold, then it is selected. If no images meet that threshold, then the next 15 are loaded and evaluated.

The thumbnails are generated by a call from flask to puppeteer. MementoEmbed includes a Python class that can make this cross-language call, provided a user has puppeteer installed. If requested by the user, MementoEmbed uses its knowledge of various archives to produce a thumbnail without the archive banner. This only works for some archives. For Wayback Archives, information for choosing URI-Ms without banners was gathered from Table 9 of John Berlin's Masters Thesis.

The Future of MementoEmbed



MementoEmbed has many possibilities. I have already mentioned that MementoEmbed will support features like an oEmbed endpoint and HTML-only social cards. In the visible future, I will address language-specific issues and problems with certain web constructs, like framesets and ancient character sets. I also foresee the need for additional social card preferences, like changes to width and height as well as a preference for a vertical rather than horizontal card. One could even use content negotiation to request thumbnails in formats other than PNG.

The striking image selection algorithm will be improved. At the moment the weights are set at what works based on my limited testing. It is likely new weights, a new equation, or even a new algorithm could be employed at some point. Feedback from the community will guide these decisions.

Some other ideas that I have considered involve new forms of surrogates. Simple alterations to existing surrogates are possible, like social cards that contain thumbnails or social cards without any images. More complex concepts like Teevan's Visual Snippets or Woodruff's enhanced thumbnails would require a lot of work, but are possible within the framework of MementoEmbed.

A lot of it will depend on the needs of the community. Thanks to Sawood Alam, Mat Kelly, Grant Atkins, Michael Nelson, and Michele Weigle for already providing feedback. As more people experience MementoEmbed, they will no doubt come up with ideas I had not considered, so please try our demo at http://mementoembed.ws-dl.cs.odu.edu or look at the source code in GitHub at https://github.com/oduwsdl/MementoEmbed. Most importantly, report any issues or ideas to our GitHub issue tracker: https://github.com/oduwsdl/MementoEmbed/issues.


--Shawn M. Jones

Sunday, July 22, 2018

2018-07-22: Tic-Tac-Toe and Magic Square Made Me a Problem Solver and Programmer


"How did you learn programming?", a student asked me in a recent summer camp. Dr. Yaohang Li organized the Machine Learning and Data Science Summer Camp for High School students of the Hampton Roads metropolitan region at the Department of Computer Science, Old Dominion University from June 25 to July 9, 2018. The camp was funded by the Virginia Space Grant Consortium. More than 30 students participated in it. They were introduced to a variety of topics such as Data Structures, Statistics, Python, R, Machine Learning, Game Programming, Public Datasets, Web Archiving, and Docker etc. in the form of discussions, hands-on labs, and lectures by professors and graduate students. I was invited to give a lecture about my research and Docker. At the end of my talk I solicited questions and distributed Docker swag.

The question "How did you learn programming?" led me to draw Tic-Tac-Toe Game and a 3x3 Magic Square on the white board. Then I told them a more than a decade old story of the early days of my bachelors degree when I had recently got my very first computer. One day while brainstorming on random ideas, I realized the striking similarity between the winning criteria of a Tic-Tac-Toe game and sums of 15 using three numbers of a 3x3 Magic Square that uses unique numbers from one to nine. The similarity has to do with their three rows, three columns, and two diagonals. After confirming that there are only eight combinations of selecting three unique numbers from one to nine whose sum is 15, I was sure that those are all placed at strategic locations in a magic square and there is no other possibility left for another such combination. If we assign values to each block of the Tic-Tac-Toe game according the Magic Square and store list of values acquired by the two players, we can decide potential winning moves in the next step by trying various combinations of two acquired vales of a player and subtracting it from 15. For example, if places 4 and 3 are acquired by the red (cross sign) player then a potential winning move would be place 8 (15-4-3=8). With this basic idea of checking potential wining move, when the computer is playing against a human, I could set strategies of first checking for the possibility of winning moves by the computer and if none are available then check for the possibility of the next winning moves by the human player and block them. While there are many other approaches to solve this problem, my idea was sufficient to get me excited and try to write a program for it.

By that time I only had the basic understanding of programming constructs such as variables, conditions, loops, and functions in C programming language as part of the introductory Computer Science curriculum. While C is a great language for many reasons, it was not an exciting language for me as a beginner. If I were to write Tic-Tac-Toe game in C, I would have ended up writing something that would have a text-based user interface in the terminal which is not what I was looking for. I asked someone about the possibility of writing software with a graphical user interface (GUI) and he suggested that I try Visual Basic. So I went to the library, got a book on VB6, and studied it for about a week. Now, I was ready to create a small window with nine buttons arranged in a 3x3 grid. When these buttons would be clicked, a colored label (a circle or a cross) would be placed and a callback function would be called with an argument associated with the value according to the position of the button (as per the Magic Square arrangement). The callback function can then update states and play the next move. Later, the game was improved with different modes and settings.

One day, I shared my program and approach with a professor (who is working for Microsoft now) with excitement. He said this technique is explored in an algorithm book too. This made me feel a little underwhelmed because I was not the first one who came up with this idea. However, I was equally happy that I discovered it independently and the fact that it was validated by some smart people already.

This was not the only event when I had an idea and needed the right tool to express it. Over time my curiosity lead me to many more challenges, ideas of potential solutions for the problem, and exploration of numerous suitable tools, techniques, and programming languages.


My talk was scheduled for Wednesday, June 27, 2018. I started by introducing myself, WS-DL Research Group, basics of Web Archiving, and then briefly talked about my Archive Profiling research. Without going too much into the technical details, I tried to explain the need of Memento Routing and how Archive Profiles can help to achieve this.


Luckily, Dr. Michele Weigle had already introduced Web Archiving to them the day before my talk. When I started mentioning Web Archives, they knew what I was talking about. This helped me cut my talk down and save some time to talk about other things and the Q/A session.


I then put my Docker Campus Ambassador hat on and started with the Dockerizing ArchiveSpark story. Then I briefly described what Docker is, where can it be useful, and how it works. I walked them through a code example to illustrate the procedure of working with Docker. As expected, it was their first encounter with Docker and many of them had no experience with Linux operating system either, so I tried to keep things as simple as possible.


I had a lot of stickers and some leftover T-shirts from my previous Docker event, so I gave them to those who asked any questions. A couple days later, Dr. Li told me that the students were very excited about Docker and especially those T-shirts, so I decided to give a few more of those away. For that, I asked them a few questions related to my earlier talk and whoever was able to recall the answers got a T-shirt.


Overall, I think it was a successful summer camp. I am positive that those High School students had a great learning experience and exposure to some research techniques that can be helpful in their career and some of them might be encouraged to go for a graduation degree one day. Being a research university, ODU is enriched with many talented graduate students with a variety of expertise and experiences which can benefit the community at large. I think more such programs should be organized in the Department of Computer Science and various other departments of the university.


It was a fun experience for me as I interacted with High School students here in the USA for the first time. They were all energetic, excited, and engaging. Good luck to all who were part of this two weeks long event. And now you know, how I learned programming!

--
Sawood Alam

Wednesday, July 18, 2018

2018-07-18: HyperText and Social Media (HT) Trip Report


Leaping Tiger statue next to the College of Arts at Towson University
From July 9 - 12, the 2018 ACM Conference on Hypertext and Social Media (HT) took place at the College of Arts at Towson University in Baltimore, Maryland. Researchers from around the world presented the results of complete or ongoing work in tutorial, poster, and paper sessions. Also, during the conference I had the opportunity to present a full paper: "Bootstrapping Web Archive Collections from Social Media" on behalf of co-authors Dr. Michele Weigle and Dr. Michael Nelson.

Day 1 (July 9, 2018)


The first day of the conference was dedicated to a tutorial (Efficient Auto-generation of Taxonomies for Structured Knowledge Discovery and Organization) and three workshops:
  1. Human Factors in Hypertext (HUMAN)
  2. Opinion Mining, Summarization and Diversification
  3. Narrative and Hypertext
I attended the Opinion Mining, Summarization and Diversification workshop. The workshop started with a talk titled: "On Reviews, Ratings and Collaborative Filtering," presented by Dr. Oren Sar Shalom, principal data scientist at Intuit, Israel. Next, Ophélie Fraisier, a PhD student studying stance analysis on social media at Paul Sabatier University, France, presented: "Politics on Twitter : A Panorama," in which she surveyed methods of analyzing tweets to study and detect polarization and stances, as well as election prediction and political engagement.
Next, Jaishree Ranganathan, a PhD student at the University of North Carolina, Charlotte, presented: "Automatic Detection of Emotions in Twitter Data - A Scalable Decision Tree Classification Method."
Finally, Amin Salehi, a PhD student at Arizona State University, presented: "From Individual Opinion Mining to Collective Opinion Mining." He showed how collective opinion mining can help capture the drivers behind opinions as opposed to individual opinion mining (or sentiment) which identifies single individual attitudes toward an item.

Day 2 (July 10, 2018)


The conference officially began on day 2 with a keynote: "Lessons in Search Data" by Dr. Seth Stephens-Davidowitz, a data scientist and NYT bestselling author of: "Everybody Lies."
In his keynote, Dr. Stephens-Davidowitz revealed insights gained from search data ranging from racism to child abuse. He also discussed a phenomenon in which people are likely to lie to pollsters (social desirability bias) but are honest to Google ("Digital Truth Serum") because Google incentivizes telling the truth. The paper sessions followed the keynote with two full papers and a short paper presentation.


The first (full) paper of day 2 in the Computational Social Science session: "Detecting the Correlation between Sentiment and User-level as well as Text-Level Meta-data from Benchmark Corpora," was presented by Shubhanshu Mishra, a PhD student at the iSchool of the University of Illinois at Urbana-Champaign. He showed correlations between user-level and tweet-level metadata by addressing two questions: "Do tweets from users with similar Twitter characteristics have similar sentiments?" and "What meta-data features of tweets and users correlate with tweet sentiment?" 
Next, Dr. Fred Morstatter presented a full paper: "Mining and Forecasting Career Trajectories of Music Artists," in which he showed that their dataset generated from concert discovery platforms can be used to predict important career milestones (e.g., signing by a major music label) of musicians.
Next, Dr. Nikolaos Aletras, a research associate at the University College London, Media Futures Group, presented a short paper: "Predicting Twitter User Socioeconomic Attributes with Network and Language Information." He described a method of predicting the occupational class and income of Twitter users by using information extracted from their extended networks.
After a break, the Machine Learning session began with a full paper (Best Paper Runner-Up): "Joint Distributed Representation of Text and Structure of Semi-Structured Documents," presented by Samiulla Shaikh, a software engineer and researcher at IBM India Research Labs.
Next, Dr. Oren Sar Shalom presented a short paper titled: "As Stable As You Are: Re-ranking Search Results using Query-Drift Analysis," in which he presented the merits of using query-drift analysis for search re-ranking. This was followed by a short paper presentation titled: "Embedding Networks with Edge Attributes," by Palash Goyal, a PhD student at University of Southern California. In his presentation, he showed a new approach to learn node embeddings that uses the edges and associated labels.
Another short paper presentation (Recommendation System session) by Dr. Oren Sar Shalom followed. It was titled: "A Collaborative Filtering Method for Handling Diverse and Repetitive User-Item Interactions." He presented a collaborative filtering model that captures multiple complex user-item interactions without any prior domain knowledge.
Next, Ashwini Tonge, a PhD student at Kansas State University presented a short paper titled: "Privacy-Aware Tag Recommendation for Image Sharing," in which she presented a means of tagging images on social media in order to improve the quality of user annotations while preserving user privacy sharing patterns.
Finally, Palash Goyal presented another short paper titled: "Recommending Teammates with Deep Neural Networks."

The day 2 closing keynote by Leslie Sage, director of data science at DevResults followed after a break that featured a brief screening of the 2018 World Cup semi-final game between France and Belgium. In her keynote, she presented the challenges experienced in the application of big data toward international development.

Day 3 (July 11, 2018)


Day 3 of the conference began with a keynote: "Insecure Machine Learning Systems and Their Impact on the Web" by Dr. Ben Zhao, Neubauer Professor of Computer Science at University of Chicago. He highlighted many milestones of machine learning by showing problems they have solved in natural language processing and computer vision. But showed that opaque machine learning systems are vulnerable to attack by agents with malicious intents, and he expressed the idea that these critical issues must be addressed especially given the rush to deploy machine learning systems. 
Following the keynote, I present our full paper: "Bootstrapping Web Archive Collections from Social Media" in the Temporal session. I highlighted the importance of web archive collections as a means of preserving the historical record of important events, and the seeds (URLs) from which they are formed. The seeds are collected by experts curators, but we do not have enough experts to collect seeds in a world of rapidly unfolding events. Consequently, I proposed exploiting the collective domain expertise of web users by generating seeds from social media collections and showed through a novel suite of measures, that seeds generated from social media are similar to those generated by experts.

Next, Paul Mousset, a PhD student at Paul Sabatier University, presented a full paper: "Studying the Spatio-Temporal Dynamics of Small-Scale Events in Twitter," in which he presented his work into the granular identification and characterization of event types on Twitter.
Next, Dr. Nuno Moniz, invited Professor at the Sciences College of the University of Porto, presented a short paper: "The Utility Problem of Web Content Popularity Prediction." He demonstrated that state-of-the-art approaches for predicting web content popularity have been optimized for improving the predictability of average behavior of data: items with low levels of popularity.
Next, Samiulla Shaikh (again), presented the first full paper (Nelson Newcomer Award winner) of the Semantic session: "Know Thy Neighbors, and More! Studying the Role of Context in Entity Recommendation," in which he showed how to efficiently explore a knowledge graph for the purpose of entity recommendation by utilizing contextual information to help in the selection of a subset of a entities in a knowledge graph.
Samiulla Shaikh (again), presented a short paper: "Content Driven Enrichment of Formal Text using Concept Definitions and Applications," in which he showed a method of making formal text more readable to non-expert users by text enrichment e.g., highlighting definitions and fetching of definitions from external data sources.
Next, Yihan Lu, a PhD student at Arizona State University, presented a short paper: "Modeling Semantics Between Programming Codes and Annotations." He presented the results from investigating a systematic method to examine annotation semantics and its relationship with source codes. He also showed their model which predict concepts in programming code annotation. Such annotations could be useful to new programmers.
Following a break, the User Behavior session began. Dr. Tarmo Robal, a research scientist at the Tallinn University of Technology, Estonia, presented a full paper: "IntelliEye: Enhancing MOOC Learners' Video Watching Experience with Real-Time Attention Tracking." He introduced IntelliEye, a system that monitors students watching video lessons and detects when they are distracted and intervenes in an attempt to refocus their attention.
Next, Dr. Ujwal Gadiraju, a postdoctoral researcher at L3S Research Center, Germany, presented a full paper: "SimilarHITs: Revealing the Role of Task Similarity in Microtask Crowdsourcing." He presented his findings from investigating the role of task similarity in microtask crowdsourcing on platforms such as Amazon Mechanical Turk and its effect on market dynamics.
Next, Xinyi Zhang, a computer science PhD candidate at UC Santa Barbara, presented a short paper: "Penny Auctions are Predictable: Predicting and profiling user behavior on DealDash." She showed that penny auction sites such as DealDash are vulnerable to modeling and adversarial attacks by showing that both the timing and source of bids are highly predictable and users can be easily classified into groups based on their bidding behaviors.
Shortly after another break, the Hypertext paper sessions began. Dr. Charlie Hargood, senior lecturer at Bournemouth University, UK and Dr. David Millard, associate Professor  at the University of Southampton, UK, presented a full paper: "The StoryPlaces Platform: Building a Web-Based Locative Hypertext System." They presented StoryPlaces, an open source authoring tool designed for the creation of locative hypertext systems.
Next, Sharath Srivatsa, a Masters student at International Institute of Information Technology, India, presented a full paper: "Narrative Plot Comparison Based on a Bag-of-actors Document Model." He presented an abstract "bag-of-actors" document model for indexing, retrieving, and comparing documents based on their narrative structures. The model resolves the actors in the plot and their corresponding actions.
Next, Dr. Claus Atzenbeck, professor at Hof University, Germany, presented a short paper: "Mother: An Integrated Approach to Hypertext Domains." He stated that the Dexter Hypertext Reference Model which was developed to provide a generic model for node-link hypertext systems does not match the need of Component-Based Open Hypermedia Systems (CB-OHS), and proposed how this can be remedied by introducing Mother, a system that implements link support.
The final (short) paper of the day, "VAnnotatoR: A Framework for Generating Multimodal Hypertexts," was presented by Giuseppe Abrami. He introduced a virtual reality and augmented reality framework for generating multimodal hypertexts called VAnnotatoR. The framework enables the annotation and linkage of texts, images and their segments with walk-on-able animations of places and buildings.
The conference banquet at Rusty Scupper followed the last paper presentation. The next HyperText conference was announced at the banquet.

Day 4 (July 12, 2018)


The final day of the conference featured multiple papers presentations such as:
The day began with a keynote "The US National Library of Medicine: A Platform for Biomedical Discovery and Data-Powered Health," presented by Elizabeth Kittrie, strategic advisor for data and open science at the National Library of Medicine (NLM). She discussed the role the NLM serves such as provider of health data for biomedical research and discovery. She also discussed the challenges that arise from the rapid growth of biomedical data, shifting paradigms of data sharing, as well as the role of libraries in providing access to digital health information.
The Privacy session of exclusively full papers followed the keynote. Ghazaleh Beigi, a PhD student at Arizona State University presented: "Securing Social Media User Data - An Adversarial Approach." She showed a privacy vulnerability that arises from the anonymization of social media data by demonstrating an adversarial attack specialized for social media data.
Next, Mizanur Rahman, a PhD student at Florida International University, presented: "Search Rank Fraud De-Anonymization in Online Systems." The bots and automatic methods session with two full paper presentations followed.
Diego Perna, a researcher at the University of Calabria, Italy, presented: "Learning to Rank Social Bots." Given recent reports about the use of bots to spread misinformation/disinformation on the web in order to sway public opinion, Diego Perna proposed a machine-learning framework for identifying and ranking online social network accounts based on their degree similarity to bots.
Next, David Smith, a researcher at University of Florida,  presented: "An Approximately Optimal Bot for Non-Submodular Social Reconnaissance." He showed that studies that show how social bots befriend real users as part of an effort to collect sensitive information operate with the premise that the likelihood of users accepting bot friend requests is fixed, a constraint contradicted by empirical evidence. Subsequently, he presented his work which addressed this limitation.
The News session began shortly after a break with a full paper (Best Paper Award) presentation from Lemei Zhang, a PhD candidate from Norwegian University of Science and Technology: "A Deep Joint Network for Session-based News Recommendations with Contextual Augmentation." She highlighted some of the issues news recommendation system suffer such as fast updating rate of news articles and lack of user profiles. Next, she proposed a news recommendation system that combines user click events within sessions and news contextual features to predict the next click behavior of a user.
Next, Lucy Wang, senior data scientist at Buzzfeed, presented a short paper: "Dynamics and Prediction of Clicks on News from Twitter."
Next, Sofiane Abbar, senior software/research engineer at Qatar Computing Research Institute, presented via a YouTube video: "To Post or Not to Post: Using Online Trends to Predict Popularity of Offline Content." He proposed a new approach for predicting the popularity of news articles before  they are published. The approach is based on observations regarding the article similarity and topicality and complements existing content-based methods.

Next, two full papers (Community Detection session) where presented by Ophélie Fraisier and Amin Salehi. Ophélie Fraisier presented: "Stance Classification through Proximity-based Community Detection."  She proposed the Sequential Community-based Stance Detection (SCSD) model for stance (online viewpoints) detection. It is a semi-supervised ensemble algorithm which considers multiple signals that inform stance detection. Next, Amin Salehi presented: "Sentiment-driven Community Profiling and Detection on Social Media." He presented a method of profiling social media communities based on their sentiment toward topics and proposed a method of detecting such communities and identifying motives behind their formation.
For pictures and notes complementary to this blogpost see Shubhanshu Mishra's notes.

I would like to thank the organizers of the conference, the hosts, Towson University College of Arts, as well as IMLS for funding our research.
-- Nwala (@acnwala)