Friday, May 31, 2019

2019-06-03: Metadata on Datasets Saves You Time

When I joined ODU this Spring 2019, I explored datasets in digital libraries with the hope of discovering ways to enable users to discover data, and for data to find its ways to users as my first task. This led to some interesting findings that I will elaborate in this post.

First things first, let's take a look at what tools and platforms are available that attempt to make things easier for users to find and visualize data. A quick Google Search provided a link to this awesome GitHub repository which contains a list of topic-centric public dataset repositories. This collection proved useful to gather the types of dataset descriptions available at present.

The first dataset collection I explored was Kaggle. Here, the most upvoted dataset (as of May 31, 2019) was a CSV file with the topic "Credit Card Fraud Detection". Taking a quick look at the data, the first two columns provides a textual description of the content, but not the rest. Since I'm not the maintainer (hence the term distributed digital collections) of that dataset, I wasn't allowed to contribute to improve the metadata in it.
Figure 1: "Credit Card Fraud Detection" Dataset in Kaggle [Link]

One useful feature that's prominent on Kaggle (and most publicly available datasets) was that they provided textual descriptions of the content, but the semantics of data fields and the links between each file in the dataset were either included in the description or not included at all. Only a handful of datasets actually documented the data fields.
Figure 2: Metadata of "Credit Card Fraud Detection" Dataset in Kaggle [Link]

If you have no expertise on a particular domain but only interested in using publicly available data to prove a hypothesis, encountering datasets with inadequate documentation is inevitable owing to the fact that most publicly available dataset semantics are vague and arcane.

This provided us enough motivation to dig a little deeper to find a way to change this trend in digital libraries. We formulated a metadata schema and envisioned a file system, DFS, which aims to reverse this state of ambiguity and bring more sense to the datasets.

Quoting from our poster publication "DFS: A Dataset File System for Data Discovering Users" on JCDL 2019 [link to paper]:
Many research questions can be answered quickly and efficiently using data already collected for previous research. This practice is called secondary data analysis (SDA), and has gained popularity due to lower costs and improved research efficiency. In this paper we propose DFS, a file system to standardize the metadata representation of datasets, and DDU, a scalable architecture based on DFS for semi-automated metadata generation and data recommendation on the cloud. We discuss how DFS and DDU lays groundwork for automatic dataset aggregation, how it integrates with existing data wrangling and machine learning tools, and explores their implications on datasets stored in digital libraries.
We published an extended version of the paper at ArXiV [Link] that elaborates more on the two components that helps to achieve our goal:
  • DFS - A metadata-based file system to standardize the metadata of datasets
  • DDU - A data recommendation architecture based on DFS to bring data closer to users
DFS isn't the next new thing; rather, it's a solution for not having a systematic way of describing datasets with enough detail to make it sensible for an end user. It provides the means to manage versions of data, and ensures that no important information about the dataset is missed out. Most importantly, it provides a machine-understandable format to define dataset schematics.  The JSON shown below is a description of a dataset in DFS meta format.
Figure 3: Sample Metafile in DFS (Shortened for Brevity)

On the other hand, DDU (or Data Discovering Users) is an architecture that we envisioned to simplify the process of plugging in data to test out hypotheses. Assuming that each dataset has metadata compliant with the proposed DFS schema, the goal was to automate data preprocessing and machine learning, while providing a visualization of the steps taken to reach the final results. So if you are not a domain expert, but still want to test out a hypothesis on that domain, you could easily discover a set of datasets that match your need, plug them into the DDU SaaS, and voila! You just got the results needed to validate your hypothesis, with a visualization of the steps followed to get them.
Figure 4: DDU Architecture

As of now, we are working hard to bring DFS into many datasets as possible. For starters, we aim to automatically generate DFS metadata for EEG and Eye Tracking data acquired in real time. The goal is to intercept live data from Lab Streaming Layer [Link], and generate metadata as the data files are generated.

But the biggest question is, does this theory hold true for all domains of research? We plan to answer this in our future work.

- Yasith Jayawardana

Wednesday, May 29, 2019

2019-05-29: In The Battle of the Surrogates: Social Cards Probably Win

Web archive collections provide meaning by sampling specific resources from the web. We want to summarize these resources by sampling mementos from those collections and visualizing them as a social media story.
On Tuesday, we released our latest pre-print "Social Cards Probably Provide Better Understanding of Web Archive Collections". My work builds on AlNoamany's work of using social media storytelling to provide a visualization that summarizes web archive collections. In previous blog posts I discussed different storytelling services. A key component of their capability to convey understanding is the surrogate, a small visualization of a web page that provides a summary of that page, like the surrogate within the Twitter Tweet example shown below. However, there are many types of surrogates. We want to use a group of surrogates together as a story to provide a summary of a web archive collection. Which type of surrogate works best for helping users understand the underlying collection?

An annotated tweet containing a surrogate referring to one of my prior blog posts.

Dr. Nelson, Dr. Weigle, and I iterated for several months to produce this study. Using Mechanical Turk, we evaluated six different surrogate types and discover that the social card, as produced by our MementoEmbed project, probably provides better understanding of the overall collection than the surrogate currently employed in the Archive-It interface.

How Much Information Do We Get From the Surrogates on the Archive-It Collection Page?

As seen in this screenshot, each Archive-It collection page contains surrogates of its seeds. For most collections, how much information do the surrogates provide the user about the collection? (link to collection in screenshot)

Archive-It allows curators to supply optional metadata on seeds. We analyzed how much information might be available to a user viewing such metadata and found that 54.60% of all Archive-It seeds have no metadata. As shown in the scatter plot below, we discovered that, as the number of seeds in a collection increases, the average number of metadata fields decreases.

As the number of seeds increases, we see a decrease in the mean number of metadata fields per collection.

Without this metadata, an Archive-It surrogate consists of the seed URL, the number of mementos, and the first and last memento-datetimes, as shown below. Is this enough for a user to glean meaning about the underlying documents?

A minimal Archive-It surrogate

We adapted some of Lulwah Alkwai's recent work (link forthcoming) and determine that seed URLs still do contain some information that may lead to understanding. An Euler diagram counting the URLs that contain some of this information is shown below. Thus, seed URLs still may help with collection understanding.
An Euler diagram showing the number of Archive-It seed URLs that contain different categories of information.

In the paper, we also highlight the top 10 metadata fields in use and define the different information classes found in seed URLs.

In a Story, Which Surrogate Best Supports Collection Understanding?

Brief Methodology

The figures below show the different types of surrogates that we displayed to participants. Each story consisted of a set of mementos visualized as surrogates in a given order. We varied the surrogates but did not change the order of the mementos. The mementos for each story had been chosen by human curators from AlNoamany's previous work and are available as a Figshare dataset. In our pre-print, we chose stories from four different collections to display to participants.

Our first surrogate type is the de-facto Archive-It interface that users would encounter when trying to understand a web archive collection. We used our own Archive-It Utilities to gather the metadata from the Archive-It collection in order to generate these surrogates.

A screenshot of part of an example story using surrogates from the Archive-It interface.

Our second is the browser thumbnail, commonly used by web archives. We employed MementoEmbed to generate these thumbnails.

A screenshot of an example story using browser thumbnails.

Next was the social card, as produced by MementoEmbed.

A screenshot of an example story using social cards.

The next three surrogates we displayed to users were combinations of browsers and thumbnails.

A screenshot of an example story using social cards next to browser thumbnails.

A screenshot of an example story using social cards, but with thumbnails instead of striking images

A screenshot of an example story using social cards, but where thumbnails appear when the user hovers over the striking image.

For each participant, we showed them the story using a given surrogate for 30 seconds. We then refreshed the web page and presented them with six surrogates of the same type as the story that they had just viewed. Two surrogates represented pages from the collection, but the other four were drawn from different collections. We asked them to select the two surrogates from the six that they believed belonged to the same collection. We recorded all mouse hovers and clicks over links and images.

Brief Results

Our results show no significant difference in response times at p < 0.05, but they do show a difference in answer accuracy for social cards vs. the Archive-It interface at p = 0.0569 and social cards side-by-side with thumbnails at p = 0.0770. The paper further details these results overall and per collection. Even though our use case is different, our results are similar to those in a 2013 IR study performed by Capra et al.

More users interacted with thumbnails than any other surrogate element. We assume that the user was attempting to zoom in and see the thumbnail better. Also, more users clicked on thumbnails to read the web page behind the surrogate than they did for social cards. In fact, social cards had the least number of participants interacting with them compared to other surrogate types. We assume that this means that most users were satisfied with the information provided by the social card and did feel the need to interact as much.

The Future

In this post, I briefly summarized our recent pre-print "Social Cards Probably Provide Better Understanding of Web Archive Collections." This is not the end, however. We are planning more studies to further examine different types of storytelling with future participants. Our work has implications not only for our own web archive summarization efforts, but for any storytelling tool that employs surrogates.

-- Shawn M. Jones

Thank-you @assertpub for letting us know that this pre-print was the #1 paper on arXiv in the Digital Libaries category for May 29, 2019.

Tuesday, May 14, 2019

2019-05-14: Back to Pennsylvania - Artificial Intelligence for Data Discovery and Reuse (AIDR 2019)

AIDR 2019
The 2019 Artificial Intelligence for Data Discovery and Reuse conference, supported by the National Science Foundation, was held in Carnegie Mellon University, Pittsburg, PA, between May 13 and May 15, 2019. It is called a conference, but it is more like a workshop. There are only plenary meetings (and a small session of posters) and the presentations are not all about frontiers of research. Many of them are research reviews and the speakers are trying to connect them with "data reuse". The presenters are in various domains, from text mining to computer vision, from medical imaging to self-driving cars, etc. Another difference from regular CS conferences in that the accepted presenter list is made only based on the abstracts they submitted. The full papers are submitted later. 

Because CiteSeerX collects a lot of data from the Web, our group does a lot of work on information extraction, classification, and reuses a lot of data for training AI models, Dr. Lee Giles recommended me to give a presentation. My title is "CiteSeerX: Reuse and Discovery for Scholarly Big Data". In general, the talk was well received. One person asked the question of how we plan to collect annotations from authors and readers by crowdsourcing. My answer was to taking advantage of the CiteSeerX platform, but we need to collect more papers (especially more recent papers) and build better author profiles before sending out the requests. I will compile everything into a 4-page paper. 

In my 1 1/2 days in CMU, I listened to two keynotes. The first was given by Tom Mitchell, one of the pioneers of machine learning and the chair of the machine learning department. His talk was on "Discovery from Brain Image Data". I used to be in a webinar by him on a similar topic. His research was on connecting natural language with brain activities, studying how brains react to stimulations of vocal languages. Here are some takeaways: (1) it takes about 400 ms for the brain to fully take a word such as "coffee"; (2) the reaction happens in different regions in the brain and it is dynamic (changing over time). The data was collected using fMIR for several people and there was quite a bit of work to denoise the fMIR signals to filter out other undergoing activities. 

The second keynote was given by the president of a startup company called medidataGlen de Vries. Glen talked about how medidata improves drug testing confidence by using synthetic data. The presentation was given in a very professional way (like a TED presentation), but Dr. Lee Giles made a comment that he was using a statistical method called "boosting" and Glen agreed. 

Another interesting talk was given by Natasha Noy from Google. Her talk was about the recently launched search engine called "Google dataset search". According to Natasha, this idea was proposed in one of her blog post in 2017. The search engine was online in September 2018. Unfortunately, because it was not well advertised, very few people know it. I personally knew it two weeks ago. The search engine uses the crawled data from Google. The backend uses basic methods to identify public tools annotated with the schema in, which defines a comprehensive list of fields for metadata of semantic entities. I explored this schema in 2016. The schema can be used for CiteSeerX, replacing Dublin core, but it does not cover semantic typed entities and relationships. So currently, it is good for metadata management. The datasets indexed was also limited to certain domains. Another interesting data search engine was called Auctus, which is a dataset search engine tailored for data argumentation. It searches data using data as input. 

Other interesting talks are:
  • Dr. Cornelia Caragea gave two presentations, one on "keyphrase extraction" - she is an expert in this field, and one on "web archiving" - with her collaboration with Mark Phillips of UNT.  
  • Matias Carrasco Kind, an astronomer, was talking about  Searching for similarities and anomalies in galaxy images
In the conference, I met with Dr. C. Lee Giles, Dr. Cornelia Caragea. All of us were very glad to see each other. We had a very pleasant dinner in a restaurant called "spoon". I had a lunch conversation with Dr. Beth Plale, an NSF program director. She gave me some good suggestions for how to survive as a tenure track professor. I also had brief conversations with Natasha Noy in Google AI and Martin Klein in Los Alamos National Lab. 

Overall, the conference experience was very good and I learned a lot by listening to top speakers from CMU. The registration fee was low and they serve breakfast, lunch, and a banquet (I could not attend). The city of Pittsburg is still cool and windy, but I felt that I am quite used to it because I was living in Pennsylvania for 14 years! The Cathedral of Learning reminds me of good old days when I was visiting my friend Dr. Jintao Liu. He used to be a graduate student of UPitt and now a professor at Tsinghua University. By the way, the supershuttle service was not very good. The front desk canceled my trip from the airport to my hotel because she wasn't able to contact the driver. I had to take a taxi. I used Uber on the way back. It was quick and inexpensive. 

Jian Wu

Monday, May 6, 2019

2019-05-06: Twitter broke my scrapers

Fig. 1: The old tweet DIV showing four (data-tweet-id, data-conversation-id, data-screen-name, and tweet-text) attributes with meaningful names. These attributes are absent in the new tweet DIV (Fig. 2).
On April 23, 2019, my Twitter desktop layout changed. I initially thought a glitch caused me to see  the mobile layout on my desktop instead of the standard desktop layout, but I soon learned this was no accident. I was part of a subset of Twitter users who did not have the option to opt-in to try the new layout. 
New desktop look 
While others might have focused on the cosmetic or functional changes, my immediate concern was to understand the extent of the structural changes to the Twitter DOM. So I immediately opened my Google Chrome Developer Tools to inspect the Twitter DOM, and I was displeased to learn that the changes to the layout seeped beyond the cosmetic new looks of the icons into the DOM. This meant that I would have to rewrite all my research applications built to scrape data from the old Twitter layout.
Old Twitter desktop look
At the moment, I am unsure if it would be possible to extract all the data previously accessible from the old layout. It is important to note that scraping goes against Twitter's Terms of Service's and Twitter offers an API that fulfills some requests invalidating the need for scraping. However, the Twitter API is limited in search, but most importantly, the API does not offer a method for extracting all tweets from a conversation. Extracting tweets from a conversation is a task fundamental to my PhD research, so I scrape Twitter privately for research. In this blogpost, I will use the tweet below to highlight some of the major changes to the Twitter DOM, specifically the tweet DIV by comparing the old and new layouts. 
Fig. 2: In the new tweet DIV, semantic items (e.g, the four semantic items in Fig. 1) are absent or obscured.
Old Tweet DIV vs New Tweet DIV
The most consequential (to me) structural difference between the old and new tweet DIVs is that the old tweet DIV includes many attributes with meaningful names while the new tweet DIV does not. In fact, in the old tweet layout, the fundamental unit, the tweet, was explicitly labeled a "tweet" by a DIV with classname="tweet," unlike the new layout. Let us consider the difference between the old and new tweet DIVs from the perspective of the four important attributes marked in Fig. 1:
  1. data-tweet-id: In the old layout, data-tweet-id (contains the tweet ID - unique string that uniquely identifies a tweet) was explicitly marked. In the new layout, the data-tweet-id attribute is absent.
  2. data-conversation-id: This attribute, absent in the new layout, and present in the old layout is responsible for chaining tweets, and thus required for identifying tweets in a reply or conversation thread. A tweet that is a reply includes the Tweet ID of its parent tweet as a value of the data-conversation-id attribute.
  3. data-screen-name: The data-screen-name attribute labels the Twitter handle of the tweet author. This attribute is marked explicitly in the old tweet DIV, but not in the new tweet DIV.
  4. tweet-text: Within the old tweet DIV, the DIV with class name, "tweet-text," marks the text of the tweet, but in the new tweet DIV, there is no such semantic label for the tweet-text.
It is not clear if the structural changes to the Twitter DOM is a merely coincidental with the rollout of the new layout or if the removal of semantic attributes is part of an intentional effort to discourage scraping. Whatever the actual reason, the consequence is obvious - scraping Twitter has just gotten harder.

Update (2019-06-29)
I previously had a paragraph in this blogpost discussing what I thought was a glitch:
I noticed that reloading my timeline caused Twitter to load and subsequently quickly remove sponsored tweets from my timeline.
During a discussion with Sawood about Twitter Ads, he raised the issue of Adblock which quickly made me realize that "the glitch," might not be a glitch, but Adblock in action. I have had Adblock on for so long that I failed to realize that it interfered with my conclusion. Further investigation validated this realization that I incorrectly attributed Adblock's removal of promoted tweets on my timeline as a glitch. 
-- Alexander C. Nwala (@acnwala)

Friday, May 3, 2019

2019-05-03: Selected Conferences and Orders in WS, DL, IR, DS, NLP, AI

The time when research works should be done is usually less predictable than homework. You may submit a paper next year, but you cannot submit your homework the next year. Even if there is a target deadline, the results may not be delivered on time. Even if the results are ready, the papers may not be in good shape, especially for papers written by students. Even if papers are submitted, they can be rejected. Therefore, it is usually useful to decide where to submit the work next.

I used to struggle to find the next deadline for my work, so I compiled this timeline, sorted by months. The deadlines are not intended to be accurate because they change every year. They can also be extended. The deadlines may vary depending on the submission type: full paper, short paper, poster, etc.  The focus is on the approximate chronological order in which the deadlines happen. One can always visit the conferences' website for the exact deadline. It is also not intended to be exhaustive as it focuses on popular conferences. I also do not want the list to be too crowded but it can be updated by adding new conferences.

The list below is made for people in the Web Science Digital Libraries Group (WS-DL) at ODU, but it can be generalized to researchers working in Web Science, Digital Libraries, Information Retrieval, Data Science, Natural Language Processing, and Artificial Intelligence to better plan where research works can be disseminated. 

There is a conference that is not included below: International Conference on Very Large Data Bases (VLDB), which has a monthly deadline. The submission opens on the 20th and ends on the first of the next month. 

  • USENIX ATC (full) (January 15, 2020)
  • JCDL (full/short/poster) (January 19, 2020)
  • SIGIR (full/short) (January 22, 2020)
  • ICDAR (full) (February 15, 2019)
  • IJCAI (full) (February 15, 2019)
  • KDD (full/short) (Feb 3, 2019)
  • ACM Web Science (full/short/poster) (Feb 18, 2019)
  • ACL (full/short) (March 4, 2019)
  • COLING (full) (March 16, 2018)
  • ISWC (full) (April 3, 2019) 
  • ECML-PKDD (full) (April 5, 2019)
  • DocEng (full) (April 9, 2019)
  • TPDL (full/short/poster) (April 15, 2019) 
  • RecSys (full/short) (April 23, 2019, Copenhagen, Denmark)
  • IRI (full) (May 2, 2019)
  • ICTIR (full/short) (May 15, 2019)
  • IRI (full) (May 18, 2019)
  • DocEng (short) (May 21, 2019)
  • EMNLP (full/short) (May 21, 2019)
  • IJCNLP (full/short) (May 21, 2019)
  • CIKM (full/short) (May 22, 2019) 
  • NIPS (full) (May 23, 2019)
  • CoNLL (full) (May 31, 2019)
  • ICDM (full) (June 5, 2019)
  • K-CAP (full/short) (June 22, 2019)

  • WSDM (full) (August 16, 2019)
  • IEEE Big Data (full) (August 19, 2019, Los Angeles, CA), poster due later
  • AAAI, IAAI (full) (September 5, 2019, New York City, NY)
  • SAC (regular/research, September 15, 2019, Czech Republic)
  • iConference (full/short/poster) (September 16, 2019)
  • ECIR (full/short) (October 1, 2019, Lisbon, Portugal)
  • SDM (full) (October 11 2019, Alberta, Canada)
  • WWW (full/short) (October 14, 2019, Taipei, Taiwan)
  • CHIIR (full/short) (October 15, 2019,  Vancouver, Canada)

  • ICWS (full/short) (December 6, 2019 -- early bird, Feburary 5 -- regular)
  • NAACL-HLT (full/short) (December 10, 2019, Seattle, WA) 
  • IntelliSys (full/poster) (December 15, 2019, Amsterdam, The Netherlands)

Jian Wu