Showing posts from May, 2019

2019-06-03: Metadata on Datasets Saves You Time

When I joined ODU this Spring 2019, I explored datasets in digital libraries with the hope of discovering ways to enable users to discover data, and for data to find its ways to users as my first task. This led to some interesting findings that I will elaborate in this post. First things first, let's take a look at what tools and platforms are available that attempt to make things easier for users to find and visualize data. A quick Google Search provided a link to  this awesome GitHub repository which contains a list of topic-centric public dataset repositories. This collection proved useful to gather the types of dataset descriptions available at present. The first dataset collection I explored was Kaggle. Here, the most upvoted dataset (as of May 31, 2019) was a CSV file with the topic "Credit Card Fraud Detection". Taking a quick look at the data, the first two columns provides a textual description of the content, but not the rest. Since I'm not the mai

2019-05-29: In The Battle of the Surrogates: Social Cards Probably Win

Web archive collections provide meaning by sampling specific resources from the web. We want to summarize these resources by sampling mementos from those collections and visualizing them as a social media story. On Tuesday, we released our latest pre-print " Social Cards Probably Provide Better Understanding of Web Archive Collections " ( ACM published version ). My work builds on AlNoamany's work of using social media storytelling to provide a visualization that summarizes web archive collections. In previous blog posts I discussed different storytelling services . A key component of their capability to convey understanding is the surrogate , a small visualization of a web page that provides a summary of that page, like the surrogate within the Twitter Tweet example shown below. However, there are many types of surrogates. We want to use a group of surrogates together as a story to provide a summary of a web archive collection. Which type of surrogate works best fo

2019-05-14: Back to Pennsylvania - Artificial Intelligence for Data Discovery and Reuse (AIDR 2019)

The 2019 Artificial Intelligence for Data Discovery and Reuse conference , supported by the National Science Foundation , was held in Carnegie Mellon University , Pittsburg, PA, between May 13 and May 15, 2019. It is called a conference, but it is more like a workshop. There are only plenary meetings (and a small session of posters) and the presentations are not all about frontiers of research. Many of them are research reviews and the speakers are trying to connect them with "data reuse". The presenters are in various domains, from text mining to computer vision, from medical imaging to self-driving cars, etc. Another difference from regular CS conferences in that the accepted presenter list is made only based on the abstracts they submitted. The full papers are submitted later.  Because CiteSeerX collects a lot of data from the Web, our group does a lot of work on information extraction, classification, and reuses a lot of data for training AI models, Dr. Lee Giles re

2019-05-06: Twitter broke my scrapers

Fig. 1: The old tweet DIV showing four ( data-tweet-id , data-conversation-id , data-screen-name , and tweet-text ) attributes with meaningful names. These attributes are absent in the new tweet DIV (Fig. 2). On April 23, 2019, my Twitter desktop layout changed. I initially thought a glitch caused me to see  the mobile layout on my desktop instead of the standard desktop layout, but I soon learned this was no accident . I was part of a subset of Twitter users who did not have the option to opt-in to try the new layout.  New desktop look  While others might have focused on the cosmetic or functional changes, my immediate concern was to understand the extent of the structural changes to the Twitter DOM . So I immediately opened my Google Chrome Developer Tools to inspect the Twitter DOM, and I was displeased to learn that the changes to the layout seeped beyond the cosmetic new looks of the icons into the DOM. This meant that I would have to rewrite all my research applicati

2019-05-03: Selected Conferences and Orders in WS, DL, IR, DS, NLP, AI, CV

The time when research works should be done is usually less predictable than homework. You may submit a paper next year, but you cannot submit your homework the next year. Even if there is a target deadline, the results may not be delivered on time. Even if the results are ready, the papers may not be in good shape, especially for papers written by students. Even if papers are submitted, they can be rejected. Therefore, it is usually useful to decide where to submit the work next. I used to struggle to find the next deadline for my work, so I compiled this timeline, sorted by months. The deadlines are not intended to be accurate because they change every year. They can also be extended. The deadlines may vary depending on the submission type: full paper, short paper, poster, etc.  The focus is on the approximate chronological order in which the deadlines happen. One can always visit the conferences' website for the exact deadline. It is also not intended to be exhaustive as it