Friday, May 31, 2019

2019-06-03: Metadata on Datasets Saves You Time

When I joined ODU this Spring 2019, I explored datasets in digital libraries with the hope of discovering ways to enable users to discover data, and for data to find its ways to users as my first task. This led to some interesting findings that I will elaborate in this post.

First things first, let's take a look at what tools and platforms are available that attempt to make things easier for users to find and visualize data. A quick Google Search provided a link to this awesome GitHub repository which contains a list of topic-centric public dataset repositories. This collection proved useful to gather the types of dataset descriptions available at present.

The first dataset collection I explored was Kaggle. Here, the most upvoted dataset (as of May 31, 2019) was a CSV file with the topic "Credit Card Fraud Detection". Taking a quick look at the data, the first two columns provides a textual description of the content, but not the rest. Since I'm not the maintainer (hence the term distributed digital collections) of that dataset, I wasn't allowed to contribute to improve the metadata in it.
Figure 1: "Credit Card Fraud Detection" Dataset in Kaggle [Link]

One useful feature that's prominent on Kaggle (and most publicly available datasets) was that they provided textual descriptions of the content, but the semantics of data fields and the links between each file in the dataset were either included in the description or not included at all. Only a handful of datasets actually documented the data fields.
Figure 2: Metadata of "Credit Card Fraud Detection" Dataset in Kaggle [Link]

If you have no expertise on a particular domain but only interested in using publicly available data to prove a hypothesis, encountering datasets with inadequate documentation is inevitable owing to the fact that most publicly available dataset semantics are vague and arcane.

This provided us enough motivation to dig a little deeper to find a way to change this trend in digital libraries. We formulated a metadata schema and envisioned a file system, DFS, which aims to reverse this state of ambiguity and bring more sense to the datasets.

Quoting from our poster publication "DFS: A Dataset File System for Data Discovering Users" on JCDL 2019 [link to paper]:
Many research questions can be answered quickly and efficiently using data already collected for previous research. This practice is called secondary data analysis (SDA), and has gained popularity due to lower costs and improved research efficiency. In this paper we propose DFS, a file system to standardize the metadata representation of datasets, and DDU, a scalable architecture based on DFS for semi-automated metadata generation and data recommendation on the cloud. We discuss how DFS and DDU lays groundwork for automatic dataset aggregation, how it integrates with existing data wrangling and machine learning tools, and explores their implications on datasets stored in digital libraries.
We published an extended version of the paper at ArXiV [Link] that elaborates more on the two components that helps to achieve our goal:
  • DFS - A metadata-based file system to standardize the metadata of datasets
  • DDU - A data recommendation architecture based on DFS to bring data closer to users
DFS isn't the next new thing; rather, it's a solution for not having a systematic way of describing datasets with enough detail to make it sensible for an end user. It provides the means to manage versions of data, and ensures that no important information about the dataset is missed out. Most importantly, it provides a machine-understandable format to define dataset schematics.  The JSON shown below is a description of a dataset in DFS meta format.
Figure 3: Sample Metafile in DFS (Shortened for Brevity)

On the other hand, DDU (or Data Discovering Users) is an architecture that we envisioned to simplify the process of plugging in data to test out hypotheses. Assuming that each dataset has metadata compliant with the proposed DFS schema, the goal was to automate data preprocessing and machine learning, while providing a visualization of the steps taken to reach the final results. So if you are not a domain expert, but still want to test out a hypothesis on that domain, you could easily discover a set of datasets that match your need, plug them into the DDU SaaS, and voila! You just got the results needed to validate your hypothesis, with a visualization of the steps followed to get them.
Figure 4: DDU Architecture

As of now, we are working hard to bring DFS into many datasets as possible. For starters, we aim to automatically generate DFS metadata for EEG and Eye Tracking data acquired in real time. The goal is to intercept live data from Lab Streaming Layer [Link], and generate metadata as the data files are generated.

But the biggest question is, does this theory hold true for all domains of research? We plan to answer this in our future work.

- Yasith Jayawardana

No comments:

Post a Comment