2022-06-17: StreamingHub - Building Reusable, Reproducible Workflows


As researchers, we often write code that others (including ourselves) might reuse in the future. Especially during early stages of research (i.e., the exploration phase), we hack bits and pieces of code together to test out different hypotheses. Upon discovering few hypotheses that work, our focus shifts towards rigorous testing, academic writing, and publication.

So when should we pay attention to the reusability of research assets (i.e., data and code)? Ideally, as early as possible; yet this is easier said than done. For instance, at the exploration phase, it's often impractical to allocate time towards keeping research assets reusable. After all, it's uncertain whether the hypotheses would even work. In every phase that follows, we get consumed with testing, academic writing, and publication. For this reason, our concern on reusability becomes more of an afterthought.

Given the cumulative nature of research, the implication of having non-reusable research assets on future research is profound. With time, researchers may find prior work (including your own) that suit their needs, and try to build upon them. This is when life becomes difficult. Having code with no documentation, etc. makes it difficult to "build upon".

When producing quality research, research assets should ideally speak for themselves. Research is not an error-free process; mistakes could happen, which may lead to incorrect conclusions. Half-baked research is rarely a strong basis for derivative work. To ensure discipline in scientific reporting, both the data collection procedures and the data-driven workflows that generate results should be transparent, reproducible, and thereby, verifiable.

What can we do about it?

There are several strategies to ensure reusability and reproducibility of research assets. For instance, documenting usage instructions and commenting hard-to-understand areas of code improves reusability. Moreover, having the research process well-documented, and implementing workflows in a modular fashion, improves reproducibility. In fact, improving these aspects have been a broad subject of research. Research contributions in this field can be categorized into three aspects - Metadata, Scientific Workflows, and Visual Programming.

Metadata

By definition, Metadata is "data that provide information about other data". Its purpose is to summarize basic information about data, which, in turn, simplifies how we work with data. Imagine being given a bunch of data without any knowledge of what they are. Here, you may spend some time inspecting the data, just to get an idea of it. Even if you do get an idea, your understanding may be incomplete, or worse, different from what the data actually means. This is where metadata serves its purpose; having metadata lets others know exactly what the creator wants them to know, and lets them build upon that.

Depending on the problem being investigated, researchers may collect fresh data (e.g., user studies, application logs, sensory readings), reuse already collected data (e.g., publicly available datasets), or do both. In either case, the data being collected/reused should contain metadata to convey what the data means, and preferably, how they were collected. In such scenarios, having quality metadata helps to ensure reusability and reproducibility.

Scientific Workflows

Documenting the research process is critical to create reproducible research assets. However, research processes could be fairly complex, and thereby painstaking to document in detail. In cases as such, verifying the integrity of results could be even difficult. This is where 'scientific workflows' help; a scientific workflow, by definition, is "the description of a process for accomplishing a scientific objective, usually expressed in terms of tasks and their dependencies". Scientific Workflow Management Systems (e.g., Kepler, Pegasus) let users design workflows either visually, e.g., using data flow diagrams, or programmatically, using a domain-specific language. This makes it easy to share workflows that are runnable, thereby enabling others to verify both the research process and the results obtained.

Visual Programming

Visual programming is "a type of programming language that lets humans describe processes using illustration". It allows scientists to develop applications based on visual flowcharting and logic diagramming. Visual programming is beneficial for several reasons; primarily, applications built using visual programming are more accessible to novices. Moreover, having a visual representation makes it easy to understand applications at a conceptual level, analogous to how data is described via metadata.

Software such as Node-RED, OrangeNeuroPype, and Orchest provides visual-programming interfaces to design data-driven workflows. Orange is geared towards exploratory data analysis and interactive data visualization, while NeuroPype is geared towards neuroimaging applications. Node-RED and Orchest, however, is more generic, and allows to build data-driven applications with ease.

Can we combine the strengths of metadata, scientific workflows, and visual programming to make the research process even more productive? This is what we propose through StreamingHub.

StreamingHub

First, I'll introduce StreamingHub. Next, I'll explain how we built scientific workflows for different domains using it, and the lessons we learnt by doing so.

Stream processing is inevitable; the need for stream processing stems from real-time applications such as stock prediction, fraud detection, self-driving cars, and weather prediction. For applications like these, latency is a critical factor that governs their practical use. After all, what good is a self-driving car if it cannot detect and avert hazards in real-time?

StreamingHub is a stream-oriented approach for data analysis and visualization. It consists of four components.

  • Data Description System (DDS)
  • Data Mux
  • Workflow Designer
  • Operations Dashboard

Data Description System (DDS)

DDS is a collection of metadata schemas for describing data streams, data sets, and data analytics. It provides three schemas: 1) Datasource schema, 2) Dataset schema, and 3) Analytic schema. Each schema is a blueprint of the fields needed to provide data-level insights.

Metadata created from these schema may look as follows.

Datasource Metadata (Example)

Dataset Metadata (Example)

Analytic Metadata (Example)

Data Mux

The data mux operates as a bridge between connected sensors, datasets, and data streams. It uses DDS metadata to create the data streams needed for a task, and streams both data and DDS metadata.

The data mux provides three modes of execution:

  • Replay - stream data from a dataset
  • Simulate - generate and stream simulated data from a dataset
  • Live - stream real-time data from connected sensors

In replay mode, the Data Mux reads and streams data (files) at their recorded (sampling) frequency. In simulate mode, it generates and streams synthetic data (guided by test cases) at their expected sampling frequency. In live mode, it connects to (live) sensory data sources and streams them. It utilizes DDS dataset/analytic metadata in the replay and simulate modes, and DDS datasource metadata in the live mode. When generating analytic outputs from data streams, the metadata from both the input source(s) and analytic process(es) are propagated into the output data to minimize the required manual labor for making analytics reusable, reproducible, and thus verifiable.

Workflow Designer


The Workflow Designer uses Node-RED to design and execute workflows. Node-RED is a programming tool for wiring together hardware devices, APIs, and online services to build real-time IoT applications. It is built on Node.js, and takes full advantage of the event-driven, non-blocking model of Node.js. This makes it ideal to run at the edge of the network on low-cost hardware such as Raspberry Pi, as well as in the cloud. Node-RED provides a sleek, browser-based visual flow editor. It allows users to drag-and-drop nodes into a canvas, and wire them together to build workflows. By default, Node-RED comes with a variety of nodes for handling IO, code execution, event routing, visualization, logging, and more. It also facilitates creating new nodes in the form of sub-flows, and using them within workflows like any other node. Flows created in this manner can be then deployed with a single-click, and imported/exported for reuse and sharing.

Depending on need, users may create sub-flows that implement custom logic. These sub-flows can later be reused within complex flows, or even shared with others. The image above shows an eye movement analysis workflow we designed using Node-RED. Here, the nodes defined as Stream Selector, IVT Filter, and Synthesizer are eye-movement specific sub-flows that we implemented ourselves.

Operations Dashboard

The operations dashboard is also generated using Node-RED. It allows users to perform interactive visualizations and data-stream control actions. Here, users have the option to visualize the data generated at any point in the analytics workflow. All visualizations are dynamic, and updated in real-time as new data is received. Moreover, the available visualization options are determined by data type. The image above shows the operations dashboard that we created for an eye movement analysis task.

The video linked below shows a quick end-to-end demo of StreamingHub in action. Here, I show how a dataset can be replayed using the StreamingHub Data Mux, and subsequently processed/visualized via a domain-specific workflow in Node-RED.

In the future, we plan to include five data-stream control actions in the operations dashboard: start, stop, pause, resume, and seek. By doing this, we hope to enable users to inspect data streams temporally and perform visual analytics; a particularly useful feature when analyzing high-frequency, high-dimensional data. If you're interested in learning more, please refer to my paper titled "StreamingHub: Interactive Stream Analysis Workflows" at JCDL 2022 [Preprint]. Also check out my presentation on StreamingHub at the 2021 WS-DL Research Expo below.

Comments