2020-04-16: Visual Data Analysis with Streaming-hub
Streaming-hub [Link] |
In my previous post, I elaborated on how dataset metadata could be standardized in a manner that enables researchers to efficiently discover and reuse data already collected for past studies. Adopting such a standard brings a host of benefits to research communities – such as simplified data sharing, massively collaborative research, and automated data pre-processing.
However, formulating and adapting such a standard would take years, if not decades, unless 1) the public realizes its practical benefits over the initial hassle of transition, and 2) tools and libraries are built that would ease workflows after transition. My previous post tries to addresses the first concern by introducing DFS and DDU. In this post, I describe our work towards addressing the second concern.
Recently, I worked on generalizing the concept of DFS into the domain of streaming data. Conceptually, data stored in repositories are non real-time, i.e. the user has the ability to read data from start to end, with no time constraints. Tools for pre-processing and analyzing this kind of data is quite common (e.g., WEKA, Orange, Trifacta Wrangler). However, in streaming data, which is inherently real-time, this is not the case. Here, we only get to know the nature of data as we receive more and more of it. If we try to process them in real-time, it gets more interesting. In such scenarios, having prior knowledge about the data streams would definitely help.
In the context of sensory data, this knowledge could be sensor specifications such as the range of measurement, the error margin, and the signal-to-noise ratio (SNR). It could also be stream-specific information such as a textual description of the stream, the number of channels, and the type of data. This led us to create an alternate metadata representation for data "streams", which we termed meta-streams.
Now, we have two metadata representations in DFS. 1) meta-files, and 2) meta-streams. Though they serve similar purposes, they target two different data types.
When exploring which fields and descriptions are necessary for meta-streams, we used two sensory devices from two domains: 1) PupilLabs Core – which is an Eye Tracker, and 2) Empatica E4 – which is a health-sensing wristband. Using two domains instead of one helped us generalize the specification better.
Though data scientists and other analytic professionals often use interactive visualization at the end of a workflow to communicate findings to a wider audience, visualization scientists claim that interactive representation of data can also be used during exploratory analysis itself. With that in mind, we focused on creating a toolkit that could leverage the abstractions in DFS to simplify data analysis workflows. We based our work on Orange – an open-source data visualization, machine learning, and data mining toolkit with an extensible, visual programming front-end for exploratory data analysis and interactive data visualization. The reason that we chose Orange was its extensibility, and its visual programming front-end. Its extensibility allows to add our own functionality into Orange by creating widgets. These widgets could encapsulate any functionality that could be implemented in Python. For instance, we created an Orange widget that discovers for available data streams through LabStreamingLayer (LSL). Its visual programming front-end provides a variety of drag-and-drop graphical elements, i.e. widgets, that could be chained together to create data analysis workflows. This approach provides a simple, but intuitive alternative to writing code.
Our toolkit, streaming-hub, is still in development, and publicly available on GitHub. The data analysis workflow that we envision through DFS, DDU, and Streaming-hub looks like below.
Proposed architecture with DFS, DDU, and streaming-hub |
In this architecture, all communications in this workflow happens through sockets, via LabStreamingLayer (LSL). We leveraged 8 abstractions in LSL to create our communication workflows:
- Sample: A single measurement of all channels from a device.
- Chunk: For transmitting in chunks of multiple samples (for improved throughput) instead of one sample per transmission (for improved latency).
- Metadata: The information about the stream (apart from the raw data) that is stored and transmitted as XML data (akin to a file header).
- Stream: The combination of sampled data from a device with metadata. A stream can have a regular/irregular sampling rate, and one or more channels. All data within a stream should have the same data type.
- Stream Outlet: For making data streams available on the lab network.
- Stream Inlet: For receiving data from a single connected outlet.
- Resolver: A service discovery feature to resolve streams present on the lab network, according to content-based queries (e.g., by name, content-type, or queries on meta-data).
- Built-in clock: To time-stamp transmitted samples so that they can be mutually synchronized.
In streaming-hub, the Inlet receives data streams and their metadata, via LSL. These streams are then synchronized and aggregated. Next, this data passes through several pre-processing steps, and arrives at the analysis stage. Once data is analyzed, the results are streamed back into LSL in the form of analytic-streams. Analytic-streams have the same properties as data-streams, and could be treated as such. Persisting analytic-streams on disk enables analytics reuse, which reduces computational overhead by avoiding repetition. Further, data and analytics could be visualized and persisted at any stage in the workflow.
Sample data analysis workflow generated using Streaminghub |
The success of this architecture depends on how good of a generalization is achieved from meta-streams and meta-files. Both DFS and Streaming-hub are still in their infancy, and yet to reach a stage where its usable for the general public. This is by all means no easy task, but it is our roadmap for the future.
-- Yasith Jayawardana (@yasithmilinda)
Comments
Post a Comment