2020-04-26: Large Scale Networking (LSN) Workshop on Huge Data

Between April 13 and 14, 2020, I attended the Large Scale Networking (LSN) workshop on Huge Data. This is a workshop supported by NSF, organized by Clemson University (Dr. Kuang-Ching Wang), University of Virginia (Dr. Ronald Hutchins), and the University of Kentucky (Dr. James Griffioen, and Dr. Zongming Fei). It was supported to be held in Chicago, IL, but due to the coronavirus pandemic, the whole workshop was moved online. The workshop is consists of 4 topic sessions:

  1. Data generation (6 presentations)
  2. Data storage (7 presentations)
  3. Data movement (14 presentations)
  4. Data processing and security (14 presentations)
Each speaker is given only 5 minutes to do a flash presentation to highlight their work. The workshop also has 4 breakout sessions:
  1. New Areas of Research Beyond Big Data
  2. New Types of Data & Ways to Get Them
  3.  Collaboration across Disciplines 
  4. Critical Research Infrastructure Needed Beyond Big Data
Dr. C. Lee Giles and I contributed a white paper titled "Scholarly Very Large Data: Challenges For Digital Libraries". Dr. Craig Partridge from Colorado State gave the keynote. Two NSF officers Deepankar Medhi and Erwin Gianchandani also attended the conference and gave short speeches at the beginning.

The workshop attempted to gather people from non-CS disciplines to present instances of "huge data" and their applications and also people from CS to present the software and hardware strategies to discuss future solutions to deal with this type of data. So people attending this conference may come from different fields, such as astronomy, medical science, geology, and meteorology.

First, what is "huge data" and what were the differences from "big data". This is a topic that was discussed extensively in the breakout session. People tend to agree that "big data", traditionally defined using the five V's (volume, variety, velocity), is usually comprised of small data. They are hard to be processed using a single machine, so you need a cluster to process data in parallel, such as MapReduce and MPI. "Huge data" is intrinsically big and begins to overwhelm the infrastructure (you may not even have the infrastructure to store all of them). 

People may think "huge data" are "special cases", but they are not. It's becoming a norm. For example, the Large Synoptic Survey Telescope (LSST) will generate on average 15TB data per night. The Polar Geospatial project accumulates tons of images. The image data in medical and biomaterial fields, generated by confocal and multiphoton microscopes can be huge. The mouse brain needs storage of >25PB, the primate brain needs storage as much as 800PB while the human brain needs storage as much as 14EB. 

Like in a real conference, the workshop has concurrent sessions, but all slides are available. Below, I outline some presentations that I was most impressed with. 

Keynote by Dr. Graig Partridge from Colorado State University, best know for his contribution to the technologies of the internet. Craig started by talking about data transfer errors through the internet. Works 20 years ago showed that most end-to-end errors were in hosts, routers, and middleboxes (Stone & Partridge, SIGCOMM 2000). One study in 2011 suggested as WiFi data rates increased, the error rates jump substantially (as high as 34%; Feher, Access Networks 2011). Nowadays, roughly 1 in every 121 huge file transfers delivered bad data. Because of this, many scientists may be using bad data unknowingly. 
I am a little skeptical about this. I've been transferring small and big files across servers for the CiteSeerX project. How do I know if the files I transferred had errors? Are errors frequency when I transfer files between local machines and public cloud-like Google drive? Banks also transfer large amount of data from ATMs, personal computers, and mobile devices, do they see lots of errors as well? 

Data generation

Modern Research Imaging and X-omics by Jim Kenyon (University of Michigan)
The presentation talks about collecting, movement, storing, and computing huge medical imaging data. The following diagram shows DNA sequencing data growth curve (Stephens 2015 PLOSONE). The presenter also includes a table showing that the Cai Lab has a max throughput of 1037TB/day! The potential solutions include compression technique with inconsequential fidelity loss not relevant to signals of interest, alternative representation formats (something different from strict voxel representation), fiber extension of the data center network core to high volume instruments, improved file movement automation through tools like Globus, distributed storage and compute, and AWS snowball-like service. 
Huge Data in Medicine by V.K. Cody Bumgardner (University of Kentucky)
The presentation talks about huge data in pathology, which is the study of disease, bridging science and medicine. Pathology generates a huge amount of data and is integrating AI techniques to enable clinical-grade computational pathology. The figure below shows the digital image growth in UK PACS radiology. However, the current instruments they have may not be able to catch up with the data growth.
Computational Challenges in Genomics Data Analyses by Soumya Rao (University of Missouri-Kansas City)
The presenter talks about challenges in genomics. The size of genomic data in public archives has reached the order of multi-petabytes. The challenges of handling, storage, transfer, and analysis involve the high cost of setting up and maintenance of servers, and computationally intensive analysis. The institute is seeking for better infrastructure such as hyper-converged cluster, state/nationwide or global grids, machine learning, and artificial intelligence and cloud computing to mitigate the bottlenecks. 

Huge data for connected vehicles by Yunsheng Wang (Kettering University)
The presenters stressed that connected and autonomous vehicle (CAV) is one of the major technological drives in the automotive domain today. The presenter talked about Mobile Edge Computing, which brings computational resources, storage, and services closer to the consumers. This could be a solution to the huge amount of communication data (1EB-10EB/month) between millions of cars. 

Trustworthy AI for Huge Data Generation and process From IoT Devices by Hongxin Hu (University of Clemson)
The presenter lists multiple challenges, such as the storage & speed of on-device AI and security & privacy of on-device AI. The solutions include designing new compression mechanisms leading to compressed DNN models that are not only accurate but also robust for deference adversarial attacks. Another research task is to develop a hardware-assisted approach based on trust execution environments.  

Data Storage

Data Enabled Radio Astronomy by David M. Halstead (CIO, NRAO)
The National Radio Astronomy Observatory is a facility of the National Science Foundation operated under cooperative agreement by Associated Universities, Inc. Two large grorund-based telescopes are VLA (Very Large Array, NM) and ALMA (Atacama Large mm/sub mm Array, Chile). The data storage and transfer is a superlative challenge for VLA (see the diagram below). Basically, each dish generates up to 320 Gbps. To meet the requirement of the interferometry,  signal must get back to the central correlator within 1/2 second. The output of the correlator reachers about 70Gbps with annual archive growth of about 240PB/year.


HL-LHC Data Challenges by Frank Würthwein (SDSC/UCSD)
Frank talked about the data generated by LHC (Large Hadron Collider). ATLAS and CMS are two general-purpose detectors of LHC. They expect to accumulate an Exabyte of data per year starting 2028. The first challenge is to take an exabyte of raw data from a distributed archive across a dozen national labs worldwide. Distributed nature is non-negotiable because no one country is prepared to provide all of the resources. This is done once a year. Every 3 years, the processing is 3 times larger. The second challenge is data processing. Each of ATLAS and CMS has more than 1000 scientists from a few hundred institutions in more than 50 countries that want to analyze this data. The distributed nature of the data storage and access poses great challenges.

Scholarly Very Large Data: Challenges for Digital Libraries by Jian Wu (Old Dominion University)
This is my paper. The paper describes the current status of the CiteSeerX project, an open access (OA) digital library search engine indexing more than 10 million academic documents. The presentation outlines upcoming challenges to store and index 35 million academic papers and make them freely available. The total size of the data is up to half petabytes. This may not be "huge" compared with astronomical or medical image data (which was why I called it "very large"), but to host it in an academic setting using commodity hardware and open-source software is non-trivial. 

Data Movement

BigData Express: Toward Predictable, Schedulable and High-Performance Data Transfer by Wenji Wu (Fermilab)
BigData Express is a schedulable, predictable, and high-performance data transfer service. It implemented a peer-to-peer, scalable, and extensible data transfer model, with a visually appealing and easy-to-use web portal. The major components include 
  • BigData Express Web Portal
  • BigData Express Scheduler
  • AmoebaNet (network as a service)
  • mdtmFTP (high-performance data transfer engine)
  • DTN (data transfer node) agent (management and configuration)
  • Storage agent 
  • Data transfer launch 
BigData Express has already been released with Apache license 2.0. The current version is 1.5. 

Characterizing networking as experienced by users by Igor Sfiligoi (UCSD - San Diego Supercomputer Center)
This is a very technical talk. The motivation of this work is reliable and performant networking now is a critical part of doing science. However, high throughput networking does not play well with firewalls and network performance tends to degrade if not monitored. As the number of DTNs exploded, management becomes an issue. Igor emphasized that testing DTNs is not enough because most data transfers are many-to-many and most sciences compute does not happen on DTNs. He then proposed domain-specific tests were needed. The punch-line is that being able to use DTNs for ad-hoc tests has tremendous value. PRP/TNRP vs approach provided the path. 

SDN/NDN Integrated Big Data Ecosystem for Big Science by Edmund Yeh (Northeastern University)
The presentation talks about the SANDIE project. It started by listing huge data applications such as LHC, LSST, SKA (Square Kilometer Array), and Event Horizon Telescope and outlines the challenges of big data systems. It points out the gap between application needs and existing networks/systems:
  • Current computer networks/systems focus on addresses, processes, servers, connections
  • Consequently, existing security solutions focus on securing data containers and delivery pipes
  • Applications care about data 
The Data-Centric Networking uses a data-centric approach to system and network design, providing system support through the whole data lifecycle. They employ the so-called named data networking (NDN). The idea is to develop NDN naming scheme for fast access and efficient communication in HEP (high energy physics) and other fields. The SANDIE results showed great improved throughput and delay performance. It achieved over 6.7 Gbps throughput (single thread) between NDN-DPDK-based consumer and producer. The optimized caching and forwarding algorithms decreased download times by a factor of 10. 

Data Processing and Security 

I did not attend this section. By reviewing the slides, most presentations are about systems, and I am not quite in this field so I only list the titles and presenters.

The big takeaways from the key observations session are listed below.
  1. Huge data is no longer an issue for a small part of science. It’s grown. Nowadays it exists in astronomy, gene sequencing, medicine, archeology (LIDAR studies). To cope with the issue, one must define the content, trade-off data volume and computation.
  2. It is unlikely to solve these in silos in different domains. Domain science, applications, and infrastructure must be considered together.
  3. We may not need all this huge data as some of it is time-dependent. Sometimes we do need to keep it all. Knowing the difference is important. We should look at metadata to decide whether to process all or part or which subset to process.
  4. Reconfirms the direction that data should be the new narrow waist for the architecture. A new holistic architecture is needed. 
Overall, the workshop was a success. In virtual meetings, people tend to listen instead of talk, but this is the best the organizer can do at this point. In the breakout sessions, although there are 100 people in the ZOOM session, most people were just listening to a few domain leaders talking. Physical meetings can push people to talk (otherwise they will feel embarrassed when facing each other). 

The workshop uses Slack for data sharing and after-session discussion but it was not well used, probably because people were tired of typing. The chatting function on ZOOM was used for Q&A but still, typing is much slower than talking so it was not favored and was finally removed on the second day. There were four breakout sessions but finally, only two sessions were well attended I think but I might be wrong. The lightning talks were too short (only 5 minutes). Many people flashed slides or talked very fast. Some people even did not have any slides. The organizes were very patients and friendly, which kept the meeting on smoothly. 

Jian Wu

Comments