2019-02-14: CISE papers need a shake -- spend more time on the data section


A Crucial Step for Averting AI Disasters

I know this is a large topic and I may not have enough evidence to convince everyone, but based on my reviewing experiences on journal articles and conference proceedings, I strongly feel that computer and information science and engineering (CISE) papers need to put more text on describing and analyzing the data. 

This argument partially comes from my background in astronomy and astrophysics. Astronomers and astrophysicists usually spend a huge chunk of text in their papers talking about data they adopt, including but not limited to where the data are collected, why they do not use another dataset, how the raw data are pre-processed, and carefully justify why they rule out outliers. They also do analysis on the data and report statistical properties, trend, or bias to ensure that they are using legitimate points in their plots.

In contrast, for many papers I read and reviewed, even in top conferences, CISE people do not often do such work. They usually assume the datasets were used before so they could use it. Many emphasize the size of the data, but few look into the structure, completeness, taxonomy, noise, and potential outliers in the data. The consequence is that they spend a lot of space on algorithms and report results better than baselines, but it not a guarantee of anything. Good CISE papers usually discuss the bias and potential risks caused by the data, but good papers are rare, even in top conferences.

Algorithm is one of the pillars of CISE, but this does not mean it is everything. It only provides the framework, like the photo frame. Data is like the photo. Without the right photo, the picture (frame+photo) will not look pleasing. Even if it looks pleasing for a particular photo, it won't for other photos. Of course, no algorithm will fit all data, but at least the paper should discuss what types of data the algorithm should be applied to.

The good news is that many CISE people have started paying attention to this problem. In the IEEE Big Data Conference,  Blaise Aguera y Arcas, the Google AI director emphasizes that AI algorithms have to be accompanied with the right data to be ethical and useful. Recently, a WSJ article titled "A Crucial Step for Averting AI Disasters" echoed the idea. The article quoted Douglas Merrill's word -- “The answer to almost every question in machine learning is more data,” I would supplement this by adding "right" after "more". If we claim we are doing Data Science, how can we neglect the first part?

Jian Wu 

Comments