2019-05-14: Back to Pennsylvania - Artificial Intelligence for Data Discovery and Reuse (AIDR 2019)

AIDR 2019
The 2019 Artificial Intelligence for Data Discovery and Reuse conference, supported by the National Science Foundation, was held in Carnegie Mellon University, Pittsburg, PA, between May 13 and May 15, 2019. It is called a conference, but it is more like a workshop. There are only plenary meetings (and a small session of posters) and the presentations are not all about frontiers of research. Many of them are research reviews and the speakers are trying to connect them with "data reuse". The presenters are in various domains, from text mining to computer vision, from medical imaging to self-driving cars, etc. Another difference from regular CS conferences in that the accepted presenter list is made only based on the abstracts they submitted. The full papers are submitted later. 

Because CiteSeerX collects a lot of data from the Web, our group does a lot of work on information extraction, classification, and reuses a lot of data for training AI models, Dr. Lee Giles recommended me to give a presentation. My title is "CiteSeerX: Reuse and Discovery for Scholarly Big Data". In general, the talk was well received. One person asked the question of how we plan to collect annotations from authors and readers by crowdsourcing. My answer was to taking advantage of the CiteSeerX platform, but we need to collect more papers (especially more recent papers) and build better author profiles before sending out the requests. I will compile everything into a 4-page paper. 

In my 1 1/2 days in CMU, I listened to two keynotes. The first was given by Tom Mitchell, one of the pioneers of machine learning and the chair of the machine learning department. His talk was on "Discovery from Brain Image Data". I used to be in a webinar by him on a similar topic. His research was on connecting natural language with brain activities, studying how brains react to stimulations of vocal languages. Here are some takeaways: (1) it takes about 400 ms for the brain to fully take a word such as "coffee"; (2) the reaction happens in different regions in the brain and it is dynamic (changing over time). The data was collected using fMIR for several people and there was quite a bit of work to denoise the fMIR signals to filter out other undergoing activities. 

The second keynote was given by the president of a startup company called medidataGlen de Vries. Glen talked about how medidata improves drug testing confidence by using synthetic data. The presentation was given in a very professional way (like a TED presentation), but Dr. Lee Giles made a comment that he was using a statistical method called "boosting" and Glen agreed. 

Another interesting talk was given by Natasha Noy from Google. Her talk was about the recently launched search engine called "Google dataset search". According to Natasha, this idea was proposed in one of her blog post in 2017. The search engine was online in September 2018. Unfortunately, because it was not well advertised, very few people know it. I personally knew it two weeks ago. The search engine uses the crawled data from Google. The backend uses basic methods to identify public tools annotated with the schema in schema.org, which defines a comprehensive list of fields for metadata of semantic entities. I explored this schema in 2016. The schema can be used for CiteSeerX, replacing Dublin core, but it does not cover semantic typed entities and relationships. So currently, it is good for metadata management. The datasets indexed was also limited to certain domains. Another interesting data search engine was called Auctus, which is a dataset search engine tailored for data argumentation. It searches data using data as input. 

Other interesting talks are:
  • Dr. Cornelia Caragea gave two presentations, one on "keyphrase extraction" - she is an expert in this field, and one on "web archiving" - with her collaboration with Mark Phillips of UNT.  
  • Matias Carrasco Kind, an astronomer, was talking about  Searching for similarities and anomalies in galaxy images
In the conference, I met with Dr. C. Lee Giles, Dr. Cornelia Caragea. All of us were very glad to see each other. We had a very pleasant dinner in a restaurant called "spoon". I had a lunch conversation with Dr. Beth Plale, an NSF program director. She gave me some good suggestions for how to survive as a tenure track professor. I also had brief conversations with Natasha Noy in Google AI and Martin Klein in Los Alamos National Lab. 

Overall, the conference experience was very good and I learned a lot by listening to top speakers from CMU. The registration fee was low and they serve breakfast, lunch, and a banquet (I could not attend). The city of Pittsburg is still cool and windy, but I felt that I am quite used to it because I was living in Pennsylvania for 14 years! The Cathedral of Learning reminds me of good old days when I was visiting my friend Dr. Jintao Liu. He used to be a graduate student of UPitt and now a professor at Tsinghua University. By the way, the supershuttle service was not very good. The front desk canceled my trip from the airport to my hotel because she wasn't able to contact the driver. I had to take a taxi. I used Uber on the way back. It was quick and inexpensive. 

Jian Wu