2018-12-14: New Insight to Big Data: Trip to IEEE Big Data 2018
The IEEE Big Data 2018 was held in the Westin Seattle Hotel between December 10 and December 13, 2018. There are more than 1100 people registered. The accepting rates vary between 13% to 24%, with an average rate of 19%. I have a poster accepted titled “CiteSeerX-2018: A Cleansed Multidisciplinary Scholarly Big Dataset”, co-authored with C. Lee Giles, two of his graduate students (Bharath and Shaurya), as well as an undergraduate student who produced preliminary results (Jianyu Mao). I attended the conference on Day 2 and Day 3 and left the conference hotel after the keynote on Day 3.
Insights from Personal Meetings
The most important thing to attend conferences is to meet with old friends and know new friends. Old friends I met include Kyle Williams (Microsoft Bing), Mu Qiao (IBM, chair of I&G track), Yang Song (Google AI, co-chair of I&G track), Manlin Li (Google Cloud), and Madian Khabsa (Apple Siri).
Kyle introduced the recent project on recommendations inferred from dialogs. He also committed giving an invited talk for my IR class in the Spring semester.
Mu mentioned his project on anomaly detection on time-series data.
Yang talked about his previous work on CiteSeerX, and Microsoft Academic Search. He said that one big obstacle for people to use MAS (and all other digital library search engines) is that none of them is comparable to Google Scholar in terms of completeness. The reason was simple: people want to see higher citation rates of their papers. He suggested me switching my focus on mining information that is not available by publishers from the text.
Madian told me that although I may think nobody uses Siri, there are still quite a lot of usage logs. One of the reasons that Siri is not very perfect is the relative smaller team compared with Google and Microsoft. He also says that it is a good time to apply for academic jobs these days because the industry pays far more than universities which attracts the best PhDs in AI.
I also introduced myself to Aidong Zhang, an NSF IIS director. Apparently, she knows Yaohang Li, and Jing He well. I sent my CV to her. I also met Huaglory Tianfield and Liqiang Wang at the University of Central Florida.
Insights from Keynote Speakers
There are two keynote speakers that I like the best, one is Blaise Aguera y Arcas from Google AI (actually he is the boss of Yang Song), and the other is Xuedong Huang from Microsoft.
Blaise’s talk started from the first NN paper by McCulloch & Pitts (1943), now cited 16k+ based on Google Scholar. He reviewed the development of AI since 2006, the year when Deep Learning people started to go to the CS conference. He talked about Jeff Dean, the director of Google Brain, and the recent paper by Bonawitz et al. (2016). He pointed out the recent progress on Federated Learning — learning of deep neural networks from decentralized data. Finally, he made a very good point: a successful application does not only depend on the model, but also on the data. He gave an example of a project that attempts to predict sexuality using face features. These features strongly depend on the shooting angle of the photograph, so the model makes wrong predictions. On the other hand, a work on predicting criminality using facial features of standard ID photographs achieves a very accurate result.
Xuedong Huang’s talk was also comprehensive. He focused on the impact of big data on natural language processing, using Microsoft products as case studies. One of the most encouraging results is that Microsoft has developed effective real-time translation tools that can facilitate team meeting using different languages. It implies that if TTS (text to speech) becomes very sophisticated, people may not need to learn a foreign language anymore. He also reminds people that big data is a vehicle, not the final destination. Knowledge is the final destination. He also admits that current techniques are not sophisticated on denoising data.
The other keynote speeches were not very impressive to me. I always feel that although it is OK for keynote speakers to talk about their own research/product, they should always try to stand at a higher place, overseeing a lot of problems the community are interested in, rather than focusing on a few very narrowly defined problems with too many jargons, definitions, and math equations.
Impressive Talks
I selectively went to presentations and posters. What I felt was that streaming data, temporal data, and anomaly detection have been more and more popular. Below are some talks I was particularly interested in.
BigSR: real-time expressive RDF stream reasoning on modern Big Data platforms (Session L9: Recommendation Systems and Stream Data Management)
The motivation is to use a semantic based method to facility anomaly detection. This is my first time to hear Apache Flink. BigSR and Ray are promising replacements of Spark. I just took a Spark training session by PSC last week. Now there are systems faster than Spark!
Unsupervised Threshold Autoencoder to Analyze and Understand Sentence Elements (Annual Workshop on Big Data Analytics)
The author was working on a multiclass classification problem using an autoencoder. He found that the performance of the model depends on some hyperparameters, such as the number of hidden layers and/or neutrons. I commented that this was an artifact of his relatively low training size (44k). With unlimited training data, the difference of different model architectures may diminish. The author did not explain very well about how he manages the imbalance problem of training samples in different categories.
Forecasting and Anomaly Detection on Application Metrics using LSTM (In Intelligent Data Mining Workshop)
The two challenges are (1) interpretability (explain the reason of anomaly), and (2) rarity (how rarely this abnormal sample is). The author uses Pegasus: an algorithm to solve the non-linear classification with SVM.
Multi-layer Embedding Neural Architecture with External Memory for Large-Scale Text Categorization: Mississippi state. (In Intelligent Data Mining Workshop)
The authors attempt to capture long-range correlations by storing more memory in LSTM nodes. The Idea looks intuitive but I am suspicious of (1) how useful it is to scholarly data as the model was trained on news articles and (2) whether the overhead is significant to classify big data.
A machine learning based NL question and answering system for hellcat data search using complex queries (In health data workshop)
The author attempts to classify all incoming questions into 6 categories. Although this particular model looks simplistic (the author admits he has scalability issues), It may be a good idea to map all questions into a narrow range of questions. This greatly reduces dimensions and may be useful summarization.
Conference Organization, Transportation, and the City of Seattle
The organization was very good. The registration is very expensive ($700). The conference was well sponsored by Baidu and another Chinese company. One impressive part of this conference is a hackathon, asking participants to solve a practical problem in 24 hours. I think JCDL should do something like this. The results may not be the best, but it pushes participants to think intensively within a very limited time window.
The conference center is located in Downtown Seattle. Transportation is super convenient, with Bus, Light Rail, and Monorail stations nearby to any places of interests. The Pike place, where the first Starbucks store is located is 10 min walk. There are many restaurants with gourmet food all over the world. I live in the Mediterranean Inn, 1 mile from the center, which is still within the walking distance. The Expedia combo (Hotel+Flight) costs me $850 for a 3-night hotel stay and a round-trip flight from ORF to SEA.
Seattle is a beautiful city. It was always lightly rainy this season so local people like to wear a waterproof hoodie sweater. People are nice. I got a chance to visit the University of Washington Library, where the Hogwarts school scenes in Harry Potter was shot.
Jian Wu
Comments
Post a Comment