Posts

Showing posts with the label jianwu

2019-05-03: Selected Conferences and Orders in WS, DL, IR, DS, NLP, AI, CV

The time when research works should be done is usually less predictable than homework. You may submit a paper next year, but you cannot submit your homework the next year. Even if there is a target deadline, the results may not be delivered on time. Even if the results are ready, the papers may not be in good shape, especially for papers written by students. Even if papers are submitted, they can be rejected. Therefore, it is usually useful to decide where to submit the work next. I used to struggle to find the next deadline for my work, so I compiled this timeline, sorted by months. The deadlines are not intended to be accurate because they change every year. They can also be extended. The deadlines may vary depending on the submission type: full paper, short paper, poster, etc.  The focus is on the approximate chronological order in which the deadlines happen. One can always visit the conferences' website for the exact deadline. It is also not intended to be exhaustive as it

2019-02-02: Two Days in Hawaii - the 33rd AAAI Conference on Artificial Intelligence (AAAI-19)

Image
The  33rd AAAI Conference on Artificial Intelligence , the 31st  Innovative Applications of Artificial Intelligence Conference, and the  9th Symposium on Educational Advances in Artificial Intelligence were held  at the Hilton Hawaiian Village, Honolulu, Hawaii. I have one paper accepted by IAAI 2019 on Cleaning Noisy and Heterogeneous Metadata for Record Linking across Scholarly Big Datasets, coauthored with Athar Sefid (my student at PSU), Jing Zhao (my mentee at PSU), Lu Liu (a graduate student who published a Nature Letter),  Cornelia Caragea (my collaborator at UIC),  Prasenjit Mitra , and  C. Lee Giles .  This year, AAAI receives the greatest number of submissions -- 7095 which doubles the submission in 2018. There are 18191 reviews collected and over 95% papers have 3 reviews. There are 1147 papers accepted, which takes 16.2% of all submissions. This is the lowest acceptance rate in history. There are in total 122 technical sessions, 460 oral presentations (15 min talk) and

2018-12-17: CoQA Challenge: Machine Reading Competition Recent Result

Image
CoQA is a dataset containing more than 127,000 questions with answers collected from more than 8000 conversations. Each conversation is about a passage in the form of questions and answers. One example of the passage is below Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white tiger stripes like Cotton's mommy. Being different made Cotton quite sad. She often wished she looked like the rest of her family. So one day, when Cotton found a can of the old farmer's orange paint, she used it to paint herself like them. When her mommy and sisters found

2018-12-14: New Insight to Big Data: Trip to IEEE Big Data 2018

Image
The IEEE Big Data 2018 was held in the Westin Seattle Hotel between December 10 and December 13, 2018. There are more than 1100 people registered. The accepting rates vary between 13% to 24%, with an average rate of 19%. I have a poster accepted titled “CiteSeerX-2018: A Cleansed Multidisciplinary Scholarly Big Dataset”, co-authored with C. Lee Giles , two of his graduate students ( Bharath and Shaurya ), as well as an undergraduate student who produced preliminary results ( Jianyu Mao ). I attended the conference on Day 2 and Day 3 and left the conference hotel after the keynote on Day 3. Insights from Personal Meetings The most important thing to attend conferences is to meet with old friends and know new friends. Old friends I met include Kyle Williams (Microsoft Bing), Mu Qiao (IBM, chair of I&G track), Yang Song (Google AI, co-chair of I&G track), Manlin Li (Google Cloud), and Madian Khabsa (Apple Siri).  Kyle introduced the recent project on recomme

2018-11-12: Google Scholar May Need To Look Into Its Citation Rate

Image
Google Scholar has long been regarded as a digital library containing the most complete collection of scholarly papers and patterns. For a digital library, completeness is very important because otherwise, you cannot guarantee the citation rate of a paper, or equivalently the in-link of a node in the citation graph. That is probably why Google Scholar is still more widely used and trusted than any other digital libraries with fancy functions. Today, I found two very interesting aspects of Google Scholar, one is clever and one is silly. The clever side is that Google Scholar distinguishes papers, preprints, and slides and count citations of them separately. If you search "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs", you may see the same view as I attached. Note that there are three results. The first is a paper on IEEE. The second actually contains a list of completely different authors. These people

2018-11-09: Grok Pattern

Image
Grok is a way to match a text line against a regular expression, map specific parts of the line into dedicated fields, and perform actions based on this mapping. Grok patterns are (usually long) regular expressions that are widely used in log parsing. With tons of search engine logs, how to effectively parse them, extract useful metadata for analytics, training, and prediction has become a key problem in mining text big data.  In this article,  Ran Ramati  gives a beginner's guide to Grok Pattern used in Logstash, one of the powerful tools in the Elastic Stack (the other two are Kibana and Elastic Search). https://logz.io/blog/logstash-grok/ The StreamSets webpage gives a list of Grok pattern examples:  https://streamsets.com/documentation/datacollector/3.4.3/help/datacollector/UserGuide/Apx-GrokPatterns/GrokPatterns_title.html The recent paper by Huawei research lab in China summarizes and compare a number of log parsing tools: https://arxiv.org/abs/1811.03509 I am k

2018-10-19: Some tricks to parse XML files

Recently I was parsing the ACM DL metadata in XML files. I thought parsing XML is a very straightforward job provided that Python has been there for a long time with sophisticated packages such as BeautifulSoup and lxml. But I still encountered some problems and it took me quite a bit of time to figure how to handle all of them. Here, I share some tricks I learned. They are not meant to be a complete list, but the solutions are general so they can be used as the starting points to handle future XML parsing jobs. CDATA. CDATA is seen in the values of many XML fields. CDATA means Character Data. Strings inside the CDATA section are not parsed. In other words, they are kept as what they are, including marksups. One example is <script>  <![CDATA[  <message> Welcome to TutorialsPoint </message>  ]] > </script > Encoding. Encoding is a pain in text processing. The problem is that there is no way to know what the encoding the text is before opening it and re

2018-08-30: Excited to Join WS-DL group in ODU!

Image
I am an outlier compared with most computer scientists because I spent 10 years on a field called "Astronomy and Astrophysics". Very few computer scientists followed the same path as me to transfer from a seemingly irrelevant major. But this is where my passion is, so I did it, and I made it! Right after I graduated as a PhD in 2011, I joined the CiteSeerX group directed by Dr. C. Lee Giles at IST , Penn State University . I worked as a DBA for web crawling at the beginning and soon became the tech leader of the search engine, and recently the Co-PI of an NSF awarded proposal on CiteSeerX . I spent six years, an usually long time as a postdoc and then was promoted to a teaching faculty. However, I kept moving on, because I wanted to do research! Luckily, Michael and Michele did not mind of taking the risk and bet on me to be a tenure-track faculty at the Old Dominion University. So I accepted the offer and became a member of the Web Science Digital Library group at ODU