2019-11-20: Trip Report to K-CAP 2019

Between November 18 and 20, I attended the 2019 International Conference on Knowledge Capture (K-CAP 2019). K-CAP is an ACM sponsored conference, rated as “A” in the ERA conference rating system. It happens once every two years. Its counterpart in Europe is EKAW (unfortunately, EKAW is rated as B), which also happens every two years. I had papers accepted by K-CAP 2015 and 2017.

This year, I co-authored a short paper titled “Searching for Evidence of Scientific News in Scholarly Big Data” (poster link). The first author is my co-advised student Reshad Hoque at ODU. I also co-authored and presented a long paper titled “Automatic Slide Generation for Scientific Papers” in the 3rdInternational Workshop on Capturing Scientific Knowledge (SciKnow 2019). The first author is my co-advised student Athar Sefid at Penn State.  Due to my tight schedule, I had to return right after the keynote by Peter Clark on the first day, so this trip report summarizes the SciKnow workshop, the tutorial on “Build a large-scale cross-lingual text search engine from scratch”, the Poster session for short papers, and the keynote session.

At the beginning of the SciKnow workshop, Dr. Yolanda Gil gave the keynote speech titled “Hypothesis-driven data analysis”. Yolanda and her group have been working on ontology linking and knowledge extraction for a long time. She presented a lot of top-level research  including how to automate hypothesis testing, which is related to a very popular topic on R&R (repeatability & reproducibility). She also talked about workflow alignment and merging, model coupling, combination, and distribution, the ontology of future scientific papers. The talk was wrapped up by an open question on how to use automatic hypothesis testing for decision making. The topics were very timely and interesting, but it was a little bit diverse. She mentioned several existing and on-going projects, such as WINGS and MINT (Model INTegration). She introduced a website https://www.scientificpaperofthefuture.org/. They had been holding sessions to train scientists to write papers in a more structured, complete, repeatable, and reproducible manners. I definitely agree with her about the way to write future scientific papers. The only question I had was how the hypothesis are represented and how they can be tested and updated automatically.

The SciKnow 2019 workshop features 4 long papers and 4 short papers. Each long paper is given 20 minutes including 15 min presentation and 5 minutes Q&A.

In the presentation titled “Semantic Workflows and machine learning for the assessment of carbon”, the authors attempted to build a classifier to distinguish between grass, trees, water, and imperviousness regions. I am impressed that he manually annotated several thousand Google earth images.

Another talk that interested me was about a knowledge graph (KG) system called ORKG (open resource knowledge graph). The KG was designed to answer 2 questions. (i) How to compare research contributions in a graph-based system? (ii) How to specify and visualize research contribution comparison in a user interface? Different from most KG systems which are constructed based on NLP techniques, this KG was based on crowdsourcing. The current analysis is based on a relatively small sample (on the order of tens). Although human-labeled data tend to be more accurate, how to scale up the system is a problem the authors need to overcome.

Other useful resources I noted in the workshop include
·      The GLUE benchmark for NLP tasks
·      Word embedding models (some I did not know): ULMFit, OpenAI GPT, BERT, XLNet, characters (ELMo) Subwords (GPT, BERT), ngrams (fastText)
·      BIO2RDF: an API to generate RDF triples in biological domains
·      A knowledge graph mapping tool called YARRML?
·      The WDPlus project and its framework called T2WML

I missed the tutorial called “Hybrid Techniques for Knowledge-based NLP - Knowledge Graphs meet Machine Learning and all their friends” by Jose Manuel Gomez-Perez, Ronald Denaux and Raul Ortega in the morning, but attended one called “Building a large scale cross language text search engine” by Carlos Badenes-Olmedo, Jose Luis Redondo-Garcia and Oscar Corcho from Universidad Politécnica de Madrid (UPM). The tutorial started with simple IR concepts such as tokenization and lemmatization. It then dived into the topic models, such as the Latent Dirichlet Allocation and how to represent documents using topical vector and then apply the Approximate Nearest Neighbor to classify documents. For the cross-lingual part, they adopted Spanish. Asian languages (e.g., CJK) are not covered. The tutorial presenter used Google classroom to present all the source codes and their results. They use Docker to encapsulate everything needed for demos, so we can play them on our personal computers. These tools can be borrowed for my future courses and tutorials.

The poster session is held in the Information ScienceInstitute (ISI), about 25 minutes’ walk from the conference hotel (yes, I walked up there). Some interesting posters drew my attention.

  • Jointly Learning from Social Media and Environmental Data for Typhoon Intensity Prediction. In this study, entities are extracted from social media data and encoded into vector representations, which are then concatenated with conventional environmental data to predict the intensity of typhoons. About 100k tweets were collected in a time range of about 10 years. They used Spacy for NER and then ConceptNet for semantic embedding. The architecture includes a single direction LSTM (used to encode deep semantic feature) and a feedforward network (with dropouts) followed by a softmax to generate a probability. My question was how useful the model is because (i) to collect a sufficient amount of data about this typhoon, we may have to wait for a long time, when we may miss the typhoon; (ii) the intensity of a typhoon may change over time. But feature analysis in this paper does indicate that social media features are more important features than environmental features, which was surprising.

  • Understanding Financial Transaction Documents using Natural Language Processing. The authors developed a pretty sophisticated way to detect ineligible reimbursement items in financial systems. The proposed system uses customized (may need to retrain) tesseract OCR to extract text out of scanned or pictured reimbursement reports. They then do entity extraction, and semantic analysis to identify items that do not comply with certain restrictions. For example, “Spa treatment massage” is identified to be an ineligible item. The system is developed for commercial uses, so it uses some proprietary datasets and tools. The evaluation gives decent performance with F1=85% based on 6000 test samples.

The first keynote was given by Peter Clark, the director of the Aristo project by AllenAI (AI2). The Aristo project is aimed at building an intelligent system that is able to capture and reason with scientific knowledge. The system starts by converting problems and relevant information into structured knowledge. They tried relational database tables and knowledge graphs and found that the latter works better. They are now able to answer multiple-choice questions at the 8th grade level such as which month has the longest day time in New York City? Actually, the accuracy on NY Regents 8th Grade (NDMC) has achieved over 90% in 2019, with language models + specialist solvers. 

One failing question is 
Which of these objects will most likely float in water?
(correct) Table tennis ball. (wrong) hard rubber ball. 

Other failing questions are reading and comprehension types. For example:

A student wants to determine the effect of garlic on the growth of a fungus species. Several samples of fungus cultures are grown in the same amount of agar and light. Each sample is given a different amount of garlic. What is the independent variable in this investigation? ” 
(correct) amount of garlic. (wrong) amount of growth.

Aristo is unable to answer a traditional math question like this. 
Two trains are driving in opposite directions with different speed v1 and v2. They started at a distance of s. How long does it take for them to meet?

In summary, Aristo has gain surprising success with language modeling. The project finds that structure is not essential for many tasks. Just pattern matching can answer many questions. But the method falls short with numerous types of questions, implying that many other AI aspects are missing. Structured reasoning and knowledge capture but with more language-like representations can be the way to go forward.

Marina del Ray is in the vicinity of Los Angeles with very beautiful beach scenes. One nearby beach is called Venice Beach with a very long fishing deck. But I just took a brief look due to my tight schedule. USC and UCLA both have some buildings in this town. The public transportation is crowded especially from the airport, so I decided to take Uber. Carpool can save some money but at the sacrifice of potential time loss because the driver is obligated to pick up at least 2 passengers. I had an experience when the second passenger canceled the trip at the last minute. The hotel I stayed in is called Jolly Roger Hotel, a very small but affordable hotel. People are very friendly but the Thai restaurant near my hotel was very disappointing.

The most important thing in a conference is to meet people. Some of the known people include
·     Yolanda Gil: Research Professor of Computer Science and Spatial Sciences and Principal Scientist at USC Information Sciences Institute
·      Jay Pujara: Research Assistant Professor of Computer Science
·      Ken Barker: IBM
·      Krutarth Patel: KSU
·      Peter Clark: AllenAI
Some new friends include
·      Andre Valdestilhas: a Brazil graduate student studying in Germany
·      Enrico Daga: Open University Knowledge Media Institute
·      Tim Lebo: Air Force Research Lab
·      Prateek Jain: Director Data Science at AppZen

Jian Wu