2020-08-03: ACM SIGIR 2020 Non-Trip Report

On July 25, more than 1,300 registrants around the world opened their laptops and started attending SIGIR 2020. Via Zoom and the streaming capabilities of the conference web portal, we were able to watch speakers, raise hands, ask questions, and chat with attendees.  SIGIR 2020 was in Xi'an, China, but few registrants could attend in person due to the travel restrictions imposed to curb the COVID-19 pandemic. The SIGIR 2020 program committee successfully converted their in-person conference to an online variant. They reduced conference fees so more people could attend and made much of the live streaming content available to the public.

ACM SIGIR is the premier conference for information retrieval. I have always wanted to attend. I have no papers to present, but Los Alamos National Laboratory was kind enough to pay for my time and attendance. SIGIR 2020 was my first pandemic video conference. I witnessed impressive work in this rapidly changing landscape. The age of term frequencies, inverted indices, and other statistical techniques for simulating understanding seems to be fading, slowly giving way to language models, neural networks, and models of evaluation that more directly incorporate and emulate user behavior. Research in delivering documents via search engines is being supplanted by research in anticipating user's information needs by combining all kinds of contextual information.  Of course, I continually wondered how we could integrate these new bodies of knowledge into our work with web archives where problems of information retrieval and collection understanding continue to hinder their broader accessibility and utility.


Geoffrey Hinton

Turing Award Winner, Fellow of the Canadian Institute for Advanced Research, Engineering Fellow at Google, and Chief Scientific Advisor of the Vector Institute Geoffrey Hinton presented our first keynote, "The Next Generation of Neural Networks." Hinton is also an Emeritus Distinguished Professor at the University of Toronto. Hinton led us through supervised, reinforcement, and unsupervised learning to bring us to the concept of autoencoders: "a way to implement unsupervised learning by using supervised learning." Autoencoders know nothing about the data at the start but train themselves as we expose them to more data. He said that this process did not work in the past because researchers used the wrong artificial neurons, initialized weights poorly, and did not have fast enough computers to make this work. In 2006, we were able to revive the concept of deep learning by beginning to overcome these problems. He covered different autoencoder types, including the BERT autoencoder, which learns from text that has had some of its words removed. BERT learns embeddings for each word by comparing it to other words' embeddings in the same sentence. BERT employs transformers for these comparisons. From here, we have created language models that allow us to predict upcoming words based on the words already observed. We repeat this process until we have a document. Hinton mentioned that GPT-3, another language model, generates articles so well that "it kind of passes the Turing Test." He then went on to discuss other work in neural networks, such as t-SNE and SimCLR.

Zongben Xu

Machine learning master Zongben Xu delivered the second keynote, "On Presuppositions of Machine Learning: A Meta Theory." He is the director of Pazhou Lab in Guangzhou and the National Lab for Big Data Analytics in Xi'an. Xu provided a model for how all machine learning should function. According to his meta-theory, machine learning takes place in a hypothetical space analyzing data according to some loss function and then preventing overfitting by applying a regularizer. He indicated that the optimal settings for these elements are a "chicken and egg problem" because we need to know the optimal solution for the problem and are often performing machine learning as part of the path to discovering this solution. He covered many issues with existing machine learning models. Xu offered some better solutions ranging. One idea is to compute distances within Banach rather than Euclidean space. Another is to model the noise in the data to discover the best answers to the problem. He showed how researchers had applied these improvements to problems like discovering moving items in video and improving dosing for CT imaging. He closed by discussing how Machine Learning is benefitting from insights like curriculum learning and self-paced learning from cognitive science.

Ellen M. Voorhees

ACM Fellow and Washington Academy of Sciences Fellow and Text REtrieval Conference (TREC) manager Ellen M. Voorhees presented the third keynote, "Coopetition in IR Research." Voorhees is a Senior Research Scientist at the US National Institute of Standards and Technology (NIST). TREC is one of the essential families of datasets for evaluating information retrieval. Voorhees defined coopetition as cooperative competition toward a common goal of producing better systems. Per Voorhees, "competing may give you a bigger piece of the pie while cooperation makes the whole pie bigger." She detailed how the community cooperates to generate TREC datasets, which are then used by competing information retrieval solutions to improve their results. These concepts are built upon the Cranfield Paradigm, which establishes text collections of fixed documents with associated queries and relevance judgements. Reusing test collections are less expensive than running user tests for every system. By using them, "we lose realism, but we retain control over more variables at less cost." She mentioned that it is best to have multiple test collections for each search task so that individual solutions do not customize their results based on the peculiarity of a given collection. She spoke about the dangers of researchers misusing collections. Researchers must consider to what task each collection should be applied or their study may be flawed. She closed by talking about the TREC-COVID collection, a test collection built in real-time from the volatile CORD-19 collection of evolving COVID-19 research. She closed by highlighting how building TREC-COVID has provided crucial insights into the volatile nature of research and the changing nature of relevance.

Norbert Fuhr

Gerald Salton Award Winner Norbert Fuhr presented our fourth keynote, "Proof by Experimentation? Towards Better IR Research." Fuhr is a professor of Computer Science at the University of Duisburg-Essen. He started by discussing a March 2020 study that reported promising results when using hydroxychloroquine for treating COVID-19. Fuhr discussed why this study was flawed and how it did not  support the use of this drug to cure COVID-19. These results were his first example of a flawed study. He covered the concepts of internal validity - the data supports the claims - and external validity - the extent to which the results can be generalized. He provided an analysis to support why MRR/ERR and AP are problematic measures. When comparing relative improvements, Fuhr stated that we should regard the effect size instead of simple arithmetic means. Fuhr cautioned against using the phrase "our experiments prove" because experiments cannot prove universally valid statements. He cautioned about not performing multiple tests without correction. He mentioned that the current form of leaderboards for comparing IR systems is "too naïve." We need a better way of reporting comparisons. Fuhr stressed that conferences and journals need to accept papers with null results to give us a better understanding of the actual state of the art. Fuhr ended with a call to action for SIGIR, as the premier IR venue, to mirror ideas from medical science for more rigorous experimentation standards so that we have better ideas on how results were obtained and if they are generalizable. His presentation generated quite a stir as he mentioned how many papers accepted at past SIGIR had flaws in their methodology.

Elizabeth F. Churchill

ACM Fellow, ACM Vice President, and Google Director of User Experience Elizabeth Churchill presented our fifth keynote, "From Information to Assistance." Churchill is also a member of the ACM’s CHI Academy and an ACM Distinguished Scientist and Distinguished Speaker. Churchill started by stating that information seeking is an ancient problem. The goal has always been to assist users. A system must provide the right information, at the right place, at the right time, and in the right format. She mentioned that the "right format" is a central area of growing research involving information ergonomics, information design, and modality. Churchill stressed that information is social and that interaction allows us to acquire answers and also learn new questions in the process. She covered the Paco project, allowing researchers to interview participants in the moment of their information need so that we get a clearer picture of their behavior in context. 

She noted that we do not often use one app to accomplish an information-seeking task, and most of us do not use the same combination of apps in the same way. We do not interact with all devices in the same way. We speak with smart speakers while typing into laptops for the same information need. These differing device experiences lead us to require new forms of output, such as personality for smart speakers. For these interactions she provided an overview of the Material Design and Flutter projects - platforms for developing interfaces for different devices. She closed by providing examples of the psychological aspects of system design and use.

Dacheng Tao

ACM and IEEE Fellow Dacheng Tao presented our sixth and final keynote, "How Deep Learning Works for Information Retrieval." Tao is a Professor of Computer Science at the University of Sydney, a Fellow of the Australian Academy of Science, and the Inaugural Director of the UBTECH Sydney Artificial Intelligence Centre. Tao took us on a journey through traditional information retrieval approaches by describing the boolean model, hash model, vector space model, and probabilistic model (e.g., LDA). He noted that these approaches have several issues. They suffer from human bias due to handcrafted features. There are issues with statistical learning where it performs poorly at query understanding. Finally, these methods are poor at handling advanced queries. 

Deep learning approaches provide better word representation because they can convey relationships between words. They allow for better overall text representation and sentence structure analysis. Transformers, like the aforementioned BERT, realize this better text representation. Deep learning language modeling is confined not only to a single language, but instead allows the system to translate and analyze across multiple languages. Deep learning is not limited to text - images can be processed alongside text to further improve the model. With deep learning, our systems better understand what we say/type, can convert them into the appropriate queries, and then produce the best response. Deep learning systems can also write text and analyze video for the relationships between objects. He noted that "deep learning is data hungry." We do not yet understand why it works, but he did discuss some breakthroughs in getting closer to figuring that out. Tao is convinced that deep learning is not only the direction of the near future of Information Retrieval but computing in general.

Summer School

SIGIR 2020 Summer School is something I have not yet encountered at a conference. This event brought us subject matter experts from across the IR industry. They covered topics dear to them and shared their professional journeys, connecting their work to paint a broader picture of contribution.

ACM Fellow Susan Dumais from Microsoft Research covered the topic of personalized search. She started our journey with tales of the 1990s when web search engines ran locally on client machines, and Lycos could provide a shortlist of the top 5% of sites on the web. Now there are billions of web sites and trillions of pages. "Now, it is hard to imagine a world where you didn't have information at your fingertips or by speech." Because of this massive amount of content, personalization in web search becomes key to making the web useful. Dumais covered issues with ambiguity (e.g., does SIGIR refer to the Special Interest Group on Information Retrieval or the Special Inspector General for Iraq Reconstruction) and their dependence on knowledge about the searcher. She discussed personalized search where a server returns numerous results to a client, and the local machine maintains information about the searcher. This method has the benefit of privacy but does not scale across devices and does not leverage information from the community. She covered the concept of "Short + Long," where the system keeps a history of search behavior to predict the best results for a given user. Additionally, the user's physical and temporal locations matter to a system when it ranks their results. Her talk provided a good overview of these concepts and how they all contribute to helping us find the correct information to accomplish our tasks.

Jimmy Lin from the University of Waterloo highlighted the progress of bringing natural language processing (NLP) and information retrieval together. His lifelong hypothesis is that "IR makes NLP useful and NLP makes IR interesting." He described his work on the START project at MIT's CSAIL group in the late 1990s through IBM's Watson in 2011 before covering more recent efforts. He mentioned how early papers in the 1990s mentioned that "understanding" may not help search because NLP from that era was not as successful as contemporary solutions like BM25. It was not until the development of language modeling with BERT that NLP was able to successfully improve search. Lin mentioned that after BERT, the newer language model T5 ranked even better. He concluded that self-supervision is the key to making this process successful, and transformers are the first instance of making the marriage of NLP and IR work. His talk was an excellent segue to the next talk about language models.
Luke Zettlemoyer from the University of Washington gave us a crash course in language models. Language models allow us to create a probability distribution for a set of words, but go beyond merely computing term frequencies and similar ideas. By training neural networks on existing corpora, language models allow us to incorporate context based on how we use words in writing. This allows language models to handle different meanings – e.g., bank as an institution vs. as the land near a river. Language models are used to generate text for machine translation, speech recognition, retrieval, and summarization. He covered ELMo, GPT-1, BERT, GPT-2, GPT-3, T5, BART, and MARGE. In my summarization work, I've been trying to determine where this technology would fit. I'm trying to summarize a corpus through a representative sample rather than sentence generation, but it is possible that sentence generation could be used to create a better intelligent sample

Mark Sanderson from RMIT next covered something I struggle with during each experiment: evaluation. As I've mentioned to other students, I can create pretty surrogates and whole stories, but if I have no way of evaluating them I have not established that they are better than other solutions. Sanderson covered constructing test collections, measuring search systems, and finished with active areas of research in evaluation. He mentioned that building a test collection in the past was easier because there were fewer documents, but at web scale, search engines need to operate on trillions of documents. He covered the sampling techniques of query pooling and system pooling for building collections from the web. Query pooling runs different queries on the same topic against the same search engine to produce a sample. System pooling runs the same query across multiple search engines to produce a sample. With the sample, we employ expert humans to judge the relevance of each document in the sample against the query. He covered different relevance measures, such as precision, precision at n, reciprocal rank, average precision, DCG, and nDCG. When comparing systems, he noted that one could not merely rely upon the mean values of measure but instead must provide the results of some type of significance tests in order to demonstrate that the results are better than random. He mentioned that evidence supports that nDCG correlates best with user preferences. One of the open work concepts he mentioned has implications for my own surrogate work. Search engine results are typically presented as cards with text snippets. In user testing, the users make choices on relevance based on these snippets. The human judges against which they are compared base their relevance judgement on the whole document. This is not a one-to-one comparison. He ended by mentioning a future area of research would be to evaluate the impact of search engines and their performance on society. 

Maarten de Rijke from the University of Amsterdam covered Interactive Information Retrieval (IIR). He stressed that we need to think of IIR as more than just a set of ranked documents returned by a search query. Instead he proposed a more abstract model where input is based on a user's query, environment, state, and more. The output from the system is an action, which could be a results list, the start of a further conversation, or something specific to the system. He wants developers to consider interactions first class citizens when they build information retrieval systems. We can evaluate the effectiveness of these user actions by user studies, user panels, and log analysis, asking the question of "do people behave differently under these conditions and why?" He covered the concept of mixed initiative interactions asking questions like "how much initiative should a system take?" If not enough people will not use it. If too much, it may scare people off. We need to analyze what happens with human-to-human interactions to inform the development of successful human-to-machine interactions. He covered other concepts like unbiased and counterfactual learning-to-rank, the concept of "talking with a document," the concept of "talking with structured information," "talking with a collection of items," information goals, SERP-based conversations, and query formulation.


I attended two tutorials. The IIR one by ChengXiang Zhai seemed to be a good prep for the conference. I do not know as much about Question Answering, so I decided to let Rishiraj Saha Roy and Avishek Anand educate me on this exciting field.

Interactive Information Retrieval: Models, Algorithms, and Evaluation

ChengXiang Zhai from the Department of Computer Science at the University of Illinois at Urbana-Champaign provided us with a good overview of IIR (slides). He started off by introducing IIR as a subarea of computer science and information science, indicating that there are many perspectives that integrate to make IIR possible, from human computer interaction to structuring information. He provided a brief historical interview of the field. Zhai covered concepts like Belkin's Anomalous State of Knowledge (ASK) hypothesis which may have implications for how I evaluate the effectiveness of my web archive social media story summaries. Combining the ASK hypothesis with Oddy's THOMAS concept of man-machine dialogue "suggestions dynamic user modeling" was key to producing better IIR systems. Zhai provided a brief overview of other concepts including Bates' berry picking model, Ellis' behavioral model, Cutting's scatter/gather, and the Okapi system.  Zhai spent some time educating us on the benefits of modeling IIR as cooperative game playing. If we treat the user and the system as two separate players seeking different goals, then they can arrive at a beneficial result. Zhai detailed the four key elements of this framework: collaboration, communication, cognition (incorporating the ASK hypothesis), and cost (minimizing it for both user and system). Given the situation, information on the user, a search history, a corpus, and a query, the system tries to choose the best response (result) to satisfy the user. The system's decision process can be modeled as a partially observable Markov decision process. It consists of a loss function that tries to minimize the cost of the system, the effort of the user, while maximizing the utility of the result. This concept can be specialized into the Interface Card Model which provides a general model for optimizing any interactive system. Zhai went on to cover probability ranking, models of economics, handling conversational search, and so much more than I can cover here. I recommend reviewing the slides as part of building a good reading list on the theories that make up IIR.

Question Answering over Curated and Open Web Sources

Rishiraj Saha Roy, Senior Researcher from the Max Planck Institute for Informatics, and Avishek Anand, Assistant Professor at the L3S Research Center provided an excellent overview of Question Answering over knowledge graphs and textual sources (tutorial web site with slides and sources, abstract). They mentioned how Question Answering (QA) is vital for search. It allows users to save time by employing a system like Siri to give them specific answers to simple questions like "What are some films directed by Nolan?" QA even goes so far as to answer more complex questions involving multiple entities and relations as well as context within conversations. QA relies upon knowledge graphs like YAGO, DBpedia, Freebase, or Wikidata, to supply the relationships and entities necessary to provide answers. These systems use technologies like SPARQL to unite triples across varying datasets in order to answer a given question. Saha Roy covered the complexities of how this is done, discussing concepts like templates, reification, qualifiers, and facts. He used the QAnswer project as an example of a working system that implemented these concepts over Wikidata before going into the KEQA model, complex QA with TextRay, hetergeneous QA with PullNet, and conversational QA with CONVEX and the MaSP model. Anand covered how QA is AI complete because of the context and other external knowledge needed by the system to properly solve the problem. Ananad mentioned SQUAD dataset while describing how we use Cloze tests, span extraction, multiple choice answers, and free form answers to test and evaluate QA systems. He highlighted how to use neural networks to build language models (like BERT) that can better understand the terms needed to discover answers. Anand covered upcoming challenges and detailed how one can create their own QA system. Their slides are an important source of recommended reading for anyone who wants to get into this topic.

Selected Paper Presentations

As is usual, I could not watch or digest all paper presentations so here I will provide an overview of a few select papers. I viewed a lot of excellent work.

Omar Khattab from Stanford presented "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT." BERT is amazingly effective for certain problem spaces, but is quite computationally expensive. ColBERT optimizes the process by delaying certain parts of the model generation. They show that ColBERT outperforms non-BERT models and is comparable with existing BERT models, but far faster.

Fan Zhang from Tsinghua University presented "Cascade or Recency: Constructing Better Evaluation Metrics for Session Search." Most search metrics favor the cascade hypothesis: lower-ranked search results are assigned lower weights. The recency effect - where users recall best the most recent item they viewed - also plays an important role. Zhang et al. developed session-based metrics that incorporate this recency effect, attempting to better reflect what results users actually favor. Their work has implications for anyone who is attempting to measure how well their system ranks results.

Delong Ma from Northeastern University presented "Efficient Graph Query Processing over Geo-Distributed Datacenters" work he completed with co-authors Yuan, Wen, Ma, Wang, and Chen. Yuan et al noted that graph computing frameworks "may not work well for geographically distributed datacenters" because they would require a lot of data transfer between datacenters. To address this, the authors propose GeoGraph which favors datacenters closer to the user, breaks the query up among datacenters, and synthesizes a result. While watching this I was considering how such technologies might be applied to the LANL Memento Aggregator or a distributed form of MemGator, but URL query routing does not have the same issues as graph query processing.

Xinyan Dai from the University of Hong Kong presented "Convolutional Embedding for Edit Distance". While processing mementos from web archives, our research team might employ string edit distance to measure the similarity of sentences or whole documents.  Dai et al. apply a convolutional neural network to the problem of string edit distance and show that it performs better than using a recurrent neural network. Their results show that their method is both more accurate and faster than prior attempts.

Yu Zhang from the University of Illinois at Urbana-Champaign presented "Minimally Supervised Categorization of Text with Metadata." Zhang et al. incorporate tags and other metadata into minimally supervised topic modeling with MetaCat. They show that MetaCat achieves higher F1 scores than competing technologies on a variety of datasets. I was interested in their method because topic modeling is one of the improvements I want to bring to web archive collection summarization.

Paridhi Maheshwari from Adobe presented "Learning Colour Representations of Search Queries" where they associated image search queries with colors in hopes of improving results. The authors note that "a significant fraction of user queries have an inherent color associated with them." For example, a query including the word "lemon" may indicate that with images containing the color yellow have a higher probability of satisfying the user's information need and thus should be ranked accordingly. Of course, some query terms map to many more colors. Maheshwari et al. apply a recurrent neural network to the problem and demonstrate that their system improves results for users.


Even though I had flipped my clock to attend sessions on Beijing time, I was only able to make it to parts of two workshops.

Deep Natural Language Processing for Search and Recommendation

The DeepNLP workshop, led by Bo Long from LinkedIn, covered various aspects of applying deep learning to various IR problems. The keynote "Explanatory Natural Language Processing: Formulation, Methods, and Evaluation" was delivered by Professor Qiaozhu Mei from the University of Michigan. Mei gave a highlight of the concept of explainable machine learning. He identified a potential problem with this by quoting Geoffrey Hinton: "we can get the neural network to cook up a story... you could train a neural network to come up with an explanation... and the explanations don't have to be how they did it." Moving beyond this, Mei detailed some potential methods for providing explanations. Feature-based explanations detail how certain features affect the output. Example-based explanations provide evidence from the training data. He suggested LIME's ability to highlight text that informs a model may be a path forward, but then identified recent work whereby his student demonstrated that explanations provided by ML do not necessarily help humans make more critical judgements of results. Most participants in their studies agreed with the ML judgement with no more critical analysis. Rather than focusing on explainability, Mei suggested solving the more general problem we tend to have with ML judgements: trust. 

ACL Fellow and IEEE Fellow Hang Li from gave our second keynote "Deep Learning and Natural Language Processing: Current and Future." Li took us on a fascinating journey about thinking and neural processing, mapping ideas from Damasio's work matching neural representations to images. He mentioned that, when asked a question, humans will internally generate images when trying to reason the answer. Deep Learning for Natural Language processing is the process of "mimicking human behaviors using neural processing tools." He provided an overview of issues with deep learning models and natural language processing, such as the issue of adequacy where data bias creates inadequate representations. This leads to the problem that "deep networks may exhibit pathological behavior." He then went on to provide a brief overview of how BERT and GPT bring these two problem areas together. He closed by mentioning that deep learning can solve classification tasks and relevancy is a classification task so NLP and deep learning are a natural match.

In addition to the keynotes, various presented detailed how their systems applied Deep NLP to different problem spaces. Wubo Li discussed augmenting existing small multimodal (text, image, video) datasets and showed improvements in their prototype at Didi Chuxing. Yiming Qiu presented work on rewriting queries based on click data to improve E-commerce results at JD.com. Marc Brette demonstrated how Salesforce applies NLP to augment queries with new entities based on existing user data to make their CRM more effective. Of specific interest to me was work presented by Haiyang Xu from Didi Chuxing that improves the accuracy of generated summaries by applying BERT to extract important words that then inform an improved abstractive summary.

Bridging the Gap between Information Science, Information Retrieval and Data Science (BIRDS)

Due to the time zone changes and the extremely long day, I was only able to attend part of BIRDS led by Ingo Frommholz. Carlos Castillo gave the keynote "Fairness and Transparency in Rankings." Castillo discussed how search results are not just documents. For the user, they could be products, potential mates, businesses, social groups, or opportunities. His point is that results affect peoples lives, so discrimination in search unfairly impacts those users. Castillo showed several examples of SERPs results providing better exposure for men over women when it came to job searches. He suggested that we need transparency in search rankings to address these fairness issues. Maybe a nutritional label for search results is a good model.

I watched several interesting paper presentations from the workshop. Riccardo Guidotti gave an invited talk on the challenges with explaining the decisions produced by machine learning, arguing that explanations are dependent on the needs and capabilities of the end user. Steven Zimmerman presented work to help protect users from misinformation returned in SERPs and from recommender systems. Amit Kumar Jiaswal presented his work applying reinforcement learning and a quantum probabilistic framework to modeling a user engaged in information foraging. I discussed Pirolli and Card's information foraging model in my candidacy proposal and was excited to see someone automating the process for the purpose of analysis.

Thanks to the Organizers and Fellow Participants

I was humbled by the work presented at SIGIR. I appreciate the hard work of the organizers to move this conference from being an in-person conference to an online one. I was impressed with the work I saw presented. I appreciate the kindness everyone showed me as I asked questions and interacted during our social sessions. I hope one day to come back to SIGIR to share my own work.