2021-09-15: The 21st ACM Symposium on Document Engineering (DocEng 2021) Trip Report

The 21st ACM Conference on Document Engineering (DocEng 2021) was organized by the University of Limerick, Ireland and held virtually (due to the COVID-19 pandemic) from August 24th to 27th. This year, the conference received a total of 24 full paper submissions and 41 short paper submissions, from which 8 full papers (33%) and 20 short papers (49%) were accepted. This year's conference was sponsored by the ACM SIGWEB (ACM Special Interest Group on Hypertext and the Web) in co-operation with the ACM SIGDOC (ACM Special Interest Group on Design of Communication). The general chairs of DocEng 2021 were Dr. Patrick Healy and Dr. Mihai Bilauca of the University of Limerick, Ireland, and the program chair was Dr. Alexandra Bonnici of the University of Malta, Malta.

I submitted two short papers to this year's conference, from which one was accepted. This paper, titled "Metadata-Driven Eye Tracking for Real-Time Applications" was co-authored by Gavindya Jayawardena and Dr. Sampath Jayarathna from Old Dominion University, USA, and Dr. Andrew T. Duchowski from Clemson University, USA.

Day 1

The first day of the conference was dedicated to two tutorial sessions. One tutorial session, "Domain-specific Modelling in Document Engineering", was conducted by Dr. Verislav Djukic from Djukic Software GmbH, Germany, and Dr. Juha-Pekka Tolvanen from Metacase, Finland. The other tutorial session, "Document Engineering Issues in Malware Analysis", was conducted by Dr. Charles Nicholas and Dr. Robert Joyce from the University of Maryland Baltimore County, USA, and Dr. Steven Simske from the Colorado State University, USA.

Day 2

Keynote I

The second day of the conference started off with the keynote "Searching Harsh Documents" by Dr. Ophir Frieder from Georgetown University, USA. Dr. Frieder focuses on scalable health information processing systems. He is a member of the Computer Science faculty at the Georgetown University and the biostatistics, bioinformatics and biomathematics faculty in the Georgetown University Medical Centre. He is also the Lead Science and Technology Advisor for Aurora: The Business Forge and the Chief Scientific Officer of Invaryant, Inc.

In this keynote, Dr. Frieder delved into the challenges of searching "harsh" document collections, such as natively non-digital documents, multilingual documents, documents that include non-textual components, corrupted documents, or any combination thereof. He discussed several techniques for image binarization, how to cope with poor spelling, and handling standard and non-standard documents. He also discussed existing solutions for different aspects of machine readability, and the importance of integrating them together for better machine readability.

Before ending the keynote, Dr. Frieder discussed ongoing research on engineering chat-bots to mimic certain individuals, by analyzing their social data, such as writings, conversations, and images. This raised the question of what adverse effects that may arise from doing so.

Session 1 - Document Content Analysis

Dr. Frieder's keynote was followed by the first paper session of the day, "Document Content Analysis", chaired by Dr. Bessat Kassaie from University of Waterloo, Canada. This session consisted of 5 papers.

Session 2 - Generation, Manipulation, and Presentation

The second paper session of the day, "Generation, Manipulation, and Presentation" was chaired by Dr. Steven Bagley from the University of Nottingham, UK, and consisted of 3 papers.

Day 3

Keynote II

Day 3 of DocEng 2021 started off with the keynote "20 Years of Physical Document and Product Protection Using Digital Methods" by Dr. Justin Picard. Dr. Picard is the Chief Technology Officer (CTO) of Scantrust SA, Switzerland, a product authentication and traceability company that he co-founded in 2013, and the inventor of the copy detection pattern, a digital authentication technology for detecting product and document counterfeiting. He is also a co-founder of the NGO Black Market Watch, where he developed a methodology to assess the impacts of illicit trade.

In this keynote, Dr. Picard reviewed some of the counterfeit detection techniques developed in the last 20 years, including printed digital watermarks, copy detection patterns and secure QR Codes. He explained how these technologies allow users to verify the authenticity of digitized content with their smartphones, and in turn, help to digitize physical documents, packaging and products. He provided real world examples on counterfeit detection from industrial applications, and discussed some of the current research problems in this area.

Session 3 - Security and Sensitive Documents

Dr. Picard's keynote was followed by the first paper session of the day, "Security and Sensitive Documents". This session was chaired by Dr. Charles Nicholas from the University of Maryland Baltimore County, USA, and consisted of 4 papers.

Session 4 - Applications and User Experiences

The next session was "Applications and User Experiences", chaired by Dr. Dick Bulterman from the Centrum Wiskunde & Informatica (CWI), Netherlands. This session consisted of 5 papers.

The goal of our paper "Metadata-Driven Eye Tracking for Real-Time Applications" was to demonstrate the benefits of adapting FAIR metadata standards to collect data, build workflows, and validate results in eye tracking research. Here, we introduced the issues encountered when conducting eye tracking research in the vast landscape of proprietary and vendor-specific eye tracking software, and proposed an approach to workaround these issues. In this approach, we first use our DFS metadata format to describe eye trackers and datasets collected using them. Next, we use this metadata to "replay" datasets and effectively simulate real-time data streams.

To verify that this approach indeed works, we created DFS metadata for two eye trackers (Tobii Pro X2-60 and SR Research EyeLink-1000) and two datasets collected using them (ADHD-SIN and N-BACK), replayed the datasets using this metadata, and conducted several real-time eye movement analysis and synthesis tasks using the replayed data. Based on our results, we discussed how well this approach works, and how it could be generalized beyond eye tracking applications.

ACM Town Hall

The next session was a discussion on the ACM Special Interest Group on Hypertext and the Web (SIGWEB). This session, named "ACM Town Hall", was conducted by Dr. Peter Brusilovsky of the University of Pittsburgh, USA. In this session, Dr. Brusilovsky introduced what SIGWEB is, the four major conferences of SIGWEB: 1) ACM Conference on Hypertext and Social Media, 2) ACM Symposium on Document Engineering (i.e., this conference), 3) ACM Web Science Conference, and 4) The Web Conference, and the reason why one should become a SIGWEB member.

Day 4

The 4th and the last day of the conference began with an information session on the upcoming ACM DocEng 2022, by Dr. Matthew Hardy, the Director of Engineering at Adobe. Dr. Hardy mentioned that the next DocEng conference will be hosted by Adobe, at the Adobe Campus in San Jose, California, USA. He mentioned that this conference would hopefully be in-person, and not virtual.

Session 5 - Systems for Visual Document Analysis

The session by Dr. Hardy was followed by first paper session of the day, "Systems for Visual Document Analysis". It was chaired by Dr. Tamir Hassan from Round-Trip PDF Solutions Vienna, Austria, and consisted of 7 papers.

Session 6 - Collections, Systems, and Management

The next session, "Collections, Systems, and Management", was the last paper session of the conference. It was chaired by Dr. Angelo Di Iorio from the University of Bologna, Italy and consisted of 4 papers.

Binarization Challenge Summary

At the end of the last paper session of the conference, Dr. Steven Simske from the Colorado State University, USA presented the results of the Binarization Challenge, "Time-Quality Competition on Binarizing Photographed Documents". The competition was carried out using four smartphone cameras. Images taken from each smartphone were first binarized using state-of-the-art algorithms, and then compared using the proportion of black-to-white ratio between the binarized image and the ground truth image, and Levenshtein distance between the OCR-extracted text (via Google Cloud Vision) and the ground truth text.


The binarization challenge summary was followed by the DocEng 2021 Awards Ceremony. Here, the best student paper award was given to the paper "On Minimizing Cost in Legal Document Review Workflows" by Eugene Yang et al.

The best paper award was given to the paper "A Novel Approach on the Joined De-Identification of Textual and Relational Data with a Modified Mondrian Algorithm" by Fabian Singhofer et al.

Birds of a Feather

The awards ceremony was followed by the Birds Of a Feather (BOaF) presentation session. While this session originally targeted two presentations, only one was presented. Following this, Dr. Charles Nicholas from the University of Maryland Baltimore County, USA presented "How to get a book published in Document Engineering".

Following this, Dr. Steve Simske, Dr. Alexandra Bonnici, Dr. Paddy Healy, and Dr. Mihai Bilauca delivered the closing remarks of DocEng 2021, and with this, the DocEng 2021 conference came to its end.

-- Yasith Jayawardana (@yasithmilinda)