2025-05-24: ACM ICMI 2024 Trip Report

The 26th ACM International Conference on Multimodal Interaction (ICMI 2024) was held from November 4 to 8 in San José, Costa Rica. This premier international forum brought together researchers and practitioners working at the intersection of multimodal artificial intelligence and social interaction. The conference focused on advancing methods and systems that integrate multiple input and output modalities, and explored both technical challenges and real-world applications.

In this blog post, I share my experience attending the conference in person, the engaging discussions I took part in, and key takeaways from the sessions, presentations, and interactions throughout the week. Our full paper, "Improving Usability of Data Charts in Multimodal Documents for Low Vision Users," was presented on Day 3, Thursday, 7 November 2024, during the 16:30 to 18:00 session in poster format.

ICMI 2024 followed a single-track format that included keynote talks, full and short paper presentations, poster sessions, live demonstrations, and doctoral spotlight papers. Its multidisciplinary approach emphasized both scientific innovation and technical modeling, with a strong emphasis on societal impact.

Keynote Speakers:

Keynote 1: A human-centered approach to design multimodal conversational systems by Heloisa Candello, IBM Brazil

This keynote offered a compelling reflection on how to center human values when designing multimodal conversational systems. Drawing from projects in domains like cultural heritage, finance, and micro-entrepreneurship, the talk emphasized that conversational user interfaces must go beyond functionality to consider bias, accountability, and trust. One striking example involved using chatbots to engage with underrepresented women entrepreneurs, where values like creditworthiness emerged as complex, culturally embedded concepts. Rather than treating these issues as technical problems alone, the talk advocated for deeply contextual, participatory design approaches that align with users’ lived experiences. I found the discussion particularly relevant as generative conversational AI continues to evolve rapidly, raising new questions around social impact and responsible system behavior. The keynote made a strong case for embedding ethical reflection into every stage of multimodal system development, especially when these systems serve diverse, real-world communities.

Keynote 2: 3D Calling with Codec Avatars by Yaser Sheikh, Meta, USA

This keynote explored the future of communication through 3D calling, where lifelike avatars allow people to share virtual spaces in ways that closely resemble face-to-face interaction. Yaser Sheikh introduced codec avatars, AI-generated models that replicate a person’s appearance, voice, and behavior with high fidelity. These avatars rely on neural networks and advanced capture systems capable of recording subtle social cues in three dimensions. The talk positioned 3D calling as the next step in the evolution of telecommunication, moving beyond video conferencing toward immersive social presence. What stood out was the goal of creating interactions that feel as natural and authentic as real life. The technical demands are significant, involving breakthroughs in both perception and rendering. Still, the potential for transforming remote collaboration, virtual reality, and human-computer interaction is substantial. This keynote presented a compelling vision of digitally mediated communication grounded in realism, presence, and social connection.

Keynote 3: Mapping Minds: The Science of Conversation and the Future of Conversational AI by Thalia Wheatley, Dartmouth College/Santa Fe Institute, USA

This keynote examined the deep cognitive and neural foundations of human conversation and what they mean for the future of conversational AI. Thalia Wheatley argued that the real innovation in human communication is not language alone, but the shared mental map that allows people to align meanings in real time. Drawing on research from psychology and neuroscience, the talk unpacked conversation as a complex, multi-channel coordination of signals involving timing, intent, emotion, and mutual understanding. This dynamic interplay defines how we connect and co-construct meaning. Wheatley challenged the audience to rethink conversational AI not just as a language processing task, but as an attempt to model the subtle choreography of human minds in sync. I found this perspective especially relevant as we continue to push for more socially aware and responsive AI systems. The talk offered both scientific insight and a roadmap for building machines that can genuinely participate in human dialogue.

Keynote 4: Greta, what else? Our research towards building socially interactive agents by Catherine Pelachaud, CNRS-ISIR, Sorbonne Université, France

This keynote traced the evolution of research behind Greta, a virtual agent platform designed for rich social interaction. Catherine Pelachaud shared how her team progressed from modeling emotional expressions to building agents that engage as active conversational partners. The talk focused on the integration of adaptive mechanisms such as imitation, synchronization, and conversational strategies that allow agents to respond dynamically during interaction. Evaluation studies were used to assess the impact of these features on user perception and interaction quality. What stood out was the commitment to modeling nuanced social behavior and making agents not just expressive, but socially responsive. The presentation offered a valuable view into the long-term development of socially interactive agents and the design choices involved in making them believable and effective in real-time communication.

Best Paper: Mitigation of gender bias in automatic facial non-verbal behaviors generation for interactive social agents

This year’s Best Paper tackled a critical and often overlooked issue in the design of socially interactive agents: gender bias in the generation of facial non-verbal behaviors. The study showed that existing models frequently reproduce and even amplify gendered patterns found in training data. The authors introduced FairGenderGen, a model designed to reduce gender-specific cues by incorporating a gradient reversal layer during training. This approach minimized gender-related distinctions in generated behaviors while preserving their natural timing and expressiveness. Evaluation results confirmed that the model lowered bias without compromising the perceived quality of the behaviors. The work stands out for combining rigorous technical development with a clear ethical motivation. It also raises important questions about the balance between realistic behavior synthesis and fairness, and how perceptions of believability may differ based on gender expectations in human-computer interaction.

Best Paper Runner Up: Online Multimodal End-of-Turn Prediction for Three-party Conversations

This paper introduced a real-time, multimodal system for predicting end-of-turn moments in three-party conversations. Accurate turn-taking is essential for natural interaction in spoken dialogue systems, but most existing models either lag in responsiveness or oversimplify complex dynamics like overlap and interruption. The authors addressed these challenges by combining a window-based approach with a fusion of multimodal features, including gaze, prosody, gesture, and linguistic context. Their model, which integrates DistilBERT and GRU layers, predicts turn shifts every 100 milliseconds and accounts for different turn-ending types such as interruption, overlapping, and clean transitions. The study also involved a new annotated dataset with synchronized gaze and motion capture data. Results showed a substantial performance improvement over traditional IPU-based models, particularly in handling nuanced conversational cues. This work sets a new benchmark for building socially aware agents that can manage multi-party dialogue with precision and fluidity.

Our Paper: Improving Usability of Data Charts in Multimodal Documents for Low Vision Users

Our paper addressed a critical accessibility gap in the usability of data charts for low-vision users, particularly on smartphones. While prior solutions have focused on blind screen reader users, they often neglect the residual visual capabilities of individuals with low vision who rely on screen magnifiers. We introduced ChartSync, a multimodal interface that transforms static charts into interactive slideshows, each offering magnified views of key data points alongside tailored audio narration. The system uses a combination of computer vision, prompt-engineered language models, and user-centered design to link charts with related text and surface important data facts. In our evaluation with 12 low-vision participants, ChartSync significantly outperformed traditional screen magnifiers and state-of-the-art alternatives in terms of usability, comprehension, and cognitive load. Presenting this work at ICMI was a valuable opportunity to engage with researchers in accessibility and multimodal interaction, and to receive constructive feedback on future directions such as desktop deployment and dynamic skimming features

Grand Challenge

Grand Challenges at ICMI serve as focused community efforts to tackle open problems in multimodal interaction by providing shared datasets, clearly defined tasks, and evaluation protocols. They are designed to stimulate collaboration, benchmark progress, and drive innovation in emerging or underexplored areas of research. ICMI 2024 included two main Grand Challenges: EVAC, focused on multimodal affect recognition, and ERR, centered on detecting interaction failures in human-robot interaction.

Grand Challenge 1: EVAC

The Empathic Virtual Agent Challenge (EVAC) focused on the recognition of affective states in multimodal interactions, using a new dataset collected in cognitive training scenarios with older adults. Participants tackled two tasks: predicting the presence and intensity of core affective labels such as "confident," "anxious," and "frustrated," and estimating appraisal dimensions like novelty, goal conduciveness, and coping, both in summary and time-continuous formats. The winning team leveraged state-of-the-art foundational models across speech, language, and vision, combining them through late fusion to outperform unimodal baselines. What stood out was the challenge's emphasis on real-world complexity, working with naturalistic, French-language data from therapeutic contexts, and requiring models to operate across asynchronous and noisy multimodal inputs. The challenge highlighted the potential of emotion-aware AI for health and assistive technologies, while also exposing the limits of current methods in capturing subtle, temporally evolving emotional cues.

Grand Challenge 2: ERR

This challenge focused on detecting interaction failures in human-robot interaction (HRI) by analyzing multimodal signals from users. Participants were asked to identify three types of events: robot mistakes, user awkwardness, and interaction ruptures. These were labeled based on observable disruptions in the interaction. The dataset was collected in a workplace setting where a robotic wellbeing coach engaged with users. It included facial action units, speech features, and posture data. The winning approach used a time series classification pipeline with MiniRocket classifiers, relying heavily on conversational turn-taking cues. Other strong entries employed GRUs, LSTMs, and modality-specific encoders or fusion techniques. Teams faced challenges related to class imbalance and the subtle nature of the rupture events, which often unfolded over time. The ERR challenge offered a timely benchmark for building more socially aware robots that can detect and respond to failures as they happen. It also demonstrated the value of real-world, multimodal data in advancing HRI research.

Travel experience:

Beyond the conference sessions, exploring Costa Rica offered an unforgettable complement to the ICMI experience. I visited La Paz Waterfall Gardens and Volcán Poás National Park, where the towering crater and tranquil lake provided breathtaking views. The trails were surrounded by rich biodiversity, with vibrant flora and sightings of toucans, hummingbirds, and butterflies creating a vivid encounter with the region's unique wildlife. As a coffee enthusiast, it was a great experience to sample and purchase high-quality peaberry coffee, a local specialty known for its smooth flavor and rarity. The trip also offered a chance to savor traditional Costa Rican cuisine and engage with the country's welcoming culture. Traveling with a group of researchers made the journey even more enriching. I was able to form connections not only with attendees from ICMI but also with those from CSCW, which was colocated, fostering cross-community conversations in both formal and informal settings.

Conclusion

Attending ICMI 2024 was both professionally rewarding and personally enriching. The conference showcased cutting-edge research in multimodal interaction, fostered thoughtful discussions on the future of socially aware systems, and highlighted important ethical and human-centered considerations in AI design. Presenting our work, engaging with a diverse research community, and participating in focused workshops and challenges provided valuable insight into the field’s evolving directions. Beyond the academic sessions, the experience of exploring Costa Rica and building new connections across communities added depth to the journey. I leave with a renewed sense of curiosity, several collaborative ideas, and deep appreciation for the vibrant and inclusive spirit of the ICMI community.

Acknowledgements

I would like to thank the ICMI 2024 conference organizers for curating an engaging and thoughtful program. I am especially grateful to my advisor, Dr. Vikas G. Ashok, and the WS-DL research group for their continued guidance and support. I also sincerely thank the ODU ISAB, the College of Sciences Graduate School (CSGS), and the Department of Computer Science at Old Dominion University for their generous support in funding my conference registration and travel. Finally, heartfelt thanks to all the colleagues, fellow researchers, and friends who made this trip a memorable and enriching experience.

- AKSHAY KOLGAR NAYAK @AkshayKNayak7

Search This Blog

Web Science and Digital Libraries Research Group