2024-08-20: Paper Summary: "Enabling Uniform Computer Interaction Experience for Blind Users through Large Language Models"

The ACM ASSETS Conference, formally known as the International ACM SIGACCESS Conference on Computers and Accessibility, is a leading forum for presenting research on the design, evaluation, and use of computing and information technologies to assist people with disabilities and older adults. ASSETS is dedicated to exploring how technology can support accessibility and inclusivity for individuals with special needs. The conference covers a wide range of topics related to accessible computing, including assistive technologies, user interfaces, and inclusive design practices. In this blog post, I write about the most recent work co-authored by Dr. Vikas Ashok, titled "Enabling Uniform Computer Interaction Experience for Blind Users through Large Language Models," published in ASSETS '24.

Motivation

Blind individuals who rely on screen readers, such as JAWS, VoiceOver, and NVDA, face significant challenges when interacting with computer applications designed primarily for visual interaction and point-and-click use. These graphical user interfaces often require complex and varying keyboard shortcuts, making it difficult for blind users to navigate and perform tasks efficiently. For instance, even when performing similar functions, different applications require different shortcuts, adding to the cognitive load. Additionally, inconsistencies across platforms, where different screen readers have distinct shortcut mappings for the same actions, further complicate the user experience. As a result, blind users often take much longer to complete tasks than sighted individuals. The SAVANT system was proposed to address these issues, offering a more uniform interaction experience across different applications.

SAVANT Design

WOZ Study and Dataset Collection

In the process of developing SAVANT, the researchers first needed to understand the types of natural language commands that blind users might use while interacting with computer applications. To gather this information, they conducted a WOZ study with 11 blind participants. During the study, participants were asked to perform various tasks across five different applications, including Excel, Gmail, File Explorer, Word, and Zoom. The study allowed the researchers to collect a total of 145 natural language commands, which were then manually annotated with the corresponding control elements and their values within the applications. This dataset was crucial as it served as the foundational data for training and testing the models used in SAVANT, ensuring that the system could accurately interpret and respond to the natural language commands used by blind users.

Figure 1 Kodandaram et al.: LLM-powered automation of screen reader steps prompted by user commands. This figure illustrates how the SAVANT system interprets a natural language command, identifies the relevant control elements, and automates the corresponding sequence of screen reader actions, enabling uniform interaction across different application interfaces.

Building on the insights gained from the Wizard-of-Oz (WOZ) study, the researchers designed the architecture of SAVANT with two main components: the Preprocessing Component and the Runtime Component.

Preprocessing Component

The Preprocessing Component plays a crucial role in setting up SAVANT by constructing a non-visual representation of the application’s graphical user interface (GUI), the Application Control Tree (ACT). The ACT includes detailed information about all the control elements within an application, such as buttons, menus, and other interactive elements, along with their relationships and actions. Additionally, the Preprocessing Component develops a dataset of few-shot examples for each application, stored in the Few-Shot Examples Dataset (FED). These few-shot examples are small samples of natural language commands paired with corresponding actions, which help the system understand and interpret similar commands from users. Importantly, this preprocessing is done only once for each application, laying the groundwork for how SAVANT will interact with that application in real-time.

Runtime Component

Once the preprocessing is complete, the Runtime Component takes over during user interaction. When a user issues a natural language command, the Runtime Component interprets this command using the information stored in the ACT and FED. It begins by matching the command to the most relevant few-shot examples from the FED, using a semantic similarity search powered by a BERT model. This search ensures that SAVANT accurately interprets the command and identifies the application's correct control elements and actions. The system then generates the appropriate sequence of screen reader actions needed to execute the command, effectively automating what would otherwise be a complex and manual process for the user. This sequence of actions is executed in the application, allowing the user to interact with the software more naturally and efficiently.

Figure 2 Kodandaram et al.: Architectural workflow of the SAVANT system. This diagram outlines the two main components of SAVANT: the Preprocessing Component, which constructs the Application Control Tree and Few-Shot Examples Dataset, and the Runtime Component, which interprets user commands, generates control element pairs, and executes the corresponding screen reader actions.

To further enhance usability, the Runtime Component is designed to handle various challenges that may arise during interaction. For example, suppose a user’s command is ambiguous or involves multiple steps. In that case, the system can break down the command into more straightforward tasks or present the user with options for clarification. This capability is crucial for ensuring that SAVANT remains flexible and responsive to the diverse ways users phrase their commands. The system’s architecture is also designed to provide feedback to users, confirming when an action has been completed or prompting them to reissue a command if necessary. By automating the interaction process and reducing the need for complex keyboard shortcuts, SAVANT offers a more consistent and seamless experience across different applications.

SAVANT Evaluation

The user study to evaluate SAVANT's effectiveness revealed significant improvements in efficiency and usability for blind screen reader users. When comparing SAVANT to traditional screen readers and the state-of-the-art AccessBolt interface, the results showed that SAVANT allowed participants to complete tasks much more quickly and with fewer key presses. On average, participants spent 101 seconds per task using SAVANT, compared to 159 seconds with AccessBolt and 395 seconds with traditional screen readers. The reduction in key presses was equally impressive, with participants pressing only 44 keys per task with SAVANT, compared to 163 keys with AccessBolt and 319 with traditional screen readers. These numbers underscore SAVANT's ability to streamline interactions and reduce the cognitive load on users.

Table 1 Kodandaram et al.: Tasks assigned to participants during the user study across six applications. This table details the specific tasks participants were required to complete in different applications, along with the depth of user interactions needed to navigate from the root to the target element. The tasks were designed to evaluate the usability and efficiency of the SAVANT system compared to traditional screen reader methods.

In terms of task completion rates, SAVANT also outperformed the other methods. Participants successfully completed 18 out of 22 tasks using SAVANT, compared to 14 with AccessBolt and only 6 with traditional screen readers. This higher completion rate is particularly important, as it demonstrates SAVANT's ability to handle a wide range of tasks across different applications. Additionally, SAVANT's use of natural language commands meant that participants did not need to remember specific keyboard shortcuts for each application, further simplifying the interaction process. The system’s ability to interpret and execute user commands accurately was reflected in its 78% accuracy rate in mapping natural language commands to the correct application controls.

Figure 3 Kodandaram et al.: Comparative analysis of task completion times and key presses across different study conditions. The box plots illustrate the differences in efficiency and user interaction between the JAWS screen reader, AccessBolt, and the SAVANT system, highlighting SAVANT's superior performance in reducing both task completion time and the number of key presses required.

The subjective experience of the participants also favored SAVANT. In the Single Ease Question (SEQ) scores, participants rated tasks an average of 6.57 out of 7 for ease of use with SAVANT, compared to 5.6 with AccessBolt and just 2 with traditional screen readers. Furthermore, the NASA-TLX scores, which measure perceived workload, were significantly lower for SAVANT, with an average score of 16.16, compared to 28.93 for AccessBolt and 49.9 for traditional screen readers. These results indicate that SAVANT not only made tasks easier to complete but also reduced the overall effort and frustration experienced by users. Overall, the study's findings highlight SAVANT's potential to transform the way blind users interact with computer applications, making the process more efficient, accessible, and user-friendly.

Limitations and Future Work

While the SAVANT system has demonstrated significant potential in enhancing the interaction experience for blind users with desktop applications, it also has certain limitations. Addressing these limitations through future work will be crucial to further improving its functionality and user experience.

SAVANT currently struggles with handling tasks involving pop-up windows and sub-windows due to their unpredictable nature and context-dependent controls. Future work should focus on developing sophisticated methods for predicting and managing these elements to improve interaction capabilities.
The system is currently limited to simple, single-action commands, making it inadequate for more complex tasks that require multiple steps. Future research should aim to incorporate support for complex, multi-step commands, possibly through task decomposition models.
SAVANT's ability to handle tasks that involve multiple applications is limited. Enhancing its capability to manage such tasks seamlessly across different applications will be an important area of future development.
Currently, SAVANT uses a push-to-talk mechanism for activation, which may not be accessible for all users. Introducing a wake-word feature would make the system more user-friendly, particularly for those with additional disabilities.

Conclusion

The SAVANT system marks a significant advancement in assistive technology, offering blind users a more efficient and uniform way to interact with diverse desktop applications through natural language commands. Despite its promising capabilities, SAVANT faces challenges, particularly in handling pop-ups, complex commands, and multi-application tasks, which highlights the need for further refinement and development. This research is a valuable step toward creating more inclusive digital environments. By addressing the current limitations and expanding SAVANT's capabilities, future iterations could significantly enhance its usability and impact, making it an even more indispensable tool for blind and visually impaired users. With continued innovation, SAVANT could set a new standard in accessible technology.

References

S. R. Kodandaram, U. Uckun, X. Bi, I. V. Ramakrishnan, and V. Ashok, "Enabling Uniform Computer Interaction Experience for Blind Users through Large Language Models," Proceedings of the International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS), arXiv preprint arXiv:2407.19537.

- AKSHAY KOLGAR NAYAK @AkshayKNayak7

Search This Blog

Web Science and Digital Libraries Research Group

2024-08-20: Paper Summary: "Enabling Uniform Computer Interaction Experience for Blind Users through Large Language Models"

Comments

Post a Comment