2021-05-03: Automated Filtering of Eye Movements Using Dynamic Areas of Interest in Different Granularity Levels

Fig. 1: The Workflow of Object Detection and Segmentation for Dynamic AOI Filtering and Eye Movement Processing.

Most of the eye-tracking experiments involve areas of interests (AOIs) for the analysis of eye gaze data. It is because people only attend to a few areas in a given stimulus and an analysis of eye movements within them can provide important clues to the underlying physiological functions supporting the allocation of visual attention resources. For instance, in user interface interaction, the number of fixations within a user interface component indicates the efficiency of finding that component among others.

Analysis of eye movements is mostly done within static AOIs though analysis of eye movements using dynamic AOIs, such as in videos, has recently gained traction. One such example is visual and statistical analysis of viewers’ experience using eye movement data on video feeds. One potential application of dynamic AOIs is dynamically controlled magnification around the centers of interest to aid people with visual impairments.

While there are tools to define static AOIs to extract eye movement data, they may require users to manually draw boundaries of AOIs on eye tracking stimuli to generate AOI-mapped gaze locations. For instance, Tobii Pro Studio eye tracking software allows users to export AOI-mapped eye movement data. But it requires to draw boundaries of AOIs on static stimuli to define AOIs to generate AOI-mapped eye movement data.

Dynamic AOIs on video streams require users to manually draw boundaries of AOIs and adjust them on a frame by frame basis. Evolving from there, the tool introduced in "Let's look at the cockpit: exploring mobile eye-tracking for observational research on the flight deck" offers more feasible approach, which utilizes computer vision techniques to map eye movements to objects of interest using a template of the desired object derived from a selected single frame of the eye tracking stimuli video. If an AOI is detected in a frame, the tool can check whether the raw eye gaze coordinates for that video frame fall within the bounds of the AOI. It only works for pre-recorded eye tracking stimuli using the manual specification of the AOI template generated beforehand. 

The process of manually defining AOIs by annotating the AOIs frame-by-frame takes time and effort for large video sequences. To overcome these challenges, in "Automated Filtering of Eye Movements using Dynamic AOI in Multiple Granularity Levels", we proposed a dynamic AOI-mapped gaze extraction workflow that uses pre-trained object detectors and object instance segmentation models for the detection of dynamic AOIs in video streams. We utilized our eye movement processing framework RAEMAP, which is designed to analyze eye movements for the extraction of dynamic AOI-mapped eye movements.

Object Detectors and Segmentation Models

For our implementation, we utilized pre-trained convolutional neural networks based object detectors and object instance segmentation models to detect dynamic AOIs. They were pre-trained on the MS COCO images dataset. 

We used the YOLOv3 and three variants of faster region-based convolutional neural networks (faster R-CNN) with different backbones (ResNet-50-FPN, ResNet-101-FPN, and ResNet-50-DC5). Altogether, we used four object detectors.

For object segmentation, we selected three models to achieve a higher level of granularity in AOIs. The models are a variant of the faster R-CNN called Mask R-CNN. Mask R-CNN extends Faster R-CNN by adding a branch for object mask prediction in parallel with the existing bounding box recognition process. We used three variants of Mask R-CNN with different backbones (ResNet-50-FPN, ResNet-101-FPN, and ResNet-50-DC5) as object instance segmentation models. 

Eye Movements Dataset

We used a publicly available eye-tracking dataset (Eye tracking database for standard video sequences) from 15 participants while watching 12 video sequences: Foreman, Bus, City, Crew, Flower Garden, Mother and Daughter, Soccer, Stefan, Mobile Calendar, Harbor, and Tempete. The dataset included the gaze locations of participants, heat maps, and sample MATLAB demo files that show how to use the data. The data has been collected from participants (2 F and 13 M) aged between 18 and 30, and had normal or corrected-to-normal vision. 

Each row in the original data files corresponds to a particular frame in the video sequence, and the columns provide the (x, y) gaze coordinates of all participants at that frame. The dataset also provided a binary flag matrix indicating the accuracy of gaze locations.

For this study, we selected only four video sequences (Foreman, Bus, Mother and Daughter, and Hall Monitor) with dominant objects which already were a class label in the COCO names list, out of the twelve sequences. We pre-processed gaze data of each participant separately for each video and filtered out the incorrect gaze locations from the gaze data based on the binary flag matrix. The goal here was to separate the gaze locations of each participant to pass into the RAEMAP, since the calculations of eye gaze metrics of each participant require a separate processing of the raw data.

Fig. 2: AOIs in Video Sequences. (a,b,c) - manually annotated AOIs using BeaverDam annotation tool. (d,e,f) - bounding boxes representing lower level of granularity, and pixel-wise masks (shaded area) representing a higher level of granularity of dynamic AOI. 

The Workflow of Dynamic AOI Detection 

Next, we loaded the pre-trained object detectors and COCO object names (class labels) using OpenCV and Detectron (COCO-Detection). Then, we configured RAEMAP eye movement processing pipeline to use these object detectors and object instance segmentation models to dynamically detect AOIs in each frame. For each frame, the object detectors output the COCO class label and location of detected objects in that frame in the form of bounding box coordinates, whereas object instance segmentation models output a pixel-wise mask of detected objects (see Figure 1). The goal here is to provide the capability of defining an object of interest in each video, such that the RAEMAP can process the eye tracking stimuli to dynamically detect the corresponding AOIs.

We identified one dominant object from each video sequence and defined it as the AOI label for that video sequence, prior to the processing of eye movement data. Based on the defined object of interest, we processed the videos offline to detect corresponding dynamic AOIs using the object detectors and object instance segmentation models. For each video sequence, we applied an object detector, and dynamically detected the AOIs at each frame. We repeated this step for each object detector and obtained prediction for the bounding box coordinates at each frame of the video sequence. Similarly, we applied object instance segmentation models, and obtain a prediction for the pixel-wise mask for the dynamic AOI at each frame. 

Filter Eye Movements within identified Dynamic AOIs 

Next, we utilized the RAEMAP to filter and extract eye gaze data within the bounding boxes or pixel-wise masks of dynamic AOIs by checking if the eye movements fall within the boundaries of detected dynamic AOIs. 

Evaluation of AOI-Mapped Eye Movements

For the evaluation, we created a ground truth dataset by manually annotating each video sequence with the expected AOI in the form of bounding boxes using the BeaverDam video annotation tool (GitHub). BeaverDam is designed for drawing bounding boxes on video frames and annotating them with class labels. Video annotations made in BeaverDam can be exported in JSON file format. Exported annotations consist of bounding box coordinates at each marked frame along with the linear interpolation parameter. We generated four JSON objects corresponding to each video file, and linearly interpolated the bounding boxes between the start and end frames to obtain a continuous annotation. We used this interpolated result as the ground truth of dynamic AOIs. Then, we filtered raw eye tracking data that fall inside the boundaries of manually annotated dynamic AOIs. 
Fig. 3: Visualization of manually annotated AOIs (blue), AOIs detected by faster R-CNN object detector (red), the gaze positions of a participant (green) in five consecutive video frames in Bus and Foreman video sequences.

Finally, we compared the two levels of granularity in dynamic AOI in terms of the percentage captured in filtered eye movements compared to a baseline model. We observed that AOI-mapped eye movements generated by object detectors could capture a maximum of 60% of eye movements, whereas object instance segmentation models captured only about 30% of eye movements. Since we observe multiple levels of reduction in filtered eye movements as the granularity of the dynamic AOIs increases, we anticipate this would contribute towards a layered analysis of both positional and advanced eye gaze metrics in the future.


The advantage of the introduced approach is that it eliminates the need for manual annotation of AOIs in dynamic eye tracking stimuli. Our proposed methodology for detecting dynamic AOIs is the initial step towards extracting eye movement data from dynamically detected AOIs in real-time. This work will help develop dynamic AOI-mapped eye movement filtering and transition workflow. In the future, we will apply our proposed computer vision based, dynamic AOI detection architecture in real-time to generate AOI-mapped eye movements. 

List of Relevant Publications:

--Gavindya Jayawardena (@Gavindya2)