2022-12-31: Paper Summary: "Beyond Classifiers: Remote Sensing Change Detection with Metric Learning" Zhang et al.

Semantic mapping of changes between images using Triplet Loss Metric Learning, Fig 8. from Zhang et al.

I talked about two kinds of trust in my previous two posts, Evaluating Trust in User-Data Networks: What Can We Learn from Waze? and Trust Management in Multi-Agent Systems via Deep Reinforcement Learning. In the former, we looked at trust as a measure of the accuracy of data provided by user and in the latter we looked at evaluating the behavior of the user to measure out trust in that particular user. The difference is subtle but apparent - a user with trustworthy historical behavior consistently provides accurate data. Conversely, one who provides data with inconsistent accuracy can be considered as being a qualitatively inferior data provider with measurably lower trust. 

This implies an exploitable attack vector, however. If we award implicit trust based on historical behavior, what happens when a historically trustworthy user suddenly provides the system with either false or malicious data? In other words, what if someone flips a switch and turns a good guy in to a bad guy?

There's a method of triggering exploration in a RL setting that we can look to as a first step to answering this question called Surprise-based Intrinsic Motivation. In Surprise-based Intrinsic Motivation, the agent essentially predicts future experience and is encouraged to explore new behavior when it receives an experience that sufficiently deviates from its expectation. This allows the agent to widen its understanding of its environment. Surprise in this sense is intuitive, it should be alarming if you take an action and all of your experience in taking that action has resulted in similar outcomes and suddenly something else happens. The curious and scientifically minded amongst us would naturally want to understand why that surprising outcome happened and further experiment to gain a better understanding than they originally possessed. 

Essentially, surprise is a measure of the difference between the inductive knowledge an agent has about the dynamics of its environment and the real dynamics of the system - the greater that difference the more surprised the agent. 

In Beyond Classifiers: Remote Sensing Change Detection with Metric Learning, the authors use metric learning to generate a semantic understanding of the changes between satellite images taken at different times. In Metric Learning, a model is built which projects detected objects onto an embedded space. This embedding organizes objects such that their distance is inversely proportional to their similarity. In other words, objects that are close are more similar than two objects that far away from each other. It has been successful in various tasks such as face recognition, person re-identification, vehicle re-identification, tracking, and image retrieval. There have been several methods developed for metric learning, including modifications to the softmax loss and methods that directly optimize distances, such as contrastive loss and triplet loss. The contrastive loss method learns a globally coherent function that evenly maps the data to the output manifold, while triplet loss uses a triplet of samples to ensure that an image of a specific class is closer to other images of the same class than to any image of a different class. These methods have a close relationship to change detection and are used in this context in the paper.

The authors call out two difficulties with prior work on metric learning-based change detection: data imbalance and source of the triplet pair. The data balance problem is about how changes in images usually represent a very small part of their combined data, with all the other parts of the image remaining relatively static. In prior works, the authors would treat the pixels from unchanged regions of the images as equal to the pixels from changed regions. The pixels loss measured from unchanged regions understandably overwhelms the loss from changed regions, so there needs to be a loss metric which treats the changed pixels with much greater weight. The source of triplet pair problem happens because the authors of prior works used triplet pairs gained by augmentation of the bi-temporal images and don't necessarily reflect the actual change between the images.

Proposed Method and Implementation

In the authors' method, standalone images are processed through a shared encoder to extract features. These features are then fed into a decoder, which aggregates the multi-scale features to upsample the resolution and output a feature embedding of a certain dimension. This step can be seen as semantic segmentation without label supervision for a single image. During training, the embeddings are adjusted based on the supervision of different loss functions (such as contrastive loss or triplet loss). During inference, the bi-temporal embeddings are compared using the cosine similarity and converted into a change probability map to indicate changed or unchanged pixels. This change probability map is then upsampled to output per-pixel predictions.

The proposed baseline networks with contrastive loss. These networks use embeddings from pair-wise or triple-wise images to compute different loss functions for training, Fig 2. from Zhang et al.

In order to train the proposed model, the authors need to calculate the probability of local pixel-wise change between images as a target. They do this in a four-step calculation:

1. Normalize extracted feature vectors with l2 normalization

For a pair of images I_1 and I_2 with shape (H, W, 3), the feature embedding F_1 and F_2 have the shape (H/4, W/4, D), where F_(k,i,j) stands for kth channel raw feature value at (i, j) position and f_(k,i,j) for l2-normalized feature (0 < i < H/4, 0 < j < W/4).
        

2. Calculate the pixel-wise cosine similarity
where s^(1,2)_(i,j) stands for similarity at (i, j) position between two images (0 < i < H/4, 0 < j < W/4).

3. Calculate the Euclidean distance d^(1,2)_(i,j) using the normalized features
4. Finally, a "change score" p^(1,2)_(i,j) representing the probability of change can be written by linear transformation
Larger p^(1,2)_(i,j) indicates higher change probability

The authors binarize classification to p^(1,2)_(i,j) > 0.5 = Changed and p^(1,2)_(i,j) >= 0.5 = Unchanged. The authors' proposed method involves optimizing p^(1,2)_(i,j) to generate this classification.

Of course, for the model to learn a loss function is needed to calculate the model's predictive performance during training. This is done by calculating a sound similarity metric between samples and comparing it to the predicted p^(1,2)_(i,j). The authors use triplet loss to improve performance on calculating the contrastive loss between two samples. Triplet loss is a method that involves introducing an additional sample to pull samples from the same class closer together and push samples from different classes farther apart, just like contrastive loss, but it adds a third sample to anchor between positive and negative samples. The triplet loss can be written as the distance between the anchor and positive samples minus the distance between the anchor and negative samples, plus a margin. This relaxation helps improve the feature representation between and within classes. For example, in face recognition, two samples with the same ID would be pulled closer together, while being pushed away from samples with a different ID. 

However, one of the main problems with employing triplet loss is collecting negative samples in unbalanced sparse data, like in change detection. The reason for this imbalance is the fact that unchanged pixels often greatly outnumber changed pixels. The author's also have the problem of not having clearly labeled classifications in the data set. So, they introduce a method to collect triplet loss sources.
Triplet loss from changed regions, where the positive sample is randomly sampled from the same phase as the anchor and pixels are shifted, Fig 3. from Zhang et al.


Triplet loss from unchanged regions, where the negative sample is generated by injecting a random image into the same pixel region as the anchor, Fig 3. from Zhang et al.

Essentially, they are creating a third sample from two bi-temporal images. In the case of changed regions, they are creating a sample that is closer in similarity to the anchor by shifting the image and randomly sampling a part of the original anchor region. Conversely for unchanged regions, they create a negative sample by injected a random image in the anchor region to force less similarity between it and the anchor than between the anchor and the positive sample. Finally, for the semantic change detection data they formulate the triplet pairs by comparing the pixel embeddings between phases. The positive sample includes pixels from phase 1 with the same semantic labeling as the anchor and the negative includes pixels from phase 2 with the a different semantic labeling from the anchor, where the anchor is from phase 1.

Datasets


The LEVIR-CD dataset consists of 637 pairs of bi-temporal remote sensing images collected from the Google Earth platform. These images have a size of 1024x1024 pixels and a spatial resolution of 0.5 meters. The dataset is primarily focused on changes in building instances and includes both building appearance and disappearance. It is officially split into three parts: train, validation, and test, with 445, 64, and 128 pairs, respectively. 

The SYSU-CD dataset includes 20,000 pairs of 0.5-meter aerial images taken in Hong Kong between 2007 and 2014. These images, which have a size of 256x256 pixels, depict changes in port construction and maintenance in Hong Kong and other international and Asia-Pacific areas. The dataset also includes changes in high-rise buildings, which can be difficult to identify due to deviations and shadows. The official split for this dataset is 12,000 images for training, 4,000 for validation, and 4,000 for testing. 

The SECOND dataset is a large-scale benchmark dataset for semantic change detection (SCD) and consists of bi-temporal high-resolution optical images from various aerial platforms and sensors. These images depict several cities in China, including Hangzhou, Chengdu, and Shanghai, and have a size of 512x512 pixels with a spatial resolution ranging from 0.5 to 3 meters per pixel. The dataset includes 30 different types of semantic changes, including unchanged, non-vegetated ground surface, trees, low vegetation, water, buildings, and playgrounds. The original validation set for this dataset is not available, so the experimental setting of a 4:1 split for training and validation is followed, with 2375 image pairs for training and 593 for testing.

Results

Binary Change Detection

The authors compare their proposed method against several other methods for detecting changes in images: FC-EF combines bi-temporal images and processes them through a ConvNet; FC-Siam-Di and FC-Siam-Conc both extract multi-level features from a Siamese ConvNet and use the feature difference or concatenation, respectively, to detect changes; DTCDSCN uses a dual attention module to detect changes by exploiting the inter-dependencies between channels and spatial positions of CNN features; STANet is a Siamese-based network that uses spatial-temporal attention for change detection; IFNet is a multi-scale feature concatenation method with attention modules for change map reconstruction; SNUNet is a multi-level feature concatenation method with a densely connected Siamese network for change detection; BIT uses a transformer-based approach and feature differencing to obtain the change map; and DSAMNet is a deeply supervised attention metric-based network that learns change maps through deep metric learning with convolutional block attention modules.

The authors' proposed method performs better than other methods in terms of F1 and Intersection over Union (IoU), with improvements of 3.24% and 2.89% for LEVIR-CD and SYSU-CD, respectively. While FC-EF, FC-Siam-Di, FC-Siam-Conc, DTCDSCN, IFNet, SNUNet, and BIT use two-class classifiers for change prediction, the proposed method outperforms them with a modified baseline. While STANet and DSAMNet also use metric learning, the proposed method demonstrates the importance of discrimination during optimization, particularly with regards to data imbalance and triplet pair selection. TBSRL is not open-sourced and therefore not included in their comparison. In addition to quantitative comparisons, the proposed method also produces fewer false positives and false negatives when compared visually to other state-of-the-art methods on SYSU-CD test images. Both the quantitative and qualitative comparisons support the superiority of the proposed method.
Quantitative performance between authors' binary change detection method and others, Table 1. from Zhang et al.


Visual comparison with other binary change detection methods, Fig 5. (rotated) from Zhang et al.

Semantic Change Detection

In this comparison, several classifier-based methods for semantic change detection are considered. The proposed method includes an extra triplet loss and is supervised by both binary and semantic signals. During training, extra triplet pairs are collected based on semantic labels, and class embeddings are computed after training. During inference, the changed region is identified using binary change detection, and these foreground regions are then compared to the pre-computed class embeddings to obtain the predicted class. The authors' method is then compared to other semantic change detection methods which include Unet++ (a variant of Unet which enables more connections among multi-scale features) Resnet-GRU and Resnet-LSTM (extract features with Resnet encoder and then use Gated Recurent Units (GRU) or Long Short-Term Memory units (LSTM) for change detection), FC-EF, FC-Siam-conv, FC-Siam-diff, and IFNet. HRSCD (str2, str3, str4) decouples the binary and semantic branches, but has a simple model structure and thus limited performance on complex semantic change detection datasets. Bi-SRNet has the highest accuracy among these methods, with skip connections between the temporal branches and the CD branch. The proposed method, which uses shared embeddings learned from both binary and semantic supervision, outperforms the state-of-the-art methods in all metrics and produces more accurate classifications in the changed locations when compared visually. Both the quantitative and qualitative comparisons demonstrate the effectiveness of metric-based methods in semantic change detection.
Quantitative performance between authors' semantic change detection method and others, Table 2. from Zhang et al.

Visual comparison with other semantic change detection methods, Fig 6. (rotated) from Zhang et al.

Visual comparison between different loss metrics, Fig 7. (rotated) from Zhang et al.

The above figure illustrates the change detections produced by different loss functions. Some false positives may mislead the baseline. However, the performance is improved with triplet loss in both changed and unchanged regions, as false positive regions are reduced due to more effective discriminative learning. The best performance is achieved when using triplet loss in both changed and unchanged regions, supporting the authors' conclusion.

Visual comparison of feature embeddings and probability mapping, Fig 8. from Zhang et al.

Here, the red boxes highlight areas where pixel embedding results in higher activations. These areas are often where significant changes occur. To further visualize the triplet pairs, the authors calculated the probability of change at each pixel. A brighter value indicates a higher probability of change. The results show that using the triplet loss in unchanged regions provides additional supervision as the additional unpaired probability map is reasonable.

Final Thoughts

This paper presents a path to calculate a baseline surrogate for surprise as a result of unexpected changes presented in aerial drone data. I believe a good direction would be to extend this method to use a predicted and current data sample pair instead of the bi-temporal pair used in the paper. If we have a sound model to predict changes given known history we can use that to compare against any data submitted, and use instances of significant change to measure surprise in the system and trigger a trust audit of the user and its submitted data. 

Sources:

Zhang, Y.; Li, W.; Wang, Y.; Wang, Z.; Li, H. Beyond Classifiers: Remote Sensing Change Detection with Metric Learning. Remote Sens. 202214, 4478. https://doi.org/10.3390/rs14184478

- Jim Ecker

Comments