An Adversarial Self-Learning Method for Cross-City Adaptation in Semantic Segmentation

Semantic segmentation is an important task in the visual system of self-driving cars. The semantic segmentation models based on the CNN (Convolutional Neural Network) trained with the large numbers of annotated labels may not work well at the environments different from the training sets due to the domain gap between the train and test domains. Just for the reduction of the distance between the source and target domains, domain adaptation methods are proposed for the unsupervised training with the unlabeled target domain. Not only the reduction of the domain-shift, but we also propose the self-learning method to enhance the predicted probabilities of the target domain. To gain more accurate probability maps of the target domain generated from the segmentation model which is trained by the source domain, we propose the adversarial self-learning method which is consists of the domain adaptation part and self-learning part. The adversarial self-learning method can maximize the predicted probabilities for the probability maps of the target domain gained from the segmentation model which is adapted with the domain adaptation method before the self-learning. With the Cityscapes to NTHU cross-city adaptation experiments, we can see that the adversarial self-learning method can achieve stateof-the-art results compared with the domain adaptation methods proposed in the recent researches.


I. INTRODUCTION
With the visual system for self-driving cars, we can realize the line and road detection [1], traffic sign recognition [2], depth estimation [3], objection detection [4] and semantic segmentation [5] based on the image processing techniques. Just for the understanding of the urban scenes, semantic segmentation plays a significant role in the visual system. Different from the image recognition which is an image-wise classification problem, semantic segmentation is a pixel-wise classification task which gives each pixel of the image a label. With the improvement of the CNN architectures, the performance of the semantic segmentation system [6]- [15] was significantly increased in a few years. For the supervised leaning for the semantic segmentation system based on the CNN architectures, a large number of high quality annotated images [16] are needed. For the training of the semantic segmentation system used for the urban scene understanding, Cityscapes [16] datasets and Mapillary Vistas [17] datasets contained thousands of high-annotated images from multi-cities from Manuscript  the real world are provided. As the high cost for the annotation of the pixel-level labels, the synthetic datasets [18], [19] which use the labels rapidly generated from the computer games for the semantic segmentation models training are provided. When we use the semantic segmentation model pre-trained with the real-world datasets or synthetic datasets to predict the images from the scenes which are not in the training datasets, the semantic segmentation models may not give good performance due to the domain shift [20]- [23] between the training datasets and the testing datasets. Retraining the models with the testing datasets is impossible with the not enough annotated labels. To deal with the domain shift problems for the semantic segmentation system, domain adaptation methods [24]- [29] based on the GANs [30], [31] (Generative Adversarial Networks) which are used to reduce the divergence between the two domains (training datasets and testing datasets) are proposed. As a proposed method, we choose the probability enhancement for the object prediction results which are the outputs gained from the semantic segmentation system (probability maps) of the target domain (testing dataset) to adapt the segmentation models trained with the source domain (training datasets). Our contributions in this paper can be introduced as follows: As a self-learning method, we calculate the cross-entropy loss with the pseudo labels gained from the probability maps of the target domain to enhance the object prediction probabilities.
To gain accurate pseudo labels for the self-learning, we use the outputs space domain adaptation method used the GANs to reduce the domain gap for the probability maps of the target domain.
We propose an adversarial self-learning method which is a combination of the domain adaptation method and the selfleaning method. We implement the proposed method for the real world cross-city adaptation. The experiments show that the proposed method can achieve state-of-the-art results.

II. RELATED WORKS
In this section, we introduce some semantic segmentation systems based on CNN models and some domain adaptation methods based on the GANs in recent researches.
Semantic Segmentation. The semantic segmentation systems achieve rapid development with the CNN models in recent years. Like the architecture of the FCN [6] (fully convolutional network), the segmentation system contains two parts, the feature extractor, and the classifier module. The feature extractors use the image recognition models like the AlexNet [32], VGGNet [33], GoogleNet [34], and ResNet [35], etc. pre-trained with the ImageNet [36] and An Adversarial Self-Learning Method for Cross-City Adaptation in Semantic Segmentation

Huachen Yu and Jianming Yang
Microsoft COCO [37] datasets to extract the feature maps for the images from the segmentation datasets. The classifier modules use the deconvolution layers for the pixel-wise classification with the consistency of the channels and sizes of the probability maps based on the extracted feature maps. Unlike the FCN [6] model, the U-Net [38], SegNet [15] models proposed the architecture consist of the encoder and decoder modules for the segmentation system. Instead of the pre-trained feature extractors, the encoder-decoder models which are used to generate probability maps from the input images directly are trained with the semantic segmentation datasets like end-to-end systems. Just like the architecture of FCN, the DeeplabV2 [7] uses the pre-trained ResNet101 [35] as the backbone for the feature extraction. Instead of the deconvolution layers, DeeplabV2 use the ASPP [7] (Atrous Spatial Pyramid Pooling) as the classifier module which uses the dilated convolution and the multi-filters with different rates to gain image spatial context with multiscales. Domain Adaptation. The domain adaptation methods based on GANs [30] used for the semantic segmentation systems in recent researches can be divided into three classes: the feature adaptation, the outputs adaptation, and the image adaptation. The feature adaptation methods [27], [28] use the discriminator to calculate the distance of the distributions of the feature maps which are extracted from the images of source and target domains with the pre-trained backbones as VGG16 [33], ResNet101 [35]. With the reduction of the distance of the feature maps of the source and target domains, the classification results of the target domain based on the feature maps can be similar to the source domain. As systematic outputs can be generated with the semantic segmentation system, the outputs adaptation methods [24], [29] directly use the discriminator to calculate the divergence of the probability maps which are the segmentation system outputs. With the adversarial training for the segmentation networks, the distributions of the probability maps from the target and source domains can be as close as possible. As the difference of image styles is the reason for the domain shift of the semantic segmentation system, the image adaptation methods [25], [26] use the GANs for image-to-image translation. As the images generated from the target domain images based on the style transfers which use the GANs can gain similar styles with the source target images, the domain gap for the segmentation system can be reduced.

III. PROPOSED METHOD
In this paper, we proposed a new algorithm consists of adversarial learning and self-learning methods for the output space gained from the segmentation network to deal with the domain shift problem between the source domain and target domain for the semantic segmentation system. In this section, we explain the algorithm flow for the proposed method and the loss function for the system optimization in details.

A. Architecture Overview
As shown in Fig. 1, the proposed method can be divided into two parts: the segmentation network G and the discriminator module D. Just for a semantic segmentation system, input images and annotated labels as the ground truth from source domain are used to calculate the crossentropy loss to train the weights of the segmentation network G as a supervised learning. For the domain adaptation part which is used to reduce the domain-shift between source and target domains, the adversarial losses for target domain images as the JS (Jensen-Shannon) divergence calculated by the discriminator module proposed by Ian Goodfellow [30] can be used to fine-turn the trained weights of the segmentation network. As the divergence between the outputs space of the source and target domains been reduced by the adversarial loss, the probability maps of the target domain as the outputs of the segmentation network can be used for a self-learning process. To enhance the confidence of the objects probabilities from the probability maps of the target domain, the cross-entropy loss between the probability maps and the pseudo labels gained from the probability maps can be calculated for the selflearning method used to adapt the weights of the segmentation network. As the generative adversarial learning, the adversarial loss from the discriminator D can be used to adapt the segmentation network G, the weights of the discriminator D should be trained with the probability maps of the source and target domains to distinguish the domains of the outputs space generated from the segmentation network G. Fig. 1. Overview of the proposed algorithm. The proposed method can be composed of segmentation network and the discriminator modu le. We use the given images and annotated labels from the source domain as a supervised training for the segmentation network. To reduce the domain-shift between the source and target domains, images from the target domain can be used to adapt the segmentation network as an unsupervised tra ining. We use the probability maps of source and target domains gained from the segmentation network to train the discriminator module.
B. Loss Function Segmentation Network Training. As introduced in section A, we use the images and the annotation labels from the source domain to train the segmentation network G as a supervised learning. We use the images from the target domain to adapt the segmentation network G to reduce the domain-shift between the source and target domains. As , ∈ × ×3 (H and W are the height and width of the images), the overview loss function can be expressed as: where ( ) is the cross-entropy loss between the source domain images and the annotated labels , ( ) is the adversarial loss for the probability maps of the target domain images calculated by the discriminator D, and ( ) is the self-learning loss which is the cross-entropy loss calculated between the probability maps gained from target domain images and the pseudo labels. The and are the weights for the adversarial loss ( ) and the self-learning loss ( ).
Cross-entropy loss for a supervised learning. We use the images and annotated labels to train the weights of the segmentation network G. As C is the number of categories, G( ) ∈ × × is the outputs of the segmentation network as probability maps. Before the calculation of the loss function, the probability maps G( ) should be normalized with a softmax layer. With the definition of the normalization probability maps Ĝ( ) as Ĝ( ) = Softmax( G( ) ) , we can define the cross-entropy loss ( ) which is based on the source domain as: where the (ℎ, , ) is the one-hot encoder for the annotated labels .
The adversarial loss for the unsupervised learning. We use the Discriminator D which can distinguish the domains of the probability maps which are the outputs of the segmentation network G to calculate the adversarial loss . As the inputs for the discriminator D, we should use the normalization probability maps Ĝ( ) instead of the probability maps G( ) generated from the target domain images . As the discriminator D is a classifier to distinguish the probability maps domains, we label the probability maps of the source domain with 1, the adversarial losses for can be defined as: As we minimize the adversarial loss to adapt the segmentation network G, the distribution of the target domain probability maps can be close to the source domain.
Self-learning loss for an unsupervised learning. As a selflearning method, we use the cross-entropy loss between the probability maps G( ) gained from the target domain images and the pseudo labels ̂ gained from the probability maps G( ) to maximize the probabilities of objects (road, sidewalk, tree, etc.) of the target domain images pixels. Dealing with the consistency of the adversarial loss, we use the normalization probability maps Ĝ( ) instead of the probability maps G( ) for the calculation of the self-learning loss. The pseudo labels of target domain ̂ can be gained from the normalization probability maps Ĝ( ) with an argmax function, which can be defined as ̂= (Ĝ( ) ). The self-leaning loss calculated with the cross-entropy loss between Ĝ( ) and ̂ can be defined as: Discriminator Network Training. The discriminator D is a two-class classifier to distinguish the domains of the normalization probability maps Ĝ( ) and Ĝ( ) . The overview loss used to train the weights of discriminator D with Ĝ( ) and Ĝ( ) can be expressed as: As the generative adversarial learning, the minimization of the discriminator loss ( , ) can gain a JS divergence [30] between the distributions of the probability maps from source and target domains.

IV. EXPERIMENTS AND RESULTS
In this section, we use the results of the experiments to validate the effectiveness of domain adaptation method for the semantic segmentation system proposed in this paper. For the real world cross-city adaptation experiments, we introduce the system network architecture (segmentation network G and discriminator D), the setting of the parameters and optimizer functions for the domain adaptation system, environments and datasets of the experiments, discussion on the results in details.

A. Network Architecture
Segmentation Network G. We use the Deeplabv2 [7] as the base architecture for the segmentation network G in the experiments. For the Deeplabv2 model, we adopt the Resnet101 [35] network which has been pre-trained with the ImageNet dataset as the backbone of segmentation network G for the feature extraction. We use the ASPP module as the decoder for G to gain the probability maps from the input feature maps. As of last, for the consistency of the input size, we use an up-sampling layer to resize the probability maps from the ASPP [7] module.
Discriminator D. To pay attention to the local patches from the input probability maps, we use the PatchGAN [39] as the base architecture for the discriminator D. The discriminator D is composed of 5 convolution blocks with the output channels as {64,128,256,512,1}. Each block used in PatchGAN consists of a convolution layer with the kernel size set to 4, stride size set to 2, padding set to 1, and a LeakyReLU [40] layer used as the activation layer with the negative slope set to 0.2.

B. Parameters and Optimizers
For the weight of the adversarial loss , as indicated in the AdaptSegNet [24], we adopt 0.001 to gain a sensitive effect for the adaptation. We also choose 0.001 for the weight of the self-learning part to keep the consistency of adversarial learning. We choose the SGD [41] (Stochastic International Journal of Machine Learning and Computing, Vol. 10, No. 5, September 2020 Gradient Descent) optimizer for the segmentation network G with the parameters' initial learning rate set to 2.5×10 -4 , momentum set to 0.9, and weight decay set to 10 -4 . For the discriminator D, we use Adam optimizer [41] with the initial learning rate set to 10 -4 . Just for training, the learning rates of the optimizers used for the G and D are decreased with the polynomial decay as the power set to 0.9.

C. Datasets and Environments
Datasets. In the experiments, we use the cityscapes [16] dataset as the source domain and cross-city dataset [28] as the target domain to implement domain adaptation for the semantic segmentation system. The cityscapes [16] dataset contains 5000 high quality pixel-wise annotated images from 50 cities around Europe. The dataset is focused on the urban street scenes and labeled with 30 classes. We only use the training set contained 2975 images from the cityscapes dataset consists of training, testing and validation parts for the segmentation network training. The cross-city NTHU [28] dataset is used to show the different appearance from the cityscapes dataset collected from four cities Rome, Rio, Tokyo, and Taipei. For each city, 3200 unannotated images are used to adapt the domain shift and 100 annotated images used to validate the adaptation effect of the system. In this paper, to prevent the domain adaptation system from the over-fitting problems, we choose the cities Rio, Tokyo, and Taipei which are not the European cities for the adaptation experiments. As the image size for the experiments, the height is set to 256, the width is set to 512.
Experiments environment. Our proposed adversarial self-learning method is implemented with the Pytorch framework. We train the segmentation network G and discriminator D with NVIDIA GTX 1080ti GPU for 100000 iterations took about 12 hours. We use the testing set from the cross-city dataset to validate the system and save the weights every 3000 iterations.

D. Overview Results
In this paper, as the cross-city dataset is labeled with 13 classes, we calculate the mIoU [42] (Mean Intersection over Union) which is the mean IoU of the 13 classes as the metric for the semantic segmentation system. Table I presents the results for the three cities' (Rio, Tokyo, and Taipei) segmentation performance transferred from the cityscapes dataset. In Table I, the SW, BLDG, TL, TS, VEG, Motor. are used to stand for Sidewalk, Building, Traffic Light, Traffic Sign, Vegetation, and Motorbike; the AL and SL are used to stand for the adaptation learning and self-learning. With the results of the experimentations, our proposed adversarial self-learning method (AL+SL) in this paper can be compared with the feature adaptation method (AL(Feature)) mentioned in the [28] and the output space adaptation method (AL(Outputs)) proposed by [24] to show the advantages when dealing with the domain shift problems for the segmentation system. As mentioned in the previous researches, the deep network can achieve better feature representation and segmentation results, we used the ResNet101 as the backbone for all the experiments. With Table I, we can see that both the adaptation of the feature map and the output space can gain effective performance. With no adaptation operation, the domain adaptation can reduce the domain shift in the segmentation system. To compare with the feature adaptation method to reduce the divergence between the feature maps of the source and target domains, directly reducing the divergence of the pixel-wise classification results used the outputs space adaptation method achieves the better mIoU results. The proposed self-learning method (SL) which is used the crossentropy loss with the pseudo labels can reduce the domain shift for the segmentation system based on the results of the experiments. For real-world cross-city adaptation, we can see that the self-learning method gains better performance compared with the domain adaptation methods from tabel1. As the pseudo labels used in the self-learning method are gained from the output probability maps, we proposed the adversarial self-learning method which uses the outputs space adaptation to reduce the divergence between the probability maps of the source and target domains before the self-learning for the target outputs. From tabel1, we can see that the adversarial self-leaning method proposed in this paper achieves state-of-the-art results compared with the baseline (with no adaptation) and the domain adaptation methods proposed in recent years. In Fig. 2, we select some semantic segmentation outputs from the three cities in the experiments to show the effective International Journal of Machine Learning and Computing, Vol. 10, No. 5, September 2020 performances of the adversarial self-learning method proposed in this paper. In Fig. 2, the GT is used to strand for the ground truth (the human-annotated data), the results of our proposed method are compared with the results from the baseline and the outputs space adaptation methods. To show the effectiveness of the adversarial self-learning method, we pay attention to the regions with red bounding boxes. From the results of Tokyo, we can see that the region of the sky can't be well classified only with the baseline, and the domain adaptation methods for the segmentation system can resolve the domain shift problems. With the results of Taipei, from the middle region of the outputs, we can see that combined with the self-learning method which can enhance the probability maps, the existed persons can be detected compared with the outputs adaptation method. With the middle-left regions of the results from Rio, the adversarial self-leaning method reduces the noise region for the predicted trees compared with the baseline and outputs adaptation method. With all the experiment results, the adversarial self-learning method can achieve state-of-theart results. Fig. 2. The example results generated from the cityscapes to cross-city adaptation system for the three cities. Each city contains the original image, the ground truth and the predicted label maps generated from the compared adaptation methods and our proposed method.

V. CONCLUSION
In this paper, to deal with the domain shift problems of the different cities for the semantic segmentation system, we proposed an unsupervised learning method only with the annotated source images. The proposed method used a selfleaning method with the pseudo labels to enhance the confidence of outputs probability maps and combined with the outputs domain adaptation to enhance the confidence of the pseudo labels. With the cityscapes to cross-city experiments, our method can achieve state-of-the-art results for the domain shift problems. We hope that our proposed method can gain better performance with the synthetic to real segmentation tasks.