Open Access Paper
28 December 2022 A cycle-consistent reciprocal network for visual correspondence
Zhiquan He, Donghong Zheng, Wenming Cao
Author Affiliations +
Proceedings Volume 12506, Third International Conference on Computer Science and Communication Technology (ICCSCT 2022); 125064V (2022) https://doi.org/10.1117/12.2662501
Event: International Conference on Computer Science and Communication Technology (ICCSCT 2022), 2022, Beijing, China
Abstract
Visual correspondence refers to building dense correspondences between two or more images of the same category. Ideally, the predicted keypoints output by the model can be back to the source image’s keypoints through the same type of network. However, in practical situations, the predicted keypoints usually do not perfectly map back to the source image keypoints. In order to strengthen the cycle-consistency of the model, we propose a cycle-consistent reciprocal network. The network uses joint loss functions to alternately train forward and inverse models, which makes the two models subject to cycle constraints and perform better with the help of each other. Experiment results demonstrate the performance of the model is improved on three popular benchmarks and set a new state-of-the-art on the benchmark of PF-WILLOW.

1.

INTRODUCTION

Establishing visual correspondence as a fundamental problem has long been concerned by the computer vision community. The task aims to establish pixel-level correspondences between two or more semantically similar images, which have proven useful in a variety of applications such as object detection1, scene understanding2, and semantic segmentation3. With the development of deep networks and abundant data available, great breakthroughs have been made in representation learning for establishing visual correspondences4, 5. However, it is still challenging to establish visual correspondences because there are large intra-class variations between images of the same class due to illumination, scale, translation, blur and occlusion, etc.6-9.

Geometric constraint is an effective method to reduce the number of uncertain candidate regions and is adopted by many methods. Recent approaches use neighbourhood consensus10-12 to establish semantic correspondence. The first learnable neighbourhood consensus network (NC-Net)10 used 4D tensor to store pixel-level matching scores and refined it through neighbourhood consensus based on local spatial context. Li et al.11 developed NC-Net to adaptive neighbourhood consensus network (ANC-Net) with the kernels of non-isotropic 4D convolution. Jae Lee et al.12 introduced Patch-Match Neighbourhood Consensus (PMNC), which used PatchMatch13 to find the candidate regions with the highest similarity iteratively. However, these approaches mainly focus on translation and heavily influenced by the quality of the original correlation map under features representation.

To address the problems above, the latest methods14-17 consider enhancing the feature representation. Cho et al.16 pay attention to the stage of cost aggregation, aiming to reduce the effect of background clutter and achieve global consensus among refined correlation maps. Zhao et al.14 propose a multi-scale matching network (MMNet) to enhance the network’s ability of handling scale changes. Convolutional Hough Matching Networks (CHMNet)17 extern 4D convolution to 6D convolution, adding the dimension of scale. But they ignore the cycle-consistency of the model during training, for the predicted position should be back to the starting position of the source image through the same type of network.

In this work, we introduce a cycle-consistent reciprocal network to improve the performance of CHMNet, and the improved model called CCR-CHM. Cycle consistency is a simple but useful technique in machine learning, which can also be used for training visual correspondence models. Taking a pair of cats as an example, we hope to build the map from the source image cat eyes to the target image cat eyes, and the predicted target position also should be mapped back to the source image cat eyes through an inverse network. In order to build this mapping relationship, we use cycle constraints to train forward and inverse networks alternately so that the two networks can be improved together during training. The experiment result shows that the model has a great improvement in three standard benchmark datasets and sets a new state-of-the-art on the dataset of PF-WILLOW18.

This paper is organized as follows. Section 2 introduces the related work. The architecture of cycle-consistent reciprocal network is presented in Section 3. Section 4 reports the experimental results compared with other methods and offers ablation studies. Finally, we make a brief conclusion in Section 5.

2.

RELATED WORK

2.1

Semantic correspondence

Early works18, 19 mostly used hand-crafted features such as SIFT and ORB to establish a semantic correspondence, which existed the disadvantage of insufficient semantic patterns and hard to deal with background clutter, non-rigid deformation and blur. Convolutional neural network (CNN) is powerful in extracting deep feature representation20, 21, and it becomes popular in the task of semantic correspondence soon22-25. They set CNN that has been pretrained on image classification as the backbone module, and then use different algorithms to build a correlation map based on the output of CNN. Rocco et al.26 propose a CNN regressor model that estimated affine transformation parameters with deep learning method. NC-Net [10] uses 4D convolution for neighbourhood consensus task, which can effectively filter unreliable matches. PMNC and ANC-Net make progress on this basis, improving the calculation efficiency and accuracy. Jeon et al.27 construct a multiple affine network with a pyramid structure, realizing estimation from coarse to fine. Subsequent methods14, 15, 17, 24 mostly focus on feature representation enhancement and the method to build a correlation map efficiently.

2.2

Convolutional Hough matching

The Hough transform is a classical algorithm for object detection, which can recognize objects of specified shape in an image by voting in parameter space, and it has been proven effective in non-rigid image matching28. Min et al. propose a trainable convolution Hough matching layer. They also set CNN as the feature extractor, and scaled the image features, extending the 4D correlation tensor to 6D. They design a trainable convolutional Hough matching kernel combined with geometric constraints and applied it to high-dimensional 4D and 6D convolutions, which makes impressive progress in the accuracy of predicting sparse keypoints.

2.3

Cycle-consistency

Cycle-consistency is widely used in practical applications. A typical application is the area of machine translation, where the model should be roughly consistent in the process of translation and back translation. In computer vision, cycle-consistency has been applied to action prediction29, image-to-image translation30 and dense image alignment31. CycleGAN30 used a cycle-consistency loss to learn a pixel-wise mapping relationship in the area of image-to-image translation. Inspired by their work, we further propose a cycle-consistent reciprocal network and use joint loss functions to iteratively train the visual correspondence model.

3.

METHOD

Let X = x1, x2, ⋯. xN be the keypoints in source image I, and visual correspondence model needs to map X to the corresponding ground-truth Y = y1, y2, ⋯ yN in target image I’. As shown in Figure 1, cycle-consistency for visual correspondence considers such a relationship that if the predicted positions 00171_PSISDG12506_125064V_page_2_1.jpg in I’ is perfect, it can return to the starting points X from 00171_PSISDG12506_125064V_page_2_1a.jpg through the same type of model.

Figure 1.

Illustration of cycle-consistency.

00171_PSISDG12506_125064V_page_3_1.jpg

To encourage cycle-consistency of the model, we propose a reciprocal network based on cycle-consistency. Figure 2 illustrates the architecture of our network. Firstly, we need to train two networks, a forward network Fθ which predicts the target positions 00171_PSISDG12506_125064V_page_2_2.jpg and an inverse network G φ which predicts the source positions 00171_PSISDG12506_125064V_page_2_3.jpg. The training of Fθ as usual, and the Gφ needs to exchange the position of the source images and the target images when inputting the images, so that the flow estimation output by Gφ is mapped from the target images to the source images. It should be noted that the target position is input to the model as a known condition during training, and we offer an ablation study in the 4.2 section. We hope the following mapping relationship can be established if Fθ and Gφ are well trained:

Figure 2.

The architecture of cycle-consistent reciprocal network. Firstly, the forward and inverse networks are trained independently. Then, we use the first prediction loss L1 and cycle-consistency loss L2 to alternately train forward and inverse networks.

00171_PSISDG12506_125064V_page_3_1.jpg
00171_PSISDG12506_125064V_page_3_2.jpg
00171_PSISDG12506_125064V_page_3_3.jpg

It is obvious that the two networks are strongly correlated and can help each to improve performance. We use cycle-consistency constraints (1) to verify the accuracy of the forward prediction 00171_PSISDG12506_125064V_page_3_4.jpg. If the inverse network Gφ is trained well and the first prediction 00171_PSISDG12506_125064V_page_3_5.jpg is accurate, the second prediction 00171_PSISDG12506_125064V_page_3_6.jpg will be close to the starting positions X with high probability. The same explanation can be applied to equation (2). Loss function consists of the L2 loss. As shown in Figure 3, we define the loss function L1 as the first prediction loss, and L2 as the cycle-consistency loss, which is the loss of re-prediction based on the first prediction results.

Figure 3.

Illustration of joint loss.

00171_PSISDG12506_125064V_page_4_1.jpg

When the forward network Fθ and inverse network Gφ are successfully trained, we come to the next step — reciprocal training. We use a reciprocal network to leverage the forward network Fθ and the inverse network Gφ, and via iterative training to make the two networks benefit from each other. We only update the parameters of one model and freezes the parameters of the opposing model during training. The training exchange epoch for Fθ and Gφ is set as 3 epochs in experiments.

To successfully train Fθ and Gφ, we define two joint loss functions Jθ and Jφ as follows. λ takes a value between 0 and 1, and we set λ = 0.5.

00171_PSISDG12506_125064V_page_4_2.jpg
00171_PSISDG12506_125064V_page_4_3.jpg

The inverse network Gφ can be regarded as a cycle constraint to check again the accuracy of the prediction result generated by Fθ. The two networks are encouraged to be consistent through the loss function L2, and they also should perform well under the constraint of loss function L1. During the training of the cycle-consistent reciprocal network, the two networks are trained alternately, and their performance will have a noticeable improvement with the help of the opposing network.

4.

EXPERIMENT

4.1

Datasets and metrics

PF-WILLOW18, PF-PASCAL32 and Spair-71k33 are three standard benchmark datasets containing corresponding sparse annotations for visual correspondence. The most difficult dataset is SPair-71k33 which includes 70,958 pairs of images from 18 categories with large intra-class variations. PF-WILLOW18 and PF-PASCAL32 respectively contain 900 pairs of images from 4 categories and 1,351 pairs of images from 20 categories. For a fair comparison, we follow the previous assessment method that trains on the training spilt of PF-PASCAL32, and evaluate the model on the test splits of PF-PASCAL32 and PF-WILLOW18. For SPair-71k33, the model is trained on its training set and evaluated on its test set.

The percent of correct keypoints (PCK) is used as an evaluation metric. After the model outputs the predicted keypoints kpred, we can get the number of correct keypoints that satisfy the condition: ||kpredkgt||2α ∙ max (H, W), where α ∈ {0.05, 0.1} denotes a threshold and kgt denotes the ground-truth in target image. H and W respectively represent the height and width of the images or the bounding box of objects.

4.2

Experiment results

Figure 4 gives 3 pairs of images, which are the evaluation result on the test splits of PF-PASCAL with the threshold of αimg = 0.05. The images lie in the top is the result of CHMNet, and the bottom images get from the improved network. The image on the left is the source image and the right image is the target image. The red and green lines denote wrong and correct predictions. We also mark the ground-truth in the target image with solid yellow circles.

Figure 4.

Matching results at original network and improved network.

00171_PSISDG12506_125064V_page_5_1.jpg

We compare our experimental results with other methods, as shown in Table 1. The bold numbers mean the best performance, and the underlined numbers indicate the second best. Our model shows a significant improvement on each metric compared with original model, especially on the PF-WILLOW. Specifically, the PCK of αbbox = 0.05 on PF-WILLOW increase by 3.3%, and the other one αbbox = 0.1 increase by 2.4%, which set a new state-of-the-art on PF-WILLOW. On PF-PASCAL, our model also performs better on both thresholds, and ranks second in the list.

Table 1.

Comparison with other methods.

MethodsSPair-71k PCK @ α bbox PF-PASCAL PCK @ αimgPF-WILLOW PCK @ αbbox
0.10.050.10.050.1
NC-Net1020.154.378.9--
ANC-Net11--86.1--
HPF2328.260.184.8--
SCOT3435.663.185.4--
DHPF1537.375.790.749.577.6
PMNC1250.482.490.6--
MMNet-FCN1450.481.191.6--
Cats1649.975.492.650.379.2
CHMNet1746.380.191.652.779.4
CCR-CHM (Ours)46.781.192.356.081.8

4.3

Ablation study

We exchange the position of source and target images, which can be regarded as a method of image augmentation. Therefore, we study the effect of this behaviour on the performance of the original network. In the ablation study, we swap the position of the input images every other epoch.

In addition, we remove the reciprocal network and train the forward network by cycle-consistency only. Specifically, we replace the inverse network with the forward network, which conducts a cycle-consistency training on itself. The ablation study results are shown in Table 2, which proves that the reciprocal training network is effective.

Table 2.

Ablation study of CCR-CHM.

MethodsSPair-71k PCK @ αbboxPF-PASCAL PCK @ αimgPF-WILLOW PCK @ αbbox
0.10.050.10.050.1
Baseline46.380.191.652.779.4
+ image augmentation46.480.591.753.379.5
+ cycle-consistency46.281.091.353.177.0
CCR-CHM (Ours)46.781.192.356.081.8

5.

CONCLUSION

In this paper, we introduce a cycle-consistent reciprocal network for visual correspondence, which uses a joint loss function to train forward and inverse networks alternately. The two networks can be improved together with the help of each other. We apply our network to the training of CHMNet, and the model performs better on the test splits of the three standard benchmarks. The evaluation results show our network can be used to improve the performance of the visual correspondence model based on deep learning. And we provide ablation studies to verify our network. We believe further research on cycle-consistency can help to establish visual correspondence.

ACKNOWLEDGMENTS

This work is supported by the National Natural Science Foundation of China under grants 61971290, 61771322, 61871186, and the Fundamental Research Foundation of Shenzhen under Grant JCYJ20190808160815125.

REFERENCES

[1] 

Milletari, F., Ahmadi, S. A., Kroll, C., et al., “Hough-CNN: Deep learning for segmentation of deep brain regions in MRI and ultrasound,” Computer Vision and Image Understanding, 164 92 –102 (2017). https://doi.org/10.1016/j.cviu.2017.04.002 Google Scholar

[2] 

Lee, J., Kim, D., Ponce, J., et al., “SFNET: Learning object-aware semantic correspondence,” in Proc. CVPR, 2278 –2287 (2019). Google Scholar

[3] 

Hur, J. and Roth, S., “Joint optical flow and temporally consistent semantic segmentation,” in Proc. ECCV, 163 –177 (2016). Google Scholar

[4] 

Hu, J., Shen, L. and Sun, G., “Squeeze-and-excitation networks,” in Proc. CVPR, 7132 –7141 (2018). Google Scholar

[5] 

Huang, G., Liu, Z., Van Der Maaten, L., et al., “Densely connected convolutional networks,” in Proc. CVPR, 4700 –4708 (2017). Google Scholar

[6] 

Seo, P. H., Lee, J., Jung, D., et al., “Attentive semantic alignment with offset-aware correlation kernels,” in Proc. ECCV, 349 –364 (2018). Google Scholar

[7] 

Kim, S., Min, D., Ham, B., et al., “FCSS: Fully convolutional self-similarity for dense semantic correspondence,” in Proc. CVPR, 6560 –6569 (2017). Google Scholar

[8] 

Ufer, N. and Ommer, B., “Deep semantic feature matching,” in Proc. CVPR, 5929 –5938 (2017). Google Scholar

[9] 

Long, J. L., Zhang, N. and Darrell, T., “Do convnets learn correspondence?,” Advances in Neural Information Processing Systems, 27 (2014). Google Scholar

[10] 

Rocco, I., Cimpoi, M., Arandjelović, R., et al., “Neighbourhood consensus networks,” Advances in Neural Information Processing Systems, 1651 –1662 (2018). Google Scholar

[11] 

Li, S., Han, K., Costain, T. W., et al., “Correspondence networks with adaptive neighbourhood consensus,” in Proc. CVPR, 10196 –10205 (2020). Google Scholar

[12] 

Lee, J. Y., DeGol, J., Fragoso, V., et al., “Patchmatch-based neighborhood consensus for semantic correspondence,” in Proc. CVPR, 13153 –13163 (2021). Google Scholar

[13] 

Barnes, C., Shechtman, E., Goldman, D. B., et al., “The generalized patchmatch correspondence algorithm,” in Proc. ECCV, 29 –43 (2010). Google Scholar

[14] 

Zhao, D., Song, Z., Ji, Z., et al., “Multi-scale matching networks for semantic correspondence,” in Proc. CVPR, 3354 –3364 (2021). Google Scholar

[15] 

Min, J., Lee, J., Ponce, J., et al., “Learning to compose hypercolumns for visual correspondence,” in Proc. ECCV, 346 –363 (2020). Google Scholar

[16] 

Cho, S., Hong, S., Jeon, S., et al., “Cats: Cost aggregation transformers for visual correspondence,” Advances in Neural Information Processing Systems, 34 9011 –9023 (2021). Google Scholar

[17] 

Min, J. and Cho, M., “Convolutional Hough matching networks,” in Proc. CVPR, 2940 –2950 (2021). Google Scholar

[18] 

Ham, B., Cho, M., Schmid, C., et al., “Proposal flow,” in Proc. CVPR, 3475 –3484 (2016). Google Scholar

[19] 

Yuen, C. J. and Torralba, A., “Sift flow: Dense correspondence across scenes and its applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 33 (5), 978 –994 (2010). Google Scholar

[20] 

He, K., Zhang, X., Ren, S., et al., “Deep residual learning for image recognition,” in Proc. CVPR, 770 –778 (2016). Google Scholar

[21] 

Huang, G., Liu, Z., Van Der Maaten, L., et al., “Densely connected convolutional networks,” in Proc. CVPR, 4700 –4708 (2017). Google Scholar

[22] 

Han, K., Rezende, R. S., Ham, B., et al., “SCNET: Learning semantic correspondence,” in Proc. ICCV, 1831 –1840 (2017). Google Scholar

[23] 

Min, J., Lee, J., Ponce, J., et al., “Hyperpixel flow: Semantic correspondence with multi-layer neural features,” in Proc. ICCV, 3395 –3404 (2019). Google Scholar

[24] 

Seo, P. H., Lee, J., Jung, D., et al., “Attentive semantic alignment with offset-aware correlation kernels,” in Proc. ECCV, 349 –364 (2018). Google Scholar

[25] 

Kim, S., Lin, S., Jeon, S. R., et al., “Recurrent transformer networks for semantic correspondence,” Advances in Neural Information Processing Systems, 31 (2018). Google Scholar

[26] 

Rocco, I., Arandjelovic, R. and Sivic J., “Convolutional neural network architecture for geometric matching,” in Proc. CVPR, 6148 –6157 (2017). Google Scholar

[27] 

Jeon, S., Kim, S., Min, D., et al., “PARN: Pyramidal affine regression networks for dense semantic correspondence,” in Proc. ECCV, 351 –366 (2018). Google Scholar

[28] 

Cho, M., Kwak, S., Schmid, C., et al., “Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals,” in Proc. CVPR, 1201 –1210 (2015). Google Scholar

[29] 

Pang, G., Wang, X., Hu, J., et al., “DBDNet: Learning bi-directional dynamics for early action prediction,” in IJCAI, 897 –903 (2019). Google Scholar

[30] 

Zhu, J. Y., Park, T., Isola, P., et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. ICCV, 2223 –2232 (2017). Google Scholar

[31] 

Zhou, T., Lee, Y., Yu, S. X., et al., “Flowweb: Joint image set alignment by weaving consistent, pixel-wise correspondences,” in Proc. CVPR, 1191 –1200 (20152017). Google Scholar

[32] 

Ham, B., Cho, M., Schmid, C., et al., “Proposal flow: Semantic correspondences from object proposals,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 40 (7), 1711 –1725 (2017). https://doi.org/10.1109/TPAMI.2017.2724510 Google Scholar

[33] 

Min, J., Lee J., Ponce, J., et al., “Spair-71k: A large-scale benchmark for semantic correspondence,” arXiv preprint arXiv:1908.10543, (2019). Google Scholar

[34] 

Liu, Y., Zhu, L., Yamada, M., et al., “Semantic correspondence as an optimal transport problem,” in Proc. CVPR, 4463 –4472 (2020). Google Scholar
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Zhiquan He, Donghong Zheng, and Wenming Cao "A cycle-consistent reciprocal network for visual correspondence", Proc. SPIE 12506, Third International Conference on Computer Science and Communication Technology (ICCSCT 2022), 125064V (28 December 2022); https://doi.org/10.1117/12.2662501
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Visualization

Convolution

Computer vision technology

Image segmentation

Machine vision

Back to Top