|
1.INTRODUCTIONObject detection technology is widely applied in various fields, including aerial photography, fast delivery, and urban monitoring. However, in remote sensing, there are specific challenges that make accurate object detection more difficult. The challenges include a small number of labeled samples,the small size of objects in remote sensing images (typically occupying only a few dozen pixels relative to the complex background)[1], and the diverse scale and multiple categories of objects[2]. These challenges pose a significant obstacle to general object detectors based on ordinary convolutional networks. Modern detectors typically use pure convolutional networks as feature extractors, such as VGG[3] and ResNet[4] backbones for detectors like Faster RCNN[5] and RetinaNet[6]. The YOLO series detectors[7], on the other hand, utilize Darknet, a novel residual network that improves feature extraction efficiency. However,convolutional networks have limitations in capturing global contextual information due to the locality of convolution operations. In contrast, transformers excel at capturing inter-dependencies among image feature patches on a global scale through multi-head self-attention. This preserves spatial information for object detection. Additionally, object detection models need improved domain adaptability and dynamic receptive field to handle viewpoint changes in aerial. images.domain adaptability and dynamic receptive field. The study in literature [8] showed that compared with CNN, visual transformers have stronger robustness against severe occlusion, disturbance, and domain shift.To improve detection performance, transformer layers can be added to pure convolutional backbones to incorporate more contextual information and learn better feature representations. Currently, most object detection techniques are solely designed and applied for a single modality such as RGB and Infrared (IR) [9], [10]. Consequently, with respect to object detection, its capability to recognize objects on the Earth’s surface remains insufficient due to the deficiency of complementary information between different modalities[11]. As imaging technology flourishes, RSIs collected from multimodality become available and provide an opportunity to improve the detection accuracy. For example, as shown in Figure.1, the fusion of two different multimodalities (RGB and IR)can effectively enhance the detection accuracy in RSI. On the other hand, the size of objects in RSI images varies greatly, and the representation power of single-layer feature maps in convolutional neural networks is limited. Therefore, it is essential to effectively represent and process multi-scale features. A classical method is to combine low-level and high-level features through summation or concatenation operations, but simply summarizing or concatenating may lead to feature mismatch and performance degradation. In this regard, we introduce learnable weights to learn the importance of different input features, while repeatedly applying top-down and bottom-up multi-scale feature fusion In summary, this article proposes the following contributions:
2.RELATED WORKS2.1.TransformerSwim-Transformer It is an attention mechanism-based neural network architecture designed for sequential data modeling. Rather than relying on recurrent neural networks (RNNs) which can be computationally expensive, Swim-Transformer uses Transformer blocks with relative positional embeddings to effectively capture long-term dependencies in sequential data. Compared to existing state-of-the-art models such as Faster R-CNN and Mask R-CNN, Swim-Transformer has shown better performance on several benchmark object detection datasets including COCO and PASCAL VOC. It offers a promising alternative approach for modeling visual data with sequential structures.Cross-Transformer It introduces a self-attention mechanism across modes, bringing the information from multiple modes together in the same model.The core idea of Cross-Transformer is to use self-attention mechanisms to capture relationships within sequence data as well as interactions across modalities. It computes the correlation within the mode by encoding each mode separately and then using the multi-mode attention mechanism. Next, in the cross-modal layer, the corresponding attention weights are used to fuse the information between the different modalities.Through self-attention computation of cross modes, Cross-Transformer can effectively mine the feature links between different modes to better understand and process multimodal data. 2.2.Muti-scale feature fusionMulti-scale feature fusion is a computer vision technique utilized to enhance the performance of models in object detection tasks, particularly for detecting objects of different sizes. This technique involves the integration of features from various layers or levels of a convolutional neural network(CNN) to obtain a more complete and informative representation of an image. Different levels in a CNN capture different levels of abstraction and spatial resolution, which makes them better suited for detecting objects of different sizes. For example, features obtained from lower layers have higher spatial resolution than those from higher layers, which can be useful for detecting small objects. Conversely, features from higher layers are better suited for detecting larger and more complex objects. By fusing these features at multiple scales, the model can benefit from the strengths of each layer and achieve better overall object detection performance. There are several ways to implement multi-scale feature fusion, including using skip connections to pass features between different layers, using feature pyramid networks (FPNs) to merge features across different resolutions, and using top-down and bottom-up pathways to aggregate information from different scales. These techniques have been proven successful in various computer vision tasks, such as object detection, semantic segmentation, and imageclassification. 3.METHODOLOGY3.1.Overall ArchitectureThe proposed network architecture, called MSTR-YOLO(shown in Figure 1), is a hybrid model that combines convolution and self-attention. Firstly, we use STR-Darknet (Section3.2)as the backbone, which not only removes the Focus module but also integrates multi-head self-attention into the original CSP-Darknet to extract more Individualized features. Secondly, The Tini-BiFPN, which replaces PANet, is designed to aggregate features from different backbone levels. (Section 3.3) Lastly, we explore different fusion methods and select pixel-level fusion for high computational efficiency to fuse IR and RGB modes 3.2.STR-DarknetThe purpose of the Focus module in the YOLOv5 backbone is to gather the pixel values from the input image and then reconstruct them into smaller complementary images. The size of the reconstructed image decreases as the number of channels increases. Therefore, it will lead to a decrease in resolution and loss of spatial information for small targets. Considering that the Focus module in the YOLOv5 backbone was replaced with the CBS module to improve the detection of small targets, which relies on higher resolution. To improve the semantic discriminability and mitigate class confusion for RSI in large-scale and complex scenes, collecting and correlating scene information from a large neighbourhood can help learn relationships between objects. However, convolutional networks have limitations in capturing global context information due to locality constraints of convolution operations.In contrast, transformers can globally attend to dependency relationships between image feature patches while preserving sufficient spatial information, enabling multi-head self-attention based object detection. To enhance transferability of learned features and capture long-range contextual information, we propose the STR-Darknet backbone to extract features for detectors. The design of STR-Darknet is straightforward(as shown in Figure 3): we embed Swim-Transformer (STR) layers into the top CSPDark block to achieve global self-attention on 2D feature maps. It’s worth noting that when the network is relatively shal-low and the feature maps are relatively large, early use of transformer layers to enforce boundary regression can lead to the loss of meaningful contextual information. Therefore,in STR-Darknet, transformer layers are only applied to P5 instead of P3 and P4, to avoid this issue. 3.3.Tini-BiFPNThe size differences of RSIs can be significant, and the representation ability of single-feature maps in conv-olutional neural networks is limited. Therefore, it is essential to effectively represent and process multi-scale features. The traditional top-down FPN [16][12] is essentially limited by one-way information flow. To address this, PANet [13] adds an additional bottom-up path aggregation network, as shown in Figure 4(b). Further studies on cross-scale connections were conducted in [14][15][16]. In those works, a simple and efficient Weighted Bidirectional Feature Pyramid Network (BiFPN), as shown in Figure 4(c), achieves two optimizations for cross-scale connections.In this paper, Inspired by BiFPN, we have designed a lightweight neck network which we call Tini-BiFPN, it not only implifies BiFPN to fit the P5 structure but also introduce the Swim-Transformer into it.as shown in Figure 4(d) and Figure 5. 3.4.Multimodal FusionMultimodal fusion is an effective approach for integrating diverse information from multiple sensors.The more information is utilized to distinguish objects, the better performance can be achieved in object detection. there are three prominent fusion methods: decision-level fusion,feature-level fusion, and pixel-level fusion. However, decision-level fusion, due to its high computational requirements, is not considered in this paper. Instead, we focus on describing our proposed feature-level fusion and pixel-level fusion techniques, which enable the integration of information at different processing depths within the network. Figure 1 and Figure 2 illustrate show the feature-level fusion of various blocks works. To ensure a fair comparison, the IR image is expanded to three bands. Each block’s fusion operation Cross-1,Cross-2,Cross-3, represent the fusion operation performed in the Low-Level, Mid-Level, High-Level, on the other hand, is considered to be a Feature-Level fusion operation. When it comes to pixel-level fusion(RGB+IR), we normalize the input RGB and IR images to intervals of [0,1], then combine them with relatively low computational effort compared to the other fusion methods, which fuse the information during later procedures to speed up the inference. As Section will demonstrate, pixel-level fusion achieves better results than feature-level fusion when combining different types of complementary information. when it comes to feature-level fusion,we try to fuse three scale feature by cross transformer. In multi-head cross- attention, we map the input patch sequence Xrgb to Q1,K1,V1 and Xir to Q2,K2,V2 in the headi (i = 1…h, and his the the num-ber of head), following the Q-K-V attention in transformer As illustrated in Figure 4, the cross-attention layer plays a pivotal role in aggregating key information (K-V pairs) from two different branches to establish attention. Specifically, the K-V pairs from the two branches are concatenated to-gether. Similarly, in order to aggregate K-V pairs from the RGB branch to the IR branch, we also concatenate the K-V pairs from these two branches. In this context, the operator [·, ·] refers to the concatenation along the token dimension as the default operation. The cross-attention feature map can be computed as: 3.5.Loss FunctionThe loss function of our network consists of two parts: detection lossLv and SR construction lossLs, which can be computed as : Where λ1 and λ2 refer to the coefficients used to balance two training tasks. In this case, the L1 loss (rather than L2 loss) is utilized to calculate the SR construction loss, denoted as Ls, between the input image X and the SR result S. This can be expressed as follows: The detection loss consists of three components: the loss for determining whether there is an object (Lobj), the loss for object localization (Lloc), and the loss for object classification (Lcls). These components are used to eval-uate the loss of prediction, which can be expressed as follows: Here, in Equation 7, the variable j represents the layer of the output in the head. The weights aj, bj, and cj correspond to the weights assigned to different layers for the three loss functions. The weights λloc, λobj, and λcls are used to regulate the emphasis of errors among box coordinates, box dimensions, objectness, no-objectness, and classification. 4.EXPERIMENTS AND ANALYSIS4.1.DatasetsThe VEDAI dataset is used for multi-class vehicle detection in aerial images. It contains 3640 vehicle instances, including 9 categories such as ships, cars, campers, airplanes, shuttle buses, tractors, trucks, freight vehicles, and other categories. The dataset includes 1210 aerial images of size 1024x1024. with four uncompressed color channels, comprising three RGB color channels and one additional nearinfrared channel. we use the VEDIA dataset to evaluate our model, and we report mAP (average of all 10 IoU thresholds, ranging from [0.5 : 0.95]) and AP50. Table 1:Distribution of Available Class Instances in the VEDAI Dataset Across 10 Folds.
4.2.Training Environment and DetailsOur proposed framework is implemented in PyTorch and is executed on a workstation equipped with an NVIDIA 3090 GPU. The VEDAI dataset is utilized to train our MSTRYOLO model. For training, we employ the standard Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.937 and weight decay of 0.0005 for Nesterov accelerated gradients. The batch size is set to 2 (8). Initially, the learning rate is set to 0.01. The entire training process consists of 300 epochs and takes approximately 12 hours to complete. 4.3.Model EvaluationThe accuracy assessment evaluates the agreement and discrepancies between the detection results and the reference mask. To evaluate the performance of the methods being compared, the accuracy metrics used are recall, precision, and mAP (mean Average Precision). The calculation of precision and recall metrics is defined as follows: TP is the count of correctly classified positive samples, FP is the count of incorrectly classified positive samples, and FN is the count of incorrectly classified negative samples. These metrics are used to evaluate the accuracy and performance of detection algorithms. The mAP (mean Average Precision) is a comprehensive indicator that averages AP values. It calculates the area under the Precision-Recall curve for all categories using an integral method. This metric quantifies the overall accuracy of an object detection model and considers the trade-off between precision and recall. 4.4.Result Analysis and Ablation ExperimentsWe validate the effectiveness of our proposed method by conducting a series of ablation experiments on the first fold of the validation set. These experiments enable us to analyze the importance of each component in VDEIA (Visual Detection and Evaluation of Image Analysis). 4.4.1.Ablation of STR-DarknetAfter integrating multi-head self attention into the original CSP-Darknet and remove the Focus module, the mean precision and mean recall scores of this frameworks has been improved. In particular, the mean recall score of YOLOv5s is improved by 21.99%(51.1% → 73.09%) Removing the Focus module not only prevents resolution degradation but also retains spatial interval information for small objects in RSI. Moreover, the utilization of self-attention has a significant impact on detecting small objects, which is widely recognized as an important and challenging task in real-world object systems. 4.4.2.Ablation of Tini-BiFPNTo further validate the effectiveness of Tini-BiFPN, We compare the network with the original network and find that its accuracy is higher than that of the original network. The results are shown in the TABLE 3. Table 2:Ablation of STR-Darknet
Table 3:Ablation of Tini-BiFPN
4.4.3.Ablation of Multimodal FusionAfter evaluating the devised fusion methods, we conducted experiments using pixel-level and feature-level fusion techniques, as described in Section 3.4 of the paper. The results are presented in TABLE 5,6,7. The pixel-level fusion method achieved the best performance among all the compared methods, with a parameter size of 9.5084M and an mAP50 of 75.86%, respectively, which are the best among all the compared methods. Therefore, we choose the pixel-level fusion as our final fusion strategy, which exhibits relatively competitive performance for the VEDAI multimodal dataset with objects that are difficult to distinguish. Table 4:Ablation of Multimodal Fusion
Table 5:The Comparison Result of Pixel-level and Feature-level Fusions in MTR-YOLO for Multimodal Dataset on the First Fold of the Validation Set.
Table 6:Class-wise Average Precision AP, Mean Average Precision mAP50, Parameters and GFLPs for Proposed MST-YOLO, YOLOv3, YOLOv4,YOLOv5s-x (IR modal ConFigureurations on VEDAI Dataset)
Table 7:Class-wise Average Precision AP, Mean Average Precision mAP50, Parameters and GFLPs for Proposed MST-YOLO, YOLOv3, YOLOv4,YOLOv5s-x (RGB modal ConFigureurations on VEDAI Dataset)
Table 8:Class-wise Average Precision AP, Mean Average Precision mAP50, Parameters and GFLPs for Proposed MST-YOLO, YOLOv3, YOLOv4,YOLOv5s-x (multi modal ConFigureurations on VEDAI Dataset)
4.4.4.Comparisons with Previous MethodsThe results clearly demonstrate that MSTR-YOLO outper-forms other frameworks, achieving higher AP and mAP50 scores. Notably, in multimodal mode, MSTR-YOLO surpasses YOLOv5x by a significant 13.19% mAP50 score. The detection performance for boat, truck, van, and other categories is notably improved in MSTR-YOLO compared to other methods. 5.CONCLUSION AND FUTURE WORKIn summary, this paper proposes a MSTR-YOLOv5 algorithm, realizing the organic combination of Transformer and CNN,meanwhile,achieving a balance of efficiency and performance. MSTR-YOLOv5 uses YOLOv5n-p5 as the baseline, STR Block to strengthen the connection between the backbone network and the global information, Tini-BiFPN to strengthen the network feature extraction and lighten the network. We also tried different methods of pixel-level fusion and feature-level fusion, and ultimately chose pixel-level fusion with better results. In the future, We will continue to explore effective methods of pixel-level fusion combined with feature-level fusion to improve detection performance. Exploring more possibilities for multimodal fusion. 6.6.REFERENCESZhuo Zheng, Yanfei Zhong, Junjue Wang, and Ailong Ma,
“Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery,”
in in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
4096
–4105
(2020). Google Scholar
Zhipeng Deng, Hao Sun, Shilin Zhou, Juanping Zhao, Lin Lei, and Huanxin Zou,
“Multi-scale object detection in remote sensing imagery with convolutional neural networks,”
ISPRS journal of photogrammetry and remote sensing, 145 3
–22
(2018). https://doi.org/10.1016/j.isprsjprs.2018.04.003 Google Scholar
Karen Simonyan and Andrew Zisserman,
“Very deep convolutional networks for large-scale image recognition,”
(2014). Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,
“Deep residual learning for image recognition,”
Google Scholar
in Proceedings of the IEEE conference on computer vision and pattern recognition,
770
–778
(2016). Google Scholar
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun,
“Faster r-cnn: Towards real-time object detection with region proposal networks,”
Advances in neural information processing systems, 28
(2015). Google Scholar
Tsung-YiLin, Priya Goyal, Ross Girshick,Kaiming He, and Piotr Dollár,
“Focal loss for dense object detection,”
in Proceedings of the IEEE international conference on computer vision,
2980
–2988
(2017). Google Scholar
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi,
“You only look once: Unified, real-time object detection,”
in Proceedings of the IEEE conferenceon computer vision and pattern recognition,
779
–788
(2016). Google Scholar
Muhammad Muzammal Naseer, Kanchana Ranasinghe, Salman H Khan,Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang,
“Intriguing properties of vision transformers,”
Advances in Neural Information Processing Systems, 34 23296
–23308
(2021). Google Scholar
Jian Ding, Nan Xue, Yang Long, Gui-Song Xia, and Qikai Lu,
“Learning roi transformer for oriented object detection in aerial images,”
in in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2849
–2858
(2019). Google Scholar
Zikun Liu, Hongzhen Wang, Lubin Weng, and Yiping Yang,
“Ship rotated bounding box space for ship extraction from high-resolution optical satellite images with complex backgrounds,”
IEEE geoscience and remote sensing letters, 13
(8), 1074
–1078
(2016). https://doi.org/10.1109/LGRS.2016.2565705 Google Scholar
Danfeng Hong, Lianru Gao, Naoto Yokoya, Jing Yao, Jocelyn Chanussot, Qian Du, and Bing Zhang,
“More diverse means better: Multimodal deep learning meets remote-sensing imagery classification,”
IEEE Transactionson Geoscience and Remote Sensing, 59
(5), 4340
–4354
(2020). Google Scholar
Mingxing Tan, Ruoming Pang, and Quoc V Le,
“Efficientdet: Scalable and efficient object detection,”
in in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
10781
–10790
(2020). Google Scholar
Glenn Jocher, Ayush Chaurasia, Alex Stoken, Jirka Borovec, Yonghye Kwon, Kalen Michael, Jiacong Fang, Zeng Yifu, Colin Wong, Diego Montes, et al.,
“ultralytics/yolov5: v7.0-yolov5 sota realtime instance segmentation,”
Zenodo,
(2022). Google Scholar
Zhi Tian, Chunhua Shen, Hao Chen, and Tong He,
“Fcos: Fully convolutional one-stage object detection,”
in Proceedings of the IEEE/CVF international conferenceon computer vision,
9627
–9636
(2019). Google Scholar
Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun,
“Yolox: Exceeding yolo series in 2021,”
(2021). Google Scholar
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik,
“Rich feature hierarchies for accurate object detection and semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern recognition,
580
–587
(2014). Google Scholar
|