KEYWORDS: Education and training, Data modeling, Video, Performance modeling, Feature extraction, Video acceleration, Process modeling, Detection and tracking algorithms, Convolution, Contamination
During object tracking process, if only the first frame is used as the matching template, changes about the target state will often lead to poor tracking results or even tracking failure of the classic Siamese tracker. To deal with this issue, UpdateNet uses the first frame as template, and regularly updates the template with combination of the previous accumulated template and the current predicted template. However, the combining of template tends to bring in background information which may pollute the template representation. For the purpose of obtaining accurate template and timely sensing the change of target, this article introduces the Squeeze-and-Excitation channel attention and selective mechanism to UpdateNet. The channel attention mechanism can sort the template information spliced by channels by adjusting the weight to highlight important information. The confidence score of the tracking predicted result of the Siamese network is used to determine whether the corresponding frame should participate in template accumulation, and a threshold is set to exclude severely contaminated predicted templates. The article also uses a more detailed parameter adjustment method to enable UpdateNet to achieve convergence faster and be more adaptive. We apply the improved UpdateNet into the DaSiamRPN tracker, and evaluations on the VOT2016 and VOT2018 datasets show that our methods can effectively improve the performance of UpdateNet.
Landmark detection is a crucial task that focuses on identifying specific landmarks representing distinct object features within an image. However, existing methods primarily adopt Convolutional Neural Networks as the backbone for feature extraction, which struggle to capture global contextual information. Due to the structural dependencies of landmarks, these methods fail to effectively utilize these relationships, leading to lower detection accuracy. To overcome these limitations, we propose a novel framework, CTBN, which combines CNNs, Transformer, and Convolutional Block Attention Modules to enhance feature representation for landmark detection. In our framework, CNNs extract fine-grained local features, Transformer captures global context, and CBAM adaptively refinesfeatures by focusing on important regions and channels. We validate our performance on multiple datasets, including CelebA, AFLW and Deepfashion. The experiments demonstrate that our method achieves superior performance, significantly improves the accuracy of landmark detection.
Scene recognition has a wide range of applications in autonomous driving, security monitoring, smart home, and so on. Though traditional methods achieved good results in this field, nowadays deep-learning methods are the dominator. In this paper, we conducted an extensive comparison of the performance of five deep-learning models on a common dataset to reveal their strength and weakness. Experimental results show that, of the five deep learning models, ConvNeXt works the best. In addition, all five models outperform a traditional method that used to be the state of the art.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.