Through hash learning, the image retrieval based on deep hash algorithm encodes the image into a fixed length hash code for fast retrieval and matching. However, previous deep hash retrieval models based on convolutional neural networks extract local information of the image using pooling and convolution technology, which requires deeper networks to obtain long distance dependency, leading to high complexity and computation. In this paper, we propose a visual Transformer model based on self-attention to learn long dependencies of images and enhance the extraction ability of image features. Furthermore, a loss function with multiple loss fusion is proposed, which combines hash contrastive loss, classification loss, and quantization loss, to fully utilize image label information to improve the quality of hash coding by learning more potential semantic information. Experimental results demonstrate the superior performance of the proposed method over multiple classical deep hash retrieval methods based on CNN and two transformer-based hash retrieval methods, on two different datasets and different lengths of hash code.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.