Retail product detection in fisheye camera capture scenes frequently suffers from excessive object occlusion and deformation, as well as difficulty in distinguishing products with small fine-grained differences, so accurately classifying and localizing products in these images presents a challenge for computer vision. We propose an efficient product detection network called EPformer by fusing a visual transformer and convolutional neural network to reliably detect retail products in fisheye images. We employ a shifted window strategy for feature information interaction across windows to more precisely detect products due to the issue of dense occlusion of products. To address the issue of excessive product deformation brought on by fisheye cameras, we develop a deformation image processing module without explicit correction and embed it into the path aggregation network structure. This enables the model to efficiently capture product geometric changes and conduct feature fusion. To address the issue of differentiating fine-grained products, we design an effective coordinate squeeze-excitation (ECSE) attention module that can capture the fine-grained texture and boundary information differences between individuals in terms of spatial and channel relationships. The inability to differentiate fine-grained products can be solved by training the ECSE module in tandem with the decoupled head. The experimental results demonstrate that EPformer is a potent product detection model with a 4.9% higher mean average precision than the state-of-the-art method (YOLOX) on the fisheye product image dataset. In addition, the EPformer model can effectively detect products in fisheye images on the Jeston Xavier NX embedded device to meet the application requirements in realistic scenarios. |
ACCESS THE FULL ARTICLE
No SPIE Account? Create one
CITATIONS
Cited by 2 scholarly publications.
Deformation
Windows
Object detection
Education and training
Image processing
Feature extraction
Cameras