Accurate crowd counting in congested scenes remain challengeable in the trade-off of efficiency and generalization. For solving this issue, we propose a mobile-friendly solution for the network deployment in high response speed demand scenarios. In order to introduce the profound potential of global crowd representations to lightweight counting model, this work suggests a novel crowd counting aimed mobile vision transformers architecture (CCMTNet), which strives for enhancing the efficiency of the model universality in real-time crowd counting tasks on resource constrained computing devices. The framework of linear CNN network interpolation structure with self-attention blocks endows the model with the ability of local feature extraction and global high-dimensional crowd information processing with low computational cost. In addition, several experimental networks with different scales based on the proposed architecture are comprehensively verified to balance the accuracy loss as compressing the computing costs. Extensive experiments on three mainstream datasets for crowd counting tasks well demonstrate the effectiveness of this proposed network. Particularly, CCMTNet achieves the feasibility of reconciling the counting accuracy and efficiency in comparisons with traditional lightweight CNN networks.
In recent years, great progress has been made in the study of crowd counting. Although the crowd counting networks being proposed to solve different problems have achieved satisfactory counting results, the differences of crowd density and scale in the same scene still degrade the overall counting performance. In order to deal with this problem, we propose a Multi-Scale Attention Grading Crowd Counting Network (MSAGNet), which focuses on different crowd densities in the scene by attention mechanism and fuses multi-scale information to reduce scale differences. Specifically, the grading attention feature obtaining module focuses on different densities of people in the scene by attention mechanism, and adaptively assigns corresponding weights to different density regions. Dense regions are given more weights, allowing the model to focus more on that part making the training of that region more accurate and effective. In addition, the multi-scale density feature fusion module fuses the feature maps containing density information to generate the final feature maps. The obtained feature maps contain attention information at different scales, which are density mapped to obtain the estimated density maps. This method can focus on different density regions in the same scene, and simultaneously fuse multi-scale information and attention weight, which can effectively solve the problem of counting dense regions that is difficult to calculate. Extensive experiments on existing crowd counting datasets (UCF_CC_50, ShanghaiTech, UCF-QNRF) show that our method can effectively improve the counting performance.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.