Crowd counting has been a popular research topic in the field of computer vision due to the variation of human head scales and the interference of background noise. Some existing methods use multi-level feature fusion to solve scale variation, but the problem of background noise interference may be more serious due to the involvement of shallow features in the feature fusion process. In this paper, we propose a Multilevel Information Sharing Network based on Residual Attention(RA-MISNet) to solve this problem. The RA-MISNet consists of a feature extraction component, an information sharing module and a residual attention density map estimator. On the basis of solving the multi-scale problem, the residual attention mechanism is adopted by our proposed method to refine the population distribution information in sharing features at all levels, which can reduce the interference of complex texture background on density map regression. Furthermore, owing to the severe label noise interference problem in high-density crowd areas, we design a Regional Multi-level Segmentation Loss (RMS Loss) to divide the multi-level density regions with different label noise rates in a single crowd image and apply the corresponding granularity supervision constraints for each density level region. Extensive experiments on three crowd counting datasets (ShanghaiTech, UCF CC 50, UCF-QNRF) demonstrate the effectiveness and superiority of the proposed methods.
This paper aims at comparing 4 top models for crowd counting and evaluating their highlights based on their performance. In DSNet, the distended convolution block network was proposed, where the distended layers are densely connected to each other in order to preserve information from continuously varied scales. Three blocks are cascaded and linked to dense residual connections to widen the range of levels covered by network and also a novel loss of consistency at multi-scale density level was introduced to improve performance. In SFANet, two foremost elements with VGG backbone CNN and two-way path multi-scale fusion networks were suggested for the front end feature extractor and back end to make density map in which one path highlights crowded regions present in images. The other direction is responsible for the fusion of multi-scale features and for the generation of the final high-quality high-density maps. In MANet (Multi-scale Attention Network), a new mechanism of soft attention was presented, which learns a series of masks and a level-conscious loss feature was introduced to regularize and direct the learning of different branches to specialize on a specific scale. In Bayesian Loss, a novel loss function was used to generate a density contribution model from the point annotations. We also analyzed the results of the 4 convolutional neural networks, extracted the pattern of convolutional neural network structure and found promising pathways for researchers in this fast-growing area.
Crowd counting is an important part of crowd analysis, which is of great significance to crowd control and management. The convolutional neural network (CNN) based crowd counting method is widely used to solve the problem of insufficient counting accuracy due to heavy occlusion, background clutters, head scale and perspective changes in crowd scenes. The multi-column convolutional neural network (MCNN) is a CNN-based method for crowd counting, which adapts to head scale variation of crowd scenes by constructing multi-column convolutional neural network composing of three single-column networks corresponding to the convolution kernel with different sizes (large, medium and small). However, as the MCNN network is relatively shallow, its receptive field is also limited, which affects the adaptability to large scale variations. In addition, due to insufficient training data, it is necessary to carry out a pre-training strategies which pre-trains the single-column convolutional neural network individually and combines the cumbersome. In this paper, a crowd counting method based on multi-column dilated convolutional neural network was proposed. Dilated convolution was used to enhance the receptive field of the network, so as to be better adaptive to the head scale variations. The image patches were obtained by randomly clipping from the original training data set images in the process of each iterative training to further expand the training data, while the training could be achieved without tedious pre-training. The experimental results on ShanghaiTech public dataset showed that the accuracy of crowd counting proposed in this paper was better than that of MCNN, which proved that this method is more robust to head scale variations in crowd scenes.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.