Open Access Paper
28 December 2022 Abnormal event detection based on appearance repair and motion consistency
Lunzheng Tan, Cheng He
Author Affiliations +
Proceedings Volume 12506, Third International Conference on Computer Science and Communication Technology (ICCSCT 2022); 125063U (2022) https://doi.org/10.1117/12.2662614
Event: International Conference on Computer Science and Communication Technology (ICCSCT 2022), 2022, Beijing, China
Abstract
The application of abnormal event detection in video surveillance is an active research field, but due to the imbalance of positive and negative samples in surveillance video, abnormal event detection is full of challenges. In this paper, we propose a new abnormal event detection method based on appearance repair and motion consistency for detecting anomalous events. Specifically, the input image is partially masked and then fed into our proposed appearance repair autoencoder for image reconstruction, and then the motion consistency of images is constructed by our proposed optical flow network. The experimental results on the UCSD, CUHK Avenue datasets show the superiority of the detection performance of our method.

1.

INTRODUCTION

At present, the research of abnormal event detection has got certain achievements, and which has been practically applied in public. However, there are still many unsolved problems in the abnormal event detection algorithm in complex scenarios, including: pedestrian occlusion, unbalanced training sample categories, anomaly events are difficult to define. Therefore, how to automatically detect the occurrence of anomalous behavior is important.

Previous researchers have done a lot of work in abnormal event detection. According to the method of extracting features, abnormal event detection methods can be divided into two categories: traditional hand-crafted features method1-5 and deep learning method6-11.

On the basis of previous studies7, 11, we propose a video abnormal detection (VAD) method based on appearance repair and motion consistency. Specifically, we consider both the spatial(appearance) and temporal(motion) information of video frames. Instead of reconstructing the whole image as in previous work, we reconstruct the image locally by masking some small regions of the image. The whole network is trained with normal events, and the repair reconstruction is performed with the help of the neighborhood of the masked region. Therefore, in the inference phase, the reconstruction of the removed anomaly regions is very difficult, while the reconstruction of the normal regions is easier. As a result, the appearance error of different events will be enlarged, which improves the abnormal event detection performance. At the same time, besides repairing and reconstructing images, considering that motion consistency is also crucial for abnormal event detection, we also propose an optical flow network to better reconstruct the temporal information between image frames.

2.

RELATED WORK

In general, we can divide current anomaly detection methods into two broad categories: traditional hand-crafted features based method and deep learning based method. They will be introduced as follows.

2.1

Traditional hand-crafted features

Generally speaking, since anomaly events mainly involve two aspects of appearance and movement, in such methods, we need to extract deep features that can represent the appearance and motion of the image. Mehran et al.1 represented objects in video frames with particle flow and introduced social forces to model the objects. Martineld et al.4 proposed that the target motion trajectories in the video dataset are clustered, and a clustered trajectory tree that can estimate the motion path is constructed. Hasan et al.5 used trajectory-like HOG and HOF features, concatenated multiple frame features in the time axis, and abnormal event detection is done using the reconstruction error between the output of the autoencoder and the ground truth.

2.2

Deep learning method

Sabokrou et al.6 used the fully convolutional layer of AlexNet to extract the deep features of the input video, and sending the features to cascaded Gaussian classifiers for anomaly detection. Li et al.7 introduced optical flow network to better predict temporal information, and anomaly detection through motion consistency. Alahi et al.8 proposed an unsupervised learning method to generate multiple LSTM networks for each pedestrian in a specific time frame, and predict the pedestrian’s position at the current moment based on the pedestrian’s historical position. Nguyen et al.10 improved the classic U-Net network by changing it to a combination of a reconstruction network and a generative adversarial network, and by introducing the idea of generative adversarial to generate more realistic predicted frame images. Vitjan et al. 11 proposed an image inpainting method with a variant of the U-Net network, which can reconstruct the image appearance well, but lacks the participation of motion information.

3.

PROPOSED METHOD

Since abnormal event detection is the identification of behavior that does not conform to the expected behavior of a video scene, we propose to use image repair reconstruction for abnormal event detection, because the masked normal video frames have sufficient information to be successfully reconstructed, but the network cannot reconstruct well frames with anomaly events. However, we know that both temporal and spatial information are very important features. The above image repair reconstruction only considers appearance constraints, but appearance constraints can only represent spatial information, so we also add motion constraints to the objective function. Specifically, the optical flow network is introduced to ensure the temporal information of normal events, thereby improving the abnormal event detection performance. An overview of our model is shown in Figure 1. The model consists of two parts. The first part treats abnormal event detection as a repair reconstruction problem and utilizes the proposed appearance repair-based autoencoder network for multi-scale image repair to learn the appearance space structure in normal events. The second part learns the motion (temporal) structure of normal events through our proposed motion-consistent optical flow network.

Figure 1.

An overview of our network framework.

00140_PSISDG12506_125063U_page_2_1.jpg

3.1

Appearance repair based autoencoder

We need to randomly sample some small regions in each frame of the input image to perform the deletion operation, that is, the randomly selected regions are set to zero in the input image and repaired by the trained network. Specifically, each frame of the input image is divided into a set of square grids of size k × k, then select several square grids to remove.

The accuracy of our image reconstruction depends on the size of the area masked during the testing phase. Since the effect of abnormal event detection depends on the similarity between the reconstructed image and the input image, the abnormal event detection performance depends on the ratio of the size k of the masked region and the size of the anomaly regions. If k is much larger than the size of the abnormal region, it cannot be reconstructed accurately; If k is too small, the reconstruction network is able to reconstruct anomalies well from neighboring regions. Because there are anomalies of different sizes, their detection must consider multiple scales. A more reliable reconstruction error map can be generated by considering multiple reconstructions of a single image generated using multiple values of k.

Our proposed appearance repair based autoencoder network is a variant of U-net, and the network architecture used is shown in Figure 2. Our encoder and decoder have a symmetric network structure, both consisting of a series of blocks. Each block contains Convolution layer, BatchNorm layer and Relu activation, skip connections are used to transfer features through different layers, thereby accurately reconstructing details.

Figure 2.

Appearance repair based autoencoder.

00140_PSISDG12506_125063U_page_3_1.jpg

Our appearance repair based autoencoder learns common appearance (spatial) patterns of normal events, and our network takes the 2 distance between the ground truth I and the output of the reconstructed network as an intensity loss:

00140_PSISDG12506_125063U_page_3_2.jpg

The disadvantage of using only the above strength loss is blurring in the output, so we add a gradient constraint, gradient loss is defined as:

00140_PSISDG12506_125063U_page_3_3.jpg

where gi represents the image gradient along the i-axis.

3.2

Optical flow network based on motion consistency

We construct an optical flow network, and the output frame of the appearance autoencoder and the actual frame are respectively input into the optical flow network to calculate the motion loss of the network. Our optical flow network is a variant of Flownet, which is a combination of multiple layers of convolutional neural networks. The deep features of the last four convolutional output layers are extracted in our framework to compute the network motion loss.

The motion loss function of the optical flow network is as follows:

00140_PSISDG12506_125063U_page_3_4.jpg

where j represents the number of network layers, and C, H and W represent the channel number, length, and width of the corresponding convolutional layer, respectively; φj (p) is the output of the j-th output convolutional layer.

4.

EXPERIMENTS

In this section, we first conduct ablation experiments to verify the effect of different parts of our proposed network, and then compare the performance difference of our method with other existing abnormal event detection methods. We conduct experiments on various benchmark datasets for abnormal event detection, including UCSD12, CUHK Avenue13.

4.1

Abnormal event detection

An example of image restoration and reconstruction performed by our method on three benchmark datasets is shown in Figure 3. It can be seen that our method can restore the input image well.

Figure 3.

Experimental examples on three benchmark datasets.

00140_PSISDG12506_125063U_page_4_1.jpg

In Table 1, we show our method achieves competitive performance compared to other popular methods. On the CUHK Avenue dataset, our anomaly event detection framework achieves the best frame-level AUC data of 85.5%; Our method is the only two methods that exceed 90% on UCSD ped1; Our detection network achieves the highest frame-level AUC of 95.9% on the UCSD ped2 dataset. In summary, our method has competitive performance compared to state-of-the-art methods.

Table 1.

AUC of different methods on the Avenue, Ped1, Ped2 datasets.

 CUHK AvenueUCSD ped1UCSD ped2
MPCCA+SFA12N/A66.8%61.3%
AbnormalGAN14N/A97.4%93.5%
Unmasking1580.6%68.4%82.2%
PredictionNet1684.9%83.1%95.4%
Our proposed method85.5%90.4%95.9%

4.2

Ablation studies

In this section, we conduct experiments to explore the effect of different target constraints on the detection results of abnormal events on three benchmark datasets by removing different constraints.

Qualitative evaluation of motion constraints: Figure 4 shows the optical flow graph generated with/without motion constraint, we can see that the optical flow produced by the anomalous event detection network with motion constraints is more in line with reality. This shows that considering temporal information can help the network to better generate motion consistency and enhance the ability to detect abnormal events.

Figure 4.

Optical flow differences with and without motion constraints, input images and reconstructed images.

00140_PSISDG12506_125063U_page_4_2.jpg

We remove different constraints for quantitative analysis; Furthermore, we conduct experiments to verify the impact of different objective loss functions on our anomaly event detection network, and our experimental data are presented in Table 2. We can observe that all three constraints can improve the abnormal event detection ability of our abnormal detection framework. The more constraints, the higher the frame-level AUC, and the better the abnormal event detection performance.

Table 2.

Experimental results of ablation of different losses on CUHK Avenue data

 CUHK Avenue
Lint
Lgrad
Lflow
AUC71.2%79.6%81.3%85.5%

5.

CONCLUSION

In this paper, we propose a novel abnormal event detection framework that, like many previous works, focus on the temporal and spatial information of the input data. In the training phase, the network is trained by adding noise to the image, and then the proposed image repair network is used to repair and reconstruct the noised image. Meanwhile, the proposed optical flow network is applied to learn image temporal information enables the framework to learn the motion consistency of images. In the testing phase, the proposed framework is utilized for abnormal event detection. Experiments on 3 benchmark datasets show that our method has superior detection ability in abnormal event detection compared to other popular method.

ACKNOWLEDGEMENTS

This paper was supported by the Characteristic innovation project of colleges and universities in Guangdong Province: “Research on pedestrian motion recognition method in traffic scene based on deep learning and feature fusion” (2021KTSCX315), and by the projects of Zhongshan Polytechnic’s Doctoral research launch: “Research on Key Technologies of video surveillance based on vehicle terminal” (KYG2103).

REFERENCES

[1] 

Mehran, R., Oyama, A. and Shah, M., “Abnormal crowd behavior detection using social force model,” CVPR, 935 –942 (2009). Google Scholar

[2] 

Yu, B., Liu, Y. and Sun, Q., “A content-adaptively sparse reconstruction method for abnormal events detection with low rank property,” IEEE Trans Syst. Man. Cybern. Syst, 47 (4), 704 –716 (2017). https://doi.org/10.1109/TSMC.2016.2638048 Google Scholar

[3] 

Sabokrou, M., Fathy, M., Hoseini, M. and Klette, R., “Real-time anomaly detection and localization in crowded scenes,” Conf. Comput. Vis. Pattern Recognit, 56 –62 (2015). Google Scholar

[4] 

Martinel, N., Micheloni, C., Piciarelli, C. and Foresti, G. L., “Camera selection for adaptive human computer interface,” IEEE Trans. Syst. Man. Cybern. Syst, 44 (5), 653 –664 (2014). https://doi.org/10.1109/TSMC.2013.2279661 Google Scholar

[5] 

Hasan, M., Choi, J. and Neumann, J., “Learning temporal regularity in video sequences,” CVPR, 733 –742 (2016). Google Scholar

[6] 

Sabokrou, M., and Fayyaz, M., “Deep-anomaly: Fully convolutional neural network for fast anomaly detection in crowded scenes,” Comput. Vis. Image Underst, 172 88 –97 (2018). https://doi.org/10.1016/j.cviu.2018.02.006 Google Scholar

[7] 

Li, J., Huang, Q., Du, Y., Zhen, X., Chen, S., and Shao, L., “Variational abnormal behavior detection with motion consistency,” IEEE Trans. Image Process, 31 275 –286 (2022). https://doi.org/10.1109/TIP.2021.3130545 Google Scholar

[8] 

Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L. and Savarese, S., “Social LSTM: Human Trajectory Prediction in Crowded Spaces,” CVPR, 961 –971 (2016). Google Scholar

[9] 

Fernando, T., Denman, S. and Sridharan, S., “Soft + hardwired attention: An LSTM framework for human trajectory prediction and abnormal event detection,” Neural Networks, 108 466 –478 (2018). https://doi.org/10.1016/j.neunet.2018.09.002 Google Scholar

[10] 

Nguyen, T. N. and Meunier, J., “Anomaly Detection in Video Sequence with Appearance-Motion Correspondence,” ICCV, 1273 –1283 (2019). Google Scholar

[11] 

Zavrtanik, V., Kristan, M. and Skocaj, D., “Reconstruction by inpainting for visual anomaly detection,” Pattern Recogn, 112 (2021). https://doi.org/10.1016/j.patcog.2020.107706 Google Scholar

[12] 

Mahadevan, V., Li, W., Bhalodia, V. and Vasconcelos, N., “Anomaly detection in crowded scenes,” ICIP, 1220 –1223 (2010). Google Scholar

[13] 

Lu, C., Shi, J. and Jia, J., “Abnormal event detection at 150 FPS in MATLAB,” ICCV, 2720 –2727 (2013). Google Scholar

[14] 

Ravanbakhsh, M., Nabi, M., Sangineto, E., Marcenaro, L., Regazzoni, C. and Sebe, N., “Abnormal event detection in videos using generative adversarial nets,” ICIP, 1577 –1581 (2017). Google Scholar

[15] 

Ionescu, R., Smeureanu, S., Alexe, S. and Popescu, M., “Unmasking the abnormal events in video,” ICCV, 2914 –2922 (2017). Google Scholar

[16] 

Liu, W., Luo, W., Lian, D. and Gao, S., “Future frame prediction for anomaly detection - a new baseline,” CVPR, 6536 –6545 (2018). Google Scholar
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Lunzheng Tan and Cheng He "Abnormal event detection based on appearance repair and motion consistency", Proc. SPIE 12506, Third International Conference on Computer Science and Communication Technology (ICCSCT 2022), 125063U (28 December 2022); https://doi.org/10.1117/12.2662614
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Optical flow

Optical networks

Video

Video surveillance

Image restoration

Motion detection

Motion models

RELATED CONTENT


Back to Top