Recent research on video violence detection has made great progress with the development of multi-modal fusion techniques. However, existing approaches still pose huge challenges in real-time violence detection due to fusion features being fixed and not able to be further fine tuned. We aim to address this challenge by exploring the multi-modal fusion method for analyzing multiple modal violent information, i.e., audio, optical flows, and RGB images. We propose a unified network called VioNets, which contains both a cross-attention graph convolutional network (GCN) module and a bidirectional gate recurrent unit (Bi-GRU) module for fusing different modalities of information. First, the cross-attention GCN module is utilized to extract the cross-modal spatial–temporal features. The Bi-GRU module is then applied to accurately capture both past and future context features for each time step of the single-modal features. As a result, the model retains important single-modal information in the extracted features while using the cross-modal features to improve the detection accuracy. Experiments conducted on the XD-Violence dataset show that the proposed method achieves an average precision of 80.59% and an inference time of 0.16 s with 1.82 M parameters. |
ACCESS THE FULL ARTICLE
No SPIE Account? Create one
Video
RGB color model
Video surveillance
Feature extraction
Optical flow
Visualization
Feature fusion