7 April 2023 VioNets: efficient multi-modal fusion method based on bidirectional gate recurrent unit and cross-attention graph convolutional network for video violence detection
Wuyan Liang, Xiaolong Xu, Xiao Fu
Author Affiliations +
Abstract

Recent research on video violence detection has made great progress with the development of multi-modal fusion techniques. However, existing approaches still pose huge challenges in real-time violence detection due to fusion features being fixed and not able to be further fine tuned. We aim to address this challenge by exploring the multi-modal fusion method for analyzing multiple modal violent information, i.e., audio, optical flows, and RGB images. We propose a unified network called VioNets, which contains both a cross-attention graph convolutional network (GCN) module and a bidirectional gate recurrent unit (Bi-GRU) module for fusing different modalities of information. First, the cross-attention GCN module is utilized to extract the cross-modal spatial–temporal features. The Bi-GRU module is then applied to accurately capture both past and future context features for each time step of the single-modal features. As a result, the model retains important single-modal information in the extracted features while using the cross-modal features to improve the detection accuracy. Experiments conducted on the XD-Violence dataset show that the proposed method achieves an average precision of 80.59% and an inference time of 0.16 s with 1.82 M parameters.

© 2023 SPIE and IS&T
Wuyan Liang, Xiaolong Xu, and Xiao Fu "VioNets: efficient multi-modal fusion method based on bidirectional gate recurrent unit and cross-attention graph convolutional network for video violence detection," Journal of Electronic Imaging 32(2), 023031 (7 April 2023). https://doi.org/10.1117/1.JEI.32.2.023031
Received: 28 November 2022; Accepted: 21 March 2023; Published: 7 April 2023
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Video

RGB color model

Video surveillance

Feature extraction

Optical flow

Visualization

Feature fusion

Back to Top