Abstract:
Violence recognition is challenging since recognition must be performed on videos acquired by a
lot of surveillance cameras at any time or place. It should make reliable detections in real time
and inform surveillance personnel promptly when violent crimes take place. Therefore, this focus
on efficient violence recognition for real-time and on-device operation, for easy expansion into a
surveillance system with numerous cameras. In this work, we propose a novel violence detection
pipeline that can be combined with the conventional 2-dimensional Convolutional Neural Net-
works (2D CNNs). In particular, frame-grouping is proposed to give the 2D CNNs the ability to
learn spatio-temporal representations in videos. It is a simple processing method to average the
channels of input frames and group three consecutive channel-averaged frames as an input of the
2D CNNs.
Furthermore, spatial and temporal attention modules are included that are lightweight but consistently improve the performance of violence recognition. The proposed pipeline brings significant performance improvements compared to the 2D CNNs followed by the Long Short-Term
Memory (LSTM) and much less computational complexity than existing 3D-CNN-based methods.
In particular, MobileNetV3 and EfficientNet-B0 with our proposed modules achieved state-of-the-
art performance on six different violence datasets.