Deep Feature-based Anomaly Detection for Video Surveillance

Paper | Report |

* indicates equal contributions.

Figure 1: Complete Flow Diagram of Proposed Solution. It consists of three phases, Phase 1: Input, Phase 2: Feature Extraction and Phase 3: Classification

Abstract

Detecting anomalies in video surveillance is a challenging task that requires distinguishing between normal and abnormal behaviour. Video surveillance systems not only face challenges in identifying and monitoring unusual human actions but also in differentiating normal from anomalous actions due to a large amount of data in video format. In this study, we propose an intelligent video surveillance system that utilizes deep feature-based anomaly detection to identify anomalous events in a video stream. Our approach uses a two-stream deep learning model with I3D as the feature extraction component, which has demonstrated effectiveness in action recognition and detection tasks. We evaluate our proposed system on the UCF Crime dataset, consisting of videos of normal and abnormal occurrences, and achieve an AUC of 87.52%. Our results demonstrate the efficacy of the proposed method in identifying anomalous events and show significant improvement over state-of-the-art methods.

Figure 2:A complete pipeline of the proposed solution in a step-wise manner, Phase 1 represents input sub-network, where inputs from normal and anomalous videos are divided into 32 segments and are provided as an input to Phase 2, which is a two-stream I3D feature extractor, where both RGB frames and optical flow information of the video is utilized to extract features and provided to the 5 layered fully connected neural network, which uses modified MIL ranking loss function in Phase 3 perform classification.

Phase 1

Input:- UCF-Crime[1] is a large-scale dataset with 1900 untrimmed surveillance videos, each lasting 128 hours, from indoor and outdoor cameras. The dataset contains an equal number of normal and anomalous videos in both the training and testing sets. The dataset includes 13 different anomalies such as abuse, arrest, arson, assault, road accident, burglary, explosion, fighting, robbery, shooting, stealing, shoplifting, and vandalism. Sultani et al.[1] proposed this dataset to overcome the limitations of previous datasets used for video anomaly detection, which lacked diversity in the types of anomalies and the number of videos.

The dataset contains only frame-level annotations gathered by multiple annotators, and only the testing set has these annotations. The training set has 800 normal and 810 anomalous videos, while the testing set has 150 normal and 140 anomalous videos, with all 13 anomalies occurring in various temporal locations, and some videos containing multiple anomalies.

 

Phase 2

Pre-Processing

Resizing: The first step is to resize the input video frames to a fixed size to ensure uniformity across all frames.

Center and Random Cropping: The second step is to perform cropping on the video frame, where in center cropping the center of the video frame is cropped out, and the remaining part is used for processing and random cropping involves randomly selecting a section of the image or video and cropping it out to create a new image or video sample

Normalization: The next step is to normalize the pixel values of the cropped video frames. This is done to ensure that the data has zero mean and unit variance, which helps in better convergence during training.

RGB to BGR Conversion: In the two-stream I3D model, the input frames are expected to be in the BGR format, which is different from the RGB format used by most video codecs. Therefore, the RGB frames are converted to BGR format before feeding into the model.

Optical Flow Computation: For the second stream of the two-stream I3D model, optical flow computation is performed on the input video frames. Optical flow is the pattern of apparent motion of objects in a video, which is computed by comparing adjacent frames. It helps in capturing the temporal information of the video

Optical Flow Normalization: Finally, the optical flow values are normalized to have zero mean and unit variance, similar to the RGB frames, to ensure consistency between the two streams.

 

Feature Extraction

Raw videos can't be used for anomaly prediction, and hence the important features need to be extracted from them. For this purpose, the Two-Stream Inflated 3D ConvNet (I3D)[2] has been utilized, which is based on 2D ConvNet inflation. I3D allows the learning of spatiotemporal features from video by extending successful ImageNet architecture designs and parameters. The I3D model employs a two-stream architecture, where one stream handles RGB frames and the other handles optical flow frames. The model has 25 layers, including 9 Inception modules, which are used for efficient feature extraction by concatenating multiple filter sizes in parallel. The I3D model has been pre-trained on Kinetics and outperforms the state-of-the-art methods in action classification. The novel part of the approach is that the modified two-stream I3D model has been used for Feature extraction for video anomaly detection on the UCF-Crime dataset. The original implementation was done in TensorFlow, while there has been modification and implementation in PyTorch

Phase 3

Model Architecture

A video is divided into 32 segments and treated as bag instances just like Sultani et. al.[1]. Overlapping temporal segments of different scales were experimented with, but they did not enhance detection accuracy. A minibatch of 30 positive and 30 negative bags is randomly selected.

Extracts features of 2048-Dimension are provided as input to a 5 Layer Fully connected neural network. The first 4 layers are followed by the ReLU Activation function, to introduce non-linearity and avoid the vanishing gradient problem. The last layer is followed by the Sigmoid Activation Function to indicate normal and abnormal inputs, with a smooth gradient that makes it suitable for backpropagation algorithms. 10% Dropout Regularization is used in the 4th layer to prevent overfitting.

 

Results

Figure 3: Training Progress:

Figure 4: Testing Progress:

The training progress for the proposed solution on the UCF-Crime dataset is shown in Figure[3]. It shows the training loss progression with each epoch, which is computed using the modified version of the MIL Ranking loss function. It is clearly observed that the magnitude of loss is drastically decreasing till the 30th epoch after that it remains stable, which few ups and down till the last epoch. During the training process, to avoid the issue of overfitting, early stopping regularization is employed with the patience of 10 and a minimum delta value of 0.001, with training loss set as a criterion.

The testing progression is clearly visible in Figure[4]. Here along with loss progress, AUC progression with each epoch is also displayed.

Figure 5: Video Representation:

Figure 6: Graph Representation:

Figure[5] shows the video level snippets from the abuse video set from 13 different anomalies video set present in the UCF-Crime dataset. It also contains the FPS value and real-time prediction score for each frame. While Figure[6] shows temporal analysis of anomaly prediction on videos used in Figure[6]. Here higher the prediction value, the higher the chances of that event being anomalous.

Download Links 

    UCF Crime Dataset: Drop Box

    Extracted features of UCF-Crime dataset: Google Drive

    Github: Github

Reference

[1]Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488, 2018

[2]Joao Carreira and Andrew Zisserman. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. 2018. arXiv: 1705.07750 [cs.CV].