Research: Scene summarization

Content-based video scene summary and analysis

Active research area. Follow github.com/vframeio for updates.

Consider the following scenario: a typical video in the Syrian Archive's collection is around 2.5 minutes long at 29.97 FPS, or about 4,500 frames per video. The Syrian Archive has over 3,314,000 videos or roughly 15 billion video frames that need to be analyzed. But the average speed of object detection algorithms is around 30FPS and the maximum number of usable processing threads per workstation is 4 (due to disk I/O and CPU/GPU resources). Analysing 15 billion frames at 30FPS on a single computer in a multithreaded system would therefore take 5,787 days or 15 years on a single thread or about 3 years on a maxed-out parallelized workstation. How is it even possible to analyze this many videos for evidence of illegal munitions? Faster object detection algorithms? Better hardware? More computers? More parallelization?

Sometimes, less is more. Over 99% percent of these 15 billion frames can be ignored using scene summarization to omit similar video frames. Scene summarization, as its name implies, summarizes the scenes of video into its most representative frames, acting as a content compression algorithm. But what exactly is a "scene"?

Existing open source scene summarization and detection tools found during research rely on techniques such as measuring changes in color and brightness to detect changes in a scene. PySceneDetect's content-aware scene detection actually analyzes video "based on changes between frames in the HSV color space". Since color and brightness do not always indicate a change in content, a better approach is to use a convolutional neural network (CNN) feature vector that analyzes the frame based on the objects it contains. However, this process can be slow. To speed up feature extraction, the perceptual hash can be precomputed to determine when there is enough change between frames. Furthermore, the percpetual hash can be sped up by first using canny edge filtering to drop frames without any edge content. And this can be sped up even more by resizing the video frame down to about 160 pixels wide. Altogether this approach has been used at VFRAME to summarize over 2 million videos on a single workstation using a Python image processing workflow.

Since the scene-summarized keyframes represent (theoretically) all different views and scenes in the video, object detection can now be run on only those frames. The scene summarization includes 3 levels of density: short, medium, and expanded. The short includes the most diverse 15 frames, while the expanded includes approximately 10 to 100 frames. Once the videos are summarized, the exapnded set is run through object detection and now only takes 1 to 3 seconds per video. To be thorough, videos with object detection matches from the summarized keyframes can then be analyzed frame-by-frame to track and estimate the number of unique objects.

As new object detection models are trained, the pre-computed set of keyframes are used again to check for evidence of illegal munitions. The runtime to analyze the entire corpus is now only about 11 days on a single workstation, instead of 3 years. Other techniques are also used to further reduce this time, including skipping videos with resolutions or durations below a certain threshold, or videos that have already been marked as containing evidence. Overall, with tweaks, batch processing the entire archive takes about 8 days on a single workstation.

Scene summarization is a important technique for working with large datasets. Often, most the video data is not useful and can be ignored. Pairing CNN feature extractors with pre-computed perceptual hashes and canny edge filters can significantly improve the capabilities of small reserach teams to analyze millions of videos.

VFRAME's approach to scene summarization is still under development, but is briefly outlined below:

In practice, the algorithm generates a group of images for each video that looks like the example below. Overall, the number of frames (and amount of data) has been reduced by over 99%, yet the video is still understandable. Reserachers can use these images to quickly preview the video before watching it, which also helps reduce exposure to graphic and trauamtic content.

Video Duration FPS Number of Frames Number of Summary Frames Data Reduction
0:59 29.97 1,768 12 99.33%
5:07 29.97 9,200 12 99.87%
1:34 29.97 2,817 12 99.58%

Example 1

Original video 5:07 seconds or 9,200 frames.
Original video 5:07 seconds or 9,200 frames.
The 12 most representative frames of the video.
The 12 most representative frames of the video.

Example 2

Original video 1:34 or 2,817 frames.
Original video 1:34 or 2,817 frames.
The 12 most representative frames of the video.
The 12 most representative frames of the video.

This research has been supported by funding from PrototypeFund (DE).