Research: Media Attribute Analysis

Active research area. View github.com/vframeio for updates

Simple scripts for preliminary analysis of large video datasets can help make projects run smoother by identifying hardware requirements, estimate processing time, and visualizing video frame size distribution to determine which object detection algorithms are most suitable. This research post introduces one of the techniques VFRAME uses, called media attribute analysis, to quickly begin understanding large video datasets.

Before analyzing hundreds of thousands of videos it can be helpful to visualize the distribution of video attributes, such as frame size and duration. A simple approach to visualizing this information is to use the PyMediaInfo library, a Python wrapper around the original MediaInfo library, to plot the distribution of dimensions, duration, and frames per second. To avoid confusion, the term media attributes is used here to describe immutable attributes of a video or image file, and does not refer to other content-based metadata attributes such as those inferred by neural networks.

Media attribute analysis is rather simple tool, but can reveal useful insights about the structure of video datasets, in particular when analyzing hundreds of thousands or millions of videos from disparate sources that contain varying resolutions and formats. Because it runs much faster than keyframe analysis, it is the first step used by VFRAME to analyze large OSINT video datasets. Actual processing speeds will depend on your storage medium's read speed, number of threads of available, and overall computational capacity. In practice this means approximately between 4 – 400 videos per second. The latter achievable on SSD and the former on networked-attached storage (NAS) devices.

For this example, 100,000 videos were located starting from the BrownMoses YouTube channel and then following a list of subscriptions and subscribers. The header data for each video file (e.g. mp4) is parsed with PyMediaInfo info and the duration, frames per second (FPS), width, and height are saved to a CSV for easy use with Pandas. Code examples below are abbreviated and designed to get you started.

# Extracting basic video attributes using PyMediaInfo
from pymediainfo import MediaInfo

fp_in = 'path/to/your/videos/myvideo.mp4'

# Get media attributes
attrs = MediaInfo.parse(fp_in).to_data()
video_attrs = [x for x in attrs['tracks'] if x['track_type'] == 'Video']  

# Access single attributes
width = video_attrs.get('width', 0)
height = video_attrs.get('height', 0)
duration = int(video_attrs.get('duration', 0))
frame_rate = float(video_attrs.get('frame_rate', 0))

# setup matplotlib vars
figsize = (12,6)
# CSV output from the VFRAME mediainfo extractor
W7AY5_EblJo.mp4, mp4, True, 1280, 720, 1.778, 4932, avc1, 197280.0, 25.0
AoR8fOHrCH4.mp4, mp4, True, 1280, 720, 1.778, 562, avc1, 22480.0, 25.0
Psy4_7Rlz6Q.mp4, mp4, True, 1280, 720, 1.778, 5699, avc1, 227960.0, 25.0

Width and Height

First, the height and width is plotted for all 100,000 videos using matplotlib with the sizes grouped into 100 pixel intervals. This shows 2 dominant clusters for height and about 4 varying clusters for width. This visualization is still slightly primitive. But plotting the raw height and width first is useful to see how many clusters are needed for the next step that uses K-Means clustering.

K-Means Clustering

Next, K-Means clustering is used to cluster the width and height data. The number of clusters depends on your data and some experimentation. Refer to your height and width plots to estimate the number of clusters in your video dataset. Based on the width and height plots, a cluster value of 6 was chosen for this analysis.

Import K-Means from the scikit-learn library, choose your number of clusters, and then fit your (width, height) data points to the clusters. The K-Means clustering algorithm will find center points that best represent K groups. If you disagree with the clustering outcome, manually adjust the number of clusters until you do not. The plot below shows the color-coded clusters for each of the 6 clusters. Try other K values and experiment until you find the right fit for your own data.

# Generate K clusters for width*height data points
from sklearn.cluster import KMeans

# get and group w,h
heights = list(df.height.values)
widths = list(df.width.values)
X = np.array([np.array([w,h]) for w,h in zip(widths, heights)])

# init kmeans
opt_n_clusters = 6
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
centers = kmeans.cluster_centers_

# plot kmeans to visually inspect cluster accuracy
plt.figure(figsize=figsize)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)
plt.title(f'K-Means Clusters for {n_videos:,} Videos')
plt.ylabel("Height")
plt.xlabel("Width")
plt.show()

In the last step of frame size distribution analysis, sort the values by width and format back into a plain bar graph.

# Sort k-means cluster data into a list of dicts
size_results = []
cluster_ids = list(y_kmeans)
centers = kmeans.cluster_centers_
for cluster_id in range(opt_n_clusters):
 n_found = cluster_ids.count(cluster_id)
 center = centers[cluster_id]
 dim = int(round(round(center[0])/10)*10), int(round(round(center[1])/10)*10)
 print(cluster_id, n_found, dim)
 o = {
   'width': dim[0],
   'height': dim[1],
   'count': n_found,
   'cluster_id': cluster_id,
   'label': f'{dim[0]}x{dim[1]}',
 }
 size_results.append(o)

# sort by width
size_results = sorted(size_results, key=itemgetter('width'))

Unsurprisingly, most of the 100,000 videos are 1280×720 pixels, a common format for videos shared online. The next clusters had more overlap and could be re-clustered to provide more size options for lower resolution videos. For now, these values are fine and show a strong cluster around 640 and another around 380. The 150 videos at 150×120 pixels clusters and the 3,398 videos at 330×250 can be deprioritized in an object detection workflow for small cluster munitions because the resolution is likely too small. Deprioritizing low-resolution videos is helpful to reduce overall processing times. Depending on what you are looking for, the second cluster could also be deprioritized since finding small objects in videos less than 400×400 pixels would still be unlikely.

Duration

Visualizing the duration of videos helps understand how conflicts are being documented. The plot below clearly shows a strong trend: the length of a video is inversely related to its quantity in the dataset. Generally, most videos are rather short and very few are long. More precisely, 30.5% of videos were under 1 minute; 45.2% were under 1.5 minutes, and 56.7% were under 2 minutes. This makes sense because video documentation often involves documenting specific incident, object, or moment. But it is interesting to see how strong this trend is over 100,000 videos.

Duration is also useful when making comparisons to the number of hours needed for manual review. The total duration of this video dataset ((df.duration.sum() / 1000 / 60 / 60 / 8)) would be about 820 8-hour days (or 273 full 24 days), virtually impossible to watch. Now, instead of calling it a dataset of 100,000 videos, it can more accurately be described as 273 days or 6,500 hours of footage.

Frames Per Second, Frame Count

Plotting the FPS provides some level of understanding for the type cameras being used, but comes with a caveat. Videos posted online are often compressed or converted into different formats. While it appears that most videos are 25FPS, keep in mind that this may be the result of conversion scripts and not the actual frame rate of the camera. For raw videos, the FPS could be used to group videos by capture devices. Here it is not that useful.

And finally, analyzing the frame count can also be used for making processing time projections. In this dataset there were a total of 598,283,235 video frames (df.frame_count.sum()). If your workstation is capable of analyzing videos at 30FPS then you would need about 230 days to process the dataset, an impractical amount of time. Fortunately, not every frame needs to be analyzed.

# setup plot
plt.figure(figsize=figsize)
plt.title(f'Frames Per Second Distribution over {n_videos:,} Videos')
plt.ylabel("Video")
plt.xlabel("Frames Per Second")

# set bins
n_bins = 30
bin_size = 1
bins = list(range(15, n_bins * bin_size, bin_size))

# plot data
x = df['frame_rate'].values
plt.hist([x], bins, label=['FPS'])
plt.legend(loc='upper right')
plt.show()

Summary

Analyzing these attributes for 100,000 videos only required about 2 hours of compute time (much faster than keyframe analysis) yet provided a few helpful preliminary insights about the resources needed for further analysis, the average sizes, and how to speed up processing by deprioritizing low-resolution videos.

Without using any neural networks, artificial intelligence, or novel algorithms, this analysis:

The VFRAME toolkit (still under development) provides two scripts to run this analysis:

# Create mediainfo CSV
vframe dev meta -i path/to/videos -o path/to/metdata.csv

# Create plots
vframe dev plot_meta -i path/to/metadata.csv -o path/to/plots/