YouTube gets real-time video segmentation: Here’s how this technology works
The new segmentation technology will allow creators to replace and modify the background, increasing videos’ production value without specialised equipment.
Google has introduced real-time, on-device mobile video segmentation to the YouTube app, by integrating this technology into the latter's stories feature, a new lightweight video format, designed specifically for YouTube creators on its beta version.
The new segmentation technology will allow creators to replace and modify the background, effortlessly increasing videos' production value without specialised equipment, Google's research blog noted.
"Video segmentation is a widely used technique that enables movie directors and video content creators to separate the foreground of a scene from the background and treat them as two different visual layers. By modifying or replacing the background, creators can convey a particular mood, transport themselves to a fun location or enhance the impact of the message. However, this operation has traditionally been performed as a time-consuming manual process or requires a studio environment with a green screen for real-time background removal. In order to enable users to create this effect live in the viewfinder, we designed a new technique that is suitable for mobile phones," the blog read.
The new technology has been developed using machine learning to solve a semantic segmentation task using convolution neural networks. To provide high-quality data for the machine learning pipeline, the developers annotated thousands of images that captured a wide spectrum of foreground poses and background settings. Annotations consisted of pixel-accurate locations of foreground elements such as hair, glasses, neck, skin, and lips, and a general background label achieving a cross-validation result of 98 percent Intersection-Over-Union (IOU) of human annotator quality.
Furthermore, the specific segmentation task to compute a binary mask separating foreground from the background for every input frame (three channels, RGB) of the video was created. After this, the computed mask was passed from the previous frame as a prior by concatenating it as a fourth channel to the current RGB input frame to achieve temporal consistency, the developers said in the blog.
Google noted that a limited rollout of YouTube stories will be facilitated to test the technology on this first set of effects, and will be rolled out across all versions in the near future.
"As we improve and expand our segmentation technology to more labels, we plan to integrate it into Google's broader Augmented Reality services," the blog noted.