‘End-to-end’ approach generates favorable results in comparison with classical codecs

Unveiling their work at the Conference on Neural Information Processing Systems in Vancouver, British Columbia, the UCI/Disney Research team members showed that their compressor – while still in an early phase – yielded less distortion and significantly smaller bits-per-pixel rates than classical coding-decoding algorithms such as H.265 when trained on specialized video content and achieved comparable results on downscaled, publicly available YouTube videos.

“Ultimately, every video compression approach works on a trade-off,” said research team leader Stephan Mandt, UCI assistant professor of computer science, who began the project while employed at Disney Research. “If I’m allowing for larger file sizes, then I can have better image quality. If I want to have a short, really small file size, then I have to tolerate some errors. The hope is that our neural network-based approach does a better trade-off overall between file size and quality.”

Video compression, even as done by traditional codecs, relies heavily on predictive capabilities, Mandt said: “Intuitively, the better a compression algorithm is at predicting the next frame of a video – given what happened in the previous frames – the less it has to memorize. If you see a person walking in a particular direction, you can predict how that video will continue in the future, which means you have less to remember and less to store.”

Current compression algorithms perform this task using heavily engineered solutions, such as trying to compute the linear displacement of small, localized patches relative to their position on the previous frame. In contrast, deep neural networks take a datacentric approach and learn the video’s underlying dynamics by drawing on large datasets of video material.

These data-driven methods, enabled by advances in deep learning over the past decade, show promise for shrinking video file sizes in future generations of video compression codecs.

Stephan Mandt, UCI assistant professor of computer science
Combining novel and traditional steps
The first step of the UCI/Disney Research team’s innovation is to downscale the dimensions of the video using a so-called variational autoencoder. This is a neural network that processes each video frame in a sequence of actions that results in a condensed array of numbers. The autoencoder then tries to undo this operation to ensure that the array contains enough information to restore the video frame. “You can think of the autoencoder as having an hourglass shape,” Mandt said. “It has a low-dimensional, compact version of the image in the middle; this is how we compress every frame into something smaller.”

Then the algorithm attempts to guess the next compressed version of an image given what has gone before, relying on an AI-based technique called a “deep generative model.” Mandt noted that other researchers have done work in this area, so this particular method is not unique. What sets the UCI/Disney Research team’s efforts apart is what follows.

The algorithm conducts an operation to encode frame content by rounding the autoencoder’s real-valued array to integers. These are easier to store than real numbers, given their many decimal places. The final step is to apply lossless compression to the array, allowing for its exact restoration. Crucially, this algorithm is informed by the neural network about which video frame to expect next, making the lossless compression aspect extremely efficient.

“For example, a language composed of letters from a finite alphabet can be perfectly compressed and uncompressed without any loss,” Mandt said. “By discretizing the latent frames of a video, we have created a discrete, countable alphabet, and now we apply lossless compression to it to reduce the file size even more.”

He said that these steps, as a whole, make this approach an “end-to-end” video compression algorithm: “The real contribution here was to combine this neural network-based deep generative video prediction model with everything else that belongs to compression algorithms, such as rounding and model-based lossless compression.”

Mandt added that he and his collaborators will continue to work toward a real, applicable version of the video compressor. One challenge is that they might need to compress the neural network itself, along with the video.

“Because the receiver requires a trained neural network for reconstructing the video, you might also have to think about how you transmit it along with the data,” Mandt said. “There are lots of open questions still. It’s a very early stage.”