Stephan Mandt @ ICBINB Seminar Series - 2023

Details

Title : Stephan Mandt @ ICBINB Seminar Series Author(s): I Can’t Believe It’s Not Better! Link(s) : https://www.youtube.com/watch?v=v6DWk7chTJQ

Rough Notes

Why Neural Compression Hasn't Taken Off (Yet)

Motivation: Video compression (~80% of all consumer traffic), health care data, edge computing, decentralized machine learning, point clouds used in self-driving cars, AR and VR content.

With video and audio streaming, we can do specialized coding, e.g. NVIDIA achieved ~10x compression.

LiDAR data can contain 10s of billions of points.

In practice, lossless compression requires:

Discrete observations e.g. on a grid.
A discrete probability model e.g. by discretizing a learned continuous density.
A coding scheme (always exists, e.g. Huffman tree which assigns bitstreams to events such that rarer events gets longer codewords).

Lossy compression on the other hand may involve some form of truncation/rounding, followed by lossless compression afterwards.

In industry, practical lossy compression by transform coding is often used, which uses an encoding transform \(f:X\to Z\) and decoder transform \(g:Z\to X\) where the compression is now done in the latent space \(Z\).

Rate is the expected code length/file size. Distortion measures the average reconstruction error. The Rate-Distortion (R-D) curve shows their relation, and we want both to be low.

Neural compression is about learning compression algorithms from data. Ideally, we would optimize the R-D performance. Some problems:

Quantization (e.g. rounding) is not differentiable.
The entropy model is discrete, giving a probability mass function.

A workaround is to add uniform noise \(u\sim (-0.5,0.5)\), and relax the discrete mass function to a continuous density. Training with uniform noise is equivalent to training a particular Variational Autoencoder (VAE).

There is a connection between image compression and deep latent variable modelling, further improved by using e.g. autoregressive structure.

But, are we approaching optimality? Neural image codes have beat most non-learned ones in R-D metrics. It seems however gains seem to saturate. Are we reaching a fundamental limit? Such a limit exists - the information rate distortion function \(R(D)\). Classical algorithms can also be faster than neural algorithms - specifically they are an order of magnitude too slow to be competitive with classical codecs. Some other obstacles include difficulty in defining a coding standard, friction to changing running systems etc.

Are things that bad? A closer look at \(R(D)\) which is only a function of the data source, which characterizes its compressibility. Given an expected distortion tolerance \(D\), \(R(D)\) is the lowest achievable rate of any block code. It is in general not available in analytical form. Work by the speaker looks at upper and lower bounds on this function which can be computed from iid samples.

What holds for the future?

Exploiting encoder-decoder asymmetries: E.g. streaming services have \(X\) videos/movies and \(Y\) users where if \(Y>X\) then decoding is done many times. Spending more time encoding may help.
Improving inference: Amortization gap, discretization gap (stochastic Gumbel annealing), marginalization gap (lossy bits-back coding).
Faster decoding with parsimonious models: Focus on lightweight models, sacrificing some \(R(D)\) performance. One inspiration is JPEG, which is a linear autoencoder. E.g. lossless compression with probabilistic circuits.
Focus on perceptual metrics over distortions: Mean squared error may not be the best quantity. There is a perception-distortion trade-off that applies to generative modelling more broadly. Perfection distortion quality and better quality compete against each other.

In summary:

Neural codecs are better than classical codecs if we ignore computation time.
Neural compression is currently not 10x better.
On the R-D and realism front, neural compression does have an advantage over traditional approaches, and there is lots of promise there.
More promise for non-traditional data types (e.g. point clouds, 3D, etc)