본문 바로가기

GDSC/ML

Scalable Neural Video Representations with Learnable Positional Features

https://arxiv.org/abs/2210.06823 

최근에 Video Representation에 관심이 많아져서 가져온 주제입니다. 요약 위주로 하다보니, 수식들은 Latex로 작성되어 있어 논문과 비교해서 읽는 걸 추천드립니다!!

 

What is Coordinate-Based Neural Representations?

⇒ Instead of storing signal outputs as a coordinate grid (e.g. image pixels), CNRs represent each signal as a compact, parameterized, continuous neural network.

→ Why? Coordinate grid’s memory requirement is unfavorable in terms of resolution and dimension

 

Several works attempted to exploit CNRs for the interpretation of video signals

→ How? By learning a neural network $f : \mathbb{R}^3 \rightarrow \mathbb{R}^3$ with $f(x, y, t) = (r, g,b)$

→ Conventional CNR fails to encode large-scale videos, but can be mitigated by designing CNR specialized for videos.

→ NERV proposes a CNR structure focusing on continuous modeling of the video signal only along the temporal dimension (aspect of time), allowing radical variations along spatial axes.

However, CNRs suffer from severe compute-inefficiency

 

Some methods have been proposed to alleviate the compute-inefficiency.

→ Separating CNR $f$ into two parts : Coordinate-to-latent mapping $g_\theta (x,y,t)=\mathbf{z}$ & Latent-to-RGB mapping $h_{\phi}(\mathbf{z})=(r, g, b)$

→ $g_\theta$ is defined with latent grids (shape resembles grid representations of given signal)

However, this architecture severely sacrifices the parameter-efficiency

 

This paper introduces NVP, a novel CNR for videos

NVP presents learnable positional features that effectively amortize a given video as “2D and 3D” latent grids with succinct parameters.

(amortize? → spreading out the cost of operation or resource over multiple uses or time periods)

  1. NVP presents 2 types of latent grids for constructing these mappings

⇒ Latent keyframes + Sparse positional features

  1. NVP proposes a compute/memory efficient compression procedure to further reduce the parameter $\theta$ by incorporating existing image and video codecs

So, what is NVP (Neural Video Representations with learnable positional features)?

Given a video signal $\mathbf{v} := (\mathbf{f}_1, \mathbf{f}_2, ... , \mathbf{f}T)$ ($T$ video frames), this paper aims to find a compact neural representation $f{\mathbf{W}}$ with parameters $\mathbf{w}$ (from this, we are able to reconstruct the original video with high quality)

 

To do this, this paper represents the video using a neural network $f_{\mathbf{w}} : \mathbb{R}^3 \rightarrow \mathbb{R}^3$ . The neural network maps space-time coordinates $(x, y, t)$ of the video to corresponding RGB values $(r, g, b)$. Here, we optimize $f_{\mathbf{w}}$ with reconstruction objectives (e.g. mean-squared error)

 

This paper also resolves the tremendous time costs for training by designing “learnable positional features” → This encodes a video with high quality and keeps their compute/parameter efficiency intact.

Architecture of NVP

The video CNR $f_{\mathbf{w}}$ is composed of two functions with parameterization $\mathbf{w}:=(\theta, \phi)$

→ Coordinate-to-latent mapping $g_\theta$ & latent-to-RGB mapping $h_\phi$

 

We can decompose $g_\theta$ as $g_{\theta_{xy}} \times g_{\theta_{xt}} \times g_{\theta_{yt}} \times g_{\theta_{xyt}}$

→ $g_{\theta_{xy}}, g_{\theta_{xt}}, g_{\theta_{yt}}$ is formalized with image-like 2D latent spatial grids $\mathbf{U}{\theta{xy}}, \mathbf{U}{\theta{xt}}, \mathbf{U}{\theta{yt}}$

→ $g_{\theta_{xyt}}$ is designed with a video-like sparse 3D latent grid $\mathbf{U}{\theta{xyt}}$

 

We present the latent-to-RGB mapping $h_{\theta} (\mathbf{z}{xy} , \mathbf{z}{xt}, \mathbf{z}{yt}, \mathbf{z}{xyt}) = (r, g, b)$ to be MLP modulated by another neural network.

 

Assume that all input coordinate (x, y, z) of $g_{\theta} (\text{and} f_\mathbf{w})$ is in $[0, 1]^3 \subset \mathbb{R}^3$

 

Learnable latent keyframes

For a given input $(x, y, t)$, we compute latent vectors $\mathbf{z}{xy}, \mathbf{z}{xt}, \mathbf{z}{yt}$ from $\mathbf{U}{\theta_{xy}}, \mathbf{U}{\theta{xt}}, \mathbf{U}{\theta{yt}}$ individually.

$\mathbf{U}$ is $L$ 2D spatial grids of $C$-dimensional latent codes $u_{ij}$, whose resolution is $H_l \times W_l$

 

Since $\mathbf{U}$ is shared over the x-axis (since we are considering $\mathbf{U}{\theta{yt}}$), we compute the latent vector $\mathbf{z}$ by only considering the values y and t at a given coordinate

 

Modulated implicit function

This paper designs $h_\phi$ to be a $K$-layer MLP, modulated by another modulator network

→ Latent vector $\mathbf{z}$ and time coordinate $t$ are passed through the modulator and synthesizer.

Modulator uses ReLU, while the synthesizer uses sinusoidal activations

Compression Procedure of NVP

Compression pipeline does not require re-training → reduces number of bits while preserving video quality

Main idea : Incorporate existing image and video codecs that have shown promises for the compression of given pixels

→ Focus on compressing key frames $\mathbf{U}{\theta{xy}}, \mathbf{U}{\theta{xt}}, \mathbf{U}{\theta{yt}}$ and sparse positional features $\mathbf{U}{\theta{xyt}}$ (sparse: only few of the numbers are non-zero)

→ Specifically, quantize $\mathbf{U}{\theta{xyt}}$ and $\mathbf{U}{\theta{xy}}, \mathbf{U}{\theta{xt}}, \mathbf{U}{\theta{yt}}$ as 3D/2D grids of 8-bit latent codes and regard them as video and image pixel grids, where number of channel becomes the dimension of latent codes (each code corresponds to each grid)

⇒ This procedure significantly reduce the parameters while notably maintaining the video quality

'GDSC > ML' 카테고리의 다른 글

Graph of Thoughts: Solving Elaborate Problems with Large Language Models  (1) 2023.09.17