Singular Value Decomposition (SVD)

where denote the columns of , which can be any data flattened into column vectors.
For example the individual could be an image, or a snapshot of some physical process through time.

is called the left singular vectors, are eigenvectors which form the eigenbasis for the column space of . These columns are ordered by the amount of variance they explain in the data (“importance”). They are the principal components of the data containing information about the column space of .
is called the right singular vectors. They basically tell us the combination of the left singular vectors that make up each of the original data points ““eigenmixture”” – containing information about the row space of .
is a diagonal matrix containing the singular values on the diagonal in descending order ordered by magnitude. They tell us the importance of each singular vector in the decomposition.
are unitary, so and

This SVD is guaranteed to exist for any matrix and is unique up to the signs of the singular values, but that doesn’t matter since the vector space is still the same.

“Economy SVD”

A more efficient (and mathematically equivalent) version of the SVD when (e.g. when we have a few hundred or thousand megapixel images), where we only compute the meaningful components:

where contains only the first columns of , contain only the first singular values 1 , and is unchanged; instead of columns, we only have columns in and .

We can do this because only has linearly independent columns, which means that the remaining columns of are just linear combinations of the first columns – the eigenvectors.

While the columns are not meaningful for representing , without them, the orthonormal space is not complete 1, i.e. is no longer unitary:

SVD as Sum of Outer Products

The SVD can be written as a sum of rank-1 matrices formed by outer products of corresponding singular vectors, weighted by singular values:

Each term is a rank-1 matrix by construction, fully depending on / explained by one row and col, and captures a fundamental (orthogonal) direction of variation in the data, weighted by the corresponding singular value .

This sum of rank-1 matrices increasingly improves the approximation of (like denoising! 2).

Theorem

The ordering by importance allows us to approximate the data matrix by truncating the singular value matrix to a smaller rank :

The Eckard-Young Theorem tells us that the best possible approximation with a given rank is exactly this truncation of the SVD:

is the frobenius norm (sqrt of the sum of squares of all elements).

Intuitive interpretation:


(’s columns are still orthogonal)
This is a diagonalization of with the eigenvectors and eigenvalues !
→ The columns of are the eigenvectors of the second moment column-wise “correlation matrix” and the singular values are the square roots of the eigenvalues of , quantifying the importance of each eigenvector for the correlation matrix.
Similarly for .
→ The columns of are the eigenvectors of the second moment row-wise “correlation matrix” .

SVDs are a data-driven generalization of fourier transforms, where the unitary matrices are rotating a coordinate transformation of the data into another coordinate system where things are simpler (principle directions that explain most of the column / row space).

Geometric interpretation:
turns the unitary matrix into an ellipsoid (stretching /squishing).
is the part that rotates the data (and potentially squashes it onto lower dimensions).

The SVD allows us to generalize solving linear systems of equations to nonsquare matrices.

Referencces

SVD - Data-Driven Science and engineering - Steve Brunton
Visual Kernel

Footnotes

  1. For this is simply omitting the zero padding which was only necessary to be able to multiply and cancelled these columns of anyways.

  2. You can think of denoising with the SVD as the “reverse” process, approximating the clean signal: For a noisy data matrix , we omit smaller singular values not to compress, but because the noise typically contributes more to smaller singular values and has less structured patterns than the true signal.