singular value decomposition

Singular Value Decomposition (SVD)

$X = ∣ x_{1} ∣ ∣ x_{2} ∣ \dots ∣ x_{m} ∣ = U Σ V^{T} = ∣ u_{1} ∣ ∣ u_{2} ∣ \dots ∣ u_{n} ∣ σ_{1} 0 σ_{2} \dots ⋱ ⋮ σ_{m} 0 ∣ v_{1} ∣ ∣ v_{2} ∣ \dots ∣ v_{m} ∣^{T}$
where $x_{k} \in R^{n}$ denote the columns of $X \in R^{n \times m}$ , which can be any data flattened into $n \times 1$ column vectors.
For example the individual $x_{k}$ could be an image, or a snapshot of some physical process through time.

$U \in R^{n \times n}$ is called the left singular vectors, $u_{k}$ are eigenvectors which form the eigenbasis for the column space of $X$ . These columns are ordered by the amount of variance they explain in the data (“importance”). They are the principal components of the data containing information about the column space of $X$ .
$V \in R^{n \times m}$ is called the right singular vectors. They basically tell us the combination of the left singular vectors that make up each of the original data points ““eigenmixture”” – containing information about the row space of $X$ .
$Σ \in R^{n \times m}$ is a diagonal matrix containing the singular values $σ_{1} \geq σ_{2} \geq \dots \geq σ_{m} \geq 0$ on the diagonal in descending order ordered by magnitude. They tell us the importance of each singular vector in the decomposition.
$U, V$ are unitary, so $U^{T} U = U U^{T} = I_{n \times n}$ and $V^{T} V = V V^{T} = I_{m \times m}$

This SVD is guaranteed to exist for any matrix $X$ and is unique up to the signs of the singular values, but that doesn’t matter since the vector space is still the same.

“Economy SVD”

A more efficient (and mathematically equivalent) version of the SVD when $n ≫ m$ (e.g. when we have a few hundred or thousand megapixel images), where we only compute the meaningful components:
$X = U Σ V^{T} = \hat{U} \hat{Σ} V^{T}$
where $\hat{U} \in R^{n \times m}$ contains only the first $m$ columns of $U$ , $\hat{Σ} \in R^{m \times m}$ contain only the first $m$ singular values ¹ , and $V \in R^{m \times m}$ is unchanged; instead of $n$ columns, we only have $m$ columns in $U$ and $Σ$ .

We can do this because $X$ only has $m$ linearly independent columns, which means that the remaining columns of $U$ are just linear combinations of the first $m$ columns – the eigenvectors.

While the columns $U_{m + 1 :}$ are not meaningful for representing $X$ , without them, the orthonormal space is not complete ¹, i.e. $U$ is no longer unitary: $\hat{U}^{T} \hat{U} = I_{m \times m}, but \hat{U} \hat{U}^{T} \neq = I$

SVD as Sum of Outer Products

The SVD can be written as a sum of rank-1 $R^{m \times m}$ matrices formed by outer products of corresponding singular vectors, weighted by singular values:
$X = U Σ V^{T} = σ_{1} u_{1} v_{1}^{T} + σ_{2} u_{2} v_{2}^{T} + \dots + σ_{m} u_{m} v_{m}^{T} + 0 = σ_{1} u_{1} [v_{1}^{T}] + σ_{2} u_{2} [v_{2}^{T}] + \dots + σ_{m} u_{m} [v_{m}^{T}]$
Each term $σ_{k} u_{k} v_{k}^{T}$ is a rank-1 matrix by construction, fully depending on / explained by one row and col, and captures a fundamental (orthogonal) direction of variation in the data, weighted by the corresponding singular value $σ_{k}$ .

This sum of rank-1 matrices increasingly improves the approximation of $X$ (like denoising! ²).

Theorem

The ordering by importance allows us to approximate the data matrix $X$ by truncating the singular value matrix $Σ$ to a smaller rank $r$ :
$σ_{1} u_{1} v_{1}^{T} + σ_{2} u_{2} v_{2}^{T} + ... + σ_{r} u_{r} v_{r}^{T} \approx \tilde{U} \tilde{Σ} \tilde{V}^{T}$
The Eckard-Young Theorem tells us that the best possible approximation with a given rank $r$ is exactly this truncation of the SVD:
$argmin_{\tilde{X} : st rank (\tilde{X}) = r} ∣∣ X - \tilde{X} ∣ ∣_{F} = \tilde{U} \tilde{Σ} \tilde{V}^{T}$
$∣ \cdot ∣_{F}$ is the frobenius norm (sqrt of the sum of squares of all elements).

Intuitive interpretation: $X^{T} X = V \hat{Σ} \hat{U}^{T} \hat{U} \hat{Σ} V^{T} = V \hat{Σ}^{2} V^{T} ⟹ X^{T} X V = V \hat{Σ}^{2}$

$X = \hat{U} Σ V^{T} ⟹ X^{T} = V \hat{Σ} \hat{U}^{T}$
$X^{T} X = V \hat{Σ} \cancelto I \hat{U}^{T} \hat{U} \hat{Σ} V^{T} = V \hat{Σ}^{2} V^{T}$ ( $\hat{U}$ ’s columns are still orthogonal)
This is a diagonalization of $X^{T} X$ with the eigenvectors $V$ and eigenvalues $σ_{i}^{2}$ !
→ The columns of $V$ are the eigenvectors of the second moment column-wise “correlation matrix” $X^{T} X$ and the singular values are the square roots of the eigenvalues of $X^{T} X$ , quantifying the importance of each eigenvector for the correlation matrix.
Similarly for $X X^{T} = \hat{U} \hat{Σ} V^{T} V \hat{Σ} \hat{U}^{T} = \hat{U} \hat{Σ}^{2} \hat{U}^{T}$ .
→ The columns of $U$ are the eigenvectors of the second moment row-wise “correlation matrix” $X X^{T}$ .

Computing the SVD through Eigendecomposition (extended)

For any matrix $X \in R^{n \times m}$ , we can form two symmetric matrices:
$X^{T} X X X^{T} = V (Σ^{T} Σ) V^{T} \in R^{m \times m} = U (Σ Σ^{T}) U^{T} \in R^{n \times n}$
Starting from the SVD $X = U Σ V^{T}$ , we can derive how these matrices decompose:
$X^{T} X = (U Σ V^{T})^{T} (U Σ V^{T}) = V Σ^{T} U^{T} U Σ V^{T} = V Σ^{T} Σ V^{T} = V σ_{1}^{2} ⋱ σ_{m}^{2} V^{T}$
Which now looks like an eigendecomposition of $X^{T} X$ with eigenvectors $V$ and eigenvalues $σ_{i}^{2}$ .
Similarly for $X X^{T}$ :
$X X^{T} = (U Σ V^{T}) (U Σ V^{T})^{T} = U Σ V^{T} V Σ^{T} U^{T} = U Σ Σ^{T} U^{T} = U σ_{1}^{2} ⋱ σ_{m}^{2} U^{T}$
Taking square roots of these eigenvalues gives us the singular values: $σ_{i} = λ_{i}$ .
This is of course not a practical way to compute the SVD.

If $X$ is symmetric and positive semidefinite, then $U = V$ , i.e. SVD equals eigendecomposition.

SVDs are a data-driven generalization of fourier transforms, where the unitary matrices are rotating a coordinate transformation of the data into another coordinate system where things are simpler (principle directions that explain most of the column / row space).

Geometric interpretation:
$Σ$ turns the unitary matrix $U$ into an ellipsoid (stretching /squishing).
$U$ is the part that rotates the data (and potentially squashes it onto lower dimensions).

The SVD allows us to generalize solving linear systems of equations $A x = b$ to nonsquare $A$ matrices.

$row (A) = col (V)$

Referencces

SVD - Data-Driven Science and engineering - Steve Brunton
Visual Kernel

For $Σ$ this is simply omitting the zero padding which was only necessary to be able to multiply $U Σ$ and cancelled these columns of $U$ anyways. ↩
You can think of denoising with the SVD as the “reverse” process, approximating the clean signal: For a noisy data matrix $X = X_{clean} + X_{noise}$ , we omit smaller singular values not to compress, but because the noise typically contributes more to smaller singular values and has less structured patterns than the true signal. ↩

Max Wolf's Second Brain

Explorer

singular value decomposition

Referencces

Graph View

Backlinks

Max Wolf's Second Brain

Explorer

singular value decomposition

Referencces

Footnotes

Graph View

Backlinks