Ch 5 Lecture 3

Intuition about SVD

Thinking again about matrix diagonalization

If A is a (non-defective) n \times n square matrix,

then we can write it as A = PDP^{T} where D is a diagonal matrix and P is an orthogonal matrix of eigenvectors.
We can find the product of A with a vector \mathbf{x} where \mathbf{x} is a column vector of length n. This product will also be a column vector of length n.
What’s the intuition behind the diagonalization?
The matrix P^{T} is a change of basis matrix. It takes a vector \mathbf{x} expressed in the standard basis and expresses it in terms of the basis of eigenvectors of A.
- This works because P is orthogonal.
The matrix D is a scaling matrix. It scales the eigenvectors by the corresponding eigenvalues. (If A is not full rank, then D will have some zeros on the diagonal.)
The matrix P is the inverse of P^{T} (because P is orthogonal). It just takes the scaled eigenvectors and expresses them in terms of the standard basis.
This all works because P is orthogonal and there’s a correspondence between the rows of P^{T} and the columns of P.
- This inverse projection is what we want because the columns of P are eigenvectors of A.
- We can see this most easily by imagining that \mathbf{x} is a multiple of one of the eigenvectors.
- P\mathbf{x} will give the coordinates of \mathbf{x} in the basis of eigenvectors. For instance, if P\mathbf{x}=\begin{bmatrix} 1 \\0\\ \vdots \\ 0\end{bmatrix}, we know that \mathbf{x} equals the first eigenvector.
  - After multiplication by the diagonal matrix, we’ll just have \begin{bmatrix} \lambda_{1} \\0\\ \vdots \\ 0\end{bmatrix}.
- Then P^{T} will take this back to a multiple of the first eigenvector, back in the standard basis.
- That’s what we want because we are supposing that \mathbf{x} is a multiple of the first eigenvector.
- The same logic will hold when \mathbf{x} is weighted sum of the eigenvectors.
  - Because P is orthonormal, P\mathbf{x} will give the weights of the eigenvectors in the basis of eigenvectors.
- P^{T} will take it back to the weighted sum of the eigenvectors in the standard basis, with the new weights being determined by the original weights and the eigenvalues.
This will all works when U and V are orthogonal and there’s a correspondence between the rows of V^{T} and the columns of U.

When we have some zeros on the diagonal, multiplying by D is a projection of \mathbf{x} onto the row space of A. We are removing the components of \mathbf{x} that are in the null space of A.

The same logic, applied to a general matrix

If A is a m \times n square matrix,

then we can write it as A = U\Sigma V^{T} where \Sigma is a diagonal matrix and P is an orthogonal matrix of eigenvectors.
We can find the product of A with a vector \mathbf{x} where \mathbf{x} is a column vector of length n. This product will now be a column vector of length m.
The matrix V^{T} is a change of basis matrix. It takes a vector \mathbf{x} expressed in the standard basis and expresses it in terms of the basis of right singular vectors of A.
The matrix \Sigma is a scaling matrix. It scales the singular vectors by the corresponding singular values. (If A is not square or A is not full rank, then \Sigma will have some zeros on the diagonal.)
The matrix U is not generally the inverse of V^{T}. It takes the scaled singular vectors and expresses them in terms of the standard basis of \mathbb{R}^m.
- Specifically, the first column of U must be the correct vector to express the first row of V^{T} in \mathbbb{R}^m
- that is, we need the first column of U to be A\mathbf{v}_1 (divided by the scaling factor).
- This is because \Sigma V^{T}\mathbf{v}_1=\sigma_1\begin{bmatrix} 1 \\0\\ \vdots \\ 0\end{bmatrix}, so that U\Sigma V^{T}\mathbf{v}_1=\sigma_1 U \begin{bmatrix} 1 \\0\\ \vdots \\ 0\end{bmatrix}=\sigma_1\mathbf{u_1}.
- So we’d better have \mathbf{u_1}=A\mathbf{v}_1/\sigma_1. :::

## SVD

::: notes - Assume m \geq n (the case m<n is similar). - Let B=A^{T} A and let \lambda_{1} \geq \lambda_{2} \geq \cdots \geq \lambda_{n} be the eigenvalues of B. - Find the basis of eigenvectors of B: B \mathbf{v}_{k}=\sigma_{k}^{2} \mathbf{v}_{k}, k=1,2, \ldots, n - Let V=\left[\mathbf{v}_{1}, \mathbf{v}_{2}, \ldots, \mathbf{v}_{n}\right]. Then V is an orthogonal n \times n matrix. - We may assume for some index r that \sigma_{r+1}, \sigma_{r+2}, \ldots, \sigma_{n} are zero, while \sigma_{r} \neq 0. - Next set \mathbf{u}_{j}=\frac{1}{\sigma_{j}} A \mathbf{v}_{j}, j=1,2, \ldots, r. - In other words, \mathbf{u}_{j} is the vector that you get when you transform \mathbf{v}_{j} by A and then normalize it by the singular value \sigma_{j}. - These are orthonormal vectors in \mathbb{R}^{m} since

\mathbf{u}_{j}^{T} \mathbf{u}_{k}=\frac{1}{\sigma_{j} \sigma_{k}} \mathbf{v}_{j}^{T} A^{T} A \mathbf{v}_{k}=\frac{1}{\sigma_{j} \sigma_{k}} \mathbf{v}_{j}^{T} B \mathbf{v}_{k}=\frac{\sigma_{k}^{2}}{\sigma_{j} \sigma_{k}} \mathbf{v}_{j}^{T} \mathbf{v}_{k}= \begin{cases}0, & \text { if } j \neq k \\ 1, & \text { if } j=k\end{cases}

Now expand this set to an orthonormal basis \mathbf{u}_{1}, \mathbf{u}_{2}, \ldots, \mathbf{u}_{m} of \mathbb{R}^{m}. This is possible by Theorem 4.7 in Section 4.3. Set U=\left[\mathbf{u}_{1}, \mathbf{u}_{2}, \ldots, \mathbf{u}_{m}\right]. This matrix is orthogonal. We calculate that if k>r, then \mathbf{u}_{j}^{T} A \mathbf{v}_{k}=0 since A \mathbf{v}_{k}=\mathbf{0}, and if k<r, then

\mathbf{u}_{j}^{T} A \mathbf{v}_{k}=\sigma_{k} \mathbf{u}_{j}^{T} \mathbf{u}_{k}=\left\{\begin{array}{l} 0, \text { if } j \neq k, \\ \sigma_{k}, \text { if } j=k \end{array}\right.

U^{T} A V=\left[\mathbf{u}_{j}^{T} A \mathbf{v}_{k}\right]=\Sigma, which is the desired SVD.

And because U and V are orthonormal, their inverses are their transposes, so A=U \Sigma V^{T}.

Note also that we can find U as the eigenvectors of A A^{T} and V as the eigenvectors of A^{T} A.

Recap

A=U S V^{T}

where

\begin{array}{lccccc} & & & U & V & S \\ & & &\left(\begin{array}{ccc} \mid & & \mid \\ \mathbf{u}_{1} & \cdots & \mathbf{u}_{m} \\ \mid & & \mid \end{array}\right) &\left(\begin{array}{ccc} \mid & & \mid \\ \mathbf{v}_{1} & \cdots & \mathbf{v}_{n} \\ \mid & & \mid \end{array}\right) &\left(\begin{array}{cccccc} \sigma_1 & & & & & \\ & \sigma_2 & & & & \\ & & \ddots & & & \\ & & & \sigma_r & & \\ & & & & \ddots & \\ & & & & & 0 \end{array}\right)\\ \text{eigenvectors of} & & & A A^{T} & A^{T} A \\ \end{array}

Example

A=\left(\begin{array}{ccc} 3 & 2 & 2 \\ 2 & 3 & -2 \end{array}\right)

A A^{T}=\left(\begin{array}{cc} 17 & 8 \\ 8 & 17 \end{array}\right), \quad A^{T} A=\left(\begin{array}{ccc} 13 & 12 & 2 \\ 12 & 13 & -2 \\ 2 & -2 & 8 \end{array}\right)

\begin{array}{cc} A A^{T}=\left(\begin{array}{cc} 17 & 8 \\ 8 & 17 \end{array}\right) & A^{T} A=\left(\begin{array}{ccc} 13 & 12 & 2 \\ 12 & 13 & -2 \\ 2 & -2 & 8 \end{array}\right) \\ \begin{array}{c} \text { eigenvalues: } \lambda_{1}=25, \lambda_{2}=9 \\ \text { eigenvectors } \end{array} & \begin{array}{c} \text { eigenvalues: } \lambda_{1}=25, \lambda_{2}=9, \lambda_{3}=0 \\ \text { eigenvectors } \end{array} \\ u_{1}=\binom{1 / \sqrt{2}}{1 / \sqrt{2}} \quad u_{2}=\binom{1 / \sqrt{2}}{-1 / \sqrt{2}} & v_{1}=\left(\begin{array}{c} 1 / \sqrt{2} \\ 1 / \sqrt{2} \\ 0 \end{array}\right) \quad v_{2}=\left(\begin{array}{c} 1 / \sqrt{18} \\ -1 / \sqrt{18} \\ 4 / \sqrt{18} \end{array}\right) \quad v_{3}=\left(\begin{array}{c} 2 / 3 \\ -2 / 3 \\ -1 / 3 \end{array}\right) \end{array}

SVD decomposition of A: A=U S V^{T}=\left(\begin{array}{cc} 1 / \sqrt{2} & 1 / \sqrt{2} \\ 1 / \sqrt{2} & -1 / \sqrt{2} \end{array}\right)\left(\begin{array}{ccc} 5 & 0 & 0 \\ 0 & 3 & 0 \end{array}\right)\left(\begin{array}{rrr} 1 / \sqrt{2} & 1 / \sqrt{2} & 0 \\ 1 / \sqrt{18} & -1 / \sqrt{18} & 4 / \sqrt{18} \\ 2 / 3 & -2 / 3 & -1 / 3 \end{array}\right)

Reformulating SVD

SVD as a sum of rank-one matrices

A=\sigma_{l} u_{l} v_{l}^{T}+\ldots+\sigma_{r} u_{r} v_{r}^{T}

Back to our example

A=\left(\begin{array}{ccc} 3 & 2 & 2 \\ 2 & 3 & -2 \end{array}\right)

Can decompose as

A=\left(\begin{array}{ccc}3 & 2 & 2 \\ 2 & 3 & -2\end{array}\right)

Uses of SVD

Moore-Penrose pseudoinverse

If we have a linear system

\begin{aligned} A x & =b \\ \end{aligned}

and A is invertible, then we can solve for x:

\begin{aligned} x & =A^{-1} b \end{aligned}

If A is not invertible, we can define instead a pseudoinverse A^{+}

Define A^{+} in order to minimize the least squares error:

\left\|\mathbf{A} \mathbf{A}^{+}-\mathbf{I}_{\mathbf{n}}\right\|_{2}

Then we can estimate x as

\begin{aligned} A x & =b \\ x & \approx A^{+} b \end{aligned}

Finding the form of the pseudoinverse

Example

Matrices as data

Example: Height and weight

A^T=\left[\begin{array}{rrrrrrrrrrrr}2.9 & -1.5 & 0.1 & -1.0 & 2.1 & -4.0 & -2.0 & 2.2 & 0.2 & 2.0 & 1.5 & -2.5 \\ 4.0 & -0.9 & 0.0 & -1.0 & 3.0 & -5.0 & -3.5 & 2.6 & 1.0 & 3.5 & 1.0 & -4.7\end{array}\right]

Covariance matrix

Covariance: \begin{aligned} \sigma_{a b}^{2} & =\operatorname{cov}(a, b)=\mathrm{E}[(a-\bar{a})(b-\bar{b})] \\ \sigma_{a}^{2} & =\operatorname{var}(a)=\operatorname{cov}(a, a)=\mathrm{E}\left[(a-\bar{a})^{2}\right] \end{aligned}

Covariance matrix:

\mathbf{\Sigma}=\left(\begin{array}{cccc} E\left[\left(x_{1}-\mu_{1}\right)\left(x_{1}-\mu_{1}\right)\right] & E\left[\left(x_{1}-\mu_{1}\right)\left(x_{2}-\mu_{2}\right)\right] & \ldots & E\left[\left(x_{1}-\mu_{1}\right)\left(x_{p}-\mu_{p}\right)\right] \\ E\left[\left(x_{2}-\mu_{2}\right)\left(x_{1}-\mu_{1}\right)\right] & E\left[\left(x_{2}-\mu_{2}\right)\left(x_{2}-\mu_{2}\right)\right] & \ldots & E\left[\left(x_{2}-\mu_{2}\right)\left(x_{p}-\mu_{p}\right)\right] \\ \vdots & \vdots & \ddots & \vdots \\ E\left[\left(x_{p}-\mu_{p}\right)\left(x_{1}-\mu_{1}\right)\right] & E\left[\left(x_{p}-\mu_{p}\right)\left(x_{2}-\mu_{2}\right)\right] & \ldots & E\left[\left(x_{p}-\mu_{p}\right)\left(x_{p}-\mu_{p}\right)\right] \end{array}\right)

\Sigma=\mathrm{E}\left[(X-\bar{X})(X-\bar{X})^{\mathrm{T}}\right]

. . . \Sigma=\frac{X X^{\mathrm{T}}}{n} \quad \text { (if } X \text { is already zero centered) }

For our dataset,

\text { Sample covariance } S^{2}=\frac{A^{T} A}{(\mathrm{n}-1)}=\frac{1}{11}\left[\begin{array}{rr} 53.46 & 73.42 \\ 73.42 & 107.16 \end{array}\right]

Plotting

The columns of the V matrix:

[[-0.57294952 -0.81959066]
 [ 0.81959066 -0.57294952]]

Walk through why these vectors are actually the eigenvectors of the covariance matrix.

The components of the data matrix

The columns of the U matrix

What if we had different orthogonal matrix for V, that wasn’t eigenvectors of the covariance matrix?

[[4.86       6.67454545]
 [6.67454545 9.74181818]]

The columns of the U matrix are no longer orthogonal.

The U matrix as a heat plot