Ch5 Lecture 5

Bridge

Bridging from SVD to PCA

In the previous lecture, we used the SVD to understand the geometry of a dataset.

We thought about data as a cloud of points and asked:

  • which directions mix together,
  • which directions remain pure,
  • and how the covariance matrix reveals the natural coordinate system of the data.

The singular vectors gave us something powerful:

a set of directions where variation in the data becomes decoupled.


A different question

Instead of asking

What directions describe the intrinsic geometry of the data?

Suppose we ask now:

If I want to look at this high‑dimensional cloud, which direction should I look from?

In high dimensions we cannot see the cloud directly. The only thing we can do is project it onto lower‑dimensional spaces.

Choosing a direction \mathbf{u} means:

  • project each data point onto that direction,
  • look at the shadow the data casts there.

So now directions become viewpoints.


Good and bad viewpoints

Some viewing directions are uninformative:

  • all projected points pile together,
  • the structure of the data disappears.

Other directions reveal structure:

  • points spread out,
  • patterns become visible.

The natural goal becomes::

Choose viewing directions where the projected data spreads out as much as possible.

This point of view is called PCA

Principal Component Analysis asks:

Which viewing direction makes the data appear maximally spread out?

We measure this spread using variance of the projected points.

Let’s use some different notation for a minute.

Suppose our centered data matrix is X, with each row representing an observation and each column representing a variable.

If we project onto a unit direction \mathbf{u}, we get one coordinate per observation, which we can stack into a vector \mathbf{\psi}.

\mathbf{\psi} = X \mathbf{u}

\mathbf{\psi} = X \mathbf{u}

What is the variance of the projected data?

Since the data has been centered, the variance is given by

\frac{1}{n} \sum_{i=1}^n \psi_i^2 = \frac{1}{n} \mathbf{\psi}^T \mathbf{\psi}

We can rewrite this in terms of the original data matrix as

\frac{1}{n} \mathbf{\psi}^T \mathbf{\psi} = \frac{1}{n} \mathbf{u}^T X^T X \mathbf{u}

We recognize that X^T X is the covariance matrix of the data.

So the variance of the projected data is given by

\mathbf{u}^T C \mathbf{u},

where C = \frac{1}{n} X^T X is the covariance matrix of the X.

Remember, we want to find the direction \mathbf{u} that maximizes the variance of the projected data.

So we want the direction \mathbf{u} that maximizes \mathbf{u}^T C \mathbf{u}.

Connection

We already solved this problem — just without calling it PCA.

The direction that maximizes \mathbf{u}^T C \mathbf{u} is the eigenvector of C with the largest eigenvalue.

And of course, the eigenvectors of the covariance matrix are the same as the singular vectors of the centered data matrix.

Now we reinterpret them:

  • before: they were intrinsic geometric directions of the data cloud,
  • now: they are the best viewing directions for seeing structure.

PCA is simply the viewing interpretation of what SVD already discovered.


Conceptual summary

SVD told us how the data naturally wants to be coordinated.

PCA asks how we should look at the data to understand it.

They are two perspectives on the same object:

  • SVD — discover the natural axes of variation.
  • PCA — view the data along those axes.

It is the same geometry, seen through the lens of projection and visualization.

High-dimensional data: variance and projection

Data as points in \mathbb{R}^p, row vs column views, and choosing directions to visualize.

How spread out is the data along a particular direction?

Suppose we have n data points in p dimensions. We can represent the data as a matrix X of size n \times p. The data points are represented as rows in the matrix, and we have subtracted the mean along each dimension from the data.

Visualizing the high-dimensional data

KIDEN city region price_index_no_rent price_index_with_rent gross_salaries net_salaries work_hours_year paid_vacations_year gross_buying_power ... mechanic construction_worker metalworker cook_chef factory_manager engineer bank_clerk executive_secretary salesperson textile_worker
0 Amsterdam91 Amsterdam Central Europe 65.6 65.7 56.9 49.0 1714.0 31.9 86.7 ... 11924.0 12661.0 14536.0 14402.0 25924.0 24786.0 14871.0 14871.0 11857.0 10852.0
1 Athenes91 Athens Southern Europe 53.8 55.6 30.2 30.4 1792.0 23.5 56.1 ... 8574.0 9847.0 14402.0 14068.0 13800.0 14804.0 9914.0 6900.0 4555.0 5761.0
2 Bogota91 Bogota South America 37.9 39.3 10.1 11.5 2152.0 17.4 26.6 ... 4354.0 1206.0 4823.0 13934.0 12192.0 12259.0 2345.0 5024.0 2278.0 2814.0
3 Bombay91 Mumbai South Asia and Australia 30.3 39.9 6.0 5.3 2052.0 30.6 19.9 ... 1809.0 737.0 2479.0 2412.0 3751.0 2880.0 2345.0 1809.0 1072.0 1206.0
4 Bruxelles91 Brussels Central Europe 73.8 72.2 68.2 50.5 1708.0 24.6 92.4 ... 10450.0 12192.0 17350.0 19159.0 31016.0 24518.0 19293.0 13800.0 10718.0 10182.0

5 rows × 41 columns

Focusing on 12 variables

We might choose to focus on only 12 (!) of the 41 variables in the dataset, corresponding to the average wages of workers in 12 specific occupations in each city.

city teacher bus_driver mechanic construction_worker metalworker cook_chef factory_manager engineer bank_clerk executive_secretary salesperson textile_worker
0 Amsterdam 15608.0 17819.0 11924.0 12661.0 14536.0 14402.0 25924.0 24786.0 14871.0 14871.0 11857.0 10852.0
1 Athens 7972.0 9445.0 8574.0 9847.0 14402.0 14068.0 13800.0 14804.0 9914.0 6900.0 4555.0 5761.0
2 Bogota 2144.0 2412.0 4354.0 1206.0 4823.0 13934.0 12192.0 12259.0 2345.0 5024.0 2278.0 2814.0
3 Mumbai 1005.0 1340.0 1809.0 737.0 2479.0 2412.0 3751.0 2880.0 2345.0 1809.0 1072.0 1206.0
4 Brussels 14001.0 14068.0 10450.0 12192.0 17350.0 19159.0 31016.0 24518.0 19293.0 13800.0 10718.0 10182.0

How can we think about the data in this 12-dimensional space?

Clouds of row-points

Clouds of column-points

Projection onto fewer dimensions

To visualize data, we need to project it onto 2d (or 3d) subspaces. But which ones?

These are all equivalent:

  • maximize variance of projected data

  • minimize squared distances between data points and their projections

  • keep distances between points as similar as possible in original vs projected space

Example in the space of column points

Example

What you should be able to do

  • Interpret data as clouds of row-points and column-points.
  • State the goal of projection (maximize variance, minimize reconstruction error, or preserve distances).

The first principal component

Goal: find the direction of maximum variance; definition and link to eigendecomposition.

Goal

We’d like to know in which directions in \mathbb{R}^p the data has the highest variance.

Direction of maximum variance

To find the direction of maximum variance, we need to find the unit vector \mathbf{u} that maximizes \mathbf{u}^T C \mathbf{u}.

Which direction gives the maximum variance?

The first principal component of a data matrix X is the eigenvector corresponding to the largest eigenvalue of the covariance matrix of the data.

In terms of the singular value decomposition of X, the first principal component is the first right singular vector of X:

\mathbf{v_1}.

The variance of the data along each principal component is given by the corresponding eigenvalue, or the square of the corresponding singular value.

Shopping baskets: exploring high-dimensional data

Load data, visualize in a few dimensions, and look at correlations before PCA.

Motivation

  • High-dimensional data (e.g. many items per observation) is hard to visualize.
  • Shopping baskets: 2000 trips, 42 food items — we want to see structure before using PCA.
  • Strategy: load data → look at 2D/3D views → inspect correlation matrices.

Method

  • Load the dataset (rows = foods, columns = baskets, or transposed).
  • Visualize in a few dimensions: scatter plots for two foods, 3D bar charts for counts.
  • Summarize associations: correlation matrix and heatmap.

Example: Load and inspect the data

0 1 2 3 4 5 6 7 8 9 ... 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
name
7up 0 0 0 0 0 0 1 0 0 0 ... 1 1 0 0 0 0 2 0 0 1
lasagna 0 0 0 0 0 0 1 0 1 0 ... 0 2 1 0 0 0 0 1 1 0
pepsi 0 0 0 0 0 0 0 0 0 0 ... 1 0 2 0 0 2 0 0 0 0
yop 0 0 0 2 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
red.wine 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 2 2 0 0 0 0

5 rows × 2000 columns

  • 2000 observations × 42 variables (number of times each food was purchased per trip).
  • Goal: visualize in a few dimensions of the original space.

3D view of two-food counts

Many pairs of foods

We can look at many combinations.

Correlations between foods

Maybe we can learn more from the correlations?

Correlation heatmap (10 foods)

  • Some patterns are visible, but the full picture is still hard to see.
  • Next step: standardize and fit PCA to find directions of maximum variance.

Next: PCA

Now perform PCA on the data.

Code
# standardize the data
scaler = StandardScaler()
food_scaled = scaler.fit_transform(food)
pca = PCA(n_components=4)
pca.fit(food_scaled);
print(f'Explained variance %: {pca.explained_variance_ratio_*100}')
Explained variance %: [8.82557345 8.35536632 7.78271503 5.81231212]

Interpreting Principal Components

PCA gives us new dimensions — but what do they mean? We visualize loadings and scores.

Motivation

  • PCA reduces dimensions and explains variance.
  • We want to understand what each principal component captures.
  • Two key objects: loadings (variables) and scores (observations).

Method: Loadings vs scores

  • Loadings (components_): each variable’s contribution to each PC. Scatter variables in PC space to see which cluster together.
  • Scores (transform): each observation’s position in PC space.
  • Bar plots: order observations by PC value to see who is high or low on each dimension.

Example: Loadings (variables in PC space)

Example: Loadings (PC3 vs PC4)

Example: Scores (individuals in PC space)

Example: Bar plots by PC value

Some real-world data

Applying PCA to genome data to explore population structure.

Motivation

  • Genome data from multiple populations
  • Goal: explore whether genetic variation correlates with population structure
  • Binarize variants and project to principal components

Method — Data pipeline

  1. Read and preprocess: extract population/gender, compute mode per trait (most frequent value)
  2. Binarize: 0 if individual has mode for that trait, 1 otherwise (mutation away from mode)
  3. Fit PCA, transform, append metadata for visualization
Code
def readAndProcessData():
    """
        Function to read the raw text file into a dataframe and keeping the population, gender separate from the genetic data
        
        We also calculate the population mode for each attribute or trait (columns)
        Note that mode is just the most frequently occurring trait
        
        return: dataframe (df), modal traits (modes), population and gender for each individual row
    """
    
    df = pd.read_csv('p4dataset2020.txt', header=None, delim_whitespace=True)
    gender = df[1]
    population = df[2]
    print(np.unique(population))
    
    df.drop(df.columns[[0, 1, 2]],axis=1,inplace=True)
    modes = np.array(df.mode().values[0,:])
    return df, modes, population, gender
['ACB' 'ASW' 'ESN' 'GWD' 'LWK' 'MSL' 'YRI']

Examples — Worked pipeline

3 4 5 6 7 8 9 10 11 12 ... 10094 10095 10096 10097 10098 10099 10100 10101 10102 10103
0 G G T T A A C A C C ... T A T A A T T T G A
1 A A T T A G C A T T ... G C T G A T C T G G
2 A A T T A A G A C C ... G C T G A T C T G G
3 A A T C A A G A C C ... G A T G A T C T G G
4 G A T C G A C A C C ... G C T G A T C T G G

5 rows × 10101 columns

0 1 2 3 4 5 6 7 8 9 ... 10091 10092 10093 10094 10095 10096 10097 10098 10099 10100
0 0 1 0 1 0 1 1 0 0 0 ... 0 1 1 1 0 0 1 0 0 1
1 1 0 0 1 0 0 1 0 1 1 ... 1 0 1 0 0 0 0 0 0 0
2 1 0 0 1 0 1 0 0 0 0 ... 1 0 1 0 0 0 0 0 0 0
3 1 0 0 0 0 1 0 0 0 0 ... 1 1 1 0 0 0 0 0 0 0
4 0 0 0 0 1 1 1 0 0 0 ... 1 0 1 0 0 0 0 0 0 0

5 rows × 10101 columns

Code
pca = PCA(n_components=6)
pca.fit(X);
#Data points projected along the principal components
PC1 PC2 PC3 PC4 PC5 PC6 population gender
0 4.783047 -0.885985 3.194163 -0.547733 -2.338509 -0.103942 ACB 1
1 12.949482 2.357771 -0.735125 -3.726529 -2.297325 -1.168944 ACB 2
2 9.835774 0.659151 -1.353148 -2.172695 -1.617898 -1.051721 ACB 2
3 -0.340330 -0.742929 3.523026 0.684339 -2.378477 -1.254986 ACB 1
4 3.782883 0.497806 -2.674773 -0.806439 -0.727355 -0.017556 ACB 2