Ch5 Lecture 5

How spread out is the data along a particular direction?

Suppose we have n data points in p dimensions. We can represent the data as a matrix X of size n \times p. The data points are represented as rows in the matrix, and we have subtracted the mean along each dimension from the data.

Visualizing the high-dimensional data

KIDEN city region price_index_no_rent price_index_with_rent gross_salaries net_salaries work_hours_year paid_vacations_year gross_buying_power ... mechanic construction_worker metalworker cook_chef factory_manager engineer bank_clerk executive_secretary salesperson textile_worker
0 Amsterdam91 Amsterdam Central Europe 65.6 65.7 56.9 49.0 1714.0 31.9 86.7 ... 11924.0 12661.0 14536.0 14402.0 25924.0 24786.0 14871.0 14871.0 11857.0 10852.0
1 Athenes91 Athens Southern Europe 53.8 55.6 30.2 30.4 1792.0 23.5 56.1 ... 8574.0 9847.0 14402.0 14068.0 13800.0 14804.0 9914.0 6900.0 4555.0 5761.0
2 Bogota91 Bogota South America 37.9 39.3 10.1 11.5 2152.0 17.4 26.6 ... 4354.0 1206.0 4823.0 13934.0 12192.0 12259.0 2345.0 5024.0 2278.0 2814.0
3 Bombay91 Mumbai South Asia and Australia 30.3 39.9 6.0 5.3 2052.0 30.6 19.9 ... 1809.0 737.0 2479.0 2412.0 3751.0 2880.0 2345.0 1809.0 1072.0 1206.0
4 Bruxelles91 Brussels Central Europe 73.8 72.2 68.2 50.5 1708.0 24.6 92.4 ... 10450.0 12192.0 17350.0 19159.0 31016.0 24518.0 19293.0 13800.0 10718.0 10182.0

5 rows × 41 columns

We might choose to focus on only 12 (!) of the 41 variables in the dataset, corresponding to the average wages of workers in 12 specific occupations in each city.

city teacher bus_driver mechanic construction_worker metalworker cook_chef factory_manager engineer bank_clerk executive_secretary salesperson textile_worker
0 Amsterdam 15608.0 17819.0 11924.0 12661.0 14536.0 14402.0 25924.0 24786.0 14871.0 14871.0 11857.0 10852.0
1 Athens 7972.0 9445.0 8574.0 9847.0 14402.0 14068.0 13800.0 14804.0 9914.0 6900.0 4555.0 5761.0
2 Bogota 2144.0 2412.0 4354.0 1206.0 4823.0 13934.0 12192.0 12259.0 2345.0 5024.0 2278.0 2814.0
3 Mumbai 1005.0 1340.0 1809.0 737.0 2479.0 2412.0 3751.0 2880.0 2345.0 1809.0 1072.0 1206.0
4 Brussels 14001.0 14068.0 10450.0 12192.0 17350.0 19159.0 31016.0 24518.0 19293.0 13800.0 10718.0 10182.0

How can we think about the data in this 12-dimensional space?

Clouds of row-points

Clouds of column-points

Projection onto fewer dimensions

To visualize data, we need to project it onto 2d (or 3d) subspaces. But which ones?

These are all equivalent:

  • maximize variance of projected data

  • minimize squared distances between data points and their projections

  • keep distances between points as similar as possible in original vs projected space

Example in the space of column points

Example

Goal

We’d like to know in which directions in R^p the data has the highest variance.

Direction of maximum variance

To find the direction of maximum variance, we need to find the unit vector \mathbf{u} that maximizes \mathbf{u}^T C \mathbf{u}.

. . . ::: notes We start by finding the eigendecomposition of the covariance matrix C: C = V \Lambda V^T.

V is a matrix whose columns are the eigenvectors of C, and \Lambda is a diagonal matrix whose diagonal elements are the eigenvalues of C.

(Note that these are simply the right singular vectors and singular values of the data matrix X.)

Then we can express \mathbf{u} in terms of the eigenvectors of C: \mathbf{u} = \sum_{i=1}^p a_i \mathbf{v}_i, where \mathbf{v}_i are the eigenvectors of C. Because \mathbf{u} is a unit vector, the coefficients a_i must sum to 1.

Now we have that C \mathbf{u} = \sum_{i=1}^p C v_i a_i = \sum_{i=1}^p a_i v_i, where \lambda_i are the eigenvalues of C.

So then \mathbf{u}^T C \mathbf{u} = \sum_{i,j=1}^p a_i a_j \mathbf{v_j}\mathbf{v_j} = \sum_{i,j=1}^p a_i a_j \delta_{i,j}||v_i|| \lambda_i = \sum_{i=1}^p a_i^2 \lambda_i. :::

Which direction gives the maximum variance?

pause

The first principal component of a data matrix X is the eigenvector corresponding to the largest eigenvalue of the covariance matrix of the data.

In terms of the singular value decomposition of X, the first principal component is the first right singular vector of X:

\mathbf{v_1}.

The variance of the data along each principal component is given by the corresponding eigenvalue, or the square of the corresponding singular value.

Example dataset: shopping baskets

0 1 2 3 4 5 6 7 8 9 ... 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
name
7up 0 0 0 0 0 0 1 0 0 0 ... 1 1 0 0 0 0 2 0 0 1
lasagna 0 0 0 0 0 0 1 0 1 0 ... 0 2 1 0 0 0 0 1 1 0
pepsi 0 0 0 0 0 0 0 0 0 0 ... 1 0 2 0 0 2 0 0 0 0
yop 0 0 0 2 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
red.wine 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 2 2 0 0 0 0

5 rows × 2000 columns

The data consist of 2000 observations of 42 variables each! The variables are the number of times each of 42 different food items was purchased in a particular shopping trip.

Let’s try visualizing the data in a few of the dimensions of the original space.

We can look at many combinations…

Maybe we can learn more from the correlations?

OK, it looks like there are some patterns here. But it’s hard to get a real sense for it.

Now perform PCA on the data.

Code
# standardize the data
scaler = StandardScaler()
food_scaled = scaler.fit_transform(food)
# find the first four principal components of the data
pca = PCA(n_components=4)
pca.fit(food_scaled);
print(f'Explained variance %: {pca.explained_variance_ratio_*100}')
Explained variance %: [8.82557345 8.35536634 7.782715   5.81231614]

pause

Meaning of the principal components

Some real-world data

Code
def readAndProcessData():
    """
        Function to read the raw text file into a dataframe and keeping the population, gender separate from the genetic data
        
        We also calculate the population mode for each attribute or trait (columns)
        Note that mode is just the most frequently occuring trait
        
        return: dataframe (df), modal traits (modes), population and gender for each individual row
    """
    
    df = pd.read_csv('p4dataset2020.txt', header=None, delim_whitespace=True)
    gender = df[1]
    population = df[2]
    print(np.unique(population))
    
    df.drop(df.columns[[0, 1, 2]],axis=1,inplace=True)
    modes = np.array(df.mode().values[0,:])
    return df, modes, population, gender
['ACB' 'ASW' 'ESN' 'GWD' 'LWK' 'MSL' 'YRI']

3 4 5 6 7 8 9 10 11 12 ... 10094 10095 10096 10097 10098 10099 10100 10101 10102 10103
0 G G T T A A C A C C ... T A T A A T T T G A
1 A A T T A G C A T T ... G C T G A T C T G G
2 A A T T A A G A C C ... G C T G A T C T G G
3 A A T C A A G A C C ... G A T G A T C T G G
4 G A T C G A C A C C ... G C T G A T C T G G

5 rows × 10101 columns

0 1 2 3 4 5 6 7 8 9 ... 10091 10092 10093 10094 10095 10096 10097 10098 10099 10100
0 0 1 0 1 0 1 1 0 0 0 ... 0 1 1 1 0 0 1 0 0 1
1 1 0 0 1 0 0 1 0 1 1 ... 1 0 1 0 0 0 0 0 0 0
2 1 0 0 1 0 1 0 0 0 0 ... 1 0 1 0 0 0 0 0 0 0
3 1 0 0 0 0 1 0 0 0 0 ... 1 1 1 0 0 0 0 0 0 0
4 0 0 0 0 1 1 1 0 0 0 ... 1 0 1 0 0 0 0 0 0 0

5 rows × 10101 columns

Code
pca = PCA(n_components=6)
pca.fit(X);
#Data points projected along the principal components
PC1 PC2 PC3 PC4 PC5 PC6 population gender
0 4.787182 -0.882031 3.288092 -0.801862 -3.087300 1.273904 ACB 1
1 12.953420 2.376716 -0.646826 -3.687080 -3.109056 -1.207336 ACB 2
2 9.840719 0.670118 -1.260007 -2.213958 -2.789986 -0.231344 ACB 2
3 -0.338584 -0.749133 3.587250 0.099608 -1.338398 0.173472 ACB 1
4 3.779919 0.506308 -2.747067 -1.039848 1.498731 0.616361 ACB 2