2020, Jan 27

Suppose that after applying Principal Component Analysis (PCA) to your dataset, you are interested in understanding which is the contribution of the original variables to the principal components. How can we do that?

In PCA, given a mean centered dataset $X$ with $n$ sample and $p$ variables, the first principal component $PC_1$ is given by the linear combination of the original variables $X_1, X_2, ..., X_p$

The first principal component $PC_1$ represents the component that retains the maximum variance of the data. $\mathbf{w_1}$ corresponds to an eigenvector of the covariance matrix

and the elements of the eigenvector $w_{1j}$, and are also known as loadings.

PCA loadings are the coefficients of the linear combination of the original variables from which the principal components (PCs) are constructed.

Here is an example of how to apply PCA with scikit-learn on the Iris dataset.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn import decomposition
from sklearn import datasets
from sklearn.preprocessing import scale

X = scale(iris.data)
y = iris.target

# apply PCA
pca = decomposition.PCA(n_components=2)
X = pca.fit_transform(X)


To get the loadings, we just need to access the attribute components_ of the sklearn.decomposition.pca.PCA object.

loadings = pd.DataFrame(pca.components_.T, columns=['PC1', 'PC2'], index=iris.feature_names)

PC1       PC2
sepal length (cm)  0.521066  0.377418
sepal width (cm)  -0.269347  0.923296
petal length (cm)  0.580413  0.024492
petal width (cm)   0.564857  0.066942


The columns of the dataframe contain the eigenvectors associated with the first two principal components. Each element represents a loading, namely how much (the weight) each original variable contributes to the corresponding principal component.

Note: In R we have the same resulting matrix accessing the element of the outputs call rotation returned by the function prcomp().

Another useful way to interpret PCA is by computing the correlations between the original variable and the principal components. How can we do that?

To compute PCA, available libraries first compute the singular value decomposition (SVD) of the original dataset

The columns of $\mathbf{V}$ contains the principal axes, $\mathbf{S}$ is a diagonal matrix containing the singular values, and the columns of $\mathbf{U}$ are the principal components scaled to unit norm.
Standardized PCs are given by $\sqrt{N-1}\mathbf{U}$.

As we have seen before, the covariance matrix is defined as

This means that the principal axes $\mathbf{V}$ are eigenvectors of the covariance matrix and $\mathbf E=\frac{\mathbf S^2}{N-1}$ are its eigenvalues.

To compute the Loading matrix, namely the correlations between the original variable and the principal components, we just need to compute the cross-covariance matrix:

In our case, $\mathbf{X}$ contains the original variables, and $\mathbf{Y}$ contains the standardized principal components, so

As you can see, from a numerical point of view, the loadings $L$ are equal to the coordinates of the variables divided by the square root of the eigenvalue associated with the component.

Therefore, if we want to compute the loading matrix with scikit-learn we just need to remember that

• $\mathbf{V}$ is stored in pca.components_.T
• $\sqrt{\mathbf E}$ is given by np.sqrt(pca.explained_variance_)
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)


Here each entry of the matrix contains the correlation between the original variable and the principal component. For example the original variable sepal length (cm) and the first principal component PC1 have a correlation of $0.89$.