How to compute PCA loadings and the loading matrix with scikit-learn
Suppose that after applying Principal Component Analysis (PCA) to your dataset, you are interested in understanding which is the contribution of the original variables to the principal components. How can we do that?
Loadings
In PCA, given a mean centered dataset
The first principal component
and the elements of the eigenvector
PCA loadings are the coefficients of the linear combination of the original variables from which the principal components (PCs) are constructed.
Loadings with scikit-learn
Here is an example of how to apply PCA with scikit-learn on the Iris dataset.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import decomposition
from sklearn import datasets
from sklearn.preprocessing import scale
# load iris dataset
iris = datasets.load_iris()
X = scale(iris.data)
y = iris.target
# apply PCA
pca = decomposition.PCA(n_components=2)
X = pca.fit_transform(X)
To get the loadings, we just need to access the attribute components_ of the sklearn.decomposition.pca.PCA object.
loadings = pd.DataFrame(pca.components_.T, columns=['PC1', 'PC2'], index=iris.feature_names)
loadings
PC1 PC2
sepal length (cm) 0.521066 0.377418
sepal width (cm) -0.269347 0.923296
petal length (cm) 0.580413 0.024492
petal width (cm) 0.564857 0.066942
The columns of the dataframe contain the eigenvectors associated with the first two principal components. Each element represents a loading, namely how much (the weight) each original variable contributes to the corresponding principal component.
Note: In R we have the same resulting matrix accessing the element of the outputs call rotation returned by the function prcomp().
Loadings Matrix
Another useful way to interpret PCA is by computing the correlations between the original variable and the principal components. How can we do that?
To compute PCA, available libraries first compute the singular value decomposition (SVD) of the original dataset
The columns of
Standardized PCs are given by
As we have seen before, the covariance matrix is defined as
This means that the principal axes
To compute the Loading matrix, namely the correlations between the original variable and the principal components, we just need to compute the cross-covariance matrix:
In our case,
(Note: derivation adapted from here).
As you can see, from a numerical point of view, the loadings
are equal to the coordinates of the variables divided by the square root of the eigenvalue associated with the component.
Therefore, if we want to compute the loading matrix with scikit-learn we just need to remember that
is stored in pca.components_.T is given by np.sqrt(pca.explained_variance_)
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
loading_matrix = pd.DataFrame(loadings, columns=['PC1', 'PC2'], index=iris.feature_names)
loading_matrix
PC1 PC2
sepal length (cm) 0.893151 0.362039
sepal width (cm) -0.461684 0.885673
petal length (cm) 0.994877 0.023494
petal width (cm) 0.968212 0.064214
Here each entry of the matrix contains the correlation between the original variable and the principal component. For example the original variable sepal length (cm) and the first principal component PC1 have a correlation of
You can find the code here.