Suppose that after applying Principal Component Analysis (PCA) to your dataset, you are interested in understanding which is the contribution of the original variables to the principal components. How can we do that?
In PCA, given a mean centered dataset with sample and variables, the first principal component is given by the linear combination of the original variables
The first principal component represents the component that retains the maximum variance of the data. corresponds to an eigenvector of the covariance matrix
and the elements of the eigenvector , and are also known as loadings.
PCA loadings are the coefficients of the linear combination of the original variables from which the principal components (PCs) are constructed.
Loadings with scikit-learn
Here is an example of how to apply PCA with scikit-learn on the Iris dataset.
import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn import decomposition from sklearn import datasets from sklearn.preprocessing import scale # load iris dataset iris = datasets.load_iris() X = scale(iris.data) y = iris.target # apply PCA pca = decomposition.PCA(n_components=2) X = pca.fit_transform(X)
To get the loadings, we just need to access the attribute components_ of the sklearn.decomposition.pca.PCA object.
loadings = pd.DataFrame(pca.components_.T, columns=['PC1', 'PC2'], index=iris.feature_names) loadings PC1 PC2 sepal length (cm) 0.521066 0.377418 sepal width (cm) -0.269347 0.923296 petal length (cm) 0.580413 0.024492 petal width (cm) 0.564857 0.066942
The columns of the dataframe contain the eigenvectors associated with the first two principal components. Each element represents a loading, namely how much (the weight) each original variable contributes to the corresponding principal component.
Note: In R we have the same resulting matrix accessing the element of the outputs call rotation returned by the function prcomp().
Another useful way to interpret PCA is by computing the correlations between the original variable and the principal components. How can we do that?
To compute PCA, available libraries first compute the singular value decomposition (SVD) of the original dataset
The columns of contains the principal axes, is a diagonal matrix containing the singular values, and the columns of are the principal components scaled to unit norm.
Standardized PCs are given by .
As we have seen before, the covariance matrix is defined as
This means that the principal axes are eigenvectors of the covariance matrix and are its eigenvalues.
To compute the Loading matrix, namely the correlations between the original variable and the principal components, we just need to compute the cross-covariance matrix:
In our case, contains the original variables, and contains the standardized principal components, so
(Note: derivation adapted from here).
As you can see, from a numerical point of view, the loadings are equal to the coordinates of the variables divided by the square root of the eigenvalue associated with the component.
Therefore, if we want to compute the loading matrix with scikit-learn we just need to remember that
- is stored in pca.components_.T
- is given by np.sqrt(pca.explained_variance_)
loadings = pca.components_.T * np.sqrt(pca.explained_variance_) loading_matrix = pd.DataFrame(loadings, columns=['PC1', 'PC2'], index=iris.feature_names) loading_matrix PC1 PC2 sepal length (cm) 0.893151 0.362039 sepal width (cm) -0.461684 0.885673 petal length (cm) 0.994877 0.023494 petal width (cm) 0.968212 0.064214
Here each entry of the matrix contains the correlation between the original variable and the principal component. For example the original variable sepal length (cm) and the first principal component PC1 have a correlation of .
You can find the code here.