🧬 PCA

Principal Component Analysis + Score Plot + Scree Plot

What Is PCA (Principal Component Analysis)?

Principal Component Analysis is a dimensionality reduction technique that transforms a set of correlated variables into a smaller set of uncorrelated variables called principal components. Each component captures a direction of maximum variance in the data, with the first component explaining the most variance, the second explaining the most remaining variance (orthogonal to the first), and so on.

PCA is used extensively in genomics (analyzing gene expression across thousands of genes), metabolomics (profiling hundreds of metabolites), quality control (monitoring multivariate processes), image compression, and exploratory data analysis. Any time you have more variables than you can visualize or interpret individually, PCA can help.

How PCA Works (Conceptual Overview)

Imagine you have data with 10 measured variables per sample. Each sample is a point in 10-dimensional space. PCA finds the directions in that space along which the data varies the most.

  1. Standardize the data. Center each variable to mean zero and (optionally) scale to unit variance. This prevents variables with larger units from dominating the analysis.
  2. Compute the covariance (or correlation) matrix. This captures how each pair of variables co-varies.
  3. Extract eigenvalues and eigenvectors. Eigenvectors define the directions of principal components; eigenvalues indicate how much variance each component explains.
  4. Project the data. Transform the original data onto the new principal component axes to get scores.

The result: you can often represent the essential structure of 10, 50, or even 1000 variables in just 2–3 principal components, making visualization and interpretation feasible.

How to Read a Scree Plot

The scree plot displays the eigenvalue (or proportion of variance explained) for each principal component in descending order. It is named after the geological term for rubble at the base of a cliff — because the plot typically looks like a steep cliff followed by a flat "scree."

The "Elbow" Rule

Look for an "elbow" in the scree plot — the point where the curve bends sharply and the subsequent components contribute relatively little additional variance. Components before the elbow are retained; those after are considered noise.

Kaiser's Rule

Retain components with eigenvalues greater than 1 (when using the correlation matrix). The rationale: a component should explain at least as much variance as a single original variable.

Cumulative Variance Threshold

Retain enough components to explain a target proportion of total variance (commonly 70–90%). For example, if the first 3 components explain 85% of variance, that may be sufficient for most purposes.

Understanding Score Plots and Loadings

Score Plot (PC1 vs. PC2)

The score plot shows each sample projected onto the first two (or more) principal components. Samples that are similar across many variables will cluster together. The score plot is your primary tool for detecting groupings, outliers, and batch effects.

In genomics, you might see healthy controls cluster on one side and disease patients on another. In quality control, a batch that drifts over time may trace a path across the plot.

Loadings

Loadings describe how much each original variable contributes to each principal component. A variable with a high absolute loading on PC1 strongly influences that component's scores. Examining loadings tells you which variables drive the observed patterns.

A biplot combines scores and loadings in one graph, showing both sample relationships and the variables responsible for those relationships.

When to Use PCA

When NOT to Use PCA

Practical Tips

  1. Always standardize first. Unless all variables share the same unit and scale, use the correlation matrix (which standardizes implicitly).
  2. Check for outliers before PCA. Extreme outliers can dominate the first component, masking the true data structure.
  3. Report the variance explained. Always state how much variance your retained components capture.
  4. Label your axes. PC1 and PC2 are abstract — annotate with the percentage of variance explained: "PC1 (45.2%)" and "PC2 (18.7%)".

Frequently Asked Questions

How many samples do I need for PCA?

A common guideline is at least 5–10 observations per variable, though PCA can be computed with fewer samples than variables (common in genomics with thousands of genes). In such cases, the number of meaningful components is limited to min(n−1, p) where n is samples and p is variables.

Should I use the covariance matrix or the correlation matrix?

Use the correlation matrix (equivalently, standardize first) when your variables have different units or vastly different scales. Use the covariance matrix when all variables share the same unit and you want to preserve the original variance structure (e.g., all measurements in mg/dL).

Is PCA the same as Factor Analysis?

No. PCA is a data reduction technique that extracts components explaining maximum variance. Factor Analysis models latent factors that cause the observed correlations. PCA is more commonly used for practical dimensionality reduction; Factor Analysis is more appropriate when you have a theoretical model of underlying constructs.

Can PCA handle missing data?

Standard PCA requires complete data. For datasets with missing values, options include: (1) removing incomplete rows (if few), (2) imputation (mean, KNN, or multiple imputation), or (3) specialized algorithms like NIPALS that handle missing values directly.

This tool is free forever. If it saved you time, consider buying me a coffee.

☕ Buy me a coffee