🧬 PCA
Principal Component Analysis + Score Plot + Scree Plot
What Is PCA (Principal Component Analysis)?
Principal Component Analysis is a dimensionality reduction technique that transforms a set of correlated variables into a smaller set of uncorrelated variables called principal components. Each component captures a direction of maximum variance in the data, with the first component explaining the most variance, the second explaining the most remaining variance (orthogonal to the first), and so on.
PCA is used extensively in genomics (analyzing gene expression across thousands of genes), metabolomics (profiling hundreds of metabolites), quality control (monitoring multivariate processes), image compression, and exploratory data analysis. Any time you have more variables than you can visualize or interpret individually, PCA can help.
How PCA Works (Conceptual Overview)
Imagine you have data with 10 measured variables per sample. Each sample is a point in 10-dimensional space. PCA finds the directions in that space along which the data varies the most.
- Standardize the data. Center each variable to mean zero and (optionally) scale to unit variance. This prevents variables with larger units from dominating the analysis.
- Compute the covariance (or correlation) matrix. This captures how each pair of variables co-varies.
- Extract eigenvalues and eigenvectors. Eigenvectors define the directions of principal components; eigenvalues indicate how much variance each component explains.
- Project the data. Transform the original data onto the new principal component axes to get scores.
The result: you can often represent the essential structure of 10, 50, or even 1000 variables in just 2–3 principal components, making visualization and interpretation feasible.
How to Read a Scree Plot
The scree plot displays the eigenvalue (or proportion of variance explained) for each principal component in descending order. It is named after the geological term for rubble at the base of a cliff — because the plot typically looks like a steep cliff followed by a flat "scree."
The "Elbow" Rule
Look for an "elbow" in the scree plot — the point where the curve bends sharply and the subsequent components contribute relatively little additional variance. Components before the elbow are retained; those after are considered noise.
Kaiser's Rule
Retain components with eigenvalues greater than 1 (when using the correlation matrix). The rationale: a component should explain at least as much variance as a single original variable.
Cumulative Variance Threshold
Retain enough components to explain a target proportion of total variance (commonly 70–90%). For example, if the first 3 components explain 85% of variance, that may be sufficient for most purposes.
Understanding Score Plots and Loadings
Score Plot (PC1 vs. PC2)
The score plot shows each sample projected onto the first two (or more) principal components. Samples that are similar across many variables will cluster together. The score plot is your primary tool for detecting groupings, outliers, and batch effects.
In genomics, you might see healthy controls cluster on one side and disease patients on another. In quality control, a batch that drifts over time may trace a path across the plot.
Loadings
Loadings describe how much each original variable contributes to each principal component. A variable with a high absolute loading on PC1 strongly influences that component's scores. Examining loadings tells you which variables drive the observed patterns.
A biplot combines scores and loadings in one graph, showing both sample relationships and the variables responsible for those relationships.
When to Use PCA
- Exploratory analysis. Before any hypothesis testing, PCA reveals structure, clusters, and outliers in multivariate data.
- Dimensionality reduction. Reduce 100 variables to 3–5 components for use as inputs to regression, clustering, or classification.
- Multicollinearity. When predictor variables are highly correlated, PCA can create orthogonal components for regression (PCR — Principal Component Regression).
- Visualization. Project high-dimensional data into 2D or 3D for plotting.
- Quality control. Monitor multivariate processes and detect out-of-spec batches.
When NOT to Use PCA
- When variables are already uncorrelated. PCA gains power from correlations. If all variables are independent, PCA offers no reduction.
- When interpretability is critical. Principal components are linear combinations of all original variables, which can be difficult to interpret biologically. Consider factor analysis if you need interpretable latent factors.
- For non-linear relationships. PCA captures linear structure. For non-linear dimensionality reduction, consider t-SNE or UMAP.
Practical Tips
- Always standardize first. Unless all variables share the same unit and scale, use the correlation matrix (which standardizes implicitly).
- Check for outliers before PCA. Extreme outliers can dominate the first component, masking the true data structure.
- Report the variance explained. Always state how much variance your retained components capture.
- Label your axes. PC1 and PC2 are abstract — annotate with the percentage of variance explained: "PC1 (45.2%)" and "PC2 (18.7%)".
Frequently Asked Questions
How many samples do I need for PCA?
A common guideline is at least 5–10 observations per variable, though PCA can be computed with fewer samples than variables (common in genomics with thousands of genes). In such cases, the number of meaningful components is limited to min(n−1, p) where n is samples and p is variables.
Should I use the covariance matrix or the correlation matrix?
Use the correlation matrix (equivalently, standardize first) when your variables have different units or vastly different scales. Use the covariance matrix when all variables share the same unit and you want to preserve the original variance structure (e.g., all measurements in mg/dL).
Is PCA the same as Factor Analysis?
No. PCA is a data reduction technique that extracts components explaining maximum variance. Factor Analysis models latent factors that cause the observed correlations. PCA is more commonly used for practical dimensionality reduction; Factor Analysis is more appropriate when you have a theoretical model of underlying constructs.
Can PCA handle missing data?
Standard PCA requires complete data. For datasets with missing values, options include: (1) removing incomplete rows (if few), (2) imputation (mean, KNN, or multiple imputation), or (3) specialized algorithms like NIPALS that handle missing values directly.
This tool is free forever. If it saved you time, consider buying me a coffee.
☕ Buy me a coffee