🧮 Statistics Toolkit/Linear Regression

📈 Linear Regression

Simple linear regression + confidence band + residuals

What Is Linear Regression?

Linear regression is a method for modeling the relationship between a dependent variable (Y) and one or more independent variables (X) by fitting a straight line through the data. The simplest form — simple linear regression — uses one predictor and fits the equation:

Y = β0 + β1X + ε

Where β0 is the intercept (value of Y when X = 0), β1 is the slope (change in Y for a one-unit increase in X), and ε represents the error term (the part of Y not explained by X).

Linear regression is arguably the most widely used statistical method in science. It underpins calibration curves in analytical chemistry, dose-response relationships in pharmacology, growth curves in biology, and predictive models in machine learning.

Understanding R² (Coefficient of Determination)

tells you what proportion of the variance in Y is explained by X. It ranges from 0 to 1:

Context matters enormously when evaluating R². In controlled laboratory experiments, R² > 0.95 is common and expected (e.g., standard curves). In behavioral or social science research, R² = 0.3 might be excellent. Never judge R² in isolation — consider the field, the noise level of the measurements, and the complexity of the phenomenon.

The Slope and Its Interpretation

The slope (β1) is the heart of regression analysis. It tells you: for every one-unit increase in X, Y changes by β1 units, on average. The sign indicates direction (positive = both increase together; negative = one increases as the other decreases). The magnitude indicates the rate of change.

The p-value for the slope tests the null hypothesis that β1 = 0 (i.e., X has no linear effect on Y). A significant p-value means the slope is distinguishable from zero. The 95% confidence interval for the slope gives you a range of plausible values.

Why Residual Analysis Matters

Residuals are the differences between observed Y values and the values predicted by the regression line. Examining residuals is not optional — it is essential. The regression equation can always be computed, but it is only meaningful if the underlying assumptions hold.

What to Look for in Residual Plots

Assumptions of Linear Regression

  1. Linearity. The relationship between X and Y is linear. Check with a scatter plot and residual plot.
  2. Independence. Observations are independent of each other. Violated in time-series data or clustered designs.
  3. Normality of residuals. Residuals should be approximately normally distributed. Check with a Q-Q plot or histogram of residuals. Moderate violations are tolerable with large samples.
  4. Homoscedasticity. The variance of residuals should be constant across all levels of X. Check by plotting residuals vs. predicted values.

Confidence Bands vs. Prediction Bands

The confidence band shows the uncertainty around the regression line itself (the mean of Y at each X). It is narrower and answers: "Where might the true regression line be?"

The prediction band shows the uncertainty for a new individual observation. It is always wider than the confidence band because it accounts for both the uncertainty in the line and the natural scatter of individual data points.

Beyond Simple Linear Regression

This calculator performs simple linear regression (one predictor). For more complex scenarios:

Frequently Asked Questions

Can I use regression for prediction?

Yes, but only within the range of your observed X values (interpolation). Extrapolating beyond your data range is risky because the linear relationship may not hold outside that range. Always state the range of X values in your data when making predictions.

My R² is low. Does that mean the regression is useless?

Not necessarily. A low R² means the predictor explains only a small fraction of variance, but the slope might still be statistically significant and practically meaningful. For example, in epidemiology, a predictor that explains 5% of disease risk can still have major public health implications.

What is the difference between correlation and regression?

Correlation measures the strength of association between two variables symmetrically (r of X vs. Y equals r of Y vs. X). Regression is directional: it predicts Y from X, producing a specific equation. Correlation does not distinguish predictor from outcome; regression does.

How do I report regression results?

APA style: "A simple linear regression revealed that study hours significantly predicted exam score, β = 3.45, SE = 0.78, t(48) = 4.42, p < .001, R² = .29." Include the regression equation, R², and ideally a scatter plot with the regression line.

This tool is free forever. If it saved you time, consider buying me a coffee.

☕ Buy me a coffee