Regression fits a line y = b0 + b1*x to data. The least squares method minimizes the sum of squared residuals. R-squared measures the fraction of variance explained. Inference for the slope tells you whether the relationship is real.
Least squares line
The slope b1 = r * (sy / sx) and the intercept b0 = y-bar - b1 * x-bar. These minimize the sum of squared residuals. The correlation r measures linear association; the regression line passes through (x-bar, y-bar).
# Least squares regression
xs = [1, 2, 3, 4, 5, 6, 7]
ys = [52, 58, 65, 70, 74, 80, 85]
n = len(xs)
x_bar = sum(xs) / n
y_bar = sum(ys) / n
ss_xy = sum((x - x_bar) * (y - y_bar) for x, y inzip(xs, ys))
ss_xx = sum((x - x_bar)**2for x in xs)
b1 = ss_xy / ss_xx
b0 = y_bar - b1 * x_bar
print(f"y = {b0:.2f} + {b1:.2f} * x")
print(f"Predicted y at x=5: {b0 + b1 * 5:.2f}")
Residuals
A residual is the observed value minus the predicted value: e = y - y-hat. Residuals should show no pattern when plotted against x or y-hat. Patterns indicate the linear model is missing something: curvature, heteroscedasticity, or outliers.
# Residuals
xs = [1, 2, 3, 4, 5, 6, 7]
ys = [52, 58, 65, 70, 74, 80, 85]
b0, b1 = 46.43, 5.54
residuals = [y - (b0 + b1*x) for x, y inzip(xs, ys)]
print(f"Residuals: {[round(r, 2) for r in residuals]}")
print(f"Sum: {sum(residuals):.4f}")
print(f"SSE: {sum(r**2 for r in residuals):.2f}")
R-squared
R-squared = 1 - SSE/SST, where SST is the total sum of squares. It measures the proportion of variance in y explained by x. An R-squared of 0.95 means the regression captures 95% of the variability. It equals the square of the correlation coefficient r.
To test H0: beta1 = 0 (no linear relationship), compute t = b1 / SE(b1), where SE(b1) = sqrt(MSE / SS_xx). This follows a t-distribution with n - 2 degrees of freedom. The confidence interval is b1 +/- t* * SE(b1).
# Inference for slopeimportmath
xs = [1, 2, 3, 4, 5, 6, 7]
ys = [52, 58, 65, 70, 74, 80, 85]
b0, b1 = 46.43, 5.54
n = len(xs)
residuals = [y - (b0 + b1*x) for x, y inzip(xs, ys)]
mse = sum(r**2for r in residuals) / (n - 2)
x_bar = sum(xs) / n
ss_xx = sum((x - x_bar)**2for x in xs)
se_b1 = math.sqrt(mse / ss_xx)
t = b1 / se_b1
print(f"t = {t:.3f}, SE(b1) = {se_b1:.4f}")
print(f"Reject H0? {'Yes' if abs(t) > 2.571 else 'No'}")
Prediction intervals
A confidence interval targets the mean response at x. A prediction interval targets a single new observation. The prediction interval is wider because it adds individual-level noise on top of estimation uncertainty: SE_pred = sqrt(MSE * (1 + 1/n + (x - x-bar)^2 / SS_xx)).