| family | father | mother | gender | height | kids | male | female | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 78.5 | 67.0 | M | 73.2 | 4 | True | False |
| 1 | 1 | 78.5 | 67.0 | F | 69.2 | 4 | False | True |
| 2 | 1 | 78.5 | 67.0 | F | 69.0 | 4 | False | True |
| 3 | 1 | 78.5 | 67.0 | F | 69.0 | 4 | False | True |
| 4 | 2 | 75.5 | 66.5 | M | 73.5 | 4 | True | False |
2026-01-06
So far we’ve focused on inference:
Now: Prediction — using data to forecast outcomes
How do we make predictions?
Why not just memorize?
Classic dataset: heights of parents and children
| family | father | mother | gender | height | kids | male | female | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 78.5 | 67.0 | M | 73.2 | 4 | True | False |
| 1 | 1 | 78.5 | 67.0 | F | 69.2 | 4 | False | True |
| 2 | 1 | 78.5 | 67.0 | F | 69.0 | 4 | False | True |
| 3 | 1 | 78.5 | 67.0 | F | 69.0 | 4 | False | True |
| 4 | 2 | 75.5 | 66.5 | M | 73.5 | 4 | True | False |
Task: Predict height of a new child (no other information)
Solution: Use the average height!
Mean height: 66.76 inches

What if we have more information?
If these are related to height, they should inform our prediction!

Taller parents → taller children (generally)
But how much taller?
We need to quantify this relationship.
Problem: Father and mother heights are on different scales
Solution: Standardize (z-score) heights
\[z = \frac{\text{height} - \text{mean}}{\text{std dev}}\]
Then combine: midparent height = average of z-scores

Model the relationship as a line:
\[\text{predicted height} = \text{slope} \times \text{midparent} + \text{intercept}\]
Ordinary Least Squares (OLS): Find the line that minimizes squared errors
\[\text{minimize } \sum_i (\text{predicted}_i - \text{actual}_i)^2\]

Slope ≈ 1.7
Interpretation: For every 1 standard deviation increase in midparent height, predicted child height increases by 1.7 inches.

Mean error: -2.43e-14
Root Mean Squared Error: 3.39 inches
Naive RMSE (just use mean): 3.58 inches
Using parent info improves predictions! (3.39 vs 3.58 inches)
Notice: residuals are approximately normal!
This motivates the probabilistic regression model:
\[Y \sim \mathcal{N}(\beta_0 + \beta_1 X, \sigma^2)\]
The response \(Y\) is a random variable with:
Because \(Y\) is random:
Correlation coefficient (\(r\)): measures linear relationship strength
| Value | Interpretation |
|---|---|
| \(r = 1\) | Perfect positive correlation |
| \(r = -1\) | Perfect negative correlation |
| \(r = 0\) | No linear relationship |
Key insight: When both variables are standardized, the slope equals the correlation!
\[r = \text{slope when } X \text{ and } Y \text{ are z-scored}\]


Linear Regression
Correlation
Probabilistic Interpretation
Regression Inference and Multiple Regression