
2026-01-06
Last lecture we learned:
Today we formalize these ideas with data generating processes and statistical models.
A key insight: all data is generated by some underlying process
Thinking about data generating processes (DGPs) is fundamental to analyzing data.
A statistical model is a formal mathematical representation of a DGP.
DGP: Flipping a coin (heads or tails)
Statistical Model: Bernoulli distribution
\[P(X) = \begin{cases} p & \text{if } X = 1 ~\text{(heads)} \\ 1 - p & \text{if } X = 0 ~\text{(tails)} \end{cases}\]
For a fair coin: \(p = 0.5\)
The Bernoulli model ignores many factors:
But it accurately describes the outcomes — and that’s what matters!
Roll a die 100 times. Record whether each roll is even or odd.
Result: P(even) = P(odd) = 0.5
This is statistically identical to flipping a fair coin!

Find a simple model that captures essential features of the DGP.
Hard to tell if you have a good model with limited data.
Extreme example: Flip a coin once
You can’t learn much about tendencies from a single observation.
Scenario: You and your roommate flip a coin to decide who takes out the trash.
Your roommate: “That’s not fair! You rigged the coin!”
Is your roommate justified?
| Scenario | P(3 heads in 10 flips) |
|---|---|
| Fair coin (p = 0.5) | ~11.7% |
| Biased coin (p = 0.25) | ~25.0% |
With 10 flips, you have some information but not enough to be confident.
| Scenario | P(30 heads in 100 flips) |
|---|---|
| Fair coin (p = 0.5) | ~0.002% |
| Biased coin (p = 0.25) | ~4.6% |
With more data, evidence becomes much stronger!
As sample size increases, the sample mean converges to the population mean.
Formally: Let \(X_1, X_2, \ldots, X_n\) be i.i.d. random variables with expected value \(\mathbb{E}[X]\)
\[\mathbb{P} \left[\lim_{n \to \infty} \bar{X}_n = \mathbb{E}[X]\right] = 1\]
The sample mean \(\bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i\) converges to the true mean.


As sample size increases:
Practical implication: More data → more confident conclusions
Notice: even with high variability in small samples, the sample mean is centered around the true mean.
An estimator is unbiased if its expected value equals the true parameter.
\[\mathbb{E}[\bar{X}] = \mu\]
The LLN tells us that as \(n \to \infty\), the variance of the estimator decreases.
The sample mean converges to a normal distribution, regardless of the original distribution!
Formally: Let \(X_1, \ldots, X_n\) be i.i.d. with mean \(\mu\) and variance \(\sigma^2\)
\[\bar{X}_n \xrightarrow{d} N\left(\mu, \frac{\sigma^2}{n}\right)\]
Even if \(X\) is not normal, the average of many \(X\)’s is approximately normal!
Rolling a die: uniform distribution from 1 to 6
But what about the average of many dice rolls?


Even when we don’t know the original distribution:
The normal distribution is:
The CLT tells us the sample mean has variance \(\frac{\sigma^2}{n}\)
Standard error = standard deviation of the sample mean:
\[SE = \frac{\sigma}{\sqrt{n}}\]
As \(n\) increases, SE decreases → estimates become more precise

Data Generating Processes
Law of Large Numbers
Central Limit Theorem
Hypothesis Testing