Lecture 07: Permutation Tests

Joseph Rudoler

2026-01-06

Recap

Bootstrap:

Resample with replacement from data
Estimate distribution of any statistic
Build confidence intervals

Today: Permutation tests — another powerful resampling method

Permutation Tests

Like bootstrap, permutation tests are non-parametric:

No assumptions about underlying distribution
Rely on resampling

Key difference: resample without replacement

The Setup

We have two samples \(X\) and \(Y\).

Question: Do they come from different distributions?

Null hypothesis (\(H_0\)): Both samples come from the same distribution.

The Logic

If \(H_0\) is true:

\(X\) and \(Y\) are drawn from the same distribution
We can combine them: \(Z = X \cup Y\)
Any random split of \(Z\) is just as valid as the original

Key insight: Under \(H_0\), the labels (X vs Y) are exchangeable!

Permutation Test Algorithm

Compute observed test statistic (e.g., difference in means)
Combine samples: \(Z = X \cup Y\)
Randomly permute (shuffle) \(Z\)
Split into new “X” and “Y” of original sizes
Compute test statistic on shuffled data
Repeat steps 3-5 many times
p-value = proportion of permuted stats ≥ observed

Permutation Test Code

Code

def permutation_test(test_func, x, y, num_permutations=10000, rng=None, one_sided=True):
    observed_stat = test_func(x, y)
    combined = np.concatenate([x, y])
    count = 0
    if rng is None:
        rng = np.random.default_rng()
    
    for _ in range(num_permutations):
        permuted = rng.permutation(combined)
        x_perm = permuted[:len(x)]
        y_perm = permuted[len(x):]
        permuted_stat = test_func(x_perm, y_perm)
        
        if one_sided:
            if permuted_stat >= observed_stat:
                count += 1
        else:
            if np.abs(permuted_stat) >= np.abs(observed_stat):
                count += 1
    
    return count / num_permutations

Application: NBA Scoring

Is SGA a better scorer than Giannis?

Permutation test p-value: 0.0336

Bootstrap vs Permutation

Both work, but permutation tests are often more powerful.

Power = probability of correctly rejecting \(H_0\) when it’s false

A more powerful test detects true effects more reliably.

There Is Really Only One Test!

All statistical tests follow the same pattern:

Compute a test statistic on observed data
Choose a null hypothesis / model
Generate null distribution (analytically or by simulation)
Compare observed statistic to null distribution → p-value

Named Tests Are Special Cases

Most “named” tests (t-test, chi-square, ANOVA, etc.) are just:

Specific test statistics
Specific null distributions
Often derived analytically (before computers!)

Simulation-based methods (bootstrap, permutation) are more flexible!

Example: Two-Sample t-test

The t-test compares means of two groups.

Assumption: Errors are normally distributed

Under this assumption, the test statistic follows a t-distribution.

t-Distribution vs Normal

t-Distribution Properties

Heavier tails than normal (more extreme values)
Converges to normal as sample size increases
Degrees of freedom (df) = \(n_1 + n_2 - 2\) for two samples

Comparing Methods

Let’s compare:

Parametric t-test (using scipy)
Manual calculation
Sampling from t-distribution
Permutation test

The Comparison

Code

rng = np.random.default_rng(43)
samples_a = rng.uniform(low=-1, high=3, size=10)
samples_b = rng.uniform(low=-3, high=1, size=10)

# 1. Parametric t-test
t_result = stats.ttest_ind(samples_a, samples_b, equal_var=True)
print(f"Parametric t-test p-value: {t_result.pvalue:.6f}")

# 2. Manual t-statistic
diff_means = np.mean(samples_a) - np.mean(samples_b)
pooled_var = (np.var(samples_a, ddof=1) + np.var(samples_b, ddof=1))
pooled_std_error = np.sqrt(pooled_var / len(samples_a))
t_stat = diff_means / pooled_std_error
t_abs = np.abs(t_stat)
df = len(samples_a) + len(samples_b) - 2
p_value_pdf = 2 * (1 - stats.t.cdf(t_abs, df=df))
print(f"Manual calculation p-value: {p_value_pdf:.6f}")

# 3. Sampling from t-distribution
t_samples = rng.standard_t(df=df, size=10000)
p_value_sim = np.mean(np.abs(t_samples) >= t_abs)
print(f"Simulated t-distribution p-value: {p_value_sim:.6f}")

Parametric t-test p-value: 0.014783
Manual calculation p-value: 0.014783
Simulated t-distribution p-value: 0.014400

Permutation t-Test

Code

def permutation_test_t(x, y, num_permutations=10000, rng=None):
    observed_stat = stats.ttest_ind(x, y, equal_var=True).statistic
    combined = np.concatenate([x, y])
    if rng is None:
        rng = np.random.default_rng()
    
    permuted_stats = []
    for _ in range(num_permutations):
        permuted = rng.permutation(combined)
        x_perm = permuted[:len(x)]
        y_perm = permuted[len(x):]
        permuted_stat = stats.ttest_ind(x_perm, y_perm, equal_var=True).statistic
        permuted_stats.append(permuted_stat)
    
    permuted_stats = np.array(permuted_stats)
    p_value = np.mean(np.abs(permuted_stats) >= np.abs(observed_stat))
    return permuted_stats, p_value

permuted_stats, p_value_perm = permutation_test_t(samples_a, samples_b, rng=rng)
print(f"Permutation test p-value: {p_value_perm:.6f}")

Permutation test p-value: 0.015700

All Methods Agree!

Method	p-value
Parametric t-test	0.0148
Manual calculation	0.0148
Simulated t-distribution	~0.014
Permutation test	~0.016

They’re all the same test, just computed differently!

Visualizing the Agreement

Why This Matters

The null distributions are nearly identical!

This is why all the tests give similar p-values.

Key insight: Named tests are just specific implementations of the general hypothesis testing framework.

When to Use What?

Parametric tests (t-test, etc.):

Fast (no simulation needed)
Well-understood mathematically
Require distributional assumptions

Simulation-based tests (permutation, bootstrap):

More flexible
No distributional assumptions
Computationally intensive
Useful for complex statistics

Summary

Permutation Tests

Shuffle labels under null hypothesis
Compare observed statistic to permuted distribution
Non-parametric and powerful

The Unified Framework

Test statistic
Null distribution (analytical or simulated)
p-value

All statistical tests are variations of this pattern!

Next Time

Linear Regression

Predicting outcomes from features
Quantifying relationships in data
Connecting inference to prediction