Assignment 04: Hypothesis Testing and Bootstrapping

Published

July 31, 2025

Hypothesis testing with flights data

This is the least structured problem you have received to date. The goal is for you to explore the given dataset and apply hypothesis testing techniques to turn vague questions into concrete statistical tests.

Below is code to import a dataset of flights across the US in January 2025.

Think about how you might use this data to design hypothesis tests and answer the following questions: 1. Does the time of day affect flight delays? 2. Do flights from different airlines have different delay patterns? (Hint: is there a difference in how long flights are delayed, or how often they are delayed?)

import pandas as pd

url = (
    "https://www.dropbox.com/scl/fi/"
    "nnlww9mk1vmevn1lytywr/flights_ontime.csv"
    "?rlkey=iska1a863ezg640lvd86wgoky&dl=1"
)

# 1) read FL_DATE and pull times in as strings
time_cols = ["ARR_TIME", "DEP_TIME", "CRS_DEP_TIME", "CRS_ARR_TIME"]

df = (
    pd.read_csv(
        url,
        parse_dates=["FL_DATE"],
        date_format=lambda x: pd.to_datetime(x, format="%m/%d/%Y %I:%M:%S %p"),
        dtype={col: str for col in time_cols},  # force them to string
    )
    .dropna(subset=time_cols)                   # drop rows missing any of the times
)

# 2) define a one-liner to zero-pad & parse “hhmm”
def hhmm_to_time(col):
    return (
        pd.to_datetime(
            col.str.zfill(4),     # “1”→“0001”, “59”→“0059”, “1323”→“1323”
            format="%H%M",
            errors="coerce"       # invalid → NaT
        )
        .dt.time                  # extract python datetime.time
    )

# 3) apply it to all of them in one go
df[time_cols] = df[time_cols].apply(hhmm_to_time)