Understanding Uncertainty – Why Statistics?

Course Goals

Our basic goals for the course are:

to build a strong intuition about data, where it comes from, and what questions it can answer.
to learn the basic computational skills needed to manipulate and analyze data. Working with data also helps with (1)!

Why statistics?

Statistics is, essentially, the study of data and how to use it. People argue about the purpose of statistics, but basically you can do 3 things with data:

description
inference
prediction

Description

Let’s load in some data and take a look at it.

The dataset contains Airbnb listings in New York City, including prices, locations, and other features.

Description

	id	name	host_id	host_identity_verified	host_name	borough	neighbourhood	lat	long	country	...	service_fee	minimum_nights	number_of_reviews	last_review	reviews_per_month	review_rate_number	calculated_host_listings_count	availability_365	house_rules	license
0	1001254	Clean & quiet apt home by the park	80014485718	unconfirmed	Madaline	brooklyn	Kensington	40.64749	-73.97237	United States	...	$193	10.0	9.0	10/19/2021	0.21	4.0	6.0	286.0	Clean up and treat the home the way you'd like...	NaN
1	1002102	Skylit Midtown Castle	52335172823	verified	Jenna	manhattan	Midtown	40.75362	-73.98377	United States	...	$28	30.0	45.0	5/21/2022	0.38	4.0	2.0	228.0	Pet friendly but please confirm with me if the...	NaN
2	1002403	THE VILLAGE OF HARLEM....NEW YORK !	78829239556	NaN	Elise	manhattan	Harlem	40.80902	-73.94190	United States	...	$124	3.0	0.0	NaN	NaN	5.0	1.0	352.0	I encourage you to use my kitchen, cooking and...	NaN

3 rows × 26 columns

Now there’s a lot you can do, but let’s start by visualizing the prices of listings.

Computing statistics like the mean (average), standard deviation (average distance from the mean), and quartiles (top 25% and bottom 25%) is easy.

count    102316.000000
mean        625.291665
std         331.677344
min          50.000000
25%         340.000000
50%         624.000000
75%         913.000000
max        1200.000000
Name: price, dtype: float64

We can even use specialized libraries to make use of the geographic information in the data. For example, we can use the geopandas library to plot the locations of listings on a map of New York City.

Let’s look at our Airbnb data again. What if instead of looking at the entire dataset, we only looked at a small “sample” or subset of the data?

Sample $\neq$ Population

Population

the entire set of data that you are interested in.

Sample

a subset of a population.

A random sample is a sample that is selected randomly from the population.

Example: Airbnb listings in New York City

We want to know the average price of Airbnb listings in New York City.

population: all Airbnb listings in New York City
sample: a smaller subset of those listings, which may or may not be representative of the entire population.

What is the population?

Flexible definition:

Average price of all short-term rentals in New York City? Population: all rentals (not just Airbnb listings) in New York City.

Often, the population is actually more abstract or theoretical

Average price of all possible Airbnb listings in New York City? Population: all potential listings, not just the ones that currently exist.

Descriptive statistics are useful for understanding the data at hand, but they don’t necessarily tell us much about the world outside of the data. For that, we need to do something more.

Quiz: restaurant survey

Inference

What if we want to answer questions about a population based on a sample?

This is where inference comes in.

Use the given sample to infer something about the population.

How do we do this if we can’t ever see the entire population?

Need a link which connects the sample to the population
Treat the sample as the outcome of a data-generating process (DGP).

There is always a DGP

A data-generating process (DGP) is a theoretical construct that describes how data is generated in a population.

Encompasses all the factors that influence the data (incl. the mechanisms and relationships between variables).
There has to be a DGP, even if we don’t know what it is.
The DGP is the process that generates the data we observe.
The full, true DGP is usually unknown.
- We can make assumptions about it and use those assumptions to draw inferences about the population (in the case that our assumptions are correct).

Statistical models

When the full DGP is too complicated / unknown, we use a model

simplified mathematical representation of the DGP
allows us to make inferences about the population based on the sample
ultimately sort of a guess – about where your data come from.

Example: Airbnb listings.

Assume that the all Airbnb listings in New York City are equally likely to be in any one of the five boroughs.
Probability of a listing being in Manhattan is 1/5, the probability of it being in Brooklyn is 1/5, etc.

Then we can look at the actual sample of listings and see if it matches our assumption:

Question: “If we assume that all boroughs are equally likely to produce each listing, how likely is it that we would see the distribution of listings that we actually observe?”

question about the probability of the sample, given a certain model of the DGP
it intuitively seems unlikely that we would see so many more listings in Manhattan and Brooklyn than in the other boroughs if all boroughs were equally likely to produce listings.

Evaluating models

What should we do now?

Now that we realize our sample is very unlikely under our model, then perhaps we should reconsider our model.
Model is just a “guess” about the DGP, while the sample is real data that we have observed.

Unlikely data or unlikely model?

There are two main culprits when we see a sample that is unlikely under our model:

The sample! Think of this as “luck of the draw”. This is only really a risk if your sample is small or systematically biased in some way. Usually if you collect enough data, the sample will start to look more like the population. If you flip a coin 5 times, you might get all tails (there’s actually a 3% chance of this happening); if you flip a coin 100 times, there’s virtually no chance that you’ll get all tails (less than 10^-30 chance).
The model! This means that our assumptions about the DGP are incorrect or incomplete. This is a more serious problem, and it won’t go away just by collecting more data.

Statistical inference is basically just a bunch of mathematical machinery and techniques that help us to quantify this guesswork precisely and make it rigorous.

Inference requires domain knowledge

Don’t try this at home!

We just said that statistical inference makes guesswork rigorous, but this is not the whole story.

We will always do a much better job of inference if was have a good understanding of the DGP and the context of the data.

This requires domain knowledge and subject matter expertise.

In the Airbnb example:

Assuming that all boroughs are equally likely to produce listings is a pretty bad assumption
- Manhattan sees vastly more tourism than the other boroughs
- Brooklyn and Queens have by far the most residents according to recent census data.

Prediction

Prediction is the process of using a model to make predictions about unseen (or future) data.

Back to the Airbnb data: we might want to predict which borough a new listing belongs to based on its features (e.g., listing type, review ratings, price, etc.).

To that end we will fit a predictive model to the data. Basic idea of the model:

we assume the features of the listing (e.g., price) are related to the probability of it being in a certain borough
- e.g., perhaps more expensive listings are more likely to be in Manhattan

Fitting a model

Models generally have parameters, which are adjustable values that affect the model’s behavior. Think of them like “knobs” you can turn to tune the model to do what you want, like adjusting the volume or the bass/treble on a speaker.

Coin flip has a single parameter: the probability of landing on heads.

If you turn the knob to 0.5, you get a fair coin;
if you turn it to 1.0, you get a coin that always lands on heads;
if you turn it to 0.0, you get a coin that always lands on tails.

Fitting a model means adjusting the parameters of the model so that it best matches the data. This is usually done by minimizing some kind of error function, which provides a measure of how well the model fits the data.

Predicting the borough of a listing

========================================
Prediction Accuracy: 45.38%
========================================

Evaluating predictions

Ok, so the model is around 45% accurate at predicting the borough of a listing.

What is a “good” prediction rate?

For discussion / reflection: What is a “good” prediction rate or accuracy? Is 45% good? What about 60%? 80%? How would you tell?

Model predictions (distribution)

Now let’s take a look at the distribution of the model’s predictions.

It looks like the model is a bit crude (it predicts no listings in the Bronx or Staten Island), but it does at least capture the general trend that listings are more likely to be in Manhattan and Brooklyn than in the other boroughs.

Summary

3 objectives of data analysis: description, inference, and prediction.

Hopefully you now have a better understanding of what statistics is supposed to help you do with data. Of course, we haven’t actually gone into any of the details of how to do anything. (Don’t worry, we’ll get there!)

Up next:

basic programming concepts that are important for data science.
After that we will learn some foundational concepts in probability that will help us think about data and models more rigorously.

From there, the sky is the limit! We’ll cover a wide range of topics, including statistical inference, uncertainty quantification, machine learning, and more.

Since we haven’t learned any programming or statistics yet, we won’t have any real exercises for this lecture. There’s just a quick Assignment 0 to make sure you are set up to run Python code for future assignments.

Why Statistics?

Course Goals

Why statistics?

Description

Description

Sample \(\neq\) Population

What is the population?

Quiz: restaurant survey

Inference

Statistical models

Example: Airbnb listings.

Evaluating models

Inference requires domain knowledge

Prediction

Predicting the borough of a listing

Evaluating predictions

Model predictions (distribution)

Summary