Lecture 03: Sampling and Simulation
Now that we have a good understanding of the basics of probability, we can start to explore how we deal with randomness computationally.
Sampling from probability distributions
A sample is a subset of data drawn from a more general population. That population can be thought of as a probability distribution – this distribution essentially describes how likely you are to observe different values when you sample from it.
We will quickly review some important concepts related to sampling.
Independent and identically distributed (IID) sampling
When we sample from a probability distribution, we often assume that the samples are independent and identically distributed (IID). This means that each sample is drawn from the same distribution and that the samples do not influence each other.
Coin flips are a good example of IID sampling. If you flip a fair coin multiple times, each flip has the same probability of being heads or tails (this is the “identically distributed” part), and the outcome of one flip does not affect the outcome of another (this is the “independent” part). The same is true for rolling a die!
We often apply this concept to more complex random processes as well, where we do not have such a clear understanding of the underlying process. For example, if we are sampling the heights of people in a city, we might assume that each person’s height is drawn from the same distribution (the distribution of heights in that city) and that one person’s height does not affect another’s. Whether or not the IID assumption holds in practice is an important question to consider when analyzing data – for example, do you think that the heights of people in a family are independent of each other?
Sampling with and without replacement
Another important concept in sampling is the distinction between sampling with replacement and sampling without replacement.
Sampling with replacement means that after we draw a sample from the population, we put it back before drawing the next sample. This means that the same object / instance can be selected multiple times.
Sampling without replacement means that once we draw a sample, we do not put it back before drawing the next sample. This means that each individual can only be selected once. This can introduce dependencies between samples, as the population changes after each draw.
Simulating a random sample
We can simulate a random process by sampling from a corresponding probability distribution.
Programmatic random sampling is not truly random, but rather “pseudo-random.” This means that the numbers generated are determined by an initial value called a “seed”. If you use the same seed, you will get the same sequence of random numbers. This is useful for reproducibility in experiments and simulations.
If you don’t specify a seed, the random number generator (RNG) will use a default seed that is typically based on the current date and time, which means that you will get different results each time you run the code.
There are built-in functions in many programming languages, including Python, that allow us to sample from common probability distributions. For example, in Python’s NumPy library, we can use numpy.random
module to sample from various distributions like uniform, normal, binomial, etc.
The normal distribution is one of the most commonly used probability distributions in statistics. It is useful for modeling lots of real-world data, especially when the data tends to cluster around a mean (or average) value. The normal distribution is defined by two parameters: the mean (average) and the standard deviation (which measures how spread out the data is around the mean).
For example, to sample 100 values from a normal distribution with mean 0 and standard deviation 1, you can use:
Code
= np.random.default_rng(seed=42) # Create a random number generator with a fixed seed
rng = rng.normal(loc=0, scale=1, size=100)
samples print("Samples:\n", samples)
=10, density=True)
plt.hist(samples, bins plt.show()
Samples:
[ 0.30471708 -1.03998411 0.7504512 0.94056472 -1.95103519 -1.30217951
0.1278404 -0.31624259 -0.01680116 -0.85304393 0.87939797 0.77779194
0.0660307 1.12724121 0.46750934 -0.85929246 0.36875078 -0.9588826
0.8784503 -0.04992591 -0.18486236 -0.68092954 1.22254134 -0.15452948
-0.42832782 -0.35213355 0.53230919 0.36544406 0.41273261 0.430821
2.1416476 -0.40641502 -0.51224273 -0.81377273 0.61597942 1.12897229
-0.11394746 -0.84015648 -0.82448122 0.65059279 0.74325417 0.54315427
-0.66550971 0.23216132 0.11668581 0.2186886 0.87142878 0.22359555
0.67891356 0.06757907 0.2891194 0.63128823 -1.45715582 -0.31967122
-0.47037265 -0.63887785 -0.27514225 1.49494131 -0.86583112 0.96827835
-1.68286977 -0.33488503 0.16275307 0.58622233 0.71122658 0.79334724
-0.34872507 -0.46235179 0.85797588 -0.19130432 -1.27568632 -1.13328721
-0.91945229 0.49716074 0.14242574 0.69048535 -0.42725265 0.15853969
0.62559039 -0.30934654 0.45677524 -0.66192594 -0.36305385 -0.38173789
-1.19583965 0.48697248 -0.46940234 0.01249412 0.48074666 0.44653118
0.66538511 -0.09848548 -0.42329831 -0.07971821 -1.68733443 -1.44711247
-1.32269961 -0.99724683 0.39977423 -0.90547906]
If you have a dataset and you want to sample from it, you can use the numpy.random.choice
function to randomly select elements from the dataset (with or without replacement). If your dataset is in a pandas DataFrame, you can also use the sample
method to randomly select rows from the DataFrame.
Code
# Sampling without replacement
= np.random.default_rng(seed=42)
rng = rng.choice(samples, size=10, replace=False)
subsample print("Sample:\n", subsample)
# another way to sample; note RNG can be different in different packages
=["Sample"]).sample(n=10, replace=False, random_state=42) pd.DataFrame(samples, columns
Sample:
[-1.32269961 -1.13328721 -0.01680116 -1.68286977 0.54315427 -1.68733443
0.85797588 -0.85304393 -0.04992591 -0.36305385]
Sample | |
---|---|
83 | -0.381738 |
53 | -0.319671 |
70 | -1.275686 |
45 | 0.218689 |
44 | 0.116686 |
39 | 0.650593 |
22 | 1.222541 |
80 | 0.456775 |
10 | 0.879398 |
0 | 0.304717 |
If you want to sample from a custom distribution, you can also use the numpy.random.choice
function to sample from a list of values with specified probabilities.
Here’s an example of how to sample 100 dice rolls with a rigged die that has a 50% chance of rolling a 6, and a 10% chance of rolling each of the other numbers (1-5):
Code
= [1, 2, 3, 4, 5, 6]
possible_rolls = [0.1, 0.1, 0.1, 0.1, 0.1, 0.5]
probabilities =probabilities, size=100) rng.choice(possible_rolls, p
array([1, 3, 4, 2, 4, 6, 1, 6, 6, 4, 2, 6, 2, 4, 6, 6, 6, 4, 6, 6, 4, 6,
1, 3, 6, 2, 2, 6, 6, 6, 6, 6, 2, 5, 4, 1, 6, 5, 5, 2, 2, 6, 1, 3,
3, 5, 6, 5, 1, 3, 6, 6, 6, 6, 6, 1, 4, 6, 4, 6, 5, 6, 6, 6, 6, 6,
6, 4, 5, 6, 2, 4, 6, 6, 2, 4, 2, 1, 6, 5, 4, 6, 1, 3, 6, 6, 6, 6,
5, 6, 6, 6, 6, 6, 6, 2, 6, 5, 6, 6])
Simulating more complex processes
Sometimes real-world processes are complex, and the samples we take are not independent. The simplest version of non-independence is sampling without replacement.
Consider dealing poker hands from a standard deck of cards. When you deal a hand, you draw cards one at a time, and each card drawn affects the next card that can be drawn (because you do not put the card back into the deck).
Code
# Make a deck of cards (Ace is 1, King is 13)
= np.arange(1, 14).repeat(4) # 4 suits, each with cards 1 to 13
deck print("Deck of cards before shuffling:\n", deck)
= np.random.permutation(deck)
deck print("Deck of cards after shuffling:\n", deck)
# deal 2 cards to each of 4 players
= np.random.default_rng(seed=21)
rng # Get flat indices of 8 cards from deck
# we want to know exactly which cards are dealt
= rng.choice(len(deck), size=8, replace=False)
chosen_indices = deck[chosen_indices].reshape(4, 2)
hands print("Hands dealt to players:\n", hands)
# remove the dealt cards from the deck
= np.delete(deck, chosen_indices)
remaining_deck print("Remaining cards in the deck:\n", remaining_deck)
= np.random.choice(remaining_deck, size=5, replace=False)
board print("Community cards on the board:\n", board)
Deck of cards before shuffling:
[ 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6
7 7 7 7 8 8 8 8 9 9 9 9 10 10 10 10 11 11 11 11 12 12 12 12
13 13 13 13]
Deck of cards after shuffling:
[12 3 3 1 5 13 3 6 13 4 8 5 11 7 8 3 7 12 2 11 1 10 9 13
9 11 7 5 6 8 1 10 10 11 4 2 4 8 6 13 6 10 12 9 2 7 4 12
2 1 5 9]
Hands dealt to players:
[[ 1 5]
[ 2 7]
[12 8]
[ 9 2]]
Remaining cards in the deck:
[12 3 3 1 13 3 6 13 4 8 5 11 8 3 7 11 1 10 13 9 11 7 5 6
1 10 10 11 4 4 8 6 13 6 10 12 9 2 7 4 12 2 5 9]
Community cards on the board:
[ 5 4 1 3 13]
But it can get even more complex than that. In many real-world scenarios, the process of generating data involves multiple steps or conditions that affect the outcome.
In these cases simulation might not be as straightforward as sampling from a single distribution (which takes just one or two lines of code). We then tend to write loops that simulate the process step by step, keeping track of the state of things as we go along.
Let’s consider an example of a musician busking for money in Rittenhouse Square. The musician’s earnings might depend on various factors like the weather and and the number of passersby. To keep it simple, let’s assume that the musician earns $3 for every passerby who stops to listen. Of course, not every passerby will stop – let’s pretend every passerby has the same 20% chance of stopping.
The musician might want to know how much money they can expect to earn in a day of busking. We can simulate this process by generating a random number of passersby and then calculating the earnings based on the stopping probability.
The Poisson distribution is commonly used to model the number of events that occur in a fixed interval of time or space, given a known average rate of occurrence. It assumes that the events occur independently and at a constant average rate. In our example, we can use the Poisson distribution to model the number of passersby in a given time period (e.g., one hour of busking).
Code
= 5
n_days # simulate whether it rains each day
= np.random.default_rng(seed=42)
rng = rng.uniform(0., 0.7, size=n_days)
rain_probabilities
= 0 # initialize a variable to keep track of total earnings
total_earnings for day in range(n_days):
# For each day, decide if it rains based on the probability
= rng.binomial(n=1, p=rain_probabilities[day])
did_it_rain print(f"Day {day + 1} ({rain_probabilities[day]:.2%} chance): {'Rain' if did_it_rain else 'No rain'}")
# Based on the outcome, the number of passersby changes
if did_it_rain:
= rng.poisson(lam=50) # fewer passersby when it rains
passersby else:
= rng.poisson(lam=200) # more passersby when it doesn't rain
passersby print(f"\t Number of passersby: {passersby}")
# Simulate the number who stop to listen to the busker
= rng.binomial(n=passersby, p=0.2) # 20% of passersby stop
listeners print(f"\t Number of listeners: {listeners}")
# Compute the busker's daily earnings
= 3 * listeners # $3 per listener
earnings print(f"\t Daily earnings: ${earnings}")
print("-" * 40)
+= earnings
total_earnings
print(f"Total earnings over {n_days} days: ${total_earnings}")
Day 1 (54.18% chance): No rain
Number of passersby: 211
Number of listeners: 41
Daily earnings: $123
----------------------------------------
Day 2 (30.72% chance): No rain
Number of passersby: 226
Number of listeners: 54
Daily earnings: $162
----------------------------------------
Day 3 (60.10% chance): Rain
Number of passersby: 51
Number of listeners: 13
Daily earnings: $39
----------------------------------------
Day 4 (48.82% chance): Rain
Number of passersby: 56
Number of listeners: 17
Daily earnings: $51
----------------------------------------
Day 5 (6.59% chance): No rain
Number of passersby: 212
Number of listeners: 34
Daily earnings: $102
----------------------------------------
Total earnings over 5 days: $477
The key here is to re-use the logic of the weekly busking simulation in a loop that runs 1000 times. Each time we run the simulation, we get a different weekly outcome based on the random number generator. By averaging these outcomes, we can get a good estimate of the expected earnings over a week of busking.
import numpy as np
def busk_one_week(rng, n_days=7):
"""Simulate the earnings of a busker in Rittenhouse Square over the course of a week.
Args:
rng: A NumPy random number generator.
n_days: The number of days to simulate (default is 7).
Returns:
total_earnings: The total earnings over the week.
"""
= rng.uniform(0., 0.7, size=n_days)
rain_probabilities
= 0 # initialize a variable to keep track of total earnings
total_earnings for day in range(n_days):
# For each day, decide if it rains based on the probability
= rng.binomial(n=1, p=rain_probabilities[day])
did_it_rain # Based on the outcome, the number of passersby changes
if did_it_rain:
= rng.poisson(lam=50) # fewer passersby when it rains
passersby else:
= rng.poisson(lam=200) # more passersby when it doesn't rain
passersby # Simulate the number who stop to listen to the busker
= rng.binomial(n=passersby, p=0.2) # 20% of passersby stop
listeners # Compute the busker's daily earnings
= 3 * listeners # $3 per listener
earnings
+= earnings
total_earnings
return total_earnings
= 1000
n_simulations = 33
random_seed
= np.random.default_rng(seed=random_seed)
rng
# Use the above function to simulate the expected earnings of a busker in Rittenhouse Square over the course of a week.
# Run the simulation 1000 times and calculate the average earnings over 1000 simulations.
= [busk_one_week(rng) for _ in range(n_simulations)]
earnings_by_week = np.mean(earnings_by_week)
average_earnings average_earnings