2025-07-31
\[ \renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\mathbb{V}\text{ar}} \newcommand{\Cov}{\mathbb{C}\text{ov}} \]
Most of you are probably familiar with the basic intuition of probability: essentially it measures how likely an event is to occur.
In mathematical terms, the probability \(\P\) of an event \(A\) is defined as:
\[\begin{align*} \P(A) &= \frac{\text{ \# of outcomes where } A \text{ occurs}}{\text{ total \# of outcomes}} \\ \end{align*}\]
By definition this quantity:
The classical example of probability is flipping a coin. When you flip a fair coin, there are two possible outcomes: heads (\(H\)) and tails (\(T\)).
If we let \(A\) be the event that the coin lands on heads, then we can compute the probability of \(A\) as follows:
\[\begin{align*} \P(\text{H}) &= \frac{\text{ \# of heads}}{\text{ total \# of outcomes}} \\ &= \frac{1}{2} \\ \end{align*}\]
This matches our intuition that a fair coin has a 50% chance of landing on heads.
An important property of probabilities is that the sum of the probabilities of all possible outcomes must equal 1. This is like say “there’s a 100% chance that something will happen”.
In our coin flip example, we have two possible outcomes: heads and tails. If the coin flip is not heads, it must be tails. In other words, the events \(H\) and \(T\) cover 100% of the possible outcomes. So we can write: \[\begin{align*} \P(H) + \P(T) &= 1 \\ \frac{1}{2} + \frac{1}{2} &= 1 \end{align*}\]
When we know the events we are interested in make up all of the possible outcomes, we can use this property to compute probabilities.
For example, for any event \(A\), the event either happens or it doesn’t. So we can compute the probability of the event not occurring as: \[\begin{align*} \P(\text{not}~ A) &= 1 - \P(A) \end{align*}\]
But what if we flip the coin twice?
Now there are four possible outcomes:
If we let \(B\) be the event that at least one coin lands on heads, we can compute the probability of \(B\) as follows:
\[\begin{align*} \P(B) &= \frac{\text{ \# of outcomes with at least one head}}{\text{ total \# of outcomes}} \\ &= \frac{|\{HH, HT, TH\}|}{|\{HH, HT, TH, TT\}|} \\ &= \frac{3}{4} \\ \end{align*}\]
What is the probability of getting heads on the first flip AND the second flip (i.e., the event \(C = \{HH\}\))?
What about the probability of getting heads on the first flip OR the second flip?
This is actually the same event as \(B\) before (at least 1 heads), so we can use the same calculation:
\(\P(B) = \P(H_1 ~\text{or}~ H_2) = \frac{3}{4}\).
In the above, we used \(H_1\) and \(H_2\) to denote heads on the first and second flips, respectively.
The notation \(H_1 ~\text{and}~ H_2\) means both flips are heads, while \(H_1 ~\text{or}~ H_2\) means at least one flip is heads.
In probability theory, we often use the symbols \(\cap\) and \(\cup\) to denote “and” and “or” respectively.
So we could also write \(\P(H_1 \cap H_2)\) for the probability of both flips being heads, and \(\P(H_1 \cup H_2)\) for the probability of at least one flip being heads.
Technically, this is set notation where \(\cap\) means intersection (the event where both \(H_1\) and \(H_2\) occur), while \(\cup\) means union (the event where either \(H_1\) or \(H_2\) occurs).
There are some important rules for calculating probabilities of multiple events. In particular, if you have two events \(A\) and \(B\), the following rules hold:
Addition rule: For any two events \(A\) and \(B\), the probability of either \(A\) or \(B\) occurring is given by: \[ \P(A \cup B) = \P(A) + \P(B) - \P(A \cap B) \] This last term, \(\P(A \cap B)\), is necessary to avoid double counting the outcomes where both \(A\) and \(B\) occur.
Note that if \(A\) and \(B\) are mutually exclusive (i.e., they cannot both occur at the same time), then \(\P(A \cap B) = 0\), and the formula simplifies to: \[ \P(A \cup B) = \P(A) + \P(B)\]
Visualizing sets of events
The following image illustrates the addition rule for two events \(A\) and \(B\) using a Venn diagram.
Conditional probability
The notation \(\P(B | A)\) is read as “the probability of \(B\) given \(A\)”. It represents the probability of event \(B\) occurring under the condition that event \(A\) has already occurred.
We make these adjustments in our heads all of the time.
Let’s think about this in the context of our coin flips. If we know that the first flip is heads (\(H_1\)), then only two outcomes are possible (\(HH\) and \(HT\)) instead of four (\(HH\), \(HT\), \(TH\), \(TT\)).
The multiplication rule helps us calculate the probability of multiple events happening, as long as we know how one event affects the other (i.e., the conditional probability).
Consider a deck of cards (52 cards total, 13 of each suit). I might ask you, “What is the probability of drawing a club on the first draw and a club on the second draw? (Assuming you do not replace the first card.)”
Two events \(A\) and \(B\) are said to be independent if the occurrence of one does not affect the probability of the other.
How does this relate to the multiplication rule?
This means that for independent events, the multiplication rule simplifies to:
\[\P(A \cap B) = \P(A) \cdot \P(B)\]
Our coin flip example illustrates this nicely.
So the probability of both flips being heads is simply
\[\P(H_1 \cap H_2) = \P(H_1) \cdot \P(H_2) = \frac{1}{2} \cdot \frac{1}{2} = \frac{1}{4}\]
Counting gets confusing and cumbersome quickly, especially when we have many events or outcomes.
Say that I want to know the probability of getting exactly one head when flipping a coin 5 times.
Let’s think about the case where the first flip is heads.
Are we done? As it stands, this is the probability of getting heads on the first flip and tails on all other flips.
There are two common types of outcomes we want to count: permutations and combinations.
A combination is a selection of items or events without regard to the order in which they occur. For example, the number of ways 1 out of 5 flips could be heads.
An important and intuitive way to think about combinations is that we are choosing an item from a set. In our example, we are choosing 1 flip to be heads out of 5 flips.
\[ \begin{align*} \text{Flip 1 is heads} &= \{1, 0, 0, 0, 0\} \\ \text{Flip 2 is heads} &= \{0, 1, 0, 0, 0\} \\ \text{Flip 3 is heads} &= \{0, 0, 1, 0, 0\} \\ \vdots \\ \text{Flip 5 is heads} &= \{0, 0, 0, 0, 1\} \end{align*} \]
It is clear here that there are 5 possible “slots” where we can place a head.
What if we want to know the number of ways to choose 2 flips to be heads out of 5 flips? Naturally the logic above still applies, and the 5 flips we counted above are still all valid placements for one of the two heads. Now we just need to consider the second head.
Let’s take the first row from above, where the first flip is heads. Given that the first flip is heads, how many ways can we choose a second flip to also be heads? The second flip can be any of the remaining 4 flips, so there are 4 possible choices. \[ \begin{align*} \text{Flip 1 and 2 are heads} &= \{1, 1, 0, 0, 0\} \\ \text{Flip 1 and 3 are heads} &= \{1, 0, 1, 0, 0\} \\ \text{Flip 1 and 4 are heads} &= \{1, 0, 0, 1, 0\} \\ \text{Flip 1 and 5 are heads} &= \{1, 0, 0, 0, 1\} \end{align*} \]
Now let’s consider the second row, where the second flip is heads. Given that the second flip is heads, how many ways can we choose a first flip to also be heads? The first flip can be any of the remaining 4 flips, so there are again 4 possible choices.
\[ \begin{align*} \text{Flip 2 and 1 are heads} &= \{1, 1, 0, 0, 0\} \\ \text{Flip 2 and 3 are heads} &= \{0, 1, 1, 0, 0\} \\ \text{Flip 2 and 4 are heads} &= \{0, 1, 0, 1, 0\} \\ \text{Flip 2 and 5 are heads} &= \{0, 1, 0, 0, 1\} \end{align*} \]
You would continue this process for the third, fourth, and fifth flips. So for each of the 5 flips, you can choose any of the remaining 4 flips to be heads.
This gives us a total of \(5 \cdot 4 = 20\) ways to choose 2 flips to be heads out of 5 flips.
But wait! We already counted the combination of flips 2 and 1 earlier (just in a different order – where flip 1 was heads first).
This illustrates the key distinction between combinations and permutations. A permutation is an arrangement of items or events in a specific order.
Counting all of the possible permutations of a sequence is straightforward. Using the logic above, you just assign “slots” in a sequence to each of the items you are arranging.
Each time you allocate a slot, you have one fewer item to place in the remaining slots. So for a sequence of length \(n\), the number of permutations is: \[ n! = n \cdot (n-1) \cdot (n-2) \cdots 2 \cdot 1 \]
Now, if we want to count combinations instead of permutations, we start with the number of permutations and then discount to account for the fact that the order does not matter.
Namely, the number of combinations of \(k\) items from a set of \(n\) items is given by the formula: \[ \binom{n}{k} = \frac{n!}{k! \cdot (n-k)!} \]
This formula:
In our example, we have \(n = 5\) (the number of flips) and \(k = 2\) (the number of heads). So the number of combinations of 2 heads from 5 flips is: \[ \binom{5}{2} = \frac{5!}{2! \cdot (5-2)!} = \frac{5!}{2! \cdot 3!} = \frac{5 \cdot 4 \cdot 3!}{2 \cdot 1 \cdot 3!} = \frac{5 \cdot 4}{2 \cdot 1} = 10 \]
This concludes the first part of our lecture on probability. We have covered the basic concepts of probability, how to compute probabilities for single and multiple events, and the distinction between combinations and permutations. Next we will talk about probability functions, which allow us to concisely describe the probability of outcomes that have many possible values.
Thinking about probability in terms of counting outcomes is useful, and it is always a good idea to keep that intuition in mind if you ever get stuck.
However, it is often more convenient to work with probability functions.
Functions map inputs to outputs
Functions are just a “map” that tells you what output to expect for each input. A probability function is a special type of function that maps inputs to probabilities in the range \([0, 1]\).
This might seem a bit redundant because we’re just presenting the same information in a new format.
But probability functions are much more concise!
For example, if we have a die with six sides, we can define a probability function \(f\) as follows: \[ f(x) = \begin{cases} \frac{1}{6} & \text{if } x = 1, 2, 3, 4, 5, 6 \\ 0 & \text{otherwise} \end{cases} \]
But we can also use the same function to describe the probability of rolling a die with any number of sides. For example, if we have a die with \(k\) sides, we can define a probability function \(f\) as follows:
\[ f(x) = \begin{cases} \frac{1}{k} & \text{if } x = 1, 2, \ldots, k \\ 0 & \text{otherwise} \end{cases} \] This is much more concise than writing out the probability for each possible outcome, and it allows us to easily generalize to any number of sides.
A random variable is a quantity that
We denote random variables with capital letters, like \(X\) or \(Y\). The specific values that a random variable can take on in a particular instance are usually denoted with lowercase letters, like \(x\) or \(y\).
We use probability functions to describe the probabilities associated with random variables.
For example, let \(X\) be a random variable that represents the outcome of flipping a fair coin. The probability function for \(X\) would be: \[ f(x) = \P (X = x) = \begin{cases} \frac{1}{2} & \text{if } x = 0 \text{ (heads)} \\ \frac{1}{2} & \text{if } x = 1 \text{ (tails)} \\ 0 & \text{otherwise} \end{cases} \]
Bernoulli random variable
The above is an example of a Bernoulli random variable, which takes on the value 1 with probability \(p\) and the value 0 with probability \(1 - p\). In our case, \(p = \frac{1}{2}\) for a fair coin.
As mentioned above, we can also think about random variables with continuous values. For example, let \(Y\) be a random variable that represents the height of a person in centimeters. Let’s assume that every person’s height is equally likely to be between 150 cm and 200 cm (this is not true of course). The probability function for \(Y\) would be: \[ f(y) = \P (Y = y) = \begin{cases} \frac{1}{50} & \text{if } 150 \leq y \leq 200 \\ 0 & \text{otherwise} \end{cases} \]
Uniform random variable
The above is an example of a uniform random variable, which takes on values in a continuous range with equal probability. In our case, the range is from 150 cm to 200 cm, and the probability density function is \(\frac{1}{50}\).
In statistics, we treat our data as a random variable (or a collection of random variables). What this means is that we assume that the data we observe is just one possible outcome of a random process.
This is a powerful assumption because it allows us to use probability theory to make inferences about the underlying process that generated the data. This is going to be a key idea in the next lecture and throughout the course.
We call the probability function for a random variable a probability distribution
Distributions can be discrete or continuous, depending on the type of random variable.
Let’s say we have a random variable \(X\), but we don’t know the exact probability function. Instead, we have a set of observed data points \(\{x_1, x_2, \ldots, x_n\}\) that we believe are individual realizations of \(X\). In other words, we have a sample of data that we think is representative of the underlying random variable.
How can we visualize this data to understand the distribution of \(X\)? The simplest solution is to just plot how many times each value occurs in the data. We can use a bar chart to visualize the counts of each value.
import matplotlib.pyplot as plt
import numpy as np
x = np.array([0, 0, 1, 1, 0])
plt.figure(figsize=(8, 5))
plt.hist(x, bins=np.arange(-0.5, 2.5, 1), density=False, align="mid", rwidth=0.8)
plt.xticks([0, 1])
plt.xlabel('Value')
plt.ylabel('Frequency')
Text(0, 0.5, 'Frequency')
Consider a bunch of dice rolls. If we roll a die 100 times, we would expect to see each number appear roughly a similar number of times.
If we plot the frequencies of each roll, we should see a discrete uniform distribution, where each number from 1 to 6 has approximately the same height. Let’s check it out:
Total number of dice rolls: 1000
rolls | |
---|---|
0 | 1 |
1 | 5 |
2 | 4 |
3 | 3 |
4 | 3 |
5 | 6 |
6 | 1 |
7 | 5 |
8 | 2 |
9 | 1 |
Now the plot:
Notice that when the \(y\)-axis represents probabilities, the heights of the bars sum to 1. This is because the total probability of all possible outcomes must equal 1!
What about continuous random variables? In this case, we cannot just count the number of occurrences of each value, because there are infinitely many possible values.
Instead, we can use a histogram to visualize the distribution of the data.
The histogram is a graphical representation that summarizes the distribution of a dataset.
The prices of Airbnb listings from back in the first lecture are a good example of a continuous random variable. The resolution (cents) is so small that basically every price is unique. So we cannot just count the number of occurrences of each price. Instead, we can create a histogram to visualize the distribution of prices across bins.
Area under a probability distribution
At the beginning of this lecture, we said that the probability of all possible outcomes must sum up to 1. This is true for both discrete and continuous random variables. For discrete random variables, the sum of the probabilities of all possible outcomes equals 1. For continuous random variables, the area under the probability density function (PDF) must equal 1.
For discrete: \[ \sum_{x} f(x) = 1 \] For continuous: \[ \int_{-\infty}^{\infty} f(x) \, dx = 1 \]
We can use the same idea to compute the probability of a continuous random variable falling within a certain range. For example, if we want to know the probability that a continuous random variable \(X\) falls between \(a\) and \(b\), we can compute the area under the PDF from \(a\) to \(b\): \[ \P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx \]
We are often interested in the average value of a random variable.
Why do we need an average?
Instead, think about what would happen if we repeated the random process many times and took the average. - Values that occur more frequently will tend to have a larger impact on the average, while values that occur less frequently will have a smaller impact. - For example, at a casino roulette table perhaps you place a bet that has a 10% chance of winning. You might bet $1 and win $10 ($9 net profit) on one game, but if you lose $1 on the next 9 games you’re not making money in the long run. Even though $9 profit sounds great, the fact that it happens so infrequently (and you lose $1 90% of the time) means that your average profit is actually zero.
Gambling warning: the house always wins
Actually, at real casinos, the games are designed so that “the house always wins” in the long run. So they would not let you bet $10 to win $100 ($90 profit) with a 10% chance – they would give you worse odds, like a 9% chance of winning $100 for a $10 bet.
In the short term this is hardly noticeable – you’re actually quite likely to win a few times! But in the long term, the house edge means that you will lose money if you keep playing. This is why casinos are profitable businesses.
We can formalize this idea that the average gives more weight to values that occur more frequently.
The expectation (or expected value) of a random variable \(X\) is gives the average value of \(X\) over many instances. It is denoted as \(\mathbb{E}[X]\) or \(\mu_X\). The expectation is calculated as follows: \[ \mathbb{E}[X] = \sum_{x} x \cdot f(x) \] where \(f(x)\) is the probability function of \(X\). For continuous random variables, the sum is replaced with an integral: \[ \mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx \]
The way to think about this is that the expectation is a weighted average of all possible values of \(X\), where the weights are the probabilities of each value.
So in our roulette example, you can either lose $1 (with 90% probability) or win $9 (with 10% probability). The expectation would be: \[ \begin{align*} \mathbb{E}[X] &= \sum_{x \in \{-1, 10\}} x \cdot f(x) \\ &= (-1) \cdot 0.9 + (10-1) \cdot 0.1 \\ &= -0.9 + 0.9 \\ &= 0 \end{align*} \]
This is also the same as what you get if you just take the average of the outcomes. Say we play roulette 10 times, and we win $10 on one game and lose $1 on the other 9 games. The average outcome is: \[ \begin{align*} \frac{1}{10} \left( -1 \cdot 9 + 9 \cdot 1 \right) &= -1 \cdot 0.9 + 9 \cdot 0.1 \\ &= -0.9 + 0.9 \\ &= 0 \end{align*} \]
So for a finite dataset, or set of outcomes, we can estimate the expected value by taking the average of the outcomes. This is often written as \(\bar{X}\), and referred to as the sample mean. \[ \mathbb{E}[X] \approx \bar{X} = \frac{1}{n} \sum_{i=1}^{n} x_i \]
This approximation becomes more accurate as the number of samples \(n\) increases. We will talk about this more in a future lecture.
The average is a useful summary of a random variable’s central tendency, but it does not tell us anything about how spread out the values are.
Consider the roulette example again. If we play roulette many times, it does not matter how much we bet on each game – the average amount we can expect to win or lose is always zero. You can bet $1 or $10,000 on each game, but the average outcome is still zero.
Of course, the outcome of each game is not zero. Sometimes you win, sometimes you lose, and the amount you win or lose changes drastically depending on how much you bet.
How can we quantify this spread?
We need a statistic that captures the typical distance between the values of the random variable and the average value.
Why distance from the average?
Let’s imagine for a moment that there was a casino (a very poorly run casino) that let you place bets that win no matter what – the only question is how much you win. Let’s take an example where the payouts still differ by $10: you get $5 if you “lose” and $15 if you “win”.
In this case, the expected value is: \[ \mathbb{E}[X] = \sum_{x } x \cdot f(x) = 5 \cdot 0.9 + 15 \cdot 0.1 = 4.5 + 1.5 = 6 \] So you can expect to win $6 per game on average.
The amount that the winnings vary, though is exactly the same as the original roulette game. How can we replicate this notion mathematically?
The answer is simple: we subtract the average from each value of \(X\): \[X' = X - \mathbb{E}[X]\] This gives us a new random variable \(X'\) that represents the distance from the average. Notice that this new random variable has an average of zero, just like the original roulette game.
\[\mathbb{E}[X'] = \sum_{x} (x - \mathbb{E}[X]) \cdot f(x) = ((5-6) \cdot 0.9 + (15-6) \cdot 0.1 = (-1) \cdot 0.9 + (9) \cdot 0.1 = -0.9 + 0.9 = 0\]
or more generally: \[\mathbb{E}[X'] = \mathbb{E}[X - \mathbb{E}[X]] = \mathbb{E}[X] - \mathbb{E}[X] = 0\]
So let’s compute exactly that - the distance from the average. The formula for distance between two vectors \(x\) and \(y\) is: \[ d^2 = \sum_{i} (x_i - y_i)^2 \] where \(x_i\) and \(y_i\) are the elements of the two vectors.
In our case, we want to compute the distance between the values of the random variable \(X\) and the average value \(\mathbb{E}[X]\). \[ d^2 = \sum_x (x - \mathbb{E}[X])^2 \]
Now we’re getting somewhere! However, this is adding up all of the squared distances – that means that the more values we have, the larger the distance will be.
This is not quite right – instead we want to compute the average distance from the mean in order to get a sense of how spread out the values typically are.
So we need to divide by the number of values: \[ d^2_\text{avg} = \frac{1}{n} \sum_x (x - \mathbb{E}[X])^2 \]
Something should feel familiar about this expression. Recall that the average is related to the expectation.
If we replace the average with the expectation, we get the formula for the variance of a random variable \(X\): \[ \text{Var}(X) = \sigma^2(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \sum_x (x - \mathbb{E}[X])^2 \cdot f(x) \]
The variance tells us how spread out the values of a random variable are around the average.
The variance is a useful statistic, but it is not in the same units as the original random variable. For example, if \(X\) represents the amount of money you win or lose in dollars, then the variance is in dollars squared. This can make it difficult to interpret.
So we often take the square root of the variance to get the standard deviation: \[ \text{SD}(X) = \sigma(X) = \sqrt{\text{Var}(X)} = \sqrt{\mathbb{E}[(X - \mathbb{E}[X])^2]} \]
Like with expected value, we can replace the expectation with the sample mean to get an estimate of the standard deviation (or variance) from a finite dataset: \[ \hat{\sigma}(X) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{X})^2} \]
Sample variance vs. population variance
Technically, the formula above is an imperfect estimate of the population standard deviation. It’s in general a little bit too small, because the sample mean \(\bar{X}\) does not perfectly represent the population mean \(\mathbb{E}[X]\). We can correct for this by dividing by \(n-1\) instead of \(n\): \[ \hat{\sigma}(X) = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{X})^2} \] This is called the sample standard deviation.
Why is the initial estimate too small? In a small dataset, the sample mean “overfits” the data, meaning it is closer to the individual data points than the true population mean. Let’s think about this in terms of coin flips. If we flip a coin once, the sample mean is either \(\hat{X}=0\) or 1, depending on whether we got heads or tails. But the true population mean is \(\mathbb{E}[X]=\frac{1}{2}\). If we compute the standard deviation using the original formula, the distance from the sample mean is exactly 0! So the standard deviation is also 0 (either \((1-1)^2\) or \((0-0)^2\)).
By contrast, the true (population) standard deviation is \(\sigma(X) = \sqrt{\mathbb{E}[(X - \mathbb{E}[X])^2]} = \sqrt{\frac{1}{4}} = \frac{1}{2}\), which is larger than the estimate using the sample mean.
This bias in computing the standard deviation gets smaller as the sample size \(n\) increases, so for large datasets the difference is negligible. In smaller datasets, though, it is important to use the \(n-1\) correction to get a more accurate estimate of the population standard deviation.
Certain probability distributions are very common, and their corresponding probability functions are well-known. It is not necessary to memorize these distributions, but it is useful to be familiar with them and their properties.
Distribution | Type | Probability Function | Parameters | Mean | Variance |
---|---|---|---|---|---|
Bernoulli | Discrete | \(f(x) = p^x (1-p)^{1-x}\) | \(p \in [0, 1]\) | \(p\) | \(p(1-p)\) |
Binomial | Discrete | \(f(x) = \binom{n}{x} p^x (1-p)^{n-x}\) | \(n \in \mathbb{N}, p \in [0, 1]\) | \(np\) | \(np(1-p)\) |
Poisson | Discrete | \(f(x) = \frac{\lambda^x e^{-\lambda}}{x!}\) | \(\lambda > 0\) | \(\lambda\) | \(\lambda\) |
Uniform | Continuous | \(f(x) = \frac{1}{b-a}\) | \(a < b\) | \(\frac{a+b}{2}\) | \(\frac{(b-a)^2}{12}\) |
Normal | Continuous | \(f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\) | \(\mu \in \mathbb{R}, \sigma > 0\) | \(\mu\) | \(\sigma^2\) |
Exponential | Continuous | \(f(x) = \lambda e^{-\lambda x}\) | \(\lambda > 0\) | \(\frac{1}{\lambda}\) | \(\frac{1}{\lambda^2}\) |
The plots below show the probability functions for each of these distributions.
This lecture introduced many important concepts from probability theory that will be useful throughout the course. Probability gives us a mathematical language and toolkit for reasoning about uncertainty and randomness in data, by thinking about possible outcomes and their likelihoods.
In particular, we covered:
Going forward, these concepts will be foundational for statistical modeling and designing good simulations and statistical tests.
Assignment 2 will give you a chance to work through some of these concepts in more detail, so be sure to check it out!