Steve for a 🐐ed intro.


Probability

Laplace’s Rule”:

… number of events where occurs / number of ways can happen.
… number of possible events.

Probability is just combinatorics, i.e. the math of combinations and proportions

Thinking of probability geometrically, as proportions is incredibly helpful:
center
Pictured: The sample space formed by all possible outcomes when rolling two dice.
The event , where is a random variable representing the dice sum can be represented as an area (yellow) in this space, and its probability is the proportion of the area of this event to the total area of the sample space →
Generally, you can think of the sample space as a unit square (area = 1).

Roll two dice. What is the probability that at least one die is a 5?

By tedious counting:



… or we take the problem apart by looking at the complement of : Either the first die is a 5 or it’s not and the second die is a 5.

Visually, it’s easy to see the relation to binomial coefficients (→ probability of getting a 5 on at least one die is the same as the probability of getting a 5 on exactly one die) and how probabilities stack up.

Probabilistic models are not just useful for truly random things, but also for things that are too complex to model exactly.

Properties of probability

The probability function is a map from subsets of the sample space (the set of all possible outcomes of a random experiment) to the real numbers:

… the probability of something happening is 1.
… the probability of nothing happening is 0.
… probability of an event is always non-negative.
… a subset of events has a smaller probability than the set it’s a subset of.
… the probability of not happening. Also sometimes denoted as (surprise, but without the log).
If are disjoint (can’t occur at the same time), the probability of either happening is the sum of the probability of each happening:

For non-disjoint events, we need to subtract the intersection, so we don’t count it twice (inclusion-exclusion principle).

The “counting” from earlier is just asking about the relative sizes of sets; proportions:
and happened”: or happened”:
Now it’s also clearer what happens when we multiply or add probabilities:
With the caveat of overcounting for non-disjoint events and the union, and for the intersection: If they are not independent, we need to use the chain rule of probability (see below).

conditional probability

Conditional probability

are two events (outcomes) of a random experiment. The probability of an event given that another event has occurred is:

If we know that has occured, becomes the new :

In this case, becomes a lot more likely, as it occupies a larger fraction of the sample space than it did before.
It’s easy to see geometrically: The reverse is not true (completely different proportions).

Link to original

independent

Independent events

Two events and are independent if knowing gives no extra information about , and vice versa:

Equivalently, using the formula of conditional probability, for independent events, we can say:

So the chain rule of probability simplifies to multiplication for independent events.

EXAMPLE

= deck of 52 cards
= card is spade →
= card is queen →

The probability of getting a spade or queen doesn’t change if we restrict ourselves to the set or :
center

Independent random variables

Two random variables are independent:

Knowing one variable gives no information about the other.

Mutual independence (more than 2 variables):

Link to original

law of total probability

Law of total probability

Given a partition of the sample space into disjoint events (), the probability of an event is the sum of the probabilities of given each of the , weighted by the probability of each :

Link to original

Further reading:
bayes theorem
probability distribution
joint probability distribution

The general view / measure-theoretic scaffold:

Probability as a measure

The probability is just a measure with the additional normalization constraint that

Probability space:
Random variable: … a measurable function

Continue with this

Continue cleaning up the below

A note on notation (needs refactor!)

… the sample space, the set of all possible outcomes of a random experiment.
events.
… probability measure aka , aka .

random variables; … a vector of random variables.
… realized values;

… the probability distribution “law” of a random variable .
… shorthand notation for probabilities of random variables taking on specific values.

… the probability of a random variable taking on value , general notation; specific:
PMF (discrete; is a probability)
PDF (continuous; not a probability, probabilities come from integrals )
conditional probability, discrete and continuous, respectively.

… a family of distributions indexed by .
Its use can depend on context:
… the probability of data given fixed parameters . (PMF/PDF of )
… the likelihood of fixed data given parameters . (not a density in !)

… a distribution without explicit parameters, e.g. the true data distribution, or a marginal distribution.
… a distribution over parameters, e.g. a prior.
or … the posterior distribution over parameters given data .

Introduction (old and bad)

Let denote the nth digit of .
Consider the quanity (number) of .
Is it even or odd? → even.
What about ?
There are two approaches to probability:

  1. “d is even with probability
    1. Probability: Representing uncertainty about certain values.
    2. Baysian approach. Common in Machine Learning
  2. “d is even with probablility 0 or 1 but I don’t know which”
    4. Probability: Mathematically defineable thing about frequencies
    5. More common, esp. in rigorous mathematical theory.
    Important points:
    The total probability is alwys 1 We care about indepence: We care about expectation: The probabiliy times some other function (e.g. darts player points and dart positions) Wandb YT

References

mathematics

Footnotes