# Introduction to discrete probability theory
It is often the case that in computer science we must deal with probabilities:
* When we want to model some random or unpredictable phenomenon (for
example the number of visitors to a website per hour).
* When we want to work with noisy data.
* For cryptography, randomness is essential.
* Randomized algorithms can offer simple and fast solutions to problems
for which no good deterministic algorithm is known.
## A frequentist introduction to probability
Our model for a random or unpredictable phenomenon will be that of a *random
experiment*: this is some process which results in a random result, called an
*outcome*, each time we repeat it.
An archetypal example is of course rolling a dice: each time we roll it,
we're presented with a new outcome. Suppose then that we roll the dice
$N$ times and keep a tally $m_i$ of how many times we see each face $i$,
where $i \in \{1,2,3,4,5,6\}$. Then the ratio $m_i/N$ is the
*relative frequency* with which we observed each face $i$. For a fair dice
and a large number of trials $N$, we should expect that $m_i/N \sim 1/6$.
To model this experiment, we will define a number called the probability of the outcome $m_i$,
which we set equal to the limit of the frequency of observing $m_i$, namely $1/6$.
```{warning}
The above interpretation of probabilities as the limit of relative frequencies is
called *frequentism*. This is not the only possible interpretation!
Indeed another very popular one is *Bayesianism* which interprets probabilities
as modeling knowledge or belief.
```
## Probability spaces
To model an experiment, we will consider the set $\Omega$ of all possible
outcomes, called the *universe*. For example, rolling a dice once yields
$\Omega = \{1,2,3,4,5,6\}$. Throughout this chapter, we'll assume that the
universe is a finite set and we note that many notions that we introduce here
do not directly transfer to the infinite case.
Subsets of $\Omega$ are called *events*. For every outcome $\omega \in \Omega$
the set $\{\omega\}$ is an event, called an *elementary* event. For example,
in our dice experiment the elementary event $\{5\}$ corresponds to "rolling a
$5$". A non-elementary event would be for example $\{2,4,6\}$ which corresponds
to "rolling an even number", that is *either* $2,4,$ or $6$.
The following vocabulary is useful when discussing events:
* For two events $A, B$, the event $A \cup B$ is read as *$A$ or $B$*.
* For two events $A, B$, the event $A \cap B$ is read as *$A$ and $B$*.
* Two events $A, B$ are called *disjoint* if $A \cap B = \emptyset$.
Recall now that during our random experiment we not only collected all possible
outcomes to form a universe $\Omega$ but we kept track of the relative frequencies too.
Therefore the bare structure of a set on $\Omega$ is not sufficient for our purposes:
we must enrich it with a *probability measure*.
(def-proba)=
```{admonition} Definition (Probability measure).
A probability measure on a universe $\Omega$ is defined
by associated to each event $e \in \mathcal{P}\left( \Omega \right)$
a real number $P(e)$ such that:
* $P\left( \Omega \right) = 1$,
* For all events $e \in \mathcal{P}\left( \Omega \right), P(e) \geq 0$.
* For any two disjoint events $A, B$, we have
$P(A \cup B) = P(A) + P(B)$.
```
For finite sets, defining $P(\{\omega\})$ for all elementary events
$\{\omega\}, \omega \in \Omega$ suffices to fully determine a probability measure
(provided our choices lead to $P\left(\Omega\right) = 1$.
The couple $(\Omega, P)$ consisting of a universe $\Omega$,
and a probability measure $P$ on $\Omega$ is called a *probability space*
(this is not exactly true, see the end of the chapter).
For example, to mathematically model a single roll of a fair dice,
our probability space is composed of:
* $\Omega = \{1,2,3,4,5,6\}$,
* For all $\omega \in \Omega$, $P(\{\omega\}) = 1/6$.
An important notion for events is the following:
(def-indep)=
```{admonition} Definition (Independent events).
Given a probability space $(\Omega, \mathcal{F})$,
two events $A, B \subseteq \Omega$ are called *independent* if:
$$P(A \cap B) = P(A) \cdot P(B).$$
```
This is an abstract statement of the intuitive notion that
two events should be called independent if occurences of one
don't affect occurences of the other.
The following properties are useful in calculating probabilities:
* For any event $A$, $P(\Omega \setminus A) = 1 - P(A)$.
* $P(\emptyset) = 0$.
* For two events $A, B$, $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ - note that
if $A, B$ are disjoint we recover $P(A \cup B) = P(A) + P(B)$.
* For two events $A, B$ with $A \subseteq B$, $P(A) \leq P(B)$.
## Examples of probability spaces
Let us once again consider the universe $\Omega = \{1,2,3,4,5,6\}$.
The first example of a probability measure we saw on $\Omega$ was used to model a single roll
of a fair dice, by assigning to each of the 6 elementary events the probability $1/6$.
A generalisation of this is the following.
(def-uniform)=
```{admonition} Definition (Uniform measure).
Given a finite universe $\Omega$, the uniform probability measure
is the one that assigns $P(\{\omega\}) = \frac{1}{\lvert \Omega \rvert}$
to all $\omega \in \Omega$.
```
An important property about this measure is *equiprobability* which tells
us that for every event $A \subseteq \Omega$:
$$P(A) = \frac{\lvert A \rvert}{\lvert \Omega \rvert}$$
Of course not all dice are fair. Another possible
assignment of probabilities is, for example:
```{list-table}
:header-rows: 1
* - $P({1})$
- $P({2})$
- $P({3})$
- $P({4})$
- $P({5})$
- $P({6})$
* - $1/4$
- $1/8$
- $1/12$
- $1/8$
- $1/6$
- $2/6$
```
The probabilities we assign do not have to be rational numbers,
any non-negative real number will do. For example, we might have a biased coin
we model with $\Omega = \{0, 1\}$ and $P(\{0\}) = \pi, P(\{1\}) = 1 - \pi$.
So far we have considered experiments consisting of single dice rolls and coin flips.
How about flipping a coin $n$ times? This is equivalent to
flipping $n$ coins once and so we can take universe $\Omega$
to consist of [$n$-tuples](def-cartesianprod) of single coin outcomes.
(ex-twocoins)=
For example, for an experiment consisting of rolling
a fair coin twice (or two fair coins once) we have
the following universe:
$$\Omega = \{(0,0), (0,1), (1,0), (1,1) \}.$$
Since we assumed each coin flip to be fair, together
they should generate the uniform distribution over all four possible outcomes:
$$P(\{(0,0)\}) = P(\{(0,1)\}) = P(\{(1,0)\}) = P(\{(1,1)\}) = 1/4.$$
## Conditional probability
Suppose that our experiment once again consists of a single roll of a fair dice,
but this time we are also given the *additional information* that we rolled an
odd number. Clearly the uniform measure $P$ on $\Omega$ is no longer appropriate
for modeling this situation: for example we know that we can't have rolled
$2$ since it isn't an odd number, yet $P(\{2\}) = 1/6$.
How should we update the uniform measure $P$ on $\Omega = \{1,2,3,4,5,6\}$
to reflect this additional information?
We now know that under this new measure, let's call it $P'$,
the probability of $\{1,3,5\}$ must be 1 since we know we've rolled an odd number.
In mathematical terms, $P'(\{1,3,5\}) = 1$.
Similarly any even outcomes are not possible anymore, so $P'(\{2\}) = P'(\{4\}) =
P'(\{6\}) = 0$. But what about $P'(\{1\})$, $P'(\{3\})$, $P'(\{5\})$? The
knowledge that the number we rolled is odd does not allow us to distinguish
between these three possibilities. Therefore we should assume that their
relative magnitudes remain the same as they were in $P$,
so that the new probabilities are $P'(\{1\}) = P'(\{3\}) = P'(\{5\}) = 1/3$.
Notice how the probability of each outcome $\omega \in E = \{1,3,5\}$
is its original probability ($P(\{\omega\}) = 1/6$) times the original probability
of the event itself $P(E) = 1/2$.
We have therefore constructed a new measure $P'$ on $\Omega$
which allows us to study a single dice role in which we know that the result is
an odd number.
More generally, given a universe $\Omega$ and a measure $P$, to model a random
experiment where we know in addition that an event $E \subseteq \Omega$ has
happened, we construct a new probability measure $P'( \cdot \lvert E)$ such
that:
1) Under the new measure, $P(E \lvert E) = 1$ since we know that $E$ has
surely happened.
2) $P'$ assigns probability zero to outcomes not in $E$:
$\forall \omega \in \Omega \setminus E, P(\omega \lvert E) = 0$.
3) $P'$ preserves the *relative magnitudes* (with respect to the original measure $P$) of outcomes in $E$.
Under these assumptions every event can be assigned a new probability which is defined by:
(def-condprob)=
```{admonition} Definition (Conditional probability).
Let $P$ be a probability measure on some universe $\Omega$
and $A, E \subseteq \Omega$ be two events with $P(E) \neq 0$.
Then, the *probability of $A$ given $B$* is
defined as
$$P(A \lvert E) = \frac{P(A \cap E)}{P(E)}.$$
```
What happens if we supposed that $A, E$ are [independent](def-indep)?
Intuitively, we should expect that since an occurence of one does not
affect the other, the probability that $A$ occurs should remain the same
regardless of whether we know if $E$ also occured. Indeed, this is true:
(thm-indepcond)=
```{admonition} Lemma (Conditional probability of independent events).
Let $(\Omega, P)$ be a probability space and
$A, E \subseteq \Omega$ be two independent events.
Then,
$$P(A \lvert E) = P(A).$$
```
The proof follows directly from the definitions of [conditional probability](def-condprob)
and [independent event](def-indep):
$$P(A \lvert E) = \frac{P(A \cap E)}{P(E)} = \frac{P(A) \cdot P(E)}{P(E)} = P(E)$$
Another important theorem about conditional probabilities is the following:
(thm-bayes)=
```{admonition} Theorem (Bayes).
Let $P$ be a probability measure on some universe $\Omega$
and $A, B \subseteq \Omega$ be two events with $P(E) \neq 0$.
Then,
$$P(A \lvert E) = \frac{P(E \lvert A)P(A)}{P(E)}.$$
```
The proof of the theorem follows from the [definition](def-condprob) of conditional probability.
Indeed, we have:
(eq-bayes1)=
$$ P(A \lvert B) = \frac{P(A \cap B)}{P(B)} $$
and if $P(A) = 0$, the theorem holds trivially.
Therefore let us assume that $P(A) \neq 0$, in which case we can consider the
conditional probability:
(eq-bayes2)=
$$ P(B \lvert A) = \frac{P(B \cap A)}{P(A)} $$
From the [second equation](eq-bayes2) it follows that:
(eq-bayes3)=
$$ P(B \cap A) = \frac{P(B \lvert A)}{P(A)} $$
But since $B \cap A = A \cap B$ we can substitute [this result](eq-bayes3)
into the [first equation](eq-bayes1), yielding the desired result.
## Discrete random variables, their distributions, and their expectation values
During a random experiment we are often interested not just on the outcomes
but various observables or statistics derived from them. This is captured by the
notion of a random variable, which is a value that depends on the results of a
random experiment.
To make this more concrete, consider a (not very competent) nutritionist
that has come up with a diet regimen in which every meal consists
of french fries but there's a catch: the number of fries we eat
is decided by rolling a fair dice! For example it could be:
(def-varexamples1)=
* We eat as many fries as the value we rolled.
* We eat 100 fries if the value we've rolled is even and 0 if its odd.
To model such values that depend on the results of a random experiment,
we define a *discrete random variable* as a *function* $X: \Omega \rightarrow
S$ where $S$ is a finite or countable set. In our case, will assume $S
\subseteq \mathbb{N}$.
Then the [examples above](def-varexamples1) yield the following variables:
(def-varexamples2)=
* A random variable $X: \Omega \rightarrow \{1,2,3,4,5,6\}$ which maps each outcome to its
numerical value: $X(1) = 1, X(2) = 2$, etc.
* A random variable $Y: \Omega \rightarrow \{0,100\}$ which maps
even outcomes to 0 and odd outcomes to 100: $Y(1) = Y(3) = Y(5) = 0,
Y(2) = Y(4) = Y(6) = 100$.
```{admonition} Disambiguation.
In the first example, be careful to distinguish between the elements
of $\Omega$ and those of $S = \mathbb{N}$. Indeed, we could have
equivalently defined $\Omega = \{{\Large ⚀,⚁,⚂,⚃,⚄,⚅}\}$ in which case
$X({\Large ⚀}) = 1, X({\Large ⚁}) = 2,$ etc.
```
Returning to our diet example, two questions come naturally:
1) What is the probability that we eat $n$ fries for a meal?
2) How many fries should we expect to eat per meal?
The answer of course depends on which random variable we consider.
To formalise such notions of "probability that a random variable
takes some value" and "average of a random variable" we introduce
now two important definitions.
(def-distr)=
```{admonition} Definition (Distribution of a random variable).
The distribution of a random variable $X$ is a probability
measure on $S$ denoted by $P(X = \cdot)$ and defined via:
$$P(X = s) = P(X^{-1}(s))$$
for all $s \in S$.
```
In words: the probability $P(X = s)$ of the random variable $X$
being equal to $s$ is defined to be the probability $P(X^{-1}(s))$
that the event $X^{-1}(s) \subseteq \Omega$ occurs in our experiment.
For example, for the diet corresponding to $X$, where
the number of fries is given by the value we rolled, the preimages of
each integer $i \in \{1,2,3,4,5,6\}$ are $X^{-1}(1) = \{1\}, X^{-1}(2)
= \{2\}$, etc. Then the probability $P(X = 1)$ that we eat $1$ fry is the
probability $P(\{1\})=1/6$ that we roll $1$ on the dice and so on for
all other $i$.
For the diet corresponding to $Y$, where the number of fries corresponds
to the events of an even or odd roll, we have $Y^{-1}(0) = \{1,3,5\}$
and $Y^{-1}(100) = \{2,4,6\}$. Therefore the probability $P(Y = 0)$
that we eat $0$ fries is $P(\{1,3,5\}) = 1/2$ and
the probability $P(Y = 100)$ of eating $100$ fries is $P(\{2,4,6\}) = 1/2$.
To formalise our second question, we define the following number
associated to a random value:
(def-mean)=
```{admonition} Definition (Expected value).
Let $X: \Omega \rightarrow S$ be a discrete random value.
Then its expected value, denoted $E[X]$ is defined to be:
$$E[X] = \sum\limits_{s \in S} s P(X = s).$$
```
For the two examples given [above](def-varexamples2):
* We should expect to eat $E[X] = \frac{1+2+3+4+5+6}{6} = 3.5$ fries per meal.
* We should expect to eat $E[Y] = \frac{0+100}{2} = 50$ fries per meal.
Notice how both expected values are rational even though at any meal
we only ever eat an integer number of fries!
Finally, let us note that can also compute *conditional expected values*
given some event $E$:
(def-condmean)=
```{admonition} Definition (Conditional expected value).
Let $X: \Omega \rightarrow S$ be a discrete random value
and let $A \subseteq \Omega$ be an event with non-zero probability.
The the expected value of $X$ given $A$, written $E[X \lvert A]$ is defined to be:
$$E[X | A] = \sum\limits_{s \in S} s \frac{P(X^{-1}(s) \cap A)}{P(A)}.$$
```
For example, if we follow the diet corresponding to $X$ defined
[above](def-varexamples1) and we also know that we have rolled an even number,
the expected number of fries we'll eat for that meal
is $E[X \lvert \{2,4,6\}] =4$ (compare this with $E[X] = 3.5$).
## A technicality
```{warning} Note.
The contents of this section aren't important for you to memorise,
or even fully understand.
The discussion below concerns some technical aspects
that eventually become important for the development of probability theory
on more general probability spaces.
```
Throughout this chapter we have avoided discussing a crucial
fact: the collection of events we consider
must be closed under complement and countable unions and intersections.
Such a structure is called a [$\sigma$-algebra](https://en.wikipedia.org/wiki/%CE%A3-algebra).
As a consequence, the definition we gave of a probability space
as a couple $(\Omega, P)$ isn't complete.
Instead, the proper definition includes a third ingredient: a collection
$\mathcal{F} \subseteq \mathcal{P}(\Omega)$ of events on $\Omega$ such that
$\mathcal{F}$ is a $\sigma$-algebra. Intuitively, we want our collection of
events to be:
* Closed under complements, so that for any event $A$ we can
also talk about the event "not $A$".
* Closed under *countable* unions and intersections so that
if we're given a countable collection $A_i$ of events, we can talk
about the event of "any of the $A_i$" or "all of the $A_i$".
In this chapter we have considered only finite universes $\Omega$
and we have tacitly taken $\mathcal{F} = \mathcal{P}(\Omega)$.
This amounts to considering *all possible events* on
the universe $\Omega$. This is the maximal $\sigma$-algebra
on $\Omega$. There are of course others, each of which can be
used to model situations where only some kinds of events interest us.
For our purposes, taking $\mathcal{F} = \mathcal{P}(\Omega)$ was
sufficient and allowed us to simplify the discussion quite a bit.
However once again the situation changes enormously when passing to the infinite case.
Finally, a random variable isn't just a function $X: \Omega \rightarrow S$!
Indeed, we must assume that $S$ is also equipped with some $\sigma$-algebra
$\mathcal{G}$.
Then a random variable is a function that preserves the structure of
the couples $(\Omega, \mathcal{F})$ and $(S, \mathcal{G})$,
meaning that it obeys $X^{-1}(E) \in \mathcal{F}$ for
any $E \in \mathcal{G}$.
This complete definition allows us to generalise the discussion
of random variables and distributions to much more general probabily spaces.
## Exercises
### Exercise 1
Consider a random experiment consisting of flipping a single fair coin.
Describe the corresponding probability space $(\Omega, P)$ and compute
the probabilities of all events in $\mathcal{P}(\Omega)$.
### Exercise 2
Consider the random experiment consisting of a single roll of a fair dice.
1) Define the event $A$ which corresponds to "the value of the rolled dice
is at least 3".
2) Recall the definition of the random variable $X$ given
[above](def-varexamples2), which maps a dice roll to its value in $\{1,2,3,4,5,6\}$.
Compute $E[X | A]$, the expected value of $X$ given the event $A$.
(exo-ex3)=
### Exercise 3
Consider a random experiment consisting of flipping two single fair coins.
1) Using the corresponding probability space we introduced [above](ex-twocoins),
define the following events:
* Event $H_0$: the first coin lands on heads.
* Event $T_0$: the first coin lands on tails.
* Event $H_1$: the second coin lands on heads.
* Event $T_1$: the second coin lands on tails.
2) Compute the corresponding probabilities $P(H_0), P(H_1), P(T_0), P(T_1)$.
(exo-ex4)=
### Exercise 4
Given the probability space $\{\Omega, P\}$ and the events
$H_0, T_0, H_1, T_1 \subseteq \Omega$ you defined in [exercise 3](exo-ex3),
compute the following conditional probabilities:
* $P(H_1 | H_0)$: the probability that the second coin
lands on heads given that the first one landed on heads.
* $P(T_1 | H_0)$: the probability that the second
coin lands on tails given that the first one landed on heads.
Compare the above with the probabilities $P(H_1)$ and $P(T_1)$ you've
computed previously. What can you deduce from this comparison?
### Exercise 5
Consider once again the random experiment treated in exercises
[3](exo-ex3) and [4](exo-ex4). Let us define a random
variable $X: \Omega \rightarrow \{0,1,2\}$ given by the sum of
the values of the two flips:
* $X(\{0,0\}) = 0$,
* $X(\{0,1\}) = 1$,
* $X(\{1,0\}) = 1$,
* $X(\{1,1\}) = 2$.
Compute:
1) The distribution of $X$.
2) The expected value $E[X]$.
3) The conditional expected value $E[X | H_1]$, where
$H_1$ is the event of the second coin landing heads up,
as defined in [exercise 3](exo-ex3).
### TODO: Exercise on Bayes' rule
### Exercise 5