Introduction to discrete probability theory

Introduction to discrete probability theory#

It is often the case that in computer science we must deal with probabilities:

When we want to model some random or unpredictable phenomenon (for example the number of visitors to a website per hour).
When we want to work with noisy data.
For cryptography, randomness is essential.
Randomized algorithms can offer simple and fast solutions to problems for which no good deterministic algorithm is known.

A frequentist introduction to probability#

Our model for a random or unpredictable phenomenon will be that of a random experiment: this is some process which results in a random result, called an outcome, each time we repeat it.

An archetypal example is of course rolling a dice: each time we roll it, we’re presented with a new outcome. Suppose then that we roll the dice \(N\) times and keep a tally \(m_i\) of how many times we see each face \(i\), where \(i \in \{1,2,3,4,5,6\}\). Then the ratio \(m_i/N\) is the relative frequency with which we observed each face \(i\). For a fair dice and a large number of trials \(N\), we should expect that \(m_i/N \sim 1/6\). To model this experiment, we will define a number called the probability of the outcome \(m_i\), which we set equal to the limit of the frequency of observing \(m_i\), namely \(1/6\).

Warning

The above interpretation of probabilities as the limit of relative frequencies is called frequentism. This is not the only possible interpretation! Indeed another very popular one is Bayesianism which interprets probabilities as modeling knowledge or belief.

Probability spaces#

To model an experiment, we will consider the set \(\Omega\) of all possible outcomes, called the universe. For example, rolling a dice once yields \(\Omega = \{1,2,3,4,5,6\}\). Throughout this chapter, we’ll assume that the universe is a finite set and we note that many notions that we introduce here do not directly transfer to the infinite case.

Subsets of \(\Omega\) are called events. For every outcome \(\omega \in \Omega\) the set \(\{\omega\}\) is an event, called an elementary event. For example, in our dice experiment the elementary event \(\{5\}\) corresponds to “rolling a \(5\)”. A non-elementary event would be for example \(\{2,4,6\}\) which corresponds to “rolling an even number”, that is either \(2,4,\) or \(6\).

The following vocabulary is useful when discussing events:

For two events \(A, B\), the event \(A \cup B\) is read as \(A\) or \(B\).
For two events \(A, B\), the event \(A \cap B\) is read as \(A\) and \(B\).
Two events \(A, B\) are called disjoint if \(A \cap B = \emptyset\).

Recall now that during our random experiment we not only collected all possible outcomes to form a universe \(\Omega\) but we kept track of the relative frequencies too. Therefore the bare structure of a set on \(\Omega\) is not sufficient for our purposes: we must enrich it with a probability measure.

Definition (Probability measure).

A probability measure on a universe \(\Omega\) is defined by associated to each event \(e \in \mathcal{P}\left( \Omega \right)\) a real number \(P(e)\) such that:

\(P\left( \Omega \right) = 1\),
For all events \(e \in \mathcal{P}\left( \Omega \right), P(e) \geq 0\).
For any two disjoint events \(A, B\), we have \(P(A \cup B) = P(A) + P(B)\).

For finite sets, defining \(P(\{\omega\})\) for all elementary events \(\{\omega\}, \omega \in \Omega\) suffices to fully determine a probability measure (provided our choices lead to \(P\left(\Omega\right) = 1\).

The couple \((\Omega, P)\) consisting of a universe \(\Omega\), and a probability measure \(P\) on \(\Omega\) is called a probability space (this is not exactly true, see the end of the chapter).

For example, to mathematically model a single roll of a fair dice, our probability space is composed of:

\(\Omega = \{1,2,3,4,5,6\}\),
For all \(\omega \in \Omega\), \(P(\{\omega\}) = 1/6\).

An important notion for events is the following:

Definition (Independent events).

Given a probability space \((\Omega, \mathcal{F})\), two events \(A, B \subseteq \Omega\) are called independent if:

\[P(A \cap B) = P(A) \cdot P(B).\]

This is an abstract statement of the intuitive notion that two events should be called independent if occurences of one don’t affect occurences of the other.

The following properties are useful in calculating probabilities:

For any event \(A\), \(P(\Omega \setminus A) = 1 - P(A)\).
\(P(\emptyset) = 0\).
For two events \(A, B\), \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\) - note that if \(A, B\) are disjoint we recover \(P(A \cup B) = P(A) + P(B)\).
For two events \(A, B\) with \(A \subseteq B\), \(P(A) \leq P(B)\).

Examples of probability spaces#

Let us once again consider the universe \(\Omega = \{1,2,3,4,5,6\}\). The first example of a probability measure we saw on \(\Omega\) was used to model a single roll of a fair dice, by assigning to each of the 6 elementary events the probability \(1/6\). A generalisation of this is the following.

Definition (Uniform measure).

Given a finite universe \(\Omega\), the uniform probability measure is the one that assigns \(P(\{\omega\}) = \frac{1}{\lvert \Omega \rvert}\) to all \(\omega \in \Omega\).

An important property about this measure is equiprobability which tells us that for every event \(A \subseteq \Omega\):

\[P(A) = \frac{\lvert A \rvert}{\lvert \Omega \rvert}\]

Of course not all dice are fair. Another possible assignment of probabilities is, for example:

\(P({1})\)	\(P({2})\)	\(P({3})\)	\(P({4})\)	\(P({5})\)	\(P({6})\)
\(1/4\)	\(1/8\)	\(1/12\)	\(1/8\)	\(1/6\)	\(2/6\)

The probabilities we assign do not have to be rational numbers, any non-negative real number will do. For example, we might have a biased coin we model with \(\Omega = \{0, 1\}\) and \(P(\{0\}) = \pi, P(\{1\}) = 1 - \pi\).

So far we have considered experiments consisting of single dice rolls and coin flips. How about flipping a coin \(n\) times? This is equivalent to flipping \(n\) coins once and so we can take universe \(\Omega\) to consist of \(n\)-tuples of single coin outcomes.

For example, for an experiment consisting of rolling a fair coin twice (or two fair coins once) we have the following universe:

\[\Omega = \{(0,0), (0,1), (1,0), (1,1) \}.\]

Since we assumed each coin flip to be fair, together they should generate the uniform distribution over all four possible outcomes:

\[P(\{(0,0)\}) = P(\{(0,1)\}) = P(\{(1,0)\}) = P(\{(1,1)\}) = 1/4.\]

Conditional probability#

Suppose that our experiment once again consists of a single roll of a fair dice, but this time we are also given the additional information that we rolled an odd number. Clearly the uniform measure \(P\) on \(\Omega\) is no longer appropriate for modeling this situation: for example we know that we can’t have rolled \(2\) since it isn’t an odd number, yet \(P(\{2\}) = 1/6\). How should we update the uniform measure \(P\) on \(\Omega = \{1,2,3,4,5,6\}\) to reflect this additional information?

We now know that under this new measure, let’s call it \(P'\), the probability of \(\{1,3,5\}\) must be 1 since we know we’ve rolled an odd number. In mathematical terms, \(P'(\{1,3,5\}) = 1\). Similarly any even outcomes are not possible anymore, so \(P'(\{2\}) = P'(\{4\}) = P'(\{6\}) = 0\). But what about \(P'(\{1\})\), \(P'(\{3\})\), \(P'(\{5\})\)? The knowledge that the number we rolled is odd does not allow us to distinguish between these three possibilities. Therefore we should assume that their relative magnitudes remain the same as they were in \(P\), so that the new probabilities are \(P'(\{1\}) = P'(\{3\}) = P'(\{5\}) = 1/3\). Notice how the probability of each outcome \(\omega \in E = \{1,3,5\}\) is its original probability (\(P(\{\omega\}) = 1/6\)) times the original probability of the event itself \(P(E) = 1/2\).

We have therefore constructed a new measure \(P'\) on \(\Omega\) which allows us to study a single dice role in which we know that the result is an odd number.

More generally, given a universe \(\Omega\) and a measure \(P\), to model a random experiment where we know in addition that an event \(E \subseteq \Omega\) has happened, we construct a new probability measure \(P'( \cdot \lvert E)\) such that:

Under the new measure, \(P(E \lvert E) = 1\) since we know that \(E\) has surely happened.
\(P'\) assigns probability zero to outcomes not in \(E\): \(\forall \omega \in \Omega \setminus E, P(\omega \lvert E) = 0\).
\(P'\) preserves the relative magnitudes (with respect to the original measure \(P\)) of outcomes in \(E\).

Under these assumptions every event can be assigned a new probability which is defined by:

Definition (Conditional probability).

Let \(P\) be a probability measure on some universe \(\Omega\) and \(A, E \subseteq \Omega\) be two events with \(P(E) \neq 0\). Then, the probability of \(A\) given \(B\) is defined as

\[P(A \lvert E) = \frac{P(A \cap E)}{P(E)}.\]

What happens if we supposed that \(A, E\) are independent? Intuitively, we should expect that since an occurence of one does not affect the other, the probability that \(A\) occurs should remain the same regardless of whether we know if \(E\) also occured. Indeed, this is true:

Lemma (Conditional probability of independent events).

Let \((\Omega, P)\) be a probability space and \(A, E \subseteq \Omega\) be two independent events. Then,

\[P(A \lvert E) = P(A).\]

The proof follows directly from the definitions of conditional probability and independent event:

\[P(A \lvert E) = \frac{P(A \cap E)}{P(E)} = \frac{P(A) \cdot P(E)}{P(E)} = P(E)\]

Another important theorem about conditional probabilities is the following:

Theorem (Bayes).

Let \(P\) be a probability measure on some universe \(\Omega\) and \(A, B \subseteq \Omega\) be two events with \(P(E) \neq 0\). Then,

\[P(A \lvert E) = \frac{P(E \lvert A)P(A)}{P(E)}.\]

The proof of the theorem follows from the definition of conditional probability. Indeed, we have:

\[ P(A \lvert B) = \frac{P(A \cap B)}{P(B)} \]

and if \(P(A) = 0\), the theorem holds trivially. Therefore let us assume that \(P(A) \neq 0\), in which case we can consider the conditional probability:

\[ P(B \lvert A) = \frac{P(B \cap A)}{P(A)} \]

From the second equation it follows that:

\[ P(B \cap A) = \frac{P(B \lvert A)}{P(A)} \]

But since \(B \cap A = A \cap B\) we can substitute this result into the first equation, yielding the desired result.

Discrete random variables, their distributions, and their expectation values#

During a random experiment we are often interested not just on the outcomes but various observables or statistics derived from them. This is captured by the notion of a random variable, which is a value that depends on the results of a random experiment.

To make this more concrete, consider a (not very competent) nutritionist that has come up with a diet regimen in which every meal consists of french fries but there’s a catch: the number of fries we eat is decided by rolling a fair dice! For example it could be:

We eat as many fries as the value we rolled.
We eat 100 fries if the value we’ve rolled is even and 0 if its odd.

To model such values that depend on the results of a random experiment, we define a discrete random variable as a function \(X: \Omega \rightarrow S\) where \(S\) is a finite or countable set. In our case, will assume \(S \subseteq \mathbb{N}\).

Then the examples above yield the following variables:

A random variable \(X: \Omega \rightarrow \{1,2,3,4,5,6\}\) which maps each outcome to its numerical value: \(X(1) = 1, X(2) = 2\), etc.
A random variable \(Y: \Omega \rightarrow \{0,100\}\) which maps even outcomes to 0 and odd outcomes to 100: \(Y(1) = Y(3) = Y(5) = 0, Y(2) = Y(4) = Y(6) = 100\).

Disambiguation.

In the first example, be careful to distinguish between the elements of \(\Omega\) and those of \(S = \mathbb{N}\). Indeed, we could have equivalently defined \(\Omega = \{{\Large ⚀,⚁,⚂,⚃,⚄,⚅}\}\) in which case \(X({\Large ⚀}) = 1, X({\Large ⚁}) = 2,\) etc.

Returning to our diet example, two questions come naturally:

What is the probability that we eat \(n\) fries for a meal?
How many fries should we expect to eat per meal?

The answer of course depends on which random variable we consider.

To formalise such notions of “probability that a random variable takes some value” and “average of a random variable” we introduce now two important definitions.

Definition (Distribution of a random variable).

The distribution of a random variable \(X\) is a probability measure on \(S\) denoted by \(P(X = \cdot)\) and defined via:

\[P(X = s) = P(X^{-1}(s))\]

for all \(s \in S\).

In words: the probability \(P(X = s)\) of the random variable \(X\) being equal to \(s\) is defined to be the probability \(P(X^{-1}(s))\) that the event \(X^{-1}(s) \subseteq \Omega\) occurs in our experiment.

For example, for the diet corresponding to \(X\), where the number of fries is given by the value we rolled, the preimages of each integer \(i \in \{1,2,3,4,5,6\}\) are \(X^{-1}(1) = \{1\}, X^{-1}(2) = \{2\}\), etc. Then the probability \(P(X = 1)\) that we eat \(1\) fry is the probability \(P(\{1\})=1/6\) that we roll \(1\) on the dice and so on for all other \(i\).

For the diet corresponding to \(Y\), where the number of fries corresponds to the events of an even or odd roll, we have \(Y^{-1}(0) = \{1,3,5\}\) and \(Y^{-1}(100) = \{2,4,6\}\). Therefore the probability \(P(Y = 0)\) that we eat \(0\) fries is \(P(\{1,3,5\}) = 1/2\) and the probability \(P(Y = 100)\) of eating \(100\) fries is \(P(\{2,4,6\}) = 1/2\).

To formalise our second question, we define the following number associated to a random value:

Definition (Expected value).

Let \(X: \Omega \rightarrow S\) be a discrete random value. Then its expected value, denoted \(E[X]\) is defined to be:

\[E[X] = \sum\limits_{s \in S} s P(X = s).\]

For the two examples given above:

We should expect to eat \(E[X] = \frac{1+2+3+4+5+6}{6} = 3.5\) fries per meal.
We should expect to eat \(E[Y] = \frac{0+100}{2} = 50\) fries per meal.

Notice how both expected values are rational even though at any meal we only ever eat an integer number of fries!

Finally, let us note that can also compute conditional expected values given some event \(E\):

Definition (Conditional expected value).

Let \(X: \Omega \rightarrow S\) be a discrete random value and let \(A \subseteq \Omega\) be an event with non-zero probability. The the expected value of \(X\) given \(A\), written \(E[X \lvert A]\) is defined to be:

\[E[X | A] = \sum\limits_{s \in S} s \frac{P(X^{-1}(s) \cap A)}{P(A)}.\]

For example, if we follow the diet corresponding to \(X\) defined above and we also know that we have rolled an even number, the expected number of fries we’ll eat for that meal is \(E[X \lvert \{2,4,6\}] =4\) (compare this with \(E[X] = 3.5\)).

A technicality#

Warning

Note. The contents of this section aren’t important for you to memorise, or even fully understand. The discussion below concerns some technical aspects that eventually become important for the development of probability theory on more general probability spaces.

Throughout this chapter we have avoided discussing a crucial fact: the collection of events we consider must be closed under complement and countable unions and intersections. Such a structure is called a \(\sigma\)-algebra.

As a consequence, the definition we gave of a probability space as a couple \((\Omega, P)\) isn’t complete. Instead, the proper definition includes a third ingredient: a collection \(\mathcal{F} \subseteq \mathcal{P}(\Omega)\) of events on \(\Omega\) such that \(\mathcal{F}\) is a \(\sigma\)-algebra. Intuitively, we want our collection of events to be:

Closed under complements, so that for any event \(A\) we can also talk about the event “not \(A\)”.
Closed under countable unions and intersections so that if we’re given a countable collection \(A_i\) of events, we can talk about the event of “any of the \(A_i\)” or “all of the \(A_i\)”.

In this chapter we have considered only finite universes \(\Omega\) and we have tacitly taken \(\mathcal{F} = \mathcal{P}(\Omega)\). This amounts to considering all possible events on the universe \(\Omega\). This is the maximal \(\sigma\)-algebra on \(\Omega\). There are of course others, each of which can be used to model situations where only some kinds of events interest us.

For our purposes, taking \(\mathcal{F} = \mathcal{P}(\Omega)\) was sufficient and allowed us to simplify the discussion quite a bit. However once again the situation changes enormously when passing to the infinite case.

Finally, a random variable isn’t just a function \(X: \Omega \rightarrow S\)! Indeed, we must assume that \(S\) is also equipped with some \(\sigma\)-algebra \(\mathcal{G}\). Then a random variable is a function that preserves the structure of the couples \((\Omega, \mathcal{F})\) and \((S, \mathcal{G})\), meaning that it obeys \(X^{-1}(E) \in \mathcal{F}\) for any \(E \in \mathcal{G}\). This complete definition allows us to generalise the discussion of random variables and distributions to much more general probabily spaces.

Exercises#

Exercise 1#

Consider a random experiment consisting of flipping a single fair coin. Describe the corresponding probability space \((\Omega, P)\) and compute the probabilities of all events in \(\mathcal{P}(\Omega)\).

Exercise 2#

Consider the random experiment consisting of a single roll of a fair dice.

Define the event \(A\) which corresponds to “the value of the rolled dice is at least 3”.
Recall the definition of the random variable \(X\) given above, which maps a dice roll to its value in \(\{1,2,3,4,5,6\}\). Compute \(E[X | A]\), the expected value of \(X\) given the event \(A\).

Exercise 3#

Consider a random experiment consisting of flipping two single fair coins.

Using the corresponding probability space we introduced above, define the following events:
- Event \(H_0\): the first coin lands on heads.
- Event \(T_0\): the first coin lands on tails.
- Event \(H_1\): the second coin lands on heads.
- Event \(T_1\): the second coin lands on tails.
Compute the corresponding probabilities \(P(H_0), P(H_1), P(T_0), P(T_1)\).

Exercise 4#

Given the probability space \(\{\Omega, P\}\) and the events \(H_0, T_0, H_1, T_1 \subseteq \Omega\) you defined in exercise 3, compute the following conditional probabilities:

\(P(H_1 | H_0)\): the probability that the second coin lands on heads given that the first one landed on heads.
\(P(T_1 | H_0)\): the probability that the second coin lands on tails given that the first one landed on heads.

Compare the above with the probabilities \(P(H_1)\) and \(P(T_1)\) you’ve computed previously. What can you deduce from this comparison?

Exercise 5#

Consider once again the random experiment treated in exercises 3 and 4. Let us define a random variable \(X: \Omega \rightarrow \{0,1,2\}\) given by the sum of the values of the two flips:

\(X(\{0,0\}) = 0\),
\(X(\{0,1\}) = 1\),
\(X(\{1,0\}) = 1\),
\(X(\{1,1\}) = 2\).

Compute:

The distribution of \(X\).
The expected value \(E[X]\).
The conditional expected value \(E[X | H_1]\), where \(H_1\) is the event of the second coin landing heads up, as defined in exercise 3.

Introduction to discrete probability theory

Contents

Introduction to discrete probability theory#

A frequentist introduction to probability#

Probability spaces#

Examples of probability spaces#

Conditional probability#

Discrete random variables, their distributions, and their expectation values#

A technicality#

Exercises#

Exercise 1#

Exercise 2#

Exercise 3#

Exercise 4#

Exercise 5#

TODO: Exercise on Bayes’ rule#

Exercise 5#