Playing with Probability in Scala 3

Here is a simple experiment: take a fair dice (or a coin) and roll it until every side of the dice has been observed at least once. Write down the number of rolls and repeat the operation several times. On average, how many rolls does it take to observe every side of the dice at least once? This is precisely the question we are about to answer together.

Solving Maths puzzle is fun. Solving them using the shiniest features of Scala 3 is even amazingly funnier! If you do not have Scala 3 installed yet:

You should now be able to run the Scala 3 REPL via the command:

[shell prompt]$ dotr -new-syntax -Yexplicit-nulls -Yerased-terms -indent -version
Starting dotty REPL...
Dotty compiler version 0.25.0-RC2 -- Copyright 2002-2020, LAMP/EPFL

An Important Note: Please refrain from starting your favourite IDE or starting a new project. It will only make experimenting harder and painful. All you need is the Scala 3 REPL, any basic text editor and knowing how to copy-paste on your system.

Understanding the Problem

Let us start by modelling a dice. The sides of a dice will be numbered starting from 1. We consider a coin as a 2-sided dice whose sides are 1 and 2. The sides of an usual 6-sided dice are 1, 2, 3, 4, 5 and 6.

final case class Dice(sides: Int) {
  def roll(): Int =
    scala.util.Random.nextInt(sides) + 1

val d2   = Dice(2)
val d6   = Dice(6)
val d10  = Dice(10)
val d20  = Dice(20)
val d100 = Dice(100)

d2 models a coin, d6 models a usual 6-sided dice, etc. The method roll , as its name suggest, simulates rolling the dice. On each invocation it gives a random side (its number). The first question to answer is: is such a dice fair ? Remember that a fair dice is one for which every side is equally likely to be observed. For a coin, it means getting 1 is as likely as getting 2. To check empirically that a dice is fair, or at least not loaded, we will roll it many times and count how often we observe its sides:

def (dice: Dice) frequencies(rolls: Long): Map[Int, Double] =
  // Stores how many times we observed each side
  val arr = Array.fill(dice.sides)(0)
  for i <- 1L to rolls do
    arr(dice.roll() - 1) += 1

  // Transforms counters into ratio
  val probability: IndexedSeq[(Int, Double)] =
    for i <- 1 to dice.sides yield
      i -> arr(i-1).toDouble / rolls
scala> d2.frequencies(1000000000L)
val res0: Map[Int, Double] = Map(1 -> 0.499985517, 2 -> 0.500014483)

scala> d6.frequencies(1000000000L)
val res1: Map[Int, Double] = HashMap(
  1 -> 0.166669596,
  2 -> 0.166660131,
  3 -> 0.166664591,
  4 -> 0.166654524
  5 -> 0.166665811,
  6 -> 0.166685347)

This extension method can be called like any method of Dice. As you can see, the frequencies are very close to each other. In addition, the more rolls we perform, the closer they are. We can conclude that these dice are fair enough. We are finally ready for our big adventure: finding the so desired average! We call an experiment the action of rolling the dice until every side has been observed at least once and the length of the experiment its number of rolls. The method rollUntilAllSeen simulates an experiment and return its length.

def (dice: Dice) rollUntilAllSeen(): Int =
  var rolls     = 0
  val seen      = Array.fill(dice.sides)(false)
  var remaining = dice.sides

  while remaining > 0 do
    val outcome = dice.roll()
    rolls += 1
    if !seen(outcome-1) then
      seen(outcome-1) = true
      remaining -= 1
scala> d6.rollUntilAllSeen()
val res2: Int = 12

scala> d6.rollUntilAllSeen()
val res3: Int = 15

scala> d6.rollUntilAllSeen()
val res4: Int = 8

scala> d6.rollUntilAllSeen()
val res5: Int = 9

Based on the four experiments above, we get the impression that the average should be close to 11, but four experiments are not a lot to get an accurate estimation of the real average. Fortunately, the more experiments we run, the closer we get to it. We need to compute the average over a large number of experiments. We will actually be a bit smarter. Instead of limiting ourselves to computing the average, we will count, for every observed length, the number of its experiments. It will give us how often a length is observed, i.e. its frequency.

final case class Histogram[A](values: Map[A, BigInt]) {

  def average(using Accumulable[A], Ratio[A]): A =
    var acc   : A      = summon[Accumulable[A]].zero
    var total : BigInt = BigInt(0)
    for (a,count) <- values do
      acc += a*count
      total += count
    acc / total

  def frequencies(using Ratio[A]): Map[A, BigDecimal] =
    val total = values.values.sum { case (a, count) => a -> BigDecimal(count) / BigDecimal(total) }

  def map[B](f: A => B): Histogram[B] =
    var histoB = scala.collection.mutable.Map.empty[B, BigInt]
    for (a, count) <- values do
      val b = f(a)
      histoB.update(b, histoB.getOrElse(b, BigInt(0)) + count)
    new Histogram[B](histoB.toMap)

object Histogram {
  def apply[A](iterations: Long)(value : => A): Histogram[A] =
    var histo = scala.collection.mutable.Map.empty[A, BigInt]
    for i <- 1L to iterations do
      val a = value
      histo.update(a, histo.getOrElse(a, BigInt(0)) + 1)
    new Histogram(histo.toMap)

The class Histogram[A] is essentially a key-value store where the value is the number of times the key has been observed, also known as its multiplicity. You may also wonder how Scala can accept adding two values of type A and multiplying/dividing a value of type A by a BigInt in average. It works thanks to the magic of Type Classes in Scala 3. Accumulable and Ratio are two type classes defined by:

trait Accumulable[A] {
  def zero: A
  def (x:A)+(y:A): A
  def (x:A)*(m: BigInt): A

trait Ratio[A] {
  def (x:A) / (m: BigInt): A

Note that, unlike Scala 2, no weird implicit conversion is required to support infix syntax for +, * and /. These methods are just defined as extension methods.

scala> val h6 = Histogram(100000000L)(d6.rollUntilAllSeen())
val h6: Histogram[Int] = Histogram(HashMap(73 -> 202, 69 -> 385, 88 -> 17,
  10 -> 8298014, 56 -> 4403, 42 -> 56557, 24 -> 1462064, 37 -> 140975, ...))

scala> h6.average
1 |h6.average
  |          ^
  |no implicit argument of type Accumulable[Int] was found for parameter x$1 of method average in class Histogram

If your first reaction is to implement an instance of Accumulable for Int, ask yourself how you could be confident that the computed values are correct when adding two positive numbers can result into a negative one:

scala> val x = 1990000000
val x: Int = 1990000000

scala> x + x
val res0: Int = -314967296

I am well aware that most use cases using Int is perfectly fine, because they never have numbers big enough to reach this limit. After all, 10 digits ought to be enough for anybody, right? In the next sections, you will see that we will reach this limit very often! Writing an instance of Accumulable for Int is a catastrophic idea. Instead we will write instances for BigInt and BigDecimal.

given Accumulable[BigInt] {
  def zero: BigInt = BigInt(0)
  def (x:BigInt)+(y:BigInt): BigInt = x+y
  def (x:BigInt)*(m:BigInt): BigInt = x*m

given Accumulable[BigDecimal] with Ratio[BigDecimal] {
  def zero: BigDecimal = BigDecimal(0)
  def (x:BigDecimal)+(y:BigDecimal): BigDecimal = x+y
  def (x:BigDecimal)*(m:BigInt): BigDecimal = x*BigDecimal(m)
  def (x:BigDecimal)/(m:BigInt): BigDecimal = x/BigDecimal(m)

Now we can get out much awaited average:

val res0: BigDecimal = 14.69830127

scala> Histogram(10000000L)(BigDecimal(d6.rollUntilAllSeen())).average
val res1: BigDecimal = 14.6949955

As you can see, the average is never far from 14.69. Knowing the average is nice but it does not tell us much about how the length is distributed among the experiments. This is precisely the reason why we kept counters! To visualize this data, we can export the histogram as a CSV file.

def (histogram: Histogram[Int])toCSV(fileName: String): Unit =
    val pw = new PrintWriter(new File(fileName))
    for length <- 0 to histogram.values.keySet.max do
      pw.println(s"$length,${histogram.values.getOrElse(length, BigInt(0))}")
scala> h6.toCSV("d6.csv")

Opening the file d6.csv with LibreOffice and plotting the data as a XY Chart using the value column as X and count as Y gives this chart:

d6 distribution length/count

As you can see, after length 15, there is a huge decrease in the number of experiments. And after length 50, the number of experiment is almost neglectable. The situation is similar for other dice. For example, here is the curve for d100:

d100 distribution length/count

By running enough experiment, we can get an pretty close estimation of the average. But an experiment is by nature random, every measure we perform is very likely to give a (close but still) different estimation of the average. We need a more reliable way to approximate the average.

Modelling the Problem

To get a more reliable approximation of the average, or the exact value, we can not rely on random experiments. We need to use maths! Remember that an experiment is a sequence of dice rolls such that, as soon as every side of the dice have been observed at least once, the sequence is over. Given a dice, we will call a sequence of sides valid when it follows these rules.

Using a 3-sided dice:

  • The sequence 2→2→1→2 is invalid because the side 3 has not been observed.
  • The sequence 2→2→1→2→3→3 is invalid because the sequence needs to stop as soon as every side has been observed so the last roll is not required.
  • The sequence 2→2→1→2→3 is valid: every side has been observed and it was not possible to stop earlier.

Note that the validity depends on the dice used! The sequence 2→2→1→2→3 is valid for a 3-sided dice but invalid for a 4-sided dice. To compute the average, we will: (1) enumerate all valid sequences (up to a certain length), then (2) sum their length and finally (3) divide by the number of values in the sum.

def enumerate(sides: Int, depth: Int): LazyList[List[Int]] =
  def aux(revSeq: List[Int]): LazyList[List[Int]] =
    if revSeq.length > depth
    then LazyList.empty
    else if revSeq.toSet.size == sides
          then LazyList(revSeq.reverse)
          else LazyList.range(1, sides+1).flatMap { next => aux(next :: revSeq) }


def average(sides: Int, depth: Int): Double =
  val validSequences = enumerate(sides, depth) / validSequences.size

For a 3-sided dice, the list of all valid sequences up to length 5 is:

scala> enumerate(3,5).map(_.mkString("→")).toList
val res26: List[String] = List(
  11123, 11132, 11213, 11223, 1123,   11312, 1132, 11332,
  12113, 12123, 1213,   12213, 12223, 1223,   123,
  13112, 1312,   13132, 132,     13312, 1332,   13332,
  21113, 21123, 2113,   21213, 21223, 2123,   213,
  22113, 22123, 2213,   22213, 22231, 2231,   22321, 22331,
  231,     2321,   23221, 23231, 2331,   23321, 23331,
  31112, 3112,   31132, 312,     31312, 3132,   31332,
  321,     3221,   32221, 32231, 3231,   32321, 32331,
  33112, 3312,   33132, 3321,   33221, 33231, 33312, 33321

That’s awesome! We just have to average all the lengths:

scala> average(3,5)
val res27: Double = 4.545454545454546

scala> average(3,10)
val res32: Double = 9.071713147410359

scala> average(3,14)
val res36: Double = 13.00953778429934

scala> average(3,16)
val res39: Double = 15.003205911089399

Apparently, computing the average for sequences up to length 16 does not converge yet. Unfortunately our implementation is to slow for large sequences. The number of valid sequences grows exponentially over length. We need a much faster algorithm.

def aggregate[A:Accumulable](sides: Int, depth: Int)(f: Int => A): A =
  var current : Array[BigInt] = Array.fill(sides+1)(BigInt(0))
  var next    : Array[BigInt] = Array.fill(sides+1)(BigInt(0))
  var agg     : A             = summon[Accumulable[A]].zero
  var length  : Int           = 0

  current(0) = BigInt(1) // The empty sequence is the unique sequence where 0 sides have been seen

  while length <= depth do
    agg += f(length) * current(sides)

    for seen <- 0 to sides - 1 do
      next(seen)     += current(seen) * seen
      next(seen + 1) += current(seen) * (sides - seen)

    length += 1
    current = next
    next    = Array.fill(sides+1)(BigInt(0))


This is a generic aggregation function that, given a sides-sided dice, gives to every valid sequence a value that depends only on its length (via f) and aggregate them over all valid sequences up to a certain length called the depth. We can use it to compute the average for sequences up to length 100000).

scala> val depth = 100000
val depth: Int = 100000

scala> val sides = 3
val sides: Int = 3

scala> val sumOfLengths = aggregate(sides,depth)(length => BigInt(length))
val sumOfLengths: BigInt = A 30109 digits positive integer

scala> val numberOfSeqs = aggregate(sides,depth)(length => BigInt(1))
val numberOfSeqs: BigInt =  A 30104 digits positive integer

scala> val avegrageUpTo100000 = (BigDecimal(sumOfLengths)/numberOfSeqs).toDouble
val avegrageUpTo100000: Double = 99999.0

The average does not seem to converge. Have a look at previous estimations of the averages for depths 5, 10, 14 and 16. The average seem very close to depth - 1. It seem to indicate that, on average, you need to roll a 3-sided dice an infinity of times to obverse every side at least once. It means that, regardless the number of rolls you perform, it is almost certain that you will never see at least once side. Let’s confirm that using the methods of the previous section:

scala> val d3 = Dice(3)
val d3: Dice = Dice(3)

scala> val h3 = Histogram(100000000L)(d3.rollUntilAllSeen())
val h3: Histogram[Int] = Histogram(HashMap(...))

val res6: BigDecimal = 5.5003517

The experience shows that, on average, 5.5 rolls are enough to see every side of a 3-sided dice. The only possible conclusion is that our modeling is very wrong. The problem is we consider every sequence to be equally likely. But the sequence 1→2→3 is much likelier to happen than 1→2→2→1→2→1→2→1→1→2→1→3. We can plot the h3 histogram to check that the longer a sequence is, the less likely it is to happen:

d3 distribution length/count

Our real big mistake is our mathematical modeling does not model the real problem. This is a very important rule in modeling: models much match closely the things they are supposed to model.

Understanding the Experiment

To get a mathematical model that closely match the experience, we need to have a deeper understanding of the problem. When we perform 10000000 experiments, we get as many valid sequences of sides. But taking the hypothesis that theses sequences are all distinct is wrong. An experiment is a random process, you may get the same sequence several times. We need to take into account how often a sequence is likely to happen.

Given a \(n\)-sided fair dice, by definition of fairness, every time we roll the dice, for any side, there is exactly \(\frac{1}{n}\) chance to observe this side. Each roll being independent from other ones, for every sequence of \(m\) rolls, there is \((\frac{1}{n})^m\) chance to observe this sequence.

Do not jump on the conclusion that the probability of a valid sequence of length \(m\) in our problem is \((\frac{1}{n})^m\) yet! If we change the problem slightly by requiring that every side is observed exactly once (ignoring sequences where one side is observed several times). Then for a coin there is only 2 valid sequences 1→2 and 2→1, each equally likely so they both have probability \(\frac{1}{2}\), not \(\frac{1}{4}\). The good way to proceed is finding a probability space that models correctly the problem.

Remember that the dice used has been fixed as a \(n\)-sided fair dice. The first step in defining a probability space is defining the outcomes. Outcomes are the results of statistical experiments. It our problem, outcomes are the valid sequences of dice rolls. Then we need to define the set of events. Events are the things whose likelihood we want to measure! For example: what is the probability that a valid sequence starts with 1, or what is the probability that a valid sequence is a palindrome (i.e. the reverse of the sequence is the sequence itself),etc. It feels natural, in our situation, to consider as event, any set of valid sequences. Last but not least, we need the probability function. Its purpose is to give, for any event, the likelihood of this event. Informally, a probability function must satisfy 3 properties:

  1. the likelihood of any event must be positive or null but never negative!
  2. the likelihood of distinct events is the sum of the likelihood of every event.
  3. the likelihood of the set of all valid sequences must be 1.

This is where things get complicated. We can decide to give to any valid sequence of size \(m\) the the probability (\((\frac{1}{n})^m\)), but we need to prove that this function satisfies all the conditions above to be a probability function. In addition, the set of valid sequences is not that trivial to work with (at least for me!). Fortunately working in this probability space is not mandatory. We can work in more comfortable probability space as long as we are able to transpose results into this one.

Remember that the dice being used is a \(n\)-sided fair dice. Let us start by some definitions:

  • Let \(\mathcal{C}_n=\{1,\dots,n\}\) be the set of the dice’s sides.

  • The set of countably infinite sequences of sides is written \(\Omega\).

  • The set of finite sequences of sides is written \(\mathcal{F}\).

  • For any finite sequence of sides \(f \in \mathcal{F}\), its length is written \(|f|\).

  • For any sequence \(s \in \mathcal{F} \cup \Omega\) of sides (finite or infinite), let \(\diamond s\) be the set of sides observed in \(s\) and \(\sharp s\) be the number of distinct sides observed in \(s\), i.e. \(\sharp s = |\diamond s|\).

  • For any sequence \(s \in \mathcal{F} \cup \Omega\) of sides (finite or infinite), and any \(i \in \{1,\dots,|s]\}\), let \(s_{[i]}\) be the sides observed at the \(i\)-th roll of \(s\), i.e. \(s=(s_{[1]},\dots,s_{[|s|]})\).

  • For any \(f \in \mathcal{F}\) and \(f’ \in \mathcal{F}\cup\Omega\), where \(f = (s_1,\dots,s_i)\) and \(f’ = (s’_1,\dots)\), we write \(f \cdot f’ \in \mathcal{F}\cup\Omega\) the concatenation of \(f\) and \(f’\), i.e. the sequence \((s_1,\dots,s_i,s’_1,\dots)\). Furthermore, for any set of prefix \(F \subset \mathcal{F}\), and any set set of (finite or infinite) sequences \(S\subset \mathcal{F}\cup\Omega\), we write \(F\cdot S = \{f\cdot s\mid f\in F,\ s\in S\}\) the set of sequences made of concatenations of \(F\) and \(S\).

For the new probability space, we can take as outcomes \(\Omega\), the set of all infinite (but countable) sequence of sides. Given a finite sequence of sides \(f \in \mathcal{F}\) (possibly empty), the set of all outcomes (infinite sequences) that start with \(f\) is called a prefix event and written \(\mathcal{E}(f)\). The finite sequence \(f \in \mathcal{F}\) is called a prefix. Note that the set of all outcomes, \(\Omega\), is an event because it is the prefix event of the empty sequence \(\epsilon\). The set of all prefix events is written \(\mathcal{E}\). We will take as events the the σ-field \(\sigma(\mathcal{E})\) generated from prefix events, i.e. the smallest σ-field containing prefix events that is closed under complement, countable unions and intersections. It means that any countable union or intersection of events is an event and the complement of any event is an event. Let \(F\subset \mathcal{F}\) be a finite or countable set of prefixes, we write \(\mathcal{E}(F)\) the event \(\bigcup_{f\in F} \mathcal{E}(f)\).

The class of sets \(\mathcal{R} = \mathcal{E} \cup \{\emptyset\} \) is a semiring of sets. It comes from two facts. Let \(f_1, f_2\in \mathcal{F}\) be two prefixes. Either \(\mathcal{E}(f_1)\) and \(\mathcal{E}(f_2)\) are disjoint, or one is contained in the other. It proves that \(\mathcal{R}\) is table by finite intersection. If \(\mathcal{E}(f_2) \subset \mathcal{E}(f_1)\) then there exists \(f_3\in \mathcal{F}\) such that \(f_2 = f_1 \cdot f_3\) and \(\mathcal{E}(f) = \bigcup_{f_4\in \mathcal{F}, |f_4|=|f_3|} \mathcal{E}(f \cdot f_4)\). It proves that \(\mathcal{E}(f_1) \setminus \mathcal{E}(f_2)\) can be written as a finite union of disjoint element of \(\mathcal{R}\).

Instead of defining the probability function \(p\) directly over \(\sigma(\mathcal{E})\), \(p\) is defined over \(\mathcal{R}\) first and then extended to \(\sigma(\mathcal{E})\) using Carathéodory’s extension theorem. \(p\) is defined on \(\mathcal{R}\) by

$$\begin{aligned} p(\emptyset) & = 0 \\ \forall f \in \mathcal{F}\quad p(\mathcal{E}(f)) & = (\frac{1}{n})^{|f|} \end{aligned}$$

\(p\) is additive and σ-subadditive because the only way for the union of two disjoint elements of \(\mathcal{R}\) to be in \(\mathcal{R}\) is if one of the two is the empty set. It is also σ-finite because \(p(\Omega) = p(\mathcal{E}(\epsilon)) = (\frac{1}{n})^{0} = 1\). The function \(p\) can then be uniquely extended into a probability function over \(\sigma(\mathcal{E})\).

Note that:

$$\begin{aligned} \forall f_1,f_2\in \mathcal{F},\quad p(\mathcal{E}(f_1\cdot f_2)) & = p(\mathcal{E}(f_1)) \times p(\mathcal{E}(f_2)) \\ \forall f\in \mathcal{F},\quad p(\mathcal{E}(f)) & = \prod_{i=1}^{|f|}p(\mathcal{E}(f_{[i]})) \end{aligned}$$

The probability space \((\Omega,\sigma(\mathcal{E}),p)\) has the very pleasing property that the probability of a prefix event \(\mathcal{E}(f)\) is exactly the probability of getting the sequence \(f\) with \(|f|\) dice rolls. There is a problem though: there are outcomes for which not every side has been observed at least once. The infinite sequence \((1,\dots)\) is such an outcome.

Let \(M\) the subset of outcomes such that at least one side has not been observed, i.e. \( M = \{ \omega \in \Omega \mid \sharp \omega < n \} \). We want to know how likely \(M\) is. For any \(i \in \mathbb{N} \), let \(M_i\) the set of outcomes such at least one side has not been observed up to the \(i\)-th roll, i.e. \(M_i = \mathcal{E}(\{f \in \mathcal{F}\mid |f|=i,\ \sharp f < n\})\).

For any \(i\in \mathbb{N}\), \(M_i\) is an event because it is a finite union of events. Not observing at least one side with an infinity of rolls is equivalent to not observing this side for every \(i\)-th roll, so \(M = \bigcap_{i\in\mathbb{N}} M_i\). From the last equation, we can conclude that \(M\) is an event because it is a countable intersection of events. Furthermore, given \(i\) dice rolls, the probability of not observing a given side is \((\frac{n-1}{n})^i\) so \(p(M_i) \le n \times (\frac{n-1}{n})^i\). Note that \(M_{i+1} \subset M_i\), so \(\bigcap_{j=0}^i M_j = M_i\). We can conclude that the probability, given an infinity of dice rolls, of never observing one side (anyone) is \(0\):

$$ p(M) = p(\bigcap_{i\in\mathbb{N}} M_i) = \lim_{i\rightarrow\infty} p(M_i) \le \lim_{i\rightarrow\infty} n \times \Bigl(\frac{n-1}{n}\Bigr)^i = 0 $$

Note that it does not mean these outcome are impossible. In theory, if you flip a coin endlessly, it is possible to always get head (resp. tail), but this is incredibly unlikely. Let \(\Omega’ = \overline{M}\) be set of outcomes such that every side has been observed at least once. \(\Omega’\) is the complement of \(M\), written \(\overline{M}\). Its probability is then \(p(\Omega’) = 1\). So for any event \(E\in\sigma(\mathcal{E})\), \(p(E) = p(E\cap\Omega’) + p(E\cap M)\), but \(p(E\cap M) \le p(M) = 0\), so

$$\forall E\in\sigma(\mathcal{E})\quad p(E) = p(E\cap\Omega’)$$

Informally, it means we can assume that, in every outcome, every side of the dice are observed at least once. More precisely, we take as probability space, the restriction of \((\Omega,\sigma(\mathcal{E}),p)\) to \(\Omega’\), written \((\Omega’,\sigma(\mathcal{E})|_{\Omega’},p)\).

How does the problem translates into this probability space? Remember that an outcome \(\omega \in \Omega’\) is an infinite sequence \((s_1,s_2,\dots)\) of sides \(s_i \in \mathcal{C}_n\) such that every every side is observed at some point. For any \(m \in \{0,\dots,n\}\) we define the random variable \(X_m\) as the function, from \(\Omega’\) to \(\mathbb{R}\), that maps any outcome \(\omega \in \Omega’ = (s_1,s_2,\dots)\) to the first \(i\) such that \(m\) side has been observed at least once.

$$\forall \omega = (s_1)_{i\ge 1}\in \Omega’, \quad X_m(\omega) = \inf_{\ i\ } \bigl\{ i \mid \sharp (s_1,\dots,s_i) = m \bigr\}$$

Note that \(X_n(\omega)\) is the number of rolls needed to observe every side of the chosen dice at least once. The average we are looking for is actually the expected value of \(X_n\). But for the expected value to defined, \(X_m\) has to be a measurable function from \((\Omega’,\sigma(\mathcal{E})|_{\Omega’})\) to \((\mathbb{R},\mathcal{B}(\mathbb{R}))\). Let \(F_{m,l}\) be the set of prefixes \(f=(s_1,\dots,s_l)\) of length \(l\) such that \(l\) is the first roll for which exactly \(m\) sides have been observed at least once, i.e. \(l = \inf_i\{i \mid \sharp (s_1,\dots,s_i) = m \}\). Then $$ X_m = \sum_{l\in\mathbb{N}} \sum_{f \in F_{m,l}} l \times \mathbb{1}_{\mathcal{E}(f)\cap \Omega’} $$ and so \(X_m\) is indeed measurable. The expected value is:

$$\begin{aligned} \mathbb{E}(X_m) & = \sum_{l\in\mathbb{N}} l\times p(X_m^{-1}(l)) \\ & = \sum_{l\in\mathbb{N}} \sum_{f \in F_{m,l}} l \times p(\mathcal{E}(f)) \\ & = \sum_{l\in\mathbb{N}} \sum_{f \in F_{m,l}} l \times \Bigl(\frac{1}{n}\Bigr)^l \\ & = \sum_{l\in\mathbb{N}} |F_{m,l}| \times l \times \Bigl(\frac{1}{n}\Bigr)^l \end{aligned}$$

The mistake we did in the last section is now clear. We computed \(\sum_{l\in\mathbb{N}} |F_{m,l}| \times l\) instead of \(\sum_{l\in\mathbb{N}} |F_{m,l}| \times l \times (\frac{1}{n})^l\). We just have to fix the function used in the aggregation to compute the right value:

def forOneValidPrefix(sides: Int)(length: Int): BigDecimal =
  (BigDecimal(1) / sides).pow(length) * length
scala> val depth = 100000
val depth: Int = 100000

scala> val sides = 3
val sides: Int = 3

scala> val expectedValueFor100000 = aggregate(sides,depth)(forOneValidPrefix(sides))
val expectedValueFor100000: BigDecimal = 5.499999999999999999999999999999999

This time the computed value match what we observed with random experiments, a value around \(5.5\). It also match the average we got for a 6-sided fair dice:

scala> val depth = 10000
val depth: Int = 10000

scala> val sides = 6
val sides: Int = 6

scala> val expectedValueFor100000 = aggregate(sides,depth)(forOneValidPrefix(sides))
val expectedValueFor100000: BigDecimal = 14.70000000000000000000000000000001

We can actually go a bit deeper by observing that for any \(m \in \{1,\dots,n\}\), \(\mathbb{E}(X_m) = \mathbb{E}(X_{m-1}) + \mathbb{E}(X_m - X_{m-1})\):

$$\begin{aligned} \mathbb{E}(X_m) & = \sum_{i=0}^{m-1} \mathbb{E}(X_{i+1} - X_i) \\ \mathbb{E}(X_{i+1}-X_i) & = \sum_{d\in\mathbb{N}^*} d \times p((X_{i+1}-X_i)^{-1}(d)) \\ p((X_{i+1}-X_i)^{-1}(d)) & = \sum_{k\in\mathbb{N}} p(\mathcal{E}(F_{i+1,k+d}) \cap \mathcal{E}(F_{i,k})) \\p(\mathcal{E}(F_{i+1,k+d})\cap \mathcal{E}(F_{i,k})) & = p(\mathcal{E}(\{f \cdot f’ \cdot c \mid f\in F_{i,k},\quad |f'|=d-1,\ \diamond f’ \subset \diamond f,\quad c\in \mathcal{C}_n\setminus\diamond f \})) \\& = \sum_{f\in F_{i,k}} \Biggl( \sum_{|f'|=d-1,\ \diamond f’ \subset \diamond f} \Biggl( \sum_{c\in \mathcal{C}_n\setminus \diamond f} p(\mathcal{E}(f \cdot f’ \cdot c)) \Biggr)\Biggr) \\& = \sum_{f\in F_{i,k}} \Biggl( p(\mathcal{E}(f)) \times \sum_{|f'|=d-1,\ \diamond f’ \subset \diamond f} \Biggl( p(\mathcal{E}(f’)) \times \sum_{c\in \mathcal{C}_n\setminus \diamond f} p(\mathcal{E}(c)) \Biggr)\Biggr) \\& = \sum_{f\in F_{i,k}} \Biggl( p(\mathcal{E}(f)) \times |\{f’\in \mathcal{F} \mid |f'|=d-1,\ \diamond f’ \subset \diamond f\}| \times \Bigl(\frac{1}{n}\Bigr)^{d-1} \times \frac{n-i}{n} \Biggr) \\& = \Bigl(\frac{i}{n}\Bigr)^{d-1} \times \frac{n-i}{n} \times p(\mathcal{E}(F_{i,k})) \end{aligned}$$

So \(p(\mathcal{E}(F_{i+1,k+d}) | \mathcal{E}(F_{i,k})) = (\frac{i}{n})^{d-1}\times \frac{n-1}{n} \). So \(\mathbb{E}(X_{i+1} - X_i) = \sum_{d\in\mathbb{N}^*} d \times \frac{n-1}{n}\times (\frac{i}{n})^{d-1} \). We recognize a Geometric Distribution whose probability of success is \(p’ = \frac{n-i}{n}\). Its expected value is known to be \(\frac{1}{p’} = \frac{n}{n-i}\). It can be computed by

$$\begin{aligned} \mathbb{E}(X_{i+1} - X_i) & = \sum_{d\in\mathbb{N}^*} d \times \frac{n-i}{n}\times \Bigl(\frac{i}{n}\Bigr)^{d-1} \\& = \frac{n-i}{n} \times \sum_{d\in\mathbb{N}^*} d \times \Bigl(\frac{i}{n}\Bigr)^{d-1} \\& = \frac{n-i}{n} \times \sum_{d\in\mathbb{N}^*} d \times \Bigl(\frac{i}{n}\Bigr)^{d-1} \end{aligned}$$

But $$\begin{aligned} \sum_{d\in\mathbb{N}^*} d \times x^{d-1} & = \sum_{d\in\mathbb{N}^*} (x^d)^’ \\& = \Bigl(\sum_{d\in\mathbb{N}^*} x^d\Bigr)^’ \\& = \Bigl(\frac{x}{1 - x}\Bigr)^’ \\& = \frac{1}{(1 - x)^2} \end{aligned}$$

So $$\begin{aligned} \mathbb{E}(X_{i+1} - X_i) & = \frac{n-i}{n} \times \frac{1}{(1 - \frac{i}{n})^2} \\& = \frac{n-i}{n} \times \Bigl(\frac{n}{n - i}\Bigr)^2 \\& = \frac{n}{n - i} \end{aligned}$$

Finally we can give the formula for the expected value and check that it gives the expected values:

$$ \mathbb{E}(X_n) = \sum_{i=0}^{n-1} \frac{n}{n-i} $$

def expectedValue(sides: Int): Double =
  (1 to sides).map(sides.toDouble / _).sum
scala> expectedValue(6)
val res0: Double = 14.7

scala> expectedValue(3)
val res1: Double = 5.5

Understanding the Probability

Still given a \(n\)-sided fair dice \(n>0\). Let \(C \subset \mathcal{C}_n\) be a subset of the sides of the dice and \(l \in \mathbb{N}\) a non-negative integer. The event of all outcomes whose prefixes of length \(l\) do not contain any side in \(C\) is written

$$M_{n,l,C} = \mathcal{E}(\{ f\in \mathcal{F} \mid |f|=l,\quad \diamond f \cap C = \emptyset \}$$

Note that for any subsets \(C_1\) and \(C_2\) of \(\mathcal{C}_n\) and \(l\in\mathbb{N}\), the property \(M_{n,l,C_1}\cap M_{n,l,C_1} = M_{n,l,C_1\cup C_2}\) holds and \(p(M_{n,l,C}) = (\frac{n - |C|}{n})^l = (1 - \frac{|C|}{n})^l\). Let \(A_{n,l}\) be the event of all outcomes whose prefixes of size \(l\) contain every side of the dice at least once:

$$\begin{aligned} A_{n,l} & = \mathcal{E}(\{ f\in \mathcal{F} \mid |f|=l,\quad \diamond f = C_n \} \\ & = \bigcap_{c\in \mathcal{C}_n} \overline{M_{n,l,\{c\}}} \\ & = \overline{\bigcup_{c\in C} M_{n,l,\{c\}}} \end{aligned}$$

$$\begin{aligned} p(A_{n,l}) & = 1 - p(\bigcup_{c \in C_n} M_{n,l,\{c\}}) \\ & = 1 - \Biggl[\sum_{C \subset C_n,C\neq\emptyset} -(-1)^{|C|} \times p(\bigcap_{c\in C} M_{n,l,\{c\}})\Biggr] \\ & = 1 - \Biggl[\sum_{C \subset C_n,C\neq\emptyset} -(-1)^{|C|} \times p(M_{n,C,l})\Biggr] \\ & = 1 + \Biggl[\sum_{C \subset C_n,C\neq\emptyset} (-1)^{|C|} \times \biggl(\frac{1-|C|}{n}\biggr)^{l}\Biggr] \\ & = 1 + \sum_{k=1}^{n} \binom{n}{k} \times (-1)^k \times \biggl(1 - \frac{k}{n}\biggr)^{l} \end{aligned}$$

We can generalize this result: given a \(n\)-sided fair dice and \(C \subset \mathcal{C}_n\) a subset of the sides of the dice. Let \(A_{n,l,C}\) be the set of outcomes whose prefixes of length \(l\) do contain only sides of \(C\) and every side of \(C\):

$$\begin{aligned} A_{n,l,C} & = \mathcal{E}(\{f\in \mathcal{F} \mid |f|=n, \quad \diamond f = C\}) \\p(A_{n,l,C}\mid M_{n,l,\overline{C}}) & = p(A_{l,|C|}) \\ & = 1 + \sum_{k=1}^{|C|} \binom{|C|}{k} \times (-1)^k \times \biggl(1 - \frac{k}{|C|}\biggr)^{l} \end{aligned}$$

We found the probability of observing, in \(l\) dice rolls or less, a subset of all the sides. But \(A_{n,l,C}\) is not in general \(X_n^{-1}(\{l\})\) because outcomes in \(X_n^{-1}(\{l\})\) reach the last observed side at roll \(l\) while outcomes in \(A_{n,l}\) may have observed every side much before the \(l\)-th roll. But we can relate the two. For any side \(c \in \mathcal{C}_n\) and any non-negative integer \(i\in\mathbb{N}\), let \(R_{n,i,c}\) the event of observing the side \(c\) at roll \(i\). Let \(l\in\mathbb{N}^*\):

$$\begin{aligned} X_n^{-1}(\{l\}) & = \bigcup_{c=1}^n A_{n,l-1,C_n\setminus\{c\}} \cap M_{n,l-1,\{c\}} \cap R_{n,l,c} \\p(X_n^{-1}(\{l\})) & = \sum_{c=1}^n p(A_{n,l-1,C_n\setminus\{c\}} \cap M_{n,l-1,\{c\}} \cap R_{n,l,c}) \\ & = \sum_{c=1}^n p(A_{n,l-1,C_n\setminus\{c\}} \mid M_{n,l-1,\{c\}}) \times p(M_{n,l-1,\{c\}}) \times \frac{1}{n} \\ & = \sum_{c=1}^n p(A_{n-1,l-1}) \times \biggl(\frac{n-1}{n}\biggr)^{l-1} \times \frac{1}{n} \\ & = \biggl(\frac{n-1}{n}\biggr)^{l-1} \times p(A_{n-1,l-1}) \\ & = \biggl(\frac{n - 1}{n}\biggr)^{l-1} \times \Biggl[ 1 + \sum_{k=1}^{n-1} \binom{n-1}{k} \times (-1)^k \times \biggl(\frac{n - 1 - k}{n-1}\biggr)^{l-1} \Biggr] \\ & = \biggl(1 - \frac{1}{n}\biggr)^{l-1} + \sum_{k=1}^{n-1} \binom{n-1}{k} \times (-1)^k \times \biggl(1 - \frac{k+1}{n}\biggr)^{l-1} \end{aligned}$$

We need to be careful when translating this formula to avoid computation approximations to lead to very wrong answers:

def probabilityForLength(sides: Int, length: Int): BigDecimal =
  def cnk(k: Int, f: Int): BigInt =
    def dfact(n: Int, k: Int, res: BigInt = BigInt(1L)): BigInt =
      if k <= 0
      then res
      else dfact(n-1, k-1, res*n)

    dfact(f,k) / dfact(k,k-1)

  if sides == 0 || length == 0 then
    if sides == 0 && length == 0 then
    ((BigDecimal(sides - 1)/sides).pow(length - 1) +
        (1 to (sides-1)).map { (k: Int) =>
          ( cnk(k,sides-1)
            * (if k % 2 == 0 then 1 else -1)
            * BigInt(sides - 1 - k).pow(length-1)
      ) / BigDecimal(sides).pow(length - 1)

We can check that the probability of observing all the sides of a 120-sided dice in less than 120 rolls is indeed 0:

scala> probabilityForLength(120, 13)
val res0: BigDecimal = 4E-34

scala> probabilityForLength(120, 42)
val res1: BigDecimal = 8.751991852311394833964673845157515E-34

scala> probabilityForLength(120, 99)
val res2: BigDecimal = 1.468941911574859178966522092677385E-33

scala> probabilityForLength(120, 119)
val res3: BigDecimal = 1.529470021201154499656736868919480E-33

Note that the very small numbers we get for 42, 99 and 119 rolls instead of 0 are due approximations in computing with so big and small numbers. To get a better idea of how the probability behaves we can, as usual, export it as a CSV file. Let us start by defining the probability for every length using a stream:

scala> val probaD120 =
            .iterate(0)(_ + 1)
            .map(length => length -> probabilityForLength(120, length))
val probaD120: LazyList[(Int, BigDecimal)] = LazyList(<not computed>)

And write this stream up to some length into a CSV file:

def (self: LazyList[(Int, BigDecimal)]) plot(depth: Int, file: String): Unit =
  val pw = new PrintWriter(new File(file))
  for (length, proba) <- self.take(depth) do
    pw.printf("%d;%f\n", length, proba.toDouble)
scala> probaD120.plot(1500, "/tmp/d120.csv")

d120 density length/probability

There is a lot of things we can do thanks to this probability like asking how many rolls we need to observe all the sides of dice 9 times out of 10. The probability of observing all the sides of the dice in exactly or less than \(l\) rolls is given by

$$p(X_n \le l) = \sum_{l'=0}^{l} p(X_n = l’)$$

All we need to do is transform the stream probaD120:

def [A,B,C](ll: LazyList[A])foldP(z: B)(f: (B,A) => (B,C)): LazyList[C] =
  LazyList.unfold((ll,z)) { case (l0, z0) =>
      if l0.isEmpty then
        val (z1,c0) = f(z0,l0.head)
        Some((c0, (l0.tail,z1)))

def (self: LazyList[(Int, BigDecimal)]) cumul: LazyList[(Int, BigDecimal)] =
  self.foldP(BigDecimal(0)) { case (s, (l,p)) => (s+p, l -> (s+p)) }
scala> val distribD120 = probaD120.cumul
val distribD120: LazyList[(Int, BigDecimal)] = LazyList(<not computed>)

scala> distribD120.dropWhile(_._2 < 0.9).head._1
val res5: Int = 842

There is 90% chance that we observe all sides of the 120-sided fair dice with 842 rolls. Once again, we can get a better idea of how the distribution behaves by plotting it

scala> distribD120.plot(1500, "/tmp/d120_cumul.csv")

d120 distribution length/probability


The initial question was simple:

how many rolls are needed, on average, with a \(n\) sided fair dice, to observe all of its sides.

But the answer was not! We have seen how to check, empirically, that the Scala Random.nextInt function correctly simulates a fair dice. From there we run (many) experiments to get an approximation of the answer. We experienced how easy but disastrous it can be to build a model disconnected from reality. We learned that building a valid model requires a deep understanding of the problem and the experiments. We had to put a lot of care into the construction of the model to be sure it is a valid formalization of the problem. The maths were not easy, but they were right. And in the end, maths lead us to a very simple formula. Was all this formal brutality useful? Yes, it was. The simple formula gives the exact answer and is by far the most efficient implementation. Going deeper we found how to answer more questions like the chance we have to observe all sides in \(l\) rolls or less. We even were able to get a precise idea of how the probability behaves by plotting it.

All along this journey, we used many of the new Scala 3 features among which extensions methods and Type Classes. We saw how easy they were to use and the great benefits they offer. Extensions methods let us add methods to objects without any boilerplate, Type Classes let us write generic function, etc.

I hope you enjoyed this journey as much as I loved writing it. Have a look at all the Scala 3 features. Many of them are not covered here but are truly amazing (Polymorphic Functions, Dependent Function Types, Match Types, Intersection and Union Types, etc). The list is pretty large.