Monday, May 26, 2025

Fiddlish space frequencies

Before getting to rivers, let’s figure out where spaces are likely to appear in the (fictional) Fiddlish language, which includes only three- and four-letter words. These words are separated by spaces, but there is no other punctuation. Suppose a line of Fiddlish text is generated such that each next word has a $50$ percent chance of being three letters and a $50$ percent chance of being four letters.

Suppose a line has many, many, many words. What is the probability that any given character deep into the line is a space?

Let's suppose that we have $N \gg 1$ words in a line. Then there would be $N-1$ spaces, whereas the total length of the line would be $L = 4F + 3T + N-1$ characters, where $F$ is the number of four-letter words and $T$ is the number of three-letter words. Since $F+T = N,$ we have $$L = 4F+3(N-F) + N-1 = 4N + F - 1.$$ Since F is binomially distributed with frequency $\frac{1}{2},$ the expected value of $F$ is $\mathbb{E}[F] = \frac{N}{2},$ so the expected length is $$\mathbb{E}[L] = \frac{9}{2}N - 1.$$

Therefore, the expected frequency of spaces in the line, which is equivalent to the probability that any given character deep into the line is a space, is then $$p = \frac{N-1}{\mathbb{E}[L]} = \frac{N - 1}{\frac{9}{2}N - 1} \to \frac{2}{9} = 22.222\dots\%$$ as $N \to \infty.$

Monday, May 19, 2025

Pyramidal permeation permutations

Consider the following figure, which shows $16$ points arranged in a rhombus shape and connected to their neighbors by edges. How many distinct paths are there from the top point to the bottom along the edges such that:

  • You never visit the same point twice.
  • You only move downward or sideways—never upward.

Let's number the nodes from top down and then from left to right, so that the top node is $1$, the nodes from left to right on the second level are $2$ and $3$, and so on. Let $N(k)$ be the number of paths from $1$ ending at node $k$ following the rules of never visiting the same point twice and only moving downward or sideways, never upward. For good measure, we can go ahead and assume that $N(1) = 1.$

It is straightforward to see that $N(2) = N(3) = 2,$ since either the path goes directly from $1$ to $k \in \{2, 3\}$ or the path first travels to the other node, $5-k$, and then sideways to $k$.

Now let's consider some $k \in \{4, 5, 6\}.$ Clearly, for any such $k,$ an acceptable path must last exist the row above at some $\ell \in \{2, 3\}.$ If the path left the higher row at $\ell = 2,$ then it either first goes from $2$ to $4$ and then (if necessary) sideways from $4$ to $k$, or it goes from $2$ to $5$ and then (if necessary) sideways from $5$ to $k.$ So there must be 2N(2) paths from $1$ to $k \in \{4,5,6\}$ that leave the first row at $\ell = 2.$ A similar argument can be made that there also must be $2N(3)$ paths from $1$ to $k \in \{4,5,6\}$ that leave the first row at $\ell = 3.$ Therefore, for each $k \in \{4,5,6\}$ we must have $$N(k) = 2N(2) + 2N(3) = 8.$$

Carrying this argument further, let $\mathcal{L}_i$ represent the $i$th row of the diagram. So for instance $\mathcal{L}_0 = \{1\},$ $\mathcal{L}_1 = \{2, 3\},$ etc. Additionally, let $I(k)$ be the downward index of node $k,$ that is how many acceptable paths downward are available from node $k$. Then extending the argument from the previous paragraph, we see that $$N(k) = \sum_{j \in \mathcal{L}_{i-1}} I(j) N(j), \,\,\forall k \in \mathcal{L}_i.$$ So, as previously seen, $N(2) = N(3) = 2$ and $N(4) = N(5) = N(6) = 8.$ Furthermore, \begin{align*} N(7) = N(8) = N(9) = N(10) &= \sum_{j=4}^6 I(j) N(j) = 2N(4) + 2N(5) + 2N(6) = 6 \cdot 8 = 48 \\ N(11) = N(12) = N(13) &= \sum_{j=7}^{10} I(j) N(j) = N(7) + 2 N(8) + 2 N(9) + N(10) = 6 \cdot 24 = 288 \\ N(14) = N(15) &= \sum_{j=11}^{13} I(j) N(j) = N(11) + 2N(12) + N(13) = 4 \cdot 288 = 1152 \\ N(16) &= \sum_{j=14}^{15} I(j)N(j) = 2 \cdot 1152 = 2234,\end{align*} so the total number of acceptable paths from the top vertex of the rhombus to the bottom vertex is $2,234.$

Monday, May 12, 2025

Guess we'll still never know ...

Now that you’ve determined the values of $a$ and $b$, let’s analyze the rest of Stephen’s statement. Is it true that losing in five games is “closer to a sweep than a seven-game series”? Let $p_4$ represent the probability that the Celtics sweep the Knicks in four games. And let $p_7$ represent the probability that the series goes to seven games (with either team winning).

Suppose $p$ is randomly and uniformly selected from the interval $(a, b),$ meaning we take it as a given that the most likely outcome is that the Knicks will lose the series in five games. How likely is it that $p_4$ is greater than $p_7$? In other words, how often will it be the case that losing in five games means a sweep is more likely than a seven-game series?

As we saw earlier, for any value of $p \in \left(\frac{3}{5}, \frac{3}{4}\right),$ we have $p_4 = p^4.$ Similarly, we have $$p_7 = P(\text{C},7\mid p) + P(\text{K}, 7\mid p) = 20p^3 (1-p)^3.$$ We can solve for the point at which we have $p_4(p) = p_7(p)$ and get the cubic equation $20(1-p)^3 - p = 0,$ or substituting $t = 1-p$ we get $$t^3 + \frac{1}{20} t - \frac{1}{20} = 0.$$ Using Cardano's formula, given that the discriminant when $p = \frac{1}{20}$ and $q = -\frac{1}{20}$ is $$\Delta = \frac{q^2}{4} + \frac{p^3}{27} = \frac{1}{1600} + \frac{1}{432000} = \frac{17}{2700} \gt 0,$$ we have the solution $$t^* = \sqrt[3]{\frac{1}{40} + \frac{1}{30} \sqrt{\frac{17}{30}}} + \sqrt[3]{\frac{1}{40} - \frac{1}{30}\sqrt{\frac{17}{30}}},$$ so the equilibrium point between $p_4$ and $p_7$ occurs at $$p^* = 1 - t^* = 1 - \sqrt[3]{\frac{1}{40} + \frac{1}{30} \sqrt{\frac{17}{30}}} - \sqrt[3]{\frac{1}{40} - \frac{1}{30}\sqrt{\frac{17}{30}}} \approx 0.676582453403\dots.$$ So if $p$ were uniformly samples from $U \left( \frac{3}{5}, \frac{3}{4} \right)$ then the probability that a sweep is more probable than a seven-game series is $$\frac{p^* - \frac{3}{5}}{\frac{3}{4} - \frac{3}{5}} = \frac{20p^* - 12}{3} \approx 0.510549689351\dots.$$

Guess we'll never know ...

In a recent First Take episode, co-host Stephen A. Smith said:

I got [the Knicks] losing this in five games, which means they’re closer to a sweep than a seven-game series. That’s how I’m looking at it right now.

Let’s look at the first part of Stephen’s statement, that he believed the Knicks would lose to the Celtics in five games.

Let $p$ represent the probability the Celtics win any given game in the series. You should assume that $p$ is constant (which means there’s no home-court advantage) and that games are independent.

For certain values of $p$, the likeliest outcome is indeed that the Celtics will win the series in exactly five games. While this probability is always less than $50$ percent, this outcome is more likely than the Celtics winning or losing in some other specific number of games. In particular, this range can be specified as $a \lt p \lt b.$

Determine the values of $a$ and $b$.

Let $P(t, n \mid p)$ be the probability of team $t \in \{ \text{C}, \text{K} \}$ winning the series in $n$ games, given that the probability of the Celtics winning any individual game is $p.$ Then since team $t$ would have to win the final game of the series, the probability is given by $$P(t, n \mid p) = \begin{cases}\binom{n-1}{3} p^4 (1-p)^{n-4}, &\text{if $t = \text{C}$;}\\ \binom{n-1}{3} p^{n-4}(1-p)^4, &\text{if $t = \text{K}.$}\end{cases}$$

We are trying to find the maximal interval $(a,b) \subseteq [0,1],$ such that $$P(\text{C}, 5 \mid p ) = \max_{t \in \{\text{C}, \text{K}\}, n \in \{4,5,6,7\}} P(t, n \mid p),$$ for all $p \in (a,b).$ Let's just brute force this thing: \begin{align*} P(\text{C}, 5 \mid p) = 4p^4(1-p) \gt P(\text{C}, 4 \mid p) = p^4 &\Rightarrow p \in \left(0,\frac{3}{4}\right) \\ P(\text{C}, 5 \mid p) = 4p^4(1-p) \gt P(\text{C}, 6 \mid p) = 10p^4(1-p)^2 &\Rightarrow p \in \left(\frac{3}{5},1\right) \\ P(\text{C}, 5 \mid p) = 4p^4(1-p) \gt P(\text{C}, 7 \mid p) = 20p^4(1-p)^3 &\Rightarrow p \in \left(1 - \frac{1}{\sqrt{5}},1\right), \end{align*} so since as long as $p \gt \frac{1}{2}$ we have $P(\text{C}, n \mid p) \gt P(\text{K}, n \mid p)$ for each $n \in \{4,5,6,7\},$ we see that the region where a Celtics win in five games is the most probable occurs when $p \in \left(\frac{3}{5}, \frac{3}{4}\right).$

Sunday, May 4, 2025

Leeloo Disney, Multi-Pass, Extra Credit

This time around, after you complete the first ride on your Multi Pass, you can reserve a fourth ride at any time slot after your first ride (rather than after your third ride). Similarly, after you have completed your second ride, you can reserve a fifth ride at any time slot after your second, and so on, until there are no available time slots remaining.

As before, the first three rides of the day are equally likely to be in any of the 12 time slots, whereas subsequent rides are equally likely to occur in any remaining available slots for the day.

On average, how many rides can you expect to “Lightning Lane” your way through today at Magic Kingdom?

As we say in the previous Classic answer, I encoded an "extra credit" version of the Lightning Pass program, where now as opposed to only adding new rides after your last scheduled ride, you add another ride uniformly distributed over all of the remaining, not already scheduled time slots. This distribution should be, and indeed turns out to be, much more centrally distributed, though here I was unable to come up with a semi-analytical solution and will just rely on the simulations.

Here again, I used $N=1,000,000$ simulations and arrived at the following histogram, which gives a mean of $m = 6.809379,$ standard deviation of $s = 1.404456,$ which implies that the true mean $95\%$ confidence interval is $R \in [6.8066, 6.8121].$ Therefore, let's comfortably round so that under the Extra Credit Lightning Lane rules, you can expect approximately $6.81$ rides.

Leeloo Disney, Multi-Pass

By purchasing “Lightning Lane Multi Pass,” you can reserve three of the many rides in a park, with each ride occurring at a different hourlong time slot. For simplicity, suppose the park you’re visiting (let’s say it’s Magic Kingdom) has $12$ such time slots, from $9$ a.m. to $9$ p.m. So if you have the $3$ p.m. time slot for a ride, then you can skip the “standby lane” and board the much shorter “Lightning Lane” at any point between $3$ and $4$ p.m. Assume you can complete at most one ride within each hourlong time slot.

Once you have completed the first ride on your Multi Pass, you can reserve a fourth ride at any time slot after your third ride. This way, you always have at most three reservations. Similarly, after you have completed your second ride, you can reserve a fifth ride at any time slot after your fourth, and so on, up until you are assigned a ride at the $9$ p.m. time slot. That will be your final ride of the day.

Magic Kingdom happens to be very busy at the moment, and so each ride is randomly assigned a valid time slot when you request it. The first three rides of the day are equally likely to be in any of the $12$ time slots, whereas subsequent rides are equally likely to occur in any slot after your currently latest scheduled ride.

On average, how many rides can you expect to “Lightning Lane” your way through today at Magic Kingdom?

Let's denote the slots that I get as $S \subseteq \{1, 2, 3, \dots, 12 \}$. Since after each ride I take, the probability distribution of what additional slots I add to $S$ only depend on $m = \max S.$ Additionally, since we care about the total ending size of $S,$ let's keep track of $n = |S|.$

Define $R(n,m)$ to be the expected total number of rides I take given that I currently have $n$ slots and the a maximum time slot is $m$. Since once the $12$th time slot is filled, you do not end up adding any more rides, we have $R(n,12) = n$ for $n = 3, 4, \dots, 12.$ Otherwise, we have the following recursion formula $$R(n,m) = \frac{ \sum_{j=m+1}^{12} R(n+1,j)}{12 - m}, \,\,\,\ m \lt 12.$$

Using the recursion formula, we see that $$R(n,11) = R(n+1,12) = n+1, \,\,\,n = 3, 4, \dots, 11.$$ Similarly, we get $$R(n,10) = \frac{R(n + 1, 11) + R(n + 1, 12)}{2} = \frac{2n+3}{2} = n + \frac{3}{2}, \,\,\,n = 3, 4, \dots, 10.$$ Once more, we get $$R(n,9) = \frac{R(n+1,10) + R(n+1,11) + R(n+1, 12)}{3} = n + \frac{11}{6}, \,\,\, n = 3, 4, \dots, 9.$$ Another time, we get $$R(n,8) = \frac{\sum_{i=9}^{12} R(n+1,i)}{4} = n + \frac{25}{12}, \,\,\, n = 3, 4, \dots, 8.$$ Further, we get $$R(n,7) = \frac{\sum_{i=8}^{12} R(n+1,i)}{5} = n + \frac{137}{60}, \,\,\, n = 3, 4, \dots, 7.$$ Continuing onward, we get \begin{align*} R(n,6) &= n + \frac{49}{20}, \,\,\, n = 3, 4, 5, 6; \\ R(n,5) &= n + \frac{363}{140}, \,\,\, n = 3, 4, 5; \\ R(n,4) &= n + \frac{761}{280}, \,\,\, n = 3, 4; \\ R(n,3) &= n + \frac{7129}{2520}, \,\,\, n = 3.\end{align*}

Let's pause to define $p_m$ as the probability that in your first three randomly assigned timeslots, the maximum time slot is $m,$ That is, $$p_m = \frac{ \binom{m-1}{2} }{ \binom{12}{3} } = \frac{ (m-1)(m-2) }{ 440 }.$$ So using the law of total expectation, we get that the total average number of rides that you expect to go on using the Lightning Lane Multi Pass is $$R = \sum_{m=3}^{12} p_m R(3,m) = \sum_{m=3}^{12} \frac{(m-1)(m-2)}{440} R(3,m) = \frac{118361}{27720} \approx 4.2698773\dots$$

To confirm the calculations here, I put together a small Python code for testing purposes (as well as for working on the Extra Credit problem that will be featured in a later post). Below find the code:

import numpy as np
rng = np.random.default_rng()

def lightning_classic(rng):
    slots = sorted(list(rng.choice(12,size=3, replace=False)))
    last = slots[-1]
    while last < 11:
        new = rng.integers(last+1, high=12)
        slots.append(new)
        last = new
    return slots

def lightning_xc(rng):
    slots = sorted(list(rng.choice(12,size=3, replace=False)))
    for i in range(12):
        if i in slots:
            open_slots = [j for j in range(i+1, 12) if j not in slots] ## figure out which ones are still open
            if open_slots:
                new = open_slots[rng.integers(len(open_slots))] ## get new slot
                slots.append(new)
    return slots

After running the code with $N = 1,000,000,$ we get the following histogram, which gives a mean of $m = 4.269231,$ standard deviation of $s = 1.032058,$ which implies that the true mean $95\%$ confidence interval is $R \in [4.2672, 4.2713].$