Do the shuffle: finding π in your playlists

Henry Jaspars really likes Nick and Norah’s infinite playlist.

post

Being a mathematician makes applying your vocation in apparently non-mathematical situations something of a reflex. Very often, this can be to the annoyance of others, including your long-suffering friends, who may label you with words such as ‘party-pooper’ or ‘killjoy’. However, in some cases, mathematics can make you the life and soul of the party. The following—the mystery of the repeating Starship song—is one such example.

At the party in question, the host had the foresight to put a playlist on shuffle to keep us pacified. But he was soon to regret this decision, as proceedings were interrupted by an unwelcome guest, like Banquo’s ghost at the feast—namely, Starship’s 1985 classic We Built This City. Don’t get me wrong: the song is a surefire singalong firestarter. But as with most things from the 80s, its glossy studio sheen is liable to wear off easily. When the song came on for the first time, we were relatively enthusiastic—but upon the second time in 10 minutes, we began to ask ourselves searching questions, such as—who is Marconi? Why was he playing the mamba? And more pressingly, why had we heard the same song twice in close succession? Were we merely victims of chance? Was it a sign from God? Perhaps some arena rock-based apparitions were at play? None of us knew how to interpret it. An eerie silence fell upon the room as our mortified host pressed skip.

I think I see the problem

Thankfully, however, I managed to rescue us from our music-based turmoil with my combinatorial prowess. Quickly, like a cowboy from the wild west whipping out their pistol, I broke out my comically large pad of paper. Although all of my friends (bar one) were either inebriated or otherwise distracted, I was undeterred in attempting to come up with a valid explanation that could satisfy the remaining two of us. Given a playlist of $n$ songs, how long will it be before a song is played twice? I set to work. Little did I know that the ensuing mathematical odyssey would take me from Harry Styles to hashing functions, from Ra-Ra-Rasputin to Ramanujan, from popular music to population dynamics and back again.

Don’t you remember?

So, to start off: the chance of the first song being unheard is exactly 1 (provided you are not suffering from debilitating deja vu). For the second song, the chance it is different from the first one is exactly $1-1/n$ for a playlist with $n$ songs, and by the time we reach the $k$th song, our odds become $1-k/n$: exactly the proportion of $n$ songs we haven’t heard yet. Then, if we have a run of different songs 1 through $k$, the probability of the $(k+1)$th song being one we’ve heard before (perhaps our beloved Starship) is $k/n$. So the probability that the first repeat is at the $(k+1)$th song, which we’ll call $P_{k+1}$, is \begin{align*} P_{k+1} &= \frac{n}{n}\cdot\frac{n-1}{n}\cdot\cdots\cdot\frac{n-k+1}{n} \cdot \frac{k}{n}\\ &= \frac{n!}{(n-k)! n^{k}} \cdot \frac{k}{n}. \end{align*} Now—observe that the gap between the repeated songs, at positions $k + 1$ and $j \leq k$, is equal to $(k + 1)/2$ on average. Hence, the number of songs we should expect to hear before some song repeats itself is equal to \begin{align*}{E}[\text{length between first repeat}] &= \sum_{k = 1}^{n} P_{k+1} \cdot \frac{k+1}{2}\\ &= \frac{1}{2}\left(\sum_{k = 1}^{n} P_{k+1} + \sum_{k = 1}^{n} P_{k+1}\cdot k\right). \end{align*}

Shuffle up: With a random playlist containing $n = 8$ songs, we may have a repeat after 6 songs, with the two plays 3 songs apart.

Hey, this is starting to sound familiar…

There has to be a repeat somewhere, so that first sum, the one with just probabilities in it, is equal to 1. For the second sum, we can use the probability we calculated earlier; so \begin{align*} {E}[\text{length between first repeat}] &= \frac{1}{2}\left(1 + \sum_{k = 1}^{n} \frac{n!}{(n-k)! n^{k}} \cdot \frac{k}{n} \cdot k \right)\\ &= \frac{1}{2}\left(1 + \sum_{k = 1}^{n} \frac{n! k^2}{(n-k)! n^{k+1}} \right). \end{align*} Letting $j = n-k$, and with a little bit of algebra (which, in true mathematical style, is left as an exercise for the reader), this can be rearranged to \begin{align*} {E}[\text{length between first repeat}] &= \frac{1}{2}\left(1 + \frac{n!}{n^n}\sum_{j = 0}^{n-1} \frac{n^j}{j!} \right). \end{align*}

We just want to dance here

Here, I became stuck. And though at the time I attributed my mathematical oversight to a DWI (dividing while intoxicated), I realised that the problem was rather more beautiful than I thought, requiring extraordinary subtlety and elegance. As such, it took an extraordinary mathematician, no less than Ramanujan himself, to put it to rights. We define the eponymous Ramanujan Q-function to be
$$Q(n) = \frac{n!}{n^n} \sum_{j = 0}^{n-1} \frac{n^j}{j!},$$ which corresponds to the truncated series, and the accompanying, rather predictably named, Ramanujan R-function to be $$R(n) = \frac{n!}{n^n} \sum_{j = n}^{\infty} \frac{n^j}{j!},$$ which corresponds to the tail.

The $Q$ and $R$-series for $n=8$

Now, our original problem is to understand $${E}[\text{length between first repeat}] = \tfrac{1}{2}(1+Q(n)).$$ While Ramanujan may not have been preoccupied with the mathematics of playlists, the $Q$ and $R$ functions emerge naturally in the context of number theory—in particular, for enumerating numbers with a bounded number of prime factors. The prime number theorem says that the number of primes below $N$ is asymptotically equal to $N / \log N$. Ramanujan generalised this, showing that the number of integers below $N$ which have at most $k$ prime factors is asymptotically equal to $$\pi_k(N) \sim \frac{N}{\log N} \sum_{j = 0}^{k}\frac{(\log\log N)^j}{j!}.$$ In the case that $k = \log\log N$, this reduces to our friend, the $Q$-function which we wish to study. At the party, thinking on my (by this point rather unsteady) feet, I had begun to contemplate the complete series from $j = 0$ to $\infty$: namely, $Q(n) + R(n)$. Surely, I thought aloud, the tail of the series should be insignificant in comparison to the partial sum? Upon going home the morning after and calculating this on the back of a napkin, I realised that I was not only wrong, but consistently wrong. Because, as it happens, the infinite series is almost exactly twice the value of the finite one! In particular, $Q(n) \sim R(n)$: namely, the partial series $Q(n)$ is equal to almost exactly half the complete infinite series, $Q(n) + R(n)$. But why?

Pick your Poisson

Although Ramanujan used high-precision tools to see this, a more heuristic, statistical analysis gives just as much insight. The Poisson distribution, which is used to model everything from football to the internet (and occasionally, both in one sitting), is defined as follows: a random variable $X$ is Poisson with mean $n$ (or $X \approx \text{Poi}(n)$) if for all $j \geq 0$, $$\text{Prob}[X = j] = \mathrm{e}^{-n} \cdot \frac{n^j}{j!}.$$ The astute reader will notice that this is identical to the terms of the $Q$- and $R$-series up to a factor of $\mathrm{e}^{-n} n^n / n!$. So to see that $Q(n) \sim R(n)$, it would suffice to show that, for $X \approx \text{Poi}(n)$, $$\text{Prob}[X < n] \sim \text{Prob}[X \geq n].$$ To prove this, we will need some elementary facts about the Poisson distribution:

  • If $X \approx \text{Poi}(n)$ and $Y \approx \text{Poi}(m)$ are independent, then $X + Y \approx \text{Poi}(n + m)$ (ie if two Poisson processes have expected numbers of $n$ and $m$ events per unit of time, the combination of both processes ought to be a Poisson process, and expect $n + m$ events per unit of time).
  • If $X \approx \text{Poi}(n)$, then $\text{E}[X] = \text{V}[X] = n$, where E and V are the expectation and variance of the distribution (ie the uncertainty in the number of events that will occur increases linearly as the rate of the events increases).

Now, with this under our belt, we can apply the central limit theorem—the fairy godmother of statistics. The central limit theorem tells us that, if we have $n$ independent random variables (let’s call them $X_1, X_2,$ and so on), which all have the same distribution with mean $\mu$ and variance $\sigma^2$, then as $n\rightarrow\infty$, $$\frac{1}{\sqrt{n}}\sum_{j = 1}^{n} (X_j-\mu) \rightarrow \mathcal{N}(0, \sigma^2),$$ where $\mathcal{N}(0, \sigma^2)$ is the normal distribution with mean 0 and standard deviation $\sigma$.

A Poisson distribution with $n = 8$ and a normal distribution with $\mu=\sigma^2=8$

This means that if we sample a large number of times from the same random variable, the distribution of the sum of the results will start to look like a bell curve—regardless of how the variable was distributed in the first place. If we use Poisson random variables, so $X_j \approx \text{Poi}(1)$ for $1 \leq j \leq n$ and $X \approx \text{Poi}(n)$, we have $$\frac{1}{\sqrt{n}}(X-n) = \frac{1}{\sqrt{n}}\sum_{j = 1}^{n} (X_j-1) \rightarrow \mathcal{N}(0, 1)$$ as $n \rightarrow \infty$. Now, we are almost there: since $n \rightarrow \infty$, if $X \approx \text{Poi}(n)$ and $Y \approx \mathcal{N}(0, 1)$ then the fact that the normal distribution is symmetric means that $$\text{Prob}[X < n] \sim \text{Prob}[Y < 0] = \text{Prob}[Y \geq 0] \sim \text{Prob}[X \geq n],$$ which is precisely what we needed to show. So, after all this, what is the average number of songs between the first repeat? Well, observe that: $$Q(n) + R(n) = \frac{n!}{n^n} \left(\sum_{j = 0}^{n-1} \frac{n^j}{j!} + \sum_{j = n}^{\infty} \frac{n^j}{j!}\right) = \frac{n!}{n^n}\sum_{j = 0}^{\infty} \frac{n^j}{j!} = \frac{\mathrm{e}^n \cdot n!}{n^n}.$$ To figure out the latter expression, we need heavier weaponry. Since factorials are usually fairly unwieldy to deal with—especially as $n$ becomes large—we use an approximation, which gets better as the factorial gets larger, known as Stirling’s approximation: $$n! \sim \sqrt{2\pi n} \cdot \frac{n^n}{\mathrm{e}^n}$$ from whence our pesky $\pi$ emerges. From this, and the fact that $Q(n) \sim R(n)$, we have $$Q(n) \sim \frac{1}{2} \cdot \frac{\mathrm{e}^n\cdot n!}{n^n} \sim \sqrt{\frac{\pi n}{2}}$$ In fact, we are remarkably close to the real value: using more advanced analytic techniques, Ramanujan managed to show that $$Q(n) = \sqrt{\frac{\pi n}{2}}-\frac{1}{3} + \frac{1}{12}\sqrt{\frac{\pi}{2n}}-\frac{4}{135n} + \cdots$$ where our approximation follows from simply neglecting the subsequent terms. So: in answer to our original question, the average length between repeats from a sample of $n$ songs is equal to $${E}[\text{length between first repeat}] = \frac{1}{2}(1+Q(n)) \sim \frac{1}{2}\sqrt{\frac{\pi n}{2}}$$ as we approach an infinite playlist. (Here’s looking at you, Nick and Norah!)

Rooooock aaaannnd roolllllll

Starship: surprisingly good builders

Knowing this, we can circle back to the original problem: was it fate, or mere luck of the draw? Well, given a playlist of around $50$ songs (the only $50$ our host was familiar with), at a conservative 3 minutes per song, we end up with the average length of the first gap being approximately equal to: $$\frac{1}{2}\sqrt{\frac{\pi \cdot 50}{2}} \cdot \text{3 minutes} = \text{13 minutes 17 seconds}.$$ Therefore, we can conclude that the ghost of Starship was probably not haunting us on that auspicious night (although one can never be completely sure). So, beyond the realm of party-based musical quandaries, how useful is this? One example is when we consider birthdays, instead of songs on a playlist. If we have a queue forming one by one, what will the average size of the first gap between people with the same birthday be? Simply plugging in $n = 365$ to our formula, on average, we will need to wait for $Q(n) = 23.9$ to come into our queue before we see a repeat, with a gap of $(1 + Q(n))/2 = 12.5$ people between them. This is remarkably small, illustrating what is commonly known as the birthday paradox—the paradox of how very few samples are needed before a repeat will tend to occur, provided they are taken at random. Equally interestingly, we can use this analysis in reverse. Say we are listening to a given playlist, and we want to work out how many songs are on it. Then, all we have to do is simply keep waiting to listen out for the number of songs between the first repeat. Rearranging, we obtain: $$n \sim \frac{8 {E}[\text{length between first repeat}]^2}{\pi}.$$

You’ll never know how many times I pressed ‘Random article’ before I found this.

So, if we wanted, for example, to work out the size of a behemoth website like Wikipedia—a website with millions upon millions of articles—all we need to do is repeatedly click ‘Random article’, and voila: we can find out how large our website really is while only having to look at less than $Q(n)/n~\sim~0.025\%$ of the total articles. And this technique becomes very useful in biology to estimate the population of a species efficiently. In fieldwork, keeping the surrounding environment pristine is paramount—so taking as few samples as possible is vital. With our method in hand, all we would need to do is start sampling at random, marking our animals sequentially, and then releasing them; using our method, in a population of 10,000 animals, we would only need to find around 125 to understand the size of the population. As soon as we come across an animal we have seen before, we can work out the gap between the first repeat, and hence estimate the population of the species, while risking minimal disturbance to the surrounding environment.

The problem of time before first repeats also has great significance in the context of cybersecurity—in particular, the study of hash collisions. A hash function is a method by which we take a piece of information—like a file or password—and encode it into a piece of data of a fixed size, known as a hash. More specifically, a hash function is a computationally complex map $h: \mathcal{B} \rightarrow \mathcal{H}$, where $\mathcal{B}$ is the set of binary strings, and $\mathcal{H}$ is some finite set of hashes. There are many practical uses for this—for example, for keeping passwords secure. Keeping an exact record of all user passwords is at best a terrible idea and at worst an unforgivable risk in terms of data security—what if someone was able to access the passwords directly, or there was a data breach? Instead, for a user $U$ with password $p \in \mathcal{B}$, the pair $(U, h(p))$ is stored in our database. Then, when the user tries to login again with password $p’ \in \mathcal{B}$, the values of the hashes, $h(p)$ and $h(p’)$, are compared. If they agree, we deduce that $p = p’$, and we let our user in.

The major shortfall of this is that hash functions are almost never injective—ie there always exist distinct $p, p’ \in \mathcal{B}$ for which $h(p) = h(p’)$, a phenomenon known as a hash collision. At best, our method is probabilistic: we can only have a certain level of confidence that our user is the genuine article, but never absolute certainty. This achilles heel which occurs by necessity in most hash functions—after all, there are far more passwords than hashes—is a major flaw, especially when malicious users are capable of exploiting it to modify or delete existing records. And while we can have faith that finding two values with the same hash is unlikely—and computationally very difficult to pull off—nevertheless, probability always wins in the end: we only have to sample on average $Q(n) \sim \sqrt{\pi n/2}$ passwords out of a space of $n$ hashes before two will result in the same hash. And even if they don’t correspond to the password we want, finding any hash collision at all is sufficient to compromise the entire system.

Direct hit: a hash collision in action.

This attack, known as the birthday attack, can allow a large system to be compromised surprisingly quickly, with nothing more than absolute randomness. But before readers start writing frantic letters to Dirichlet about the security of their passwords, they can be safe in the knowledge that modern hash functions are designed to resist such collisions: the value of $Q(n)$, known as the birthday bound, for the most common type of hash, SHA-256, with (you guessed it) $n = 2^{256}$, is $4.26 \times 10^{38}$, which would still take $9.66 \times 10^{14}$ times the length of the universe to compute at a rate of 1 million hashes per second. Phew. So while, to quote the venerable Starship, “the odds are against us”, chance, as they say, is a very fine thing indeed.

Henry is neither shaken nor stirred, though he does feel slightly permuted. In any case, he’s currently in a bit of a twist—or is it a tangle? He’s knot sure.

More from Chalkdust