Spam calls and number blocking

Michael Wendl really wants those spammers to stop calling him

post

Nowadays, it seems that everyone is inundated with spam phone calls, which manifest as a spoofed phone number or name to hide the caller’s true identity. At best, answering such calls results in insistent pitches about things like insurance benefits we are missing out on or roofing inspections we are assured we need.

This isn’t terribly different from spam email, for which you might use software to automatically shunt offenders to a junk mailbox, or to /dev/null if you are a particularly frustrated Unix user. For telephone communication, an option is enrolling in ‘no-call lists’ that flag your number as off limits with respect to unsolicited calls from telemarketers and the like. Both of these approaches run the dual risk of allowing an actual spam call to pass through and of mistakenly rejecting a genuine call as spam. In statistics, these types of errors are commonly referred to as false negatives and false positives, respectively. The goal of any spam blocking approach is always to minimise both.

Another and seemingly very popular option is number blocking: the user vets a new call, and, if they deem the number to be spam, they then set their phone to reject all subsequent calls from that number. As successive spam calls are denied, the phone compiles its own increasingly large database of blocked numbers. Some time ago, my phone started to be inundated with spam calls. In reflexively resorting to number blocking, I found myself wondering about the success rate of this method. Aside from the obvious practical significance of the question, it is also mathematically interesting and can be investigated using the tools of combinatorial probability.

Off his phone bat

The reason spammers are able to place so many calls is that they use sophisticated software to automatically make ‘robocalls’, often spoofing numbers and even names that geographically target the recipient. This tactic, commonly referred to as ‘neighbour spoofing’, is a powerful way of increasing the odds of a recipient actually answering their phone, since the call appears to be local. In order to cast this problem in mathematical terms, consider a catalogue of $k$ possible telephone numbers and suppose that the recipient’s phone already has its own database of $m$ of these numbers blocked which we denote $B=\{B_1,B_2,…,B_m\}$. Further, suppose that spam software makes $n$ calls to the recipient’s phone, choosing each spoofed calling number, $C=\{C_1, C_2, …, C_n\}$, randomly from this same catalogue of $k$ telephone numbers. A successful block occurs when one random variable from $C$ matches, or ‘collides’, with one from $B$.

Such problems are thus often referred to as matching or collision problems. A famous example is the birthday problem: what is the the probability that at least two people within a group share a common birthday? (You can read more about this in Peter Rowlett’s piece ‘Counting caterpillars’ in issue 09 of Chalkdust). The first investigation into the 2-set problem was a 1987 article by Tony Crilly and Shekhar Nandy in the Mathematical Gazette entitled ‘The birthday problem for boys and girls’, which counted birthday matches between a boy and girl from their respective sets. The collision problem appears in an enormous array of real-world scenarios and there is now quite a literature on 2-set and multi-set problems. However, our proposition is yet slightly different from all of these…

Let’s try and calculate the probability of no matches between $B$ and $C$, denoted as $P_0$ (this is equal to one minus the complimentary scenario: that we have one or more matches). Suppose we’re interested in matching heads and tails between two sets of two coins each, so $m=n=k=2$.

Specifically, if we treat each coin-to-coin comparison between sets as an independent trial, the overall state of no matches requires each coin in the left set to not match each coin in the right set, namely \[P_0 = P(\textrm{set 1 coin 1} \neq \textrm{set 2 coin 1}) \cdot P(\textrm{set 1 coin 1} \neq \textrm{set 2 coin 2}) \cdot\] \[   P(\textrm{set 1 coin 2} \neq \textrm{set 2 coin 1}) \cdot P(\textrm{set 1 coin 2} \neq \textrm{set 2 coin 2}) .\]

An example of a simple 2-set collision problem in terms of matching heads versus tails ($k=2$) for two sets of two coins each. The elementary approach of treating each match trial as being independent, where dashed lines indicate lack of a match. Multiplying the probabilities together gives $1/16.$

We might be tempted to calculate this as follows: assume each of these four trials has a Bernoulli probability of $1/2$, meaning that the overall no-match probability is $(1/2)^4 =0.0625$. Yet, if we actually write down all 4 possible head–tail combinations for the left set and the corresponding ones for the right set, then count how many of the $4\times4 = 16$ possible outcomes result in no overall match, we plainly see two hits, meaning $P_0 = 2/16 = 0.125$. These two calculations, which seemingly address the same question, differ by a factor of two!

An approach where all possible pairs are enumerated, where boxes indicate instances having no matches between the two sets. $2/16$ (or $1/8$) of the options have no matches.

The latter calculation is, of course, correct, since we made exhaustive counts. This exercise indicates that match trials between individual variables are not strictly independent of one another, a complication not uncommon in probability problems. Nevertheless, we can still generalise the elementary approach in order to investigate conditions under which its error might become acceptably small compared to the exact solution. Let’s think of this elementary approach as an approximation. Reasoning along the same lines, a single trial between one specific variable in each set has a mismatch probability of $1-1/k$ and, since there are $mn$ such trials, the overall mismatch probability is \[P_0^A = \left(1-\frac{1}{k}\right)^{mn} \approx \textrm{e}^{-mn/k},\] where superscript $A$ denotes approximation based upon the assumption of independent trials. The final expression is an additional asymptotic approximation, which is reasonable when $m$, $n$ and $k$ are sufficiently large (as they tend to be in a number blocking problem). This can be seen by comparing expansion of the binomial to the series expansion of the exponential.

Put your phone away

Using structured counting, we can calculate the probability, $P_0$, of no matches arising between set $B$ and set $C$. First, let’s consider the sampling methods by which sets $B$ and $C$ are actually obtained. Set $C$ is sampled with replacement, meaning the software plucks each new spoofed telephone number randomly with equal probability from the catalogue of $k$ numbers. Hence the spam-caller-side sample space is of size $k\times k\times … \times k = kn$ when $n$ spam calls have been placed. In many 2-set modelling scenarios, both sets are selected in this way. Examples include Crilly and Nandy’s boy–girl birthday probability discussed above and numerous real-world calculations, like the probability of no collision-enabling common altitudes for aircraft flying between two opposing airports. However, at this point, our scenario departs somewhat from these classic 2-set modelling scenarios. Namely, values in $B$, the list of numbers blocked on the phone, cannot be repeated, which is to say they are selected without replacement. The recipient blocks the first randomised spam number that appears, after which that number is, by definition, never seen again, then the second, which is likewise never seen again, and so forth. As such the blocking-side sample space is enumerated by the permutation $k(k-1)(k-2)…(k-m+1)$, which is known as the ‘falling factorial’ and denoted by $(k)_m$. Consequently, the total sample space is then of size $k^n(k)_m$. With that, we have the denominator of probability $P_0$…and we are halfway there!

The numerator is a little more complicated, but we can use some visual assistance from graphs. We can represent each random variable by a vertex. Connections between vertices are either solid or dashed lines, solid lines represent a ‘match’ and dashed lines represent ‘no match’. Edges for random variables in $B$ (the blue vertices) are necessarily dashed, while edges connecting the random variables in $C$ (the green vertices) can be of both kinds. The connection between $B$ and $C$ is likewise necessarily dashed.

An example with eight blocked numbers, and four spoofed numbers made by the caller, three of which are identical.

This diagram again suggests the falling factorial, with this particular one representing $(k)_{10}$ possible ways of realising no collisions, since there are a total of 10 non-repeating values connected by nine dashed edges. Note that the last two vertices in $C$, ie $C_3$ and $C_4$, do not add to the count because they have identical values to their adjacent neighbour $C_2$. This enumeration is readily generalised to an arbitrary graph having $m$ and $j$ uniquely valued vertices in $B$ and $C$, respectively, as $(k)_{m+j}$.

But, how should we account for the various different possible arrangements of solid and dashed edges on the caller side? Consider the arrangements below gathered under the banner of  $S_2(4,2)$, which depict the seven ways that four elements can be partitioned into two non-empty subsets. The arrangement marked by the asterisk appears in the $n=4$ example above. This diagram raises the question, would we have to account for all 15 possible ways of partitioning the four caller vertices shown in the previous diagram to obtain a complete and correct count?

It turns out that what we need here is a general way to tally the numbers of ways of partitioning a set of $n$ things into $j$ non-empty subsets, which is precisely what is quantified by the Stirling numbers of the second kind, $S_2(n,j)$. Multiplying this quantity by the falling factorial gives $S_2(n,j)(k)_{M=j}$,which represents the grand total number of graphs that do not collide when there are $j$ unique values on the caller side. Summing the resulting product over the $n$ possible multiplicities for the caller side yields the required numerator.

All the possible ways of partitioning the caller- side set of four spoofed numbers into one, two, three and four non-empty subsets.

Bringing the numerator and denominator together, we find the probability of no matches to be \[P_0 = \frac{1}{k^n(k)_m} \sum_{j=1}^n S_2(n,j)(k)_{m+j}.\] Requisite $S_2$ values are generated by the recurrence $S_2(n+1,j) = j \cdot S_2(n,j) + S_2(n,j-1)$ using the initial/edge conditions $S_2(0,0) = 1$ and $S_2(n,0) = S_2(0,n) = 0$.

Before getting to actual analysis, let me quickly illustrate the extraordinarily large numbers that combinatorial situations can yield, which are belied by the visually unremarkable decimal values in the resultant probability. For a small, but otherwise arbitrary toy example of $m=9$, $n=11$, and $k=48$, we find \[P_0 = \frac{193212130136969871163154203084800}{1896606883355394007986178778726400} \approx 0.10187,\] where the numerator and denominator each exceed 30 digits.

Blow your phone trumpet

Blocking can be examined for any situation where $m$, $n$, and $k$ can be characterised, so let’s look at my own phone as an example. I’ve experienced a few different neighbour spoofing schemes. The most specific schemes I’ve come across display actual names of local establishments, eg banks, medical offices, car dealerships, and restaurants. I must admit to having been tricked into answering some of these kinds of calls. For my locale, I will estimate a corresponding value of $k=500$ possible phone numbers in the catalogue. A less sophisticated scheme I have also experienced displays a telephone number, where my 3-digit area code and 3-digit exchange prefix are duplicated, but the trailing 4 digit line number varies. Here, $k\approx10,000$. The least specific neighbour-spoofing for north American telephones only uses the same area code, so we have $k \approx 7,920,000$ due to various reserved and disqualified numbers within the larger set of 107 possibilities. This is the weakest scheme, as presumably recipients aren’t likely to answer these calls. To help us consider number blocking over a one week period, I vetted my phone calls for a week and received a vast $n=77$ spam calls. The solution for $P_0$ that we calculated gives the probability of no matches. Let’s call the probability of one or more matches, ie successful blocks, the blocking probability, $P_B = 1-P_0$. The graph below shows blocking probabilities for the various neighbour-spoofing schemes described above as a function of the recipient’s blocking aggressiveness, ie size $m$ of their blocked number list.

Blocking probabilities calculated by the exact and approximate solutions for various levels of neighbour-spoofed spam calls. The blue curves depict the toy enumeration example above (solid line is exact, dashed line is approximate), for which $n=11$ and $k=48$. All other curves apply to the number blocking problem, for which $n=77$. For these, the exact and approximate results are visually indistinguishable.

Two somewhat surprising aspects stand out. First, what was clearly a huge error in the approximate solution for $m=n=k=2$ in the coin matching example is dramatically decreased for the toy enumeration example of $k=48$ and subsequently all but disappears for our actual number blocking analysis, where $k \ge 500$ (green, orange, and pink curves). Neglecting event dependence and applying asymptotic approximation quickly become acceptable as the parameters of this problem increase. This is because the former results from a rapidly growing sample space in which the outcomes of joint events are increasingly unaffected by one another, while the latter is explained by comparing associated expansions, as mentioned above. As a result, our telephone problem appears to be well described by the approximate result, $P_0^{A} \approx \textrm{e}^{-mn/k}$. Of course, we could not have known this without first developing the exact combinatorial solution to the problem. Note that brute force numerical evaluation would also have been limited, as suggested by the fact that even our toy example of $k = 48$ weighs in at $\sim 10^{34}$ possible combinations to check.

Now that you’ve stopped getting so many spam calls, you can stop leaving your phone in the garden

The second observation echoes the nonintuitive result of the birthday problem. Namely, matching, which is to say `number blocking’, commences with high probability even for relatively small collections of pre-blocked numbers in the recipient’s phone. For example, it reaches 90% at $m = 15$ for specific neighbour spoofing ($k = 500$) and at $m = 295$ for area code and prefix spoofing ($k = 10000$). In other words, in this particular case, there is very high probability of blocking one or more new calls when only $m / k \approx 3$% of the available numbers have been blocked! Agreement between these two scenarios is not a coincidence, as some quick algebra on the exponential solution shows: $m/k=-\ln(1 – P_B) / n$. My own phone, having 455 blocked numbers, would seem to be well-positioned against the better neighbour-spoofing attacks (green and orange curves), but not the weakly targeted calls (pink curve). Luckily I don’t tend to answer the weakly-targeted calls anyway so this isn’t too much of a concern.

Phonomenal

Of course, our analysis here is still limited: it fails to account for many spammers simultaneously using multiple different spoofing schemes, nor does it quantify the distribution of the number of calls blocked. Nevertheless, this is another example in a seemingly endless succession showing the remarkable power of mathematics to provide meaningful insights into nontrivial, real world problems. In his essay The unreasonable effectiveness of mathematics in the natural sciences, Eugene Wigner quipped that this insight mathematics offers borders on the mysterious—I have long been inclined to agree.

Michael is at Washington University in St Louis, Missouri, USA. Between his work in cancer bioinformatics and theoretical mechanics, he spends time playing the accordion.

More from Chalkdust