I find subway maps and their connections fascinating. When I first saw the Tokyo subway map, I played a little mental game. I picked a random station, and followed a series of random routes, and each time I would end up at a different station. So, I thought to myself: if I was a passenger at a random station, and all I had was a list of stations and routes, could I reach any station I wanted without a map? How would I know if I could?
Subway maps and graph connectivity
This problem turns out to be a basic question in graph theory: the problem of graph connectivity. A graph is collections of points called vertices connected by line segments joining them called edges. It asks, given some graph (I shall only look at finite graphs), is there an algorithm to determine whether or not it is connected (in other words, if every pair of vertices is joined by a sequence of edges). If we look back at the subway problem, the parallels between it and the graph connectivity problem are immediately clear. At its essence, a subway map is a graph, made of stations (its vertices) and tracks between stations (its edges).
There is a wide variety of algorithms to solve the graph connectivity problem: the most common being depth first search (DFS) and breadth first search (BFS). These algorithms both pick a starting vertex, and search all vertices which can be reached from that initial vertex; the difference between the two algorithms is the order in which the vertices are visited. DFS searches as far as possible along a path before backtracking and repeating along a new path, while BFS searches vertices in the order of how far they are from the starting vertex.
While doing these searches, we can simply check after each step whether we have reached every vertex, and if we have, then the graph must be connected. Of course, this seems trivial in the case of the graph above because it is so simple. But what about something slightly more complicated like the one on the right? It becomes harder to see at a glance, but it is still fairly simple. But what if we took a giant graph like a social network, with millions of people and billions of connections? At that point, only a computer could handle it! And remember, the computer does not see the graph like you or I do—all it has is a list of vertices and edges with no visual representation. That is why having algorithms like BFS and DFS is important.
However, breadth first search and depth first search are both deterministic. What if we looked at something more probabilistic? Let us take a look at random walks on graphs, a theoretical model with a surprising variety of applications, and apply it to the problem of graph connectivity.
Random walks appear everywhere
Random walks, as you might expect, involve ‘walking’ around a graph, starting at a vertex and randomly moving to one of its neighbours. These kinds of random walks appear all over the place.
Physics: When you take a random walk on an infinite grid-like graph, the motion you get appears jerky. If you contract the edges so the vertices get closer together, the limiting motion is similar to Brownian motion, so it can be used to model the motion of small particles suspended in water.
Epidemiology: You can use random walks to model the spread of infectious diseases. If you think of each vertex as representing a community, and you have an traveller who visits the communities according to a random walk, you can model how the traveller infects others or gets infected as time goes on. Different arrangements of communities (ie different graphs) can create different infection rates and behaviours.
Computer science: Consider a huge network like the world wide web. A graph of that size is unmanageable, but by taking a random walk, you can get a random set of connected vertices that you can use as a sample to estimate characteristics of the web.
Economics: Burton G Malkiel is a proponent of the financial theory called the random walk hypothesis, which states that the stock market and its prices follow a random walk and are inherently unpredictable. However, many other economists have refuted this theory.
Art: Quantum Cloud is a piece by British sculptor Antony Gormley, on display in London. It was designed using a random walk extending outward from points on a figure of Gormley’s body, and stands at 30 metres high.
If random walks are so omnipresent in our world, then clearly they deserve a deeper look. Let us start by defining random walks on graphs more formally.
Random walks and connectivity
A random walk on a graph follows four simple steps: 1) select a starting vertex; 2) choose one of this vertex’s neighbours at random; 3) step to that neighbouring vertex; 4) repeat steps (2) and (3) for as long as you want to.
These random walks are actually a form of Markov chain (for more on these, out Chalkdust issue 12), and the probability of moving from vertex $v$ to any particular neighbouring vertex, is one over the number of neighbours $v$ has.
There are a few properties of these Markov chains that we want to understand. First, given a pair of vertices, how many random steps do you need to take to get from one vertex to an other? Secondly, given a starting vertex, how many steps do you need to take to reach every other vertex?
Because the walk is random, the exact answers to these questions change each time we repeat the random walk, so we can only look at the average number of steps (which is more accurately called the expected number of steps). We define the expected hitting time $H(i \rightarrow j)$ to be the expected number of steps it takes to hit vertex $j$ for the first time starting from vertex $i$. We also define the expected cover time $C(i)$ to be the expected number of steps starting from vertex $i$ to reach every other vertex.
But what do these definitions have to do with graph connectivity? Let us say we take a random walk across a graph. If the graph was disconnected, the walk would never reach some vertices and we would say the expected cover time is infinite. Conversely, if it is connected, the expected cover time will be finite, and any random walk will eventually reach every vertex (with probability 1). Understanding expected hitting times will help us bound the expected cover time for a graph.
Expected cover time and spanning trees
The problem is, there are actually two reasons why our random walk might not have reached a vertex: either it could have, but didn’t; or it was not able to in the first place. The ‘not able’ case is the one we are looking for, because then the graph is disconnected. So, we need to remove the ‘could but didn’t’ case (or at least minimise the likelihood of it happening). This is where the expected cover time comes in. If we find an upper bound for expected cover time, and use that as the number of steps in our random walk, it decreases the likelihood of the ‘could but didn’t’ case happening.
So how do we find an upper bound for the expected cover time? The full proof is complicated, so let us consider an example. When we work with cover time in this section, we will assume the graph is connected since the cover time is infinite when the graph is disconnected.
Any connected graph has a minimal spanning tree, which is a smallest subgraph containing every vertex, and no cycles. It is called a tree because it can branch, but like botanical trees, it never contains a loop. An important fact about minimal spanning trees is that they always have $V – 1$ edges, where $V$ is the number of vertices in the original graph. We can see this for some spanning trees of our example graph:
We will just work with the first spanning tree. There is a (deterministic) walk restricted to this spanning tree which visits every vertex and travels across each of the green edges of this tree twice (once in each direction). We will call the vertices visited in this walk, in order, $i_0, i_1, \ldots, i_{2V – 2}$ and the first and last vertices are the same, (ie $i_0 = i_{2V – 2}$). In the example on the left above the sequence could be 1, 2, 1, 3, 1, 4, 1.
In a 1979 paper titled Random Walks, Universal Traversal Sequences, and the Complexity of Maze Problems, the authors point out that the expected time (number of steps) it takes for our random walk to cover the graph starting from $i_0$ cannot be greater than the expected time it takes for the walk to visit each $i_0, i_1, \ldots, i_{2V – 2}$ in order (although it may visit other vertices in between as it is a random walk), since by definition this list already contains every vertex.
We can express the time it takes for the graph to cover $i_0, i_1, \ldots, i_{2V – 2}$ as the expected time for the walk to go from $i_0$ to $i_1$, plus the time to go from $i_1$ to $i_2$, and so on. So, we have
\[ C(i_0) \le H(i_0 \rightarrow i_1) + H(i_1 \rightarrow i_2) + \cdots + H(i_{2V – 3} \rightarrow i_{2V – 2}). \]
The authors of that paper were further able to show that if $i$ and $j$ are adjacent vertices, $H(i \rightarrow j) + H(j\rightarrow i) \le 2E$, where $E$ is the number of edges in the whole graph. We will use this fact to bound expected cover time, but first let us see why it is true. Proving this in general requires some advanced probability theory, so we will try to understand it in a simple case. So, let us look at another example graph:
If the walk reaches vertex 1 at some point, it must move to vertex 2 next, so $H(1 \rightarrow 2) = 1$. If the walk reaches vertex 2, there are an infinite number of paths the walk can take, only to reach vertex $1$ at the very end:
A step from vertex $2$ to any another vertex has a $1/2$ chance of occurring, whereas a step from vertex $1$ or $3$ to vertex $2$ happens with probability 1. Thus, the path $2, 1$ has a $1/2$ chance of occurring based on the random walk. This also means the path $2, 3, 2, 1$ has a $1/4$ chance of occurring, and so on. Thus, the probability of a path occurring is $2^{-(N + 1) / 2}$ where $N$ is the length of the path. I leave it to you to check that that the sum below converges to 3:
\[H(2 \rightarrow 1) = \left(1 \cdot \frac{1}{2}\right) + \left(3 \cdot \frac{1}{4}\right) + \left(5 \cdot \frac{1}{8}\right) + \cdots = 3.\]
Using this fact, we see that in this case $H(1 \rightarrow 2) + H(2 \rightarrow 1) = 4 = 2E$.
We can now return to the main goal (bounding expected cover time) and the inequality we concocted above,
\[C(i_0) \le H(i_0 \rightarrow i_1) + H(i_1 \rightarrow i_2) + \cdots + H(i_{2V – 3} \rightarrow i_{2V – 2}).\]
There are $2(V – 1)$ terms on the right side, because a minimal spanning tree has $V – 1$ edges and we are traversing each edge twice. We can match up the terms on the right hand side into $V – 1$ pairs of the form $H(i \rightarrow j) + H(j \rightarrow i)$ where $i$ and $j$ are adjacent. We can then substitute $2E$ for those terms, giving us a bound on the expected cover time $C(i_0) \le 2E(V – 1)$.
A possible algorithm
Now that we understand the expected cover time of connected graphs, we can apply this generally to all graphs (connected or disconnected). We begin by taking a random walk across the graph for $4E(V – 1)$ steps and ticking off each vertex as we reach it. The number of steps may seem odd as we found the upper bound to be $2E(V – 1)$, but by using $4E(V – 1)$, we can ensure there is a less than $1/2$ chance of failing to search the graph fully.
But even then, less than $1/2$ is not that great. What we can do, however, is run the random walk $10$ times. Each time, we tick off each vertex we reach. If every vertex is accounted for, we return true (the graph is connected). Otherwise, we return false (the graph is disconnected). If we generate a true in any of the $10$ iterations, we stop the loop immediately and return a final value of true. Otherwise, we let the loop keep running. If the loop finishes returning false each time, we return a final value of false.
By doing this, we give the walk several chances to try and search the graph properly, which should increase the accuracy of the algorithm. However, it would be nice to know how accurate the algorithm actually is.
It turns out that we actually need two values to understand the algorithm: the first tells us how accurate the algorithm is when it returns true (sometimes called the sensitivity), and the second which tells us how accurate the algorithm is when it returns false (sometimes called the specificity). We can calculate these values using Bayes’ theorem (see Chalkdust issue 13), because the accuracies of our algorithm are $P(\textrm{graph is connected $|$ true})$ and $P(\textrm{graph is disconnected $|$ false})$.
We define $p$ to be the proportion of input graphs which are actually connected. We also define $\lambda = P(\textrm{true $|$ the graph is connected})$. Because there is a less than $1/2$ chance the algorithm does not search the graph properly during one random walk, the probability of it searching it properly at least one of the $10$ times we run the random walk should be at least $1 – (1/2)^{10}$. The first calculation is to figure out the accuracy of the ‘algorithm returns true’ case (sensitivity), but this is clearly just $1$ because of how the algorithm works. The ‘algorithm returns false’ case (specificity) requires Bayes’ theorem. However, the actual calculation is fairly simple:
$$P(\textrm{disconnected} \mid \textrm{false}) = \frac{P(\textrm{false} \mid \textrm{disconnected})P(\textrm{disconnected})}{P(\textrm{false})}= \frac{1 – p}{1 – p\lambda}. $$
This is because
$$ P(\textrm{false})=P(\textrm{false}\mid\textrm{disconnected})P(\textrm{disconnected}) +P(\textrm{false}\mid\textrm{connected})P(\textrm{connected})$$ $$ =1(1-p)+(1-\lambda)p=1-p\lambda. $$
When we graph the values we calculated above against $p$, we get the picture on the right. For the most part, it seems like our algorithm is very accurate. The sharp drop off in specificity as $p$ approaches 1 is to be expected, since if a graph is very likely to be connected, then we should be wary about trusting an output of false from our algorithm.
But how good is our algorithm?
So far, we have taken a good look at the world of random walks and graph connectivity, and designed a connectivity algorithm using random walks which is actually quite accurate! It seems like we are done, right? Not quite—we still have one more question to answer. We already know there are deterministic algorithms like breadth first search and depth first search which can also be used to solve graph connectivity. So, how good is our algorithm relative to those?
Our metric is computer time. It turns out that BFS and DFS are actually faster than our algorithm, as the time they take to execute are approximately proportional to $V + E$, whereas our algorithm’s time is approximately proportional to $VE$. There is room to improve this, as a 2007 paper (Many Random Walks Are Faster Than One) showed that using multiple random walks in parallel (rather than in series as we did) could decrease the cover time (making our algorithm faster).
We have already seen that random walks arise everywhere from epidemiology to computer science to art. Because these objects are ubiquitous, studying them becomes very important to understanding other fields. The specific problem of graph connectivity is a core question, not just in graph theory, but also in complexity theory (the study of algorithms in general). In fact, one of the authors of the Random walks paper (which studied cover time and hitting time), László Lovász, won the prestigious Abel prize this year for his work (including this paper) on graph theory and complexity theory!
So although our algorithm may not be as fast as BFS or DFS, designing it provided us an insight into the surprisingly relevant world of graph connectivity, random walks, and complexity.
I wish to thank Burton Newman, who is my teacher at Squares and Cubes. He introduced me to this weird and interesting problem, and guided me as I explored it.