The secret life of words

Barry Pawlik builds words and sentences from doubles

post

Let’s begin gently, with the unassuming question: What is a square? The most precise answer is the favourite one of professional physicists: the universal It depends. On the context, of course.

For many people, the first thing that comes to mind is a geometric figure, like this one: $\square$. A perfect answer is also the number 529 (since we’re in the 23rd issue of Chalkdust, after all), as well as its more recognisable friends: 9 and 25. But not everyone realises that squares can be not only figures, numbers and even logical symbols, but also… words!

The arena of this article is combinatorics on words: a discipline at the intersection of discrete mathematics and theoretical computer science. Combinatorics on words studies patterns and repetitions in sequences of symbols, treating words as purely mathematical objects rather than carriers of meaning. Below we present a brief guide to this world.

We start with a set, preferably nonempty and finite. For the sake of intuition, we will call this set an alphabet, and its elements letters. For a binary alphabet (ie consisting of exactly two letters) we usually take it to be $\{0,1\}$ or $\{a,b\}$ and for $k$-element alphabets we take $\{0,1,\ldots,k-1\}$. We can also consider the whole $26$-letter Latin alphabet $\{a,b,c,\ldots,x,y,z\}$, which is most often used to give fancy examples from natural languages that illustrate the relevant definitions. Finally, by a word we mean a (finite or infinite) sequence of elements of the alphabet. Instead of writing $(w_1,w_2,w_3,\ldots, w_n)$ we will write $w_1w_2w_3\cdots w_n$ so, for example, the word (W,O,R,D) can be written as WORD.

A square is a word $XX$, where $X$ is a word. For example, for $X=\,\,$0100 we get the square $XX=\,\,$01000100, as well as for $X=\,\,$HOTS we get $XX=\,\,$HOTSHOTS. Note that each square has even length. Moreover, the number of occurrences of each letter in a square is even—words with this property are sometimes called tangrams (since they consist of letters that can be rearranged so as to form a square, for example, the word APPEASES is a tangram as its letters can be rearranged to get the square PEASPEAS).

Thus, by a square we mean a word of even length such that, when we colour the letters in the first half in one colour and those in the second half in a different colour, both obtained monochromatic words—read from left to right—are identical: COUSCOUS, CANCAN. First, let us look at something ‘mathsy’ about squares.

We say that a word $W$ is square-free if there do not exist words $P$, $X$, $S$ ($X$ nonempty) such that $W=PXXS$. In other words, a square-free word does not contain a subword consisting of consecutive letters that is a square. Consider the words PHILOLOGY and MATHEMATICS. The former is not square-free, since it contains the square LOLO. The latter, on the other hand, is square-free.

How long can a binary square-free word be? Without loss of generality, if we start with the letter 0, then the second letter must be 1 (to avoid the square 00), and the third must again be 0 (to avoid 011). And what about the fourth letter? We obtain either 0100 or 0101 which are bad in either case. Every binary word of length at least four contains a square!

All right, and how long can a square-free word over a ternary (three-letter) alphabet be? Here comes a surprise! In 1906, the Norwegian mathematician Axel Thue proved a theorem that is today regarded as the starting point of combinatorics on words. It states that for every natural number $N$ there exists a ternary square-free word of length $N$. Let’s look at the construction of the theorem’s proof.

We start with the following sequence of binary (!) words: we begin with 0, and create each subsequent element of the sequence by appending to the previous one its complement (that is, the word in which every 0 is replaced by 1 and every 1 is replaced by 0; for example, the complement of 0110 is 1001). Thus, the first few elements of our sequence are:

0, 01, 0110, 01101001, 0110100110010110, ...

Each word in the sequence is a prefix of the next one, so the sequence stabilises position by position. In this sense, it converges to an infinite word: every finite prefix eventually stops changing. Thus we can say that our sequence converges to the infinite word

0110100110010110100101100110100110010100110100101101001100101101001011001101...

which, by the way, is called the Prouhet–Thue–Morse word (no, not the Morse of ··· --- ··· , just his distant cousin). Let us then take the Prouhet–Thue–Morse word and, for each pair of consecutive 0s, count how many 1s occur between them and put those numbers into a sequence:

This procedure produces a new infinite word

21020121012021020120210121020121012021012102012021020121012021020120210121020...

The reason this construction works is that the Prouhet–Thue–Morse word is overlap-free, meaning it contains no factor of the form $aXaXa$ ($a$: letter, $X$: word). Now our ‘gap encoding’ produces a ternary word $A$ (here written with symbols 0,1,2), and crucially, there is a fixed way to expand each symbol of $A$ into a binary block so that concatenating these blocks recovers the original Thue–Morse word. If $A$ contained a square $CC$, then its expansion would force a repeated, perfectly aligned pattern inside the Thue–Morse word, which in turn creates an overlap $aXaXa$. This contradicts overlap-freeness of Thue–Morse, so $A$ must be square-free. Therefore every prefix of $A$ is square-free, giving ternary square-free words of any length. A key point of Thue’s theorem is that the word obtained in this way is square-free. Therefore every prefix of it is also square-free, and we can build a square-free word of any desired length. In particular, for any alphabet with at least three letters there exist square-free words of arbitrary length.

Shuffle squares

Now, let us introduce the shuffle squares. Figuratively speaking, a shuffle square is a word obtained by interleaving the letters of two copies of the same word in a way that preserves the left-to-right order of letters in each copy. The precise definition says that a word $W$ is a shuffle square if, for some natural number $n$, we have $$W=w_1w_2\cdots w_{2n},$$ where $w_1$ to $w_{2n}$ are letters, and there exist disjoint sets of indices $I=\{i_1,\,i_2,\,\ldots,\,i_n\}$ and $J=\{j_1,\, j_2,\, \ldots,\, j_n\}$ such that $i_1 < i_2 < \cdots < i_n$ and $j_1 < j_2 < \cdots < j_n$, and for every $1\leq s\leq n$ we have $w_{i_s}=w_{j_s}$. Given that, the exemplary shuffle square is the French word TUTEURER since we can choose $I=\{1,2,4,6\}$ and $J=\{3,5,7,8\}$.

One convenient way to think about these words is as follows: a given word is a shuffle square if there is a colouring of its letters such that each letter is coloured in one of two fixed colours (as we did earlier) in such a way that monochromatic subwords (read from left to right) are the same. So now, why is TUTEURER a shuffle square? Well, because TUTEURER. Note that the word consists of two identical copies, TUER and TUER, which have been shuffled together.

The French word tuteurer means to stake or support climbing plants

Shuffle squares are fascinating and mysterious objects. Checking whether a given tangram (do you remember, dear reader, the definition of a tangram?) is a shuffle square turns out to be computationally hard: while a solution can be verified efficiently, no efficient general algorithm is known for solving the problem. More precisely, the decision problem is NP-complete, even in the binary case. On the other hand, we now know—thanks to a recent result from December 2025 by Xiaoyu He and Logan Post (from the Georgia Institute of Technology) that the ratio of the number of binary shuffle squares of length $2n$ to all tangrams of length $2n$ tends to $1$ as $n\to\infty$. This can be stated precisely as every sufficiently long binary tangram is almost surely a shuffle square.

In the spirit of Thue, one can ask what the smallest alphabet size is over which shuffle squares can be avoided. It has to be greater than three, since there are exactly 237 finite shuffle-square-free words over the ternary alphabet—a result obtained by computer search. To get a feeling for how constrained the situation is, you can try listing all shuffle-square-free binary words by hand. The first upper bound was provided by James Currie in 2014, who proved that for sufficiently large alphabets this is indeed possible, giving a bound of $10^{40}$. After several papers by various researchers, the currently best known result is due to Laurent Bulteau, Vincent Jugé and Stéphane Vialette, who proved in 2023 that shuffle squares are avoidable over an alphabet consisting of six letters. They conjectured that four letters are optimal, meaning that shuffle squares can be avoided over some four-letter alphabet, but not over any smaller one.

The sentence hunt

Let us now turn to the occurrence of shuffle squares in natural languages! Let us agree that we will not worry about upper and lowercase letters, nor about spaces and punctuation marks. We also focus on those shuffle squares that are not squares (so words like hotshots are out!). Our further considerations will revolve around the following challenge: to construct the longest possible expression in a natural language that is a shuffle square. The game was started by the Polish mathematician Andrzej Ruciński in July 2023, who proposed the Polish expression:

Nina, mima mama.

Its English translation is Nina, the mime’s mum (after all, every person, including every mime or other performer, has a mum; here we are considering one whose mother’s name was Nina). Note that this is indeed a shuffle square: NINAMIMAMAMA. Interestingly, in June 2025 the Belgian mathematician Savinien Kreczman gave an English anagram of Ruciński’s shuffle square, which is itself a shuffle square:

Man, I am an imam.

Indeed, MANIAMANIMAM (let us recall that two words are called anagrams if some reordering of the letters of the first word gives the second one—for example, ELEVENPLUSTWO and TWELVEPLUSONE). The Ruciński sentence has length 12. In August 2024, Jan Szejko, another researcher from Poland, constructed shuffle squares of lengths 20 and 22 in his native language:

czaszeczka szczekacza and dziedziczenie czekanika.

(The first can be translated as the barker’s little skull, and the second as inheritance of the little pickaxe). Interestingly, Ruciński and Szejko used completely different methods to find their examples. The former relied solely on the power of his own mind, while the latter used his programming skills. At present, the record for the length of a shuffle square in a natural language belongs to Kreczman. At the turn of June and July 2025, he constructed several shuffle squares in English and French, the longest of which is the 48-letter

In nine innings, Eagles battle Boston to soar near Wien win.

Kreczman—like Ruciński—did not use a computer. He even remarked that creating long shuffle squares is easier than creating short ones: it suffices, for a while, to ‘roughly’ keep the letter counts balanced and, above all, to choose suitable words at the end of the expression that allow one to ‘close’ the shuffle square. Around the same time, Lucas Mol—a mathematician from Canada—was searching shuffle squares in natural languages. Mol focused on finding short shuffle squares in English and he managed to obtain two single-word examples: prepress and sestet, but (using the computational power of his computer) he concentrated primarily on searching for two-word shuffle squares. He found as many as 9659 (!) expressions. Here are some of them:

  • assassins sin
  • easiest assists
  • incontinent content
  • papal salsa
  • bad bandana
  • einstein stinging
  • itchiest cheese
  • prepare erasers
  • bonobo nonsense
  • folks folksinging
  • ominous minus
  • smallest mallet
  • cave carver
  • heather eater
  • ongoing onion
  • ungenuine genie
  • connect sonnets
  • hot hosts
  • owlet towelette
  • woke worker

The longest sentence discovered by Mol is the 22-letter acquaintance quaintness. If you, dear reader, enjoy this game and happen to discover interesting shuffle squares in any natural language, please be sure to let me know! Who knows, perhaps one day this article will need an update. Maybe someday someone will find a long sentence in a natural language which, when shuffled with itself, yields another sentence in that language (the starting point in this challenge is Lucas Mol’s we weeded)—now that would be something!

A bonus problem

To finish, here is a little puzzle from Savinien Kreczman. One of the following sentences is a shuffle square, and the other is not. Can you guess which is which?

  • He heard a ruckus: a duck hushed a red ara.
  • He heard a ruckus: a duck used a red rattle at the lead head.

Barry is an assistant professor at the Silesian University of Technology in Gliwice, Poland. In his spare time, he uses words to explain maths, and at work, he uses maths to explain words.

More from Chalkdust