In the 1930s, the American psychologist and linguist George Kingsley Zipf developed a model to describe word frequency and probability distribution in texts. This is regarded as the beginning of so-called quantitative linguistics – yes, mathematicians can have fun with words too. Zipf’s law derives from this model and, having not lost its relevance, still provides conversation fodder for linguists today. The so-called Zipf distribution also plays a not altogether insignificant role in other disciplines, such as demography – Zipf did a pretty good job here.
Zipf’s Law: what’s it all about?
Zipf realised that some words come up significantly more often in a text than others. For example, in most languages it’s usually the case that the longer a word is, the less often it appears.
Want an example? This text consists of 598 words, the word “the” comes up 42 times, the long words “fluctuations” and “quantitative” on the other hand only appear once (well, twice since we’ve used them here in our example).
Thus you sort all the words according to how frequently they come up and give them a rank. The word that appears most frequently is ranked first, the next one second, and so on. The probability of a word turning up is inversely proportional to its position in the ranking. Turning this into a simple formula: the resulting Zipf distribution describes exactly how the word ranked second in the list turns up in the text corpus on average only half as often as the one ranked first. The frequency of the word in third place is only about a third of the top-ranking one, and so on. Fascinating, right? (At least for all the mathematicians among us)
The “false” Zipfian Law
This has to do with the linguistic observation that linguistic economy is incredibly important when using language. Every speaker looks for a compromise between two things while he speaks: (content-wise) saying as much as possible while expending as little energy as possible. He is therefore on the one hand subject to the desire to convey information as understandably as possible – which often leads to detailed accounts. On the other, he naturally strives towards not spending too much mental and physical energy on speaking. Zipf constructed a whole array of linguistic hypotheses and formulated them as laws – this one here, however, he didn’t. Nonetheless, it is still known under the name of the “‘false’ Zipfian Law”.
Zipf distribution and ranking cities
Zipf’s impact goes much further than linguistics. Zipf’s law can be applied to many areas. If you observe, for example, cities in the USA, the populations exhibit a clear Zipf distribution. The same goes for the proportions of German cities. In 1999, Berlin stood in first place with around 3,341,000 inhabitants. Hamburg followed in second place with 1,705,000 inhabitants, so around half the population of Berlin. Munich had 1,195,000 inhabitants, which corresponds to about a third of the top-ranking capital city. In fourth and fifth place were Cologne with 963,000 and Frankfurt with 644,000 inhabitants, each around one quarter and one fifth of the number of inhabitants of Berlin.
Where else does the Zipf distribution have an effect?
On top of that, there are numerous other biological, physical and social phenomena that adhere to Zipf’s Law and form a Zipf distribution. These include the frequency distribution of forest fires and earthquakes, fluctuations in financial markets and also the size of companies. The underlying mechanisms of Zipf’s Law have however as yet only been partly explained.
So we’ll all get down to counting words now, right? 😉