1. Problem and history

The problem is to find a function or a distribution for the phoneme frequencies of a text or of a corpus. Sometimes letters or even sounds are counted, which is fully justified. In the same way one could count e.g. the syllables of the Japanese katakana or hiragana. The number of examinations is enormous, some of them give the absolute frequencies, other ones merely the proportions.

The counting began in the 19th century (Förstemann 1846, 1852; Meyer 1869; Bourdon 1892) and developed quickly on practical grounds: stenographers, printers, constructors of typewriters, decoders, etc. needed urgently the frequency of letters for their own purposes. Förstemann and Meyer pursued comparative aims, e.g. the problem of the relation between consonants and vowels in the examined languages (Old Indian, Greek, Latin and Gothic) and its impact for the development of languages.

The first who considered phonemes from the frequency point of view and set up hypotheses was G.K. Zipf (1929, 1935, 1949). Afterwards a great number of works appeared using phoneme frequencies for finding other interrelations. The first empirical model, namely the the geometric (and the right truncated geometric) distribution, was proposed by Sigurd (1968). Good (1969) brought a partial-sums distribution (Whitworth distribution) whose modelling was revived in word length (\rightarrow) research. Tuldava (1971/1995) considered different possibilities, Altmann (1993) used the synergetic way of modelling and derived a special function for this purpose. Martindale, Gusein-Zade, Mckenzie and Borodovsky (1996) compared several curves (functions) and many data in order to find the “best” model. Altmann and Lehfeldt (1980) and Zörnig, Altmann (1983, 1984) developed hypotheses on the entropy (\rightarrow) and the repeat rate (\rightarrow) of phonemes, Kubáček (1994) derived the formula for the necessary size of the phoneme count in order to attain confident counts. Naranan and Balasubrahmanyan (1998, 2000) developed a theory from which different curves for phoneme frequencies are derivable.

Not all arguments holding for word frequencies are valid in this domain. The modelling has been performed in two ways: (i) a continuous curve has been fitted to the proportions of phonemes, (ii) a discrete distribution has been fitted. It can be shown that continuous curves have their analogues in discrete distributions.

2. Hypothesis

The ranked frequencies of phonemes follow a regular probability function or a regular monotone decreasing function.

The result depends on whether one considers the ranked frequencies as a discrete distribution (normalized) or merely a regular series approximated by a continuous function (not normalized).

3. Derivation

The formulas used up to now can be derived from different approaches.

3.1. Tuldava´s approach (1988)

This approach can be represented by the simple differential equation

(1)  y' = \frac{b}{x}

telling that the change of frequency (y) is inversely proportional to the rank (x) and yielding

(2)y = a + b \ln x\quad ,

where b is negative. This curve is frequently used in other domains, too (cf. also Martindale et al. 1996; Laherrère, Sornette 1998).

3.2. Derivations related to the unified theory (→) are

(a) Zipf´s law (zeta function)

When formula (2) of the unified theory (\rightarrow) is used with a0 = a2 = a3 = ... = 0,a1 = - b, this yields

(3) \frac{dy}{y} = -\frac{b}{x}dx

telling that the relative rate of change of frequency is proportional to the relative rate of change of rank, resulting in

(4)y = Ax^{-b}\quad.

This is, perhaps, the most disseminated formula in linguistics representing the power law.

(b) Yule´s species/genera function (1924)

When formula (2) of the unified theory is used with a0 = c',a1 = b,a2 = a3 = ... = = , this yields

(5) \frac{dy}{y}	= \left( c-\frac{b}{x} \right)dx

resulting in

(6) y= ae^{cx}x^{-b} = ad^x x^{-b}\quad.

(c) Naranan and Balasubrahmanyan´s (1992a,b, 2000) function

When formula (2) of the unified theory is used with a0 = 0,a3 = a4 = ... = 0,, this yields

(7) \frac{dy}{y}	= \left( -\frac{a_1}{x} + \frac{a_2}{x^2} \right)dx

resulting in

(8) y= Ce^{-a_2/x}x^{-a_1},

derived by the authors in a different way.

(d) Altmann´s ranking function (1993)

Using formula (11) of the unified theory, which can be written as

(9) y_x =  \left( 1+a_0\frac{a_1}{(x-b_1)^{c_1}} + \frac{a_2}{(x-b_2)^{c_2}} \right)y_{x-1},

and reparametrizing ai = 0(i = 0,2,3,...)c1 = 1, yields

(10) y_x =  \left( 1+\frac{a_1}{x-b_1} \right)y_{x-1} .

Upon setting b1 = - a,a1 - b1 = b, this results in

(11) y_x = \frac{\begin{pmatrix} b + x \\ x - 1 \end{pmatrix}}{\begin{pmatrix} a + x \\ x - 1 \end{pmatrix}}y_1\quad, x = 1,2,3,...

This proved to be a very good model for letter distribution in English and German (Best 2005).

All these formulas can be transformed in distributions by appropriate normalizing. Several distributions have been derived directly, namely

(e) Geometric distribution

Sigurd (1968) used simply the 1-displaced geometric distribution. It can be obtained from formula (9) setting ai = 0(i = 1,2,3,...), which yields

(12)y_{x+1}= (1+a_0)y_x\quad.

For - 1 < a0 < 0,1 + a0 = q,1 - q = p,yx = Px one obtains the usual (1-displaced) geometric distribution

(13)P_x = pq^{x-1},\quad x = 1,2,3,...

The same result was proposed also by Orlov, Boroda, Nadarejšvili (1982). Treating directly the relative frequencies one can write (13) as

(14) y_x = y_1 q^{x-1},\quad x=1,2,3,...

(f) Negative hypergeometric distribution

A systematic analysis of Slavic languages and German (Grzybek & Kelih 2003, 2003a,b, 2005, 2006b; Grzybek, Kelih, & Altmann 2004, 2006a,b; Best 2005a,b) showed that the most stable distribution for letter frequencies follows from the unified theory by setting a1 = (K + n - 1)( - K + M + 1)( - K + M - n),a2 = (M - 1)(K - M + n),a0 = b2 = 0,b1 = - K + M - n,yielding

(15) P_x = \frac{(M+x-1)(K-M+n-x)}{x(n-x+1)}P_{x-1}

from which

(16) P_x = \frac{\begin{pmatrix} M+x-1 \\ x \end{pmatrix}\begin{pmatrix} K-M+n-x-1 \\ n-x \end{pmatrix}}{\begin{pmatrix} K+n-1 \\ n \end{pmatrix}} = \frac{\begin{pmatrix} -M \\ x \end{pmatrix}\begin{pmatrix} -K+M \\ n-x \end{pmatrix}}{\begin{pmatrix} -K \\ n \end{pmatrix}}\quad x= 0,1,...,n

which is usually displaced by 1 step to the right.

3.3. Partial-sums distributions (Good 1969)

Good (1969) introduced a new distribution, mentioned in Martindale et al. (1996). It is a so-called partial-sums distribution, namely a “sterred” discrete uniform distribution (cf. Wimmer, Altmann 1999). Their provenience is shown in the chapter on Word frequency (à). The Good distribution has the form

(17) P_x = \frac{1}{n}\sum_{i=x}^n \frac{1}{i},\quad x=1,2,...,n.

Example. Frequency of phonemes in Hawaiian

In Table 1 and Fig. 1 one can find the fitting of the above formulas to the relative frequencies of Hawaiian phonemes. If functions are used, normalizing is not necessary.

Tabelle11 PF.jpg

Except for the geometric series, all of them yield in this case a good – approximately equal – fitting. In Fig. 1, only fitting of (11) is shown.

Grafik11 PF.jpg
Fig. 1. Fitting function (11) to Hawaiian phoneme frequencies

4. Authors: U. Strauss, G. Altmann, K.-H. Best

