Word frequency is in correlation with word commonness. The words that are frequent are usually also common and – vice versa – low-frequency words are usually not common. The distinction between the notion ‘commonness’ and ‘frequency’ is useful in understanding what kind of information frequency lists contain. For instance, while kägu 'cuckoo' is certainly a common word in Estonian, it is not particularly frequent in fiction and journalistic texts. Therefore, it does not exceed the threshold to be included in the present frequency lists. Thus, high frequency in certain text classes does not grant the status of a common word. Word frequency depends on the text in which the word is used. Therefore, when interpreting the word frequencies, the type of the texts must be taken into account. Many words that are common in certain types of texts (e.g. physics textbooks, fairy tales) are not common in language in general. Moreover, words that are highly frequent in a certain genre, need not to be common in language.
Words do not appear in texts randomly but the choice of words is associated with the topic of the text. It follows that frequency lists based on any texts create a crooked image of word frequency because every text addresses a certain topic and makes use of words that are associated with this topic. Thus, to get a more realistic picture of word frequencies, one has to account not only for word frequency but also for the word’s distribution in various texts. A low-frequency word that appears in various texts is considered more common than a high-frequency word that can only be found in certain type(s) of texts.
It is another matter how widely spread a word has to be in order to be included in frequency lists. The present lists include frequent words that are also common words. Therefore, only the words that occur in fiction as well as in journalistic texts have been included. If a word is frequent in either of them but does not occur in the other, it is not considered a common word, and, thus, not considered suitable for this dictionary.
Approaching frequency in terms of commonness presupposes that that the texts are homogenous enough. If the texts are rather different (e.g. texts extracted from internet chatrooms vs law texts), it is not clear what the sum of the frequencies in different texts represent.
The present dictionary is based on 1 million word corpus that includes fiction and journalistic texts. Fiction and journalistic texts are two distinct text classes in written language that are homogenous enough, and, which, at the same time, are not too different from one other. Fiction and journalistic texts that represent nationally spread non-specialized quality journalism are considered to represent standard neutral Estonian language.
Both of the text classes – fiction and journalistic texts – are represented roughly by 500,000 words. The fiction texts (ilu92_98.zip) are compiled of texts that represent the years 1992-1998 of the fiction subcorpus of the Corpus of Estonian Literary Language. The length of each extract is 2000 words; some texts are represented by more than one extract. The journalistic texts (aja95_99.zip) are partly compiled of the texts of the 1990s subcorpus of the Corpus of Estonian Literary Language but also include texts that are extracted from newspaper online archives. All of the journalistic texts originate from the years 1995-1999. All the journalistic texts are extracted as full issues.
The fact that the present dictionary is based only on two text classes (and ignores all the others, including spoken language) and a rather small sample of texts (1 million words) should make us cautious as to how these word frequencies can be interpreted. For comparison – ‘Word Frequencies in Written and Spoken English’ (Leech et al. 2001) is based on a corpus of 100 million words British National Corpus. On the other hand, the only equivalent frequency list in Estonian compiled so far (see Kaasik et al 1976 and Kaasik jt 1977) is based on 100,000 words of author’s narrative in fiction.
Table 1(text file) lists the 10,000 most frequent lemmas, ranked from the more frequent to less frequent lemmas. The lemmas are presented in column one, the second column gives the part-of-speech, the third column gives the overall frequency of the lemma, the fourth column gives the frequency of the lemma in journalistic texts, and, lastly, the fifth column gives the frequency of the lemma in fiction.
Table 2(text file) lists the 1,000 most frequent word forms in alphabetical order. The first column gives the word form, the second column gives the overall frequency of the word form, the third column gives the frequency of the lemma in journalistic texts, and the fourth column gives the frequency of the lemma in fiction.
Table 3(text file) lists 100 words that are frequent only in either fiction or journalistic texts but not present in both of the text classes. “–“ in either third or fourth column means that the word is not present in journalistic texts or fiction respectively. The data show that journalistic texts have more specific vocabulary. Such vocabulary includes mostly nouns that refer to domains of government (riigieelarve 'state budget', välisminister 'minister of foreign affairs', siseminister 'minister of internal affairs'); economy (investeering 'investment', börs 'stock market', tarbija 'consumer'); sport (meistrivõistlus 'champions league', finaal 'final'). Only 11 words out of 100 appeared in fiction only, 7 of which are verbs (pomisema 'mumble', kummarduma 'bow (oneself)', silitama 'stroke', võpatama 'flinch', seisatama 'stop (for a bit)', kuulatama 'listen up', kohendama 'spruce (up)').
The corpus the frequency lists are based on is compiled of 2000-word extracts of fiction works published in 1992-1998 (ilu92_98.zip) and and full issues of journals published in 1995-1999 (aja95_99.zip). The sources of fiction texts are listed in fiction source texts and the newspaper issues are listed in journalism source texts.
It must be noted that the frequency lists present the frequencies of words and not the frequencies of word senses. Thus, the frequency of the verb tulema represents the frequency of the word tulema in all of its senses, e.g. ‘arrive’ and ‘have to’. Words that appear in more than one parts-of-speech are tagged for several parts-of-speech.
Word phrases that often appear together and that are usually separate entries in the dictionary (e.g. multi-verb units) are here treated as separate items and their frequencies are also listed as such. For instance, the frequencies of the multi-word unit aru saama (lit. sense + get 'understand') are counted separately for aru 'sense' and saama 'get'.
If a certain word is not to be found in the frequency lists, it does not mean that it did not occur in the texts at all. The criterion of selection of words in the frequency lists is that they occur in both text classes – journalism and fiction – at least on five occasions. Thus, if a word was highly frequent in journalism but did not occur in fiction, it was not included in the lists. For instance, the word puuraidur ‘wood-cutter’ occurred on 50 occasions in fiction but not once in journalistic texts. Thus, it was not included in the frequency lists. Likewise, the word omavalitsus ‘local government’ was mentioned on 209 occasions in journalistic texts but did not occur in fiction at all. Thus, it was excluded.
Moreover, one should be cautious about drawing conclusion based on the frequencies of the individual words. According to British lexicographer John Sinclair, the semantics of individual words is dependent on the contexts in which the word occurs. For instance, one of the most frequent nouns in our frequency list is aeg ‘time’ but in most cases, it does not refer to time as an ontological category but occurs in phrases such as samal ajal ‘at the same time’, viimasel ajal ‘in recent times’, kogu aeg ‘all the time’, pikka aega ‘for a long time’. In comparison, aika ‘time’ is also the most frequent word in Finnish (Saukkonen et al 1979).
The list of lemmas was obtained automatically, using the morphology analyser for Estonian language and statistical part-of-speech tagger estyhmm (Kaalep, Vaino 2000). In addition to the overall frequency of the lemmas, the frequencies were also counted separately in the journalistic texts and fiction. The proper names, abbreviations, and numbers were not included in the frequency lists.
Each lemma has one or more tags, which indicate, which part-of-speech the lemma represents (substantive – S, adjective – A, verb – V, pronoun – P, or particles (D). Particles include adpositions, adverbs, conjunctions, and interjections. The lemmas oma ‘own’ and pool ‘half’ have the highest number of part-of-speech tags (four).
Because the part-of-speech tagging was an automated process, a certain amount of errors occurred. At first, it was not possible to distinguish certain homonymous forms of the words see ’it’ and tema ‘s/he’ (e.g. nende (PL.GEN), neid (PL.PAR), nendes or neis (PL.INE)). However, as these forms were disambiguated manually, the frequencies represent the actual frequencies of these lemmas. The same procedure was followed in the case of some other homonymous forms.
A large amount of Estonian proper nouns are homonymous with certain common nouns or noun forms. Therefore, the morphology tagger does not always correctly distinguish the proper nouns and common nouns (e.g. Laine (proper noun) and laine ‘wave’, Kalju (proper noun) and kalju ‘rock’, and especially compound surnames and toponyms). Such instances were manually disambiguated. For instance, MUSTAMÄGI (lit. black mountain, a toponym) has been re-tagged as a proper noun. The manual tagging reduced the frequencies of many lemmas, which often occur in proper nouns and toponyms (e.g. liiv ‘sand’, mari ‘berry’).
The principles of automatic lemmatization
Because the morphology analyser makes mistakes, the outcome must be properly analysed to see to what extent the frequency lists can be trusted. In order to do that, the outcome of the automatic process was compared to manually analysed text. The analysis revealed that the most common (2%) mistake was to tag proper nouns as common nouns. However, this type of mistake has only limited effect on the frequency lists. Because most of the proper nouns (tagged as common nouns) where not present in both text classes or were infrequent, they were excluded from the frequency lists anyhow. The effect of the mistake was also reduced by manual correction.
Besides the confusion between proper and common nouns, 0.75% of the words were lemmatized incorrectly. 0.75% error is comparable to sampling error, which is associated with the selection of texts.
The frequency lists include 9,700 words. The corpus that the list are based on includes 1,007,000 words (510,200 words of journalism and 496,800 words of fiction), including numbers, abbreviations, and proper nouns. After excluding proper nouns, abbreviations, and numbers, 908,400 words remain. This is the size of the corpus that the frequency lists are based on.
Table 4 gives the cumulative percentage of the lemmas as to what extent they cover the text.
|First … words||% of text covered||Each word occurs at least on … occasions|
The data show that 250 most frequent lemmas cover more than half of the texts and 10,000 most frequent lemmas cover about 90% of the texts.
However, the frequency lists only show the top of the ice berg. The corpus includes 60,000 lemmas. About a half of the lemmas (32,000) only occurred once and 28,000 more than once. Only about half of these (14,500) were present in both text classes (journalism and fiction), and only 9,700 occurred at least on 5 occasions. These 9700 lemmas cover 90.3% of the texts.
There were 22,000 lemmas that were only to be found in fiction and 23,500 lemmas that only occurred in journalistic texts. The most frequent such lemmas are presented in Table 3.
Table 5 gives the cumulative percentage of the word forms as to what extent they cover the text.
|The first … word forms||% of texts covered||every form occurs at least on … occasions|