Eesti keeles

Introduction

Word frequency is in correlation with word commonness. The words that are frequent are usually also common and - vice versa - low-frequency words are usually not common. The distinction between the notion 'commonness' and 'frequency' is useful in understanding what kind of information frequency lists contain. For instance, while kägu 'cuckoo' is certainly a common word in Estonian, it is not particularly frequent in fiction and journalistic texts. Therefore, it does not exceed the threshold to be included in the present frequency lists. Thus, high frequency in certain text classes does not grant the status of a common word. Word frequency depends on the text in which the word is used. Therefore, when interpreting the word frequencies, the type of the texts must be taken into account. Many words that are common in certain types of texts (e.g. physics textbooks, fairy tales) are not common in language in general. Moreover, words that are highly frequent in a certain genre, need not to be common in language.

Words do not appear in texts randomly but the choice of words is associated with the topic of the text. It follows that frequency lists based on any texts create a crooked image of word frequency because every text addresses a certain topic and makes use of words that are associated with this topic. Thus, to get a more realistic picture of word frequencies, one has to account not only for word frequency but also for the word's distribution in various texts. A low-frequency word that appears in various texts is considered more common than a high-frequency word that can only be found in certain type(s) of texts.

It is another matter how widely spread a word has to be in order to be included in frequency lists. The present lists include frequent words that are also common words. Therefore, only the words that occur in fiction as well as in journalistic texts have been included. If a word is frequent in either of them but does not occur in the other, it is not considered a common word, and, thus, not considered suitable for this dictionary.

Approaching frequency in terms of commonness presupposes that that the texts are homogenous enough. If the texts are rather different (e.g. texts extracted from internet chatrooms vs law texts), it is not clear what the sum of the frequencies in different texts represent.

The present dictionary is based on 1 million word corpus that includes fiction and journalistic texts. Fiction and journalistic texts are two distinct text classes in written language that are homogenous enough, and, which, at the same time, are not too different from one other. Fiction and journalistic texts that represent nationally spread non-specialized quality journalism are considered to represent standard neutral Estonian language.

Both of the text classes - fiction and journalistic texts - are represented roughly by 500,000 words. The fiction texts (ilu92_98.zip) are compiled of texts that represent the years 1992-1998 of the fiction subcorpus of the Corpus of Estonian Literary Language. The length of each extract is 2000 words; some texts are represented by more than one extract. The journalistic texts (aja95_99.zip) are partly compiled of the texts of the 1990s subcorpus of the Corpus of Estonian Literary Language but also include texts that are extracted from newspaper online archives. All of the journalistic texts originate from the years 1995-1999. All the journalistic texts are extracted as full issues.

The fact that the present dictionary is based only on two text classes (and ignores all the others, including spoken language) and a rather small sample of texts (1 million words) should make us cautious as to how these word frequencies can be interpreted. For comparison - 'Word Frequencies in Written and Spoken English' (Leech et al. 2001) is based on a corpus of 100 million words British National Corpus. On the other hand, the only equivalent frequency list in Estonian compiled so far (see Kaasik et al 1976 and Kaasik jt 1977) is based on 100,000 words of author's narrative in fiction.

The 10,000 most frequent lemmas

Table 1(text file) lists the 10,000 most frequent lemmas, ranked from the more frequent to less frequent lemmas. The lemmas are presented in column one, the second column gives the part-of-speech, the third column gives the overall frequency of the lemma, the fourth column gives the frequency of the lemma in journalistic texts, and, lastly, the fifth column gives the frequency of the lemma in fiction.

The 1,000 most frequent word forms

Table 2(text file) lists the 1,000 most frequent word forms in alphabetical order. The first column gives the word form, the second column gives the overall frequency of the word form, the third column gives the frequency of the lemma in journalistic texts, and the fourth column gives the frequency of the lemma in fiction.

The 100 frequent words that were not included

Table 3(text file) lists 100 words that are frequent only in either fiction or journalistic texts but not present in both of the text classes. '-' in either third or fourth column means that the word is not present in journalistic texts or fiction respectively. The data show that journalistic texts have more specific vocabulary. Such vocabulary includes mostly nouns that refer to domains of government (riigieelarve 'state budget', välisminister 'minister of foreign affairs', siseminister 'minister of internal affairs'); economy (investeering 'investment', börs 'stock market', tarbija 'consumer'); sport (meistrivõistlus 'champions league', finaal 'final'). Only 11 words out of 100 appeared in fiction only, 7 of which are verbs (pomisema 'mumble', kummarduma 'bow (oneself)', silitama 'stroke', võpatama 'flinch', seisatama 'stop (for a bit)', kuulatama 'listen up', kohendama 'spruce (up)').

The corpus

The corpus the frequency lists are based on is compiled of 2000-word extracts of fiction works published in 1992-1998 (ilu92_98.zip) and and full issues of journals published in 1995-1999 (aja95_99.zip). The sources of fiction texts are listed in fiction source texts and the newspaper issues are listed in journalism source texts.

What can be and cannot be found in the frequency lists?

It must be noted that the frequency lists present the frequencies of words and not the frequencies of word senses. Thus, the frequency of the verb tulema represents the frequency of the word tulema in all of its senses, e.g. 'arrive' and 'have to'. Words that appear in more than one parts-of-speech are tagged for several parts-of-speech.

Word phrases that often appear together and that are usually separate entries in the dictionary (e.g. multi-verb units) are here treated as separate items and their frequencies are also listed as such. For instance, the frequencies of the multi-word unit aru saama (lit. sense + get 'understand') are counted separately for aru 'sense' and saama 'get'.

If a certain word is not to be found in the frequency lists, it does not mean that it did not occur in the texts at all. The criterion of selection of words in the frequency lists is that they occur in both text classes - journalism and fiction - at least on five occasions. Thus, if a word was highly frequent in journalism but did not occur in fiction, it was not included in the lists. For instance, the word puuraidur 'wood-cutter' occurred on 50 occasions in fiction but not once in journalistic texts. Thus, it was not included in the frequency lists. Likewise, the word omavalitsus 'local government' was mentioned on 209 occasions in journalistic texts but did not occur in fiction at all. Thus, it was excluded.

Moreover, one should be cautious about drawing conclusion based on the frequencies of the individual words. According to British lexicographer John Sinclair, the semantics of individual words is dependent on the contexts in which the word occurs. For instance, one of the most frequent nouns in our frequency list is aeg 'time' but in most cases, it does not refer to time as an ontological category but occurs in phrases such as samal ajal 'at the same time', viimasel ajal 'in recent times', kogu aeg 'all the time', pikka aega 'for a long time'. In comparison, aika 'time' is also the most frequent word in Finnish (Saukkonen et al 1979).

How were the frequency lists compiled?

The list of lemmas was obtained automatically, using the morphology analyser for Estonian language and statistical part-of-speech tagger estyhmm (Kaalep, Vaino 2000). In addition to the overall frequency of the lemmas, the frequencies were also counted separately in the journalistic texts and fiction. The proper names, abbreviations, and numbers were not included in the frequency lists.

Each lemma has one or more tags, which indicate, which part-of-speech the lemma represents (substantive - S, adjective - A, verb - V, pronoun - P, or particles (D). Particles include adpositions, adverbs, conjunctions, and interjections. The lemmas oma 'own' and pool 'half' have the highest number of part-of-speech tags (four).

Because the part-of-speech tagging was an automated process, a certain amount of errors occurred. At first, it was not possible to distinguish certain homonymous forms of the words see 'it' and tema 's/he' (e.g. nende (PL.GEN), neid (PL.PAR), nendes or neis (PL.INE)). However, as these forms were disambiguated manually, the frequencies represent the actual frequencies of these lemmas. The same procedure was followed in the case of some other homonymous forms.

A large amount of Estonian proper nouns are homonymous with certain common nouns or noun forms. Therefore, the morphology tagger does not always correctly distinguish the proper nouns and common nouns (e.g. Laine (proper noun) and laine 'wave', Kalju (proper noun) and kalju 'rock', and especially compound surnames and toponyms). Such instances were manually disambiguated. For instance, MUSTAMÄGI (lit. black mountain, a toponym) has been re-tagged as a proper noun. The manual tagging reduced the frequencies of many lemmas, which often occur in proper nouns and toponyms (e.g. liiv 'sand', mari 'berry').

The principles of automatic lemmatization

  1. The lemmas are not tagged for parts-of-speech. Thus, if a certain particle is homonymous with a certain noun in the nominative case, they are presented as the same lemma. For instance, the frequencies of the noun saadik 'ambassador' and adposition saadik 'since' are presented as a single frequency of the string saadik. In such cases, multiple parts-of-speech tags are included (saadik D/S). However, it remains unknown how many times the lemma occurs as each part-of-speech.
  2. Homonymous lemmas are not distinguished. For instance, the lemmas palk 'log' and palk 'salary' are presented as a single lemma with a single frequency count. The fact that the lemma may contain two homonymous lemmas is shown using a double part-of-speech tag (palk S/S). If a lemma is homonymous but the other homonym(s) are not likely to occur in the corpus, only one part-of-speech tag is listed. For instance, the lemma ruut has three homonyms ('seed plant', 'a type of plant', and 'square'), two of which are quite unlikely to occur in the journalistic texts or fiction. In such cases, the lemma was given only one part-of-speech tag. Admittedly, it is subjective and possibly erroneous to assume that the other homonyms do not occur in the texts. However, it would also be erroneous to assume that these infrequent homonyms are present in the corpus. Lemmas that have several inflectional paradigms are always marked as representing several parts-of-speech. For instance, the word päike 'sun' has the tag S/S because it can be inflected päikese (SG:GEN): päikesesse (SG:ILL) as well as päikse (SG:GEN) and päiksesse (SG:ILL).
  3. All the comparative and superlative forms of lemmas are considered as separate lemmas. Thus, the words õnnelik 'happy', õnnelikum 'happier', and õnnelikem 'happiest' are separate lemmas in the frequency lists. The past and present participles have been lemmatized following different principles. The present participle forms (-v and -tav) are considered to be separate lemmas. The past participles (-nud and -tud), on the other hand, are considered as separate lemmas only in cases that express clear adjectival meanings, e.g. surnud (lit. die-pst.ptcp, 'dead)'. The reason being that the present participles mostly function as adjectives but the past participles may function as verbs or as adjectives; and the distinction between the two parts-of-speech may not always be clear-cut. The frequencies of the past participles are, thus, 'hidden' in the verb frequencies but the frequencies of the present participles are presented as separate lemmas. Moreover, the -ja and -mine derivates of verbs that express the doer and action respectively were considered as separate lemmas.
  4. The des-gerund and mata-soupine are analysed either as verbs or adverbs by the morphology analyser. In addition, some of the mata-forms are analysed as adpositions, e.g. hoolimata, vaatamata (both 'despite of'). The des- and mata-forms that were disambiguated as particles were listed as separate lemmas with their own frequency count.
  5. The lemmatization of certain word forms in not straightforward. For instance, it is not clear if the base form of päikese is päike or päikene (both 'sun'). In such cases the following principles were implemented:
  6. The pronouns ma/mina, sa/sina, ta/tema are always lemmatized as the longer forms. Moreover, the frequency of the lemmas mina 'I', sina 'you', and tema 's/he' also include the plural forms (e.g. me/meie).
  7. As was mentioned above, numbers are not included in the frequency lists. It follows that the compounds with numeric components (e.g. 3-aastane 'three-year-old') were also excluded.
  8. Some compounds may be written with a hyphen as well as a single word proper (e.g. võib-olla, võibolla). Based on their spelling, such words are presented as separate lemmas.
  9. The journalistic texts that originate from the 1990s include instances of the letters š and ž being replaced by sh and zh. The words that display such spelling are also listed as separate lemmas from their correctly spelled counterparts. For instance, the frequency lists include the lemma shokk as well as šokk (both 'shock').

Because the morphology analyser makes mistakes, the outcome must be properly analysed to see to what extent the frequency lists can be trusted. In order to do that, the outcome of the automatic process was compared to manually analysed text. The analysis revealed that the most common (2%) mistake was to tag proper nouns as common nouns. However, this type of mistake has only limited effect on the frequency lists. Because most of the proper nouns (tagged as common nouns) where not present in both text classes or were infrequent, they were excluded from the frequency lists anyhow. The effect of the mistake was also reduced by manual correction.

Besides the confusion between proper and common nouns, 0.75% of the words were lemmatized incorrectly. 0.75% error is comparable to sampling error, which is associated with the selection of texts.

To what extent are the lemmas that occur in the corpus presented in the frequency lists?

The frequency lists include 9,700 words. The corpus that the list are based on includes 1,007,000 words (510,200 words of journalism and 496,800 words of fiction), including numbers, abbreviations, and proper nouns. After excluding proper nouns, abbreviations, and numbers, 908,400 words remain. This is the size of the corpus that the frequency lists are based on.

Table 4 gives the cumulative percentage of the lemmas as to what extent they cover the text.

Table 4. The lemmas as per the proportion of text they cover, ranked per frequency
First ... words % of text covered Each word occurs at least on ... occasions
10 19,3 6194
20 24,6 4032
50 33,1 1797
100 40,7 1034
250 51,3 452
500 60,2 229
1000 69,0 115
1500 74,0 72
2000 77,2 52
3000 81,5 30
5000 86,0 15
10000 90,3 5

The data show that 250 most frequent lemmas cover more than half of the texts and 10,000 most frequent lemmas cover about 90% of the texts.

However, the frequency lists only show the top of the ice berg. The corpus includes 60,000 lemmas. About a half of the lemmas (32,000) only occurred once and 28,000 more than once. Only about half of these (14,500) were present in both text classes (journalism and fiction), and only 9,700 occurred at least on 5 occasions. These 9700 lemmas cover 90.3% of the texts.

There were 22,000 lemmas that were only to be found in fiction and 23,500 lemmas that only occurred in journalistic texts. The most frequent such lemmas are presented in Table 3.

Table 5 gives the cumulative percentage of the word forms as to what extent they cover the text.

Table 5. The word forms as per the proportion of text they cover, ranked per frequency
The first ... word forms % of texts covered every form occurs at least on ... occasions
10 13,0 5329
20 17,2 2961
50 23,5 1445
100 29,4 863
250 38,2 373
500 45,3 187
1000 52,4 95
1500 56,7 65
2000 59,7 50
3000 64,2 33
5000 69,7 20
10000 76,9 10
20000 83,8 5
33000 88.8 3

References

  1. Heiki-Jaan Kaalep, Tarmo Vaino. Teksti täielik morfoloogiline analüüs lingvisti töövahendite komplektis. Kogumikus " Arvutuslingvistikalt inimesele" Tartu 2000 lk 87 - 99
  2. Kaasik, Ü., Tuldava, J., Viilup, A., Ääremaa, K. Eesti keele ilukirjandusproosa autorikõne sõnavormide sagedussõnastik. Keelestatistika 1. TRÜ toimetised vihik 377, Tartu 1976, lk 107-153
  3. Kaasik, Ü. Tuldava, J., Villup, A., Ääremaa, K. Eesti tänapäeva ilukirjandusproosa autorikõne lekseemide sagedussõnastik. Keelestatistika 2. TRÜ toimetised, vihik 413,, Tartu 1977, lk 5-140
  4. Leech, G., Rayson, P., Wilson, A. Word Frequencies in Written and Spoken English. Longman, Pearson Education 2001
  5. Saukkonen, P., Haipus, M., Niemikorpi, A., Sulkala, H. Suomen kielen taajuussanasto. A frequency dictionary of Finnish. Werner Söderström osakeyhtiö. Porvoo - Helsinki - Juva 1979

Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: January 04 2019 11:27:06.