Please scroll down to find the general principles of compiling the frequency lists.
Compilation of the frequency lists has been supported by the national programme for Estonian Language Technology.
N-grams are defined as strings of two, three, or four words. N-grams are not synonymous to collocations. Collocations are words that occur together in certain contexts (such as clauses). Components of collocations need not (but may) be adjacent to each other. Thus, the words ajas pilli lõhki 'carry [something] too far' in example (1) are considered to form a collocation but not a trigram. In example (2), on the other hand, ajas pilli lõhki 'carry [something] too far' is considered to form a collocation as well as a trigram.
(1) Siis aga ajas vihane herilane pilli hoopis lõhki. 'Then the angry wasp carried it too far'
(2) Vihane herilane ajas pilli lõhki. 'The angry wasp carried it too far'
The frequency lists presented here are based on the Balanced Corpus of Estonian, which is compiled of three 5 million word sub-corpora - journalistic texts, fiction, and scientific texts. The frequency lists have been compiled for the whole Balanced Corpus of Estonian as well as for each sub-corpus. The lists include n-grams of the lemmas as well as word forms. The frequency lists presented here are based on the morphologically disambiguated version of the Balanced Corpus of Estonian. Morphological categories follow the system developed by Filosoft. The texts are disambiguated using a statistical tool based on trigrams.
The n-grams of word forms are not case sensitive, i.e. proper nouns and common nouns are not distinguished. The n-grams of word lemmas, on the other hand, are case sensitive, i.e. proper nouns and common nouns are distinguished.
The n-grams also include punctuation marks, most frequently, a comma. Thus, the most frequent bigrams include , et and most frequent trigrams include selleks , et. All punctuation marks are marked with #Z# in the list of lemmas and tag #z# in the list of word forms. The tags make the punctuation marks easily detectable and searchable. Thus, researchers who do not wish to include such n-grams can easily delete them and the ones who wish to study only n-grams with punctuation marks can easily find them. Brackets have been excluded from the lists.
Frequency lists of n-grams include numbers and abbreviations.
The frequency lists presented here only include the n-grams that occurred in the Balanced Corpus of Estonian at least on ten occasions.
Restricting the number of n-grams affects the relationship of the frequencies of word forms and lemmas. In general, unique n-grams of word forms are more frequent than unique n-grams of lemmas. However, if the n-grams are restricted to ones that appear at least ten times in the corpus, the n-grams of lemmas are more frequent than the n-grams of word forms (compare tables 1, 2, and 3 with tables 4, 5, and 6). Thus, excluding the less frequent n-grams decreases the number of individual word forms considerably while it does not affect the number of lemma to such extent.
The distribution of the words in the texts is not taken into account. Thus, it is possible that certain highly frequent n-grams only occur in one text of the Balanced Corpus of Estonian. However, excluding the less frequent n-grams decreases the chances of including those that only occur in a small number of texts. It does not, however, exclude n-grams that are highly frequent in one (or few) text(s) but do not occur in others. Nevertheless, such examples help to characterize each sub-corpus. For instance, the trigram käesolevas töös on 'present work is' occurs in scientific text but not in other sub-corpora.
The frequency lists of n-grams include lists for bigrams, trigrams, and tetragrams. The lists are based on the texts of the Balanced Corpus of Estonian, which consists of three sub-corpora - journalistic texts, fiction, and scientific texts. Therefore, the total number of lists is 12 (3 types of n-grams x 4 (sub)corpora).
The collocations that include the following elements have been excluded from the lists: 1) tags that refer to beginning and the end of sentences, e.g. </s> and <s>; 2) brackets, e.g. jt (1998) 'et al (1998)'; more than one punctuation mark, e.g. kass , koer . 'cat , dog .'
The lists are presented ranking from the most frequent n-gram. The lists include n-grams that appear at least on 10 occasions.
The tables 1, 2, and 3 present the frequencies of n-grams of lemmas and word forms that occur at least on 10 occasions in the Balanced Corpus of Estonian and its sub-corpora.
Table 1. Bigrams that occur at least on 10 occasions in the Balanced Corpus of Estonian and its sub-corpora
Balanced Corpus of Estonian | Number of word forms | Number of lemmas | Total | 138,544 | 155,864 | Journalism | 39,497 | 50,051 | Fiction | 50,893 | 54,762 | Science | 41,948 | 55,309 |
Table 2. Trigrams that occur at least on 10 occasions in the Balanced Corpus of Estonian and its sub-corpora
The Balanced Corpus of Estonian | Number of word forms | Number of lemmas | Total | 43,670 | 65,584 | Journalism | 9,637 | 14,903 | Fiction | 17,256 | 26,853 | Science | 10,375 | 15,173 |
Tabel 3. Tetragrams that occur at least on 10 occasions in the Balanced Corpus of Estonian and its sub-corpora
The Balanced Corpus of Estonian | Number of word forms | Number of lemmas | Total | 9,076 | 16,615 | Journalism | 1,500 | 2,917 | Fiction | 3,300 | 6,749 | Science | 2,398 | 3,615 |
It can be observed in the above tables that the sub-corpus of fiction includes the largest number of n-grams and the sub-corpus of journalistic texts includes the smallest number of n-grams.
The tables 4, 5, and 6 present the frequencies of n-grams of lemmas and word forms (including those that occur on less than 10 occasions) in the Balanced Corpus of Estonian and its sub-corpora. The tables also indicate the amount of n-grams that occurred only once.
Bigrams
Table 4. Bigrams in the Balanced Corpus of Estonian and its sub-corpora
The Balanced Corpus of Estonian | Number of word forms | Hapax legomena | Number of lemmas | Hapax legomena | Total | 7,091,668 | 5,760,968 | 5,000,628 | 3,784,287 | Journalism | 2,761,718 | 2,326,100 | 2,064,043 | 1,623,931 | Fiction | 2,428,911 | 1,986,740 | 1,687,784 | 1,295,874 | Science | 2,669,528 | 2,154,197 | 1,984,012 | 1,478,192 |
Table 5. Trigrams in the Balanced Corpus of Estonian and its sub-corpora
The Balanced Corpus of Estonian | Number of word forms | Hapax legomena | Number of lemmas | Hapax legomena | Total | 11,352,391 | 10,510,398 | 10,112,500 | 9,077,246 | Journalism | 3,964,840 | 3,756,338 | 3,661,717 | 3,384,829 | Fiction | 3,982,505 | 3,694,934 | 3,469,303 | 3,114,081 | Science | 3,765,643 | 3,465,735 | 3,490,457 | 3,123,692 |
Table 6. Tetragrams in the Balanced Corpus of Estonian and its sub-corpora
The Balanced Corpus of Estonian | Number of word forms | Hapax legomena | Number of lemmas | Hapax legomena | Total | 11,700,325 | 11,340,636 | 11,277,113 | 10,798,089 | Journalism | 3,952,982 | 3,883,418 | 3,867,422 | 3,768,283 | Fiction | 4,131,481 | 4,025,433 | 3,942,796 | 3,786,123 | Science | 3,719,791 | 3,564,570 | 3,642,931 | 3,458,774 |
The statistics of vocabulary suggests that words that occur only once make up about a half of the vocabulary of the Balanced Corpus of Estonian. Comparing the tables 1 and 4, it can be observed that only 0.08% of all the tetragrams of word forms in the Balanced Corpus of Estonian occurred at least 10 times, and that 97% only occurred once; 0.33% of the tetragrams of lemmas occurred at least 10 times and 76% occurred once. It can be observed in tables 2 and 5 that only 2% of the bigrams of word forms occurred at least on 10 occasions and 81% occurred only once; 3% of the bigrams of lemmas occurred at least 10 times and 76% occurred only once.