Eesti keeles

The frequency lists of n-grams of lemmas and word forms based on the Balanced Corpus of Estonian

Please scroll down to find the general principles of compiling the frequency lists.

Lists:

1. Bigrams:
Bigrams of word forms in the Balanced Corpus of Estonian
Bigrams of lemmas in the Balanced Corpus of Estonian
Bigrams of word forms in the sub-corpus of fiction
Bigrams of lemmas in the sub-corpus of fiction
Bigrams of word forms in the sub-corpus of journalistic texts
Bigrams of lemmas in the sub-corpus of journalistic texts
Bigrams of word forms in the sub-corpus of scientific texts
Bigrams of lemmas in the sub-corpus of scientific texts

2. Trigrams:
Trigrams of word forms in the Balanced Corpus of Estonian
Trigrams of lemmas in the Balanced Corpus of Estonian
Trigrams of word forms in the sub-corpus of fiction
Trigrams of lemmas in the sub-corpus of fiction
Trigrams of word forms in the sub-corpus of journalistic texts
Trigrams of lemmas in the sub-corpus of journalistic texts
Trigrams of word forms in the sub-corpus of scientific texts
Trigrams of lemmas in the sub-corpus of scientific texts

3. Tetragrams:
Tetragrams of word forms in the Balanced Corpus of Estonian
Tetragrams of lemmas in the Balanced Corpus of Estonian
Tetragrams of word forms in the sub-corpus of fiction
Tetragrams of lemmas in the sub-corpus of fiction
Tetragrams of word forms in the sub-corpus of journalistic texts
Tetragrams of lemmas in the sub-corpus of journalistic texts
Tetragrams of word forms in the sub-corpus of scientific texts
Tetragrams of lemmas in the sub-corpus of scientific texts

Compilation of the frequency lists has been supported by the national programme for Estonian Language Technology.

N-grams are defined as strings of two, three, or four words. N-grams are not synonymous to collocations. Collocations are words that occur together in certain contexts (such as clauses). Components of collocations need not (but may) be adjacent to each other. Thus, the words ajas pilli lõhki 'carry [something] too far' in example (1) are considered to form a collocation but not a trigram. In example (2), on the other hand, ajas pilli lõhki 'carry [something] too far' is considered to form a collocation as well as a trigram.

(1) Siis aga ajas vihane herilane pilli hoopis lõhki. 'Then the angry wasp carried it too far'
(2) Vihane herilane ajas pilli lõhki. 'The angry wasp carried it too far'

The frequency lists presented here are based on the Balanced Corpus of Estonian, which is compiled of three 5 million word sub-corpora - journalistic texts, fiction, and scientific texts. The frequency lists have been compiled for the whole Balanced Corpus of Estonian as well as for each sub-corpus. The lists include n-grams of the lemmas as well as word forms. The frequency lists presented here are based on the morphologically disambiguated version of the Balanced Corpus of Estonian. Morphological categories follow the system developed by Filosoft. The texts are disambiguated using a statistical tool based on trigrams.

The n-grams of word forms are not case sensitive, i.e. proper nouns and common nouns are not distinguished. The n-grams of word lemmas, on the other hand, are case sensitive, i.e. proper nouns and common nouns are distinguished.
The n-grams also include punctuation marks, most frequently, a comma. Thus, the most frequent bigrams include , et and most frequent trigrams include selleks , et. All punctuation marks are marked with #Z# in the list of lemmas and tag #z# in the list of word forms. The tags make the punctuation marks easily detectable and searchable. Thus, researchers who do not wish to include such n-grams can easily delete them and the ones who wish to study only n-grams with punctuation marks can easily find them. Brackets have been excluded from the lists.
Frequency lists of n-grams include numbers and abbreviations.

The frequency lists presented here only include the n-grams that occurred in the Balanced Corpus of Estonian at least on ten occasions.
Restricting the number of n-grams affects the relationship of the frequencies of word forms and lemmas. In general, unique n-grams of word forms are more frequent than unique n-grams of lemmas. However, if the n-grams are restricted to ones that appear at least ten times in the corpus, the n-grams of lemmas are more frequent than the n-grams of word forms (compare tables 1, 2, and 3 with tables 4, 5, and 6). Thus, excluding the less frequent n-grams decreases the number of individual word forms considerably while it does not affect the number of lemma to such extent.

The distribution of the words in the texts is not taken into account. Thus, it is possible that certain highly frequent n-grams only occur in one text of the Balanced Corpus of Estonian. However, excluding the less frequent n-grams decreases the chances of including those that only occur in a small number of texts. It does not, however, exclude n-grams that are highly frequent in one (or few) text(s) but do not occur in others. Nevertheless, such examples help to characterize each sub-corpus. For instance, the trigram käesolevas töös on 'present work is' occurs in scientific text but not in other sub-corpora.

Statistics

The frequency lists of n-grams include lists for bigrams, trigrams, and tetragrams. The lists are based on the texts of the Balanced Corpus of Estonian, which consists of three sub-corpora - journalistic texts, fiction, and scientific texts. Therefore, the total number of lists is 12 (3 types of n-grams x 4 (sub)corpora).
The collocations that include the following elements have been excluded from the lists: 1) tags that refer to beginning and the end of sentences, e.g. </s> and <s>; 2) brackets, e.g. jt (1998) 'et al (1998)'; more than one punctuation mark, e.g. kass , koer . 'cat , dog .' The lists are presented ranking from the most frequent n-gram. The lists include n-grams that appear at least on 10 occasions.
The tables 1, 2, and 3 present the frequencies of n-grams of lemmas and word forms that occur at least on 10 occasions in the Balanced Corpus of Estonian and its sub-corpora.

Bigrams

Table 1. Bigrams that occur at least on 10 occasions in the Balanced Corpus of Estonian and its sub-corpora

Balanced Corpus of Estonian Number of word forms Number of lemmas
Total 138,544 155,864
Journalism 39,497 50,051
Fiction 50,893 54,762
Science 41,948 55,309

Trigrams

Table 2. Trigrams that occur at least on 10 occasions in the Balanced Corpus of Estonian and its sub-corpora

The Balanced Corpus of Estonian Number of word forms Number of lemmas
Total 43,670 65,584
Journalism 9,637 14,903
Fiction 17,256 26,853
Science 10,375 15,173

Tetragrams

Tabel 3. Tetragrams that occur at least on 10 occasions in the Balanced Corpus of Estonian and its sub-corpora

The Balanced Corpus of Estonian Number of word forms Number of lemmas
Total 9,076 16,615
Journalism 1,500 2,917
Fiction 3,300 6,749
Science 2,398 3,615

It can be observed in the above tables that the sub-corpus of fiction includes the largest number of n-grams and the sub-corpus of journalistic texts includes the smallest number of n-grams.

The tables 4, 5, and 6 present the frequencies of n-grams of lemmas and word forms (including those that occur on less than 10 occasions) in the Balanced Corpus of Estonian and its sub-corpora. The tables also indicate the amount of n-grams that occurred only once.

Bigrams

Table 4. Bigrams in the Balanced Corpus of Estonian and its sub-corpora

The Balanced Corpus of Estonian Number of word forms Hapax legomena Number of lemmas Hapax legomena
Total 7,091,668 5,760,968 5,000,628 3,784,287
Journalism 2,761,718 2,326,100 2,064,043 1,623,931
Fiction 2,428,911 1,986,740 1,687,784 1,295,874
Science 2,669,528 2,154,197 1,984,012 1,478,192

Trigrams

Table 5. Trigrams in the Balanced Corpus of Estonian and its sub-corpora

The Balanced Corpus of Estonian Number of word forms Hapax legomena Number of lemmas Hapax legomena
Total 11,352,391 10,510,398 10,112,500 9,077,246
Journalism 3,964,840 3,756,338 3,661,717 3,384,829
Fiction 3,982,505 3,694,934 3,469,303 3,114,081
Science 3,765,643 3,465,735 3,490,457 3,123,692

Tetragrams

Table 6. Tetragrams in the Balanced Corpus of Estonian and its sub-corpora

The Balanced Corpus of Estonian Number of word forms Hapax legomena Number of lemmas Hapax legomena
Total 11,700,325 11,340,636 11,277,113 10,798,089
Journalism 3,952,982 3,883,418 3,867,422 3,768,283
Fiction 4,131,481 4,025,433 3,942,796 3,786,123
Science 3,719,791 3,564,570 3,642,931 3,458,774

The statistics of vocabulary suggests that words that occur only once make up about a half of the vocabulary of the Balanced Corpus of Estonian. Comparing the tables 1 and 4, it can be observed that only 0.08% of all the tetragrams of word forms in the Balanced Corpus of Estonian occurred at least 10 times, and that 97% only occurred once; 0.33% of the tetragrams of lemmas occurred at least 10 times and 76% occurred once. It can be observed in tables 2 and 5 that only 2% of the bigrams of word forms occurred at least on 10 occasions and 81% occurred only once; 3% of the bigrams of lemmas occurred at least 10 times and 76% occurred only once.


Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: December 09 2015 21:11:48.