The Frequency List of Estonian Literary Language

Lists

Frequency lists of The Balanced Corpus of Estonian:

Frequency lists of the sub-corpora (each sub-corpus includes 5 million words):

The list of lemmas in the sub-corpus of scientific texts in order of frequency
The list of lemmas in the sub-corpus of scientific texts in alphabetical order
The list of word forms in the sub-corpus of scientific texts in order of frequency
The list of word forms in the sub-corpus of scientific texts in alphabetical order

Aggregated lists:

The basics of frequency lists

Frequency lists are created with a statistical disambiguator named t3mesta using a morphologically disambiguated Balanced Corpora of Estonian which was post-disambiguated by a rule-based method. Corpora includes 5 million words of journalistic texts, 5 million words of fiction texts and 5 million words of science texts. Post-disambiguation is needed to re-disambiguate forms that are left ambiguous. Post-disambiguation can be divided into three main parts:

1) disambiguating words that got same analysis but have different lemmas, e.g. the form talvel could be an instance of both - lemma talv or lemma tali. During the post-disambiguation, the lemma that was more frequently used in this form in the text was chosen. When the decision could not be made based on the same text, the whole sub-corpus was used.

2) disambiguating the nud- and tud-participles that have more than one outcome in t3mesta due to verb and adjective interpretation differencies. For instance, the form laulnud may be analysed as the verb laulma (e.g. in Ta on laulnud ooperis) or as the adjective laulnud (in Ta ei �hinenud kiidulaulu laulnud kriitikute arvamusega). nud- and tud-participles are always analysed as verbs in the post-disambiguation, no matter what their syntactic functions are in the sentence.

3) disambiguating the rest of the ambiguous cases.

Statistics

Corpora includes a total of 14,438,223 words (without punctuation marks). 16,610,934 analysis were made before and 15,000,562 analysis after post-disambiguation. There were 997,934 word forms in the corpus, 580,805 of which occured only once. In post-disambiguated text, there were 18,996 word forms, which analysis was left ambiguous. Most of them received proper noun tag (11,940). The other ambiguous words (7,056), which included some indiscernibility between singular and plural (e.g. form on 'is/are' can be analysed as singluar or plural), adverbs and conjuctions (e.g. forms nagu, kui), typos, words in foreign languages etc., did not affect the reliability of the frequency lists of lemmas and word forms.

Frequency lists

Three frequency lists are submitted: the list of lemmas, the list of word forms and the aggregated list. The list of lemmas consists of lemmas that occurred at least 10 times. The list of word forms consists of forms that occurred at least 10 times. The aggregated list consists of both the lemmas and the word forms that occurred at least 10 times. The frequency lists of lemmas and word forms are created separately for each sub-corpus: the sub-corpus of journalistic texts (5 million words), fiction (5 million words), and scientific texts (5 million words). All lists have two versions:

1) sorted in descending order,

2) sorted in alphabetical order.

Excluded from frequency list

Besides the lemmas and word forms that occurred less than 10 times, the following elements were excluded from the lists: punctuation marks, abbreviations, numbers, Roman numerals, proper nouns, genitive attributes (G tag in morphological analyzer's output) and tagging mistakes. In addition, foreign words and proper nouns that were tagged incorrectly were removed manually.

Webmaster Last modified: January 21 2019 19:41:45.