Eesti keeles

Lists

Frequency lists of The Balanced Corpus of Estonian:

Frequency lists of the sub-corpora (each sub-corpus includes 5 million words):

Aggregated lists:

The basics of frequency lists

Frequency lists are created with a statistical disambiguator named t3mesta using a morphologically disambiguated Balanced Corpora of Estonian which was post-disambiguated by a rule-based method. Corpora includes 5 million words of journalistic texts, 5 million words of fiction texts and 5 million words of science texts. Post-disambiguation is needed to re-disambiguate forms that are left ambiguous. Post-disambiguation can be divided into three main parts:

Statistics

Corpora includes a total of 14,438,223 words (without punctuation marks). 16,610,934 analysis were made before and 15,000,562 analysis after post-disambiguation. There were 997,934 word forms in the corpus, 580,805 of which occured only once. In post-disambiguated text, there were 18,996 word forms, which analysis was left ambiguous. Most of them received proper noun tag (11,940). The other ambiguous words (7,056), which included some indiscernibility between singular and plural (e.g. form on 'is/are' can be analysed as singluar or plural), adverbs and conjuctions (e.g. forms nagu, kui), typos, words in foreign languages etc., did not affect the reliability of the frequency lists of lemmas and word forms.

Frequency lists

Three frequency lists are submitted: the list of lemmas, the list of word forms and the aggregated list. The list of lemmas consists of lemmas that occurred at least 10 times. The list of word forms consists of forms that occurred at least 10 times. The aggregated list consists of both the lemmas and the word forms that occurred at least 10 times. The frequency lists of lemmas and word forms are created separately for each sub-corpus: the sub-corpus of journalistic texts (5 million words), fiction (5 million words), and scientific texts (5 million words). All lists have two versions:

Excluded from frequency list

Besides the lemmas and word forms that occurred less than 10 times, the following elements were excluded from the lists: punctuation marks, abbreviations, numbers, Roman numerals, proper nouns, genitive attributes (G tag in morphological analyzer's output) and tagging mistakes. In addition, foreign words and proper nouns that were tagged incorrectly were removed manually.


Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: December 09 2015 21:07:35.