Eesti keeles

The frequency lists of parts-of-speech and grammatical categories of nominals based on the Balanced Corpus of Estonian

Here you can find frequency lists of parts-of-speech and grammatical categories of nominals. It is known that text classes (text types, genres, registers) are different as for the vocabulary as well as the grammatical forms they contain. One the aims of compiling frequency lists of parts-of-speech and grammatical categories is to contribute to research in various text classes. The frequency lists of parts-of-speech and nominal categories are based on the Balanced Corpus of Estonian. The size of the Balanced Corpus of Estonian is 15 million words, and it is compiled of three sub-corpora of the same size – 5 million words of journalistic texts, 5 million words of fiction and 5 million words of scientific texts. The frequency lists presented here are based on the morphologically disambiguated version of the Balanced Corpus of Estonian. Morphological categories follow the system developed by Filosoft. The morphological analysis and disambiguation was performed using tools Estmorf and t3mesta, predecessors of the open source tool Vabamorf. This project was supported by the national programme for Estonian Language Technology.

The frequency lists of grammatical categories of nominals are based on the following parts-of-speech:

_A_ adjective in the positive degree, includes declinable as well as indeclinable adjectives, e.g. kallis 'dear';
_C_ adjective in the comparative degree, e.g. laiem 'broader';
_U_ adjective in the superlative degree, e.g. pikim 'the longest';
_S_ noun, e.g. asi 'thing';
_N_ cardinal numeral, e.g. kaks 'two';
_O_ ordinal numeral, e.g. teine 'second';
_P_ pronoun, e.g. mina 'I' or see 'it'.

Table 2 (below) lists the frequencies of the following parts-of-speech:

_V_ verb, e.g. tegema 'do';
_D_ adverb; includes regular adverbs (e.g. kiiresti 'quickliy'), proadverbs (e.g. siis 'then') as well as adverbs that serve as verbal particles (e.g. üle 'over' in üle jääma 'left over');
_J_ conjunction, e.g. ja 'and' or kui 'if';
_K_ adposition; includes prepositions (e.g. üle 'across' in üle tee 'across the road') as well as postpositions (e.g. all 'under' in maja all 'under the house');
_Y_ abbreviation, e.g. USA.

In Estonian, the participle forms of verbs often lie at the border of two parts-of-speech. The participle forms may function as verbs or as adjectives depending on the contexts. Due to this, the outcome of the automatic disambiguation is not considered reliable. It follows that the past participle forms of verbs (-nud and -tud forms) are always considered to represent verbs, and are annotated as such. The present participle forms, however, are always annotated as adjectives. Numbers may be written as numerals (kaks 'two') or as numbers (2). The latter have been excluded from the present lists. This applies to bare numbers (e.g. 2) as well as compounds that include numbers (e.g. 2-aastane '2-year-old').

The grammatical categories and their abbreviations are listed in Table 1.

Table 1. Abbreviations of the grammatical categories of the noun

ab abessive
abl ablative
ad adessive
adt aditive
all allative
el elative
es essive
g genitive
ill illative
in inessive
kom comitative
n nominative
p partitive
pl plural
sg singular
ter terminative
tr translative

The frequency of parts-of-speech

Table 2 gives the frequencies of parts-of-speech in the Balanced Corpus of Estonian. The frequencies are given for the whole corpus as well as for each sub-corpus (journalistic texts, fiction, and scientific texts). The lists based on the different sub-corpora reveal that even though the sub-corpora are roughly the same size, there are differences in the frequencies of parts-of-speech between the sub-corpora. It can be observed in Table 2 that there is a clear distinction between the fiction and journalistic texts as for the proportions of nominals and verbs. Scientific texts, on the other hand, contain a surprisingly large amount of adjectives. To some extent, the high frequency of adjectives in scientific texts may be explained by the fact that all present participles were annotated as adjectives.

In the following, the frequency lists are divided into four parts, depending on the morphological information presented in them.

Part 1 includes frequency lists that have been compiled based on the frequencies of the combinations of parts-of-speech, grammatical number, and case. The frequencies are presented for the whole Balanced Corpus of Estonian, as well as for each sub-corpus (journalistic texts, fiction, and scientific texts).

Part 2 includes frequency lists that are based on the frequencies of grammatical number and case. The second part includes 5 lists:
1) Frequency of the combinations of case and grammatical number of all nominals. The frequencies are presented for the whole Balanced Corpus of Estonian as well as for each sub-corpus (journalistic texts, fiction, and scientific texts) (see Table 4);
2) The frequencies of the combinations of case and grammatical number for each part-of-speech in the Balanced Corpus of Estonian (see Table 5);
3) The frequencies of the combinations of case and grammatical number for each part-of-speech in the sub-corpus of journalistic texts (see Table 6);
4) The frequencies of the combinations of case and grammatical number for each part-of-speech in the sub-corpus of fiction (see Table 7);
5) The frequencies of the combinations of case and grammatical number for each part-of-speech in the sub-corpus of scientific texts (see Table 8).

Part 3 includes frequency lists that are based on the case only. The third part includes 5 lists:
1) The frequencies of the cases of nominals in the Balanced Corpus of Estonian. The frequencies are presented for the whole corpus as well as for each sub-corpus (journalistic texts, fiction, and scientific texts) (see Table 9);
2) The frequencies of the cases of nominals for each part-of-speech in the Balanced Corpus of Estonian (see table 10);
3) The frequencies of cases for each part-of-speech in the sub-corpus of journalistic texts (see Table 11);
4) The frequencies of cases for each part-of-speech in the sub-corpus of fiction (see Table 12);
5) The frequencies of cases for each part-of-speech in the sub-corpus of scientific texts (see Table 13).

Part 4 includes frequency lists that are based the grammatical number, i.e. the frequency of singular and plural forms in the Balanced Corpus of Estonian and in the sub-corpora (fiction, journalistic texts, and scientific texts).


Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: November 26 2015 00:27:21.