The database of Estonian verbal multi-word expressions

Files

What is it?

This database contains a subtype of multi-word expressions, namely those consisting of a verb and a particle or a verb and its complements. The expressions consisting of a verb and its subject are not included. The multi-word units consisting of a verb and a infinite form of a verb are included irregularly.

The present version of the database contains ca 13 000 expressions.

Usage

The database is available for downloading here. You can also use a simple search engine.

The database has been compiled on the basis of

Dictionaries and wordlists, aimed at human users, namely

the Explanatory Dictionary of Estonian (EKSS 1988-2000)
Index of the Thesaurus of Estonian (Saareste 1979)
a list of particle verbs (Hasselblatt 1990)
Phraseological Dictionary (Õim 1991)
Dictionary of Synonyms (Õim 1993)
Filosoft thesaurus

The verbal multi-word expressions, extracted automatically from corpora totaling 20 million words and post-edited manually. This collocation extraction experiment is described in (Kaalep, Muischnek 2003).

The database consists of 11 fields. The fields are delimited with colons and contain the following information:

Field 1
The expression itself. The verbal component of the expression is recorded in the supine form, the traditional form of presenting the Estonian verbs in the dictionaries. As for the expressions consisting of a verb and a noun or a noun phrase, the noun can be ’frozen’ in a certain case form or allow certain case alternations. If the nominal component is ’frozen’, then it is recorded in the database in this certain case form. If the nominal component can undergo certain case alternations, it is recorded in the database in the partitive case form, but the information about the case alternation is given in the morphological analysis (cf field 11).

Field 2
The subtype of the expression. The possible subtypes are:

yv – particle verb
nv – expression consisting of a noun (phrase) and a verb; could be divided further into idiomatic expressions and collocations
tv – support verb construction
av – catenative verb construction

Fields 3-9
Whether the expression was recorded in a certain dictionary/wordlist and/or was retrieved with collocation extraction methods. x means yes, - stands for no.

field 3: Fraselogical dictionary (Õim 1991)
field 4: the Explanatory Dictionary of Estonian (EKSS 1988-2000)
field 5: Filosoft thesaurus (https://www.filosoft.ee/ thes_et/)
field 6: a list of particle verbs (Hasselblatt 1990)
field 7: Index of the Thesaurus of Estonian (Saareste 1979)
field 8: Dictionary of Synonyms (Õim 1993)
field 9: automatically extracted collocations

Field 10
If the expression was found and tagged in the 313 000 -word corpus (cf the documentation of the corpus), the number in this field shows the number of the occurences in the corpus.

Field 11
Morphological analysis of the expression
The field is delimited by the <morf> and </morf> tags.
The different components of the expression together with their analysis are separated by the curly bracket {
For each word(form) the following information is presented:
wordform lemma+ending // morphological categories //
<wordform> is the wordform in text
<lemma> for nouns is singular nominative, for verbs it is the form of the supine (also called ma-infinitive; the lemma is presented here without the infinitival ending -ma).
<ending> is the morphological (not derivational) ending, the separate endings one wordform can have (e.g. case+plural) are treated as one, the particle GI/KI has not been separated from the other endings. If the word doesn't or can't have any endings, the zero ending is given (+0).
<morphological categories> are given in the table of morphological categories (https://www.cl.ut.ee/korpused/morfliides/seletus).
If the word-form is a compound or a derived word, then:
the components of a compound are separated by '_' ;
suffixes are separated from the lemma by '='. The presentation of the suffixes is not consistent: only a pre-defined amount of productive suffixes have been annotated.
for the compounds the lemma is found only for the rightmost component.

Reerences

Õim, A. Sünonüümisõnastik Tallinn 1991
Õim, A. Fraseoloogiasõnaraamat ETA KKI, Tallinn 1993
Filosoft tesaurus (https://www.filosoft.ee)
Partikkelverbide loend teosest Cornelius Hasselblatt "Das Estnische Partikelverb als Lehnübersetzung aus dem Deutschen" Wiesbaden 1990
Kaalep, H-J. And Muischnek, K. Inconsistent Selectional Criteria in Semi-automatic Multi-word Unit Extraction.. Teoses COMPLEX 2003, 7th Conference on Computational Lexicography and Corpus Research. By F. Kiefer, J.Pajzs, Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest 2003, pp. 27-36.
Saareste, A. Eesti keele mõistelise sõnaraamatu indeks Finsk-ugriska institutsionen, Uppsala, 1979
EKSS Eesti kirjakeele seletussõnaraamat (A - sentimeetririhm) ETA KKI, Tallinn, 1988 - 1999

Webmaster Last modified: January 04 2019 10:46:50.