The database of Estonian verbal multi-word expressions


What is it?

This database contains a subtype of multi-word expressions, namely those consisting of a verb and a particle or a verb and its complements. The expressions consisting of a verb and its subject are not included. The multi-word units consisting of a verb and a infinite form of a verb are included irregularly.

The present version of the database contains ca 13 000 expressions.


The database has been compiled on the basis of

  1. Dictionaries and wordlists, aimed at human users, namely
  2. The verbal multi-word expressions, extracted automatically from corpora totaling 20 million words and post-edited manually. This collocation extraction experiment is described in (Kaalep, Muischnek 2003).

The database consists of 11 fields. The fields are delimited with colons and contain the following information:

Field 1
The expression itself. The verbal component of the expression is recorded in the supine form, the traditional form of presenting the Estonian verbs in the dictionaries. As for the expressions consisting of a verb and a noun or a noun phrase, the noun can be ’frozen’ in a certain case form or allow certain case alternations. If the nominal component is ’frozen’, then it is recorded in the database in this certain case form. If the nominal component can undergo certain case alternations, it is recorded in the database in the partitive case form, but the information about the case alternation is given in the morphological analysis (cf field 11).

Field 2
The subtype of the expression. The possible subtypes are:

Fields 3-9
Whether the expression was recorded in a certain dictionary/wordlist and/or was retrieved with collocation extraction methods. x means yes, - stands for no.

field 3: Fraselogical dictionary (Õim 1991)
field 4: the Explanatory Dictionary of Estonian (EKSS 1988-2000)
field 5: Filosoft thesaurus ( thes_et/)
field 6: a list of particle verbs (Hasselblatt 1990)
field 7: Index of the Thesaurus of Estonian (Saareste 1979)
field 8: Dictionary of Synonyms (Õim 1993)
field 9: automatically extracted collocations

Field 10
If the expression was found and tagged in the 313 000 -word corpus (cf the documentation of the corpus), the number in this field shows the number of the occurences in the corpus.

Field 11
Morphological analysis of the expression
The field is delimited by the <morf> and </morf> tags.
The different components of the expression together with their analysis are separated by the curly bracket {
For each word(form) the following information is presented:
wordform       lemma+ending // morphological categories //
<wordform> is the wordform in text
<lemma> for nouns is singular nominative, for verbs it is the form of the supine (also called ma-infinitive; the lemma is presented here without the infinitival ending -ma).
<ending> is the morphological (not derivational) ending, the separate endings one wordform can have (e.g. case+plural) are treated as one, the particle GI/KI has not been separated from the other endings. If the word doesn't or can't have any endings, the zero ending is given (+0).
<morphological categories> are given in the table of morphological categories (
If the word-form is a compound or a derived word, then:
the components of a compound are separated by '_' ;
suffixes are separated from the lemma by '='. The presentation of the suffixes is not consistent: only a pre-defined amount of productive suffixes have been annotated.
for the compounds the lemma is found only for the rightmost component.


