Lexicographers, linguists and computational linguists have come up with different definitions of the term collocation. Here, following the usual practice in computational linguistics, the term collocation is used denoting an expression, which consists of words co-occurring more frequently in natural language texts than one would suggest on basis of their individual frequencies. Collocations differ in respect to the number of combining words making up the collocation and their syntactic relations. Based on their meaning, the collocations can be either (1) idioms (for example hambasse puhuma 'lie'), comprehensively represented in dictionaries, but rarely occuring in actual texts; (2) particle verbs and support verb combinations (üle saama 'get over', õppust võtma 'learn a lesson'), of those particle verbs are typically reprsented in dictionaries or (3) various noun phrases (for example rohelised mehikesed 'extraterrestrials').
Collocations also include fixed expressions, which contain words used in their usual meaning, but only a certain combination of theoretically possible words is actual for expressing certain meaning (e.g. in Estonian firewood is chopped (puid lõhutakse) but not broken (tehakse katki), in English one can make or deliver a speech, in Estonian only a verb pidama 'hold, make' can be used in this context, but not esitama 'deliver'. Such numerous fixed expressions are problematic for foreign language learners.
As the word order in Estonian is free, the words making up a collocation can be separated from each other by several intervening words, e.g. in the following sentence there are four extra word-forms between the constituents of the particle verb üle saama 'to overcome': Kass ei saanud priske hiire kaotusest kuidagi üle.
Collocation Tool
There is an Interface for finding collocations from the Balanced corpus of Estonian, Estonian Reference corpus and it's subcorpora. One can use the interface to search for collocations in three different ways;
(1) particular lemma's relevant collocates as word forms;
(2) particular lemma's relevant collocates as lemmas;
(3) particular word form's relevant collocates as word forms;
Both the number of lemmas and word forms can be restricted by their word class category in the collocate search, In order to detect collocations in a corpus, three association measures are used: (1) log-likelihood (LL), (2) Mutual Information (MI) and (3) Minimum Sensitivity (MS). For comparison one can also search for word pairs ordered by frequency( Sag).
What are these lists good for?
Collocation finder enables one to retrieve collocations of a single word. However, in order to analyse the most frequent collocations in a certain text corpus, frequency lists based on source materials are needed. Here we present lists of first 5000 most significant or frequent collocations ordered by an association measure or by frequency.
Similar to possibilities offered by the collocation tool, these lists are organized (1) on basis of the word class of the collocates and (2) whether the collocate is a lemma or a word-form. So, for each word class pair, there are three lists: lemma-lemma, lemma-wordform and wordform-wordform.
The following word class pairs are included in the lists:
adjective- noun (AS)
adverb- adjective (DA)
noun- adverb (SD)
noun-noun (SS)
verb- adjective (VA)
verb- adverb (VD)
verb- noun (VS)
verb verb (VV)
Lemma-lemma ja wordform-wordform pairs are symmetric, which means that as the frequencies for the pairs lootma ('hope') V abi ('help') and abi 'help' S lootma 'hope' V are the same, only one of the word order variants is represented in the lists. On the other hand, lemma-wordform pairs are not symmetric and therefore for them also mirror cases are given (compare juriidiline A isiku S ja isik S juriidilise A).
Therefore lemma ja word form collocations include also the following
"opposite pairs":
adjective- adverb (AD)
noun- adjective (SA)
adjective- verb (AV)
adverb- noun (DS)
adverb- verb (DV)
noun- verb (SV)
From the mentioned collocation pairs, 5000 most frequent/significant ones, occurring in a corpus at least ten times are represented. They are listed according to their frequency (Sag) and the three association measures (Log-Likelihood (LL), Mutual Information (MI), Minimum Sensitivity (MS) mentioned above.
Word class labels: _A_ adjective _D_ adverb _S_ noun _V_ verb