Eesti keeles

Morphologically disambiguated corpus

Old things:

The file failid.zip contains manually disambiguated files. Every text has been manually disambiguated by two persons; and the third person has compared the result and made the necessary corrections.

Content

The work with the morphological disambiguation of Estonian began in the COPERNICUS-project "Multext-East" (1995-1997). During that project G. Orwell's novel "1984" was disambiguated. The main part of this corpus, 400 000 words, has been disambiguated during 2002-2003. This work has been supported by the national program "Eesti keel ja rahvuskultuur" (Estonian Language and National Culture). The following researchers have participated in this work: Külli Habicht, Heiki-Jaan Kaalep, Neeme Kahusk, Kadri Muishnek, Heili Orav, Andriela Rääbis, Kadri Vider.

The texts belong to the following text classes:

Text class number of words
Fiction (Estonian authors) 104 000
G. Orwell's "1984" 75 500
Newspaper texts 111 000
Legal texts 121 000
Texts from the scientific magazine "Horisont" 98 000
Reference texts 4 000
Altogether 513 000

File names

begin with a 3-letter code: (ilu[fiction], sea[legal texts], aja[newspaper], hor[isont], inf[reference texts], 1984).)

The origin of the texts

All the fiction texts, except for "1984", come from the subcorpus of the 1980s of the Corpus of Written Estonian 1890-1990. The number in the filename is the same as in the original, the code "tkt" or "stkt" in the original filename has been replaced with code "ilu".

The newspaper files are not present in the other corpora. The filename contains the name of the newspaper.

The reference texts come from the subcorpus of the 1980ies of the Corpus of Written Estonian 1890-1990; the file inf_0002.yhene is from the text class "Hobbies" and the file inf_0011.yhene comes from the text class "Encyclopaedias".

The legal documents come from: 1)the homepage of the Estonian Legal Language Centre http://www.legaltext.ee/ (april 2002) and 2) some other resources. The filenames of the files we have got from the Estonian Legal Language Centre contain the same number as their source files. The filenames of the files coming from other sources contain the name of the legislative document.

The excerpts from the magazine "Horisont" come from its homepage http://www.horisont.ee/ (9. october 2003) and come from the years 1996-2003. The filenames are the same as they were on the homepage of "Horisont".

The analysis

The wordforms have been analysed one by one. The result of the analysis for one wordform is as follows:

wordform
    lemma+ending // morphological categories //

If the word-form is a compound or a derived word, then:

All the parts of a multi-word proper name are analyzed, the non-final parts as if having an unknown inflection:

Rio Rio //_S_ prop ? //

de de //_S_ prop ? //

Janeiros Janeiro+s //_S_ prop sg in //

The tags <s> and </s> placed on separate lines mark the beginning and end of a sentence, heading etc. Some files also contain paragraph tags <p> and </p>.

Symbols and entities

Code table is utf-8. <, > and & are represented as entities &lt;, &gt; ja &amp;

Known problems so far

Ca 0,3% of the analysis can be debatable or wrong.

Some publications about this

  1. H.-J. Kaalep, K. Muischnek, K. Müürisep, A. Rääbis, K. Habicht. Kas tegelik tekst allub eesti keele morfoloogilistele kirjeldustele? Eesti kirjakeele testkorpuse morfosüntaktilise märgendamise kogemusest. Keel ja Kirjandus 9/2000, lk. 623-633 doc file, pdf file
  2. K. Muischnek, K. Vider. Sõnaliigituse kitsaskohad eesti keele arvutianalüüsis esitatud avaldamiseks Rakenduslingvistika konverentsi 2004 kogumikus doc file pdf file

Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: February 02 2023 10:45:04.