Old things:
The file failid.zip contains manually disambiguated files. Every text has been manually disambiguated by two persons; and the third person has compared the result and made the necessary corrections.
The work with the morphological disambiguation of Estonian began in the COPERNICUS-project "Multext-East" (1995-1997). During that project G. Orwell's novel "1984" was disambiguated. The main part of this corpus, 400 000 words, has been disambiguated during 2002-2003. This work has been supported by the national program "Eesti keel ja rahvuskultuur" (Estonian Language and National Culture). The following researchers have participated in this work: Külli Habicht, Heiki-Jaan Kaalep, Neeme Kahusk, Kadri Muishnek, Heili Orav, Andriela Rääbis, Kadri Vider.
The texts belong to the following text classes:
Text class | number of words |
---|---|
Fiction (Estonian authors) | 104 000 |
G. Orwell's "1984" | 75 500 |
Newspaper texts | 111 000 |
Legal texts | 121 000 |
Texts from the scientific magazine "Horisont" | 98 000 |
Reference texts | 4 000 |
Altogether | 513 000 |
begin with a 3-letter code: (ilu[fiction], sea[legal texts], aja[newspaper], hor[isont], inf[reference texts], 1984).)
All the fiction texts, except for "1984", come from the subcorpus of the 1980s of the Corpus of Written Estonian 1890-1990. The number in the filename is the same as in the original, the code "tkt" or "stkt" in the original filename has been replaced with code "ilu".
The newspaper files are not present in the other corpora. The filename contains the name of the newspaper.
The reference texts come from the subcorpus of the 1980ies of the Corpus of Written Estonian 1890-1990; the file inf_0002.yhene is from the text class "Hobbies" and the file inf_0011.yhene comes from the text class "Encyclopaedias".
The legal documents come from: 1)the homepage of the Estonian Legal Language Centre http://www.legaltext.ee/ (april 2002) and 2) some other resources. The filenames of the files we have got from the Estonian Legal Language Centre contain the same number as their source files. The filenames of the files coming from other sources contain the name of the legislative document.
The excerpts from the magazine "Horisont" come from its homepage http://www.horisont.ee/ (9. october 2003) and come from the years 1996-2003. The filenames are the same as they were on the homepage of "Horisont".
The wordforms have been analysed one by one. The result of the analysis for one wordform is as follows:
wordform
lemma+ending // morphological categories //
If the word-form is a compound or a derived word, then:
All the parts of a multi-word proper name are analyzed, the non-final parts as if having an unknown inflection:
Rio Rio //_S_ prop ? //
de de //_S_ prop ? //
Janeiros Janeiro+s //_S_ prop sg in //
The tags <s> and </s> placed on separate lines mark the beginning and end of a sentence, heading etc. Some files also contain paragraph tags <p> and </p>.
Code table is utf-8. <, > and & are represented as entities <, > ja &
Ca 0,3% of the analysis can be debatable or wrong.