The file failid2004.zip contains manually disambiguated files. Every text has been manually disambiguated by two persons; and the third person has compared the result and made the necessary corrections.
The work with the morphological disambiguation of Estonian began in the COPERNICUS-project "Multext-East" (1995-1997). During that project G. Orwell's novel "1984" was disambiguated. The main part of this corpus, 400 000 words, has been disambiguated during 2002-2003. This work has been supported by the national program "Eesti keel ja rahvuskultuur" (Estonian Language and National Culture). The following researchers have participated in this work: Külli Habicht, Heiki-Jaan Kaalep, Neeme Kahusk, Kadri Muishnek, Heili Orav, Andriela Rääbis, Kadri Vider.
The texts belong to the following text classes:
Text class | number of words |
---|---|
Fiction (Estonian authors) | 104 000 |
G. Orwell's "1984" | 75 500 |
Newspaper texts | 111 000 |
Legal texts | 121 000 |
Texts from the scientific magazine "Horisont" | 98 000 |
Reference texts | 4 000 |
Altogether | 513 000 |
begin with a 3-letter code: (ilu[fiction], sea[legal texts], aja[newspaper], hor[isont], inf[reference texts], 1984).)
All the fiction texts, except for "1984", come from the subcorpus of the 1980s of the Corpus of Written Estonian 1890-1990. The number in the filename is the same as in the original, the code "tkt" or "stkt" in the original filename has been replaced with code "ilu".
The newspaper files are not present in the other corpora. The filename contains the name of the newspaper.
The reference texts come from the subcorpus of the 1980ies of the Corpus of Written Estonian 1890-1990; the file inf_0002.yhene is from the text class "Hobbies" and the file inf_0011.yhene comes from the text class "Encyclopaedias".
The legal documents come from: 1)the homepage of the Estonian Legal Language Centre http://www.legaltext.ee/ (april 2002) and 2) some other resources. The filenames of the files we have got from the Estonian Legal Language Centre contain the same number as their source files. The filenames of the files coming from other sources contain the name of the legislative document.
The excerpts from the magazine "Horisont" come from its homepage http://www.horisont.ee/ (9. october 2003) and come from the years 1996-2003. The filenames are the same as they were on the homepage of "Horisont".
The wordforms have been analysed one by one, except for some multi-word proper names like New York. The result of the analysis for one wordform is as follows:
wordform
lemma+ending // morphological categories //
If the word-form is a compound or a derived word, then:
The tags <s> and </s> placed on separate lines mark the beginning and end of a sentence, heading etc. Some files also contain paragraph tags <p> and </p>.
In addition to letters and numbers the following symbols can be found in this corpus: ,;.:<>()!?%&"'*+-/=@_~
The non-ascii characters are represented as sgml entities. All the used entities are listed in the table of entities.
Dash can be as - or as -- and its annotation is always —. In the beginning of a list one can find combination -. and it has then received an annotation —.
The quotation marks can be in the following forms:
" | double quote (beginning or end) |
' | single quote (beginning or end) |
“ | beginning double quote |
” | end double quote |
‘ | beginning single quote |
’ | end single quote |
Ca 0,3% of the analysis can be debatable or wrong.