Morphologically disambiguated corpus (version from 2004)

corpus query specially for the morphologically disambiguated corpus

The file failid2004.zip contains manually disambiguated files. Every text has been manually disambiguated by two persons; and the third person has compared the result and made the necessary corrections.

Content

The work with the morphological disambiguation of Estonian began in the COPERNICUS-project "Multext-East" (1995-1997). During that project G. Orwell's novel "1984" was disambiguated. The main part of this corpus, 400 000 words, has been disambiguated during 2002-2003. This work has been supported by the national program "Eesti keel ja rahvuskultuur" (Estonian Language and National Culture). The following researchers have participated in this work: Külli Habicht, Heiki-Jaan Kaalep, Neeme Kahusk, Kadri Muishnek, Heili Orav, Andriela Rääbis, Kadri Vider.

The texts belong to the following text classes:

Text class	number of words
Fiction (Estonian authors)	104 000
G. Orwell's "1984"	75 500
Newspaper texts	111 000
Legal texts	121 000
Texts from the scientific magazine "Horisont"	98 000
Reference texts	4 000
Altogether	513 000

File names

begin with a 3-letter code: (ilu[fiction], sea[legal texts], aja[newspaper], hor[isont], inf[reference texts], 1984).)

The origin of the texts

All the fiction texts, except for "1984", come from the subcorpus of the 1980s of the Corpus of Written Estonian 1890-1990. The number in the filename is the same as in the original, the code "tkt" or "stkt" in the original filename has been replaced with code "ilu".

The newspaper files are not present in the other corpora. The filename contains the name of the newspaper.

The reference texts come from the subcorpus of the 1980ies of the Corpus of Written Estonian 1890-1990; the file inf_0002.yhene is from the text class "Hobbies" and the file inf_0011.yhene comes from the text class "Encyclopaedias".

The legal documents come from: 1)the homepage of the Estonian Legal Language Centre http://www.legaltext.ee/ (april 2002) and 2) some other resources. The filenames of the files we have got from the Estonian Legal Language Centre contain the same number as their source files. The filenames of the files coming from other sources contain the name of the legislative document.

The excerpts from the magazine "Horisont" come from its homepage http://www.horisont.ee/ (9. october 2003) and come from the years 1996-2003. The filenames are the same as they were on the homepage of "Horisont".

The analysis

The wordforms have been analysed one by one, except for some multi-word proper names like New York. The result of the analysis for one wordform is as follows:

wordform lemma+ending // morphological categories //

<wordform> is the wordform in text
<lemma> for nouns is singular nominative, for verbs it is the form of the ma-infinitive (without the infinitival ending -ma)
<ending> is the morphological (not derivational) ending, the separate endings one wordform can have (e.g. case+plural) are treated as one, the particle GI/KI has not been separated from the other endings. If the word doesn't or can't have any endings, the zero ending is given (+0).
<morphological categories> are given in the table of morphological categories.

If the word-form is a compound or a derived word, then:

The components of a compound are separated by '_' ;
Suffixes are separated from the lemma by '='. The presentation of the suffixes is not consistent: only a pre-defined amount of productive suffixes have been annoptated.
For the compunds the lemma is found only for the rightmost component.

The tags <s> and </s> placed on separate lines mark the beginning and end of a sentence, heading etc. Some files also contain paragraph tags <p> and </p>.

Symbols and entities

In addition to letters and numbers the following symbols can be found in this corpus: ,;.:<>()!?%&"'*+-/=@_~

The non-ascii characters are represented as sgml entities. All the used entities are listed in the table of entities.

Dash can be as - or as -- and its annotation is always —. In the beginning of a list one can find combination -. and it has then received an annotation —.

The quotation marks can be in the following forms:

"	double quote (beginning or end)
'	single quote (beginning or end)
“	beginning double quote
”	end double quote
‘	beginning single quote
’	end single quote

Known problems so far

Ca 0,3% of the analysis can be debatable or wrong.

Some publications about this

H.-J. Kaalep, K. Muischnek, K. Müürisep, A. Rääbis, K. Habicht. Kas tegelik tekst allub eesti keele morfoloogilistele kirjeldustele? Eesti kirjakeele testkorpuse morfosüntaktilise märgendamise kogemusest. Keel ja Kirjandus 9/2000, lk. 623-633 doc file, pdf file
K. Muischnek, K. Vider. Sõnaliigituse kitsaskohad eesti keele arvutianalüüsis esitatud avaldamiseks Rakenduslingvistika konverentsi 2004 kogumikus doc file pdf file

Webmaster Last modified: February 02 2023 10:29:41.