Eesti keeles

Reference corpus of Estonian: Transcripts of Riigikogu (Estonian Parliament)

Contents

This corpus contains the edited versions of transcripts of the sessions of Riigikogu. Their originals have been downloaded from http://www.riigikogu.ee/ems/plsql/ems.basdata

These texts form a part of the planned reference corpus of Estonian. Their collecting and processing has been financed by a national program “Estonian language and national culture”.

How can one use it?

The corpus is free for use for non-commercial purposes only.

Sources and markup

The texts have been automatically downloaded from the internet and converted from HTML to SGML (TEI). The conversion programs were written by Kaarel Kaljurand; they are described in http://psych.ut.ee/~kaarel/corpus_tools/.

One file contains the transcripts of one month. There are no corrections or hyphenations in the texts. The place where the rendition of plain text changes, is tagged with <hi rend=’what kind of rendition’>; the end is tagged with </hi>.

Every file begins with a header <teiheader> that documents the contents of the file, its size, the used tags etc (in Estonian). <div0> marks the transcripts of one month; <div1> marks the transcripts of one session, and <div2> marks one item of the agenda.

The speakers are tagged with <rs> and are always with <hi rend='bold'>.

The opening quotation mark is the entity &ldquo;; the closing quotation mark is the entity &rdquo;.

One paragraph, i.e. one unit between <p> and </p> is on one line. The text inside paragraphs has been processed by a program called estyhmm; as a result, the punctuation marks are separated from wordforms by a space (except those punctuation marks that are an integral part of the token, e.g. an abbreviation or an ordinal number) and the sentences are tagged with <s> and </s>.

The corpus contains 13 million words, covering the period from March of 1995 to the end of 2001.

The amount of words by years:

In addition to ASCII symbols (unaccented letters, numbers and punctuation signs), the texts contain the following entities:


Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: March 24 2014 14:48:48.