Reference corpus of Estonian: Transcripts of Riigikogu (Estonian Parliament)


This corpus contains the edited versions of transcripts of the sessions of Riigikogu.

These texts form a part of the planned reference corpus of Estonian. Their collecting and processing has been financed by a national program “Estonian language and national culture”.

How can one use it?

The corpus is free for use for non-commercial purposes only.

Sources and markup

The texts have been automatically downloaded from the internet and converted from HTML to SGML (TEI).

One file contains the transcripts of one month. There are no corrections or hyphenations in the texts. The place where the rendition of plain text changes, is tagged with <hi rend=’what kind of rendition’>; the end is tagged with </hi>.

Every file begins with a header <teiheader> that documents the contents of the file, its size, the used tags etc (in Estonian). <div0> marks the transcripts of one month; <div1> marks the transcripts of one session, and <div2> marks one item of the agenda.

The speakers are tagged with <rs> and are always with <hi rend='bold'>.

The opening quotation mark is the entity &ldquo;; the closing quotation mark is the entity &rdquo;.

One paragraph, i.e. one unit between <p> and </p> is on one line. The text inside paragraphs has been processed by a program called estyhmm; as a result, the punctuation marks are separated from wordforms by a space (except those punctuation marks that are an integral part of the token, e.g. an abbreviation or an ordinal number) and the sentences are tagged with <s> and </s>.

The corpus contains 13 million words, covering the period from March of 1995 to the end of 2001.

The amount of words by years:

In addition to ASCII symbols (unaccented letters, numbers and punctuation signs), the texts contain the following entities:

