Reference corpus of Estonian: Legislation


This corpus contains:

  1. Estonian laws, 391 files - headings and filenames
  2. Estonian translations of EU legislation, 5431 files - headings and filenames

These texts form a part of the planned reference corpus of Estonian. Their collecting and processing has been financed by a national program “Estonian language and national culture”.

How can one use it?

The corpus is free for use for non-commercial purposes only.

Sources and tagging

The texts originate from Estonian Legal Language Centre ( on April 30, 2002

The texts have been automatically downloaded from the internet and converted from HTML to SGML (TEI).  The conversion programs were written by > Heiki-Jaan Kaalep

One file contains one law or regulation or the like. Non-textual parts (e.g. pictures) have been omitted. The texts often contain parts in various languages.   

All the rendition information (e.g. italic, bold) has been deleted. The superscript and subscript are <hi rend="sup"> and <hi rend="sub"> . UNICODE-entities having the form &#number; have been converted to SGML-entities. The conversion of various forms of Estonian letters s and z with caron, Icelandic letters, Greek letters etc. has resulted in many incorrect results. Original HTML-lists have been converted to ordinary text with numbers at the beginning of paragraphs (if the original was a numbered list) or a hyphen at the beginning of paragraphs (if the original was a bulleted list). There are no corrections or hyphenations in the texts. The entity &quest; stands for symbols which correct original form is unknown.

The opening quotation mark is the entity &ldquo; the closing quotation mark is the entity &rdquo;. The division of the texts into paragraphs follows exactly the original HTML files. One paragraph, i.e. one unit between <p> and </p> is on one line. The text inside paragraphs has been processed by a program called estyhmm; as a result, the punctuation marks are separated from wordforms by a space (except those punctuation marks that are an integral part of the token, e.g. an abbreviation or an ordinal number) and the sentences are tagged with <s> and </s>. Apart from paragraphs and sentences, the structure of the texts (e.g. headings, sections, signatures, appendixes, footnotes etc.) is not tagged.

Every file starts with a <teiHeader> documenting the file contents, size, used tags etc.


Estonian laws (1.8 million tokens)

Estonian translations of EU legislation (9.6 million tokens)

Numbers and abbreviations have been counted as tokens.

English translations and originals

This corpus contains the parallel texts for the Estonian versions:

  1. English translations of the Estonian laws, 390 files, 3,0 million tokens (packed in an archive - headings and filenames
  2. Original EU legislation in English, 5414 tfiles, 12 million tokens (packed in two archives and - headings and filenames

Symbols and entities

In addition to ASCII symbols (unaccented letters, numbers and punctuation signs), the texts contain the following entities:

