Eesti keeles

Reference corpus of Estonian: Legislation

Content

This corpus contains:

  1. Estonian laws, 391 files - headings and filenames
  2. Estonian translations of EU legislation, 5431 files - headings and filenames

These texts form a part of the planned reference corpus of Estonian. Their collecting and processing has been financed by a national program “Estonian language and national culture”.

The corpus is free for use for non-commercial purposes only.

Sources and tagging

The texts originate from Estonian Legal Language Centre web page https://www.legaltext.ee on April 30, 2002

The texts have been automatically downloaded from the internet and converted from HTML to SGML (TEI).  The conversion programs were written by > Heiki-Jaan Kaalep

One file contains one law or regulation or the like. Non-textual parts (e.g. pictures) have been omitted. The texts often contain parts in various languages.   

All the rendition information (e.g. italic, bold) has been deleted. The superscript and subscript are <hi rend="sup"> and <hi rend="sub"> . UNICODE-entities having the form &#number; have been converted to SGML-entities. The conversion of various forms of Estonian letters s and z with caron, Icelandic letters, Greek letters etc. has resulted in many incorrect results. Original HTML-lists have been converted to ordinary text with numbers at the beginning of paragraphs (if the original was a numbered list) or a hyphen at the beginning of paragraphs (if the original was a bulleted list). There are no corrections or hyphenations in the texts. The entity &quest; stands for symbols which correct original form is unknown.

The opening quotation mark is the entity &ldquo; the closing quotation mark is the entity &rdquo;. The division of the texts into paragraphs follows exactly the original HTML files. One paragraph, i.e. one unit between <p> and </p> is on one line. The text inside paragraphs has been processed by a program called estyhmm; as a result, the punctuation marks are separated from wordforms by a space (except those punctuation marks that are an integral part of the token, e.g. an abbreviation or an ordinal number) and the sentences are tagged with <s> and </s>. Apart from paragraphs and sentences, the structure of the texts (e.g. headings, sections, signatures, appendixes, footnotes etc.) is not tagged.

Every file starts with a <teiHeader> documenting the file contents, size, used tags etc.

Size

Estonian laws (1.8 million tokens)

Estonian translations of EU legislation (9.6 million tokens)

Numbers and abbreviations have been counted as tokens.

Symbols and entities

In addition to ASCII symbols (unaccented letters, numbers and punctuation signs), the texts contain the following entities:


Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: December 21 2018 20:40:26.