Reference corpus of Estonian: Legislation

XML-files
SGML TEI P3 files:eestiseadus.tar.gz euroseadus-1.tar.gz euroseadus-2.tar.gz

Content

This corpus contains:

Estonian laws, 391 files - headings and filenames
Estonian translations of EU legislation, 5431 files - headings and filenames

These texts form a part of the planned reference corpus of Estonian. Their collecting and processing has been financed by a national program “Estonian language and national culture”.

The corpus is free for use for non-commercial purposes only.

Sources and tagging

The texts originate from Estonian Legal Language Centre web page https://www.legaltext.ee on April 30, 2002

The texts have been automatically downloaded from the internet and converted from HTML to SGML (TEI). The conversion programs were written by > Heiki-Jaan Kaalep

One file contains one law or regulation or the like. Non-textual parts (e.g. pictures) have been omitted. The texts often contain parts in various languages.

All the rendition information (e.g. italic, bold) has been deleted. The superscript and subscript are <hi rend="sup"> and <hi rend="sub"> . UNICODE-entities having the form &#number; have been converted to SGML-entities. The conversion of various forms of Estonian letters s and z with caron, Icelandic letters, Greek letters etc. has resulted in many incorrect results. Original HTML-lists have been converted to ordinary text with numbers at the beginning of paragraphs (if the original was a numbered list) or a hyphen at the beginning of paragraphs (if the original was a bulleted list). There are no corrections or hyphenations in the texts. The entity &quest; stands for symbols which correct original form is unknown.

The opening quotation mark is the entity “ the closing quotation mark is the entity ”. The division of the texts into paragraphs follows exactly the original HTML files. One paragraph, i.e. one unit between <p> and </p> is on one line. The text inside paragraphs has been processed by a program called estyhmm; as a result, the punctuation marks are separated from wordforms by a space (except those punctuation marks that are an integral part of the token, e.g. an abbreviation or an ordinal number) and the sentences are tagged with <s> and </s>. Apart from paragraphs and sentences, the structure of the texts (e.g. headings, sections, signatures, appendixes, footnotes etc.) is not tagged.

Every file starts with a <teiHeader> documenting the file contents, size, used tags etc.

Size

Estonian laws (1.8 million tokens)

Estonian translations of EU legislation (9.6 million tokens)

Numbers and abbreviations have been counted as tokens.

Symbols and entities

In addition to ASCII symbols (unaccented letters, numbers and punctuation signs), the texts contain the following entities:

Æ - capital AE diphthong (ligature)
&Aacgr; - capital Alpha, accent, Greek
Á - capital A, acute accent
Â - capital A, circumflex accent
&Agr; - capital Alpha, Greek
À - capital A, grave accent
Å - capital A, ring
Ã - capital A, tilde
Ä - capital A, dieresis or umlaut mark
&Bgr; - capital Beta, Greek
Ç - capital C, cedilla
&Dgr; - capital Delta, Greek
&EEgr; - capital Eta, Greek
Ð - capital Eth, Icelandic
&Eacgr; - capital Epsilon, accent, Greek
É - capital E, acute accent
Ê - capital E, circumflex accent
Э - capital E, Cyrillic
&Egr; - capital Epsilon, Greek
È - capital E, grave accent
Ë - capital E, dieresis or umlaut mark
&Ggr; - capital Gamma, Greek
Í - capital I, acute accent
Î - capital I, circumflex accent
&Igr; - capital Iota, Greek
Ì - capital I, grave accent
Ï - capital I, dieresis or umlaut mark
&KHgr; - capital Chi, Greek
&Kgr; - capital Kappa, Greek
&Lgr; - capital Lambda, Greek
Ł - capital L, stroke
М - capital EM, Cyrillic
&Mgr; - capital Mu, Greek
&Ngr; - capital Nu, Greek
Ñ - capital N, tilde
Œ - capital OE ligature
&OHacgr; - capital Omega, accent, Greek
&OHgr; - capital Omega, Greek
Ó - capital O, acute accent
Ô - capital O, circumflex accent
&Ogr; - capital Omicron, Greek
Ò - capital O, grave accent
Ø - capital O, slash
Õ - capital O, tilde
Ö - capital O, dieresis or umlaut mark
&PHgr; - capital Phi, Greek
&PSgr; - capital Psi, Greek
П - capital PE, Cyrillic
&Pgr; - capital Pi, Greek
Ř - capital R, caron
&Rgr; - capital Rho, Greek
Š - capital S, caron
&Sgr; - capital Sigma, Greek
&THgr; - capital Theta, Greek
&Tgr; - capital Tau, Greek
Ú - capital U, acute accent
&Ugr; - capital Upsilon, Greek
Ù - capital U, grave accent
Ü - capital U, dieresis or umlaut mark
&Xgr; - capital Xi, Greek
Ý - capital Y, acute accent
Ž - capital Z, caron
&Zgr; - capital Zeta, Greek
&aacgr; - small alpha, accent, Greek
á - small a, acute accent
â - small a, circumflex accent
а - small a, Cyrillic
æ - small ae diphthong (ligature)
&agr; - small alpha, Greek
à - small a, grave accent
& - ampersand
ą - small a, ogonek
å - small a, ring
ã - small a, tilde
ä - small a, dieresis or umlaut mark
&bgr; - small beta, Greek
č - small c, caron
ç - small c, cedilla
° - degree sign
&dgr; - small delta, Greek
$ - dollar sign
é - small e, acute accent
ê - small e, circumflex accent
&eeacgr; - small eta, accent, Greek
&eegr; - small eta, Greek
&egr; - small epsilon, Greek
è - small e, grave accent
ę - small e, ogonek
ð - small eth, Icelandic
ë - small e, dieresis or umlaut mark
€ - euro sign
≥ - geq /ge R: =greater-than-or-equal
&ggr; - small gamma, Greek
> - greater-than sign R:
&iacgr; - small iota, accent, Greek
í - small i, acute accent
î - small i, circumflex accent
и - small i, Cyrillic
&idigr; - small iota, dieresis, Greek
е - small ie, Cyrillic
&igr; - small iota, Greek
ì - small i, grave accent
ï - small i, dieresis or umlaut mark
й - small short i, Cyrillic
ķ - small k, cedilla
к - small ka, Cyrillic
&kgr; - small kappa, Greek
&khgr; - small chi, Greek
“ - left double quotation mark
≤ - leq /le R: =less-than-or-equal
&lgr; - small lambda, Greek
ł - small l, stroke
< - less-than sign R:
&mgr; - small mu, Greek
· - centerdot B: =middle dot
н - small en, Cyrillic
≠ - ne /neq R: =not equal
&ngr; - small nu, Greek
ñ - small n, tilde
&oacgr; - small omicron, accent, Greek
ó - small o, acute accent
ô - small o, circumflex accent
о - small o, Cyrillic
œ - small oe ligature
&ogr; - small omicron, Greek
ò - small o, grave accent
&ohacgr; - small omega, accent, Greek
&ohgr; - small omega, Greek
º - ordinal indicator, masculine
ø - small o, slash
õ - small o, tilde
ö - small o, dieresis or umlaut mark
‰ - per mille sign
&pgr; - small pi, Greek
&phgr; - small phi, Greek
± - pm B: =plus-or-minus sign
£ - pound sign
&psgr; - small psi, Greek
? - question mark
" - quotation mark
ř - small r, caron
р - small er, Cyrillic
” - right double quotation mark
&rgr; - small rho, Greek
š - small s, caron
с - small es, Cyrillic
§ - section sign
&sfgr; - final small sigma, Greek
&sgr; - small sigma, Greek
□ - square, open
ß - small sharp s, German (sz ligature)
т - small te, Cyrillic
&tgr; - small tau, Greek
&thgr; - small theta, Greek
þ - small thorn, Icelandic
˜ - small tilde
× - times B: =multiply sign
&uacgr; - small upsilon, accent, Greek
ú - small u, acute accent
û - small u, circumflex accent
&udigr; - small upsilon, dieresis, Greek
&ugr; - small upsilon, Greek
ù - small u, grave accent
ü - small u, dieresis or umlaut mark
в - small ve, Cyrillic
&xgr; - small xi, Greek
ý - small y, acute accent
ы - small yeru, Cyrillic
ÿ - small y, dieresis or umlaut mark
ž - small z, caron
&zgr; - small zeta, Greek

Webmaster Last modified: December 21 2018 20:40:26.