Reference corpus of Estonian: Legislation
Content
This corpus contains:
- Estonian laws, 391 files
- headings and filenames
- Estonian translations of EU legislation, 5431 files
- headings and filenames
These texts
form a part of the planned reference corpus of Estonian. Their
collecting and
processing has been financed by a national program “Estonian language
and
national culture”.
The corpus is free for use for non-commercial purposes only.
Sources and tagging
The texts originate from Estonian Legal Language Centre web page https://www.legaltext.ee on
April 30, 2002
The texts have been
automatically downloaded from the internet and converted from HTML to
SGML
(TEI). The
conversion programs
were written by > Heiki-Jaan Kaalep
One file contains one law or regulation or the like. Non-textual parts (e.g. pictures) have been omitted.
The texts often contain parts in various languages.
All the rendition information (e.g. italic, bold) has been deleted.
The superscript and subscript are <hi rend="sup"> and <hi
rend="sub"> . UNICODE-entities having the form &#number; have
been converted to SGML-entities.
The conversion of various forms of Estonian letters s and z with caron,
Icelandic letters, Greek letters etc. has resulted in many incorrect
results. Original HTML-lists have been converted to ordinary text with
numbers at the beginning of paragraphs (if the original was a numbered
list) or a hyphen at the beginning of paragraphs (if the original was a
bulleted list). There
are no corrections
or hyphenations in the texts. The entity ? stands for
symbols which correct original form is unknown.
The opening
quotation mark is the entity “ the closing quotation mark is the entity ”. The division of the texts into paragraphs follows exactly the original
HTML files. One
paragraph, i.e. one unit between <p> and </p> is on one
line. The
text inside paragraphs has been processed by a program called estyhmm;
as a
result, the punctuation marks are separated from wordforms by a space
(except
those punctuation marks that are an integral part of the token, e.g. an
abbreviation or an ordinal number) and the sentences are tagged with
<s>
and </s>. Apart from paragraphs and sentences,
the structure of the texts (e.g. headings, sections, signatures,
appendixes, footnotes etc.) is not tagged.
Every file starts with a <teiHeader> documenting the file
contents, size, used tags etc.
Size
Estonian laws (1.8 million tokens)
Estonian translations of EU legislation (9.6 million tokens)
Numbers and abbreviations have been counted as tokens.
Symbols and entities
In addition to ASCII symbols (unaccented letters, numbers and punctuation signs), the texts contain the following entities:
- Æ - capital AE diphthong (ligature)
- &Aacgr; - capital Alpha, accent, Greek
- Á - capital A, acute accent
- Â - capital A, circumflex accent
- &Agr; - capital Alpha, Greek
- À - capital A, grave accent
- Å - capital A, ring
- Ã - capital A, tilde
- Ä - capital A, dieresis or umlaut mark
- &Bgr; - capital Beta, Greek
- Ç - capital C, cedilla
- &Dgr; - capital Delta, Greek
- &EEgr; - capital Eta, Greek
- Ð - capital Eth, Icelandic
- &Eacgr; - capital Epsilon, accent, Greek
- É - capital E, acute accent
- Ê - capital E, circumflex accent
- Э - capital E, Cyrillic
- &Egr; - capital Epsilon, Greek
- È - capital E, grave accent
- Ë - capital E, dieresis or umlaut mark
- &Ggr; - capital Gamma, Greek
- Í - capital I, acute accent
- Î - capital I, circumflex accent
- &Igr; - capital Iota, Greek
- Ì - capital I, grave accent
- Ï - capital I, dieresis or umlaut mark
- &KHgr; - capital Chi, Greek
- &Kgr; - capital Kappa, Greek
- &Lgr; - capital Lambda, Greek
- Ł - capital L, stroke
- М - capital EM, Cyrillic
- &Mgr; - capital Mu, Greek
- &Ngr; - capital Nu, Greek
- Ñ - capital N, tilde
- Œ - capital OE ligature
- &OHacgr; - capital Omega, accent, Greek
- &OHgr; - capital Omega, Greek
- Ó - capital O, acute accent
- Ô - capital O, circumflex accent
- &Ogr; - capital Omicron, Greek
- Ò - capital O, grave accent
- Ø - capital O, slash
- Õ - capital O, tilde
- Ö - capital O, dieresis or umlaut mark
- &PHgr; - capital Phi, Greek
- &PSgr; - capital Psi, Greek
- П - capital PE, Cyrillic
- &Pgr; - capital Pi, Greek
- Ř - capital R, caron
- &Rgr; - capital Rho, Greek
- Š - capital S, caron
- &Sgr; - capital Sigma, Greek
- &THgr; - capital Theta, Greek
- &Tgr; - capital Tau, Greek
- Ú - capital U, acute accent
- &Ugr; - capital Upsilon, Greek
- Ù - capital U, grave accent
- Ü - capital U, dieresis or umlaut mark
- &Xgr; - capital Xi, Greek
- Ý - capital Y, acute accent
- Ž - capital Z, caron
- &Zgr; - capital Zeta, Greek
- &aacgr; - small alpha, accent, Greek
- á - small a, acute accent
- â - small a, circumflex accent
- а - small a, Cyrillic
- æ - small ae diphthong (ligature)
- &agr; - small alpha, Greek
- à - small a, grave accent
- & - ampersand
- ą - small a, ogonek
- å - small a, ring
- ã - small a, tilde
- ä - small a, dieresis or umlaut mark
- &bgr; - small beta, Greek
- č - small c, caron
- ç - small c, cedilla
- ° - degree sign
- &dgr; - small delta, Greek
- $ - dollar sign
- é - small e, acute accent
- ê - small e, circumflex accent
- &eeacgr; - small eta, accent, Greek
- &eegr; - small eta, Greek
- &egr; - small epsilon, Greek
- è - small e, grave accent
- ę - small e, ogonek
- ð - small eth, Icelandic
- ë - small e, dieresis or umlaut mark
- € - euro sign
- ≥ - geq /ge R: =greater-than-or-equal
- &ggr; - small gamma, Greek
- > - greater-than sign R:
- &iacgr; - small iota, accent, Greek
- í - small i, acute accent
- î - small i, circumflex accent
- и - small i, Cyrillic
- &idigr; - small iota, dieresis, Greek
- е - small ie, Cyrillic
- &igr; - small iota, Greek
- ì - small i, grave accent
- ï - small i, dieresis or umlaut mark
- й - small short i, Cyrillic
- ķ - small k, cedilla
- к - small ka, Cyrillic
- &kgr; - small kappa, Greek
- &khgr; - small chi, Greek
- “ - left double quotation mark
- ≤ - leq /le R: =less-than-or-equal
- &lgr; - small lambda, Greek
- ł - small l, stroke
- < - less-than sign R:
- &mgr; - small mu, Greek
- · - centerdot B: =middle dot
- н - small en, Cyrillic
- ≠ - ne /neq R: =not equal
- &ngr; - small nu, Greek
- ñ - small n, tilde
- &oacgr; - small omicron, accent, Greek
- ó - small o, acute accent
- ô - small o, circumflex accent
- о - small o, Cyrillic
- œ - small oe ligature
- &ogr; - small omicron, Greek
- ò - small o, grave accent
- &ohacgr; - small omega, accent, Greek
- &ohgr; - small omega, Greek
- º - ordinal indicator, masculine
- ø - small o, slash
- õ - small o, tilde
- ö - small o, dieresis or umlaut mark
- ‰ - per mille sign
- &pgr; - small pi, Greek
- &phgr; - small phi, Greek
- ± - pm B: =plus-or-minus sign
- £ - pound sign
- &psgr; - small psi, Greek
- ? - question mark
- " - quotation mark
- ř - small r, caron
- р - small er, Cyrillic
- ” - right double quotation mark
- &rgr; - small rho, Greek
- š - small s, caron
- с - small es, Cyrillic
- § - section sign
- &sfgr; - final small sigma, Greek
- &sgr; - small sigma, Greek
- □ - square, open
- ß - small sharp s, German (sz ligature)
- т - small te, Cyrillic
- &tgr; - small tau, Greek
- &thgr; - small theta, Greek
- þ - small thorn, Icelandic
- ˜ - small tilde
- × - times B: =multiply sign
- &uacgr; - small upsilon, accent, Greek
- ú - small u, acute accent
- û - small u, circumflex accent
- &udigr; - small upsilon, dieresis, Greek
- &ugr; - small upsilon, Greek
- ù - small u, grave accent
- ü - small u, dieresis or umlaut mark
- в - small ve, Cyrillic
- &xgr; - small xi, Greek
- ý - small y, acute accent
- ы - small yeru, Cyrillic
- ÿ - small y, dieresis or umlaut mark
- ž - small z, caron
- &zgr; - small zeta, Greek
Webmaster
Last modified: December 21 2018 20:40:26.