Reference corpus of Estonian: Legislation
Content
This corpus contains:
	- Estonian laws, 391 files 
		- headings and filenames
- Estonian translations of EU legislation, 5431 files
		- headings and filenames 
These texts
	form a part of the planned reference corpus of Estonian. Their
	collecting and
	processing has been financed by a national program “Estonian language
	and
	national culture”.
 The corpus is free for use for non-commercial purposes only.
Sources and tagging
The texts originate from Estonian Legal Language Centre web page https://www.legaltext.ee on
	April 30, 2002
The texts have been
	automatically downloaded from the internet and converted from HTML to
	SGML
	(TEI).  The
	conversion programs
	were written by > Heiki-Jaan Kaalep
One file contains one law or regulation or the like. Non-textual parts (e.g. pictures) have been omitted.
	The texts often contain parts in various languages.
	  
All the rendition information (e.g. italic, bold) has been deleted.
	The superscript and subscript are <hi rend="sup"> and <hi
	rend="sub"> . UNICODE-entities having the form &#number; have
	been converted to SGML-entities.
	The conversion of various forms of Estonian letters s and z with caron,
	Icelandic letters, Greek letters etc. has resulted in many incorrect
	results. Original HTML-lists have been converted to ordinary text with
	numbers at the beginning of paragraphs (if the original was a numbered
	list) or a hyphen at the beginning of paragraphs (if the original was a
	bulleted list). There
	are no corrections
	or hyphenations in the texts. The entity ? stands for
	symbols which correct original form is unknown.
 The opening
	quotation mark is the entity “ the closing quotation mark is the entity ”. The division of the texts into paragraphs follows exactly the original
	HTML files. One
	paragraph, i.e. one unit between <p> and </p> is on one
	line. The
	text inside paragraphs has been processed by a program called estyhmm;
	as a
	result, the punctuation marks are separated from wordforms by a space
	(except
	those punctuation marks that are an integral part of the token, e.g. an
	abbreviation or an ordinal number) and the sentences are tagged with
	<s>
	and </s>.  Apart from paragraphs and sentences,
	the structure of the texts (e.g. headings, sections, signatures,
	appendixes, footnotes etc.) is not tagged.
	
	Every file starts with a <teiHeader> documenting the file
	contents, size, used tags etc. 
Size
Estonian laws (1.8 million tokens)
Estonian translations of EU legislation (9.6 million tokens)
Numbers and abbreviations have been counted as tokens.
Symbols and entities
In addition to ASCII	symbols (unaccented letters, numbers and punctuation signs), the texts contain the following entities:
	- Æ - capital AE diphthong (ligature)
- &Aacgr; - capital Alpha, accent, Greek
- Á - capital A, acute accent
- Â - capital A, circumflex accent
- &Agr; - capital Alpha, Greek
- À - capital A, grave accent
- Å - capital A, ring
- Ã - capital A, tilde
- Ä - capital A, dieresis or umlaut mark
- &Bgr; - capital Beta, Greek
- Ç - capital C, cedilla
- &Dgr; - capital Delta, Greek
- &EEgr; - capital Eta, Greek
- Ð - capital Eth, Icelandic
- &Eacgr; - capital Epsilon, accent, Greek
- É - capital E, acute accent
- Ê - capital E, circumflex accent
- Э - capital E, Cyrillic
- &Egr; - capital Epsilon, Greek
- È - capital E, grave accent
- Ë - capital E, dieresis or umlaut mark
- &Ggr; - capital Gamma, Greek
- Í - capital I, acute accent
- Î - capital I, circumflex accent
- &Igr; - capital Iota, Greek
- Ì - capital I, grave accent
- Ï - capital I, dieresis or umlaut mark
- &KHgr; - capital Chi, Greek
- &Kgr; - capital Kappa, Greek
- &Lgr; - capital Lambda, Greek
- Ł - capital L, stroke
- М - capital EM, Cyrillic
- &Mgr; - capital Mu, Greek
- &Ngr; - capital Nu, Greek
- Ñ - capital N, tilde
- Œ - capital OE ligature
- &OHacgr; - capital Omega, accent, Greek
- &OHgr; - capital Omega, Greek
- Ó - capital O, acute accent
- Ô - capital O, circumflex accent
- &Ogr; - capital Omicron, Greek
- Ò - capital O, grave accent
- Ø - capital O, slash
- Õ - capital O, tilde
- Ö - capital O, dieresis or umlaut mark
- &PHgr; - capital Phi, Greek
- &PSgr; - capital Psi, Greek
- П - capital PE, Cyrillic
- &Pgr; - capital Pi, Greek
- Ř - capital R, caron
- &Rgr; - capital Rho, Greek
- Š - capital S, caron
- &Sgr; - capital Sigma, Greek
- &THgr; - capital Theta, Greek
- &Tgr; - capital Tau, Greek
- Ú - capital U, acute accent
- &Ugr; - capital Upsilon, Greek
- Ù - capital U, grave accent
- Ü - capital U, dieresis or umlaut mark
- &Xgr; - capital Xi, Greek
- Ý - capital Y, acute accent
- Ž - capital Z, caron
- &Zgr; - capital Zeta, Greek
- &aacgr; - small alpha, accent, Greek
- á - small a, acute accent
- â - small a, circumflex accent
- а - small a, Cyrillic
- æ - small ae diphthong (ligature)
- &agr; - small alpha, Greek
- à - small a, grave accent
- & - ampersand
- ą - small a, ogonek
- å - small a, ring
- ã - small a, tilde
- ä - small a, dieresis or umlaut mark
- &bgr; - small beta, Greek
- č - small c, caron
- ç - small c, cedilla
- ° - degree sign
- &dgr; - small delta, Greek
- $ - dollar sign
- é - small e, acute accent
- ê - small e, circumflex accent
- &eeacgr; - small eta, accent, Greek
- &eegr; - small eta, Greek
- &egr; - small epsilon, Greek
- è - small e, grave accent
- ę - small e, ogonek
- ð - small eth, Icelandic
- ë - small e, dieresis or umlaut mark
- € - euro sign
- ≥ - geq /ge R: =greater-than-or-equal
- &ggr; - small gamma, Greek
- > - greater-than sign R:
- &iacgr; - small iota, accent, Greek
- í - small i, acute accent
- î - small i, circumflex accent
- и - small i, Cyrillic
- &idigr; - small iota, dieresis, Greek
- е - small ie, Cyrillic
- &igr; - small iota, Greek
- ì - small i, grave accent
- ï - small i, dieresis or umlaut mark
- й - small short i, Cyrillic
- ķ - small k, cedilla
- к - small ka, Cyrillic
- &kgr; - small kappa, Greek
- &khgr; - small chi, Greek
- “ - left double quotation mark
- ≤ - leq /le R: =less-than-or-equal
- &lgr; - small lambda, Greek
- ł - small l, stroke
- < - less-than sign R:
- &mgr; - small mu, Greek
- · - centerdot B: =middle dot
- н - small en, Cyrillic
- ≠ - ne /neq R: =not equal
- &ngr; - small nu, Greek
- ñ - small n, tilde
- &oacgr; - small omicron, accent, Greek
- ó - small o, acute accent
- ô - small o, circumflex accent
- о - small o, Cyrillic
- œ - small oe ligature
- &ogr; - small omicron, Greek
- ò - small o, grave accent
- &ohacgr; - small omega, accent, Greek
- &ohgr; - small omega, Greek
- º - ordinal indicator, masculine
- ø - small o, slash
- õ - small o, tilde
- ö - small o, dieresis or umlaut mark
- ‰ - per mille sign
- &pgr; - small pi, Greek
- &phgr; - small phi, Greek
- ± - pm B: =plus-or-minus sign
- £ - pound sign
- &psgr; - small psi, Greek
- ? - question mark
- " - quotation mark
- ř - small r, caron
- р - small er, Cyrillic
- ” - right double quotation mark
- &rgr; - small rho, Greek
- š - small s, caron
- с - small es, Cyrillic
- § - section sign
- &sfgr; - final small sigma, Greek
- &sgr; - small sigma, Greek
- □ - square, open
- ß - small sharp s, German (sz ligature)
- т - small te, Cyrillic
- &tgr; - small tau, Greek
- &thgr; - small theta, Greek
- þ - small thorn, Icelandic
- ˜ - small tilde
- × - times B: =multiply sign
- &uacgr; - small upsilon, accent, Greek
- ú - small u, acute accent
- û - small u, circumflex accent
- &udigr; - small upsilon, dieresis, Greek
- &ugr; - small upsilon, Greek
- ù - small u, grave accent
- ü - small u, dieresis or umlaut mark
- в - small ve, Cyrillic
- &xgr; - small xi, Greek
- ý - small y, acute accent
- ы - small yeru, Cyrillic
- ÿ - small y, dieresis or umlaut mark
- ž - small z, caron
- &zgr; - small zeta, Greek
	
		 
 
	
	
		 Webmaster   
	
	Last modified: December 21 2018 20:40:26.
      
	
	Webmaster   
	
	Last modified: December 21 2018 20:40:26.