Eesti keeles

The Mixed Corpus: Lääne Elu


This subcorpus contains issues of the newspaper „Lääne Elu“ (local newspaper of the Läänemaa county) from the period 04.05.2000 – 01.11.2008, (1273 issues,  6407 articles),   1 764 250 words in 126 205 sentences. The texts have been semi-automatically downloaded and converted from HTML-format to TEI-format. The programs have been written and conversions done by Kristel Uiboaed.

From the newspaper texts non-textual material has been omitted. By non-textual material we mean pictures (photos, drawings, diagrams etc). We have also omitted articles containing of tables only, like various sports results tables or TV-programs. And lastly, we have omitted weather forecasts and horoscopes.

How can one use it?

The corpus is free for use for non-commercial purposes only.

Texts and annotation

Mark-up and annotation conform to the TEI-guidelines. One file contains one issue of the newspaper.

Every file begins with a header <teiheader> that contains information about file size, used tags etc. The rest of the file is structured as follows:

The text has been annotated for paragraphs, sentences, headlines and authors.

Special symbols

The non-ASCII characters/symbols are presented using the following entities:

Olem Tähistatav
&Agrave; À
&amp; &amp;
&aogon; ą
&Aring; Å
&aring; å
&ast; *
&commat; @
&atilde; ã
&Auml; Ä
&auml; ä
&cacute; ć
&deg; °
&Eacute; É
&eacute; é
&euml; ë
&iogon; į
&lowbar; _
&lsqb; [
&middot; ·
&Ntilde; Ñ
&ordm; º
&Oslash; Ø
&oslash; ø
&Otilde; Õ
&otilde; õ
&Ouml; Ö
&ouml; ö
&percnt; %
&plus; +
&plusmn; ±
&quot; &quot;
&rsqb; ]
&sacute; ś
&Scaron; Š
&scaron; š
&sect; §
&sup2; ²
&sup3; ³
&Zcaron; Ž
&zcaron; ž
&tilde; ~
&times; ×
&Uogon; Ų
&Uuml; Ü
&uuml; ü
&verbar; |

