This subcorpus contains issues of the newspaper „Lääne Elu“ (local newspaper of the Läänemaa county) from the period 04.05.2000 – 01.11.2008, (1273 issues, 6407 articles), 1 764 250 words in 126 205 sentences. The texts have been semi-automatically downloaded and converted from HTML-format to TEI-format. The programs have been written and conversions done by Kristel Uiboaed.
From the newspaper texts non-textual material has been omitted. By non-textual material we mean pictures (photos, drawings, diagrams etc). We have also omitted articles containing of tables only, like various sports results tables or TV-programs. And lastly, we have omitted weather forecasts and horoscopes.
The corpus is free for use for non-commercial purposes only.
Mark-up and annotation conform to the TEI-guidelines. One file contains one issue of the newspaper.
Every file begins with a header <teiheader>
that contains information about file size, used tags etc.
The rest of the file is structured as follows:
<div0>
is one issue of the newspaper, e.g. <div0 type='leht'><head>
Lääne Elu 15.04.2008 </head>
<div1>
is a section e.g. <div1 type='rubriik'><head>
Uudised</head>
<div2>
is an article, e.g. <div2 type='artikkel'><head>
Tervishoid vajab rohkem riigi abi </head>
The text has been annotated for paragraphs, sentences, headlines and authors.
SGML-files contain entities listed in this table