This subcorpus contains issues of the newspaper „Valgamaalane“ (local newspaper of the Valga county) from the period 02.09.2004 - 31.07.2008, (598 issues 10 577 articles), 2 495 302 words in 182 936 sentences.

The texts have been semi-automatically downloaded and converted from HTML-format to TEI-format. The programs have been written and conversions done by Kristel Uiboaed.

From the newspaper texts non-textual material has been omitted. By non-textual material we mean pictures (photos, drawings, diagrams etc). We have also omitted articles containing of tables only, like various sports results tables or TV-programs. And lastly, we have omitted weather forecasts and horoscopes.

How can one use it?

The corpus is free for use for non-commercial purposes only.

Texts and annotation

Mark-up and annotation conform to the TEI-guidelines. One file contains one issue of the newspaper.

Every file begins with a header <teiheader> that contains information about file size, used tags etc.

The rest of the file is structured as follows:

The text has been annotated for paragraphs, sentences, headlines and authors.

Special symbols

The non-ASCII characters/symbols are presented using the following entities:

Entity Symbol Estonian description
&Acy; A kirillitsa suur A
&Amacr; Ā  
&Aring; Å  
&Auml; Ä  
&Ccaron; Č  
&Eacute; É  
&Icy; И  
&Ncy; Н kirillitsa suur EN
&Otilde; Õ  
&Ouml; Ö  
&Scaron; Š  
&Uogon; Ų  
&Uuml; Ü  
&Vcy; В kirillitsa suur VE
&Zcaron; Ž  
&amacr; ā  
&amp; &amp;  
&aogon; ą  
&aring; å  
&ast; *  
&commat; @  
&auml; ä  
&cacute; ć  
&ccaron; č  
&deg; °  
&eacute; é  
&edot; ė  
&emacr; ē  
&frac12; ½  
&frac34; ¾  
&gcedil; Ģ väike Ģ
&imacr; ī  
&iogon; į  
&kcedil; ķ  
&lcedil; ļ  
&lowbar; _  
&middot; ·  
&ncedil; ņ  
&oacute; ó  
&otilde; õ  
&ouml; ö  
&percnt; %  
&plus; +  
&quot; &quot;  
&rcaron; ř  
&rcedil; ŗ  
&scaron; š  
&sect; §  
&sup1; 1  
&sup2; ²  
&sup3; ³  
&times; ×  
&umacr; ū  
&uogon; ų  
&uuml; ü  
&zcaron; ž  


