Reference corpus: weekly Maaleht

The frequencies of the wordforms used in this corpus.

How can one use it?

The corpus is free for use for non-commercial purposes only.


This corpus contains the issues of the weekly newspaper Maaleht from issue 20 2001 until issue 20 2004, the corpus contains approximately 4,3 million words. The distribution of words by files is in a table.

These texts form a part of the planned reference corpus of Estonian. Their collecting and processing has been financed by a national program Estonian language and national culture.

The texts originate from

The texts have been semi-automatically downloaded from the internet and converted from HTML to SGML (TEI). The conversion programs were written, conversions made and the frequencies calculated by ěivind Rang°y.

One file contains one issue. Non-textual parts (e.g. pictures, comic strips) have been omitted. TV programmes, all advertisments etc have also been omitted. Multiple occurrences of the same article have been deleted.

The opening quotation mark is the entity “. The closing quotation mark is the entity ”, single quote is '.

The rendition information has been tagged, using the attribute rend. The possible tags and values for rendition are the following: <hi rend='bold'>, <hi rend='italic'>, <hi rend='sup'>, <hi rend='underline'>, <hi>, <p rend='bold'>, <p rend='bold_italic'>, <p rend='bold_underline'>. <div0> stands for a whole issue, <div1> stands for a theme (e.g. "Uudised"), <div2> stands for an article.

The text has been divided into paragraphs according to the original HTML-file, the sentences have been tagged automatically. The titles and authors have been annotated using the tags <bibl><author><s>; an article can lack a title or an author.

Every file begins with a <teiHeader>, documenting the file contents, size, used tags etc.


The following entities have been used in this corpus:

Entity Sign Explanation
Aacute Ácapital A, acute accent
Aring Åcapital A, ring
Auml Äcapital A, dieresis or umlaut mark
Ccaron Čcapital C, caron
Ccirc Ĉcapital C, circumflex accent
Eacute Écapital E, acute accent
Ncedil Ņcapital N, cedilla
Omacr Ōcapital O, macron
Oslash Øcapital O, slash
Otilde Õcapital O, tilde
Ouml Öcapital O, dieresis or umlaut mark
Scaron Šcapital S, caron
Umacr Ūcapital U, macron
Uuml Ücapital U, dieresis or umlaut mark
Zcaron Žcapital Z, caron
aacute ásmall a, acute accent
acirc âsmall a, circumflex accent
aelig æsmall ae diphthong (ligature)
agrave àsmall a, grave accent
amacr āsmall a, macron
amp &ampersand
aring åsmall a, ring
atilde ãsmall a, tilde
auml äsmall a, dieresis or umlaut mark
bull bullet
cacute ćsmall c, acute accent
ccaron čsmall c, caron
ccedil çsmall c, cedilla
curren ¤general currency sign
dagger dagger
deg °degree sign
eacute ésmall e, acute accent
egrave èsmall e, grave accent
emacr ēsmall e, macron
eogon ęsmall e, ogonek
euml ësmall e, dieresis or umlaut mark
euro euro sign
frac12 ½fraction one-half
frac14 ¼fraction one-quarter
frac34 ¾fraction three-quarters
gt >greater-than sign R:
iacute ísmall i, acute accent
imacr īsmall i, macron
kcedil ķsmall k, cedilla
lcedil ļsmall l, cedilla
ldquo left double quotation mark
lt <less-than sign R:
micro µmicro sign
middot ·centerdot B: =middle dot
nacute ńsmall n, acute accent
ncaron ňsmall n, caron
ncedil ņsmall n, cedilla
ntilde ñsmall n, tilde
oacute ósmall o, acute accent
ograve òsmall o, grave accent
ohm ohm sign
omacr ōsmall o, macron
oslash øsmall o, slash
otilde õsmall o, tilde
ouml ösmall o, dieresis or umlaut mark
permil per mille sign
plusmn ±pm B: =plus-or-minus sign
pound £pound sign
rarr rightarrow /to A: =rightward arrow
rcaron řsmall r, caron
rcedil ŗsmall r, cedilla
rdquo right double quotation mark
reg ®circledR =registered sign
sacute śsmall s, acute accent
scaron šsmall s, caron
sect §section sign
sup1 ¹superscript one
sup2 ²superscript two
sup3 ³superscript three
szlig ßsmall sharp s, German (sz ligature)
times ×times B: =multiply sign
trade trade mark sign
uacute úsmall u, acute accent
ucirc ûsmall u, circumflex accent
ugrave ùsmall u, grave accent
umacr ūsmall u, macron
uuml üsmall u, dieresis or umlaut mark
yacute ýsmall y, acute accent
zcaron žsmall z, caron

