Reference corpus of Estonian: Kroonika


This corpus contains the issues of the weekly „Kroonika“ from  January 2001 until April 2003  (114 issues containing 1000 articles). They make up 0.6 million words in 55 thousand sentences.

These texts form a part of the planned reference corpus of Estonian. Their collecting and processing has been financed by a national program “Estonian language and national culture”.

How can one use it?

The corpus is free for use for non-commercial purposes only.

Sources and tagging

The texts have been downloaded from the internet and converted from HTML to SGML (TEI). The conversion programs were written by Katrin Tsepelina.

One file contains one issue. Non-textual parts, i.e photos have been omitted.

UNICODE-entities having the form &#number; have been converted to SGML-entities.

The opening quotation mark is the entity “; the closing quotation mark is the entity ”. The single quotation mark, as well as the apostrophe, is '.
The rendition information has been tagged, using the attribute ‚rend’. If the rendition concerns a whole paragraph, then the attribute ‚rend’ is used with the corresponding tag <p>. The possible tags and values for rendition are the following:

<hi rend='bold'>, <hi rend='italic'>, <p rend='bold'>, <p rend='italic'>, <p rend='italic_bold'>

<div0> stands for a whole issue, <div1> stands for a theme (the possible themes are „Juhtkiri“, „Nupud“, „Pikad Lood“ and „V&auml;lismaa“), <div2> stands for an article and <div3> stands for a part of an article, starting with a sub-heading.

The text inside paragraphs has been processed by a program called estyhmm; as a result, the punctuation marks are separated from wordforms by a space (except those punctuation marks that are an integral part of the token, e.g. an abbreviation or an ordinal number) and the sentences are tagged with <s> and </s>. The headings, sub-headings and authors have been tagged; not every article has a heading or an author. The author has been tagged using <bibl> <author> <s>; the text characterising the author (e.g. „Editor” is also eclosed inside these tags.

Every file starts with a <teiHeader> documenting the file contents, size, used tags etc.


0.6 million words, distributed by years in the following way:

year milj. of words
2001 0,27
2002 0,23
2003 0,08

Symbols and entities

In addition to ASCII symbols (unaccented letters, numbers and punctuation signs), the texts contain the following entities:

