Eesti keeles

Reference corpus of Estonian: Postimees

Contents

This corpus contains the issues of the daily newspaper „Postimees“ from  November 27, 1995 until October 10, 2000  (1760 issues containing 88 600 articles). They make up 32.9 million words in 2.5 million sentences.

These texts form a part of the planned reference corpus of Estonian. Their collecting and processing has been financed by a national program “Estonian language and national culture”.

How can one use it?

The corpus is free for use for non-commercial purposes only.

Sources and markup

The texts originate from http://www.postimees.ee.

The texts originate from http://www.postimees.ee/
The texts have been semi-automatically downloaded from the internet and converted from HTML to SGML (TEI). The conversion programs were written by Erik Saarts  and Heiki-Jaan Kaalep

One file contains one issue. Non-textual parts (e.g. pictures, comic strips) have been omitted. Tables with rates of foreign currencies, TV programmes, all advertisments and horoscope, have also been omitted. Multiple occurrences of the same article have been deleted, but anecdotes and e.g. book reviews contain many repeating paragraphs.

The superscript is <hi rend="sup">. UNICODE-entities having the form &#number; have been converted to SGML-entities. The conversion of various forms of Estonian letters s and z with caron, Icelandic letters, etc. has resulted in many incorrect results. Original HTML-lists have been converted to ordinary text with numbers at the beginning of paragraphs (if the original was a numbered list) or a hyphen at the beginning of paragraphs (if the original was a bulleted list). There are no corrections or hyphenations in the texts. The entity &quest; stands for symbols, the correct original form of which is unknown.

The opening quotation mark is the entity &ldquo;; the closing quotation mark is the entity &rdquo;.

The rendition information has been tagged, using the attribute ‚rend’. If the rendition concerns a whole paragraph, then the attribute ‚rend’ is used with the corresponding tag <p>. The possible tags and values for rendition are the following:  

<hi rend='bold'>, <hi rend='italic'>, <hi rend='sup'>, <hi rend='underline'>, <hi>, <p rend='bold'>, <p rend='bold_italic'>, <p rend='bold_underline'>, <p rend='h3'>, <p rend='h3_bold'>, <p rend='h3_italic'>, <p rend='h4'>, <p rend='h4_bold'>, <p rend='italic'>, <p rend='italic_bold'>, <p rend='underline'>, <p rend='underline_bold'>

<div0> stands for a whole issue, <div1> stands for a part (e.g. „Põhileht”), <div2> and <div3> stand for theme (e.g. „Arvamus”, „Repliik”), <div4> stands for an article.

The heading for <div2>, <div3> or <div4> may be missing, and the classification of headings into division levels may be inconsistent. Book, music, film and similar reviews may contain several mini-articles inside one <div4>. The basis for defining <div1>, <div2> and sometimes for <div3> also, have been the original file names from http://www.postimees.ee

The division of the texts into paragraphs follows exactly the original HTML files. The headings and authors have been tagged; not every article has a heading or an author. The author has been tagged using <bibl> <author> <s>; the text characterising the author (e.g. „Editor” is also eclosed inside these tags. The text inside paragraphs has been processed by a program called estyhmm; as a result, the punctuation marks are separated from wordforms by a space (except those punctuation marks that are an integral part of the token, e.g. an abbreviation or an ordinal number) and the sentences are tagged with <s> and </s>. The tag <lb> is used for a new-line, used for lay-out purposes only. Apart from the above, the structure of the texts (e.g. sub-headings, photo captions, questions and answers of the interviews, footnotes etc.) is not tagged.

Every file starts with a <teiHeader> documenting the file contents, size, used tags etc.

Size

32.9 million words, distributed by years in the following way:

Year Size
1995 0,4
1996 6,1
1997 6,8
1998 8,2
1999 6,5
2000 4,9

In addition to ASCII symbols (unaccented letters, numbers and punctuation signs), the texts contain the following entities:


Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: March 24 2014 13:47:30.