Reference corpus of Estonian: Postimees

XML-files
SGML TEI P3 files:1995.zip 1996.zip 1997.zip 1998.zip 1999.zip 2000.zip

Content

This corpus contains the issues of the daily newspaper „Postimees“ from November 27, 1995 until October 10, 2000 (1760 issues containing 88 600 articles). They make up 32.9 million words in 2.5 million sentences:

Year	Size
1995	0,4
1996	6,1
1997	6,8
1998	8,2
1999	6,5
2000	4,9

These texts form a part of the planned reference corpus of Estonian. Their collecting and processing has been financed by a national program “Estonian language and national culture”.

The corpus is free for use for non-commercial purposes only.

Sources and markup

The texts originate from the archive of https://www.postimees.ee.

The texts have been semi-automatically downloaded from the internet and converted from HTML to SGML (TEI). The conversion programs were written by Erik Saarts and Heiki-Jaan Kaalep

One file contains one issue. Non-textual parts (e.g. pictures, comic strips) have been omitted. Tables with rates of foreign currencies, TV programmes, all advertisments and horoscope, have also been omitted. Multiple occurrences of the same article have been deleted, but anecdotes and e.g. book reviews contain many repeating paragraphs.

The superscript is <hi rend="sup">. UNICODE-entities having the form &#number; have been converted to SGML-entities. The conversion of various forms of Estonian letters s and z with caron, Icelandic letters, etc. has resulted in many incorrect results. Original HTML-lists have been converted to ordinary text with numbers at the beginning of paragraphs (if the original was a numbered list) or a hyphen at the beginning of paragraphs (if the original was a bulleted list). There are no corrections or hyphenations in the texts. The entity &quest; stands for symbols, the correct original form of which is unknown.

The opening quotation mark is the entity “; the closing quotation mark is the entity ”.

The rendition information has been tagged, using the attribute ‚rend’. If the rendition concerns a whole paragraph, then the attribute ‚rend’ is used with the corresponding tag . The possible tags and values for rendition are the following:

<hi rend='bold'>, <hi rend='italic'>, <hi rend='sup'>, <hi rend='underline'>, <hi>, , , , , , , , , , , ,

<div0> stands for a whole issue, <div1> stands for a part (e.g. „Põhileht”), <div2> and <div3> stand for theme (e.g. „Arvamus”, „Repliik”), <div4> stands for an article.

The heading for <div2>, <div3> or <div4> may be missing, and the classification of headings into division levels may be inconsistent. Book, music, film and similar reviews may contain several mini-articles inside one <div4>. The basis for defining <div1>, <div2> and sometimes for <div3> also, have been the original file names from https://www.postimees.ee

The division of the texts into paragraphs follows exactly the original HTML files. The headings and authors have been tagged; not every article has a heading or an author. The author has been tagged using <bibl> <author> <s>; the text characterising the author (e.g. „Editor” is also eclosed inside these tags. The text inside paragraphs has been processed by a program called estyhmm; as a result, the punctuation marks are separated from wordforms by a space (except those punctuation marks that are an integral part of the token, e.g. an abbreviation or an ordinal number) and the sentences are tagged with <s> and </s>. The tag <lb> is used for a new-line, used for lay-out purposes only. Apart from the above, the structure of the texts (e.g. sub-headings, photo captions, questions and answers of the interviews, footnotes etc.) is not tagged.

Every file starts with a <teiHeader> documenting the file contents, size, used tags etc.

SGML-entities

SGML-files contain entities listed in this table

Webmaster Last modified: December 21 2018 16:41:34.