The Mixed Corpus: Eesti Arst

These texts form a part of the reference corpus of Estonian. Their collecting and processing has been financed by a national program ~Estonian language and national culture~.

The format of the corpus

The texts have been semi-automatically downloaded from the internet and converted from PDF to SGML (TEI) format. The conversion programs were written and conversions made by Kaarel Veskis and Heiki-Jaan Kaalep.

One file contains the issues of the journal published during one year. The non-textual material - tables, graphs, pictures - has been omitted, as well as the English summaries and lists of references.

The text has been divided into paragraphs as in the original pdf-file. The sentences have been annotated automatically. Every file begins with a teiHeader that contains description about the file, tags used etc.

<div0> stands for a whole year of the issues, <div1> stands for one issue and <div2> stands for an article or other text in an issue.

The opening quotation mark is the entity “. The closing quotation mark is the entity ”, single quote is '. The information about rendition (bold, italic, etc) has been stated only if it applies to the whole paragraph. The possible rendition alternatives for a paragraph are the following:

<p rend='esirida'>
opening paragraph of the article
<p rend='toc'>
list of contents of an issue
<p rend='teesid'>
text of stand-alone abstracts, in smaller font than text of articles
<p rend='bold'>
<p rend='table_heading'>
heading of a table (table itself is deleted)
<p rend='figure_heading'>
heading of a figure (figure itself has been deleted)
<p rend='abstract'>
text of abstract in the beginning of an article
<p rend='keywords'>
keywords attached to the article
<p rend='H6'>, <p rend='H5'>, <p rend='H4'>, <p rend='H3'>, <p rend='H2'>,
<p rend='H1'>
different levels of (sub)headings

The following entities have been used in this corpus:

