This corpus contains the internet version of the weekly "Eesti Ekspress" www.ekspress.ee
year | words |
---|---|
2001 | 1 449 037 |
2000 | 1 672 059 |
1999 | 1 699 156 |
1998 | 1 361 693 |
1997 | 985 826 |
1996 | 347 793 |
Sum | 7 515 564 |
These texts are part of a corpus called 'The Mixed corpus of Estonian'.
The corpus is free for use for non-commercial purposes only.
The texts are automatically saved from internet and converted from HTML-format to TEI-format using the sofware created by Kaarel Kaljurand
Every file contains one newspaper issue. The non-textual material like photos, comic strips etc have been omitted.
One file (one newspaper issue) has been divided into following parts:
<div0> the whole issue, e.g.
<div0 type='unknown'><head>Eesti Ekspress</head>
<div1> the main part of the newspaper (A-osa) or the cultural ??kultuurilisa <div1 type='unknown'><head>AOSA</head>
<div2> on rubriik, nt <div2 type='unknown'><head>MAGNET</head>
<div3> The articles (one or more) that were represented as one file. They may have a common heading, but it may be absent as well.If the heading is missing, the string CT_FILENAME stands for it, e.g.
<div3 type='unknown'><head>CT_FILENAME</head>
<div4> article, e.g.
<div4 type='unknown'><head>Sõbrakäsi idast</head>
The authors have been annotated using the tag combination <bibl> <author>, e.g.
<p><bibl><author>Siim Nestor</author></bibl></p>
Every file begins with a teiHeader that describes the file and the tags used.
In this version the sentence boundaries are not annotated. The smallest subdivision is <p>
There can be mistakes in the annotation of titles and authors. All the annotated titles and authors are correct, but there exists also a certain amount of titles and authors that have not been annotated as such.
SGML-files contain entities listed in this table