Reference corpus of Estonian: Weekly "Eesti Ekspress"

What is it?

This corpus contains the internet version of the weekly "Eesti Ekspress"

These texts are part of a corpus called 'The Mixed corpus of Estonian'. The collecting of texts and constructing the corpus is still in progress. The work is supported by a program 'Estonian language and culture' funded by the Estonian Ministery of Science and Education.

How can one use it?

The corpus is free for use for non-commercial purposes only.

Texts and annotation

The texts are automatically saved from internet and converted from HTML-format to TEI-format using the sofware created by Kaarel Kaljurand (have a look at

Every file contains one newspaper issue. The non-textual material like photos, comic strips etc have been omitted.

One file (one newspaper issue) has been divided into following parts:

There can be mistakes in the annotation of titles and authors. All the annotated titles and authors are correct, but there exists also a certain amount of titles and authors that have not been annotated as such.

Size of the corpus

year words
2001 1 449 037
2000 1 672 059
1999 1 699 156
1998 1 361 693
1997 985 826
1996 347 793
Sum 7 515 564

Special symbols

In addition to the ASCII symbols, the following entities have been used in texts:

