The Mixed Corpus: Arvutitehnika ja Andmetöötlus (Computer technics and data processing)

This corpus contains the texts from the internet archive of the journal „Arvutitehnika ja andmetöötlus“ („Computer technics and data processing“) The corpus contains the journal volumes from the years 1999 – 2005, approximately 625 000 words.

The collecting and annotating of these texts was supported by the national programme „The Language Technology Support for Estonian“.

How can one use it?

The corpus is free for use for non-commercial purposes only.

Source and annotation

The texts have been semi-automatically downloaded from the internet and converted from PDF to SGML (TEI) format. The conversion programs were written and conversions made by Kaarel Veskis and Heiki-Jaan Kaalep.
One file contains one issue of the journal. The non-textual material (illustrations, figures) as well as tables and lists of references.

No spell-checking or error correction has been performed.


Annotation follows the TEI guidelines. <div0> stands for one issue of the journal, <div1> stands for one article. The text has been divided into paragraphs following the original HTML markup, the sentences have been marked automatically (and hence the mark-up may contain some errors). The headings and authors have been annotated with <head> and <bibl><author> tags respectively.

Every file begins with a <teiHeader> that contains the information about the content and size of the file and lists the used tags.

The following entities have been used in this corpus:

