This corpus contains the texts from the internet archive of the journal „Arvutitehnika ja andmetöötlus“ („Computer technics and data processing“) http://deepthought.ttu.ee/aa/. The corpus contains the journal volumes from the years 1999 – 2005, approximately 625 000 words.
The collecting and annotating of these texts was supported by the national programme „The Language Technology Support for Estonian“.
The corpus is free for use for non-commercial purposes only.
The texts have been semi-automatically downloaded from the internet and converted from PDF to SGML (TEI) format. The conversion programs were written and conversions made by Kaarel Veskis and Heiki-Jaan Kaalep.
One file contains one issue of the journal. The non-textual material (illustrations, figures) as well as tables and lists of references.
No spell-checking or error correction has been performed.
Annotation follows the TEI guidelines. <div0> stands for one issue of the journal, <div1> stands for one article. The text has been divided into paragraphs following the original HTML markup, the sentences have been marked automatically (and hence the mark-up may contain some errors). The headings and authors have been annotated with <head> and <bibl><author> tags respectively.
Every file begins with a <teiHeader> that contains the information about the content and size of the file and lists the used tags.