Scientific texts in balanced corpus

This subcorpus contains 5 million words scientific texts. The PhD dissertations make up about half of it, the remaining half contains scientific journals like „Eesti Arst“, „Arvutitehnika ja Andmetöötlus“, ’Agraarteadus’, the yearbooks of Emakeele Selts and Eesti Matemaatika Selts etc. The full list of the included texts can be found in this table.

The non-ascii characters are represented as SGML-entities, e.g.

The punctuation marks in the texts have been separated from the preceding words by blanks.

The omitted material is replaced by a tag <gap>, that has an attribute 'desc'; the value of the attribute describes the omitted material.

For example <gap desc='sisukord'> stands for the omitted table of contents.

One can use our corpus query for the Balanced Corpus to search the corpus or you can download all the science texts here:

Possible mistakes: if the text contains subheadings or lists, the annotation of sentences (and thus the division into lines in the corpus query output) can be erroneous.

