The Mixed Corpus: PhD dissertations


This subcorpus contains PhD dissertations written in Estonian; 2,3 million words altogether. You can find the list of included dissertations here.

NB! The same PhD dissertations are included in the Balanced Corpus!

How can one use it?

The corpus is free for use for non-commercial purposes only.

Texts and annotation

Mark-up and annotation conform to the TEI-guidelines. One file contains one dissertation.

Every file begins with a header <teiheader> containing information about file size, used tags etc. The actual text of the disseratation begins with tags <text><body> and ends with </body></text>.

The text has been annotated for paragraphs <p> and sentences <s>. Non-textual material (graphs, formulae, pictures, tables etc) has been omitted and represented by a tag <gap desc=’description_of_the_omitted_material’>. Longer strips of non-Estonian text have been omitted, also tables of content and lists of references.

In the corpus version one can access via our corpus query, all mark-up except the tags <gap> used for the omitted material have been deleted.

