This subcorpus contains PhD dissertations written in Estonian; 2,3 million words altogether. You can find the list of included dissertations here.
NB! The same PhD dissertations are included in the Balanced Corpus!
The corpus is free for use for non-commercial purposes only.
Mark-up and annotation conform to the TEI-guidelines. One file contains one dissertation.
Every file begins with a header
<teiheader> containing information about file size, used tags etc. The actual text of the disseratation begins with tags
<text><body> and ends with
The text has been annotated for paragraphs
<p> and sentences
<s>. Non-textual material (graphs, formulae, pictures, tables etc) has been omitted and represented by a tag
<gap desc=’description_of_the_omitted_material’>. Longer strips of non-Estonian text have been omitted, also tables of content and lists of references.
In the corpus version one can access via our corpus query, all mark-up except the tags
<gap> used for the omitted material have been deleted.