There are 5 million words of fiction texts in the Balanced Corpus. The sentence boundaries in the corpus are annotated, i.e. you get a whole sentence as an answer to your query. Every row (sentence) in the corpus query output begins with a reference field. Clicking on this field you can see the reference to the piece of fiction where the particular sentence is retrieved from.
All works collected in the fiction part of the corpus are listed in this table.
The punctuation marks in the texts have been separated from the preceding words by blanks. So the sentence
Ma nägin, et ta tuleb, ja ütlesin: "Tere!"
Looks like this in the corpus:
Ma nägin , et ta tuleb , ja ütlesin : " Tere !"
The files for downloading have been annotated according to the TEI guidelines. The structure of the downloadable files is as follows:
<teiheader>
The header contains information about the author and title of the novel/short story, the number of words in the text, the size of the file in bytes; the annotation tags used in text are also listed in the header.<text><body>
and ends with the tags </body></text>
<div0 type='tervikteos'>
- the text as a whole<div1 type='alaosa'> või <div1 type='peatükk'>
- the subdivisions of the text<head>
<bibl><author>
<p>
<s>
<lg>
<l>
The omitted material is replaced by a tag <gap>
, that has an attribute 'desc'; the value of the attribute describes the omitted material.
For example <gap desc='sisukord'>
stands for the omitted table of contents.
In addition to the ASCII characters the texts contain entities that are enumerated in this table.