Eesti keeles

Fiction in balanced corpus

There are 5 million words of fiction texts in the Balanced Corpus. The sentence boundaries in the corpus are annotated, i.e. you get a whole sentence as an answer to your query. Every row (sentence) in the corpus query output begins with a reference field. Clicking on this field you can see the reference to the piece of fiction where the particular sentence is retrieved from.

All works collected in the fiction part of the corpus are listed in this table.

The punctuation marks in the texts have been separated from the preceding words by blanks. So the sentence

Ma nägin, et ta tuleb, ja ütlesin: "Tere!"

Looks like this in the corpus:

Ma nägin , et ta tuleb , ja ütlesin : " Tere !"

The files for downloading have been annotated according to the TEI guidelines. The structure of the downloadable files is as follows:

The omitted material is replaced by a tag <gap>, that has an attribute 'desc'; the value of the attribute describes the omitted material.
For example <gap desc='sisukord'> stands for the omitted table of contents.

In addition to the ASCII characters the texts contain entities that are enumerated in this table.


Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: January 21 2019 18:29:20.