Eesti keeles

Fiction in balanced corpus

There are 5 million words of fiction texts in the Balanced Corpus. The sentence boundaries in the corpus are annotated, i.e. you get a whole sentence as an answer to your query. Every row (sentence) in the corpus query output begins with a reference field. Clicking on this field you can see the reference to the piece of fiction where the particular sentence is retrieved from.

All works collected in the fiction part of the corpus are listed in this table.

The punctuation marks in the texts have been separated from the preceding words by blanks. So the sentence

Ma nägin, et ta tuleb, ja ütlesin: "Tere!"

Looks like this in the corpus:

Ma nägin , et ta tuleb , ja ütlesin : " Tere !"

The files for downloading have been annotated according to the TEI guidelines. The structure of the downloadable files is as follows:

In the corpus query output the line of a poem is treated as a sentence.

The omitted material is replaced by a tag <gap>, that has an attribute 'desc'; the value of the attribute describes the omitted material.
For example <gap desc='sisukord'> stands for the omitted table of contents.

In the corpus version accessible via the corpus query the annotation has been deleted.

In addition to the ASCII characters the texts contain entities that are enumerated in this table.

Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: September 05 2008 16:18:27.