Start People Corpora Resources

Fiction in balanced corpus

There are 5 million words of fiction texts in the Balanced Corpus. The sentence boundaries in the corpus are annotated, i.e. you get a whole sentence as an answer to your query. Every row (sentence) in the corpus query output begins with a reference field. Clicking on this field you can see the reference to the piece of fiction where the particular sentence is retrieved from.

All works collected in the fiction part of the corpus are listed in this table.

The punctuation marks in the texts have been separated from the preceding words by blanks. So the sentence

Ma nägin, et ta tuleb, ja ütlesin: "Tere!"

Looks like this in the corpus:

Ma nägin , et ta tuleb , ja ütlesin : " Tere !"

The files for downloading have been annotated according to the TEI guidelines. The structure of the downloadable files is as follows:

Every file begins with a header <teiheader> The header contains information about the author and title of the novel/short story, the number of words in the text, the size of the file in bytes; the annotation tags used in text are also listed in the header.
The corpus text itself begins with the tags <text><body> and ends with the tags </body></text>
The following tags are used in the texts:
- <div0 type='tervikteos'> - the text as a whole
- <div1 type='alaosa'> või <div1 type='peatükk'> - the subdivisions of the text
- headlines are annotated using <head>
- the names of the authors are annotated using <bibl><author>
  the paragraphs are annotated using <p>
- the sentences are annotated using <s>
- the poems are annotated using <lg>
- the line of a poem is annotated using <l>

The omitted material is replaced by a tag <gap>, that has an attribute 'desc'; the value of the attribute describes the omitted material.
For example <gap desc='sisukord'> stands for the omitted table of contents.

In addition to the ASCII characters the texts contain entities that are enumerated in this table.

Webmaster Last modified: January 21 2019 18:29:20.