Start People Corpora Resources

Corpus of Spoken Estonian

Corpus of Spoken Estonian has been collected since 1997. In 2011, corpus contains ca 360 hours of recordings (= ca 2 million words), ca 1 800 000 words are transcribed.

Corpus of Spoken Estonian is an open corpus, there is no prescription how many and what kind of texts it should include. The corpus is meant primarily for investigating interaction pragmatics.

Corpus consists mainly of audio recordings. The structure of the corpus is based on features of communication situation:

everyday / institutional interaction;

dialogue / monologue;

face-to-face interaction / mediated interaction (phone and media communication).

Every recorded situation is provided with background information file containing data on interactants and interaction features.

The recordings are transcribed according to a slightly modified version of Gail Jefferson transcription used in Conversation Analysis.

5-15 minutes excerpts are typically transcribed from everyday interactions and longer institutional dialogues, shorter institutional dialogues are transcribed in their whole length.

Names, addresses and other identifying or sensitive data is replaced in transcription with rhythmically equal substitutes (i.e. Tiina > Liina).

Data is recorded digitally in the .wav format which enables phonetic analysis of the sound source.

Corpus is open to researchers for scientific and educational usage. For gaining access to corpus it is necessary to get in touch with the corpus administrator and to sign a confidentiality agreement.

Corpus has a search engine for processing corpus data.

Webmaster Last modified: October 11 2018 20:18:45.