Corpus of Spoken Estonian has been collected since 1997. In 2011, corpus contains ca 360 hours of recordings (= ca 2 million words), ca 1 800 000 words are transcribed.
Corpus of Spoken Estonian is an open corpus, there is no prescription how many and what kind of texts it should include. The corpus is meant primarily for investigating interaction pragmatics.
Corpus consists mainly of audio recordings. The structure of the corpus is based on features of communication situation:
Every recorded situation is provided with background information file containing data on interactants and interaction features.
The recordings are transcribed according to a slightly modified version of Gail Jefferson transcription used in Conversation Analysis.
5-15 minutes excerpts are typically transcribed from everyday interactions and longer institutional dialogues, shorter institutional dialogues are transcribed in their whole length.
Names, addresses and other identifying or sensitive data is replaced in transcription with rhythmically equal substitutes (i.e. Tiina > Liina).
Data is recorded digitally in the .wav format which enables phonetic analysis of the sound source.
Corpus is open to researchers for scientific and educational usage. For gaining access to corpus it is necessary to get in touch with the corpus administrator and to sign a confidentiality agreement.