www.cl.ut.ee Spoken People Corpus Transcription Background Texts Contract Links

Corpus of spoken Estonian

The Department of Estonian Language initiated the corpus of spoken Estonian in 1997. The corpus is compiled by the research group of Spoken Estonian (Tiit Hennoste, Airi Jansons, Liina Lindström, Andriela Rääbis, Krista Strandson, Piret Toomet, Riina Vellerind).

The corpus is transcribed by the transcription of conversational analysis (CA). Each tape is provided with a header that lists in all 44 situational factors that have been found to affect language use in the analysis of various languages. For each concrete tape the number of possible factors is as high as possible.

The corpus is planned as an open corpus, i.e. no limits have been set. Our intention is to collect various types of oral speech, the usage of both everyday and institutional conversation, spontaneous and planned speech, monologues and dialogues, face-to-face interaction and media texts. The speakers are inhabitants of the largest towns of Estonia: Tallinn, Tartu and Pärnu.

As of June 2002, the tape library contained 388 tapes. 659 texts or stretches of text had been fed into the computer, in all 403,000 text units. The transcribed texts can be grouped as follows:

everyday institutional

face-to-face 107 201

telephone 83 198

media 70

	everyday	institutional
face-to-face	107	201
telephone	83	198
media		70

The institutional situations include a large number of shop dialogues (65) and dialogues at service institutions and government offices.

The corpus is a data bank in the Word format and simple txt-format (ISO-8859-1). In order to access the corpus, a contract with the research group of Spoken Estonian is required.