Eesti keeles

Syllabified corpora

Syllabified corpora in korpused.zip are:

Text class Number of words Origin
Fiction (Estonian authors) 104 000 https://cl.ut.ee/korpused/morfkorpus/
Newspaper texts 111 000 https://cl.ut.ee/korpused/morfkorpus/
Oral speech 100 000 https://cl.ut.ee/korpused/morfkorpus/
Chatrooms 94 000 https://cl.ut.ee/korpused/jutumorf/
CHILDES caretaker language 400 000 https://childes.talkbank.org/access/Other/

Syllabification was a two-stage process:

  1. Mark word boundaries in compound words, using the morphological analyser by Filosoft (https://github.com/Filosoft/vabamorf), using command line flag -a (meaning that the word is not lemmatised)
  2. Syllabify with hfst-xfst transducer silbita.xfscript

Code table is utf-8. Underscore "_" marks word bounderies in compound words; dot "." marks syllable boundaries.

CV structures found in the corpora are CVstruktuurid.zip


Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: September 28 2021 14:28:53.