Syllabified corpora in korpused.zip are:
Text class | Number of words | Origin |
---|---|---|
Fiction (Estonian authors) | 104 000 | https://cl.ut.ee/korpused/morfkorpus/ |
Newspaper texts | 111 000 | https://cl.ut.ee/korpused/morfkorpus/ |
Oral speech | 100 000 | https://cl.ut.ee/korpused/morfkorpus/ |
Chatrooms | 94 000 | https://cl.ut.ee/korpused/jutumorf/ |
CHILDES caretaker language | 400 000 | https://childes.talkbank.org/access/Other/ |
Syllabification was a two-stage process:
Code table is utf-8. Underscore "_" marks word bounderies in compound words; dot "." marks syllable boundaries.
CV structures found in the corpora are CVstruktuurid.zip