Syllabified corpora in korpused.zip are:
| Text class | Number of words | Origin |
|---|---|---|
| Fiction (Estonian authors) | 104 000 | https://cl.ut.ee/korpused/morfkorpus/ |
| Newspaper texts | 111 000 | https://cl.ut.ee/korpused/morfkorpus/ |
| Oral speech | 100 000 | https://cl.ut.ee/korpused/morfkorpus/ |
| Chatrooms | 94 000 | https://cl.ut.ee/korpused/jutumorf/ |
| CHILDES caretaker language | 400 000 | https://childes.talkbank.org/access/Other/ |
Syllabification was a two-stage process:
Code table is utf-8. Underscore "_" marks word bounderies in compound words; dot "." marks syllable boundaries.
CV structures found in the corpora are CVstruktuurid.zip