Morphologically disambiguated corpus of chatrooms

The corpus consists of 94,000 tokens plus punctuation marks and names of participants. The corpus is an extract from a corpus of chatrooms. The texts were tagged automatically by ESTMORF and disambiguated-corrected in 2012 by Dage Särg for her MA thesis Internetikeele süntaktiline analüüs kitsenduste grammatikaga. Tartu 2015

File format

File format is almost the same as that of the morphologically disambiguated corpus , but

Zipped files failid_fs.zip contain the Filosoft categories
Zipped files failid_gt.zip contain the Giellatekno categories, also used in the keeleveeb corpora

Webmaster Last modified: November 02 2020 16:28:55.