This corpus contains the issues of the daily newspaper Õhtuleht / SL Õhtuleht from 06. 03. 1997 until 31. 12. 2007, altogether 3344 issues; 45,572,699 tokens.
The corpus is free for use for non-commercial purposes only.
These texts form a part of the planned reference corpus of Estonian. Their collecting and processing has been financed by a national program Estonian language and national culture.
The texts originate from http://www.ohtuleht.ee
The texts have been semi-automatically downloaded from the internet and converted from HTML to SGML (TEI). The conversion programs were written by Krista Liin.
One file contains one issue. Non-textual parts (e.g. pictures, comic strips) have been omitted. TV programmes, all advertisments, hyperlinks, tables (e.g. sport results, currency exchange rates) etc have also been omitted. Omitted material (Excl. pictures) is substituted by a tag <gap desc=’description_of_omission’>.
(See the Estonian page)
SGML-files contain entities listed in this table