This subcorpus contains texts from the newsmagazine Luup, ca 1,9 million words altogether, 130 issues, 2298 articles. The corpus contains issues from the years 1996-2002, namely
The texts originate from the webpage http://luup.postimees.ee/
Texts have been semi-automatically downloaded from the web and converted from HTML-format to TEI-format.
The corpus is free for use for non-commercial purposes only.
Mark-up and annotation conform to the TEI-guidelines. One file contains one issue of the journal.
Every file begins with a header <teiheader>
that contains information about the file size, used tags etc.
The rest of the file is structured as follows:
<div0>
is one issue of journal, e.g. <div0 type='ajakirjanumber'><head>
Luup Nr. 13 (122), 22. juuli 2000</head>
<div1>
is a section e.g. <div1 type='rubriik'><head>
JUHTKIRI</head>
<div2>
is an article, e.g. <div2 type='artikkel'><head>
Kõige tähtsam raamat</head>
<div3>
is a subpart of an article, e.g. <div3 type='alaosa'><head>
Kui jäämäed hakkavad sulama</head>
(in the issues 1998 Nr. 14 – 2002 Nr. 04 the mark-up of <div3>
may be erroneous. The text has been annotated for paragraphs <p>
, sentences <s>
, headlines <head>
and authors <bibl><author>
.
The non-textual material has been omitted from the text and replaced by a tag <gap desc=’description_of_the_omitted_material’>
. By non-textual material we mean pictures (photos, drawings, diagrams etc), tables etc.
SGML-files contain entities listed in this table