Eesti keeles

The Mixed Corpus: Luup

Contents

This subcorpus contains texts from the newsmagazine Luup, ca 1,9 million words altogether, 130 issues, 2298 articles. The corpus contains issues from the years 1996-2002, namely

The texts originate from the webpage http://luup.postimees.ee/

Texts have been semi-automatically downloaded from the web and converted from HTML-format to TEI-format.

How can one use it?

The corpus is free for use for non-commercial purposes only.

Texts and annotation

Mark-up and annotation conform to the TEI-guidelines. One file contains one issue of the journal.

Every file begins with a header <teiheader> that contains information about the file size, used tags etc.

The rest of the file is structured as follows:

The text has been annotated for paragraphs <p>, sentences <s>, headlines <head> and authors <bibl><author>.

The non-textual material has been omitted from the text and replaced by a tag <gap desc=’description_of_the_omitted_material’>. By non-textual material we mean pictures (photos, drawings, diagrams etc), tables etc.

In the corpus version one can access via our corpus query, all mark-up except the tags <gap> used for the omitted material have been deleted.

Special symbols

The non-ASCII characters/symbols are presented using the following entities:

Entity Symbol
â acirc
à agrave
À Agrave
& amp
Å Aring
å aring
ä auml
Ä Auml
bull
ć cacute
Ć Cacute
° deg
é eacute
É Eacute
è egrave
ë euml
¼ frac14
> gt
í iacute
« laquo
ldquo
< lt
µ micro
· middot
ń nacute
Ó Oacute
ó oacute
ô ocirc
Ø Oslash
ø oslash
õ otilde
Õ Otilde
ö ouml
Ö Ouml
permil
± plusmn
» raquo
rdquo
š scaron
§ sect
¹ sup1
² sup2
³ sup3
ß szlig
ú uacute
Ü Uuml
ü uuml
ž zcaron
Ž Zcaron

Valid XHTML 1.0! Valid CSS! Webmaster    Last modified: March 24 2014 14:15:30.