Estonian Reference Corpus

The compiling of Estonian Reference Corpus has been supported by:

national programme «Estonian Language and National Culture (Eesti keel ja rahvuskultuur)»
national programme «The Estonian Language and National Memory»
national programme «Estonian Language Technology».

What does this corpus consist of?

This corpus contains only full texts, not 2000-word extracts. The Estonian Refernce Corpus includes only texts that represent written Estonian.

The Estonian Reference Corpus consist of the following subcorpora:

Fiction from the year 1990 onwards (5,6 million words);
Daily Postimees (issues 27.11.1995-10.10.2000; 1760 issues containing 88 600 articles, 32.9 million words);
Weekly Eesti Ekspress (issues 09.08.1996-29.11.2001; 7.5 million words);
Daily Eesti Päevaleht (issues 18.10.1995-31.10.2007; (4,065 issues containing 366,862 articles); 87.9 million words);
Weekly Maaleht (2001-2004; 4.3 million words);
Daily SL Õhtuleht (1997-2007; 45.5 million words);
Valgamaalane (02.09.2004-31.07.2008; 2.5 million word);
Lääne Elu (04.05.2000-01.11.2008; 1.8 million words);
Magazine Horisont (1996-2003; 260,000 words);
Magazine Luup (1996-2002; 1,9 million words);
Magazine Kroonika (2001-2003; 600,000 words);
Magazine Eesti Arst (2002-2004; ca 0,7 million words);
Magazine Arvutitehnika ja Andmetöötlus (1999-2005; 625,000 words);
Magazine Agraarteadus (2001-2006; 298,000 words);
Various cientific articles (ca 1.3 million words);
Estonian and European legal documents (ca 1.8 million and 10 million words);
New media (ca 21 million words);
Parliament transcripts 1995-2001 (13 million words);
PhD dissertations (2.3 million words).

The Estonian Reference Corpus contains a more balanced subcorpus called The Balanced Corpus.

How to use this corpus?

The corpus is free for non-commercial use. One can either:

download the compressed texts;
use Keeleveeb’s corpus query to retrieve concordances of lemmas, word-classes and grammatical categories or their co-occurences.

One can reach the texts from the description of each subcorpus. Some subcorpora cannot be downloaded. These can be used via the corpus query.

Mark-up and annotation

The korpus texts are coded following the TEI guidelines.

The structure of the downloadable files is as follows:

Each korpus file begins with a header <teiheader>. The header documents the name of the text(s) in the file, the extent of the file in words and in bytes and lists the used tags.
The text itself begins with the tags <text><body>. In every text, at least the heades <head>, passages <p> and sentences <s> have been marked. The rest of the annotation can be different in different subcorpora.

How do we do it?

Texts that are electronically available (e.g. on the Internet) are most easy to collect. Among such texts, journalism is the most common genre but the Internet is also a suitable source for collecting texts pertaining to other genres, such as fiction, scientific texts, etc.

When handling large amounts of texts it is important to automatize the process as much as possible. When compiling the Estonian Reference Corpus, the initial plan was to develop a program that would extract the texts from the web, convert them from HTML to TEI (Text Encoding Initiative) format, annotate parts of texts (headings, paragraphs, and sentences), and would check the accordance with SGML (Standard Generalized Markup Language) standard. After this, it is possible to lemmatize and disambiguate the texts with morphology analyzer. The final result would have been parsed texts that can be searched for lemmas, word forms, and random strings. Currently, the corpus is pared but not lemmatized.

However, the texts available on the internet (especially the journalistic texts) turned out to vary tremendously as for their format. Therefore, no single program is capable of converting them.

Representativeness and balancedness are key notions in corpus linguistics. A representative and balanced corpus should include all (or most) text classes that are prominent in a certain culture in a certain period of time and these text classes should be represented in the corpus in accordance with their prominence. In reality, representativeness and balancedness are becoming less important as the size of the corpora keeps growing. Truly large representative corpora are quite rare; British National Corpus is an example of such a corpus.

Based on the Estonian Reference Corpus, a smaller (more) balanced corpus has been compiled. This is referred to as The Balanced Corpus of Estonian consists of three sub-corpora - journalistic texts, fiction, and scientific texts - each 5 million words.

Because the Balanced Corpus of Estonian is a part of the Estonian Reference Corpus, it is possible to get the same sentence twice as a result of a single query. In order to avoid that, the searches from the Balanced Corpus of Estonian can be made as separate queries.

The Estonian Reference Corpus is no longer the largest Estonian corpus. Today the largest corpus available for Estonian language is etTenTen, a corpus collected from the Internet, compiled in co-operation of the Institute of Estonian Language and Lexical Computing Ltd. etTenTen can be found on Keeleveeb's webpage.

Webmaster Last modified: December 21 2018 17:53:44.