Estonian Reference Corpus

The compiling of Estonian Reference Corpus has been supported by:

What does this corpus consist of?

This corpus contains only full texts, not 2000-word extracts. The Estonian Refernce Corpus includes only texts that represent written Estonian.

The Estonian Reference Corpus consist of the following subcorpora:

The Estonian Reference Corpus contains a more balanced subcorpus called The Balanced Corpus.

How to use this corpus?

The corpus is free for non-commercial use. One can either:

One can reach the texts from the description of each subcorpus. Some subcorpora cannot be downloaded. These can be used via the corpus query.

Mark-up and annotation

The korpus texts are coded following the TEI guidelines.

The structure of the downloadable files is as follows:

How do we do it?

Texts that are electronically available (e.g. on the Internet) are most easy to collect. Among such texts, journalism is the most common genre but the Internet is also a suitable source for collecting texts pertaining to other genres, such as fiction, scientific texts, etc.

When handling large amounts of texts it is important to automatize the process as much as possible. When compiling the Estonian Reference Corpus, the initial plan was to develop a program that would extract the texts from the web, convert them from HTML to TEI (Text Encoding Initiative) format, annotate parts of texts (headings, paragraphs, and sentences), and would check the accordance with SGML (Standard Generalized Markup Language) standard. After this, it is possible to lemmatize and disambiguate the texts with morphology analyzer. The final result would have been parsed texts that can be searched for lemmas, word forms, and random strings. Currently, the corpus is pared but not lemmatized.

However, the texts available on the internet (especially the journalistic texts) turned out to vary tremendously as for their format. Therefore, no single program is capable of converting them.

Representativeness and balancedness are key notions in corpus linguistics. A representative and balanced corpus should include all (or most) text classes that are prominent in a certain culture in a certain period of time and these text classes should be represented in the corpus in accordance with their prominence. In reality, representativeness and balancedness are becoming less important as the size of the corpora keeps growing. Truly large representative corpora are quite rare; British National Corpus is an example of such a corpus.

Based on the Estonian Reference Corpus, a smaller (more) balanced corpus has been compiled. This is referred to as The Balanced Corpus of Estonian consists of three sub-corpora - journalistic texts, fiction, and scientific texts - each 5 million words.

Because the Balanced Corpus of Estonian is a part of the Estonian Reference Corpus, it is possible to get the same sentence twice as a result of a single query. In order to avoid that, the searches from the Balanced Corpus of Estonian can be made as separate queries.

The Estonian Reference Corpus is no longer the largest Estonian corpus. Today the largest corpus available for Estonian language is etTenTen, a corpus collected from the Internet, compiled in co-operation of the Institute of Estonian Language and Lexical Computing Ltd. etTenTen can be found on Keeleveeb's webpage.

