The corpus of the 1980s - total of 1 million tokens - consists of the following text classes:
Text class | Beginning of file name | Number of tokens | Per cent of corpus |
---|---|---|---|
Newspapers | tat | 175,000 | 17.5 % |
Documents | tdt | 12,000 | 1.2 % |
Encyclopædias | tnt | 20,000 | 2.0 % |
Essays and biographies | tet | 90,000 | 9.0 % |
Hobby texts | tht | 75,000 | 7.5 % |
Fiction | tkt | 250,000 | 25.0 % |
Popular science | tpt | 150,000 | 15.0 % |
Propaganda | tot | 60,000 | 6.0 % |
Religion | trt | 8,000 | 0.8 % |
Science | ttt | 160,000 | 16.0 % |