Koala - Korp`s linguistic annotations, developing an infrastructure for text-based research with high-quality annotations

The corpus infrastructure Korp at Språkbanken (http://spraakbanken.gu.se) contains large amounts of Swedish texts, of many different types and ages, which are used by a wide range of researchers and the general public. The texts contain linguistic enrichment, annotations, such as word classes and syntactic roles, which help filter the search results for the user. They allow us to find "am", "is", and "are" when looking for "be", and all mentions of Caesar as the object of the verb defeat while ignoring those where he is the subject, as well as to distinguish between verbal uses of bend ("to bend the iron") from nominal ones ("a bend in the road"). The quality of these annotations is crucial to get good search results, in particular to researchers who otherwise may have to look at thousands of irrelevant sentences.

The Koala project aims to enhance the annotations, which have been automatically created through well-known language technology methods. This is done by adding linguistic knowledge to the system through the many resources available at Språkbanken, and by combining the various annotation tools for lexical analysis, part-of-speech tagging, sense disambiguation, and syntactic analysis into a high-quality system where word-level and sentence level annotations inform eachother, and the system does not make decisions until it has all available information. The resulting data and tools will be freely available.

Final report

Final report Koala

Purpose and use of the infrastructure

The main goal of the infrastructure project Koala - Korp's linguistic annotations - has been to improve the tools used in the research infrastructure Korp (<https://spraakbanken.gu.se/korp>) to annotate the texts, i.e. adding linguistic information, such as parts-of-speech and syntactic functions. Korp contains large amounts of text, mainly in Swedish, freely available for searching to researchers and the general public, e.g. to examine how common a word or linguistic phenomenon is, or how language has changed over time. The annotations are necessary to make the texts better searchable. The primary user has an interest in language, but other questions can also be explored, e.g. by social scientists, historians, etc.

We have been working on improving a number of different levels of analysis, but also more generally how the different analyses interact, including ambiguity, scalability, and traceability. The amount of texts in Korp has increased since the project started, from about 1-2 billion words to around 15 billion today, and it continues to grow.

Although Korp is the most prominent display for the project results towards the users of the research infrastructure, the efforts are immensely important also for other parts of the Språkbanken research infrastructure. The annotation tools are available for use on other texts than the ones in Korp, and the texts of Korp are also downloadable for other purposes than the concordance search provided by Korp. The annotation tools are also used by a number of other Språkbanken research infrastructures, such as Strix, which is under development for analysis at text level rather than word level, and Lark (Lärka), a platform for learning Swedish and Swedish linguistics, which is used in research for collecting real-time data on language learning.

Språkbanken is a research infrastructure supporting three kinds of research: (1) language technology; (2) linguistics; and (3) digital humanities and social sciences (DHS). The activities conducted in the project aiming to improve Korp's annotations have also simultaneously constituted (applied) language technology research, and most of the publications listed below describe this work. Some new language technology resources have benefitted directly from the improved annotations, viz. the "Culturomics Gigaword Corpus" (<https://spraakbanken.gu.se/eng/resource/gigaword>), the Swedish sentiment lexicon SenSALDO (<https://spraakbanken.gu.se/eng/resource/sensaldo>), and the Swedish thesaurus Blingbring (<https://spraakbanken.gu.se/eng/resource/blingbring>). Among DHS research utilizing the infrastructure we may mention a study on rhetorical history and an investigation of place names in the medieval letters digitized by the Swedish National Archives.

Project results

To process text with the annotation tools, the basic units of the text need to be defined, i.e. sentences and words. The standard definition of a word (or token) is often simplistically something delimited by blank space or punctuation, but is now instead defined by the lexicon. This allows us to affect what is considered a word by creating a new entry in our central lexical resource SALDO. This makes it easier, e.g., to handle multiword units by adding them to the lexicon.

Soon, version 3.0 of SALDO will be made available. It contains a new part-of-speech tagset, which is more similar to the part-of-speech tagging now being introduced in Korp (see below). Additionally, information about inflection is separated from information about compounding. There are also placeholders for additional information about words, such as domain or style. Finally, many new entries have been created in the lexicon. Our work has also shown that there are different types of multiword units, which behave differently syntactically. Handling these will be implemented in SALDO at a later point.

A number of new annotation types have been implemented, such as sentiment, lexical classes, and named entities, as well as several readability metrics. In addition, several tools for lexical annotation have been improved. The treatment of compounds has been greatly improved through ranking of compound alternatives by probability. They may now also consist of multiple parts, while they were previously limited to two parts for practical reasons.

Within the project, a number of new methods for disambiguating different senses of words in the texts have been developed. These are mainly so called unsupervised methods which use a large amount of text and a lexicon. One of these methods has been implemented and is available through Sparv, the Språkbanken annotation tool. Word senses are thus now ranked by probability in the texts in Korp.

One of the most important annotations at the lexical level is part-of-speech tagging, which gives a part-of-speech category for every word in the text, but also more detailed morphological information. A major change in tagging is that our lexical resources, the lexicon and the morphological description, are now integrated into the tagger. This means that the statistical tagger now has help when it encounters a word it hasn't seen before (because it, or a form of it, is not present in the training data from which the tagger has learned the parts-of-speech).

In building the evaluation corpus Eukalyptus, we have created a new part-of-speech tagset. It is more in line with the standard description of Swedish, the Swedish Academy grammar, than the previous tagset, called SUC after the Stockholm-Umeå corpus, which most part-of-speech taggers for Swedish of the last decades are based upon. The new part-of-speech tagset has, however, not yet been fully implemented in Korp, as some practical problems remain regarding the syntactic structure, which is then built on top of part-of-speech tagging.

Regarding the syntactic analysis, we have evaluated several different syntactic parsers, and adapted a dependency parser to output multiple annotation hypotheses. We have discussed different ways of sorting query results in Korp based on annotation quality (or rather, how confident the tools are for any given annotation), but the true challenge lies in how to display this information to the user in an intuitive way.

A large part of the work involved in evaluating the results of the tools has been to create an evaluation corpus with annotations for the different levels, i.e. parts-of-speech and morphological information, word senses, syntactic analysis etc. This work is finished and we are currently putting the data together to make it available to others. We have begun evaluating the annotation tools of the different levels of analysis, but this work will continue in the near future, to allow us to say how much the different tools have improved during the project, and how good the annotations are for different types of text.

Problems and deviations from the plan

In the application we pointed to a number of directions in which the infrastructure could be developed, and several of these have been implemented as a result of the project. Even in those cases where we cannot yet show concrete implementations, the project has allowed us to consider structures in a more long-term perspective which has made way for the possibility to implement more tools and functions in the future. This applies, for example, to the question of traceability, being able to see where data is coming from and how it was produced. We have mostly worked on this at a conceptual level and today we are able to extract this information from the systems. However, we have yet to solve how to handle the extreme increase in data from having such information available throughout the process.

The largest deviation from the time plan has been the creation of the evaluation corpus, which required more time than planned, in part because of annotators leaving the project. It is, however, close to being finished and will be published this summer. it has already been made available upon request to researchers from Uppsala University.

The availability and future of the infrastructure

Korp is one of the most important tools in Språkbanken's research infrastructure. Språkbanken has currently existed for close to 45 years, and is, starting from 2018, a part of the national infrastructure Språkbanken (funded by the Swedish Research Council 2018-2024). The work on developing the tools of the infrastructure Korp has been influential to Språkbanken first becoming a university infrastructure at University of Gothenburg, and then being ready to apply for becoming a national infrastructure.

With a few exceptions, the material of Korp (<https://spraakbanken.gu.se/korp>) is freely available for querying. In addition, we provide most texts, with their annotations, for download in the form of so called sentence sets, where, due to copyright, the texts are scrambled at sentence level so that they cannot be reconstituted. The corpus query tool Korp is also freely available (via Github: <https://github.com/spraakbanken/>). All annotation tools can be used freely through Sparv (<https://spraakbanken.gu.se/sparv>), which allows for annotating your own texts.

International cooperation

Since being awarded the project grant, Språkbanken has become national coordinator for Swe-Clarin, the Swedish part of the European CLARIN ERIC (with 22 participating countries). Korp is an important tool within Swe-Clarin.

A number of institutions in different countries have chosen to set up their own instance of Korp, to handle and search through texts in their own languages. A few examples are the Finnish Korp (<https://korp.csc.fi/>) and the Saami Korp (<http://gtweb.uit.no/korp/>).

Grant administrator

University of Gothenburg

Reference number

In13-0320:1

Amount

SEK 5,605,000

Funding

RJ Infrastructure for research

Subject

Language Technology (Computational Linguistics)

Year

2013

Koala - Korp`s linguistic annotations, developing an infrastructure for text-based research with high-quality annotations

About RJ

Research

Collaboration

Financial management

Follow us: