SweLL - electronic research infrastructure on Swedish Learner Language

With a growing number of people seeking asylum in Sweden, the need for second language (L2) teaching and the evolvement of such a practice is of great importance to the Swedish society. The government has recently initiated a project on learning among newly arrived. One of the foci of this project is on producing tools for evaluation of L2 Swedish, an aim to which the SweLL project contributes in a most robust way.
The purpose of SweLL is to set up an infrastructure for collection, digitization, normalization, and annotation of learner production, as well as to make available a linguistically annotated corpus of approx. 600 L2 learner texts. Such a corpus would make it possible to search for various types of linguistic structures, without the researcher having to guess what such a structure might look like, since there is a parallel normalized version available. L2 corpora are available for many other languages, but for Swedish such a resource is lacking.
To fill the needs of the L2 research field, SweLL will create an infrastructure consisting of:
*a data collection portal, through file import and via online exercises
*methods and tools for L2 analysis
*an annotated corpus of L2 production
*specific search tools for L2-material facilitating filtering for e.g. texts written by male writers or writers at a certain proficiency level.
The material and tools will be made accessible through Språkbanken.
Final report
SweLL - Swedish Learner Language - infrastructure project had as an aim to lay the fundament for digital Second Language Acquisition research by
(1) collecting and manually annotating learner essays written by learners of Swedish at different levels of development
(2) developing well-functioning annotation principles, tagsets and processes, and thoroughly describing them
(3) developing and documenting tools for processing and storing of learner essays
(4) making the data and tools available through a portal developed for digital resources and tools for second language acquisition research of Swedish

We release the infrastructure according to the following:

* The SweLL portal hosts more than 680 essays that have been digitized and manually transcribed from handwritten samples. All essays have been pseudonymized to protect each individual learner. A larger portion of the essays (at present 500 texts) has been normalized, i.e. re-written in order to fit the norms of standard Swedish by correcting erroneous and deviant language, and each correction has been assigned a label describing the difference between the learner's and the corrected version.

* Manuals and guidelines are available for each step in the annotation process:
-- Transcription guidelines
-- Pseudonymization guidelines
-- Normalization guidelines
-- Correction annotation guidelines
-- Manual for SVALA users
-- Manual for SweLL Portal users
-- L2-specific searches through a corpus browsing tool Korp (https://spraakbanken.gu.se/korp/)

* Several tools are made available for future users of the infrastructure (links accessible via the project page, see below):
-- SweLL portal for collection, storage and versioning of essays, administration of the annotation, statistical overview, import and export of the data
-- SVALA annotation tool for performing manual annotation steps (pseudonymization, normalization, correction annotation)
-- Automatic pseudonymizer service (included as a part of the SVALA tool, and available through github for potential extensions or re-use in other projects)

* Thorough work has been carried out to make sure that the GDPR guidelines and ethical principles are followed. In consultation with university lawyers, the access principles have been defined and legal basis double-checked. Access is granted following an application for use. According to the GDPR, users outside Europe cannot get immediate access to the data in its entirety. Their applications need to be processed by the university lawyers on a case to case basis. Applicants inside EU will get access to the full dataset provided their intended use targets L2-oriented research, development or pedagogical applications.

* The data can be browsed through Korp (https://spraakbanken.gu.se/korp/) with specific search solutions for L2-material facilitating filtering for e.g. texts written by writers of a certain age, gender, mother tongue, or writers at a certain proficiency level with a possibility for full-text view.

More information about the project and tools are available at the project page: https://spraakbanken.gu.se/projekt/swell
Pseudonymization of learner essays as a way to meet GDPR requirements: https://spraakbanken.gu.se/blogg/index.php/2020/10/27/pseudonymization-of-learner-essays-as-a-way-to-meet-gdpr-requirements/ (October 2020)

Korp searches in Second Language data: https://spraakbanken.gu.se/blogg/index.php/2020/06/17/korp-searches-in-second-language-data/ (June 2020)

Interoperability of second language resources and tools: https://www.clarin.eu/news/blog-post-elena-volodina-clarin-workshop-interoperability-second-language-resources-and-tools (2018-01-24)


SweLL data (Elena Volodina, April 2018, English): https://gubox.box.com/s/r5btxbu4tyhl3urz0sn0k1wgkopvwlss

Interoperability of L2 resources and tools (Elena Volodina, October 2018, English): https://youtu.be/XAFeC7tQBwo
