Elena Volodina

SweLL - electronic research infrastructure on Swedish Learner Language

With a growing number of people seeking asylum in Sweden, the need for second language (L2) teaching and the evolvement of such a practice is of great importance to the Swedish society. The government has recently initiated a project on learning among newly arrived. One of the foci of this project is on producing tools for evaluation of L2 Swedish, an aim to which the SweLL project contributes in a most robust way.
The purpose of SweLL is to set up an infrastructure for collection, digitization, normalization, and annotation of learner production, as well as to make available a linguistically annotated corpus of approx. 600 L2 learner texts. Such a corpus would make it possible to search for various types of linguistic structures, without the researcher having to guess what such a structure might look like, since there is a parallel normalized version available. L2 corpora are available for many other languages, but for Swedish such a resource is lacking.
To fill the needs of the L2 research field, SweLL will create an infrastructure consisting of:
*a data collection portal, through file import and via online exercises
*methods and tools for L2 analysis
*an annotated corpus of L2 production
*specific search tools for L2-material facilitating filtering for e.g. texts written by male writers or writers at a certain proficiency level.
The material and tools will be made accessible through Språkbanken.
Final report
SweLL - Swedish Learner Language - infrastructure project had as an aim to lay the fundament for digital Second Language Acquisition research by
(1) collecting and manually annotating learner essays written by learners of Swedish at different levels of development
(2) developing well-functioning annotation principles, tagsets and processes, and thoroughly describing them
(3) developing and documenting tools for processing and storing of learner essays
(4) making the data and tools available through a portal developed for digital resources and tools for second language acquisition research of Swedish

We release the infrastructure according to the following:

* The SweLL portal hosts more than 680 essays that have been digitized and manually transcribed from handwritten samples. All essays have been pseudonymized to protect each individual learner. A larger portion of the essays (at present 500 texts) has been normalized, i.e. re-written in order to fit the norms of standard Swedish by correcting erroneous and deviant language, and each correction has been assigned a label describing the difference between the learner's and the corrected version.

* Manuals and guidelines are available for each step in the annotation process:
-- Transcription guidelines
-- Pseudonymization guidelines
-- Normalization guidelines
-- Correction annotation guidelines
-- Manual for SVALA users
-- Manual for SweLL Portal users
-- L2-specific searches through a corpus browsing tool Korp (https://spraakbanken.gu.se/korp/)

* Several tools are made available for future users of the infrastructure (links accessible via the project page, see below):
-- SweLL portal for collection, storage and versioning of essays, administration of the annotation, statistical overview, import and export of the data
-- SVALA annotation tool for performing manual annotation steps (pseudonymization, normalization, correction annotation)
-- Automatic pseudonymizer service (included as a part of the SVALA tool, and available through github for potential extensions or re-use in other projects)

* Thorough work has been carried out to make sure that the GDPR guidelines and ethical principles are followed. In consultation with university lawyers, the access principles have been defined and legal basis double-checked. Access is granted following an application for use. According to the GDPR, users outside Europe cannot get immediate access to the data in its entirety. Their applications need to be processed by the university lawyers on a case to case basis. Applicants inside EU will get access to the full dataset provided their intended use targets L2-oriented research, development or pedagogical applications.

* The data can be browsed through Korp (https://spraakbanken.gu.se/korp/) with specific search solutions for L2-material facilitating filtering for e.g. texts written by writers of a certain age, gender, mother tongue, or writers at a certain proficiency level with a possibility for full-text view.

More information about the project and tools are available at the project page: https://spraakbanken.gu.se/projekt/swell
Publication list
Elena Volodina, Yousuf Ali Mohammed, Sandra Derbring, Arild Matsson and Beata Megyesi. 2020. Towards Privacy by Design in Learner Corpora Research: A Case of On-the-fly Pseudonymization of Swedish Learner Essays. COLING-2020. Proceedings.

Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg and Mats Wirén. 2019. The SweLL Language Learner Corpus: From Design to Annotation. Northern European Journal of Language Technology, Special Issue.

Elena Volodina, Arild Matsson, Dan Rosén and Mats Wirén. 2019. SVALA: an Annotation Tool for Learner Corpora generating parallel texts. Learner Corpus Research conference (LCR-2019). Proceedings.

Wirén Mats, Arild Matsson, Dan Rosén, Elena Volodina. 2019. SVALA: Annotation of Second-Language Learner Text Based on Mostly Automatic Alignment of Parallel Corpora. CLARIN-2018 post-conference volume. LiUP Press.

Egon W. Stemle, Adriane Boyd, Maarten Janssen, Therese Lindström Tiedemann, Nives Mikelic Preradovic, Alexandr Rosen, Dan Rosén, Elena Volodina. 2019. Working together towards an ideal infrastructure for language learner corpora. Learner Corpus Research 2017. In Andrea Abel, Aivars Glaznieks, Verena Lyding & Lionel Nicolas (eds.) Widening the Scope of Learner Corpus Research. Selected papers from the fourth Learner Corpus Research Conference. Corpora and Language in Use – Proceedings 5, Louvain-la-Neuve: Presses universitaires de Louvain, 427-468.

Beáta Megyesi, Sofia Johansson, Dan Rosén,Carl-Johan Schenström, Gunlög Sundberg, Mats Wirén & Elena Volodina. 2018. Learner Corpus Anonymization in the Age of GDPR: Insights from the Creation of a Learner Corpus of Swedish. Proceedings of the 7th NLP4CALL workshop.

Elena Volodina, Lena Granstedt, Beáta Megyesi, Julia Prentice, Dan Rosén, Carl-Johan Schenström, Gunlög Sundberg & Mats Wirén. 2018. Annotation of learner corpora: first SweLL insights. Proceedings of SLTC-2018, Stockholm, Sweden.

Dan Rosén, Mats Wirén and Elena Volodina. 2018. Error Coding of Second-Language Learner Texts Based on Mostly Automatic Alignment of Parallel Corpora. Clarin-2018.

Elena Volodina, Maarten Janssen, Therese Lindström Tiedemann, Nives Mikelic Preradovic, Silje Karin Ragnhildstveit, Kari Tenfjord and Koenraad de Smedt. 2018. Interoperability of Second Language Resources and Tools. Clarin-2018.

Pilán, Ildikó, & Volodina, Elena. 2018. Exploring word embeddings and phonological similarity for the unsupervised correction of language learner errors. In Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (pp. 119-128) at COLING-2018.


BLOGS

Pseudonymization of learner essays as a way to meet GDPR requirements: https://spraakbanken.gu.se/blogg/index.php/2020/10/27/pseudonymization-of-learner-essays-as-a-way-to-meet-gdpr-requirements/ (October 2020)

Korp searches in Second Language data: https://spraakbanken.gu.se/blogg/index.php/2020/06/17/korp-searches-in-second-language-data/ (June 2020)

Interoperability of second language resources and tools: https://www.clarin.eu/news/blog-post-elena-volodina-clarin-workshop-interoperability-second-language-resources-and-tools (2018-01-24)


INTERVIEWS

SweLL data (Elena Volodina, April 2018, English): https://gubox.box.com/s/r5btxbu4tyhl3urz0sn0k1wgkopvwlss

Interoperability of L2 resources and tools (Elena Volodina, October 2018, English): https://youtu.be/XAFeC7tQBwo
Grant administrator
University of Gothenburg
Reference number
IN16-0464:1
Amount
SEK 7,150,000.00
Funding
Infrastructure for research
Subject
General Language Studies and Linguistics
Year
2016