DaLAJ-GED - a dataset for Grammatical Error Detection tasks on Swedish

Authors

  • Elena Volodina
  • Yousuf Ali Mohammed
  • Aleksandrs Berdicevskis
  • Gerlof Bouma
  • Joey Öhman

DOI:

https://doi.org/10.3384/ecp197011

Keywords:

language resource, acceptability judgments, grammatical error detection, baselines, Swedish SuperLim

Abstract

DaLAJ-GED is a dataset for linguistic acceptability judgments for Swedish, covering five head classes: lexical, morphological, syntactical, orthographical and punctuation. DaLAJGED is an extension of DaLAJ.v1 dataset (Volodina et al., 2021a,b). Both DaLAJ datasets are based on the SweLL-gold corpus (Volodina et al., 2019) and its correction annotation categories. DaLAJ-GED presented here contains 44,654 sentences, distributed (almost) equally between correct and incorrect ones and is primarily aimed at linguistic acceptability judgment task, but can also be used for other tasks related to grammatical error detection (GED) on a sentence level. DaLAJ-GED is included into the Swedish SuperLim 2.0 collection, an extension of SuperLim (Adesam et al., 2020), a benchmark for Natural Language Understanding (NLU) tasks for Swedish. This paper gives a concise overview of the dataset and presents a few benchmark results for the task of linguistic acceptability, i.e. binary classification of sentences as either correct or incorrect.

Downloads

Published

2023-05-16