Two Neural Models for Multilingual Grammatical Error Detection

This paper presents two neural models for multilingual grammatical error detection and their results in the MultiGED-2023 shared task. The ﬁrst model uses a simple, purely supervised character-based approach. The second model uses a large language model which is pretrained on 100 different languages and ﬁne-tuned on the provided datasets of the shared task. Despite simple approaches, the two systems achieved promising results. One system has the second best F-score; the other is in the top four of participating systems.


Introduction
Grammatical Error Detection (GED) is the task of detecting different kinds of errors in text such as spelling, punctuation, grammatical, and word choice errors. It is one of the key components in the grammatical error correction (GEC) community. This paper concerns with the development of different methods for subtoken representation and their evaluation on standard benchmarks for multiple languages. Our work is inspired by the recent shared task MultiGED-2023. The aim of this task is to detect tokens in need of correction across five different languages, labeling them as either correct ("c") or incorrect ("i"), i.e. performing binary classification at the token level.
Recent GED methods make use of neural sequence labeling models, either recurrent neural networks or transformers. The first experiments using convolutional neural network and long short-term memory networks (LSTM) models for GED was proposed in 2016 (Rei and Yannakoudakis, 2016). Later, a bidirectional, attentional LSTM was used to jointly learn token-level and sentence-level representations and combine This work is licensed under a Creative Commons Attribution 4.0 International Licence.
Licence details: http://creativecommons.org/licenses/by/4.0/. them so as to detect grammatically incorrect sentences and to identify the location of the error tokens at the same time (Rei and Søgaard, 2019). The bidirectional LSTM model was also used together with grammaticality-specific word embeddings to improve GED performance (Kaneko et al., 2017). A bidirectional LSTM model was trained on synthetic data generated by an attentional sequence-to-sequence model to push GED score (Kasewa et al., 2018). Best-performing GED systems employ transformer block-based model for token-level labeling. A pretrained BERT model has been fine-tuned for GED and shown its superior performance in (Kaneko and Komachi, 2019). The BERT model has also been shown significant improvement over LSTM models in both GED and GEC (Liu et al., 2021). The state-ofthe-art GED method uses a multi-class detection method (Yuan et al., 2021).
In this work, we also employ state-of-the-art sequence labeling methods, which are based on LSTM or BERT. In contrast to previous work, we focus on different representations of tokens at subtoken levels. Our best-performing system can process multiple languages using a single model.

Methods
We use two different token representations, one at the character level, and one at the subtoken level.

Character-based Representation
In this representation, the j-th input token of a sentence is represented by the concatenation of three vectors (b j , m j , e j ) corresponding to its characters. More precisely, the token is represented by vector x j = (b j , m j , e j ) where the first vector b j and the third vectors e j represent the first and last character of the token respectively. The second vector m j represents a bag of characters of the middle subtoken without the initial and final positions.  The dotted frame in Figure 1 depicts this representation. For example, the token "Last" is represented as a concatenation of the following vectors: (1) an one-hot vector for character L; (2) an one-hot vector for character t, and (3) a bag-ofcharacter multihot vector for the internal characters a, s. Thus, each token is represented by a vector of size 3V where V is the size of the alphabet. The label y j is predicted by a softmax layer: . This representation is inspired by a semicharacter word recognition method which was proposed by Sakaguchi et al. (2017). It was demonstrated that this method is significantly more robust in word spelling correction compared to character-based convolutional networks.

Subtoken-based Representation
Recent language processing systems have used unsupervised text tokenizer and detokenizer so as to make a purely end-to-end system that does not depend on language-specific pre-and postprocessing. SentencePiece is a method which implements subword units, e.g., byte-pair-encoding -BPE (Sennrich et al., 2016) and unigram language model (Kudo, 2018) with the extension of direct training from raw sentences. Using this method, the vocabulary size is predetermined prior to the neural encoder training. Our system also uses subtoken representation.

LSTM and BERT Encoders
The LSTM network is a common type of recurrent neural networks which is capable of process-ing sequential data efficiently. This was a common method prior to 2017, before Transformers (Vaswani et al., 2017), which dispense entirely with recurrence and rely solely on the attention mechanism. Despite being outdated, we developed a purely supervised LSTM encoder to test the effectiveness of the character-based method.
We employ the XLM-RoBERTa model as another encoder in our system. RoBERTa (Liu et al., 2019) is based on Google's BERT model released in 2018 (Devlin et al., 2019). It modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates. RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer. The XLM-RoBERTa model was proposed in 2020 (Conneau et al., 2020), which is based on RoBERTa. It is a large multilingual language model, trained on 100 languages, 2.5TB of filtered CommonCrawl data. It has been shown that pretraining multilingual models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. Unlike some XLM multilingual models, this model does not require language tensors to understand which language is used, and should be able to determine the correct language from the input ids.

Experiments
This section presents the datasets in use, experimental settings and obtained results of our system.

Datasets
The datasets are provided by the MultiGED-2023 shared task. 1 The shared task provides training, development and test data for each of the five languages: Czech, English, German, Italian and Swedish. The training and development datasets are available in the MultiGED-2023 GitHub repository, and test sets are released during the test phase for participating teams. Table 1 shows the statistics of the datasets.

Evaluation Metric
Evaluation is carried out in terms of token-based precision, recall and F 0.5 , consistent with previous work on error detection. F 0.5 is used instead of F 1 because humans judge false positives more harshly than false negatives and so precision is more important than recall.

Experimental Settings
Our first system, namely VLP-char, uses the character-based token representation and the LSTM encoder. Its parameters are initialized with random vectors in each run. This allows us to establish results in a pure supervised learning setting rather than a semi-supervised or transfer learning setting. The same model is trained separately for each language, resulting five models. All five language-specific models are trained with the Adam optimizer (Kingma and Ba, 2015), and with learning rate 5 × 10 −4 . We use the crossentropy loss function for multinomial classification as usual. All models are trained in 80 epochs. The maximum sequence length is set to 60 tokens -this is enough to cover most sentences in the provided datasets. Since the data is highly imbalanced -the error rates are from only 10% (for English) to 24% (for Czech), we set the incorrect label weight to 90% and the correct label weight to 10% when computing the objective function. This system does not use any external resources; only datasets provided by the organizers are used to train and validate the models. We use the BigDL library 2 as the deep learning framework. Our code is publicly available on GitHub. 3 Our second system, namely DSL-MIM-HUS, uses the subtoken-based representation and the pretrained XLM-RoBERTa embeddings. 4 This system uses the library NERDA 5 to fine-tune the pretrained embeddings on all datasets. That is, we combine all the provided datasets (training and development splits) into one large dataset and perform the experiment on this combined one. There is thus only one model for all the five languages. The combined dataset is divided into training, development and test split with the ratios 0.8, 0.1 and 0.1, respectively. There are 82,976 training sam-  ples, 10,371 development samples and 10,371 test samples respectively. We did not keep the proportion of different language data the same when sampling. It had been more beneficial if the proportion would have been kept since the sizes of languages are very different -there are three times more German sentences than Italian sentences. The hyperparameters are tuned on the development set and selected as follows: the learning rate of 10 −5 , the number of training epochs of 20.

Supervised System
Without using any external datasets or pretrained embeddings, the VLP-char system obtained mediocre results. It ranks the fourth place among participating systems. This sytem consistently gives higher recall than precision on all the languages, while other systems have better precision than recall. It achieves 63.95% of recall on the Czech test set, which is the highest recall among participating systems for this language, as shown in Table 2.
Despite mediocre results, this system represents what we can build with very limited data.

Pretrained System
On our test split, the system DSL-MIM-HUS achieves a precision of 80.88%, a recall of 64.07% and F 0.5 of 71.50% for incorrect token prediction. The corresponding scores on the training set is 98.54%, 96.75%, and 97.64%, respectively. Since this combined dataset contains all the provided samples of all languages, it does not make sense to evaluate on each language separately.
On the private test set of the shared task MultiGED-2023 (Volodina et al., 2023), the system DSL-MIM-HUS is the second highest ranking. It achieves the best score among participating  systems on the English REALEC dataset. Table 3 shows the performance of this system on the private test set, as announced by the organizers. Although the XLM-RoBERTa system clearly outperformed the LSTM system, the LSTM system was trained on a fraction of the data available to the XLM-RoBERTa system.

Conclusion
We have presented two neural models for multilingual grammatical error detection and their results in the MultiGED-2023 shared task. One model uses a purely supervised LSTM network on a character-based token representation. The other model uses a pretrained BERT network on a subtoken representation. The two systems have achieved promising results in the shared task.
We are going to seek a better way to exploit syntactic and semantic information which comes from a dependency parser. We believe that explicit syntactic and semantic dependency between tokens of a sentence will be fruitful in detecting grammatical errors. In a recent study, we have demonstrated the usefulness of syntactic structures in improving lexical embeddings (Dang and Le-Hong, 2021). The idea of incorporating constituent-based syntax has also been shown effective for GED as well (Zhang and Li, 2022).