Swedish Language Technology Conference and NLP4CALL

Out-of-the-Box Graded Vocabulary Lists with Generative Language Models: Fact or Fiction?

2024-10-15T10:41:20+02:00

In this paper, we explore the zero-shot classification potential of generative language models for the task of grading vocabulary and generating graded vocabulary lists. We expand upon prior research by testing five different language model families on five different languages. Our results indicate that generative models can grade vocabulary across different languages with moderate but stable success, but producing vocabulary in a language other than English seems problematic and often leads to the generation of non-words, or words in a language other than the target language.

Investigating Acoustic Correlates of Whisper Scoring for L2 Speech Using Forced alignment with the Italian Component of the ISLE corpus

2024-10-15T10:53:22+02:00

This paper analyses how global phonetic analyses of learner data can be used to confirm Whisper probability scores assigned to learner phonetic data. We explore the Italian component of the ISLE corpus with phonetic analyses of 23 learners of English. Using a C++ wrapper of the Whisper models, we investigate the probability scores assigned by Whisper's tiny model. We discuss the phonetic features that may account for these Whisper predictions using P2FA-forced alignment. We try to correlate the quality of the phonetic realisation (measured using Levenshtein distance to the read text) to global vocalic measurements such as the convex hull or Euclidian distances between monophthongs. We show that Levenshtein distance to the reference transcription of the Whisper tidy model correlates with the grades assigned by the annotators and partially to the accuracy of the classification of monophthongs using the k-NN algorithm.

Leading by Example: The Use of Generative Artificial Intelligence to Create Pedagogically Suitable Example Sentences

2024-10-15T10:58:38+02:00

Several second language acquisition studies have argued in favour of practising vocabulary in authentic contexts. After the tradition of obtaining these usage examples by "invention" (i.e. language experts creating examples based on their intuitions) was superseded by corpus-based approaches (i.e. using dedicated tools to select examples from corpora), the rise of large language models led to a third possible "data source": Generative Artificial Intelligence (GenAI). This paper aims to assess GenAI-based examples in terms of their pedagogical suitability by conducting an experiment in which second language (L2) learners compare GenAI-based examples to corpus-based ones, for L2 Spanish. The study shows that L2 learners find GenAI-based sentences more suitable than corpus-based sentences, with -- on a total of 400 pairwise comparisons -- 265 artificial examples being found most suitable by all learners (compared to 10 corpus-based examples). The prompt type (different zero-shot and few-shot prompts were designed) did not have a noticeable impact on the results. Importantly, the GenAI approach also yielded a number of unsuitable example sentences, leading us to conclude that a "hybrid" method which takes authentic corpus-based examples as its starting point and employs GenAI models to rewrite the examples might combine the best of both worlds.

Potential of ASR for the study of L2 learner corpora

2024-10-15T11:23:09+02:00

This study is at the crossroads of Natural Language Processing (NLP) and Second Language Acquistion (SLA). We used Word Error Rate (WER) measurements of Whisper's speech recognition on a French L2 learner corpus to get automatic transcripts, and compared them with pre-existing manual transcripts. We then conducted quantitative and qualitative analysis of the issues which are inherent to the specificities of interlanguage for any automatic tool. We will discuss the different issues encountered by Whisper that are specific to learner corpora.

Enhancing a multi-faceted verb-centered resource to help a language learner: the case of breton

2024-10-15T11:27:29+02:00

This article builds on two recent resources for breton, a verb-centered database and a set of sentences in the universal dependencies (UD) format. Our focus is on breton, an endangered language in the celtic family. We provide an analysis of the verb resource and show how it can be connected and transformed to a multi-faceted system intended to help a learner in a flexible way. We discuss several scenarios.

Evaluating Automatic Pronunciation Scoring with Crowd-sourced Speech Corpus Annotations

2024-10-15T11:31:31+02:00

Pronunciation is an important, and difficult aspect of learning a language. Providing feedback to learners automatically can help train pronunciation, but training a model to do so requires corpora annotated for mispronunciation. Such corpora are rare. We investigate the potential of using the crowdsourced annotations included in Common Voice to indicate mispronunciation. We evaluate the quality of ASR generated goodness of pronunciation scores through the Common Voice corpus against a simple baseline. These scores allow us to see how the Common Voice annotations behave in a real use scenario. We also take a qualitative approach to analyzing the corpus and show that the crowdsourced annotations are a poor substitute for mispronunciation annotations as they typically reflect issues in audio quality or misreadings instead of mispronunciation.

Opinions Are Buildings: Metaphors in Secondary Education Foreign Language Learning

2024-10-15T11:37:04+02:00

Automatic metaphor detection has been an active field of research for years. Yet, it was rarely investigated how automatic metaphor detection can aid language learning. We therefore present MEWSMET, a corpus of argumentative essays written by English as Foreign Language learners annotated for metaphors. We differentiate between two kinds of metaphors: metaphors that are comprehensible to native speakers, even though they themselves would not use them (comprehensible metaphors, CMs) and metaphors that native speakers would use (target language metaphors, TLMs). We use MEWSMET in two ways: Firstly, we analyze our annotations and find out that there is a positive linear correlation between essay score and the number of TLMs, while no correlation is found between essay score and the number of CMs. Secondly, we explore how metaphor detection models perform on MEWSMET. We find that metaphor detection is a hard task given our noisy learner data, and that metaphor detection models tend to be better at identifying all metaphors (TLMs+CMs) instead of just TLMs, even though only TLMs can be used as a feature for automatic essay-scoring.

Investigating strategies for lexical complexity prediction in a multilingual setting using generative language models and supervised approaches

2024-10-15T11:40:25+02:00

This paper explores methods to automatically predict lexical complexity in a multilingual setting using advanced natural language processing models. More precisely, it investigates the use of transfer learning and data augmentation techniques in the context of supervised learning, showing the great interest of multilingual approaches. We also assess the potential of generative large language models for predicting lexical complexity. Through different prompting strategies (zero-shot, one-shot, and chain-of-thought prompts), we analyze model performance in diverse languages. Our findings reveal that while generative models achieve high correlation scores, their predictive quality varies. The comparative study illustrates that while generative large language models have potential, optimized task-specific models still outperform them in accuracy and reliability.

Developing a Pedagogically Oriented Interactive Reading Tool with Teachers in the Loops

2024-10-15T11:43:58+02:00

Reading is crucial for students' academic success and essential life skills. Particularly, the need for students to read in L2 English has grown due to its global significance. However, L2 readers often have limited opportunities for meaningful, interactive reading practice with immediate support. This paper introduces System A (Anonymized), a pedagogically oriented, web-based ICALL system designed to enhance L2 reading experiences, developed through an action research design involving teachers. System A offers a range of interactive features for students, including not only the autonomous identification of vocabulary and more than 650 grammar constructs, but making them interactively explorable in the text, providing detailed explanations and practical examples in contexts. To support effective teaching, System A employs a LLM for generating tailored reading comprehension questions and answer evaluation, with teachers actively involved. We present the development and application of the system from both technical and pedagogical perspectives to advance L2 learning research and refine educational tools.

Developing a Web-Based Intelligent Language Assessment Platform Powered by Natural Language Processing Technologies

2024-10-15T11:47:56+02:00

We introduce System, an intelligent language assessment platform and reusable module that streamlines the creation, administration and scoring of language proficiency tests supported by Natural Language Processing (NLP) technologies. As a first implementation, we realized an automatic pipeline for the Elicited Imitation Test (EIT), a popular test format that has been widely adopted in language learning research for general proficiency and formative assessments. The platform can be extended to other test formats and assessment types. System is a valuable tool for standardizing data collection in Second Language Acquisition (SLA) and Intelligent Computer Assisted Language Learning (ICALL) research as well as serving as an application for classroom assessment. In this paper, we present the design of the system and a preliminary evaluation of LLMs for generating language errors in EIT items. We conclude with a future outlook as well as limitations.

Jingle BERT, Jingle BERT, Frozen All the Way: Freezing Layers to Identify CEFR Levels of Second Language Learners Using BERT

2024-10-15T11:53:57+02:00

In this paper, we investigate the question of how much domain adaptation is needed for the task of automatic essay assessment by freezing layers in BERT models. We test our methodology on three different graded language corpora (English, French and Swedish) and find that partially fine-tuning base models improves performance over fully fine-tuning base models, although the number of layers to freeze differs by language. We also look at the effect of freezing layers on different grades in the corpora and find that different layers are important for different grade levels. Finally, our results represent a new state-of-the-art in automatic essay classification for the three languages under investigation.

Generating Contexts for ESP Vocabulary Exercises with LLMs

2024-10-15T11:57:53+02:00

The current paper addresses the need for language students and teachers to have access to a large number of pedagogically sound contexts for vocabulary acquisition and testing. We investigate the automatic derivation of contexts for a vocabulary list of English for Specific Purposes (ESP). The contexts are generated by contemporary Large Language Models (namely, Mistral-7B-Instruct and Gemini 1.0 Pro) in zero-shot and few-shot settings, or retrieved from a web-crawled repository of domain-relevant websites. The resulting contexts are compared to a professionally crafted reference corpus based on their textual characteristics (length, morphosyntactic, lexico-semantic, and discourse-related). In addition, we annotated the automatically derived contexts regarding their direct applicability, comprehensibility, and domain relevance. The 'Gemini, zero-shot' contexts are rated most highly by human annotators in terms of pedagogical usability, while the 'Mistral, few-shot' contexts are globally closest to the reference based on textual characteristics.

Automatic Text Simplification: A Comparative Study in Italian for Children with Language Disorders

2024-10-15T12:03:29+02:00

Text simplification aims to improve the readability of a text while maintaining its original meaning. Despite significant advancements in Automatic Text Simplification, particularly in English, other languages like Italian have received less attention due to limited high-quality data. Moreover, most Automatic Text Simplification systems produce a unique output, overlooking the potential benefits of customizing text to meet specific cognitive and linguistic requirements. These challenges hinder the integration of current Automatic Text Simplification systems into Computer-Assisted Language Learning environments or classrooms. This article presents a multifaceted output that highlights the potential of Automatic Text Simplification for Computer-Assisted Language Learning. First, we curated an enriched corpus of parallel complex-simple sentences in Italian. Second, we fine-tuned a transformer-based encoder-decoder model for sentences simplification. Third, we parameterized grammatical text features to facilitate adaptive simplifications tailored to specific target populations, achieving state-of-the-art results, with a SARI score of 60.12. Lastly, we conducted automatic and manual qualitative and quantitative evaluations to compare the performance of ChatGPT-3.5, and our fine-tuned transformer model. By demonstrating enhanced adaptability and performance through tailored simplifications in Italian, our findings underscore the pivotal role of ATS in Computer-Assisted Language Learning methodologies.

A Conversational Intelligent Tutoring System for Improving English Proficiency of Non-Native Speakers via Debriefing of Online Meeting Transcriptions

2024-10-15T12:08:43+02:00

This paper presents work-in-progress on developing a conversational system designed to enhance non-native English speakers' language skills through post-meeting analysis of the transcriptions of video conferences in which they have participated. Following recent advances in chatbots and agents based on large language models (LLMs), our tutoring system leverages pre-trained LLMs within an ecosystem that integrates different techniques, including in-context learning, external non-parametric memory retrieval, efficient parameter fine-tuning, grammatical error correction models, and error-preserving speech synthesis. A detailed analysis of the different technologies employed in each of these aspects is provided, along with a description of the datasets used. The system is currently in development, with a planned pilot study to evaluate its effectiveness among students of L2-English.

Evaluating the Generalisation of an Artificial Learner

2024-10-15T12:15:07+02:00

This paper focuses on the creation of LLM-based artificial learners. Motivated by the capability of language models to encode language representation, we evaluate such models in predicting masked tokens in learner corpora. We pre-trained two learner models, one in a training set of the EFCAMDAT (natural learner model) and another in the C4200m dataset (syntehtic learner model), evaluating them against a native model using an external corpora of English for Specific purposes corpus of French undergraduates (CELVA) as test set. We measured metrics related to accuracy, consistency and divergence. While the native model performs reasonably well, the natural learner pre-trained model show improvements token in recall at k. We complement the accuracy metric showing that the native language model make "over-confident" mistakes where our artificial learners make mistakes where probabilities are uniform. Finally we show that the general tokens choices from the native model diverges from the natural learner model and that this divergence is higher on lower proficiency levels.

Semantic Error Prediction: Estimating Word Production Complexity

2024-10-15T12:19:05+02:00

Estimating word complexity is a well-established task in computer-assisted language learning. So far, however, complexity estimation has been largely limited to comprehension. This neglects words that are easy to comprehend, but hard to produce. We introduce semantic error prediction (SEP) as a novel task that assesses the production complexity of content words. Given the corrected version of a learner-produced text, a system has to predict which content words replace tokens from the original text. We present and analyse one example of such a semantic error prediction dataset, which we generate from an error correction dataset. As neural baselines, we use BERT, RoBERTa, and LLAMA2 embeddings for SEP. We show that our models can already improve downstream applications, such as predicting essay vocabulary scores.

GRAMEX: Generating Controlled Grammar Exercises from Various Sources

2024-10-15T12:21:11+02:00

This paper presents Gramex, an application designed to assist teachers in the creation of learning materials, namely grammar exercises. More precisely Gramex leverages state-of-the-art parsing techniques to morpho-syntactically annotate texts and turn these into grammar exercises while aligning these with official curricula. Allowing teachers to freely select excerpts of texts from which to generate specific grammar exercises aims to increase learners' engagement in educational activities. Gramex currently supports 4 types of exercises (Fill-in-the-Blanks, Mark-the-Words, Single and Mutliple Choice questionnaires) and 3 output formats (JSON objects, printable workbooks, H5P interactive content). Gramex is under active development and has been experimentally used with teachers of L1-learners in elementary and middle French schools.

LLM chatbots as a language practice tool: a user study

2024-10-15T12:24:02+02:00

Second language learners often experience language anxiety when speaking with others in their target language. As the generative capabilities of Large Language Models (LLMs) continue to improve, we investigate the possibility of using an LLM as a conversation practice tool. We conduct a user study with 160 English language learners, where an LLM chatbot is used to simulate real-world conversations. We present our findings on 1) how an interactive session with a chatbot might impact performance in real-world conversations; 2) whether the learning experience differs for learners of different proficiency levels; 4) how changes in difficulty affects the learner's experience; and 3) how online, synchronous conversation provided by an LLM compares with a purely receptive experience. Additionally, we propose a simple yet effective way to detect linguistic complexity on-the-fly: clicking on words to reveal dictionary definitions. We demonstrate that clicks correlate well with linguistic complexity and indicate which words learners find difficult to understand.

Sailing through multiword expression identiﬁcation with Wiktionary and Linguse: A case study of language learning

2024-10-15T12:26:51+02:00

Multiword expressions (MWEs), due to their idiomatic nature, pose particular challenges in comprehension tasks and vocabulary acquisition for language learners. Current NLP tools fall short off comprehensively aiding language learners when encountering MWEs. While proficient in identifying MWEs seen during training, current systems are constrained by limited training data. To address the specific needs of language learners, this research integrates expansive MWE lexicons and NLP methodologies as championed by Savary et al. (2019a). Outcomes encompass a specialized MWE corpus from Wiktionary, the enhancement of Linguse, a reading application for language learners, with MWE annotations, and empirical validation with French language students. The culmination is an MWE identifier optimally designed for language learner requirements.