Evaluating the Generalisation of an Artificial Learner

Authors

  • Bernardo Stearns Insight Centre for Data Analytics, Data Science Institute, University of Galway
  • Nicolas Ballier LLF & CLILLAC-ARP / Universit´e Paris Cit´e, rue Thomas Mann, 75013 PARIS
  • Thomas Gaillat LIDILE / Universit´e de Rennes 2, 35000 Rennes
  • Andrew Simpkin School of Mathematical and Statistical Sciences, University of Galway, University Road, Galway
  • John P. McCrae Insight Centre for Data Analytics, Data Science Institute, University of Galway

DOI:

https://doi.org/10.3384/ecp211015

Keywords:

LLM, Learner Simulation, NLP

Abstract

This paper focuses on the creation of LLM-based artificial learners. Motivated by the capability of language models to encode language representation, we evaluate such models in predicting masked tokens in learner corpora. We pre-trained two learner models, one in a training set of the EFCAMDAT (natural learner model) and another in the C4200m dataset (syntehtic learner model), evaluating them against a native model using an external corpora of English for Specific purposes corpus of French undergraduates (CELVA) as test set. We measured metrics related to accuracy, consistency and divergence. While the native model performs reasonably well, the natural learner pre-trained model show improvements token in recall at k. We complement the accuracy metric showing that the native language model make "over-confident" mistakes where our artificial learners make mistakes where probabilities are uniform. Finally we show that the general tokens choices from the native model diverges from the natural learner model and that this divergence is higher on lower proficiency levels.

Downloads

Published

2024-10-15