Comparing Native and Learner Englishes Using a Large Pre-trained Language Model


  • Tatsuya Aoyama Georgetown University



learner corpus study, second language acquisition, distributional semantics, word embeddings, language model


The use of lexical items by L2 speakers of English has been analyzed through a variety of methods; however, they are either (i) infeasible for a large-scale learner corpus study or (ii) designed to measure vocabulary breadth, rather than depth. This paper presents the preliminary results of an ongoing work to utilize contextualized word embeddings (CWEs) obtained from a large pre-trained language model to measure the depth of L2 speakers’ vocabulary knowledge, operationalized as how similar L2 speakers’ use of a given word is to that of L1 speakers’. We find that (i) the mean distance between L1 CWEs and L2 CWEs of a given word tends to decrease as the proficiency level becomes higher, and that (ii) while words that have similar CWEs in the L1 corpus and L2 corpus tend to reveal interesting properties about the word use, words that have dissimilar CWEs in the two corpora often suffer from domain effects.