The Teacher-Student Chatroom Corpus version 2: more lessons, new annotation, automatic detection of sequence shifts


  • Andrew Caines ALTA Institute & Computer Laboratory, University of Cambridge
  • Helen Yannakoudakis Department of Informatics, King’s College London
  • Helen Allen Cambridge University Press & Assessment, University of Cambridge
  • Pascual Pérez-Paredes Departamento de Filología Inglesa, Universidad de Murcia
  • Bill Byrne Department of Engineering, University of Cambridge
  • Paula Buttery ALTA Institute & Computer Laboratory, University of Cambridge



dialogue, resources, English teaching


The first version of the Teacher-Student Chatroom Corpus (TSCC) was released in 2020 and contained 102 chatroom dialogues between 2 teachers and 8 learners of English, amounting to 13.5K conversational turns and 133K word tokens. In this second version of the corpus, we release an additional 158 chatroom dialogues, amounting to an extra 27.9K conversational turns and 230K word tokens. In total there are now 260 chatroom lessons, 41.4K conversational turns and 363K word tokens, involving 2 teachers and 13 students with seven different first languages. The content of the lessons was, as before, guided by the teacher, and the proficiency level of the learners is judged to range from B1 to C2 on the CEFR scale. Annotation of the dialogue continued with conversational analysis of sequence types, pedagogical focus, and correction of grammatical errors. In addition, we have annotated fifty of the dialogues using the Self-Evaluation of Teacher Talk framework which is intended for self-reflection on interactional aspects of language teaching. Finally, we conducted machine learning experiments to automatically detect shifts in discourse sequences from turn to turn, using modern transfer learning methods with large pretrained language models. The TSCC v2 is freely available for research use.