Constructing SABeD: A Spoken Academic Belgian Dutch Corpus

Authors

  • Jolien Mathysen KU Leuven, Belgium
  • Vincent Vandeghinste Instituut voor de Nederlandse Taal, The Netherlands
  • Elke Peters KU Leuven, Belgium
  • Patrick Wambacq KU Leuven, Belgium

DOI:

https://doi.org/10.3384/ecp210001

Abstract

We present the Spoken Academic Belgian Dutch (SABeD) corpus and a description of its construction. It was compiled from selected first bachelor academic lectures in higher education institutions in Flanders, as students indicate that the language used in such lectures is one of the hurdles for comprehension and academic success. We first applied speech recognition on these lectures and then applied manual utterance segmentation and manual correction of the automated transcription. A filtered version of the resulting transcriptions was automatically punctuated and linguistically annotated with CLARIN tools and is currently available for search in the Autosearch online corpus query environment. The manual transcriptions and the ELAN files with the final annotation will soon be made available to the research community for download in the CLARIN infrastructure at http://hdl.handle.net/10032/tm-a2-w4.

Downloads

Published

2024-07-09