Lemmatizing and POS-tagging Akkadian with BabyLemmatizer and Dictionary-Based Post-Correction

Authors

  • Aleksi Sahala
  • Tero Alstola
  • Jonathan Valk
  • Krister Lindén

DOI:

https://doi.org/10.3384/ecp198011

Keywords:

Lemmatization, Tagging, Assyriology, Akkadian, Babylonian, Turkunlp

Abstract

We present BabyLemmatizer, a hybrid lemmatizer and POS-tagger for Akkadian, the language of the ancient Assyrians and Babylonians, documented from 2350 BCE to 100 CE. In our approach the text is first POS-tagged and lemmatized with TurkuNLP trained with human-verified labels, and then post-corrected with dictionary-based methods to improve the lemmatization quality. The post-correction also assigns labels with confidence scores to flag the most suspicious lemmatizations for manual validation. We demonstrate that the presented tool achieves a Lemma+POS labeling accuracy of 94%, and a lemmatization accuracy of 95% in a held-out test set. We also apply lemmatizer to a previously unlemmatized text corpus to test it in practice.

Downloads

Published

2023-06-09