Lemmatizing and POS-tagging Akkadian with BabyLemmatizer and Dictionary-Based Post-Correction
Keywords:Lemmatization, Tagging, Assyriology, Akkadian, Babylonian, Turkunlp
AbstractWe present BabyLemmatizer, a hybrid lemmatizer and POS-tagger for Akkadian, the language of the ancient Assyrians and Babylonians, documented from 2350 BCE to 100 CE. In our approach the text is first POS-tagged and lemmatized with TurkuNLP trained with human-verified labels, and then post-corrected with dictionary-based methods to improve the lemmatization quality. The post-correction also assigns labels with confidence scores to flag the most suspicious lemmatizations for manual validation. We demonstrate that the presented tool achieves a Lemma+POS labeling accuracy of 94%, and a lemmatization accuracy of 95% in a held-out test set. We also apply lemmatizer to a previously unlemmatized text corpus to test it in practice.
Copyright (c) 2023 Aleksi Sahala, Tero Alstola, Jonathan Valk, Krister Lindén
This work is licensed under a Creative Commons Attribution 4.0 International License.