Lemmatizing and POS-tagging Akkadian with BabyLemmatizer and Dictionary-Based Post-Correction
DOI:
https://doi.org/10.3384/ecp198011Keywords:
Lemmatization, Tagging, Assyriology, Akkadian, Babylonian, TurkunlpAbstract
We present BabyLemmatizer, a hybrid lemmatizer and POS-tagger for Akkadian, the language of the ancient Assyrians and Babylonians, documented from 2350 BCE to 100 CE. In our approach the text is first POS-tagged and lemmatized with TurkuNLP trained with human-verified labels, and then post-corrected with dictionary-based methods to improve the lemmatization quality. The post-correction also assigns labels with confidence scores to flag the most suspicious lemmatizations for manual validation. We demonstrate that the presented tool achieves a Lemma+POS labeling accuracy of 94%, and a lemmatization accuracy of 95% in a held-out test set. We also apply lemmatizer to a previously unlemmatized text corpus to test it in practice.Downloads
Published
2023-06-09
Issue
Section
Contents
License
Copyright (c) 2023 Aleksi Sahala, Tero Alstola, Jonathan Valk, Krister Lindén
This work is licensed under a Creative Commons Attribution 4.0 International License.