A Supervised Machine Learning Approach for Post-OCR Error Detection for Historical Text

Authors

  • Dana Dannélls
  • Shafqat Virk

DOI:

https://doi.org/10.3384/ecp184170

Keywords:

historical text, optical character recognition (OCR), Swedish, supervised machine learning

Abstract

Training machine learning models with high accuracy requires careful feature engineering, which involves finding the best feature combinations and extracting their values from the data. The task becomes extremely laborious for specific problems such as post Optical Character Recognition (OCR) error detection because of the diversity of errors in the data. In this paper we present a machine learning approach which exploits character n-gram statistics as the only feature for the OCR error detection task. Our method achieves a significant improvement over the baseline reaching state-of-the-art results of 91% and 89% F1 score on English and Swedish datasets respectively. We report various experiments to select the appropriate machine learning algorithm and to compare our approach to previously reported traditional approaches.

Downloads

Published

2021-08-12