A Two-OCR Engine Method for Digitized Swedish Newspapers
Keywords:Optical Character Recognition (OCR), Natural Language Processing (NLP), Historical material, Newspapers, Swedish
In this paper we present a two-OCR engine method that was developed at Kungliga biblioteket (KB), the National Library of Sweden, for improving the correctness of the OCR for mass digitization of Swedish newspapers. To evaluate the method a reference material spanning the years 1818–2018 was prepared and manually transcribed. A quantitative evaluation was then performed against the material. In this first evaluation we experimented with word lists for different time periods. The results show that even though there was no significant overall improvement of the OCR results, some combinations of word lists are successful for certain periods and should therefore be explored further.
Copyright (c) 2021 Dana Dannélls, Lars Björk, Torsten Johansson and Ove Dirdal
This work is licensed under a Creative Commons Attribution 4.0 International License.