A Two-OCR Engine Method for Digitized Swedish Newspapers

Dana Dann´ells; Lars Bj¨ork; Ove Dirdal; Torsten Johansson

doi:10.3384/ecp1808

Authors

Dana Dann´ells
Lars Bj¨ork
Ove Dirdal
Torsten Johansson

DOI:

https://doi.org/10.3384/ecp1808

Keywords:

Optical Character Recognition (OCR), Natural Language Processing (NLP), Historical material, Newspapers, Swedish

Abstract

In this paper we present a two-OCR engine method that was developed at Kungliga biblioteket (KB), the National Library of Sweden, for improving the correctness of the OCR for mass digitization of Swedish newspapers. To evaluate the method a reference material spanning the years 1818–2018 was prepared and manually transcribed. A quantitative evaluation was then performed against the material. In this first evaluation we experimented with word lists for different time periods. The results show that even though there was no significant overall improvement of the OCR results, some combinations of word lists are successful for certain periods and should therefore be explored further.

A Two-OCR Engine Method for Digitized Swedish Newspapers

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License