Evaluation and refinement of an enhanced OCR-process for mass digitisation
Great expectations are placed on the capacity of heritage institutions to make their collections available in digital format. Data driven research is becoming a key concept within the humanities and social sciences. Kungliga biblioteke s (KB, National Library of Sweden) collections of digitised newspaper can thus be regarded as unique cultural data sets with information that rarely is conveyed in other media types. The digital format makes it possible to explore these resources in ways not feasible while in print format. As texts are no longer only read but also subjected to computer based analysis the demand on the reliability increases. Technologies for converting images to machine-readable text – OCR – play a fundamental part in making these resources available, but the effectiveness vary with the type of document being processed. This is evident in relation to the digitisation of newspapers where factors relating to their production, layout and paper quality often impair the OCR-production. In order to improve the machine readable text, especially in relation to the digitisation of newspapers, KB initiated the development of an OCR-module where key parameters can be adjusted according to the characteristics of the material being processed. The purpose of this project application is to carry out a formal evaluation of, and improve this OCR-module through systematic text analyses, dictionaries and word lists with the aim of implementing it in the mass digitisation process.