SwedPop - improving a national data infrastructure through automated transcription and record linkage, a stable historical identifier, and enhanced dissemination
The proposal seeks to further expand the breadth and accessibility of Swedish historical population data, an essential national data infrastructure asset. First, the project will transcribe and disseminate all individual birth records for the period 1800-1899 through the implementation of machine-learning methods of automated handwriting recognition (HTR). Second, together with already available individual-level data on mortality and emigration, the birth records will be used to design and implement a method to generate a unique and stable personal identifier. This will not only greatly facilitate future historical research through offering a straightforward way to combine different datasets but also increase the scientific integrity of future research on historical individual-level data. Third, the project will develop and disseminate a pipeline for supervised probabilistic record linkage of large-scale historical individual-level datasets using state-of-the-art methods of machine learning. This will not only elevate the quality of currently existing data, but also vastly facilitate and standardize future linking efforts. Fourth, a long-term sustainable model for the storage and dissemination of functionalities and data will be developed. Tested and standardized protocols for data security and storage, along with innovative web-based interfaces for data extraction will ensure convenient and uninterrupted access to data for a long time forward.