Abstract
Methods for linking individuals across historical data sets, typically in combination with AI based transcription models, are developing rapidly. Perhaps the single most important identifier for linking is personal names. However, personal names are prone to enumeration and transcription errors and although modern linking methods are designed to handle such challenges, these sources of errors are critical and should be minimized. For this purpose, improved transcription methods and large-scale databases are crucial components. This paper describes and provides documentation for HANA, a newly constructed large-scale database which consists of more than 3.3 million names. The database contains more than 105 thousand unique names with a total of more than 1.1 million images of personal names, which proves useful for transfer learning to other settings. We provide three examples hereof, obtaining significantly improved transcription accuracy on both Danish and US census data. In addition, we present benchmark results for deep learning models automatically transcribing the personal names from the scanned documents. Through making more challenging large-scale databases publicly available we hope to foster more sophisticated, accurate, and robust models for handwritten text recognition.
| Original language | English |
|---|---|
| Article number | 101473 |
| Journal | Explorations in Economic History |
| Volume | 87 |
| Issue number | January |
| Number of pages | 12 |
| ISSN | 0014-4983 |
| DOIs | |
| Publication status | Published - Jan 2023 |
Bibliographical note
Publisher Copyright:© 2022
Funding
We are grateful to the BYU Record Linking Lab for providing the US census data and the Copenhagen Archives who have supplied large amounts of scanned source material. The authors also gratefully acknowledge valuable comments from Philipp Ager, Anthony Wray, Paul Sharp, the editor and an anonymous referee. Torben gratefully acknowledges financial support from the Independent Research Fund Denmark , grant 8106-00003B . Emil gratefully acknowledges financial support from the European Research Council (Starting Grant Reference 851725 ). The HANA database is available at https://www.kaggle.com/sdusimonwittrock/hana-database . We are grateful to the BYU Record Linking Lab for providing the US census data and the Copenhagen Archives who have supplied large amounts of scanned source material. The authors also gratefully acknowledge valuable comments from Philipp Ager, Anthony Wray, Paul Sharp, the editor and an anonymous referee. Torben gratefully acknowledges financial support from the Independent Research Fund Denmark, grant 8106-00003B. Emil gratefully acknowledges financial support from the European Research Council (Starting Grant Reference 851725). The HANA database is available at https://www.kaggle.com/sdusimonwittrock/hana-database.
Fingerprint
Dive into the research topics of 'HANA: a handwritten name database for offline handwritten text recognition'. Together they form a unique fingerprint.Related research output
- 2 Ph.D. thesis
-
Essays in Economics and Data Science
Wittrock, S., 20. Jan 2023, Syddansk Universitet. Det Samfundsvidenskabelige Fakultet. 197 p.Research output: Thesis › Ph.D. thesis
-
Machine Learning with Applications in Economics
Johansen, T., 21. Dec 2023, Syddansk Universitet. Det Samfundsvidenskabelige Fakultet. 251 p.Research output: Thesis › Ph.D. thesis
Open AccessFile195 Downloads (Pure)
Related datasets
-
HANA: A HAndwritten NAme Database for Offline Handwritten Text Recognition
Wittrock, S. F. (Creator) & Johansen, T. (Creator), Department of Economics SDU, 2023
https://www.kaggle.com/datasets/sdusimonwittrock/hana-database and one more link, https://github.com/TorbenSDJohansen/HANA (show fewer)
Dataset
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver