Abstract
Understanding contemporary policy issues ranging from intergenerational social mobility to long-run impacts of social policies requires recent as well as historical data. However, until recently this has posed enormous challenges, due in large part to the work required to exploit historical, often hand-written, data at scale being insurmountable. This has severely limited what has been possible to study; while digitization of individual historical sources, such as census records, has provided important insights into a range of scientific questions, large-scale digitization of historical data has proved too slow and expensive to fully reap the benefits attainable. More recent developments within artificial intelligence (AI) and machine learning (ML), however, have changed the landscape completely.
This thesis aimed at two primary goals: First, a key objective has been to develop new and exploit current advances in ML in order to facilitate the use of historical, handwritten documents at scales insurmountable before the current “spring” within AI. Second, it attempts to answer questions related to the role of early-life circumstances and their long-run and multi-generational impacts, going back an additional generation compared to what would have been possible without the use of ML for digitization of historical data. Acknowledging the important role of human capital investments for later-life success, this research is motivated
by models on early-life skill formation (Heckman, 2006; Attanasio, 2015) and adds insights into the relative role of inputs to child health and human capital production. Thus, the thesis broadly consists of two overarching themes, or meta-chapters, which organically interplay through methodological contributions making empirical studies feasible, and those in turn providing feedback for further improvements to the ML methods developed. Each of the two overarching themes consists of three papers, with strong ties to each other and where the developments from the earlier papers make possible the contributions of the more recent studies.
The first three chapters of my thesis are dedicated to the development of new state-of-the-art (SOTA) within ML for document digitization and handwritten text recognition (HTR), and are the product of work by the “Big Data Analytics and Digitization” (BDAD) group at University of Southern Denmark (SDU). Chapter 1 of my thesis, “Applications of machine learning in tabular document digitisation” (Dahl et al., 2023a), provides the foundation for this work: We present a general-purpose, modular, end-to-end “pipeline” for transforming raw scans of historical tabular data (e.g., census lists or death certificates) into data ready for statistical analysis. We showcase the effectiveness of our pipeline through two applications, proving its usability in practice. A key feature of our contribution is its modular structure, which allows us to use modifications with newer and better “modules” in later projects while still using the general framework we propose.
Chapter 2 of the thesis, “HANA: A HAndwritten NAme database for offline handwritten text recognition” (Dahl et al., 2023c), is dedicated to the task of HTR of names, a particularly important and difficult problem within historical data digitization. Names are often the single most important piece of information to extract when using individual-level data, as it plays the most prominent role when linking different data sources together. At the same time, however, names are challenging to transcribe, a result of the large pool of names, many of which are very similar and thus prone to even one-character errors resulting in erroneous linkages. To tackle this issue, we introduce the (to our best knowledge) largest publicly available database of labelled images of handwritten names, consisting of more than 3.3 million names with more than 105 thousand unique names. Using this database, we train neural networks for HTR that achieve very high rates of transcription accuracy (up to 95.7% on the database’s test split), and we show that using the database combined with transfer learning allows one to achieve much higher transcription accuracy also on other datasets, something which we illustrate using Danish and US census records. We open source all our data and code.
Chapter 3 of the thesis, “DARE: A large-scale handwritten DAte REcognition system” (Dahl et al., 2023b), follows closely in the steps of Chapter 2, but with our focus now on HTR for dates. For this purpose, we introduce (to our best knowledge) the largest publicly available database of labelled images of dates, consisting of nearly 10 million tokens spread across 2.2 million different images of dates, sampled from a wide variety of historical documents. A key distinction, however, is our focus on also contributing to the SOTA within neural networks for HTR, where we propose a highly competitive architecture based on the EfficientNetV2-architecture of Tan and Le (2021). Accurate transcriptions of dates are, as with the case of names, a key element of historical data digitization, be it for the purpose of obtaining birth and/or death dates of individuals (which is important for historical record linkage) or dating nurse home visits for infants. We demonstrate that our system achieves high transcription accuracy on the database’s test split and demonstrate its use as a foundation model when transfer learning to new collections of historical records. We open source our data and will be open sourcing our code in the future. The next three chapters of my thesis are devoted to studying the role of early-life circumstances, particularly early-life investment policies, on long-run and intergenerational outcomes. They are all based on a large database of records on nurse home visiting (NHV) in the 1960s Copenhagen, which we construct using ML methods developed as part of the first three chapters of my thesis; as was the case for the first three chapters, these three are closely tied together and each contribute to the feasibility of the next chapter. Chapter 4 of my thesis, “Cohort Profile: The Copenhagen Infant Health Nurse Records (CIHNR) cohort” (Bjerregaard et al., 2023), is devoted to establishing the key database which allows us to study long-run impacts of social policies at the most granular level. Here, we create a database combining detailed childhood data – obtained through transcriptions of handwritten nurse records from the 1960s Copenhagen NHV program, covering all children during their first year of life (and for a subset first three years) – with long-run outcomes from Danish register data, thus following children from their birth to when they are around 60. This is made possible due to ML methodological contributions of the thesis’ first three chapters, which we use to transcribe over 10 million fields of information from the collection of nurse records, which we then link to Danish administrative register data. The paper provides a comprehensive description of the data we make available, and the steering group at the Center for Clinical Research and Prevention, Bispebjerg and Frederiksberg Hospital, Denmark, welcomes collaboration with national and international colleagues. Due to confidentiality reasons, however, this database cannot be made publicly available.
Chapter 5 of the thesis, “Universal Investments in Toddler Health. Learning from a Large Government Trial” (Baker et al., 2023), studies the impact of an extended NHV experiment for children born between 1959 and 1967. Exploiting a 1960s government trial that assigned everyone born during the first three days of any month to three (instead of one) years of NHV, we investigate long-run and intergenerational impacts of extended NHV. We document positive long-run health benefits – with some evidence of impacts extending to the offspring of our focal individuals – which are larger for initially disadvantaged children. Documenting that the initiative was highly cost-effective, especially so for initially disadvantaged children, our results suggest that these types of early-life investments can alleviate inequalities at low costs.
Chapter 6 of the thesis, “Optimal Treatment Allocation under Constraints” (Johansen, 2023a), examines the complex role of treatment providers on policy impacts by comparing nurses of the 1960s Copenhagen NHV program against each other. The paper’s main contributions are (1) developing a strongly polynomial algorithm for solving optimal treatment allocations problems of the type where treatment effects vary at the individual level and are subject to capacity constraints and (2) showcasing the method in a setting where each nurse is considered a “treatment”, thus showing that significant efficiency improvements are possible by carefully allocating nurses to children to optimize the total impacts of the NHV program. By doing so, the chapter adds novel evidence on the importance of treatment providers for the success of early-life investment policies. While each chapter of the thesis is self-contained, they nevertheless are closely connected, and they showcase how to exploit developments within AI for the benefit of obtaining data of importance to applied, empirical studies. Though not a direct part of the thesis, it has served as a motivation for developing a Python library for HTR (Johansen, 2023b), something which has proven fundamental to four of the thesis’ chapters (3-6), has been and is used in other research projects, and is actively used for the Link-Lives project that aims to “[...] reconstruct life-courses and multigenerational family relations for (nearly) all Danes, from 1787–1968.” [1] The rest of the thesis contains each chapter as it appears in the most recent version of the associated paper (be it published or as a working paper); to avoid ambiguity, two page numbers are included for all the remaining pages after the Danish summary, one a running tally for the thesis (bottom left) and the other the page number as it appears on the specific paper. Each chapter will include its own list of references, as well as any supplementary material (either directly or linked to), and figure and table counts resets between chapters.
[1] See https://link-lives.dk/en/link-lives-a-research-project/.
Attanasio, Orazio P. (Dec. 2015). “The Determinants of Human Capital Formation During the Early Years of Life: Theory, Measurement, and Policies”. In: Journal of the European Economic Association 13.6, pp. 949–997. issn: 1542-4766. doi: 10.1111/jeea.12159. url: https://doi.org/10.1111/jeea.12159.
Baker, Jennifer, Lise Bjerregaard, Christian M Dahl, Torben Johansen, Emil Sørensen, and Miriam Wüst (2023). “Universal Investments in Toddler Health. Learning from a Large Government Trial”. In: IZA Discussion Paper No. 16270.
Bjerregaard, Lise G, Miriam Wüst, Torben SD Johansen, Thorkild IA Sørensen, Christian M Dahl, and Jennifer L Baker (2023). “Cohort Profile: The Copenhagen Infant Health Nurse Records (CIHNR) cohort”. In: International Journal of Epidemiology, dyad096.
Dahl, Christian M., Torben S. D. Johansen, Emil N. Sørensen, Christian E. Westermann, and Simon F. Wittrock (2023a). “Applications of machine learning in tabular document digitisation”. In: Historical Methods: A Journal of Quantitative and Interdisciplinary History 56.1, pp. 34–48. doi: 10.1080/01615440.2023.2164879.
— (2023b). DARE: A large-scale handwritten DAte Recognition system. Working paper. Earlier (2022) version available at arXiv preprint arXiv:2210.00503.
Dahl, Christian M., Torben S. D. Johansen, Emil N. Sørensen, and Simon F. Wittrock (2023c). “HANA: A HAndwritten NAme Database for Offline Handwritten Text”. In: Explorations in Economic History 87. Methodological Advances in the Extraction and Analysis of Historical Data, p. 101473. issn: 0014-4983. doi: https://doi.org/10. 1016/j.eeh.2022.101473. url: https://www.sciencedirect.com/science/article/ pii/S0014498322000511.
Heckman, James J. (2006). “Skill formation and the economics of investing in disadvantaged children”. In: Science 312.5782, pp. 1900–1902.
Johansen, Torben Skov Dyg (2023a). Optimal Treatment Allocation under Constraints. Working paper.
— (2023b). timm-sequence-net. https://github.com/TorbenSDJohansen/timm-sequencenet. Python library to be made open-source.
Tan, Mingxing and Quoc Le (2021). “EfficientNetV2: Smaller Models and Faster Training”. In: Proceedings of Machine Learning Research 139. Ed. by Marina Meila and Tong Zhang, pp. 10096–10106. url: https://proceedings.mlr.press/v139/tan21a.html.
This thesis aimed at two primary goals: First, a key objective has been to develop new and exploit current advances in ML in order to facilitate the use of historical, handwritten documents at scales insurmountable before the current “spring” within AI. Second, it attempts to answer questions related to the role of early-life circumstances and their long-run and multi-generational impacts, going back an additional generation compared to what would have been possible without the use of ML for digitization of historical data. Acknowledging the important role of human capital investments for later-life success, this research is motivated
by models on early-life skill formation (Heckman, 2006; Attanasio, 2015) and adds insights into the relative role of inputs to child health and human capital production. Thus, the thesis broadly consists of two overarching themes, or meta-chapters, which organically interplay through methodological contributions making empirical studies feasible, and those in turn providing feedback for further improvements to the ML methods developed. Each of the two overarching themes consists of three papers, with strong ties to each other and where the developments from the earlier papers make possible the contributions of the more recent studies.
The first three chapters of my thesis are dedicated to the development of new state-of-the-art (SOTA) within ML for document digitization and handwritten text recognition (HTR), and are the product of work by the “Big Data Analytics and Digitization” (BDAD) group at University of Southern Denmark (SDU). Chapter 1 of my thesis, “Applications of machine learning in tabular document digitisation” (Dahl et al., 2023a), provides the foundation for this work: We present a general-purpose, modular, end-to-end “pipeline” for transforming raw scans of historical tabular data (e.g., census lists or death certificates) into data ready for statistical analysis. We showcase the effectiveness of our pipeline through two applications, proving its usability in practice. A key feature of our contribution is its modular structure, which allows us to use modifications with newer and better “modules” in later projects while still using the general framework we propose.
Chapter 2 of the thesis, “HANA: A HAndwritten NAme database for offline handwritten text recognition” (Dahl et al., 2023c), is dedicated to the task of HTR of names, a particularly important and difficult problem within historical data digitization. Names are often the single most important piece of information to extract when using individual-level data, as it plays the most prominent role when linking different data sources together. At the same time, however, names are challenging to transcribe, a result of the large pool of names, many of which are very similar and thus prone to even one-character errors resulting in erroneous linkages. To tackle this issue, we introduce the (to our best knowledge) largest publicly available database of labelled images of handwritten names, consisting of more than 3.3 million names with more than 105 thousand unique names. Using this database, we train neural networks for HTR that achieve very high rates of transcription accuracy (up to 95.7% on the database’s test split), and we show that using the database combined with transfer learning allows one to achieve much higher transcription accuracy also on other datasets, something which we illustrate using Danish and US census records. We open source all our data and code.
Chapter 3 of the thesis, “DARE: A large-scale handwritten DAte REcognition system” (Dahl et al., 2023b), follows closely in the steps of Chapter 2, but with our focus now on HTR for dates. For this purpose, we introduce (to our best knowledge) the largest publicly available database of labelled images of dates, consisting of nearly 10 million tokens spread across 2.2 million different images of dates, sampled from a wide variety of historical documents. A key distinction, however, is our focus on also contributing to the SOTA within neural networks for HTR, where we propose a highly competitive architecture based on the EfficientNetV2-architecture of Tan and Le (2021). Accurate transcriptions of dates are, as with the case of names, a key element of historical data digitization, be it for the purpose of obtaining birth and/or death dates of individuals (which is important for historical record linkage) or dating nurse home visits for infants. We demonstrate that our system achieves high transcription accuracy on the database’s test split and demonstrate its use as a foundation model when transfer learning to new collections of historical records. We open source our data and will be open sourcing our code in the future. The next three chapters of my thesis are devoted to studying the role of early-life circumstances, particularly early-life investment policies, on long-run and intergenerational outcomes. They are all based on a large database of records on nurse home visiting (NHV) in the 1960s Copenhagen, which we construct using ML methods developed as part of the first three chapters of my thesis; as was the case for the first three chapters, these three are closely tied together and each contribute to the feasibility of the next chapter. Chapter 4 of my thesis, “Cohort Profile: The Copenhagen Infant Health Nurse Records (CIHNR) cohort” (Bjerregaard et al., 2023), is devoted to establishing the key database which allows us to study long-run impacts of social policies at the most granular level. Here, we create a database combining detailed childhood data – obtained through transcriptions of handwritten nurse records from the 1960s Copenhagen NHV program, covering all children during their first year of life (and for a subset first three years) – with long-run outcomes from Danish register data, thus following children from their birth to when they are around 60. This is made possible due to ML methodological contributions of the thesis’ first three chapters, which we use to transcribe over 10 million fields of information from the collection of nurse records, which we then link to Danish administrative register data. The paper provides a comprehensive description of the data we make available, and the steering group at the Center for Clinical Research and Prevention, Bispebjerg and Frederiksberg Hospital, Denmark, welcomes collaboration with national and international colleagues. Due to confidentiality reasons, however, this database cannot be made publicly available.
Chapter 5 of the thesis, “Universal Investments in Toddler Health. Learning from a Large Government Trial” (Baker et al., 2023), studies the impact of an extended NHV experiment for children born between 1959 and 1967. Exploiting a 1960s government trial that assigned everyone born during the first three days of any month to three (instead of one) years of NHV, we investigate long-run and intergenerational impacts of extended NHV. We document positive long-run health benefits – with some evidence of impacts extending to the offspring of our focal individuals – which are larger for initially disadvantaged children. Documenting that the initiative was highly cost-effective, especially so for initially disadvantaged children, our results suggest that these types of early-life investments can alleviate inequalities at low costs.
Chapter 6 of the thesis, “Optimal Treatment Allocation under Constraints” (Johansen, 2023a), examines the complex role of treatment providers on policy impacts by comparing nurses of the 1960s Copenhagen NHV program against each other. The paper’s main contributions are (1) developing a strongly polynomial algorithm for solving optimal treatment allocations problems of the type where treatment effects vary at the individual level and are subject to capacity constraints and (2) showcasing the method in a setting where each nurse is considered a “treatment”, thus showing that significant efficiency improvements are possible by carefully allocating nurses to children to optimize the total impacts of the NHV program. By doing so, the chapter adds novel evidence on the importance of treatment providers for the success of early-life investment policies. While each chapter of the thesis is self-contained, they nevertheless are closely connected, and they showcase how to exploit developments within AI for the benefit of obtaining data of importance to applied, empirical studies. Though not a direct part of the thesis, it has served as a motivation for developing a Python library for HTR (Johansen, 2023b), something which has proven fundamental to four of the thesis’ chapters (3-6), has been and is used in other research projects, and is actively used for the Link-Lives project that aims to “[...] reconstruct life-courses and multigenerational family relations for (nearly) all Danes, from 1787–1968.” [1] The rest of the thesis contains each chapter as it appears in the most recent version of the associated paper (be it published or as a working paper); to avoid ambiguity, two page numbers are included for all the remaining pages after the Danish summary, one a running tally for the thesis (bottom left) and the other the page number as it appears on the specific paper. Each chapter will include its own list of references, as well as any supplementary material (either directly or linked to), and figure and table counts resets between chapters.
[1] See https://link-lives.dk/en/link-lives-a-research-project/.
Attanasio, Orazio P. (Dec. 2015). “The Determinants of Human Capital Formation During the Early Years of Life: Theory, Measurement, and Policies”. In: Journal of the European Economic Association 13.6, pp. 949–997. issn: 1542-4766. doi: 10.1111/jeea.12159. url: https://doi.org/10.1111/jeea.12159.
Baker, Jennifer, Lise Bjerregaard, Christian M Dahl, Torben Johansen, Emil Sørensen, and Miriam Wüst (2023). “Universal Investments in Toddler Health. Learning from a Large Government Trial”. In: IZA Discussion Paper No. 16270.
Bjerregaard, Lise G, Miriam Wüst, Torben SD Johansen, Thorkild IA Sørensen, Christian M Dahl, and Jennifer L Baker (2023). “Cohort Profile: The Copenhagen Infant Health Nurse Records (CIHNR) cohort”. In: International Journal of Epidemiology, dyad096.
Dahl, Christian M., Torben S. D. Johansen, Emil N. Sørensen, Christian E. Westermann, and Simon F. Wittrock (2023a). “Applications of machine learning in tabular document digitisation”. In: Historical Methods: A Journal of Quantitative and Interdisciplinary History 56.1, pp. 34–48. doi: 10.1080/01615440.2023.2164879.
— (2023b). DARE: A large-scale handwritten DAte Recognition system. Working paper. Earlier (2022) version available at arXiv preprint arXiv:2210.00503.
Dahl, Christian M., Torben S. D. Johansen, Emil N. Sørensen, and Simon F. Wittrock (2023c). “HANA: A HAndwritten NAme Database for Offline Handwritten Text”. In: Explorations in Economic History 87. Methodological Advances in the Extraction and Analysis of Historical Data, p. 101473. issn: 0014-4983. doi: https://doi.org/10. 1016/j.eeh.2022.101473. url: https://www.sciencedirect.com/science/article/ pii/S0014498322000511.
Heckman, James J. (2006). “Skill formation and the economics of investing in disadvantaged children”. In: Science 312.5782, pp. 1900–1902.
Johansen, Torben Skov Dyg (2023a). Optimal Treatment Allocation under Constraints. Working paper.
— (2023b). timm-sequence-net. https://github.com/TorbenSDJohansen/timm-sequencenet. Python library to be made open-source.
Tan, Mingxing and Quoc Le (2021). “EfficientNetV2: Smaller Models and Faster Training”. In: Proceedings of Machine Learning Research 139. Ed. by Marina Meila and Tong Zhang, pp. 10096–10106. url: https://proceedings.mlr.press/v139/tan21a.html.
| Original language | English |
|---|---|
| Awarding Institution |
|
| Supervisors/Advisors |
|
| Date of defence | 10. Jan 2024 |
| Publisher | |
| DOIs | |
| Publication status | Published - 21. Dec 2023 |
Note re. dissertation
A print copy of the thesis can be accessed at the Library.Keywords
- Economics
- Machine Learning
- Econometrics
- Artificial Intelligence
Fingerprint
Dive into the research topics of 'Machine Learning with Applications in Economics'. Together they form a unique fingerprint.-
Optimal Treatment Allocation under Constraints
Johansen, T., 28. Apr 2024, (In preparation)Research output: Other contribution › Research
-
Applications of machine learning in tabular document digitisation
Dahl, C. M., Johansen, T. S. D., Sørensen, E. N., Westermann, C. E. & Wittrock, S., 2023, In: Historical Methods. 56, 1, p. 34-48Research output: Contribution to journal › Journal article › Research › peer-review
42 Downloads (Pure) -
Cohort profile: the Copenhagen infant health nurse records (CIHNR) cohort
Bjerregaard, L. G., Wüst, M., Johansen, T. S. D., Sørensen, T. I. A., Dahl, C. M. & Baker, J. L., Dec 2023, In: International Journal of Epidemiology. 52, 6, p. e340-e346 dyad096.Research output: Contribution to journal › Journal article › Research › peer-review
Open AccessFile140 Downloads (Pure)
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver