Abstract
Historical U.S. censuses have been an important data source for economics, particularly because they allow researchers to track individuals’ life outcomes over long periods of time. However, linking individuals across multiple census rounds is challenging often due to errors in name transcription. In this paper, we improve the name transcription in historical U.S. censuses using a machine-learning model. Our ap-
proach resulted in a significant increase in the likelihood of linking individuals across censuses. We also find that our model performs especially well when human transcribers struggle, i.e., when the legibility of names on the original census form is low. The increased linkage rate is observed across nearly all socio-demographic subgroups, including those that are typically difficult to link.
proach resulted in a significant increase in the likelihood of linking individuals across censuses. We also find that our model performs especially well when human transcribers struggle, i.e., when the legibility of names on the original census form is low. The increased linkage rate is observed across nearly all socio-demographic subgroups, including those that are typically difficult to link.
Original language | English |
---|---|
Publication status | Published - 24. Aug 2024 |