Document Digitization and Machine Learning

Christian M. Dahl, Emil Nørmark Sørensen

Publikation: Working paperForskning


Data acquisition forms the primary step in all empirical research. The availability
of data directly impacts the quality and extent of conclusions and insights. In particular, larger and more detailed datasets provide convincing answers even to complex research questions. The main problem is that large and detailed usually implies costly and difficult, especially when the data medium is paper and books. Human operators and manual transcription have been the traditional approach for collecting historical data. Researchers spend vast hours on sorting, organizing and manually transcribing paper documents. This quickly becomes infeasible when the data requirements grow and the corresponding amount of documents reaches an unsurmountable level. Instead of manual transcription we advocate the use of modern machine learning techniques to automate the digitization process. We give an overview of the potential for applying machine digitization for data collection, we show that it performs on-par or better than existing methods for tabular sequence transcription at a fraction of the cost, and finally we discuss the steps in applying machine learning methods on a few cases of actual documents: US and UK mortality data and Danish death certicates. We also briefly comment on the prospects of active learning and present an example of a recently developed digitization application.
StatusUdgivet - 2019


Dyk ned i forskningsemnerne om 'Document Digitization and Machine Learning'. Sammen danner de et unikt fingeraftryk.