Abstract

Data acquisition forms the primary step in all empirical research. The availability
of data directly impacts the quality and extent of conclusions and insights. In particular, larger and more detailed datasets provide convincing answers even to complex research questions. The main problem is that large and detailed usually implies costly and difficult, especially when the data medium is paper and books. Human operators and manual transcription have been the traditional approach for collecting historical data. Researchers spend vast hours on sorting, organizing and manually transcribing paper documents. This quickly becomes infeasible when the data requirements grow and the corresponding amount of documents reaches an unsurmountable level. Instead of manual transcription we advocate the use of modern machine learning techniques to automate the digitization process. We give an overview of the potential for applying machine digitization for data collection, we show that it performs on-par or better than existing methods for tabular sequence transcription at a fraction of the cost, and finally we discuss the steps in applying machine learning methods on a few cases of actual documents: US and UK mortality data and Danish death certicates. We also briefly comment on the prospects of active learning and present an example of a recently developed digitization application.
Original languageEnglish
Publication statusPublished - 2019

Fingerprint

Analog to digital conversion
Transcription
Learning systems
Sorting
Data acquisition
Availability
Costs

Cite this

@techreport{6e7ae8179f7844fc8acbaeda3d6097b1,
title = "Document Digitization and Machine Learning",
abstract = "Data acquisition forms the primary step in all empirical research. The availability of data directly impacts the quality and extent of conclusions and insights. In particular, larger and more detailed datasets provide convincing answers even to complex research questions. The main problem is that large and detailed usually implies costly and difficult, especially when the data medium is paper and books. Human operators and manual transcription have been the traditional approach for collecting historical data. Researchers spend vast hours on sorting, organizing and manually transcribing paper documents. This quickly becomes infeasible when the data requirements grow and the corresponding amount of documents reaches an unsurmountable level. Instead of manual transcription we advocate the use of modern machine learning techniques to automate the digitization process. We give an overview of the potential for applying machine digitization for data collection, we show that it performs on-par or better than existing methods for tabular sequence transcription at a fraction of the cost, and finally we discuss the steps in applying machine learning methods on a few cases of actual documents: US and UK mortality data and Danish death certicates. We also briefly comment on the prospects of active learning and present an example of a recently developed digitization application.",
author = "Dahl, {Christian M.} and S{\o}rensen, {Emil N{\o}rmark}",
year = "2019",
language = "English",
type = "WorkingPaper",

}

TY - UNPB

T1 - Document Digitization and Machine Learning

AU - Dahl, Christian M.

AU - Sørensen, Emil Nørmark

PY - 2019

Y1 - 2019

N2 - Data acquisition forms the primary step in all empirical research. The availability of data directly impacts the quality and extent of conclusions and insights. In particular, larger and more detailed datasets provide convincing answers even to complex research questions. The main problem is that large and detailed usually implies costly and difficult, especially when the data medium is paper and books. Human operators and manual transcription have been the traditional approach for collecting historical data. Researchers spend vast hours on sorting, organizing and manually transcribing paper documents. This quickly becomes infeasible when the data requirements grow and the corresponding amount of documents reaches an unsurmountable level. Instead of manual transcription we advocate the use of modern machine learning techniques to automate the digitization process. We give an overview of the potential for applying machine digitization for data collection, we show that it performs on-par or better than existing methods for tabular sequence transcription at a fraction of the cost, and finally we discuss the steps in applying machine learning methods on a few cases of actual documents: US and UK mortality data and Danish death certicates. We also briefly comment on the prospects of active learning and present an example of a recently developed digitization application.

AB - Data acquisition forms the primary step in all empirical research. The availability of data directly impacts the quality and extent of conclusions and insights. In particular, larger and more detailed datasets provide convincing answers even to complex research questions. The main problem is that large and detailed usually implies costly and difficult, especially when the data medium is paper and books. Human operators and manual transcription have been the traditional approach for collecting historical data. Researchers spend vast hours on sorting, organizing and manually transcribing paper documents. This quickly becomes infeasible when the data requirements grow and the corresponding amount of documents reaches an unsurmountable level. Instead of manual transcription we advocate the use of modern machine learning techniques to automate the digitization process. We give an overview of the potential for applying machine digitization for data collection, we show that it performs on-par or better than existing methods for tabular sequence transcription at a fraction of the cost, and finally we discuss the steps in applying machine learning methods on a few cases of actual documents: US and UK mortality data and Danish death certicates. We also briefly comment on the prospects of active learning and present an example of a recently developed digitization application.

M3 - Working paper

BT - Document Digitization and Machine Learning

ER -