Man or machine: Evaluating Spelling Error Detection in Danish Newspaper Corpora

Eckhard Bick, Jonas Nygaard Blom, Marianne Rathje, Jørgen Schack

Publikation: Kapitel i bog/rapport/konference-proceedingKonferencebidrag i proceedingsForskningpeer review

6 Downloads (Pure)

Abstract

This paper evaluates frequency and detection performance for both spelling and grammatical errors in a corpus of published Danish newspaper texts, comparing the results of three human proofreaders with those of an automatic system, DanProof. Adopting the error categorization scheme of the latter, we look at the accuracy of individual error types and their relative distribution over time, as well as the adequacy of suggested corrections. Finally, we discuss so-called artefact errors introduced by corpus processing, and the potential of DanProof as a corpus cleaning tool for identifying and correcting format conversion, OCR or other compilation errors. In the evaluation, with balanced F1-scores of 77.6 and 67.6 for 1999 texts and 2019 texts, respectively, DanProof achieved a higher recall and accuracy than the individual human annotators, and contributed the largest share of errors not detected by others (16.4% for 1999 and 23.6% for 2019). However, the human annotators had a significantly higher precision. Not counting artifacts, the overall error frequency in the corpus was low (~ 0.5%), and less than half in the newer texts compared to the older ones, a change that mostly concerned orthographical errors, with a correspondingly higher relative share of grammatical errors.

OriginalsprogEngelsk
Titel3rd Annual Meeting of the ELRA-ISCA Special Interest Group on Under-Resourced Languages, SIGUL 2024 at LREC-COLING 2024 - Workshop Proceedings
RedaktørerMaite Melero, Sakriani Sakti, Claudia Soria
Antal sider8
UdgivelsesstedTorino
ForlagEuropean Language Resources Association (ELRA)
Publikationsdato2024
Sider204-211
ISBN (Elektronisk)9782493814296
StatusUdgivet - 2024
Begivenhed3rd Annual Meeting of the ELRA-ISCA Special Interest Group on Under-Resourced Languages, SIGUL 2024 - Turin, Italien
Varighed: 21. maj 202422. maj 2024

Konference

Konference3rd Annual Meeting of the ELRA-ISCA Special Interest Group on Under-Resourced Languages, SIGUL 2024
Land/OmrådeItalien
ByTurin
Periode21/05/202422/05/2024

Bibliografisk note

Publisher Copyright:
© 2024 ELRA Language Resource Association.

Fingeraftryk

Dyk ned i forskningsemnerne om 'Man or machine: Evaluating Spelling Error Detection in Danish Newspaper Corpora'. Sammen danner de et unikt fingeraftryk.
  • LREC-COLING

    Bick, E. (Deltager)

    20. maj 202422. maj 2024

    Aktivitet: Deltagelse i faglig begivenhedOrganisering af eller deltagelse i konference

Citationsformater