Syntheval: A Framework for Detailed Utility and Privacy Evaluation of Tabular Synthetic Data

Anton D. Lautrup*, Tobias Hyrup, Arthur Zimek, Peter Schneider-Kamp

*Corresponding author for this work

Research output: Contribution to journalJournal articleResearchpeer-review

39 Downloads (Pure)

Abstract

With the growing demand for synthetic data to address contemporary issues in machine learning, such as data scarcity, data fairness, and data privacy, having robust tools for assessing the utility and potential privacy risks of such data becomes crucial. SynthEval, a novel open-source evaluation framework distinguishes itself from existing tools by treating categorical and numerical attributes with equal care, without assuming any special kind of preprocessing steps. This makes it applicable to virtually any synthetic dataset of tabular records. Our tool leverages statistical and machine learning techniques to comprehensively evaluate synthetic data fidelity and privacy-preserving integrity. SynthEval integrates a wide selection of metrics that can be used independently or in highly customisable benchmark configurations, and can easily be extended with additional metrics. In this paper, we describe SynthEval and illustrate its versatility with examples. The framework facilitates better benchmarking and more consistent comparisons of model capabilities.

Original languageEnglish
Article number6
JournalData Mining and Knowledge Discovery
Volume39
Issue number1
Number of pages25
ISSN1384-5810
DOIs
Publication statusPublished - Jan 2025

Keywords

  • Benchmark
  • Evaluation framework
  • Synthetic data
  • Tabular data

Fingerprint

Dive into the research topics of 'Syntheval: A Framework for Detailed Utility and Privacy Evaluation of Tabular Synthetic Data'. Together they form a unique fingerprint.

Cite this