Quotation Platzer, Michael, Reutterer, Thomas. 2021. Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data. Frontiers in Big Data. 4 1-12.




AI-based data synthesis has seen rapid progress over the last several years and is increasingly recognized for its promise to enable privacy-respecting high-fidelity data sharing. This is reflected by the growing availability of both commercial and open-sourced software solutions for synthesizing private data. However, despite these recent advances, adequately evaluating the quality of generated synthetic datasets is still an open challenge. We aim to close this gap and introduce a novel holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions for mixed-type tabular data. Measuring fidelity is based on statistical distances of lower-dimensional marginal distributions, which provide a model-free and easy-to-communicate empirical metric for the representativeness of a synthetic dataset. Privacy risk is assessed by calculating the individual-level distances to closest record with respect to the training data. By showing that the synthetic samples are just as close to the training as to the holdout data, we yield strong evidence that the synthesizer indeed learned to generalize patterns and is independent of individual training records. We empirically demonstrate the presented framework for seven distinct synthetic data solutions across four mixed-type datasets and compare these then to traditional data perturbation techniques. Both a Python-based implementation of the proposed metrics and the demonstration study setup is made available open-source. The results highlight the need to systematically assess the fidelity just as well as the privacy of these emerging class of synthetic data generators.


Press 'enter' for creating the tag

Publication's profile

Status of publication Published
Affiliation WU
Type of publication Journal article
Journal Frontiers in Big Data
Language English
Title Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data
Volume 4
Year 2021
Page from 1
Page to 12
Reviewed? Y
URL https://www.frontiersin.org/journals/big-data#
DOI https://doi.org/10.3389/fdata.2021.679939
Open Access Y
Open Access Link https://www.frontiersin.org/articles/10.3389/fdata.2021.679939/full


Al-Based Privacy-Preserving Big Data Sharing for Market Research (ANITA-ANonymous bIg daTA)
Reutterer, Thomas (Details)
Platzer, Michael (Mostly AI, Austria)
Institute for Marketing and Customer Analytics IN (Details)
Research Institute for Computational Methods FI (Details)
Research Institute for Cryptoeconomics FI (Details)
Research areas (Ă–STAT Classification 'Statistik Austria')
5301 Distributive trades (Details)
5307 Business and management economics (Details)
5315 Commercial science (Details)
5320 Marketing (Details)
5321 Market research (Details)
Google Scholar: Search