On Population Fidelity as an Estimator for the Utility of Synthetic Training Data

Authors

  • Alexander Florean
  • Jonas Forsman
  • Sebastian Herold

DOI:

https://doi.org/10.3384/ecp208016

Abstract

Synthetic data promises to address several challenges in training machine learning models, such as data scarcity, privacy concerns, and efforts for data collection and annotation. In order to actually benefit from synthetic data, its utility for the intended purpose has to be ensured and, ideally, estimated before it is used to produce possibly poorly performing models. Population fidelity metrics are potential candidates to provide such an estimation. However, evidence of how well they are suited as estimators of the utility of synthetic data is scarce.

In this study, we present the results of an experiment in which we investigated whether population fidelity as measured with nine different metrics correlates with the predictive performance of classification models trained on synthetic data.

Cluster Analysis and Cross-Classification show the most consistent results w.r.t. correlation with F1-performance but do not exceed moderate levels.The degree of correlation, and hence the potential suitability for estimating utility, varies considerably across the inspected datasets. Overall, the results suggest that the inspected population fidelity metrics are not a reliable and accurate tool to estimate the utility of synthetic training data for classification tasks. They may be precise enough though to indicate trends for different synthetic datasets based on the same original data.

Further research should shed light on how different data properties affect the ability of population fidelity metrics to estimate utility and make recommendations on how to use these metrics for different scenarios and types of datasets.

Downloads

Published

2024-06-14