Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Synthetic Data Generation Using Combinatorial Testing and Variational Autoencoder

Published

Author(s)

Krishna Khadka, Jaganmohan Chandrasekaran, Yu Lei, Raghu N. Kacker, D. Richard Kuhn

Abstract

Data is a crucial component in machine learning. However, many datasets contain sensitive information such as personally identifiable health and financial data. Access to these datasets must be restricted to avoid potential security concerns. Synthetic data generation addresses this problem by generating artificial data that are similar to, and thus could be used in place of, the original real-world data. This research introduces a synthetic data generation approach called CT-V AE that uses Combinatorial Testing (CT) and Variational Autoencoder (VAE). We first use VAE to learn the distribution of the real-world data and encode it in a latent, lower-dimensional space. Next, we use CT to sample the latent space by generating a t-way set of latent vectors, each of which represents a data point in the latent space. A synthetic dataset is generated from the t-way set by decoding each latent vector in the set. Our experimental evaluation suggests that machine learning models trained with synthetic datasets generated using our approach could achieve performance that is very similar to those trained with real-world datasets. Furthermore, our approach performs better than several state-of-the-art synthetic data generation approaches.
Proceedings Title
IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW
Conference Dates
April 16-20, 2023
Conference Location
Dublin, IE
Conference Title
12th International Workshop on Combinatorial Testing

Keywords

synthetic data generation, variational autoencoders, t-way testing, combinatorial testing, latent space sampling, machine learning

Citation

Khadka, K. , Chandrasekaran, J. , Lei, Y. , Kacker, R. and Kuhn, D. (2023), Synthetic Data Generation Using Combinatorial Testing and Variational Autoencoder, IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW, Dublin, IE, [online], https://doi.org/10.1109/ICSTW58534.2023.00048, https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=936332 (Accessed July 21, 2024)

Issues

If you have any questions about this publication or are having problems accessing it, please contact reflib@nist.gov.

Created May 29, 2023, Updated July 17, 2024