There is a great deal of excitement at the moment around synthetic data and the role it can play in driving innovation. But what is synthetic data? Why is it useful? How is it generated? And what type of synthetic data should you use?
Real data contains information about real people or entities, real events, and real interactions. Synthetic data is the same, except some or all the people, entities, events, or interactions are artificial.
The reason synthetic data is such a hot topic is that ever-increasing computing power and data volumes have combined with progress in machine learning, AI, and other algorithmic approaches to make possible much higher quality artificial data than ever before.
At Smart Data Foundry, we think of three factors when assessing how ‘good’ Synthetic Data is:
- Fidelity – how similar the synthetic dataset is to the ‘real’ data.
- Privacy – the risk of unintended disclosure of data used to manufacture the synthetic data.
- Utility – the ‘usefulness’ of the synthetic data.
Smart Data Foundry have compared the primary approaches of Synthetic Data generation considering the privacy, confidentiality, ethics, and disclosure risk, as well as a more basic need where real data does not exist or exhibits a systematic problem such as bias.
There are two primary methods of generating synthetic data
- Simulation (including Agent-Based Modelling)
Software is written that simulates key aspects of the systems that generate (or would generate) real data. Synthetic data is then captured from the simulation.
- Learning-based synthesis (often referred to as the creation of synthetic doubles)
First, a model is trained to learn the patterns in an existing real dataset, then to the learned patterns are used to generate synthetic data with ‘the same’ multi-dimensional distributions as the original dataset.
All synthetic data is not created equal, there is no plausible one-size-fits-all approach, and different methods solve different use cases. Simulation-based and learning-based synthesis have rather complementary strengths and weaknesses and are best suited to different application areas.
Simulation is the more suitable approach if real data does not exist, is biased or incomplete if you are looking to innovate through collaboration with 3rd parties and don’t want to share confidential data or you are looking to work with complex integrated data sets that are not readily available.
Where the goal is to make available a safe synthetic double of existing data to which the provider has access while protecting privacy or confidentiality, a learning-based approach is probably most suitable.
There may also be scenarios best suited to a combination of synthetic doubles complemented with simulation-based data to provide data where gaps exist in the real data.