What kinds of tools do you see in the open-source space for synthetic data creation? Do we have good tools available, or is there an imbalance where most of the best-in-class tools appear to be closed-source? There are some interesting and worthwhile checking. I'll highlight two, but there are a few others:
- ydata-synthetic (focused on the use of DeepLearning models for synthesis)
- SDV (which encompasses many techniques, from deep learning to Bayes networks)
In the case of imbalanced classes, from what I've seen within the open-source space, no package holds the best in class.
Depending on the type of data we are talking about (images, text, or structured), the techniques are widely different, and even in the realm of structure data, it depends on the use case and data structure.
For some use cases, SMOTE and ADASYN (from sklearn) are enough to deal with the imbalanced classes problem. Still, GANs are a better option for others where you've bigger dimensionality and complex relations. GANs can be challenging, though, due to a lot of parameters to tune.
Is there some kind of spectrum of difficulty for synthetic data types? I.e., structured data is easier than unstructured data? I ask since it seems that tool-builders often focus on structured data first, and it feels like the unstructured types get a little neglected, at least early on. About synthetic data, I would say it is the opposite. We see the space of unstructured data is very well developed (from videos to images), with a strong community both in open-source and platforms.
For structured data, we see a bigger investment happening since 2017 but mainly focused on privacy (research). However, synthetic data can be leveraged for use cases such as augmentation and balancing datasets or bias mitigation.
Are there baseline models for synthetic data? Is it pretty much VAE and GANs? Has something emerged as the go-to method? VAEs and GANs are a way to do it. But there are other methods that are not Deep Learning based, such as Markov chains or Bayesian Networks, for instance. In a nutshell, any generative model can be used as a tool for data synthesis.
We now have methods such as GPT-3 and Dall-E as industrial solutions to generate text and images. What would you say are the following things we can expect to see in the field for "synthetic data as a service"? This is a tough one! To be honest, I think we will still see more advances within the realm of video and images. Structured data is trickier to generalize and deliver as an automated service with no business background embedded.
I was wondering if there are approaches for synthetic data for time series. I have tried variational autoencoders and other generative models. However, the results looked "too smooth," and the time series were "spikey." It didn't seem that those approaches could be usable in practice. Are there libraries or approaches that work well for sequence data? Indeed that's a normal pattern/behavior you will find while generating synthetic data for time series. Solutions based on RNN and CNN tend to have limitations and lose some long-term contexts while generating time series, resulting in the behaviors you've described.
From an open-source perspective, I haven't seen anything that works smoothly, but the best so far is
DoppelGANger, which is GAN based. Btw we are working on making it available using TF2 at the ydata-synthetic package.