We tend to shy away from “synthetic” anything—food, clothing, or building materials, for example. Synthetic data, on the other hand, can be good for business if you’re looking for AI training data.
Synthetic data is data that’s generated artificially rather than by real-world events. Thanks in part to the rise of AI and machine learning, the appetite for data is off the charts. However, data sets for certain AI needs, such as new markets in software development, can be expensive or simply unavailable. It can mimic data of every type imaginable, be produced in infinite quantities, and be customized further to provide granular control over testing and simulations.
Synthetic data can also help organizations leapfrog over the challenges of anonymized real-world data while optimizing and streamlining product timelines. Gartner predicts that in 2024, 60% of data for AI will be synthetic in order to simulate reality and de-risk AI, compared with 1% of data in 2021.
How Synthetic Data Is Made
Synthetic data is created typically via algorithms, statistical models, or generative AI. To develop synthetic data, information from almost any source is analyzed to detect structures and patterns. The structures and patterns become the foundation for building new data sets that include the characteristics of the previous data sets.
AI-generated synthetic data is created by providing a sample that triggers a more sophisticated replica of a real-world data set. An AI-based synthetic data generator can learn and replicate software or business objectives while weaving in historical trends and outlier behavior. Once it’s trained, the generator can produce data that’s the functional equivalent of the “real” data, at the same scale or larger if needed.
Synthetic data can be used to mimic structured data, such as time-series-based databases, or unstructured data, such as images. Synthetic images are commonly used for training applications for autonomous vehicles and machines.
Why Use Synthetic Data?
One of the key concerns around using real-world data sets is the need to protect the privacy of end users whose data contributes to building a training model. Many AI models use anonymized or pseudonymized data, which is real-world data less actual identifying connections. However, because even anonymized data can be useful to cybercriminals, it requires the same protections legally and operationally as any other sensitive data. In the end, the need to protect user privacy reduces the usefulness of the data.
In addition to data governance burdens, real-world data sets may also be harder to work with because of data “noise” (from data that can’t be interpreted), data errors, and incomplete records. The cost and time required to identify and mitigate these issues can add up quickly, making synthetic data a more attractive option.
Then there’s the inability to find a relevant real-world data set. An enterprise in a promising new market niche or with a revolutionary product may find that rich, years-long customer behavior databases are scarce—or worse, the competition already has one. In these cases, synthetic data provides what may be the only alternative.
Where Synthetic Data Excels
Synthetic data is the simplest way to have rich, effective, and functionally identical stand-ins for data sets based on real information. In some cases, it is more effective and useful than its real-world counterparts. This data comes prelabeled and doesn’t have the errors or identity masking issues of real-world data. It can be freely shared without compromising user privacy. It also can be created at scale and on demand, augmented as needed, and made to map directly to an existing application’s fields and records.
Researchers at MIT have found that synthetic data largely acts like the real thing—meaning it generates similar results when compared with real data. The researchers created the Synthetic Data Vault, a set of open-source tools to expand data access without compromising privacy, and hired data scientists to develop predictive models with original data sets as well as synthetic ones. The researchers found no significant differences among the resulting solutions in 70% of the tests.
The Downside
Like all technologies that are still in the early stages of application, AI needs to be carefully managed. This is true whether an enterprise is using synthetic data or real-world data sets. Generative AI can still produce less-than-desirable results when creating it, such as amplifying anomalies in data. In addition, it’s possible that it sets could mistakenly include some personally identifiable information, thus opening up an organization to fines or legal action, depending on the data regulations that were violated.
Written By: