Once considered less desirable than real data, synthetic data is now seen by some as a panacea. The actual data is messy and riddled with bias. New data privacy regulations make data collection difficult. In contrast, synthetic data is blank and can be used to create more diverse data sets. You can produce perfectly labeled faces, for example, of different ages, shapes and ethnicities to create a face detection system that works across populations.
But synthetic data has its limits. If it doesn’t reflect reality, it could end up producing AI that is even worse than messy and biased real-world data – or it could just inherit the same issues. “What I don’t want to do is endorse this paradigm and say, ‘Oh, this will solve so many problems,’ says Cathy O’Neil, data scientist and founder of algorithmic auditing firm ORCAA. “Because he will also ignore a lot of things.”
Realistic, not real
Deep learning has always been about data. But in recent years, the AI community has learned that good data is more important than large The data. Even small amounts of correct, properly labeled data can do more to improve the performance of an AI system than 10 times the amount of data not retained, or even a more advanced algorithm.
This is changing the way companies should approach the development of their AI models, says Ofir Chakon, CEO and co-founder of Datagen. Today, they start by acquiring as much data as possible, then tweak and fine-tune their algorithms for better performance. Instead, they should do the opposite: use the same algorithm while improving the composition of their data.
But collecting real-world data to perform this type of iterative experimentation is too expensive and time-consuming. This is where Datagen comes in. With a Synthetic Data Generator, teams can create and test dozens of new datasets per day to identify which one is maximizing a model’s performance.
To ensure the realism of its data, Datagen gives its suppliers detailed instructions on the number of individuals to be scanned in each age group, BMI range and ethnicity, as well as a list of actions to be carried out, like walking around a room or drinking a soda. The providers return both high-fidelity static images and motion capture data of these actions. Datagen’s algorithms then expand this data into hundreds of thousands of combinations. Sometimes the summarized data is checked again. Fake faces are drawn against real faces, for example, to see if they look realistic.
Datagen now generates facial expressions to monitor driver alertness in smart cars, body movements to track customers in cashier-less stores, and irises and hand movements to improve eye and hand tracking capabilities VR headsets. The company says its data has already been used to develop computer vision systems serving tens of millions of users.
It’s not just synthetic humans that are mass-produced. Clicks is a startup that uses synthetic AI to perform automated vehicle inspections. Using design software, he recreates all the makes and models of cars his AI needs to recognize, then renders them with different colors, damages and distortions in different lighting conditions, on different backgrounds. . This allows the company to update its AI when automakers release new models and helps prevent data privacy breaches in countries where license plates are considered private information and not. therefore may not be present in the photos used to train the AI.
Mostly.ai works with financial, telecommunications and insurance companies to provide fake customer data spreadsheets that allow businesses to share their customer database with external vendors in a law-compliant manner. Anonymization can reduce the richness of a data set but still fails to adequately protect individual privacy. But synthetic data can be used to generate spurious, detailed data sets that share the same statistical properties as real business data. It can also be used to simulate data that the business does not yet have, including a more diverse customer population or scenarios like fraudulent activity.
Supporters of synthetic data say it can also help assess AI. In a recent article published at an AI conference, Suchi Saria, associate professor of machine learning and healthcare at Johns Hopkins University, and his coauthors demonstrated how data generation techniques can be used to extrapolate different patient populations from a single data set. This could be useful if, for example, a company only had data from the younger population of New York City, but wanted to understand how their AI performed on an aging population with a higher prevalence of diabetes. She is now starting her own business, Bayesian Health, which will use this technique to test medical AI systems.
The limits of pretending
But is synthetic data overhyped?
Regarding confidentiality, “just because the data is ‘synthetic’ and does not correspond directly to real user data that it does not encode sensitive information about real people,” says Aaron Roth , professor of computer science and information science. at the University of Pennsylvania. Some data generation techniques have been shown to faithfully reproduce the images or text found in training data, for example, while others are vulnerable to attacks that require them to completely regurgitate that data.
This might be suitable for a company like Datagen, whose synthetic data is not intended to conceal the identity of people who have consented to be scanned. But that would be bad news for companies offering their solution as a way to protect sensitive financial or patient information.
Research suggests that the combination of two synthetic data techniques in particular—differential confidentiality and contradictory generative networks“Can produce the strongest privacy protections,” says Bernease Herman, data scientist at the University of Washington’s eScience Institute. But skeptics fear that this nuance could get lost in the marketing jargon of synthetic data providers, who won’t always be aware of the techniques they use.