Synthetic Data: The Diamonds of Machine Learning – TDWI

Synthetic Data: The Diamonds of Machine Learning

Refined and labeled data is imperative for advances in AI. When your supply of good data does not match your demand, look to synthetic data to fill the gap.

We have all heard the saying, Diamonds are a girls best friend. This saying was made famous by Marilyn Monroe in the 1953 film Gentlemen Prefer Blondes. The unparalleled brilliance and permanence of the diamond contribute to its desirability. Its unique molecular structure results in its incredible strength, making it highly desirable not only as jewelry that looks beautiful but also for industrial tools that cut, grind, and drill.

However, the worldwide supply of diamonds is limited as they take millions of years to form naturally. In the middle of the last century, corporations set out to determine a process to produce lab-grown diamonds. Over the past 70 years, scientists have not only been able to replicate the strength and durability of natural diamonds but, more recently, have been able to match the color and clarity of natural diamonds as well.

Just as in the case of diamonds in the mid-twentieth century, today there is a mismatch of the supply and demand of high-quality data needed to power todays artificial intelligence revolution. Just as the supply of coal did not equal the supply of diamonds, todays supply of raw data does not equal the supply of refined, labeled data, which is needed to power the training of machine learning models.

What is the answer to this mismatch of supply and demand? Many companies are pursuing lab-generated synthetic data that can be used to support the explosion of artificial intelligence.

The goal of synthetic data generation is to produce sufficiently groomed data for training an effective machine learning model -- including classification, regression, and clustering. These models must perform equally well when real-world data is processed through them as if they had been built with natural data.

Synthetic data can be extremely valuable in industries where the data is sparse, scarce, or expensive to acquire. Common use cases include outlier detection or problems that deal with highly sensitive data, such as private health-related problems. Whether challenges arise from data sensitivity or data scarcity, synthetic data can fill in the gaps.

There are three common methods of generating synthetic data: enhanced sampling, generative adversarial networks, and agent-based simulations.

Enhanced Sampling

In problems such as rare disease detection or fraud detection, one of the most common challenges is the rarity of instances representing the target for which you are searching. Class imbalance in your data limits the ability of the machine learning model to be accurately trained. Without sufficient exposure to instances of the minority class during training, it is difficult for the model to recognize instances when evaluating production data. In fraud cases, if the model is not trained with sufficient instances of fraud, it will classify everything as non-fraudulent when deployed in production.

To balance your data, one option is to either over-sample the minority class or under-sample your majority case to create a synthetic distribution of the data. This method does ensure that the model has an equal balance of each class of data. Statistical professionals have long used this method for addressing the class imbalance.

Go here to see the original:

Synthetic Data: The Diamonds of Machine Learning - TDWI

Related Posts

Comments are closed.