Fake data is great data when it comes to machine learning – Stacey on IoT

Its been a few years since Ilast wroteabout the idea of using synthetic data to train machine learning models.After having three recent discussions on the topic, I figured its time to revisit the technology, especially as it seems to be gaining ground in mainstream adoption.

Back in 2018, at Microsoft Build, I saw a demonstration of a drone flying over a pipeline as it inspected it for leaks or other damage. Notably, the drones visual inspection model was trained using both actual data and simulated data. Use of the synthetic data helped teach the machine learning model about outliers and novel conditions it wasnt able to encounter using traditional training. Italso allowed Microsoft researchers to train the model more quickly and without the need to embark on as many expensive, data-gathering flights as it would have had to otherwise.

The technology is finally starting to gain ground. In April, a startup calledAnyverse raised 3million ($3.37 million)for its synthetic sensor data,while another startup,AI.Reverie,published a paper about how it used simulated data to train a model to identify planes on airport runways.

After writing that initial story, I heard very little about synthetic data untilmy conversation earlier this month with Dan Jeavons, chief data scientist at Shell. When I asked him about Shells machine learning projects, using simulated data was one that he was incredibly excited about because it helps build models that can detect problems that occur only rarely.

I think its a really interesting way to get info on the edge cases that were trying to solve, he said. Even though we have a lot of data, the big problem that we have is that, actually, we often only had a very few examples of what were looking for.

In the oil business, corrosion in factories and pipelines is a big challenge, and one that can lead to catastrophic failures. Thats why companies are careful about not letting anything corrode to the point where it poses a risk. But that also means the machine learning models cant be trained on real-world examples of corrosion. So Shell uses synthetic data to help.

As Jeavons explained, Shell is also using synthetic data to try and solve the problem of people smoking at gas stations. Shelldoesnthave a lot of examples because the cameras dont always catch the smokers; in other cases, theyre too far away or arent facing the camera. So the company is working hard on combining simulated synthetic data with real data to build computer vision models.

Almost always the things were interested in are the edge cases rather than the general norm, said Jeavons. And its quite easy to detect the edge [deviating] from the standard pattern, but its quite hard to detect the specific thing that you want.

In the meantime, startup AI.Reverie endeavored to learn more about the accuracy of synthetic data. The paper it published, RarePlanes: Synthetic Data Takes Flight, lays out how its researchers combined satellite imagery of planes parked at airports that was annotated and validated by humans with synthetic data created by machine.

When using just synthetic data, the model was only about 55% percent accurate, whereas when it only used real-world data that number jumped to 73%. But by makingreal-world data 10% of the training sample and using synthetic data for the rest, the models accuracy came in at 69%.

Paul Walborsky, the CEO of AI.Reverie (and the former CEO at GigaOM; in other words, my former boss), says that synthetic datais going to be a big business. Companies using such data need to account for ways that their fake data can skew the model, but if they can do that, they can achieve robust models faster and at a lower cost than if they relied on real-world data.

So even though IoT sensors are throwing off petabytes of data, it would be impossible to annotate all of it and use it for training models. And as Jeavons points out, those petabytes of data may not have the situation you actually want the computer to look for. In other words, expect the wave of synthetic and simulated data to keep on coming.

Were convinced that, actually, this is going to be the future in terms of making things work well, said Jeavons, both in the cloud and at the edge for some of these complex use cases.

Related

See the rest here:
Fake data is great data when it comes to machine learning - Stacey on IoT

Related Posts

Comments are closed.