Opinions

As AI Reaches Limits And Sparks Debate, Can Synthetic Data Be The Problem Solver?

Synthetic data empowers models to teach themselves and discover new knowledge.

Share this article

Share this article

Synthetic data empowers models to teach themselves and discover new knowledge.

Opinions

As AI Reaches Limits And Sparks Debate, Can Synthetic Data Be The Problem Solver?

Synthetic data empowers models to teach themselves and discover new knowledge.

Share this article

The primary source of data for training AI has traditionally been real-world data. You would think this “dataset” would be vast. However, there are many gaps within this world of generic data and acquiring it can be both costly and time-consuming.

The web’s noise and sometimes disorder can make it difficult to push AI performance to new heights (and that’s before addressing legal and privacy questions). As models continue to advance, even the biggest names in the industry can struggle to gather sufficient high-quality data.

These limitations of relying solely on real world data have prompted companies like Microsoft, OpenAI and Cohere to explore the use of synthetic data. And, against a backdrop of AI debate, the move comes at a time when the most prominent AI companies have formed a new industry body to oversee safe development of the most advanced models”.

However, this is all set to change. A new report from Gartner has predicted that most data used to train machine learning models will be synthetic and automatically generated. Only 1% of all AI training data was synthetic in 2021, but analysts suggest it could hit 60% by the end of 2024.

Synthetic data generates unique and sophisticated datasets. It is cost-effective and can be fine-tuned by humans, ensuring its quality. Overall, it preserves privacy, reduces bias and maintains statistical integrity. While caution is needed to avoid using poor synthetic data, the potential for accelerated progress towards superintelligent AI systems is undeniable.

The need for data and privacy

Could we actually run out of training data? Well, a paper published by Epoch (yet to be peer reviewed) has predicted we could run out of language data between 2030-2050, with high-quality language data ‘likely to be exhausted by 2027’. This is because as large language models (LLMs) become more sophisticated, they need greater, better and more accurate datasets to train on.

But in the current landscape, it is not just the amount of datasets running out in comparison to increasing compute power causing issues, but also the amount of accessible real-world data. There is a real problem in finding enough real-world data sources that have licences for commercial use, particularly when it comes to images. Even when you can access a hub of commercially licensed photos, you have to account for the biases of the source they are pulled from and the limited availability of images on offer.

Moreover, there are increasing concerns over legal protection. The use of AI and real-word data to recreate voices, faces and text is causing a great uproar in copyright infringement and artistic integrity. So, what can be done to help address these concerns?

The attraction of synthetic data

Image protection, copyright and licences are there for a reason – and they must be kept. But it doesn’t seem possible to gather enough usable data from real-world sources to meet the growing demand for ML training. As a result, it is these legal frameworks that provide greater weight to the use of synthetic data.

Synthetic data is artificially generated by computer systems to match real-world data without relying on it. This removes the need to reveal or transmit personal identifying information. As companies battle the limits of human-made data, synthetic data has risen as a more cost-efficient and faster way of generating streams of data without the licence issue.

The more synthetic data produced and the more it absorbs from real-world scenarios, the more realistic and representative the data is. This creates far greater and usable datasets and means ML training can progress much quicker. By being able to self-generate data and create a vast range of scenarios and use cases – expensive or unsafe to recreate in the real world – synthetic data has the potential to solve AI’s regulation dilemma and match its predicted growth.

Why we need to couple the synthetic with the real

There is a future where synthetic data is so sophisticated that it rarely needs real-world datasets to work off. A future where LLMs are able to train themselves. Yet it is important to acknowledge the issues that can arise by training models on synthetic data alone, a process which can lead to regenerating incorrect and biased datasets. This is often down to using poor synthetic data and suboptimal data practices.

Therefore, it is imperative that synthetic data is developed to accurately and reliably reflect real-world data. If this isn’t the case, that is when these problems arise. Real-world data is a key step in initially training models and ensuring their accuracy as they develop. Synthetic data can then improve on this data and create a whole range of further data sources and scenarios.

The answer to AI’s issues

The next year is likely to see a massive upturn in the use of synthetic data. As real-world datasets run thinner, demands for training data increase and legal concerns become ever greater, companies are seeing synthetic data as the most viable answer in addressing each challenge. And for good reason.

Synthetic data provides the ability to use compliant data, reduce bias and maintain accuracy. But it also allows companies to test a whole range of scenarios and create unique datasets not possible in the real-world. By merging it with real-world data, it can drive the process towards creating superintelligent AI systems.

Ultimately, synthetic data empowers models to teach themselves and discover new knowledge. The more we grow its use with AI, the more we can unlock new ways of developing models as we know it.

Steve Harris is CEO at Mindtech.

Get news to your inbox
Trending articles on Opinions

As AI Reaches Limits And Sparks Debate, Can Synthetic Data Be The Problem Solver?

Share this article