The quality of data inputs in AI systems is critical to the accuracy and relevance of their outputs, akin to how poor ingredients result in poor dishes. This concept, known as “Garbage in, Garbage out,” highlights the importance of quality data.
Research predicts that publicly available human-generated data for training AI models may be depleted within two to eight years, leading to a reliance on “synthetic data” that could compromise AI accuracy and reliability. How will the depletion of publicly available human-generated data impact the accuracy and reliability of AI models, and what role can synthetic data play in mitigating these challenges?
Dr James Kang, Senior Lecturer in Computer Science at RMIT University Vietnam discusses the impact of data shortages on AI, the rise and challenges of synthetic data, and potential solutions—key insights for anyone invested in AI’s future.
What are the current data quality issues affecting AI model development?
AI training and inferencing abilities are critical to the process of producing AI predictions and outputs. Using biased or incomplete data affects the outcome. For example, when AI has been trained for facial recognition based on one particular ethnic group, it may produce less accurate predictions for faces of other ethnicities.
Inaccurate, inconsistent or outdated data can also cause confusion with AI and degrade its performance. Poor labelling can also affect the quality as it directly relates to accuracy and consistency of AI training.
For example, when training AI with showing pictures of apples, pineapples and oranges, and pineapples have been mislabelled as apples, the AI will be confused and misidentify apples and pineapples as the same fruit. So, accurate labeling is important in the first instance and is crucial in AI training to ensure reliable and effective model performance.
Forbes Advisor survey reveals that 64 percent of businesses believe that AI will help increase their productivity and it relies on the quality of the data used for AI training.
Unveiling synthetic data: Definitions, uses, and drivers
Whilst real data are obtained from primary sources such as by interviews, surveys, experiments, observations and data mining, synthetic data is artificially created or generated by algorithms using computer simulations and predictions rather than real data for the purpose of replicating patterns or statistical properties of actual data to save on cost e.g. up to 60 percent of inspection costs and accelerating development cycles by 30 percent.
Synthetic data can be generated quickly while real data can be expensive and time consuming to obtain when training and labelling data for input into AI systems. Tech companies may opt to use synthetic data where privacy or ethical issues in relation to data collection may be a barrier, such as personal user information, or sensitive health information.
Synthetic data can also be useful to reduce or eliminate certain biases as real-world data can also often contain biases. When real data is scarce and hard to obtain, it may also be useful to augment existing datasets with synthetic data to provide a more diverse set of training samples. Thus, there are many scenarios and reasons where researchers and tech companies could use synthetic data to overcome certain constraints and limitations of real-world data.
In the field of healthcare, researchers often face the challenge of limited access to diverse and comprehensive patient data due to privacy regulations. For instance, IBM’s Watson Health has utilised synthetic data to overcome these limitations. By generating synthetic datasets that mimic real patient records, they were able to create a diverse and comprehensive training set for their AI models. This helped reduce biases, such as those related to gender, age, and ethnicity, present in the real-world data.
Overcoming obstacles: Challenges and solutions in synthetic data for AI
One critical problem of AI models using synthetic data is the degree of uncertainty with regards to the validity and quality of information. It is hard for users to understand how much they can trust the data obtained by AI as there is no universally accepted or standardised credit or quality ‘scoring’ system in place yet. Therefore, it is important to implement a robust data validation system to ensure the accuracy and reliability of the data used for training.
To distinguish the real data and synthetic data, AI can track metadata and origin of data used to find out where the data has come from and how good or reliable it is. Where required, human intervention may also be part of the process during data preparation and AI training to ensure the quality of the data used is maintained.
Somewhat ironically, AI algorithms may be able to play a role in assessing and detecting quality issues where parameters have clearly been defined as part of the audit process. Partially substituting parts of real data to create a partially synthetic data may be an option to protect individual privacy such as using pseudo identification for usernames and date of birth for instance, without completing synthesising the data set. Using a mix of real and synthetic data in such a carefully managed way can help boost AI performance by maintaining data privacy whilst minimising biases.
As AI continues to transform industries worldwide, addressing the data quality crisis is paramount. Synthetic data offers a compelling solution but must be used carefully to mitigate risks related to accuracy and trustworthiness. To ensure sustainable AI development in Vietnam, it is crucial for governments, businesses, industries and universities to collaborate in enhancing the quality of data for training, including synthetic data. By working together, we can overcome data shortages and improve the accuracy and reliability of AI models, fostering innovation and growth in the AI sector.
Story: Dr James Kang, Senior Lecturer, Computer Science, School of Science, Engineering & Technology – RMIT Vietnam
ENDS
Related
Source: Vietnam Insider