0.01% False Training Text Can Lead To An 11.2% Increase In Harmful Content. Be Wary Of Artificial Intelligence "data Poisoning"
0.01% False Training Text Can Lead To An 11.2% Increase In Harmful Content. Be Wary Of Artificial Intelligence "data Poisoning"
The Ministry of National Security issued a security reminder article today (5th), stating that artificial intelligence training data has problems of varying quality, including false information, fictitious content and biased views, causing data source contamination and posing new challenges to artificial intelligence security.
The Ministry of National Security issued a security reminder article today (5th), stating that artificial intelligence training data has problems of varying quality, including false information, fictitious content and biased views, causing data source contamination and posing new challenges to artificial intelligence security.

Data is the foundation of artificial intelligence
The three core elements of artificial intelligence are algorithms, computing power and data. Data is the basic element for training AI models and the core resource for AI applications.
Provide raw materials for AI models. Massive data provides sufficient training materials for AI models, allowing them to learn the inherent laws and patterns of data and achieve semantic understanding, intelligent decision-making and content generation. At the same time, data also drives artificial intelligence to continuously optimize performance and accuracy, and realize iterative upgrades of models to adapt to new needs.
Affects the performance of AI models. AI models have extremely high requirements on the quantity, quality and diversity of data. Sufficient data volume is the prerequisite for fully training large-scale models; data with high accuracy, completeness and consistency can effectively avoid misleading the model; diversified data covering multiple fields can improve the model's ability to cope with actual complex scenarios.
Promote the application of AI models. The increasing abundance of data resources has accelerated the implementation of the "artificial intelligence " action and effectively promoted the in-depth integration of artificial intelligence with various economic and social fields. This not only cultivates and develops new productive forces, but also promotes the leap-forward development of my country's science and technology, industrial optimization and upgrading, and the overall jump in productivity.

Data pollution impacts security defense lines
High-quality data can significantly improve the accuracy and reliability of the model, but once the data is contaminated, it may lead to model decision-making errors or even AI system failure, posing certain security risks.
Deliver harmful content. Contaminated data generated through "data poisoning" behaviors such as tampering, fabrication, and repetition will interfere with the parameter adjustment of the model during the training phase, weaken the model's performance, reduce its accuracy, and even induce harmful output. Research shows:
When there are only 0.01% false texts in the training data set, the harmful content output by the model will increase by 11.2%;
Even if it is 0.001% false text, its harmful output will increase by 7.2% accordingly.
Causes recursive pollution. False content generated by artificial intelligence contaminated by data may become a data source for subsequent model training, forming a continuing "contamination legacy effect." Currently, the quantity of AI-generated content on the Internet far exceeds the real content produced by humans. A large amount of low-quality and non-objective data floods it, leading to the accumulation of erroneous information in AI training data sets from generation to generation, ultimately distorting the cognitive capabilities of the model itself.
Initiate real risks. Data pollution may also trigger a series of real risks, especially in areas such as financial markets, public safety, and medical health.
In the financial field, criminals use AI to concoct false information, causing data pollution, which may cause abnormal fluctuations in stock prices, posing a new type of market manipulation risk;
In the field of public security, data pollution can easily disturb public perception, mislead public opinion, and induce social panic;
In the medical and health field, data contamination may cause models to generate incorrect diagnosis and treatment recommendations, which not only endangers patients' lives but also aggravates the spread of pseudoscience.

Build a solid artificial intelligence data base
Strengthen source supervision to prevent the generation of pollution. Based on laws and regulations such as the Cybersecurity Law of the People's Republic of China, the Data Security Law of the People's Republic of China, and the Personal Information Protection Law of the People's Republic of China, an AI data classification and hierarchical protection system is established to fundamentally prevent the generation of contaminated data and help effectively prevent AI data security threats.
Strengthen risk assessment and ensure data circulation. Strengthen the overall assessment of artificial intelligence data security risks to ensure data security throughout the entire life cycle, including collection, storage, transmission, use, exchange and backup. Simultaneously accelerate the construction of an artificial intelligence security risk classification management system and continuously improve the comprehensive data security capabilities.
Terminal cleaning and repair, building a governance framework. Regularly clean and repair contaminated data in accordance with regulatory standards. Develop specific rules for data cleaning in accordance with relevant laws, regulations and industry standards. Gradually build a modular, monitorable, and scalable data governance framework to achieve continuous management and quality control.