Data Poisoning for Dummies (No Code, No Math)

Madhav Malhotra

Summary

Data poisoning means corrupting the data an AI model is trained with. Bad actors do this to stop others' AI models from functioning reliably.
Companies may use corrupt (poisoned) data if:
1. They scrape public data posted by bad actors.
2. Their private dataset is hacked and modified.
3. Their AI model is derived from another model trained on poisoned data.
AI models may have low accuracy after training on poisoned data. Worse, they may have 'backdoors.' This is when an AI model behaves unexpectedly after receiving a rare, secret, specially-chosen 'trigger'.
In critical applications like autonomous weapons, autonomous vehicles, and healthcare treatment - data poisoning can unexpectedly cause damages including death. This possibility reduces the trustworthiness of AI.

What is Data Poisoning?

Data poisoning is a kind of 'hack' where corrupted data is added to the training set of an AI model. Bad actors do this to make a model perform poorly, or to control the model's response to specific inputs.

Data poisoning occurs in the early stages of creating an AI model: while data is gathered before training begins. Common ways bad actors corrupt data include:

Adding mislabelled data. Ex: Teaching an AI model that a picture of a plane is a picture of a frog.
Adding outliers/noise to the data. Ex: Giving a language model many sentences with hate speech (Amy Kraft, CBS News, 2016).
Adding small, but distinct 'triggers' (see black and white square below).

Often, bad actors will choose corrupted data that's difficult for humans to notice. Here are some examples of corrupted image and text data.

Derived from Tang et al., ‘An embarrassingly simple approach for trojan attack in deep neural networks’, Aug. 2020.

Figure from Zhang et al., ‘Trojaning Language Models for Fun and Profit’, Aug. 2020.

Sources of Poisoned Data

Insider threats: One source of poisoned data is an organisation's employees themselves. Employees working on data collection and preprocessing can secretly corrupt data. They may do this with incentives like corporate/state-sponsored espionage, activism for ideological causes, or personal disgruntlement.

Public data scraping: It's common to get data to train models from public Internet sources. Humans can't check the enormous amount of data collected, so anyone (like bad actors) could post poisoned data that gets adopted into AI datasets. Ex: a political adversary could post false information on social media, which might be collected and used to train chatbots like ChatGPT.

Fine-tuning pretrained models: Most AI models are derived from 'pretrained models' (models shown as effective by academic groups or large companies). So small organisations often copy those models (with small updates) for their use. However, if a pretrained model used poisoned data, any models based on the original usually inherit the effects. This is like a disease spreading across different AI models.

Hypothetical example of how poisoned data used to train GPT-2 could affect other models derived from it. Visual derived from J. Yang et al., ‘Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond’, Apr. 2023.

Why does Data Poisoning Matter?

There are two main kinds of damage that data poisoning can cause for AI models.

Reduced Accuracy: Poisoned data can lead to a decrease in model accuracy. When a model learns from incorrect or manipulated data, it makes incorrect or unwanted decisions. This can result in financial losses, privacy breaches, or reputational damage for businesses relying on these models.

As a real-life example, Microsoft released a chatbot on Twitter in 2016. It learnt from interactions with Twitter users in real time. Users started flooding the chatbot with explicit and racist content. Within hours, the chatbot started reciprocating this language and Microsoft had to shut it down (Amy Kraft, CBS News, 2016).

Twitter taught Microsoft's AI chatbot to be a racist asshole in less than a day - The Verge — From James Vincent, Verge News, 2016.

Backdoors: Data poisoning can introduce 'backdoors' into AI models. These are hidden behaviours that are only activated when a model receives a specific, secret input ('trigger'). For example, upon seeing the pixel pattern in the dog image above, an AI model could be trained to always classify an image as a frog.

That doesn't seem too bad. However, researchers have also been able to make the AI models behind self-driving cars malfunction. Specifically, they showed how adding things like sticky notes to a stop sign can cause the sign not to be recognised 90% of the time. Unfixed, this could take human lives (Tom Simonite, WIRED, 2017).

Different triggers added onto images of stop signs to poison data. Figure from T. Gu, B. Dolan-Gavitt, and S. Garg, ‘BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain’, 2017.

Potential Solutions

Addressing data poisoning and backdoors is still an active area of research; no perfect solutions exist.

Technical solutions being worked on include detecting poisoned input data, removing triggers from input data, and detecting if a model has had any backdoors implanted in it.
Policy solutions depend on the organisation implementing the policies.
- AI model developers can use increased caution to acquire data from companies with a transparent record of trust and security.
- Critical industries like energy, healthcare, and military sectors can avoid using complex AI models (see 'interpretable by design').
- Government departments (like the NIST in the US) can work on standards for dataset quality, backdoor detection, etc. with academics.

EA Forum Bot Site
EA Forum