Data Poisoning

Data poisoning is an attack on the training phase of machine learning rather than on a finished model. Because a model learns whatever patterns are present in its training data, an attacker who can influence that data can influence what the model learns. This stands in contrast to evasion attacks like adversarial examples, which manipulate inputs at prediction time without touching the training set.

Poisoning attacks come in two broad flavors. Availability or integrity attacks aim to degrade overall accuracy by injecting mislabeled or corrupted examples, making the model unreliable. Backdoor or targeted attacks are subtler: they leave normal accuracy intact but teach the model a hidden rule that misfires only on inputs carrying a specific trigger, the pattern demonstrated by the BadNets and Trojaning papers. The threat has grown more practical as models are trained on data scraped from the open web, which an attacker can seed. The paper “Poisoning Web-Scale Training Datasets is Practical” by Carlini and colleagues (arXiv:2302.10149, 2023) showed concretely how a low-budget adversary could place malicious content into the data behind large public datasets.

Data poisoning is recognized as a top-tier risk for modern AI. The OWASP Top 10 for LLM Applications lists “Data and Model Poisoning” explicitly, reflecting how dependent today’s systems are on large, loosely controlled training corpora.

For a business reader, the key insight is that a model’s integrity depends on the integrity of its training data. If you cannot vouch for where your data came from, you cannot fully vouch for what your model has learned, which makes data provenance and curation a security concern, not just a quality one.

Sources

Related