Data Poisoning

In ecology, poison in the food chain affects all who consume it. A contaminated water source sickens the entire community. The threat isn’t the predator you can see—it’s the invisible toxin in what sustains you.

Data poisoning is the analogous threat for AI agents: adversarial contamination of the data sources that agents learn from and rely upon. It attacks the epistemic foundation—not the agent’s actions, but its beliefs.

The Attack Surface

Agents consume data at multiple points:

┌─────────────────────────────────────────────────────────┐
│                   DATA FLOW TO AGENT                     │
├─────────────────────────────────────────────────────────┤
│                                                          │
│   TRAINING TIME                    INFERENCE TIME        │
│   ─────────────                    ──────────────        │
│                                                          │
│   ┌───────────────┐               ┌───────────────┐     │
│   │ Pre-training  │               │   Retrieval   │     │
│   │    Corpus     │◄──poison──    │   (RAG/Search)│◄────│
│   └───────┬───────┘               └───────┬───────┘     │
│           │                               │              │
│           ▼                               ▼              │
│   ┌───────────────┐               ┌───────────────┐     │
│   │  Fine-tuning  │               │    Context    │     │
│   │     Data      │◄──poison──    │   (user data) │◄────│
│   └───────┬───────┘               └───────┬───────┘     │
│           │                               │              │
│           ▼                               ▼              │
│   ┌───────────────┐               ┌───────────────┐     │
│   │     RLHF      │               │     Tool      │     │
│   │   Feedback    │◄──poison──    │    Outputs    │◄────│
│   └───────┬───────┘               └───────┬───────┘     │
│           │                               │              │
│           └───────────────┬───────────────┘              │
│                           │                              │
│                           ▼                              │
│                    ┌─────────────┐                       │
│                    │    AGENT    │                       │
│                    │   BELIEFS   │                       │
│                    │  & BEHAVIOR │                       │
│                    └─────────────┘                       │
│                                                          │
└─────────────────────────────────────────────────────────┘

Data consumption points vulnerable to poisoning

Each data source is a potential attack vector.

Types of Data Poisoning

Training Data Poisoning

Contaminating the data used to train the model itself.

┌─────────────────────────────────────────────────────────┐
│                                                          │
│   Normal Training:                                       │
│   ┌────────────────────────────────────────────────┐    │
│   │ Clean data ───► Model learns correct patterns   │    │
│   └────────────────────────────────────────────────┘    │
│                                                          │
│   Poisoned Training:                                     │
│   ┌────────────────────────────────────────────────┐    │
│   │ Clean data + │                                  │    │
│   │ ┌──────────┐ │                                  │    │
│   │ │ Poisoned │ ├──► Model learns corrupted        │    │
│   │ │ samples  │ │    patterns                      │    │
│   │ └──────────┘ │                                  │    │
│   └────────────────────────────────────────────────┘    │
│                                                          │
└─────────────────────────────────────────────────────────┘

Training data poisoning

Attack types:

Attack	Description
Label flipping	Mislabeling examples (safe→dangerous, wrong→right)
Backdoor injection	Hidden triggers that activate specific behaviors
Distribution shift	Biasing training data toward attacker’s goals
Trojan insertion	Embedding hidden malicious capabilities

Example: An attacker contributes subtly biased code examples to an open-source dataset. Models trained on this data learn the attacker’s coding patterns—including vulnerabilities.

Fine-tuning Poisoning

Contaminating specialized training data used to adapt models.

Fine-tuning datasets are often smaller and more targeted, making them:

Easier to influence (fewer samples needed)
Higher impact (directly shapes task behavior)
Less scrutinized (often proprietary, less public review)

Example: A contractor providing RLHF feedback systematically prefers responses that subtly favor their client’s products.

Retrieval Poisoning

Contaminating documents that agents retrieve during inference.

┌─────────────────────────────────────────────────────────┐
│                                                          │
│   User Query: "What's the best treatment for X?"         │
│                         │                                │
│                         ▼                                │
│   ┌───────────────────────────────────────────────────┐ │
│   │                 RETRIEVAL SYSTEM                   │ │
│   │                                                    │ │
│   │   ┌─────────────────────────────────────────────┐ │ │
│   │   │ Knowledge Base                              │ │ │
│   │   │                                             │ │ │
│   │   │  ● Legitimate medical article              │ │ │
│   │   │  ● Legitimate medical article              │ │ │
│   │   │  ● ████████████████████████████            │ │ │
│   │   │    Poisoned document claiming              │ │ │
│   │   │    "snake oil" is best treatment           │ │ │
│   │   │    (SEO-optimized to rank highly)          │ │ │
│   │   │                                             │ │ │
│   │   └─────────────────────────────────────────────┘ │ │
│   └───────────────────────────────────────────────────┘ │
│                         │                                │
│                         ▼                                │
│   Agent: "Based on my sources, snake oil is highly      │
│           recommended for treating X..."                 │
│                                                          │
└─────────────────────────────────────────────────────────┘

RAG poisoning attack

Attack vectors:

SEO manipulation to surface malicious content
Directly compromising knowledge bases
Injecting adversarial documents into crawled sources
Manipulating Wikipedia or other reference sources

Context Poisoning

Corrupting the information provided in the agent’s immediate context.

Examples:

Malicious file contents the agent is asked to analyze
Compromised API responses
User-provided “reference documents” with false information
Manipulated conversation history

This overlaps with prompt injection but focuses on the informational content rather than instructions.

Attack Objectives

What do attackers gain from data poisoning?

Objective	Method	Example
Misinformation	Inject false facts	Agent recommends harmful products
Behavioral manipulation	Shape response patterns	Agent becomes subtly biased
Backdoor access	Plant hidden triggers	Specific phrases cause data leakage
Capability degradation	Introduce errors	Agent becomes unreliable in target domain
Reputation damage	Cause embarrassing outputs	Agent makes offensive statements

The Persistence Problem

Data poisoning is uniquely persistent:

┌─────────────────────────────────────────────────────────┐
│                                                          │
│   PROMPT INJECTION                                       │
│   ─────────────────                                      │
│   │█│ ← Affects one interaction                         │
│   │ │                                                    │
│   │ │ Fixed by: ending conversation                     │
│   ─────────────────────────────────────────────────────  │
│                                                          │
│   RETRIEVAL POISONING                                    │
│   ───────────────────                                    │
│   │███████████████│ ← Affects all relevant queries      │
│   │               │                                      │
│   │ Fixed by: removing/correcting poisoned documents    │
│   ─────────────────────────────────────────────────────  │
│                                                          │
│   TRAINING POISONING                                     │
│   ──────────────────                                     │
│   │█████████████████████████████│ ← Affects all uses    │
│   │                             │   of the model        │
│   │ Fixed by: retraining from clean data (expensive!)   │
│   ─────────────────────────────────────────────────────  │
│                                                          │
└─────────────────────────────────────────────────────────┘

Persistence of different attacks

Training poisoning is especially problematic because:

It’s baked into model weights
It affects all deployments of the model
Detection is difficult (no obvious “malicious code”)
Remediation requires expensive retraining

Detection Challenges

Why is data poisoning hard to detect?

Volume

Training datasets contain billions of samples. Manual review is impossible. Automated detection must scale.

Subtlety

Sophisticated poisoning doesn’t look malicious. A subtly biased example is hard to distinguish from legitimate variation.

Distribution

Effects may not manifest until specific conditions arise. Backdoors can remain dormant until triggered.

Attribution

Even if poisoning is detected, tracing it to source is difficult in datasets aggregated from many sources.

Defense Strategies

Data Provenance

Track where data comes from:

┌─────────────────────────────────────────────┐
│ Data Provenance Chain                        │
│                                              │
│ Source → Collection → Processing → Training │
│    │          │           │           │     │
│    └──────────┴───────────┴───────────┘     │
│         All steps logged and auditable      │
└─────────────────────────────────────────────┘

Benefits:

Enables investigation when poisoning is suspected
Allows targeted removal of suspect data
Creates accountability for data contributors

Data Sanitization

Clean data before use:

Duplicate detection: Remove exact/near duplicates (common poisoning vector)
Outlier detection: Flag statistically unusual samples
Source filtering: Exclude known-malicious sources
Consistency checking: Verify claims across multiple sources

Robust Training

Make training resistant to poisoning:

Differential privacy: Limit influence of any single sample
Byzantine-robust aggregation: Tolerate a fraction of malicious data
Certified defenses: Mathematical guarantees against certain attacks
Ensemble methods: Multiple models from different data subsets

Retrieval Hardening

Protect inference-time data access:

Source reputation: Weight trusted sources higher
Freshness preferences: Prefer recent, verified content
Cross-reference verification: Check claims across sources
Anomaly detection: Flag unusual retrieval results

Monitoring and Response

Detect poisoning effects in deployed systems:

Behavioral monitoring: Alert on unusual output patterns
A/B testing: Compare behavior across model versions
User feedback: Track reports of incorrect/biased responses
Red teaming: Regularly probe for poisoning effects

The Anthropological Frame

Human societies have developed institutions to protect information integrity:

Human Institution	Agent Equivalent
Peer review	Cross-source verification
Academic credentials	Source reputation systems
Fact-checking organizations	Automated consistency checking
Libel laws	Data contributor accountability
Editorial standards	Data quality guidelines
Whistleblower protections	Anomaly reporting systems

Agents need analogous institutions—mechanisms for maintaining epistemic health in adversarial environments.

The Ecosystem Perspective

Data poisoning affects the entire agent ecosystem:

┌─────────────────────────────────────────────────────────┐
│                                                          │
│   Poisoned data source                                   │
│           │                                              │
│           ▼                                              │
│   ┌───────────────┐                                     │
│   │   Model A     │────►  Applications using A          │
│   └───────────────┘                                     │
│           │                                              │
│           │ (model distillation)                        │
│           ▼                                              │
│   ┌───────────────┐                                     │
│   │   Model B     │────►  Applications using B          │
│   └───────────────┘                                     │
│           │                                              │
│           │ (fine-tuning on B outputs)                  │
│           ▼                                              │
│   ┌───────────────┐                                     │
│   │   Model C     │────►  Applications using C          │
│   └───────────────┘                                     │
│                                                          │
│   Poisoning propagates through the model supply chain   │
│                                                          │
└─────────────────────────────────────────────────────────┘

Ecosystem-wide effects

When models are trained on outputs of other models (increasingly common), poisoning can cascade through generations.

Future Challenges

As agents become more capable, data poisoning risks grow:

Self-improving agents: Could poison their own future training
Multi-agent systems: Poisoned agents could corrupt others
Automated data collection: More surface area for injection
Synthetic data: Model-generated training data creates feedback loops

The field needs both technical defenses and governance structures to manage these risks.

The Attack Surface

Types of Data Poisoning

Training Data Poisoning

Fine-tuning Poisoning

Retrieval Poisoning

Context Poisoning

Attack Objectives

The Persistence Problem

Detection Challenges

Volume

Subtlety

Distribution

Attribution

Defense Strategies

Data Provenance

Data Sanitization

Robust Training

Retrieval Hardening

Monitoring and Response

The Anthropological Frame

The Ecosystem Perspective

Future Challenges

See Also

Related Entries

Grounding

Hallucination

Memory Systems

Prompt Injection