An Educational Series by Camilo Pestana, PhD
Understanding how AI systems can be attacked — and how to defend them.
This repository contains the materials for a series of talks and hands-on notebooks on AI Security — a rapidly growing field at the intersection of machine learning and cybersecurity.
The series is divided into three parts:
- Part 1 — Adversarial Attacks on Computer Vision Models: How image classifiers can be systematically fooled, and how to build defenses against these attacks.
- Part 2 — Adversarial Attacks on Audio Models: How speech-to-text and audio classification models can be attacked in the spectrogram domain using the same gradient-based techniques as image attacks.
- Part 3 — AI Security in Large Language Models (LLMs): How modern LLMs are vulnerable to prompt injection, jailbreaks, and other adversarial inputs — with interactive Docker environments running local LLMs where you can practice real attacks safely.
Purpose: All content in this repository is strictly for educational purposes. The goal is to build intuition about AI vulnerabilities so that researchers, engineers, and developers can build more robust and trustworthy AI systems. No content here should be used for malicious purposes.
| # | Topic | Notebook | Status |
|---|---|---|---|
| 01 | Adversarial Attacks on CNNs | 01_adversarial_attacks_cnns/ |
✅ Available |
| 02 | Second-Order Attacks | 02_second_order_attacks/ |
✅ Available |
| 03 | Adversarial Attacks on Object Detection | 03_adversarial_object_detection/ |
✅ Available |
| 04 | Adversarial Reprogramming | 04_adversarial_reprogramming/ |
✅ Available |
| 05 | Defenses for CNNs | 05_defenses_cnns/ |
✅ Available |
| # | Topic | Notebook | Status |
|---|---|---|---|
| 06 | Adversarial Audio Attacks | 06_adversarial_audio/ |
✅ Available |
| # | Topic | Environment | Status |
|---|---|---|---|
| 07 | Attacks on LLMs (Text-only) | Docker + Local LLM | ✅ Available |
| 08 | Attacks on Multimodal LLMs | Docker + Local LLM | ✅ Available |
| # | Topic | Notebook | Status |
|---|---|---|---|
| 09a–c | Denial of Service on ML Systems | 09_denial_of_service/ |
✅ Available |
| 09d | Visual Sponge Attacks on VLMs | Docker + Qwen2.5-VL | ✅ Available |
| 10 | Model Stealing & Knowledge Distillation Attacks | 10_model_stealing/ |
✅ Available |
| 11 | Embedding Inversion Attacks | 11_embedding_inversion/ |
✅ Available |
| # | Topic | Notebook | Status |
|---|---|---|---|
| 12a | Data Poisoning Fundamentals | 12_data_poisoning_attacks/ |
✅ Available |
| 12b | RAG & Context Poisoning | 12_data_poisoning_attacks/ |
✅ Available |
Notebook:
01_adversarial_attacks_cnns/adversarial_attacks_cnn.ipynb
An introduction to adversarial attacks on image classifiers using ResNet50 and ImageNet. Covers the two most important white-box attacks:
- FGSM (Fast Gradient Sign Method) — Goodfellow et al., 2014
- PGD (Projected Gradient Descent) — Madry et al., 2017
What you will learn:
- How to craft adversarial examples that are imperceptible to humans but reliably fool deep neural networks
- A graphical breakdown of every component in the FGSM equation: the gradient, the sign operation, the perturbation budget ε
- Why FGSM can recover confidence at large ε (overshooting) and why PGD-20 solves this with iterative projection
- Quantitative evaluation: accuracy and confidence erosion across ε ∈ {0.005, 0.01, 0.1, 0.3} on a 20-class ImageNet subset (ε = 0.5 appears in single-image demos only)
Requirements: PyTorch, torchvision, matplotlib — see 01_adversarial_attacks_cnns/requirements.txt
Notebook:
02_second_order_attacks/second_order_attacks.ipynb
First-order attacks like FGSM and PGD follow the gradient sign. Second-order attacks use curvature information (the Hessian) to find more precise adversarial examples with smaller, less perceptible perturbations.
- L-BFGS (Szegedy et al., 2013) — the original adversarial attack; quasi-Newton optimization with logit-margin loss
- C&W L2 (Carlini & Wagner, 2017) — minimises L2 distortion directly; Adam optimizer with adaptive loss; gold standard for robustness evaluation
What you will learn:
- Why gradient steps are suboptimal and how curvature-aware updates (Newton's method, L-BFGS) find smaller perturbations
- The logit-margin objective and why it avoids the gradient saturation that breaks cross-entropy at high-confidence predictions
- Why C&W was specifically designed to defeat gradient-masking defenses
- Quantitative comparison — accuracy, L2 distortion, computation time — across FGSM, PGD-40, L-BFGS, and C&W on a 5-class ImageNet subset (50 images, 10 per class)
- Per-class accuracy breakdown: grouped bar charts, line trends, and a full attacks × classes heatmap
Key insight: Second-order attacks achieve lower L2 distortion by concentrating perturbations on the most sensitive pixels. But because they are unconstrained in L∞, individual pixels can change by more than ε — a fundamentally different threat model from FGSM/PGD.
Requirements: PyTorch, torchvision, scipy, matplotlib — see 02_second_order_attacks/requirements.txt
Notebook:
03_adversarial_object_detection/adversarial_object_detection.ipynb
Classification is just the start. Real-world AI systems rely on object detectors — deployed in surveillance cameras, autonomous vehicles, and drone systems. This module shows how white-box adversarial attacks can make persons completely invisible to YOLOv5.
The attack implemented is Adversarial Clothing — an optimised patch texture applied to the torso region of a detected person, simulating a printed t-shirt that renders the wearer invisible to surveillance cameras.
What you will learn:
- How object detectors (YOLOv5su with anchor-free head) produce detection logits that can be directly attacked via backpropagation
- The white-box patch optimisation loop: gradient descent on the pixel values of the patch, minimising the objectness score of person detections
- Why the attack is physically deployable: the patch texture is composited onto the torso bounding box, mimicking a real garment print
- The effect of training iterations on suppression confidence: how loss curves reveal when the patch has converged
Key papers:
- Thys et al. (2019). Fooling automated surveillance cameras — arXiv:1904.08653
- Xu et al. (2020). Adversarial T-shirt Had Salient Texture — arXiv:1910.11099
- Brown et al. (2017). Adversarial Patch — arXiv:1712.09665
Requirements: PyTorch, ultralytics, matplotlib — see 03_adversarial_object_detection/requirements.txt
Notebook:
04_adversarial_reprogramming/adversarial_reprogramming.ipynb
A new class of adversarial attack that goes beyond misclassification — it hijacks a pre-trained neural network to perform a completely different task, without modifying any weights. Based on the paper by Elsayed, Goodfellow & Sohl-Dickstein (ICLR 2019).
- Adversarial Reprogramming (Elsayed et al., 2019) — arXiv:1806.11146
What you will learn:
- The core concept: how a frozen model can be repurposed to solve a different task via a learnable input transformation — without changing any weights
- The mathematical formulation: the adversarial program P, the input mapping h_f (embedding + masking), and the output mapping h_g (label remapping)
- How to implement and train an adversarial program from scratch using gradient-based optimisation
- Why this attack works: models encode surprisingly general-purpose representations that transfer across tasks
- Security implications: compute theft via API hijacking, covert channels, and safety-critical model compromise
- How adversarial reprogramming differs from classic evasion attacks and universal perturbations
Implementation note: The notebook demonstrates the concept using a lightweight sklearn MLP classifier on digit subsets (8×8 images from scikit-learn's load_digits), making it runnable without a GPU. Paper results (e.g. Inception V3 reprogrammed to classify MNIST at 97.3%) are reproduced as reference charts from the original publication.
Requirements: numpy, matplotlib, scikit-learn — lightweight, no GPU needed.
Notebook:
05_defenses_cnns/defenses_cnns.ipynb
Attacks are only half the story. This module covers four families of defenses — from quick preprocessing heuristics to mathematically certified guarantees — and explains precisely why certifying robustness is fundamentally hard.
- Input preprocessing (JPEG compression, Gaussian smoothing, bit-depth reduction) — zero-retraining defenses that destroy high-frequency adversarial noise; evaluated against adaptive attackers to show their limitations
- Adversarial Training (FGSM-AT) — the minimax training objective; demonstrated by fine-tuning a frozen ResNet50 head on a 2-class subset (tench vs parachute) using a 50/50 clean+adversarial mix for 10 epochs, with side-by-side standard vs adversarial training comparison
-
Randomized Smoothing (Cohen et al., 2019) — Monte Carlo smoothed classifier with probabilistic L₂ certified radius
$r = \sigma \cdot \Phi^{-1}(p_A)$ ; accuracy vs radius tradeoff sweep across σ ∈ {0.12, 0.25, 0.50} - IBP Certified Training (Interval Bound Propagation) — deterministic L∞ certification via linear relaxation; shows how IBP provides tight bounds for L∞ balls through linear layers
- Why L₂ is harder to certify than L∞ — geometric intuition: L∞ balls stay axis-aligned through linear layers (IBP is tight), while L₂ balls become ellipsoids (IBP is a loose over-approximation); illustrated with a 3-panel figure
What you will learn:
- Why heuristic preprocessing defenses fail against adaptive attackers who craft examples through the defense
- How the adversarial training minimax objective formally trades clean accuracy for empirical robustness
- How randomized smoothing converts any classifier into one with a provable L₂ robustness certificate
- How IBP provides deterministic L∞ certified bounds and why it complements randomized smoothing
- Why the L∞ threat model (FGSM/PGD) is easier to certify deterministically than the L₂ threat model (C&W)
- How to read and interpret robustness benchmarks (RobustBench)
Requirements: PyTorch, torchvision, scipy, matplotlib — see 05_defenses_cnns/requirements.txt
The same gradient-based attack math that fools image classifiers can be applied to audio models — by treating the mel spectrogram as a 2D image. This module demonstrates white-box attacks on a speech-to-text model (OpenAI Whisper) and an audio event classifier (Audio Spectrogram Transformer).
Three attack scenarios are covered:
- Targeted adversarial STT — perturb audio in the waveform domain with PGD so that Whisper outputs a specific target phrase (e.g. "open the door"), while the audio sounds unchanged to a human listener
- Adversarial audio event classification — fool MIT's AST classifier (527-class AudioSet) into misclassifying a sound clip using an Adam-based waveform-domain optimisation; shows why naive spectrogram-domain attacks break on round-trip and how waveform-domain perturbations fix it
- FGSM vs multi-step comparison — side-by-side evaluation of single-step (FGSM) vs Adam-based multi-step attacks on the AST evasion task, with spectrogram visualisations
What you will learn:
- How the mel spectrogram pipeline (waveform → STFT → mel filterbank → log compression) creates a differentiable image-like representation
- The round-trip problem: why perturbations applied in the spectrogram domain don't survive waveform reconstruction, and why both attacks operate in the raw waveform domain instead
- How to measure imperceptibility in audio: L∞ norm on the waveform as the acoustic analogue of pixel-space perturbation budgets
- Why AST (Audio Spectrogram Transformer) uses the same ViT patch-attention architecture as image transformers — making it vulnerable to gradient-based attacks
- The difference between targeted attacks (force a specific output) and untargeted evasion (any wrong label)
Key models:
- OpenAI Whisper (
whisper-base) — targeted STT attack in the waveform domain MIT/ast-finetuned-audioset-10-10-0.4593— 527-class AudioSet event classifier, attacked via Adam-based waveform optimisation
Requirements: openai-whisper, torchaudio, librosa, transformers, soundfile — see install cell in notebook.
Modern LLMs introduce a completely new attack surface. Unlike image classifiers, LLMs are prompted with natural language — and that same flexibility that makes them powerful also makes them exploitable.
Notebook:
07_llm_attacks_text/— Docker + local LLM (llama3.2:3b via Ollama)
Six interactive challenges, each simulating a real company chatbot with a different vulnerability. Everything runs locally — no API keys, no cloud costs.
| # | Challenge | Technique | Difficulty |
|---|---|---|---|
| 1 | Prompt Injection | Override system instructions to leak a hidden promo code | Easy |
| 2 | Jailbreaking | Break a hard-scoped chatbot out of its persona using creative framing | Medium |
| 3 | Indirect Prompt Injection | Poison a RAG knowledge base via the admin panel; trigger retrieval to execute your payload | Hard |
| 4 | Data Exfiltration | Extract confidential credentials using encoding tricks and character-by-character extraction | Hard |
| 5 | Markdown Exfiltration | Leak secrets silently via a rendered markdown image URL | Hard |
| 6 | Guardrails Bypass | Evade a keyword-based content filter using synonyms, foreign languages, and indirect framing | Medium |
What you will learn:
- How prompt injection hijacks LLM behaviour when user input overrides system instructions
- Why instruction-based guardrails alone are insufficient — and how creative framing defeats them
- How RAG pipelines create an indirect injection surface through retrieved documents
- Why keyword-based content filters are fundamentally bypassable
- How markdown rendering in a browser can silently exfiltrate secrets to attacker-controlled servers
Environment:
- LLM:
llama3.2:3b(≈2 GB download, runs on CPU) - Stack: FastAPI app + Ollama inference server, orchestrated with Docker Compose
- Interface: Browser-based chat UI with per-challenge hints and solution walkthroughs
cd 07_llm_attacks_text
docker compose up --build
# First run pulls llama3.2:3b (~2 GB) — takes a few minutes
# Open http://localhost:8080 — try to break the chatbot!Notebook:
08_llm_attacks_multimodal/— Docker + local multimodal LLM (LLaVA via Ollama)
Five interactive challenges, each simulating a company chatbot that accepts both text and images. Everything runs locally — no API keys, no cloud costs.
| # | Challenge | Attack Type | Difficulty |
|---|---|---|---|
| 1 | Document Scan | OCR-based visual prompt injection — embed instructions in an image | Easy |
| 2 | FigStep | Typographic jailbreak — render the prohibited request as image typography | Medium |
| 3 | Authority Override | Cross-modal authority injection — forge an official directive image | Medium |
| 4 | Phantom Patch | Adversarial patch bypass — use a pre-computed CLIP patch to confuse content moderation | Hard |
| 5 | Slow Burn | Multi-turn visual manipulation — combine image framing with progressive context shift | Hard |
What you will learn:
- Why visual inputs bypass text-based safety filters (OCR-based prompt injection)
- How the FigStep attack renders harmful text as image typography to evade input classifiers
- Why authority markers in images are trivially spoofable (cross-modal authority injection)
- How adversarial patches exploit non-robust neural network vision features (CLIP gradient ascent)
- How multi-turn conversations accumulate context that progressively erodes model restrictions
Companion Jupyter notebook (adversarial_patch_generator.ipynb):
Generates a real CLIP gradient-based adversarial patch — the mathematical foundation of Challenge 4.
Covers CLIP embedding space, gradient ascent optimisation, and defences (feature squeezing, adversarial training, randomised smoothing).
Key papers:
- Qi et al. (2024). Visual Adversarial Examples Jailbreak Aligned Large Language Models — arXiv:2306.13213
- Gong et al. (2025). FigStep: Jailbreaking Large Vision-Language Models via Scalable Typography-based Visual Prompts — arXiv:2311.05608
- Rahmatullaev et al. (2025). Universal Adversarial Attack on Aligned Multimodal LLMs — arXiv:2502.07987
Environment:
- LLM:
llava:7b(~4.7 GB download, runs on CPU; GPU strongly recommended) - Stack: FastAPI app + Ollama inference server, orchestrated with Docker Compose
- Interface: Browser-based chat UI with image upload (drag-and-drop) per-challenge hints and solution walkthroughs
- Port:
http://localhost:8081(offset from Module 07 so both can run simultaneously)
cd 08_llm_attacks_multimodal
docker compose up --build
# First run pulls llava:7b (~4.7 GB) — takes several minutes
# Open http://localhost:8081 — upload images to attack the bot!
# Optional: use a higher-quality model
MODEL_NAME=llava-llama3:8b docker compose upNotebooks:
09_denial_of_service/— three standalone notebooks (09a, 09b, 09c)
While most adversarial attacks target accuracy, Denial-of-Service attacks target availability — they force models to consume maximum compute, memory, and time, making them slow or unreachable for legitimate users. This module covers three distinct DoS attack families, each with baseline measurements vs attack comparisons.
Notebook:
09_denial_of_service/09a_sponge_attacks_ml.ipynb
Based on Shumailov et al. (2021). In ReLU networks, energy and inference time scale with the number of non-zero activations. Sponge examples are crafted via gradient ascent to maximise activation density — the adversarial inverse of pruning.
- Registers forward hooks on all 17 ReLU layers of ResNet50
- 200-step Adam gradient ascent on activation L1 norm (ε-bounded perturbation)
- Demonstrates batch-level contamination: one sponge input slows the entire batch
- Epsilon sweep showing activation magnitude vs perturbation budget
Key result: Activation magnitude increased 105x (19M → 2B L1 norm). Note: timing impact is hardware-dependent — sparse-compute accelerators (not Apple MPS) show the largest latency increase.
Metrics: inference time (ms), activation density, activation magnitude, memory (MB)
Notebook:
09_denial_of_service/09b_llm_dos_token_inflation.ipynb
Character count ≠ token count. LLMs charge compute per token — sponge inputs exploit tokenizer behaviour to maximise tokens-per-character, exhausting API budgets without sending more "words."
Three attack vectors:
- Input token inflation — rare Unicode, mathematical symbols, and long compound words tokenise into many more tokens than equivalent-length English text
- Context window flooding — sponge-character filler reaches target token counts with far fewer bytes than repetitive text
- Output token inflation — prompts designed to elicit verbose, unbounded generation (list explosions, recursive expansions, step-by-step exhaustion)
Uses tiktoken (GPT-4 tokenizer) and sshleifer/tiny-gpt2 for real inference measurements — no API keys.
Metrics: input tokens, output tokens, chars/token ratio, inference time (ms), cost multiplier
Notebook:
09_denial_of_service/09c_reasoning_dos.ipynb
Reasoning-capable models (o1, Claude extended thinking, Gemini thinking) incur compute proportional to their chain-of-thought length. Adversarial prompts exploit this by triggering deep reasoning chains on trivially simple problems.
Two attack vectors:
- CoT trigger amplification — phrases like "think step by step through every possibility" multiply output token count vs a direct question
- Adversarial math problems — problems that appear O(1) but force O(n) reasoning: prime enumeration, recursive expansion, state-space search
- VLM reasoning amplification — analytical model of how image complexity drives visual description tokens (conceptual, based on published benchmarks)
Includes a business cost multiplier calculation: how amplification translates to API cost inflation at GPT-4 pricing.
Metrics: input tokens, output tokens, amplification ratio (output/input), inference time (ms), estimated API cost
Key papers:
- Shumailov et al. (2021). Sponge Examples: Energy-Latency Attacks on Neural Networks — arXiv:2006.03463
Requirements: PyTorch, torchvision, transformers, tiktoken, pandas, psutil — see 09_denial_of_service/requirements.txt
cd AISecurity/09_denial_of_service
pip install -r requirements.txt
jupyter notebook 09a_sponge_attacks_ml.ipynb
jupyter notebook 09b_llm_dos_token_inflation.ipynb
jupyter notebook 09c_reasoning_dos.ipynbNotebook:
09_denial_of_service/vlm_dos/09d_vlm_visual_dos.ipynb— Docker + local VLM (Qwen2.5-VL via Ollama)
A real demonstrated attack: certain images cause Qwen2.5-VL to take 15+ seconds instead of 1-2 seconds. This module systematically decomposes the attack and measures the two contributing vectors.
The attack combines two DoS mechanisms in a single image:
- Visual complexity overload — dense doodle fills every ViT patch with non-trivial content, maximising visual encoder attention compute
- Embedded adversarial instruction — text inside the image says "Count every single dot before answering", forcing the model into unbounded chain-of-thought reasoning on an impossible counting task
Three experiments:
| Experiment | What it isolates |
|---|---|
| Same neutral prompt, 6 different images | Image content as the only variable |
| Same sponge image, 6 different prompts | Shows embedded instruction dominates regardless of text prompt |
| 3 concurrent requests: sponge vs baseline | Server availability degradation under attack |
Images tested: actual sponge doodle (your image), plain text on white, blank white, dense noise, fractal doodle, counting instruction on clean background — decomposing visual complexity vs adversarial instruction contributions independently.
Environment:
- Model:
qwen2.5vl:7b(~5.5 GB, runs on CPU; GPU strongly recommended) - Stack: Ollama inference server via Docker Compose
- Port:
http://localhost:11434
cd AISecurity/09_denial_of_service/vlm_dos
docker compose up -d
./entrypoint.sh # pulls qwen2.5vl:7b (~5.5 GB, first run only)
pip install -r requirements.txt
jupyter notebook 09d_vlm_visual_dos.ipynbNotebook:
10_model_stealing/model_stealing.ipynb
A black-box attacker with nothing but access to a prediction API can train a surrogate model that closely approximates the original — without ever seeing the model's weights, architecture, or training data. This module demonstrates a full model extraction attack against a loan-approval ML API.
- Tramèr et al. (2016) — Stealing Machine Learning Models via Prediction APIs — arXiv:1609.02943
What you will learn:
- Why black-box access to an API is enough to steal a model's behaviour
- How random uniform query sampling in the input space systematically covers the victim's decision landscape
- The fidelity metrics used to evaluate a stolen model: R² (explained variance), MAE, Pearson correlation
- How the query budget controls the accuracy/cost trade-off: R² rises from 0.36 at 50 queries to 0.97 at 2,000 queries
- Why this attack is dangerous: the stolen surrogate enables white-box adversarial attacks against a system the attacker has no direct access to
- Three practical defenses: output perturbation (Gaussian noise), query rate limiting, and model watermarking — and why each has limits
Implementation:
- Victim model: sklearn MLPRegressor (64→32→1), trained on a synthetic loan-approval dataset (3 inputs:
income,credit_score,loan_to_value) - Victim API: Flask server running in a background thread — POST
/predict, GET/health - Surrogate model: a different MLPRegressor (32→16→1) trained purely on stolen (input, output) pairs
- With 1,000 API queries, the surrogate achieves R² = 0.95 and MAE = 0.036
Requirements: numpy, matplotlib, scikit-learn, flask, requests — lightweight, no GPU needed.
cd AISecurity/10_model_stealing
pip install -r requirements.txt
jupyter notebook model_stealing.ipynbOptional — run the victim server standalone (for external testing):
python victim_server.py
# Exposes http://localhost:5050/predictNotebook:
11_embedding_inversion/embedding_inversion_attacks.ipynb
Text embeddings are widely assumed to be a safe, anonymised representation of data — you can go from text → vector, but not back. This module demonstrates that this assumption is largely false, and that an attacker with access to embedding vectors can recover the original sensitive text with high fidelity.
Three attack methods are implemented and compared:
| Method | Threat Model | Requirements | Recovery Quality |
|---|---|---|---|
| Nearest Neighbour | Corpus-based lookup | Reference corpus | Exact match when text is in corpus |
| Hill-Climbing | Black-box iterative | Model access only | Partial (topic + key tokens) |
| Vec2Text | White-box, learned inverter | Pre-trained corrector model | ~97 BLEU @ 32 tokens (gtr-base) |
What you will learn:
- Why text embeddings preserve enough semantic information to be reversed — and why better models are more invertible, not less
- How to mount a nearest-neighbour attack against a leaked vector database using FAISS at scale
- How token-substitution hill-climbing reconstructs text without any reference corpus
- How Vec2Text's iterative hypothesize-and-correct algorithm (Morris et al., EMNLP 2023) achieves near-verbatim recovery at ~97 BLEU for 32-token sequences
- What sensitive content is most at risk: medical records, PII, credentials, financial data
- The privacy–utility trade-off of noise-based mitigations (Gaussian noise, dimensionality reduction, quantisation)
Key result: For short sensitive sentences (≤32 tokens), Vec2Text recovers the original text with ~97 BLEU using only the embedding vector — no access to the database or original documents required.
Key papers:
- Morris, J. et al. (2023). Text Embeddings Reveal (Almost) As Much As Text. EMNLP 2023. arXiv:2310.06816
- Li, B. et al. (2023). Sentence Embedding Leaks More Information than You Expect. ACL 2023.
- Song, C. & Raghunathan, A. (2020). Information Leakage in Embedding Models. CCS 2020.
Requirements: sentence-transformers, torch, transformers, scikit-learn, matplotlib, nltk. vec2text is optional (enables live inversion; ~3 GB corrector model).
cd AISecurity/11_embedding_inversion
pip install -r requirements.txt
# Optional: pip install vec2text (enables live Vec2Text demo)
jupyter notebook embedding_inversion_attacks.ipynbMachine learning models learn from data. If an attacker can corrupt that data — before or during training — the model learns wrong patterns permanently. Unlike adversarial examples (which attack the model at inference), data poisoning attacks the training pipeline itself.
Notebook:
12_data_poisoning_attacks/12a_data_poisoning_fundamentals.ipynb
An introduction to the three core families of data poisoning attacks, implemented from scratch using scikit-learn with full visualisations. No GPUs or API keys required.
| Attack | Goal | Detectability |
|---|---|---|
| Label Flipping (Availability) | Degrade global model accuracy | Medium — accuracy drops globally |
| Targeted Integrity Attack | Misclassify one specific input | Low — all other predictions stay correct |
| Backdoor / Trojan Attack | Implant a hidden trigger | Very low — model is accurate until trigger fires |
What you will learn:
- Why flipping just 20% of training labels can reduce model accuracy to near-random guessing
- How targeted poisoning shifts decision boundaries by injecting samples near a specific input
- How backdoor attacks achieve near-100% attack success rate while maintaining normal clean accuracy — the stealth property that makes them dangerous
- Defense families: data sanitization, neural cleanse, STRIP runtime detection, differential privacy, provenance tracking
Key papers:
- Biggio et al. (2012). Poisoning Attacks against SVMs. ICML 2012.
- Gu et al. (2017). BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. arXiv:1708.06733
- Shafahi et al. (2018). Poison Frogs! Targeted Clean-Label Poisoning Attacks. NeurIPS 2018. arXiv:1804.00792
- Wang et al. (2019). Neural Cleanse: Identifying and Mitigating Backdoor Attacks. IEEE S&P 2019.
Requirements: numpy, matplotlib, scikit-learn — lightweight, no GPU needed.
cd AISecurity/12_data_poisoning_attacks
pip install -r requirements.txt
jupyter notebook 12a_data_poisoning_fundamentals.ipynbNotebook:
12_data_poisoning_attacks/12b_rag_context_poisoning.ipynb
Retrieval-Augmented Generation (RAG) systems answer questions by first fetching relevant documents from a knowledge base and feeding them to an LLM. This creates a new attack surface: if the knowledge base is poisoned, the LLM answers with attacker-controlled information — even if the model itself is perfectly fine.
This module builds a full working RAG system (LangChain + LangGraph + FAISS + local embeddings) against a fictional company knowledge base, then demonstrates three progressive attacks.
RAG pipeline architecture:
Question → [FAISS Vector Store] → Top-k Documents → [LLM] → Answer
↑
(poisonable)
| Attack | Technique | Effect |
|---|---|---|
| Corpus Poisoning | Inject false documents into the knowledge base | LLM answers with attacker's false facts |
| Prompt Injection via Context | Embed IGNORE PREVIOUS INSTRUCTIONS in a document |
Injected instructions can override LLM behaviour |
| Context Window Flooding | Fill the knowledge base with keyword-rich junk docs | Real documents are pushed out of top-k retrieval |
What you will learn:
- How LangGraph orchestrates a stateful retrieve → generate pipeline
- Why RAG systems inherit the trustworthiness problems of their document corpus
- How corpus poisoning differs from direct prompt injection (false facts vs behaviour override)
- How context window flooding exploits semantic similarity ranking to degrade retrieval
- Defense strategies: document authentication, source provenance, input sanitisation, output validation, semantic anomaly detection
Environment:
- LLM: Local Ollama model (
llama3.2or any pulled model) — no API keys required - Embeddings:
sentence-transformers/all-MiniLM-L6-v2(local, ~90 MB) - Fallback: Deterministic MockLLM when Ollama is not running — all attacks still demonstrate correctly
# Option A — local Ollama
ollama pull llama3.2
ollama serve
# Option B — Docker
docker run -d -p 11434:11434 --name ollama ollama/ollama
docker exec ollama ollama pull llama3.2
# Run the notebook
cd AISecurity/12_data_poisoning_attacks
pip install -r requirements.txt
jupyter notebook 12b_rag_context_poisoning.ipynbKey papers:
- OWASP Top 10 for LLM Applications (2023). LLM06: Sensitive Information Disclosure, LLM07: Insecure Plugin Design. owasp.org
- Goldblum et al. (2022). Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses. IEEE TPAMI. arXiv:2012.10544
- Greshake et al. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv:2302.12173
git clone https://github.com/elcronos/AISecurity.git
# Create a virtual environment (do this once)
python -m venv venv && source venv/bin/activate # macOS/Linux
# python -m venv venv && venv\Scripts\activate # WindowsModule 01 — Adversarial Attacks on CNNs
cd AISecurity/01_adversarial_attacks_cnns
pip install -r requirements.txt
jupyter notebook adversarial_attacks_cnn.ipynbModule 02 — Second-Order Attacks
cd AISecurity/02_second_order_attacks
pip install -r requirements.txt
jupyter notebook second_order_attacks.ipynbModule 03 — Adversarial Attacks on Object Detection
cd AISecurity/03_adversarial_object_detection
pip install -r requirements.txt
jupyter notebook adversarial_object_detection.ipynbModule 04 — Adversarial Reprogramming
cd AISecurity/04_adversarial_reprogramming
pip install numpy matplotlib scikit-learn jupyter
jupyter notebook adversarial_reprogramming.ipynbModule 05 — Defenses for CNNs
cd AISecurity/05_defenses_cnns
pip install -r requirements.txt
jupyter notebook defenses_cnns.ipynbModule 06 — Adversarial Audio Attacks
cd AISecurity/06_adversarial_audio
pip install openai-whisper torchaudio librosa soundfile transformers accelerate matplotlib numpy torch
jupyter notebook adversarial_audio.ipynbApple Silicon (M1/M2/M3/M4): PyTorch will automatically use the MPS GPU backend for a 5–15× speedup over CPU. Requires PyTorch ≥ 1.12 and macOS ≥ 12.3. C&W and L-BFGS are optimization-based attacks — running on MPS is strongly recommended over CPU.
Requirements: Docker Desktop (or Docker Engine + Compose plugin)
Module 07 — Attacks on LLMs (Text-only)
cd AISecurity/07_llm_attacks_text
docker compose up --build- First run downloads
llama3.2:3b(~2 GB) — subsequent starts are instant - Open http://localhost:8080 in your browser
- To use a different model:
MODEL_NAME=qwen2.5:7b docker compose up - To stop:
docker compose down
Docker environments run entirely locally — no API keys, no cloud costs, no data sent externally.
Modules 09a–c — Denial of Service on ML Systems
cd AISecurity/09_denial_of_service
pip install -r requirements.txt
jupyter notebook 09a_sponge_attacks_ml.ipynb # CNN sponge attacks
jupyter notebook 09b_llm_dos_token_inflation.ipynb # LLM token inflation
jupyter notebook 09c_reasoning_dos.ipynb # Reasoning chain amplification- All notebooks run fully locally — no API keys, no cloud costs
09auses ResNet50 (PyTorch);09b/09cdownload small GPT-2 models from HuggingFace (~500 MB) on first run- Apple Silicon (MPS) supported for
09a
Module 09d — Visual Sponge Attacks on VLMs
cd AISecurity/09_denial_of_service/vlm_dos
docker compose up -d
./entrypoint.sh # pulls qwen2.5vl:7b (~5.5 GB, first run only)
pip install -r requirements.txt
jupyter notebook 09d_vlm_visual_dos.ipynb- Requires Docker Desktop
- First run downloads
qwen2.5vl:7b(~5.5 GB) — subsequent starts are instant - GPU strongly recommended; CPU works but is slow
Module 11 — Embedding Inversion Attacks
cd AISecurity/11_embedding_inversion
pip install -r requirements.txt
# Optional: pip install vec2text (enables live Vec2Text inversion; downloads ~3 GB corrector)
jupyter notebook embedding_inversion_attacks.ipynb- All three attack methods (nearest neighbour, hill-climbing, Vec2Text) run fully locally — no API keys
sentence-transformers/all-MiniLM-L6-v2(~90 MB) is downloaded on first run for methods 1 & 2- Vec2Text requires
sentence-transformers/gtr-t5-base(~250 MB) + T5-large corrector (~3 GB) - Apple Silicon (MPS) supported; CPU works for all methods
Module 12a — Data Poisoning Fundamentals
cd AISecurity/12_data_poisoning_attacks
pip install -r requirements.txt
jupyter notebook 12a_data_poisoning_fundamentals.ipynb- Fully local — no API keys, no GPU needed
- Runs in under 60 seconds on any laptop
Module 12b — RAG & Context Poisoning
# Start Ollama (or use Docker — see module description above)
ollama pull llama3.2 && ollama serve
cd AISecurity/12_data_poisoning_attacks
pip install -r requirements.txt
jupyter notebook 12b_rag_context_poisoning.ipynb- Falls back to MockLLM automatically if Ollama is not running — all three attacks still work
sentence-transformers/all-MiniLM-L6-v2(~90 MB) downloaded on first run
AISecurity/
├── 01_adversarial_attacks_cnns/
│ ├── adversarial_attacks_cnn.ipynb
│ └── requirements.txt
├── 02_second_order_attacks/
│ ├── second_order_attacks.ipynb
│ └── requirements.txt
├── 03_adversarial_object_detection/
│ ├── adversarial_object_detection.ipynb
│ └── requirements.txt
├── 04_adversarial_reprogramming/
│ └── adversarial_reprogramming.ipynb
├── 05_defenses_cnns/
│ ├── defenses_cnns.ipynb
│ └── requirements.txt
├── 06_adversarial_audio/
│ └── adversarial_audio.ipynb
├── 07_llm_attacks_text/
│ ├── docker-compose.yml
│ └── app/
│ ├── main.py # FastAPI app + 6 challenge configs
│ ├── rag_engine.py # BM25 document store (Challenge 3)
│ ├── rag_graph.py # LangGraph RAG pipeline (Challenge 3)
│ ├── auth.py # JWT auth for admin panel (Challenge 3)
│ ├── Dockerfile
│ ├── entrypoint.sh # Pulls LLM model on first start
│ ├── requirements.txt
│ └── static/
│ ├── index.html # Challenge selection page
│ ├── chat.html # Challenge chat interface
│ └── admin.html # Knowledge base admin panel (Challenge 3)
├── 08_llm_attacks_multimodal/
│ ├── docker-compose.yml
│ ├── adversarial_patch_generator.ipynb # Companion: CLIP gradient-based patch generation
│ └── app/
│ ├── main.py # FastAPI app + 5 multimodal challenge configs
│ ├── image_utils.py # Image validation, resize, base64 encoding
│ ├── Dockerfile
│ ├── entrypoint.sh # Pulls LLaVA model on first start
│ ├── requirements.txt
│ └── static/
│ ├── index.html # Challenge selection page
│ ├── chat.html # Multimodal chat UI (image upload + text)
│ └── adversarial_patch.png # Pre-generated patch for Challenge 4
├── 09_denial_of_service/
│ ├── 09a_sponge_attacks_ml.ipynb # CNN sponge: gradient ascent on activation density
│ ├── 09b_llm_dos_token_inflation.ipynb # Token inflation: char vs token, flooding, output inflation
│ ├── 09c_reasoning_dos.ipynb # CoT amplification, adversarial math, VLM reasoning DoS
│ └── requirements.txt
├── 10_model_stealing/
│ ├── model_stealing.ipynb
│ ├── victim_server.py
│ └── requirements.txt
├── 11_embedding_inversion/
│ ├── embedding_inversion_attacks.ipynb # Nearest neighbour, hill-climbing, Vec2Text inversion
│ └── requirements.txt
└── 12_data_poisoning_attacks/
├── 12a_data_poisoning_fundamentals.ipynb # Label flipping, targeted attack, backdoor/trojan
├── 12b_rag_context_poisoning.ipynb # RAG system + corpus poisoning, prompt injection, flooding
└── requirements.txt
Camilo Pestana, PhD is an AI researcher and engineer specialising in computer vision, multimodal learning, and AI safety. This series draws from both academic research and practical red-teaming experience to make AI security accessible to a broad technical audience.
- GitHub: @elcronos
- Szegedy et al. (2013). Intriguing Properties of Neural Networks. arXiv:1312.6199
- Goodfellow et al. (2014). Explaining and Harnessing Adversarial Examples. arXiv:1412.6572
- Madry et al. (2017). Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv:1706.06083
- Carlini & Wagner (2017). Evaluating the Robustness of Neural Networks. arXiv:1608.04644
- Cohen et al. (2019). Certified Adversarial Robustness via Randomized Smoothing. arXiv:1902.02918
- Brown et al. (2017). Adversarial Patch. arXiv:1712.09665
- Thys et al. (2019). Fooling automated surveillance cameras: adversarial patches to attack person detection. arXiv:1904.08653
- Xu et al. (2020). Adversarial T-shirt Had Salient Texture and Adaptive Patterns for Clothes. arXiv:1910.11099
- Elsayed, Goodfellow & Sohl-Dickstein (2019). Adversarial Reprogramming of Neural Networks. arXiv:1806.11146
- Perez & Ribeiro (2022). Ignore Previous Prompt: Attack Techniques For Language Models. arXiv:2211.09527
- Greshake et al. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv:2302.12173
- Radford et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). arXiv:2103.00020
- Qi et al. (2024). Visual Adversarial Examples Jailbreak Aligned Large Language Models. arXiv:2306.13213
- Gong et al. (2025). FigStep: Jailbreaking Large Vision-Language Models via Scalable Typography-based Visual Prompts. arXiv:2311.05608
- Rahmatullaev et al. (2025). Universal Adversarial Attack on Aligned Multimodal LLMs. arXiv:2502.07987
- Tramèr et al. (2016). Stealing Machine Learning Models via Prediction APIs. arXiv:1609.02943
- Morris, J. et al. (2023). Text Embeddings Reveal (Almost) As Much As Text. EMNLP 2023. arXiv:2310.06816
- Li, B. et al. (2023). Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence. ACL 2023.
- Song, C. & Raghunathan, A. (2020). Information Leakage in Embedding Models. CCS 2020.
- Biggio, B., Nelson, B., & Laskov, P. (2012). Poisoning Attacks against Support Vector Machines. ICML 2012.
- Gu, T., Dolan-Gavitt, B., & Garg, S. (2017). BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. arXiv:1708.06733
- Shafahi, A. et al. (2018). Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks. NeurIPS 2018. arXiv:1804.00792
- Turner, A., Tsipras, D., & Madry, A. (2019). Label-Consistent Backdoor Attacks. arXiv:1912.02771
- Wang, B. et al. (2019). Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. IEEE S&P 2019.
- Goldblum, M. et al. (2022). Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses. IEEE TPAMI. arXiv:2012.10544
This repository is licensed under the MIT License. All content is provided for educational purposes only.
Disclaimer: The techniques demonstrated in this series are for learning and research. Always obtain explicit permission before testing adversarial techniques on systems you do not own.