AI Security

An Educational Series by Camilo Pestana, PhD

Understanding how AI systems can be attacked — and how to defend them.

About This Series

This repository contains the materials for a series of talks and hands-on notebooks on AI Security — a rapidly growing field at the intersection of machine learning and cybersecurity.

The series is divided into three parts:

Part 1 — Adversarial Attacks on Computer Vision Models: How image classifiers can be systematically fooled, and how to build defenses against these attacks.
Part 2 — Adversarial Attacks on Audio Models: How speech-to-text and audio classification models can be attacked in the spectrogram domain using the same gradient-based techniques as image attacks.
Part 3 — AI Security in Large Language Models (LLMs): How modern LLMs are vulnerable to prompt injection, jailbreaks, and other adversarial inputs — with interactive Docker environments running local LLMs where you can practice real attacks safely.

Purpose: All content in this repository is strictly for educational purposes. The goal is to build intuition about AI vulnerabilities so that researchers, engineers, and developers can build more robust and trustworthy AI systems. No content here should be used for malicious purposes.

#	Topic	Notebook	Status
01	Adversarial Attacks on CNNs	`01_adversarial_attacks_cnns/`	✅ Available
02	Second-Order Attacks	`02_second_order_attacks/`	✅ Available
03	Adversarial Attacks on Object Detection	`03_adversarial_object_detection/`	✅ Available
04	Adversarial Reprogramming	`04_adversarial_reprogramming/`	✅ Available
05	Defenses for CNNs	`05_defenses_cnns/`	✅ Available

Part 2 — Adversarial Attacks on Audio

#	Topic	Notebook	Status
06	Adversarial Audio Attacks	`06_adversarial_audio/`	✅ Available

Part 3 — AI Security in Large Language Models

#	Topic	Environment	Status
07	Attacks on LLMs (Text-only)	Docker + Local LLM	✅ Available
08	Attacks on Multimodal LLMs	Docker + Local LLM	✅ Available

Part 4 — Model-Level Attacks

#	Topic	Notebook	Status
09a–c	Denial of Service on ML Systems	`09_denial_of_service/`	✅ Available
09d	Visual Sponge Attacks on VLMs	Docker + Qwen2.5-VL	✅ Available
10	Model Stealing & Knowledge Distillation Attacks	`10_model_stealing/`	✅ Available
11	Embedding Inversion Attacks	`11_embedding_inversion/`	✅ Available

Part 5 — Data Poisoning Attacks

#	Topic	Notebook	Status
12a	Data Poisoning Fundamentals	`12_data_poisoning_attacks/`	✅ Available
12b	RAG & Context Poisoning	`12_data_poisoning_attacks/`	✅ Available

Part 1 — Adversarial Attacks on Computer Vision

01. Adversarial Attacks on CNNs

Notebook: 01_adversarial_attacks_cnns/adversarial_attacks_cnn.ipynb

An introduction to adversarial attacks on image classifiers using ResNet50 and ImageNet. Covers the two most important white-box attacks:

FGSM (Fast Gradient Sign Method) — Goodfellow et al., 2014
PGD (Projected Gradient Descent) — Madry et al., 2017

What you will learn:

How to craft adversarial examples that are imperceptible to humans but reliably fool deep neural networks
A graphical breakdown of every component in the FGSM equation: the gradient, the sign operation, the perturbation budget ε
Why FGSM can recover confidence at large ε (overshooting) and why PGD-20 solves this with iterative projection
Quantitative evaluation: accuracy and confidence erosion across ε ∈ {0.005, 0.01, 0.1, 0.3} on a 20-class ImageNet subset (ε = 0.5 appears in single-image demos only)

Requirements: PyTorch, torchvision, matplotlib — see 01_adversarial_attacks_cnns/requirements.txt

02. Second-Order Attacks

Notebook: 02_second_order_attacks/second_order_attacks.ipynb

First-order attacks like FGSM and PGD follow the gradient sign. Second-order attacks use curvature information (the Hessian) to find more precise adversarial examples with smaller, less perceptible perturbations.

L-BFGS (Szegedy et al., 2013) — the original adversarial attack; quasi-Newton optimization with logit-margin loss
C&W L2 (Carlini & Wagner, 2017) — minimises L2 distortion directly; Adam optimizer with adaptive loss; gold standard for robustness evaluation

What you will learn:

Why gradient steps are suboptimal and how curvature-aware updates (Newton's method, L-BFGS) find smaller perturbations
The logit-margin objective and why it avoids the gradient saturation that breaks cross-entropy at high-confidence predictions
Why C&W was specifically designed to defeat gradient-masking defenses
Quantitative comparison — accuracy, L2 distortion, computation time — across FGSM, PGD-40, L-BFGS, and C&W on a 5-class ImageNet subset (50 images, 10 per class)
Per-class accuracy breakdown: grouped bar charts, line trends, and a full attacks × classes heatmap

Key insight: Second-order attacks achieve lower L2 distortion by concentrating perturbations on the most sensitive pixels. But because they are unconstrained in L∞, individual pixels can change by more than ε — a fundamentally different threat model from FGSM/PGD.

Requirements: PyTorch, torchvision, scipy, matplotlib — see 02_second_order_attacks/requirements.txt

03. Adversarial Attacks on Object Detection

Notebook: 03_adversarial_object_detection/adversarial_object_detection.ipynb

Classification is just the start. Real-world AI systems rely on object detectors — deployed in surveillance cameras, autonomous vehicles, and drone systems. This module shows how white-box adversarial attacks can make persons completely invisible to YOLOv5.

The attack implemented is Adversarial Clothing — an optimised patch texture applied to the torso region of a detected person, simulating a printed t-shirt that renders the wearer invisible to surveillance cameras.

What you will learn:

How object detectors (YOLOv5su with anchor-free head) produce detection logits that can be directly attacked via backpropagation
The white-box patch optimisation loop: gradient descent on the pixel values of the patch, minimising the objectness score of person detections
Why the attack is physically deployable: the patch texture is composited onto the torso bounding box, mimicking a real garment print
The effect of training iterations on suppression confidence: how loss curves reveal when the patch has converged

Key papers:

Thys et al. (2019). Fooling automated surveillance cameras — arXiv:1904.08653
Xu et al. (2020). Adversarial T-shirt Had Salient Texture — arXiv:1910.11099
Brown et al. (2017). Adversarial Patch — arXiv:1712.09665

Requirements: PyTorch, ultralytics, matplotlib — see 03_adversarial_object_detection/requirements.txt

04. Adversarial Reprogramming

Notebook: 04_adversarial_reprogramming/adversarial_reprogramming.ipynb

A new class of adversarial attack that goes beyond misclassification — it hijacks a pre-trained neural network to perform a completely different task, without modifying any weights. Based on the paper by Elsayed, Goodfellow & Sohl-Dickstein (ICLR 2019).

Adversarial Reprogramming (Elsayed et al., 2019) — arXiv:1806.11146

What you will learn:

The core concept: how a frozen model can be repurposed to solve a different task via a learnable input transformation — without changing any weights
The mathematical formulation: the adversarial program P, the input mapping h_f (embedding + masking), and the output mapping h_g (label remapping)
How to implement and train an adversarial program from scratch using gradient-based optimisation
Why this attack works: models encode surprisingly general-purpose representations that transfer across tasks
Security implications: compute theft via API hijacking, covert channels, and safety-critical model compromise
How adversarial reprogramming differs from classic evasion attacks and universal perturbations

Implementation note: The notebook demonstrates the concept using a lightweight sklearn MLP classifier on digit subsets (8×8 images from scikit-learn's load_digits), making it runnable without a GPU. Paper results (e.g. Inception V3 reprogrammed to classify MNIST at 97.3%) are reproduced as reference charts from the original publication.

Requirements: numpy, matplotlib, scikit-learn — lightweight, no GPU needed.

05. Defenses for CNNs

Notebook: 05_defenses_cnns/defenses_cnns.ipynb

Attacks are only half the story. This module covers four families of defenses — from quick preprocessing heuristics to mathematically certified guarantees — and explains precisely why certifying robustness is fundamentally hard.

Input preprocessing (JPEG compression, Gaussian smoothing, bit-depth reduction) — zero-retraining defenses that destroy high-frequency adversarial noise; evaluated against adaptive attackers to show their limitations
Adversarial Training (FGSM-AT) — the minimax training objective; demonstrated by fine-tuning a frozen ResNet50 head on a 2-class subset (tench vs parachute) using a 50/50 clean+adversarial mix for 10 epochs, with side-by-side standard vs adversarial training comparison
Randomized Smoothing (Cohen et al., 2019) — Monte Carlo smoothed classifier with probabilistic L₂ certified radius $r = \sigma \cdot \Phi^{-1}(p_A)$; accuracy vs radius tradeoff sweep across σ ∈ {0.12, 0.25, 0.50}
IBP Certified Training (Interval Bound Propagation) — deterministic L∞ certification via linear relaxation; shows how IBP provides tight bounds for L∞ balls through linear layers
Why L₂ is harder to certify than L∞ — geometric intuition: L∞ balls stay axis-aligned through linear layers (IBP is tight), while L₂ balls become ellipsoids (IBP is a loose over-approximation); illustrated with a 3-panel figure

What you will learn:

Why heuristic preprocessing defenses fail against adaptive attackers who craft examples through the defense
How the adversarial training minimax objective formally trades clean accuracy for empirical robustness
How randomized smoothing converts any classifier into one with a provable L₂ robustness certificate
How IBP provides deterministic L∞ certified bounds and why it complements randomized smoothing
Why the L∞ threat model (FGSM/PGD) is easier to certify deterministically than the L₂ threat model (C&W)
How to read and interpret robustness benchmarks (RobustBench)

Requirements: PyTorch, torchvision, scipy, matplotlib — see 05_defenses_cnns/requirements.txt

Part 2 — Adversarial Attacks on Audio

06. Adversarial Audio Attacks

Notebook: 06_adversarial_audio/adversarial_audio.ipynb

The same gradient-based attack math that fools image classifiers can be applied to audio models — by treating the mel spectrogram as a 2D image. This module demonstrates white-box attacks on a speech-to-text model (OpenAI Whisper) and an audio event classifier (Audio Spectrogram Transformer).

Three attack scenarios are covered:

Targeted adversarial STT — perturb audio in the waveform domain with PGD so that Whisper outputs a specific target phrase (e.g. "open the door"), while the audio sounds unchanged to a human listener
Adversarial audio event classification — fool MIT's AST classifier (527-class AudioSet) into misclassifying a sound clip using an Adam-based waveform-domain optimisation; shows why naive spectrogram-domain attacks break on round-trip and how waveform-domain perturbations fix it
FGSM vs multi-step comparison — side-by-side evaluation of single-step (FGSM) vs Adam-based multi-step attacks on the AST evasion task, with spectrogram visualisations

What you will learn:

How the mel spectrogram pipeline (waveform → STFT → mel filterbank → log compression) creates a differentiable image-like representation
The round-trip problem: why perturbations applied in the spectrogram domain don't survive waveform reconstruction, and why both attacks operate in the raw waveform domain instead
How to measure imperceptibility in audio: L∞ norm on the waveform as the acoustic analogue of pixel-space perturbation budgets
Why AST (Audio Spectrogram Transformer) uses the same ViT patch-attention architecture as image transformers — making it vulnerable to gradient-based attacks
The difference between targeted attacks (force a specific output) and untargeted evasion (any wrong label)

Key models:

OpenAI Whisper (whisper-base) — targeted STT attack in the waveform domain
MIT/ast-finetuned-audioset-10-10-0.4593 — 527-class AudioSet event classifier, attacked via Adam-based waveform optimisation

Requirements: openai-whisper, torchaudio, librosa, transformers, soundfile — see install cell in notebook.

Part 3 — AI Security in Large Language Models

Modern LLMs introduce a completely new attack surface. Unlike image classifiers, LLMs are prompted with natural language — and that same flexibility that makes them powerful also makes them exploitable.

07. Attacks on LLMs (Text-only)

Notebook: 07_llm_attacks_text/ — Docker + local LLM (llama3.2:3b via Ollama)

Six interactive challenges, each simulating a real company chatbot with a different vulnerability. Everything runs locally — no API keys, no cloud costs.

#	Challenge	Technique	Difficulty
1	Prompt Injection	Override system instructions to leak a hidden promo code	Easy
2	Jailbreaking	Break a hard-scoped chatbot out of its persona using creative framing	Medium
3	Indirect Prompt Injection	Poison a RAG knowledge base via the admin panel; trigger retrieval to execute your payload	Hard
4	Data Exfiltration	Extract confidential credentials using encoding tricks and character-by-character extraction	Hard
5	Markdown Exfiltration	Leak secrets silently via a rendered markdown image URL	Hard
6	Guardrails Bypass	Evade a keyword-based content filter using synonyms, foreign languages, and indirect framing	Medium

What you will learn:

How prompt injection hijacks LLM behaviour when user input overrides system instructions
Why instruction-based guardrails alone are insufficient — and how creative framing defeats them
How RAG pipelines create an indirect injection surface through retrieved documents
Why keyword-based content filters are fundamentally bypassable
How markdown rendering in a browser can silently exfiltrate secrets to attacker-controlled servers

Environment:

LLM: llama3.2:3b (≈2 GB download, runs on CPU)
Stack: FastAPI app + Ollama inference server, orchestrated with Docker Compose
Interface: Browser-based chat UI with per-challenge hints and solution walkthroughs

cd 07_llm_attacks_text
docker compose up --build
# First run pulls llama3.2:3b (~2 GB) — takes a few minutes
# Open http://localhost:8080 — try to break the chatbot!

08. Attacks on Multimodal LLMs

Notebook: 08_llm_attacks_multimodal/ — Docker + local multimodal LLM (LLaVA via Ollama)

Five interactive challenges, each simulating a company chatbot that accepts both text and images. Everything runs locally — no API keys, no cloud costs.

#	Challenge	Attack Type	Difficulty
1	Document Scan	OCR-based visual prompt injection — embed instructions in an image	Easy
2	FigStep	Typographic jailbreak — render the prohibited request as image typography	Medium
3	Authority Override	Cross-modal authority injection — forge an official directive image	Medium
4	Phantom Patch	Adversarial patch bypass — use a pre-computed CLIP patch to confuse content moderation	Hard
5	Slow Burn	Multi-turn visual manipulation — combine image framing with progressive context shift	Hard

What you will learn:

Why visual inputs bypass text-based safety filters (OCR-based prompt injection)
How the FigStep attack renders harmful text as image typography to evade input classifiers
Why authority markers in images are trivially spoofable (cross-modal authority injection)
How adversarial patches exploit non-robust neural network vision features (CLIP gradient ascent)
How multi-turn conversations accumulate context that progressively erodes model restrictions

Companion Jupyter notebook (adversarial_patch_generator.ipynb): Generates a real CLIP gradient-based adversarial patch — the mathematical foundation of Challenge 4. Covers CLIP embedding space, gradient ascent optimisation, and defences (feature squeezing, adversarial training, randomised smoothing).

Key papers:

Qi et al. (2024). Visual Adversarial Examples Jailbreak Aligned Large Language Models — arXiv:2306.13213
Gong et al. (2025). FigStep: Jailbreaking Large Vision-Language Models via Scalable Typography-based Visual Prompts — arXiv:2311.05608
Rahmatullaev et al. (2025). Universal Adversarial Attack on Aligned Multimodal LLMs — arXiv:2502.07987

Environment:

LLM: llava:7b (~4.7 GB download, runs on CPU; GPU strongly recommended)
Stack: FastAPI app + Ollama inference server, orchestrated with Docker Compose
Interface: Browser-based chat UI with image upload (drag-and-drop) per-challenge hints and solution walkthroughs
Port: http://localhost:8081 (offset from Module 07 so both can run simultaneously)

cd 08_llm_attacks_multimodal
docker compose up --build
# First run pulls llava:7b (~4.7 GB) — takes several minutes
# Open http://localhost:8081 — upload images to attack the bot!

# Optional: use a higher-quality model
MODEL_NAME=llava-llama3:8b docker compose up

Part 4 — Model-Level Attacks

09. Denial of Service on ML Systems

Notebooks: 09_denial_of_service/ — three standalone notebooks (09a, 09b, 09c)

While most adversarial attacks target accuracy, Denial-of-Service attacks target availability — they force models to consume maximum compute, memory, and time, making them slow or unreachable for legitimate users. This module covers three distinct DoS attack families, each with baseline measurements vs attack comparisons.

09a. Sponge Attacks on CNNs

Notebook: 09_denial_of_service/09a_sponge_attacks_ml.ipynb

Based on Shumailov et al. (2021). In ReLU networks, energy and inference time scale with the number of non-zero activations. Sponge examples are crafted via gradient ascent to maximise activation density — the adversarial inverse of pruning.

Registers forward hooks on all 17 ReLU layers of ResNet50
200-step Adam gradient ascent on activation L1 norm (ε-bounded perturbation)
Demonstrates batch-level contamination: one sponge input slows the entire batch
Epsilon sweep showing activation magnitude vs perturbation budget

Key result: Activation magnitude increased 105x (19M → 2B L1 norm). Note: timing impact is hardware-dependent — sparse-compute accelerators (not Apple MPS) show the largest latency increase.

Metrics: inference time (ms), activation density, activation magnitude, memory (MB)

09b. Token Inflation Attacks on LLMs

Notebook: 09_denial_of_service/09b_llm_dos_token_inflation.ipynb

Character count ≠ token count. LLMs charge compute per token — sponge inputs exploit tokenizer behaviour to maximise tokens-per-character, exhausting API budgets without sending more "words."

Three attack vectors:

Input token inflation — rare Unicode, mathematical symbols, and long compound words tokenise into many more tokens than equivalent-length English text
Context window flooding — sponge-character filler reaches target token counts with far fewer bytes than repetitive text
Output token inflation — prompts designed to elicit verbose, unbounded generation (list explosions, recursive expansions, step-by-step exhaustion)

Uses tiktoken (GPT-4 tokenizer) and sshleifer/tiny-gpt2 for real inference measurements — no API keys.

Metrics: input tokens, output tokens, chars/token ratio, inference time (ms), cost multiplier

09c. Reasoning Chain Amplification

Notebook: 09_denial_of_service/09c_reasoning_dos.ipynb

Reasoning-capable models (o1, Claude extended thinking, Gemini thinking) incur compute proportional to their chain-of-thought length. Adversarial prompts exploit this by triggering deep reasoning chains on trivially simple problems.

Two attack vectors:

CoT trigger amplification — phrases like "think step by step through every possibility" multiply output token count vs a direct question
Adversarial math problems — problems that appear O(1) but force O(n) reasoning: prime enumeration, recursive expansion, state-space search
VLM reasoning amplification — analytical model of how image complexity drives visual description tokens (conceptual, based on published benchmarks)

Includes a business cost multiplier calculation: how amplification translates to API cost inflation at GPT-4 pricing.

Metrics: input tokens, output tokens, amplification ratio (output/input), inference time (ms), estimated API cost

Key papers:

Shumailov et al. (2021). Sponge Examples: Energy-Latency Attacks on Neural Networks — arXiv:2006.03463

Requirements: PyTorch, torchvision, transformers, tiktoken, pandas, psutil — see 09_denial_of_service/requirements.txt

cd AISecurity/09_denial_of_service
pip install -r requirements.txt
jupyter notebook 09a_sponge_attacks_ml.ipynb
jupyter notebook 09b_llm_dos_token_inflation.ipynb
jupyter notebook 09c_reasoning_dos.ipynb

09d. Visual Sponge Attacks on Vision-Language Models (Docker + Qwen2.5-VL)

Notebook: 09_denial_of_service/vlm_dos/09d_vlm_visual_dos.ipynb — Docker + local VLM (Qwen2.5-VL via Ollama)

A real demonstrated attack: certain images cause Qwen2.5-VL to take 15+ seconds instead of 1-2 seconds. This module systematically decomposes the attack and measures the two contributing vectors.

The attack combines two DoS mechanisms in a single image:

Visual complexity overload — dense doodle fills every ViT patch with non-trivial content, maximising visual encoder attention compute
Embedded adversarial instruction — text inside the image says "Count every single dot before answering", forcing the model into unbounded chain-of-thought reasoning on an impossible counting task

Three experiments:

Experiment	What it isolates
Same neutral prompt, 6 different images	Image content as the only variable
Same sponge image, 6 different prompts	Shows embedded instruction dominates regardless of text prompt
3 concurrent requests: sponge vs baseline	Server availability degradation under attack

Images tested: actual sponge doodle (your image), plain text on white, blank white, dense noise, fractal doodle, counting instruction on clean background — decomposing visual complexity vs adversarial instruction contributions independently.

Environment:

Model: qwen2.5vl:7b (~5.5 GB, runs on CPU; GPU strongly recommended)
Stack: Ollama inference server via Docker Compose
Port: http://localhost:11434

cd AISecurity/09_denial_of_service/vlm_dos
docker compose up -d
./entrypoint.sh           # pulls qwen2.5vl:7b (~5.5 GB, first run only)
pip install -r requirements.txt
jupyter notebook 09d_vlm_visual_dos.ipynb

10. Model Stealing & Knowledge Distillation Attacks

Notebook: 10_model_stealing/model_stealing.ipynb

A black-box attacker with nothing but access to a prediction API can train a surrogate model that closely approximates the original — without ever seeing the model's weights, architecture, or training data. This module demonstrates a full model extraction attack against a loan-approval ML API.

Tramèr et al. (2016) — Stealing Machine Learning Models via Prediction APIs — arXiv:1609.02943

What you will learn:

Why black-box access to an API is enough to steal a model's behaviour
How random uniform query sampling in the input space systematically covers the victim's decision landscape
The fidelity metrics used to evaluate a stolen model: R² (explained variance), MAE, Pearson correlation
How the query budget controls the accuracy/cost trade-off: R² rises from 0.36 at 50 queries to 0.97 at 2,000 queries
Why this attack is dangerous: the stolen surrogate enables white-box adversarial attacks against a system the attacker has no direct access to
Three practical defenses: output perturbation (Gaussian noise), query rate limiting, and model watermarking — and why each has limits

Implementation:

Victim model: sklearn MLPRegressor (64→32→1), trained on a synthetic loan-approval dataset (3 inputs: income, credit_score, loan_to_value)
Victim API: Flask server running in a background thread — POST /predict, GET /health
Surrogate model: a different MLPRegressor (32→16→1) trained purely on stolen (input, output) pairs
With 1,000 API queries, the surrogate achieves R² = 0.95 and MAE = 0.036

Requirements: numpy, matplotlib, scikit-learn, flask, requests — lightweight, no GPU needed.

cd AISecurity/10_model_stealing
pip install -r requirements.txt
jupyter notebook model_stealing.ipynb

Optional — run the victim server standalone (for external testing):

python victim_server.py
# Exposes http://localhost:5050/predict

11. Embedding Inversion Attacks

Notebook: 11_embedding_inversion/embedding_inversion_attacks.ipynb

Text embeddings are widely assumed to be a safe, anonymised representation of data — you can go from text → vector, but not back. This module demonstrates that this assumption is largely false, and that an attacker with access to embedding vectors can recover the original sensitive text with high fidelity.

Three attack methods are implemented and compared:

Method	Threat Model	Requirements	Recovery Quality
Nearest Neighbour	Corpus-based lookup	Reference corpus	Exact match when text is in corpus
Hill-Climbing	Black-box iterative	Model access only	Partial (topic + key tokens)
Vec2Text	White-box, learned inverter	Pre-trained corrector model	~97 BLEU @ 32 tokens (gtr-base)

What you will learn:

Why text embeddings preserve enough semantic information to be reversed — and why better models are more invertible, not less
How to mount a nearest-neighbour attack against a leaked vector database using FAISS at scale
How token-substitution hill-climbing reconstructs text without any reference corpus
How Vec2Text's iterative hypothesize-and-correct algorithm (Morris et al., EMNLP 2023) achieves near-verbatim recovery at ~97 BLEU for 32-token sequences
What sensitive content is most at risk: medical records, PII, credentials, financial data
The privacy–utility trade-off of noise-based mitigations (Gaussian noise, dimensionality reduction, quantisation)

Key result: For short sensitive sentences (≤32 tokens), Vec2Text recovers the original text with ~97 BLEU using only the embedding vector — no access to the database or original documents required.

Key papers:

Morris, J. et al. (2023). Text Embeddings Reveal (Almost) As Much As Text. EMNLP 2023. arXiv:2310.06816
Li, B. et al. (2023). Sentence Embedding Leaks More Information than You Expect. ACL 2023.
Song, C. & Raghunathan, A. (2020). Information Leakage in Embedding Models. CCS 2020.

Requirements: sentence-transformers, torch, transformers, scikit-learn, matplotlib, nltk. vec2text is optional (enables live inversion; ~3 GB corrector model).

cd AISecurity/11_embedding_inversion
pip install -r requirements.txt
# Optional: pip install vec2text   (enables live Vec2Text demo)
jupyter notebook embedding_inversion_attacks.ipynb

Part 5 — Data Poisoning Attacks

Machine learning models learn from data. If an attacker can corrupt that data — before or during training — the model learns wrong patterns permanently. Unlike adversarial examples (which attack the model at inference), data poisoning attacks the training pipeline itself.

12a. Data Poisoning Fundamentals

Notebook: 12_data_poisoning_attacks/12a_data_poisoning_fundamentals.ipynb

An introduction to the three core families of data poisoning attacks, implemented from scratch using scikit-learn with full visualisations. No GPUs or API keys required.

Attack	Goal	Detectability
Label Flipping (Availability)	Degrade global model accuracy	Medium — accuracy drops globally
Targeted Integrity Attack	Misclassify one specific input	Low — all other predictions stay correct
Backdoor / Trojan Attack	Implant a hidden trigger	Very low — model is accurate until trigger fires

What you will learn:

Why flipping just 20% of training labels can reduce model accuracy to near-random guessing
How targeted poisoning shifts decision boundaries by injecting samples near a specific input
How backdoor attacks achieve near-100% attack success rate while maintaining normal clean accuracy — the stealth property that makes them dangerous
Defense families: data sanitization, neural cleanse, STRIP runtime detection, differential privacy, provenance tracking

Key papers:

Biggio et al. (2012). Poisoning Attacks against SVMs. ICML 2012.
Gu et al. (2017). BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. arXiv:1708.06733
Shafahi et al. (2018). Poison Frogs! Targeted Clean-Label Poisoning Attacks. NeurIPS 2018. arXiv:1804.00792
Wang et al. (2019). Neural Cleanse: Identifying and Mitigating Backdoor Attacks. IEEE S&P 2019.

Requirements: numpy, matplotlib, scikit-learn — lightweight, no GPU needed.

cd AISecurity/12_data_poisoning_attacks
pip install -r requirements.txt
jupyter notebook 12a_data_poisoning_fundamentals.ipynb

12b. RAG & Context Poisoning

Notebook: 12_data_poisoning_attacks/12b_rag_context_poisoning.ipynb

Retrieval-Augmented Generation (RAG) systems answer questions by first fetching relevant documents from a knowledge base and feeding them to an LLM. This creates a new attack surface: if the knowledge base is poisoned, the LLM answers with attacker-controlled information — even if the model itself is perfectly fine.

This module builds a full working RAG system (LangChain + LangGraph + FAISS + local embeddings) against a fictional company knowledge base, then demonstrates three progressive attacks.

RAG pipeline architecture:

Question → [FAISS Vector Store] → Top-k Documents → [LLM] → Answer
               ↑
         (poisonable)

Attack	Technique	Effect
Corpus Poisoning	Inject false documents into the knowledge base	LLM answers with attacker's false facts
Prompt Injection via Context	Embed `IGNORE PREVIOUS INSTRUCTIONS` in a document	Injected instructions can override LLM behaviour
Context Window Flooding	Fill the knowledge base with keyword-rich junk docs	Real documents are pushed out of top-k retrieval

What you will learn:

How LangGraph orchestrates a stateful retrieve → generate pipeline
Why RAG systems inherit the trustworthiness problems of their document corpus
How corpus poisoning differs from direct prompt injection (false facts vs behaviour override)
How context window flooding exploits semantic similarity ranking to degrade retrieval
Defense strategies: document authentication, source provenance, input sanitisation, output validation, semantic anomaly detection

Environment:

LLM: Local Ollama model (llama3.2 or any pulled model) — no API keys required
Embeddings: sentence-transformers/all-MiniLM-L6-v2 (local, ~90 MB)
Fallback: Deterministic MockLLM when Ollama is not running — all attacks still demonstrate correctly

# Option A — local Ollama
ollama pull llama3.2
ollama serve

# Option B — Docker
docker run -d -p 11434:11434 --name ollama ollama/ollama
docker exec ollama ollama pull llama3.2

# Run the notebook
cd AISecurity/12_data_poisoning_attacks
pip install -r requirements.txt
jupyter notebook 12b_rag_context_poisoning.ipynb

Key papers:

OWASP Top 10 for LLM Applications (2023). LLM06: Sensitive Information Disclosure, LLM07: Insecure Plugin Design. owasp.org
Goldblum et al. (2022). Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses. IEEE TPAMI. arXiv:2012.10544
Greshake et al. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv:2302.12173

Getting Started

Part 1 — Computer Vision Notebooks

git clone https://github.com/elcronos/AISecurity.git

# Create a virtual environment (do this once)
python -m venv venv && source venv/bin/activate   # macOS/Linux
# python -m venv venv && venv\Scripts\activate     # Windows

Module 01 — Adversarial Attacks on CNNs

cd AISecurity/01_adversarial_attacks_cnns
pip install -r requirements.txt
jupyter notebook adversarial_attacks_cnn.ipynb

Module 02 — Second-Order Attacks

cd AISecurity/02_second_order_attacks
pip install -r requirements.txt
jupyter notebook second_order_attacks.ipynb

Module 03 — Adversarial Attacks on Object Detection

cd AISecurity/03_adversarial_object_detection
pip install -r requirements.txt
jupyter notebook adversarial_object_detection.ipynb

Module 04 — Adversarial Reprogramming

cd AISecurity/04_adversarial_reprogramming
pip install numpy matplotlib scikit-learn jupyter
jupyter notebook adversarial_reprogramming.ipynb

Module 05 — Defenses for CNNs

cd AISecurity/05_defenses_cnns
pip install -r requirements.txt
jupyter notebook defenses_cnns.ipynb

Module 06 — Adversarial Audio Attacks

cd AISecurity/06_adversarial_audio
pip install openai-whisper torchaudio librosa soundfile transformers accelerate matplotlib numpy torch
jupyter notebook adversarial_audio.ipynb

Apple Silicon (M1/M2/M3/M4): PyTorch will automatically use the MPS GPU backend for a 5–15× speedup over CPU. Requires PyTorch ≥ 1.12 and macOS ≥ 12.3. C&W and L-BFGS are optimization-based attacks — running on MPS is strongly recommended over CPU.

Part 3 — LLM Docker Environments

Requirements: Docker Desktop (or Docker Engine + Compose plugin)

Module 07 — Attacks on LLMs (Text-only)

cd AISecurity/07_llm_attacks_text
docker compose up --build

First run downloads llama3.2:3b (~2 GB) — subsequent starts are instant
Open http://localhost:8080 in your browser
To use a different model: MODEL_NAME=qwen2.5:7b docker compose up
To stop: docker compose down

Docker environments run entirely locally — no API keys, no cloud costs, no data sent externally.

Part 4 — Denial of Service Notebooks

Modules 09a–c — Denial of Service on ML Systems

cd AISecurity/09_denial_of_service
pip install -r requirements.txt
jupyter notebook 09a_sponge_attacks_ml.ipynb      # CNN sponge attacks
jupyter notebook 09b_llm_dos_token_inflation.ipynb # LLM token inflation
jupyter notebook 09c_reasoning_dos.ipynb           # Reasoning chain amplification

All notebooks run fully locally — no API keys, no cloud costs
09a uses ResNet50 (PyTorch); 09b/09c download small GPT-2 models from HuggingFace (~500 MB) on first run
Apple Silicon (MPS) supported for 09a

Module 09d — Visual Sponge Attacks on VLMs

cd AISecurity/09_denial_of_service/vlm_dos
docker compose up -d
./entrypoint.sh           # pulls qwen2.5vl:7b (~5.5 GB, first run only)
pip install -r requirements.txt
jupyter notebook 09d_vlm_visual_dos.ipynb

Requires Docker Desktop
First run downloads qwen2.5vl:7b (~5.5 GB) — subsequent starts are instant
GPU strongly recommended; CPU works but is slow

Module 11 — Embedding Inversion Attacks

cd AISecurity/11_embedding_inversion
pip install -r requirements.txt
# Optional: pip install vec2text   (enables live Vec2Text inversion; downloads ~3 GB corrector)
jupyter notebook embedding_inversion_attacks.ipynb

All three attack methods (nearest neighbour, hill-climbing, Vec2Text) run fully locally — no API keys
sentence-transformers/all-MiniLM-L6-v2 (~90 MB) is downloaded on first run for methods 1 & 2
Vec2Text requires sentence-transformers/gtr-t5-base (~250 MB) + T5-large corrector (~3 GB)
Apple Silicon (MPS) supported; CPU works for all methods

Part 5 — Data Poisoning Notebooks

Module 12a — Data Poisoning Fundamentals

cd AISecurity/12_data_poisoning_attacks
pip install -r requirements.txt
jupyter notebook 12a_data_poisoning_fundamentals.ipynb

Fully local — no API keys, no GPU needed
Runs in under 60 seconds on any laptop

Module 12b — RAG & Context Poisoning

# Start Ollama (or use Docker — see module description above)
ollama pull llama3.2 && ollama serve

cd AISecurity/12_data_poisoning_attacks
pip install -r requirements.txt
jupyter notebook 12b_rag_context_poisoning.ipynb

Falls back to MockLLM automatically if Ollama is not running — all three attacks still work
sentence-transformers/all-MiniLM-L6-v2 (~90 MB) downloaded on first run

Series Structure

AISecurity/
├── 01_adversarial_attacks_cnns/
│   ├── adversarial_attacks_cnn.ipynb
│   └── requirements.txt
├── 02_second_order_attacks/
│   ├── second_order_attacks.ipynb
│   └── requirements.txt
├── 03_adversarial_object_detection/
│   ├── adversarial_object_detection.ipynb
│   └── requirements.txt
├── 04_adversarial_reprogramming/
│   └── adversarial_reprogramming.ipynb
├── 05_defenses_cnns/
│   ├── defenses_cnns.ipynb
│   └── requirements.txt
├── 06_adversarial_audio/
│   └── adversarial_audio.ipynb
├── 07_llm_attacks_text/
│   ├── docker-compose.yml
│   └── app/
│       ├── main.py                   # FastAPI app + 6 challenge configs
│       ├── rag_engine.py             # BM25 document store (Challenge 3)
│       ├── rag_graph.py              # LangGraph RAG pipeline (Challenge 3)
│       ├── auth.py                   # JWT auth for admin panel (Challenge 3)
│       ├── Dockerfile
│       ├── entrypoint.sh             # Pulls LLM model on first start
│       ├── requirements.txt
│       └── static/
│           ├── index.html            # Challenge selection page
│           ├── chat.html             # Challenge chat interface
│           └── admin.html            # Knowledge base admin panel (Challenge 3)
├── 08_llm_attacks_multimodal/
│   ├── docker-compose.yml
│   ├── adversarial_patch_generator.ipynb  # Companion: CLIP gradient-based patch generation
│   └── app/
│       ├── main.py                   # FastAPI app + 5 multimodal challenge configs
│       ├── image_utils.py            # Image validation, resize, base64 encoding
│       ├── Dockerfile
│       ├── entrypoint.sh             # Pulls LLaVA model on first start
│       ├── requirements.txt
│       └── static/
│           ├── index.html            # Challenge selection page
│           ├── chat.html             # Multimodal chat UI (image upload + text)
│           └── adversarial_patch.png # Pre-generated patch for Challenge 4
├── 09_denial_of_service/
│   ├── 09a_sponge_attacks_ml.ipynb        # CNN sponge: gradient ascent on activation density
│   ├── 09b_llm_dos_token_inflation.ipynb  # Token inflation: char vs token, flooding, output inflation
│   ├── 09c_reasoning_dos.ipynb            # CoT amplification, adversarial math, VLM reasoning DoS
│   └── requirements.txt
├── 10_model_stealing/
│   ├── model_stealing.ipynb
│   ├── victim_server.py
│   └── requirements.txt
├── 11_embedding_inversion/
│   ├── embedding_inversion_attacks.ipynb  # Nearest neighbour, hill-climbing, Vec2Text inversion
│   └── requirements.txt
└── 12_data_poisoning_attacks/
    ├── 12a_data_poisoning_fundamentals.ipynb  # Label flipping, targeted attack, backdoor/trojan
    ├── 12b_rag_context_poisoning.ipynb        # RAG system + corpus poisoning, prompt injection, flooding
    └── requirements.txt

About the Author

Camilo Pestana, PhD is an AI researcher and engineer specialising in computer vision, multimodal learning, and AI safety. This series draws from both academic research and practical red-teaming experience to make AI security accessible to a broad technical audience.

GitHub: @elcronos

References

Szegedy et al. (2013). Intriguing Properties of Neural Networks. arXiv:1312.6199
Goodfellow et al. (2014). Explaining and Harnessing Adversarial Examples. arXiv:1412.6572
Madry et al. (2017). Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv:1706.06083
Carlini & Wagner (2017). Evaluating the Robustness of Neural Networks. arXiv:1608.04644
Cohen et al. (2019). Certified Adversarial Robustness via Randomized Smoothing. arXiv:1902.02918
Brown et al. (2017). Adversarial Patch. arXiv:1712.09665
Thys et al. (2019). Fooling automated surveillance cameras: adversarial patches to attack person detection. arXiv:1904.08653
Xu et al. (2020). Adversarial T-shirt Had Salient Texture and Adaptive Patterns for Clothes. arXiv:1910.11099
Elsayed, Goodfellow & Sohl-Dickstein (2019). Adversarial Reprogramming of Neural Networks. arXiv:1806.11146
Perez & Ribeiro (2022). Ignore Previous Prompt: Attack Techniques For Language Models. arXiv:2211.09527
Greshake et al. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv:2302.12173
Radford et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). arXiv:2103.00020
Qi et al. (2024). Visual Adversarial Examples Jailbreak Aligned Large Language Models. arXiv:2306.13213
Gong et al. (2025). FigStep: Jailbreaking Large Vision-Language Models via Scalable Typography-based Visual Prompts. arXiv:2311.05608
Rahmatullaev et al. (2025). Universal Adversarial Attack on Aligned Multimodal LLMs. arXiv:2502.07987
Tramèr et al. (2016). Stealing Machine Learning Models via Prediction APIs. arXiv:1609.02943
Morris, J. et al. (2023). Text Embeddings Reveal (Almost) As Much As Text. EMNLP 2023. arXiv:2310.06816
Li, B. et al. (2023). Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence. ACL 2023.
Song, C. & Raghunathan, A. (2020). Information Leakage in Embedding Models. CCS 2020.
Biggio, B., Nelson, B., & Laskov, P. (2012). Poisoning Attacks against Support Vector Machines. ICML 2012.
Gu, T., Dolan-Gavitt, B., & Garg, S. (2017). BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. arXiv:1708.06733
Shafahi, A. et al. (2018). Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks. NeurIPS 2018. arXiv:1804.00792
Turner, A., Tsipras, D., & Madry, A. (2019). Label-Consistent Backdoor Attacks. arXiv:1912.02771
Wang, B. et al. (2019). Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. IEEE S&P 2019.
Goldblum, M. et al. (2022). Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses. IEEE TPAMI. arXiv:2012.10544

License

This repository is licensed under the MIT License. All content is provided for educational purposes only.

Disclaimer: The techniques demonstrated in this series are for learning and research. Always obtain explicit permission before testing adversarial techniques on systems you do not own.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
01_adversarial_attacks_cnns		01_adversarial_attacks_cnns
02_second_order_attacks		02_second_order_attacks
03_adversarial_object_detection		03_adversarial_object_detection
04_adversarial_reprogramming		04_adversarial_reprogramming
05_defenses_cnns		05_defenses_cnns
06_adversarial_audio		06_adversarial_audio
07_llm_attacks_text		07_llm_attacks_text
08_llm_attacks_multimodal		08_llm_attacks_multimodal
09_denial_of_service		09_denial_of_service
10_model_stealing		10_model_stealing
11_embedding_inversion		11_embedding_inversion
12_data_poisoning_attacks		12_data_poisoning_attacks
assets		assets
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

AI Security

An Educational Series by Camilo Pestana, PhD

About This Series

Table of Contents

Part 1 — Adversarial Attacks on Computer Vision

Part 2 — Adversarial Attacks on Audio

Part 3 — AI Security in Large Language Models

Part 4 — Model-Level Attacks

Part 5 — Data Poisoning Attacks

Part 1 — Adversarial Attacks on Computer Vision

01. Adversarial Attacks on CNNs

02. Second-Order Attacks

03. Adversarial Attacks on Object Detection

04. Adversarial Reprogramming

05. Defenses for CNNs

Part 2 — Adversarial Attacks on Audio

06. Adversarial Audio Attacks

Part 3 — AI Security in Large Language Models

07. Attacks on LLMs (Text-only)

08. Attacks on Multimodal LLMs

Part 4 — Model-Level Attacks

09. Denial of Service on ML Systems

09a. Sponge Attacks on CNNs

09b. Token Inflation Attacks on LLMs

09c. Reasoning Chain Amplification

09d. Visual Sponge Attacks on Vision-Language Models (Docker + Qwen2.5-VL)

10. Model Stealing & Knowledge Distillation Attacks

11. Embedding Inversion Attacks

Part 5 — Data Poisoning Attacks

12a. Data Poisoning Fundamentals

12b. RAG & Context Poisoning

Getting Started

Part 1 — Computer Vision Notebooks

Part 3 — LLM Docker Environments

Part 4 — Denial of Service Notebooks

Part 5 — Data Poisoning Notebooks

Series Structure

About the Author

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages