TOS Risk Summarizer (Llama-3 Fine-Tune)

1. Introduction

Terms of Service are designed to be ignored. Companies hide predatory clauses—unilateral changes, data selling, and waivers of rights—buried in thousands of words of legalese.

TOS Risk Summarizer is a specialized Fine-Tuned LLM designed to reclaim user agency. Unlike generic models that simply "summarize" text, this model was trained via Knowledge Distillation to act as a cynical legal analyst. It scans documents clause-by-clause, ignores the fluff, and specifically extracts high-risk provisions with a "Verdict" and "Risk Assessment."

This project demonstrates a complete end-to-end fine-tuning pipeline: from synthetic data generation and cleaning to efficient LoRA training on constrained hardware.

2. Showcase

This short video demonstrates what the project does and how it works.

(Click the image to watch the full demo on YouTube)

3. The Stack

This project was built using a hybrid cloud/local workflow to overcome hardware constraints.

Model Architecture: Llama 3.1 8B (Instruct)
Fine-Tuning Framework: Unsloth (for 2x faster training & memory efficiency)
Training Infrastructure: Google Colab (T4 GPU)
Inference Engine: Ollama (running locally)
Teacher Model: Gemini 2.0 Experimental (via OpenRouter)
Frontend: Gradio (Real-time Streaming)
Version Control: Git & Hugging Face Hub

4. Methodology

The core challenge of this project was engineering a high-performance legal AI with zero budget and incompatible local hardware (AMD GPU vs. NVIDIA-centric tools). Here is how I solved each bottleneck.

4.1 Data Engineering & Filtering

I started with the CodeHima/TOS_Dataset from Hugging Face. Upon inspection, the dataset was highly noisy, containing "fair" labels for non-clauses like "Welcome to the website" or "Table of Contents." Training on this would have resulted in model hallucinations.

Solution: I wrote a Python pre-processing pipeline using Regex heuristics to strip navigation text, headers, and fragments under 50 characters.

4.2 Knowledge Distillation (Teacher-Student Training)

To create a high-quality "Risk Assessment" for every clause, I couldn't rely on the raw dataset labels alone.

Solution: I utilized Gemini 2.0 Experimental as a "Teacher Model." I fed the cleaned clauses into Gemini with a strict system prompt to generate step-by-step legal reasoning and risk verdicts.
Result: This created a synthetic, high-quality dataset ("The Textbook") which was then used to train the smaller Llama 3 model ("The Student"). This technique allowed the 8B model to punch above its weight class.

4.3 Fine-Tuning with Unsloth

I chose Unsloth over standard Hugging Face Transformers because of its optimized kernels. It allowed me to fit the Llama 3.1 8B training process into the limited VRAM of a free Colab instance without sacrificing context length (2048 tokens).

4.4 LoRA Adapters (PEFT)

Instead of fine-tuning all 8 billion parameters (which is computationally impossible on consumer hardware), I used Low-Rank Adaptation (LoRA).

Strategy: We froze the base model weights and only trained a small set of adapter layers (~1% of parameters).
Benefit: This reduced the final model file size from ~16GB to ~160MB, making it easy to host on Hugging Face and deploy locally.

4.5 The "AMD Constraint" (Colab + Ollama)

My local workstation runs an AMD RX 6600 XT. Most fine-tuning libraries (like Unsloth and bitsandbytes) are heavily optimized for NVIDIA CUDA and do not run natively on AMD Windows.

Workaround: I decoupled training and inference.
1. Training: Offloaded to the cloud using Google Colab's free NVIDIA T4 GPUs.
2. Inference: Exported the LoRA adapters and loaded them locally using Ollama, which has excellent ROCm (AMD) support. This allowed for zero-latency, private, offline analysis on my own machine.

4.6 Deployment Trade-offs (GGUF vs. Adapters)

Initially, I attempted to convert the model to GGUF format for use with llama.cpp. However, the quantization process exceeded the RAM limits of the free Colab tier, causing OOM (Out of Memory) crashes.

Pivot: Instead of a full merge, I adopted an Adapter-Runtime workflow. The application loads the base Llama 3 model and dynamically patches the LoRA adapters at runtime. This bypassed the high-RAM conversion step entirely.

4.7 Real-Time UI (Gradio)

I built a frontend using Gradio that accepts full text input. It splits the document into clauses, streams them to the local inference engine, and applies a custom Python Gatekeeper to filter out low-confidence risks before displaying them to the user.

5. Training Graph

The graph below visualizes the training stats of each fine-tuned model.

TOS Risk Summarizer v1 Model TOS Risk Summarizer v2 Model TOS Risk Summarizer v3 Model

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
tos-risk-summarizer-v1		tos-risk-summarizer-v1
tos-risk-summarizer-v2		tos-risk-summarizer-v2
tos-risk-summarizer-v3		tos-risk-summarizer-v3
.gitignore		.gitignore
README.md		README.md
knowledge_distillation.ipynb		knowledge_distillation.ipynb
model_training.ipynb		model_training.ipynb
trainer_graph.py		trainer_graph.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TOS Risk Summarizer (Llama-3 Fine-Tune)

1. Introduction

2. Showcase

3. The Stack