About the Project

Our Inspiration 💡

Our primary motivation was to address a critical blind spot in product development: geo‑specific regulatory compliance 🌍📜. We wanted to transform compliance from a reactive, often manual process into a proactive, traceable, and auditable system ✅. Without a specialized tool, flagging features that might violate complex regional regulations is difficult, leading to potential risks and inconsistencies. We set out to build a prototype that leverages Large Language Models (LLMs) 🤖 to automate this detection and provide clear, accurate, and structured compliance checks.

To achieve this, we aimed for a system that was accurate, fast ⚡, and cost‑effective 💰. This led us to fine‑tune the gpt-4.1-mini model, striking a deliberate balance between the higher cost of larger models and the need for low latency suitable for real‑time auditing and interaction.

How We Built It 🛠️

Our project was brought to life ✨ with a dual‑AI architecture, supervised fine‑tuning, and a real‑time monitoring dashboard.

1) The Foundation: Tools and Data 💾

We built our system with a stack of powerful tools and libraries 💻:

Languages & Dev: Python, Jupyter Notebook
UI: Streamlit
ML & Data: pandas, scikit-learn, faiss
LLM SDKs: vertexai, openai
Storage & Realtime: Google Sheets API 📄

Our process began with meticulous data preparation ✍️. We created a specialized dataset for training and validation from five core regulation texts plus a Terminologies.csv file, producing Train.csv and Test.csv. The data was AI‑augmented and manually verified to ensure quality and accuracy.

2) The Brain: Fine‑Tuning a Specialized LLM 🧠

The core of our solution is a fine‑tuned gpt-4.1-mini-2025-04-14 model chosen for cost‑efficiency, low latency, and support for supervised fine‑tuning.

Training Data 📚

We formatted examples into JSONL simulating a conversational flow:

User query about a feature
Tool call by the AI to evaluate it
Tool’s structured response (violation, reasoning, cited articles)
Final human‑readable summary
- This teaches the model to produce reliable, structured JSON outputs.

Training Process 🚀

3 epochs, batch size = 1, learning rate multiplier = 2
Achieved train and validation loss of 0.000, indicating a strong fit to our specialized task 🎉

Safety First 🛡️

Post‑tuning, the model passed automated safety alignment checks across 15 sensitive categories, ensuring readiness for deployment.

3) The Architecture: A Dual‑AI System for Robustness 🏛️

Our methodology uses Retrieval‑Augmented Generation (RAG) and a multi‑layered decision flow to ensure accuracy and consistency.

AI 1 — Initial Compliance Check 🤖1️⃣
- Retrieves relevant sections from regulatory documents via embeddings
- Analyzes the feature against these texts
- Outputs Violation / No Violation with specific citations
AI 2 — Consistency & Validation 🤖2️⃣
- Looks up historical decisions using RAG
- Compares the new decision to similar past cases
- Supports or challenges the initial assessment
Human‑in‑the‑Loop 🧑‍💻
- Final, validated decisions are logged to Google Sheets in real time
- Enables immediate oversight and correction
- Creates a continuous feedback loop that improves the system over time

4) The Eyes: An Intuitive Monitoring Dashboard 👀

We built a Looker Studio dashboard 📊 for transparency and auditing. It provides a comprehensive view of compliance health via:

KPIs 📈: Total features evaluated, number of violations, overall non‑compliance percentage
Normalized Metrics: A Regional Violation Rate (violations relative to features per region) to enable fair comparisons
Visualizations 🗺️: A pie chart of overall compliance, a bar chart of most frequently violated regulations, and a geographic map highlighting regional risk hotspots

Challenges We Faced 🧗‍♀️

Reducing Ambiguity 🤔➡️✅: Generic LLMs can be nuanced but non‑committal. We fine‑tuned on a strict JSONL schema to produce definitive, auditable outputs.
Ensuring Consistency 🔄: LLM answers can drift over time. Our dual‑AI validator checks against historical precedent to stabilize decisions.
Meaningful Regional Analysis 🌍📊: Raw counts overstate large regions. Our normalized Regional Violation Rate enables fair, insightful cross‑region comparisons.

What We Learned 🎓

Power of Small, Specialized Models 💪🤖: A tuned gpt-4.1-mini can rival larger models on a narrow task while keeping latency and cost low.
Robustness via Multi‑Agent Systems 🤝: Embedding checks and balances (a validator agent) directly into the workflow yields more reliable outputs.
Data Presentation Matters 🎨: Thoughtful dashboard design and normalized metrics turn raw logs into actionable insight for stakeholders.
Human‑in‑the‑Loop ❤️🧑‍💻: Real‑time logging to Google Sheets supports oversight and creates a learning system that improves with expert feedback.