About the Project
Our Inspiration 💡
Our primary motivation was to address a critical blind spot in product development: geo‑specific regulatory compliance 🌍📜. We wanted to transform compliance from a reactive, often manual process into a proactive, traceable, and auditable system ✅. Without a specialized tool, flagging features that might violate complex regional regulations is difficult, leading to potential risks and inconsistencies. We set out to build a prototype that leverages Large Language Models (LLMs) 🤖 to automate this detection and provide clear, accurate, and structured compliance checks.
To achieve this, we aimed for a system that was accurate, fast ⚡, and cost‑effective 💰. This led us to fine‑tune the gpt-4.1-mini model, striking a deliberate balance between the higher cost of larger models and the need for low latency suitable for real‑time auditing and interaction.
How We Built It 🛠️
Our project was brought to life ✨ with a dual‑AI architecture, supervised fine‑tuning, and a real‑time monitoring dashboard.
1) The Foundation: Tools and Data 💾
We built our system with a stack of powerful tools and libraries 💻:
- Languages & Dev: Python, Jupyter Notebook
- UI: Streamlit
- ML & Data:
pandas,scikit-learn,faiss - LLM SDKs:
vertexai,openai - Storage & Realtime: Google Sheets API 📄
Our process began with meticulous data preparation ✍️. We created a specialized dataset for training and validation from five core regulation texts plus a Terminologies.csv file, producing Train.csv and Test.csv. The data was AI‑augmented and manually verified to ensure quality and accuracy.
2) The Brain: Fine‑Tuning a Specialized LLM 🧠
The core of our solution is a fine‑tuned gpt-4.1-mini-2025-04-14 model chosen for cost‑efficiency, low latency, and support for supervised fine‑tuning.
Training Data 📚
- We formatted examples into JSONL simulating a conversational flow:
- User query about a feature
- Tool call by the AI to evaluate it
- Tool’s structured response (violation, reasoning, cited articles)
- Final human‑readable summary
- This teaches the model to produce reliable, structured JSON outputs.
Training Process 🚀
- 3 epochs, batch size = 1, learning rate multiplier = 2
- Achieved train and validation loss of 0.000, indicating a strong fit to our specialized task 🎉
Safety First 🛡️
- Post‑tuning, the model passed automated safety alignment checks across 15 sensitive categories, ensuring readiness for deployment.
3) The Architecture: A Dual‑AI System for Robustness 🏛️
Our methodology uses Retrieval‑Augmented Generation (RAG) and a multi‑layered decision flow to ensure accuracy and consistency.
AI 1 — Initial Compliance Check 🤖1️⃣
- Retrieves relevant sections from regulatory documents via embeddings
- Analyzes the feature against these texts
- Outputs Violation / No Violation with specific citations
AI 2 — Consistency & Validation 🤖2️⃣
- Looks up historical decisions using RAG
- Compares the new decision to similar past cases
- Supports or challenges the initial assessment
Human‑in‑the‑Loop 🧑💻
- Final, validated decisions are logged to Google Sheets in real time
- Enables immediate oversight and correction
- Creates a continuous feedback loop that improves the system over time
4) The Eyes: An Intuitive Monitoring Dashboard 👀
We built a Looker Studio dashboard 📊 for transparency and auditing. It provides a comprehensive view of compliance health via:
- KPIs 📈: Total features evaluated, number of violations, overall non‑compliance percentage
- Normalized Metrics: A Regional Violation Rate (violations relative to features per region) to enable fair comparisons
- Visualizations 🗺️: A pie chart of overall compliance, a bar chart of most frequently violated regulations, and a geographic map highlighting regional risk hotspots
Challenges We Faced 🧗♀️
- Reducing Ambiguity 🤔➡️✅: Generic LLMs can be nuanced but non‑committal. We fine‑tuned on a strict JSONL schema to produce definitive, auditable outputs.
- Ensuring Consistency 🔄: LLM answers can drift over time. Our dual‑AI validator checks against historical precedent to stabilize decisions.
- Meaningful Regional Analysis 🌍📊: Raw counts overstate large regions. Our normalized Regional Violation Rate enables fair, insightful cross‑region comparisons.
What We Learned 🎓
- Power of Small, Specialized Models 💪🤖: A tuned
gpt-4.1-minican rival larger models on a narrow task while keeping latency and cost low. - Robustness via Multi‑Agent Systems 🤝: Embedding checks and balances (a validator agent) directly into the workflow yields more reliable outputs.
- Data Presentation Matters 🎨: Thoughtful dashboard design and normalized metrics turn raw logs into actionable insight for stakeholders.
- Human‑in‑the‑Loop ❤️🧑💻: Real‑time logging to Google Sheets supports oversight and creates a learning system that improves with expert feedback.
Built With
- css
- gspread
- gspread-dataframe
- lookerstudio
- numpy
- pandas
- pillow
- pydantic
- python
- scikit-learn
- streamlit
- streamlitcloud
Log in or sign up for Devpost to join the conversation.