About the Project

Our Inspiration 💡

Our primary motivation was to address a critical blind spot in product development: geo‑specific regulatory compliance 🌍📜. We wanted to transform compliance from a reactive, often manual process into a proactive, traceable, and auditable system ✅. Without a specialized tool, flagging features that might violate complex regional regulations is difficult, leading to potential risks and inconsistencies. We set out to build a prototype that leverages Large Language Models (LLMs) 🤖 to automate this detection and provide clear, accurate, and structured compliance checks.

To achieve this, we aimed for a system that was accurate, fast ⚡, and cost‑effective 💰. This led us to fine‑tune the gpt-4.1-mini model, striking a deliberate balance between the higher cost of larger models and the need for low latency suitable for real‑time auditing and interaction.


How We Built It 🛠️

Our project was brought to life ✨ with a dual‑AI architecture, supervised fine‑tuning, and a real‑time monitoring dashboard.

1) The Foundation: Tools and Data 💾

We built our system with a stack of powerful tools and libraries 💻:

  • Languages & Dev: Python, Jupyter Notebook
  • UI: Streamlit
  • ML & Data: pandas, scikit-learn, faiss
  • LLM SDKs: vertexai, openai
  • Storage & Realtime: Google Sheets API 📄

Our process began with meticulous data preparation ✍️. We created a specialized dataset for training and validation from five core regulation texts plus a Terminologies.csv file, producing Train.csv and Test.csv. The data was AI‑augmented and manually verified to ensure quality and accuracy.

2) The Brain: Fine‑Tuning a Specialized LLM 🧠

The core of our solution is a fine‑tuned gpt-4.1-mini-2025-04-14 model chosen for cost‑efficiency, low latency, and support for supervised fine‑tuning.

Training Data 📚

  • We formatted examples into JSONL simulating a conversational flow:
  1. User query about a feature
  2. Tool call by the AI to evaluate it
  3. Tool’s structured response (violation, reasoning, cited articles)
  4. Final human‑readable summary
    • This teaches the model to produce reliable, structured JSON outputs.

Training Process 🚀

  • 3 epochs, batch size = 1, learning rate multiplier = 2
  • Achieved train and validation loss of 0.000, indicating a strong fit to our specialized task 🎉

Safety First 🛡️

  • Post‑tuning, the model passed automated safety alignment checks across 15 sensitive categories, ensuring readiness for deployment.

3) The Architecture: A Dual‑AI System for Robustness 🏛️

Our methodology uses Retrieval‑Augmented Generation (RAG) and a multi‑layered decision flow to ensure accuracy and consistency.

  • AI 1 — Initial Compliance Check 🤖1️⃣

    • Retrieves relevant sections from regulatory documents via embeddings
    • Analyzes the feature against these texts
    • Outputs Violation / No Violation with specific citations
  • AI 2 — Consistency & Validation 🤖2️⃣

    • Looks up historical decisions using RAG
    • Compares the new decision to similar past cases
    • Supports or challenges the initial assessment
  • Human‑in‑the‑Loop 🧑‍💻

    • Final, validated decisions are logged to Google Sheets in real time
    • Enables immediate oversight and correction
    • Creates a continuous feedback loop that improves the system over time

4) The Eyes: An Intuitive Monitoring Dashboard 👀

We built a Looker Studio dashboard 📊 for transparency and auditing. It provides a comprehensive view of compliance health via:

  • KPIs 📈: Total features evaluated, number of violations, overall non‑compliance percentage
  • Normalized Metrics: A Regional Violation Rate (violations relative to features per region) to enable fair comparisons
  • Visualizations 🗺️: A pie chart of overall compliance, a bar chart of most frequently violated regulations, and a geographic map highlighting regional risk hotspots

Challenges We Faced 🧗‍♀️

  • Reducing Ambiguity 🤔➡️✅: Generic LLMs can be nuanced but non‑committal. We fine‑tuned on a strict JSONL schema to produce definitive, auditable outputs.
  • Ensuring Consistency 🔄: LLM answers can drift over time. Our dual‑AI validator checks against historical precedent to stabilize decisions.
  • Meaningful Regional Analysis 🌍📊: Raw counts overstate large regions. Our normalized Regional Violation Rate enables fair, insightful cross‑region comparisons.

What We Learned 🎓

  • Power of Small, Specialized Models 💪🤖: A tuned gpt-4.1-mini can rival larger models on a narrow task while keeping latency and cost low.
  • Robustness via Multi‑Agent Systems 🤝: Embedding checks and balances (a validator agent) directly into the workflow yields more reliable outputs.
  • Data Presentation Matters 🎨: Thoughtful dashboard design and normalized metrics turn raw logs into actionable insight for stakeholders.
  • Human‑in‑the‑Loop ❤️🧑‍💻: Real‑time logging to Google Sheets supports oversight and creates a learning system that improves with expert feedback.

Built With

Share this project:

Updates