⚖️ Legal AI Auditor

A professional-grade Retrieval-Augmented Generation (RAG) application designed for auditing complex legal documents. Built with a robust containerized architecture, it moves beyond simple chat to provide Deep Contextual Audits of contracts using "Lost-in-the-Middle" mitigation strategies.

Overview

Sequence Diagram

This auditor uses a state-of-the-art RAG pipeline to allow legal professionals to interact with large volumes of documents. It features recursive text splitting optimized for legal clauses and a streaming UI for a real-time, low-latency experience.

Brain: Gemini 1.5 Flash (via LangChain Google GenAI)
Memory: Chroma Vector Cloud Database
Interface: Streamlit (with Session State Memory)
Infrastructure: Docker & Docker Compose

Key Engineering Features

Deep Contextual Audit: Implemented RecursiveCharacterTextSplitter with 2000-character chunks and high overlap to ensure legal definitions and multi-paragraph clauses remain semantically intact.
Metadata Injection: Custom ingestion pipeline extracts document metadata (Filename, Date, Type) and injects it directly into vector embeddings to improve retrieval accuracy.
Persistent Vector Memory: Integrated with Chroma to store and retrieve document embeddings efficiently, preventing re-indexing of duplicate files.
Real-time Streaming: Implemented token-by-token response streaming using LangChain generators for a responsive user experience.
Dockerized Architecture: Fully containerized microservices architecture ensuring consistent performance across development and production environments.

📂 Data Ingestion & Metadata

Metadata Injection: Custom ingestion pipeline extracts document metadata (Filename, Date, Type) and injects it directly into vector embeddings.

Real-time ingestion tracking in the sidebar, confirming successful metadata injection.

Embeddings stored in Chroma DB

🧠 Persistent Memory & Reasoning

Persistent Vector Memory: Integrated with Chroma to store and retrieve document embeddings efficiently.

💬 Real-time Interaction

Real-time Streaming: Implemented token-by-token response streaming for a responsive experience.

Tech Stack

Component	Technology	Description
Orchestration	LangChain (LCEL)	Manages the Retrieval/Generation chains.
Frontend	Streamlit	Chat interface with `st.session_state` history.
Vector Store	Chroma	High-performance vector similarity search.
LLM	Google Gemini 1.5 Flash	Selected for 1M+ token context window.
DevOps	Docker Compose	Multi-container orchestration.

Installation & Setup

1. Prerequisites

Docker and Docker Compose installed.
A Google Gemini API Key.

2. Environment Configuration

Create a .env file in the root directory:

GOOGLE_API_KEY=your_gemini_api_key_here
CHROMA_KEYS=your_chroma_keys_here

3. Deploy with Docker

Launch the entire stack from scratch to ensure a clean build:

# Build and start the containers
docker compose up --build -d

Usage

Access: Open the UI at http://localhost:8501.
Upload: Use the sidebar to upload PDF legal contracts.
Audit: Ask complex reasoning questions:
- "What are the termination liabilities in Section 4.2?"
- "Is there a non-compete clause that exceeds 12 months?"
Monitor: To check the logs in real-time:
```
docker compose logs -f
```

Project Structure

legal-rag/
├── DATA/
│   └── UPLOADED/           # Persistent storage for PDF inputs
│       └── files.pdf
├── src/
│   ├── llm_model/          # Embedding & LLM initialization
│   │   ├── embedding.py
│   │   └── llm.py
│   ├── prompts/            # RAG chains & system templates
│   │   └── legal_template.py
│   ├── text_handler/       # Chunking & formatting logic
│   │   └── splitter.py
│   ├── utils/              # Helper functions (DB & Chat)
│   │   ├── chat_utils.py
│   │   └── db_utils.py
│   ├── database.py         # Vector store connection
│   ├── ingestion.py        # PDF extraction & metadata logic
│   └── retriever.py        # Two-stage retrieval pipeline
├── tests/                  # Evaluation & Testing suite
│   ├── deep_eval_model.py
│   ├── generate_eval_data.py
│   ├── rag_chain_eval.py
│   └── test_eval.py
├── app.py                  # Main Streamlit Entry Point
├── deepeval_test.py        # Automated testing script
├── Dockerfile              # Container definition
└── docker-compose.yml      # Orchestration config

🧹 Maintenance

To stop the application and clear the cache/volumes (useful for resetting the database):

docker compose down
docker system prune -a --volumes

Access: Open the UI at http://localhost:8501.
Upload: Use the sidebar to upload PDF legal contracts.
Audit: Ask complex reasoning questions:
- "What are the termination liabilities in Section 4.2?"
- "Is there a non-compete clause that exceeds 12 months?"
Monitor: To check the logs in real-time:

docker compose logs -f

Found a bug or have a suggestion for the auditing pipeline? Please open an issue!

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.deepeval		.deepeval
assets		assets
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
deepeval_test.py		deepeval_test.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚖️ Legal AI Auditor

Overview

Key Engineering Features

📂 Data Ingestion & Metadata

🧠 Persistent Memory & Reasoning

💬 Real-time Interaction

Tech Stack

Installation & Setup

1. Prerequisites

2. Environment Configuration

3. Deploy with Docker

Usage

Project Structure

🧹 Maintenance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⚖️ Legal AI Auditor

Overview

Key Engineering Features

📂 Data Ingestion & Metadata

🧠 Persistent Memory & Reasoning

💬 Real-time Interaction

Tech Stack

Installation & Setup

1. Prerequisites

2. Environment Configuration

3. Deploy with Docker

Usage

Project Structure

🧹 Maintenance

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages