Skip to content

JosephJonathanFernandes/Contract_Clause_extractor

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Contract Clause Extractor

CI Coverage

A professional, enterprise-grade Python application for extracting and indexing legal clauses from PDF contracts using Large Language Models (LLMs), vector search, and relational databases. Built with security, modularity, and scalability in mind.

🎯 Problem Statement

Legal professionals and organizations need efficient ways to:

  • Extract structured clauses from unstructured PDF contracts
  • Search through contract clauses using natural language queries
  • Maintain secure, versioned contract databases
  • Scale clause extraction and search operations

πŸ—οΈ Architecture Overview

This application follows a modular, service-oriented architecture with clear separation of concerns:

Core Components

  • PDF Processing: Extract text from PDF documents using PyMuPDF
  • LLM Integration: Use Google Gemini and OpenAI for intelligent clause extraction
  • Vector Search: Index clauses using OpenSearch with sentence transformers
  • Data Persistence: Store contracts and clauses in MySQL with proper relationships
  • REST API: Flask-based API for document upload and semantic search

Technology Stack

  • Backend: Python 3.8+, Flask
  • AI/ML: Google Gemini, OpenAI GPT, Sentence Transformers
  • Search: OpenSearch with KNN vectors
  • Database: MySQL
  • Processing: PyTorch, LangChain, TikToken

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • MySQL database
  • OpenSearch instance
  • API keys for Google Gemini and OpenAI

Setup

  1. Clone the repository
  2. Create a virtual environment and activate it
  3. Install dependencies: pip install -r requirements.txt
  4. Copy .env.example to .env and fill in your secrets
  5. Set up MySQL and OpenSearch (see docs/ARCHITECTURE.md)
  6. Run the app: python run.py

πŸ“‘ API Usage

Upload Contract

POST /upload
Content-Type: multipart/form-data

file: <PDF file>

Search Clauses

POST /search
Content-Type: application/json

{
  "clause": "confidentiality agreement terms"
}

πŸ§ͺ Testing

Run the test suite:

pytest tests/ --cov=src --cov-report=html

Run specific tests:

pytest tests/test_tiktoken.py

πŸ› οΈ Development

Code Quality

  • Linting: flake8 src tests
  • Formatting: black src tests
  • Type checking: mypy src

Project Structure

contract-clause-extractor/
β”œβ”€β”€ src/                    # Core application code
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ app.py             # Flask application
β”‚   β”œβ”€β”€ database.py        # Database operations
β”‚   β”œβ”€β”€ init_db.py         # Database initialization
β”‚   β”œβ”€β”€ router/            # API routes
β”‚   β”œβ”€β”€ services/          # Business logic services
β”‚   └── utils/             # Utility functions
β”œβ”€β”€ tests/                 # Unit and integration tests
β”œβ”€β”€ docs/                  # Documentation
β”œβ”€β”€ config/                # Configuration files
β”œβ”€β”€ scripts/               # Automation scripts
β”œβ”€β”€ .env.example           # Environment template
β”œβ”€β”€ pyproject.toml         # Project configuration
β”œβ”€β”€ requirements.txt       # Dependencies
└── run.py                 # Application entry point

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Workflow

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests and linting
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

⚠️ Security

This application handles sensitive legal documents. Please review our Security Policy for responsible disclosure and secure development practices.

πŸ“š Documentation

About

Professional Contract Clause Extractor using LLMs, vector search, and enterprise-grade architecture. Built with Flask, OpenSearch, MySQL, and Google Gemini/OpenAI.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%