A professional, enterprise-grade Python application for extracting and indexing legal clauses from PDF contracts using Large Language Models (LLMs), vector search, and relational databases. Built with security, modularity, and scalability in mind.
Legal professionals and organizations need efficient ways to:
- Extract structured clauses from unstructured PDF contracts
- Search through contract clauses using natural language queries
- Maintain secure, versioned contract databases
- Scale clause extraction and search operations
This application follows a modular, service-oriented architecture with clear separation of concerns:
- PDF Processing: Extract text from PDF documents using PyMuPDF
- LLM Integration: Use Google Gemini and OpenAI for intelligent clause extraction
- Vector Search: Index clauses using OpenSearch with sentence transformers
- Data Persistence: Store contracts and clauses in MySQL with proper relationships
- REST API: Flask-based API for document upload and semantic search
- Backend: Python 3.8+, Flask
- AI/ML: Google Gemini, OpenAI GPT, Sentence Transformers
- Search: OpenSearch with KNN vectors
- Database: MySQL
- Processing: PyTorch, LangChain, TikToken
- Python 3.8+
- MySQL database
- OpenSearch instance
- API keys for Google Gemini and OpenAI
- Clone the repository
- Create a virtual environment and activate it
- Install dependencies:
pip install -r requirements.txt - Copy
.env.exampleto.envand fill in your secrets - Set up MySQL and OpenSearch (see docs/ARCHITECTURE.md)
- Run the app:
python run.py
POST /upload
Content-Type: multipart/form-data
file: <PDF file>POST /search
Content-Type: application/json
{
"clause": "confidentiality agreement terms"
}Run the test suite:
pytest tests/ --cov=src --cov-report=htmlRun specific tests:
pytest tests/test_tiktoken.py- Linting:
flake8 src tests - Formatting:
black src tests - Type checking:
mypy src
contract-clause-extractor/
βββ src/ # Core application code
β βββ __init__.py
β βββ app.py # Flask application
β βββ database.py # Database operations
β βββ init_db.py # Database initialization
β βββ router/ # API routes
β βββ services/ # Business logic services
β βββ utils/ # Utility functions
βββ tests/ # Unit and integration tests
βββ docs/ # Documentation
βββ config/ # Configuration files
βββ scripts/ # Automation scripts
βββ .env.example # Environment template
βββ pyproject.toml # Project configuration
βββ requirements.txt # Dependencies
βββ run.py # Application entry point
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests and linting
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
This application handles sensitive legal documents. Please review our Security Policy for responsible disclosure and secure development practices.