Skip to content

sriieeu/contract-risk-analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 

Repository files navigation

Organizations deal with a large number of legal contracts such as vendor agreements, employment contracts, NDAs, and licensing agreements. Reviewing these documents manually is time-consuming and requires legal expertise because each contract contains multiple clauses that may introduce legal, financial, or compliance risks.

The Contract Risk Analyzer is a legal document analysis system designed to automate the identification and evaluation of risky clauses in contracts. It uses Natural Language Processing (NLP) techniques and a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) model trained on the CUAD (Contract Understanding Atticus Dataset) to classify contract clauses and assess potential risk levels.

The system extracts text from contract PDFs, segments them into clauses, classifies each clause into predefined legal categories, calculates a risk score, and explains why a clause is considered risky. The application also provides tools for comparing contract versions and visualizing model explanations.

Objectives The main objectives of the Contract Risk Analyzer system are: Automate the extraction and analysis of legal clauses from contract documents. Classify clauses into standard legal categories based on the CUAD benchmark. Identify potential legal risks within specific clauses. Provide explainable AI outputs showing why the model flagged a clause. Allow contract comparison through a redline diff viewer. Enable quick review of contracts through an interactive web interface

Features

  • 25+ Clause Types: CUAD benchmark clause classification
  • Risk Scoring: Per-clause risk scores with plain-English explanations
  • SHAP Explainability: Understand what drives each risk flag
  • Redline Diff Viewer: Side-by-side contract comparison
  • PDF Extraction: Automatic clause segmentation via PyMuPDF + spaCy

Architecture

contract-risk-analyzer/
├── src/
│   ├── extraction/        # PDF parsing + clause segmentation
│   │   ├── pdf_extractor.py
│   │   └── clause_segmenter.py
│   ├── classification/    # BERT fine-tuned on CUAD
│   │   ├── cuad_model.py
│   │   └── clause_classifier.py
│   ├── risk/              # Risk scoring engine
│   │   ├── risk_scorer.py
│   │   └── risk_rules.py
│   ├── explainability/    # SHAP integration
│   │   └── shap_explainer.py
│   └── ui/                # Streamlit app
│       └── app.py
├── data/
│   └── cuad_labels.json   # CUAD clause taxonomy
├── models/                # Fine-tuned model checkpoints
├── tests/
└── requirements.txt

Setup: pip install -r requirements.txt python -m spacy download en_core_web_sm streamlit run src/ui/app.py

CUAD Dataset:

The Contract Understanding Atticus Dataset (CUAD) (https://www.atticusprojectai.org/cuad) contains 510 contracts with 13,000+ expert annotations across 41 legal clause categories. We fine-tune bert-base-uncased on this benchmark for clause classification.

Tech Stack

  • PDF Parsing: PyMuPDF (fitz)
  • NLP Pipeline: spaCy (en_core_web_sm)
  • Model: BERT fine-tuned on CUAD (HuggingFace Transformers)
  • Explainability: SHAP (transformers pipeline explainer)
  • UI: Streamlit
  • Diff: difflib + custom React-style renderer

Web Preview: image image image image

image image

About

A legal contract analysis system using NLP and fine-tuned BERT on the CUAD dataset.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors