Organizations deal with a large number of legal contracts such as vendor agreements, employment contracts, NDAs, and licensing agreements. Reviewing these documents manually is time-consuming and requires legal expertise because each contract contains multiple clauses that may introduce legal, financial, or compliance risks.
The Contract Risk Analyzer is a legal document analysis system designed to automate the identification and evaluation of risky clauses in contracts. It uses Natural Language Processing (NLP) techniques and a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) model trained on the CUAD (Contract Understanding Atticus Dataset) to classify contract clauses and assess potential risk levels.
The system extracts text from contract PDFs, segments them into clauses, classifies each clause into predefined legal categories, calculates a risk score, and explains why a clause is considered risky. The application also provides tools for comparing contract versions and visualizing model explanations.
Objectives The main objectives of the Contract Risk Analyzer system are: Automate the extraction and analysis of legal clauses from contract documents. Classify clauses into standard legal categories based on the CUAD benchmark. Identify potential legal risks within specific clauses. Provide explainable AI outputs showing why the model flagged a clause. Allow contract comparison through a redline diff viewer. Enable quick review of contracts through an interactive web interface
- 25+ Clause Types: CUAD benchmark clause classification
- Risk Scoring: Per-clause risk scores with plain-English explanations
- SHAP Explainability: Understand what drives each risk flag
- Redline Diff Viewer: Side-by-side contract comparison
- PDF Extraction: Automatic clause segmentation via PyMuPDF + spaCy
contract-risk-analyzer/
├── src/
│ ├── extraction/ # PDF parsing + clause segmentation
│ │ ├── pdf_extractor.py
│ │ └── clause_segmenter.py
│ ├── classification/ # BERT fine-tuned on CUAD
│ │ ├── cuad_model.py
│ │ └── clause_classifier.py
│ ├── risk/ # Risk scoring engine
│ │ ├── risk_scorer.py
│ │ └── risk_rules.py
│ ├── explainability/ # SHAP integration
│ │ └── shap_explainer.py
│ └── ui/ # Streamlit app
│ └── app.py
├── data/
│ └── cuad_labels.json # CUAD clause taxonomy
├── models/ # Fine-tuned model checkpoints
├── tests/
└── requirements.txt
Setup: pip install -r requirements.txt python -m spacy download en_core_web_sm streamlit run src/ui/app.py
CUAD Dataset:
The Contract Understanding Atticus Dataset (CUAD) (https://www.atticusprojectai.org/cuad) contains 510 contracts with 13,000+ expert annotations across 41 legal clause categories. We fine-tune bert-base-uncased on this benchmark for clause classification.
Tech Stack
- PDF Parsing: PyMuPDF (fitz)
- NLP Pipeline: spaCy (en_core_web_sm)
- Model: BERT fine-tuned on CUAD (HuggingFace Transformers)
- Explainability: SHAP (transformers pipeline explainer)
- UI: Streamlit
- Diff: difflib + custom React-style renderer



