Skip to content

AtharvSabde/RedactAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RedactAI πŸ›‘οΈ

Privacy firewall for your PDFs before sending to LLMs

Python 3.8+

MCP


🀐 The Problem

Large Language Models are becoming a default tool for reviewing, summarizing, and extracting insights from documents. But there is a hidden cost.

Most LLMs require raw document input. When you upload a contract, medical record, financial, or internal report, you are often sending unfiltered sensitive data along with it.

This creates real risks:

Personally identifiable information is exposed unintentionally

Confidential or regulated data leaves your control

Manual redaction is slow, error-prone, and inconsistent

Existing redaction tools are either rule-based, cloud-only, or break document structure

In practice, teams are forced to choose between using LLMs effectively and protecting privacy. That trade-off should not exist.

πŸ›‘οΈ The Solution

RedactAI is an MCP (Model Context Protocol) server that provides AI-powered sensitive data detection and redaction for PDF documents. It leverages local Ollama models to identify and permanently remove personally identifiable information (PII) from PDFs while maintaining document integrity.

Simply provide a PDF file path, and RedactAI will:

  • βœ… Automatically detect sensitive data (names, emails, dates, IDs, medical info, financial data)
  • βœ… Redact permanently by blacking out sensitive information
  • βœ… Let you preview with side-by-side comparison before and after
  • βœ… Choose your model for speed vs accuracy trade-offs
  • βœ… Customize redactions by excluding false positives or adding missed items

Privacy First: All processing happens locally with Ollama - your data never leaves your machine.


πŸŽ₯ Demo Video

RedactAI Demo

Click to watch the full demo on YouTube


🎯 Why RedactAI?

Feature RedactAI Manual Redaction Cloud Services
Privacy βœ… 100% Local βœ… Local ❌ Data sent to cloud
Speed βœ… Seconds ❌ Hours βœ… Fast
Accuracy βœ… AI-Powered ❌ Error-prone βœ… AI-Powered
Cost βœ… Free ⚠️ Time-consuming ❌ Subscription fees
Customization βœ… Full control βœ… Full control ❌ Limited
Audit Trail βœ… Highlighted preview ❌ Manual tracking ⚠️ Varies

πŸ“‹ Table of Contents


✨ Features

πŸ€– Flexible Model Selection

  • Choose from any locally installed Ollama model (gemma3:1b, llama3.2:3b, mistral:7b, etc.)
  • Trade-off between speed and accuracy based on model size
  • Automatic model caching for improved performance

πŸ” Comprehensive PII Detection

Automatically detects and redacts:

  • Names: Full names of people (John Doe, Dr. Smith, Jane M. Johnson)
  • Emails: Email addresses ([email protected], [email protected])
  • Phones: Phone numbers (+1-555-123-4567, (555) 123-4567)
  • Addresses: Physical addresses (123 Main St, Apt 4B, New York, NY 10001)
  • IDs/SSNs: ID numbers (123-45-6789, Passport: AB1234567)
  • Credit Cards: Card numbers (1234-5678-9012-3456)
  • Dates of Birth: DOB: 01/15/1990, Born: January 15, 1990
  • Medical Info: Diagnosis codes, patient IDs, prescription info
  • Financial Data: Account numbers, transaction details, salary info
  • Other PII: Social media handles, URLs with personal info

🎯 Smart Redaction Modes

1. Automatic Redaction - Full AI-powered detection and redaction

2. Analysis Mode - Preview sensitive data before redacting

3. Custom Redaction - Fine-tune results with exclude/include lists

πŸ“„ Dual Output

  • Redacted PDF: Permanently blacks out sensitive information
  • Highlighted PDF: Preview showing what was detected (yellow highlights)

πŸš€ User Experience Features

  • Auto-opens both original and redacted PDFs for side-by-side comparison
  • Detailed progress tracking for each operation step
  • Masked data reporting (shows first/last characters only)
  • Cross-platform support (Windows, macOS, Linux)

πŸ—οΈ Architecture

Technology Stack

  • MCP Framework: FastMCP for Model Context Protocol implementation
  • LLM Integration: Ollama API with structured JSON responses
  • PDF Processing: PyMuPDF (fitz) for text extraction and redaction
  • Text Analysis: Custom data processor with masking utilities

Core Components

1. MCP Server (src/server.py)

  • Exposes 5 primary tools via MCP protocol
  • Handles LLM instance caching
  • Progress tracking and error recovery

2. Ollama LLM Wrapper (src/tools/ollama_llm.py)

  • Robust JSON parsing with error recovery
  • Structured schema for consistent output
  • Connection health checking

3. PDF Extractor (src/tools/pdf_extractor.py)

  • Text extraction from PDF documents
  • Support for page-by-page or full document extraction

4. Data Processor (src/tools/data_processor.py)

  • Flattens and deduplicates sensitive data
  • Creates masked versions for secure reporting

5. PDF Redactor (src/tools/pdf_redactor.py)

  • Applies black redactions to matched text
  • Generates highlighted preview version
  • Per-page redaction statistics

πŸ”§ Prerequisites

Before installing RedactAI, ensure you have:

  1. Python 3.8 or higher

    python --version
  2. Ollama installed and running

    • Download from: https://ollama.ai
    • After installation, start the service:
      ollama serve
  3. At least one Ollama model (recommended):

    # Fast model (recommended for getting started)
    ollama pull gemma3:1b
    
    # Balanced model (recommended for balance)
    ollama pull gemma3:4b
    
    # Accurate model (for maximum precision)
    ollama pull gemma3:12b
  4. Claude Desktop (for MCP integration)


πŸ“¦ Installation

πŸš€ Quick Start (recommended)

Choose your platform and run the automated installation:

Windows (PowerShell):

# Download the script
Invoke-WebRequest -Uri "https://raw.githubusercontent.com/AtharvSabde/RedactAI/main/setup.ps1" -OutFile "setup.ps1"

# Run it
powershell -ExecutionPolicy Bypass -File setup.ps1

macOS/Linux:

chmod +x setup.sh

./setup.sh

The automated script will:

  • βœ… Install all dependencies
  • βœ… Set up virtual environment
  • βœ… Configure Ollama and pull recommended model
  • βœ… Automatically configure Claude Desktop
  • βœ… Verify the installation

After installation completes:

  1. Restart Claude Desktop
  2. Type in Claude: List available Ollama models
  3. If you see models listed, you're ready to go! πŸŽ‰

πŸ“˜ Detailed Installation

For manual installation, troubleshooting, or advanced configuration options, see:

πŸ‘‰ Complete Installation Guide (INSTALLATION.md)

The detailed guide includes:

  • Manual installation steps
  • Prerequisites checklist
  • Configuration helper scripts
  • Troubleshooting common issues
  • Platform-specific instructions

🎯 Model Selection

Choose the right model based on your needs:

Model Parameters Speed Accuracy Best For
gemma3:1b 1 Billion ⚑ Fast (14s) Basic Quick scans, simple documents
gemma3:4b 4 Billion βš–οΈ Balanced (49s) High Recommended - balanced use
gemma3:12b 12 Billion 🐒 Slow (108s) Higher better accuracy, complex documents

Benchmark Results (2-page resume):

  • gemma3:1b: 9 redactions in 14 seconds (basic detection)
  • gemma3:4b: 38 redactions in 49 seconds (aggressive detection) ⭐ Recommended
  • gemma3:12b: 14 redactions in 108 seconds (smart/selective)

Recommendation: Start with gemma3:4b for the best balance of speed and accuracy.

General Guidelines:

Model Size Parameters Speed Accuracy Best For
Small 1B-4B ⚑⚑⚑ ⭐⭐ Quick processing, simple documents
Medium 4B-12B ⚑⚑ ⭐⭐⭐ Balanced use, most documents
Large 12B+ ⚑ ⭐⭐⭐⭐ High accuracy, complex documents

πŸš€ Usage Examples

Example 1: Basic Redaction

In Claude Desktop, simply say:

Redact "C:\Users\atharv\Desktop\resume.pdf"

RedactAI will:

  1. βœ… Analyze the PDF with the default model (gemma3:1b)
  2. βœ… Detect all sensitive information
  3. βœ… Create a redacted version
  4. βœ… Auto-open both PDFs side-by-side for comparison

Example 2: Using a Different Model

Redact my resume using gemma3:4b model for better accuracy

Example 3: Custom Redaction

After seeing the first redaction:

Redact again but don't redact my name "John Doe" and DO redact "Google" and "Project X"

RedactAI will use the redact_pdf_custom tool to:

  • Exclude: "John Doe"
  • Include: "Google", "Project X"

Example 4: Analysis Only (Preview)

Analyze "C:\Documents\contract.pdf" without redacting

This shows you what would be redacted without creating a new file.

Example 5: Check Available Models

What Ollama models do I have available?

Example 6: Check Ollama Status

Is Ollama running and ready?

πŸ› οΈ Available Tools

RedactAI provides 5 MCP tools:

1. list_available_models()

Lists all Ollama models installed on your system with size and details.

Returns: JSON with model list and size-to-accuracy guidance

Use case: Check which models you can use before redacting.


2. check_ollama_status(model, base_url)

Verifies Ollama service is running and specified model is available.

Parameters:

  • model (optional): Model name to check (default: "gemma3:1b")
  • base_url (optional): Ollama API URL (default: "http://localhost:11434")

Use case: Troubleshooting connection issues.


3. analyze_pdf_sensitive_data(pdf_path, pdf_base64, model)

Analyzes PDF to detect sensitive information WITHOUT redacting.

Parameters:

  • pdf_path: Local file path to PDF
  • pdf_base64 (optional): Base64 encoded PDF data
  • model (optional): Ollama model to use (default: "gemma3:1b")

Returns:

  • Masked preview of detected data
  • Categories and counts
  • No files created

Use case: Preview before permanent redaction.


4. redact_pdf(pdf_path, pdf_base64, model, return_base64, auto_open)

Permanently redacts sensitive data from PDF.

Parameters:

  • pdf_path: Local file path to PDF
  • pdf_base64 (optional): Base64 encoded PDF data
  • model (optional): Model to use (default: "gemma3:1b")
  • return_base64 (optional): Return as base64 (default: false)
  • auto_open (optional): Auto-open PDFs (default: true)

Returns:

  • Redacted PDF (blacked out sensitive data)
  • Highlighted preview PDF (shows what was redacted)
  • Detailed summary with masked data
  • Statistics per page

Use case: Main redaction workflow.


5. redact_pdf_custom(pdf_path, exclude_items, include_items, model, auto_open, return_base64)

Custom redaction with user-specified exclusions and additions.

Parameters:

  • pdf_path: Path to ORIGINAL PDF (required)
  • exclude_items: List of strings to NOT redact
  • include_items: List of strings to forcefully redact
  • model (optional): Model to use (default: "gemma3:1b")
  • auto_open (optional): Auto-open PDFs (default: true)
  • return_base64 (optional): Return as base64 (default: false)

Example:

{
  "pdf_path": "resume.pdf",
  "exclude_items": ["John Doe", "[email protected]"],
  "include_items": ["Secret Project", "XYZ Corp"],
  "model": "gemma3:4b"
}

Use case: Fine-tune redactions after initial pass. User reviews initial redaction and says "don't redact my name, but DO redact 'Google'".


πŸ”„ Workflow Example

1. User uploads PDF β†’ analyze_pdf_sensitive_data()
2. Review masked sensitive data β†’ Decide what to redact
3. Run redact_pdf() β†’ Get redacted + highlighted PDFs
4. Both PDFs auto-open side-by-side
5. If adjustments needed β†’ Use redact_pdf_custom()
   - Exclude false positives
   - Include additional items
6. Final redacted PDF ready for sharing

πŸ› Troubleshooting

Issue 1: "Cannot connect to Ollama"

Solution:

# Start Ollama service
ollama serve

# Verify it's running
curl http://localhost:11434/api/tags

Issue 2: "Model not found"

Solution:

# List installed models
ollama list

# Install missing model
ollama pull gemma3:1b

Issue 3: "MCP server not showing in Claude"

Solution:

  1. Verify paths in claude_desktop_config.json are correct
  2. Use forward slashes (/) even on Windows, or escape backslashes (\\)
  3. Restart Claude Desktop completely
  4. Check Claude logs:
    • Windows: %APPDATA%\Claude\logs
    • macOS: ~/Library/Logs/Claude

Issue 4: "PDFs not opening automatically"

Solution:

  • Ensure you have a PDF viewer installed (Adobe Reader, browser, etc.)
  • Try manually opening from the output path
  • Set auto_open: false in the tool call if it's causing issues

Issue 5: "Redaction too slow"

Solution:

  • Use a smaller model: gemma3:1b (fastest)
  • Process fewer pages at once
  • Upgrade your hardware (more RAM/CPU helps)

Issue 6: "LLM returned empty result"

Solution:

# Check Ollama logs
ollama logs

# Restart Ollama
# Windows: Stop and restart the service
# macOS/Linux: 
killall ollama
ollama serve

Issue 7: "Too many/too few redactions"

Solution:

  • Try different models: gemma3:4b or gemma3:12b
  • Use redact_pdf_custom to fine-tune results
  • Exclude false positives or include missed items

πŸ“ Project Structure

RedactAI/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ server.py              # Main MCP server (FastMCP)
β”‚   └── tools/
β”‚       β”œβ”€β”€ ollama_llm.py      # Ollama LLM integration
β”‚       β”œβ”€β”€ pdf_extractor.py   # PDF text extraction
β”‚       β”œβ”€β”€ data_processor.py  # Sensitive data processing
β”‚       └── pdf_redactor.py    # PDF redaction logic
β”œβ”€β”€ scripts/
β”‚   └── configure_claude.py    # Configuration helper script
β”œβ”€β”€ setup.sh                    # Automated setup (macOS/Linux)
β”œβ”€β”€ setup.ps1                   # Automated setup (Windows)
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ INSTALLATION.md             # Detailed installation guide
β”œβ”€β”€ README.md                   # This file
β”œβ”€β”€ LICENSE                     # MIT License
└── .gitignore

πŸ”’ Security Considerations

  1. Original files are never modified - Redactions create new files
  2. Temporary files are cleaned up - Automatic cleanup in finally blocks
  3. Masked reporting - Sensitive data never exposed in full in logs/responses
  4. Local processing - All LLM operations run locally via Ollama (no cloud APIs)
  5. No data transmission - Your sensitive documents stay on your machine

🀝 Contributing

Contributions are welcome! Here's how you can help:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Commit your changes: git commit -m 'Add amazing feature'
  4. Push to the branch: git push origin feature/amazing-feature
  5. Open a Pull Request

Development Setup

# Clone your fork
git clone https://github.com/YOUR_USERNAME/RedactAI.git
cd RedactAI

# Create venv and install dependencies
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install -r requirements.txt

# Test changes
python src/server.py

πŸ’‘ Important Notes

  • This is an MCP server - it exposes tools via the Model Context Protocol, not a standalone CLI application
  • The server must be running and connected to an MCP client (like Claude Desktop) to use the tools
  • All operations return detailed JSON responses with progress tracking and error information
  • The system requires Ollama to be running locally at http://localhost:11434 by default
  • Larger models provide better accuracy but require more computational resources and time
  • The highlighted PDF serves as a preview/audit trail of what was redacted

πŸ™ Acknowledgments


πŸ“§ Contact

Atharv Sabde


🌟 Star the Repo!

If RedactAI helped you protect your privacy, please ⭐ star the repo on GitHub!


Built with ❀️ for privacy-conscious AI users

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors