RedactAI 🛡️

Privacy firewall for your PDFs before sending to LLMs

🤐 The Problem

Large Language Models are becoming a default tool for reviewing, summarizing, and extracting insights from documents. But there is a hidden cost.

Most LLMs require raw document input. When you upload a contract, medical record, financial, or internal report, you are often sending unfiltered sensitive data along with it.

This creates real risks:

Personally identifiable information is exposed unintentionally

Confidential or regulated data leaves your control

Manual redaction is slow, error-prone, and inconsistent

Existing redaction tools are either rule-based, cloud-only, or break document structure

In practice, teams are forced to choose between using LLMs effectively and protecting privacy. That trade-off should not exist.

🛡️ The Solution

RedactAI is an MCP (Model Context Protocol) server that provides AI-powered sensitive data detection and redaction for PDF documents. It leverages local Ollama models to identify and permanently remove personally identifiable information (PII) from PDFs while maintaining document integrity.

Simply provide a PDF file path, and RedactAI will:

✅ Automatically detect sensitive data (names, emails, dates, IDs, medical info, financial data)
✅ Redact permanently by blacking out sensitive information
✅ Let you preview with side-by-side comparison before and after
✅ Choose your model for speed vs accuracy trade-offs
✅ Customize redactions by excluding false positives or adding missed items

Privacy First: All processing happens locally with Ollama - your data never leaves your machine.

🎥 Demo Video

Click to watch the full demo on YouTube

🎯 Why RedactAI?

Feature	RedactAI	Manual Redaction	Cloud Services
Privacy	✅ 100% Local	✅ Local	❌ Data sent to cloud
Speed	✅ Seconds	❌ Hours	✅ Fast
Accuracy	✅ AI-Powered	❌ Error-prone	✅ AI-Powered
Cost	✅ Free	⚠️ Time-consuming	❌ Subscription fees
Customization	✅ Full control	✅ Full control	❌ Limited
Audit Trail	✅ Highlighted preview	❌ Manual tracking	⚠️ Varies

📋 Table of Contents

✨ Features

🤖 Flexible Model Selection

Choose from any locally installed Ollama model (gemma3:1b, llama3.2:3b, mistral:7b, etc.)
Trade-off between speed and accuracy based on model size
Automatic model caching for improved performance

🔍 Comprehensive PII Detection

Automatically detects and redacts:

Names: Full names of people (John Doe, Dr. Smith, Jane M. Johnson)
Emails: Email addresses ([email protected], [email protected])
Phones: Phone numbers (+1-555-123-4567, (555) 123-4567)
Addresses: Physical addresses (123 Main St, Apt 4B, New York, NY 10001)
IDs/SSNs: ID numbers (123-45-6789, Passport: AB1234567)
Credit Cards: Card numbers (1234-5678-9012-3456)
Dates of Birth: DOB: 01/15/1990, Born: January 15, 1990
Medical Info: Diagnosis codes, patient IDs, prescription info
Financial Data: Account numbers, transaction details, salary info
Other PII: Social media handles, URLs with personal info

🎯 Smart Redaction Modes

1. Automatic Redaction - Full AI-powered detection and redaction

2. Analysis Mode - Preview sensitive data before redacting

3. Custom Redaction - Fine-tune results with exclude/include lists

📄 Dual Output

Redacted PDF: Permanently blacks out sensitive information
Highlighted PDF: Preview showing what was detected (yellow highlights)

🚀 User Experience Features

Auto-opens both original and redacted PDFs for side-by-side comparison
Detailed progress tracking for each operation step
Masked data reporting (shows first/last characters only)
Cross-platform support (Windows, macOS, Linux)

🏗️ Architecture

Technology Stack

MCP Framework: FastMCP for Model Context Protocol implementation
LLM Integration: Ollama API with structured JSON responses
PDF Processing: PyMuPDF (fitz) for text extraction and redaction
Text Analysis: Custom data processor with masking utilities

Core Components

1. MCP Server (src/server.py)

Exposes 5 primary tools via MCP protocol
Handles LLM instance caching
Progress tracking and error recovery

2. Ollama LLM Wrapper (src/tools/ollama_llm.py)

Robust JSON parsing with error recovery
Structured schema for consistent output
Connection health checking

3. PDF Extractor (src/tools/pdf_extractor.py)

Text extraction from PDF documents
Support for page-by-page or full document extraction

4. Data Processor (src/tools/data_processor.py)

Flattens and deduplicates sensitive data
Creates masked versions for secure reporting

5. PDF Redactor (src/tools/pdf_redactor.py)

Applies black redactions to matched text
Generates highlighted preview version
Per-page redaction statistics

🔧 Prerequisites

Before installing RedactAI, ensure you have:

Python 3.8 or higher
```
python --version
```
Ollama installed and running
- Download from: https://ollama.ai
- After installation, start the service:
```
ollama serve
```

At least one Ollama model (recommended):

# Fast model (recommended for getting started)
ollama pull gemma3:1b

# Balanced model (recommended for balance)
ollama pull gemma3:4b

# Accurate model (for maximum precision)
ollama pull gemma3:12b

Claude Desktop (for MCP integration)
- Download from: https://claude.ai/download

📦 Installation

🚀 Quick Start (recommended)

Choose your platform and run the automated installation:

Windows (PowerShell):

# Download the script
Invoke-WebRequest -Uri "https://raw.githubusercontent.com/AtharvSabde/RedactAI/main/setup.ps1" -OutFile "setup.ps1"

# Run it
powershell -ExecutionPolicy Bypass -File setup.ps1

macOS/Linux:

chmod +x setup.sh

./setup.sh

The automated script will:

✅ Install all dependencies
✅ Set up virtual environment
✅ Configure Ollama and pull recommended model
✅ Automatically configure Claude Desktop
✅ Verify the installation

After installation completes:

Restart Claude Desktop
Type in Claude: List available Ollama models
If you see models listed, you're ready to go! 🎉

📘 Detailed Installation

For manual installation, troubleshooting, or advanced configuration options, see:

👉 Complete Installation Guide (INSTALLATION.md)

The detailed guide includes:

Manual installation steps
Prerequisites checklist
Configuration helper scripts
Troubleshooting common issues
Platform-specific instructions

🎯 Model Selection

Choose the right model based on your needs:

Model	Parameters	Speed	Accuracy	Best For
gemma3:1b	1 Billion	⚡ Fast (14s)	Basic	Quick scans, simple documents
gemma3:4b	4 Billion	⚖️ Balanced (49s)	High	Recommended - balanced use
gemma3:12b	12 Billion	🐢 Slow (108s)	Higher	better accuracy, complex documents

Benchmark Results (2-page resume):

gemma3:1b: 9 redactions in 14 seconds (basic detection)
gemma3:4b: 38 redactions in 49 seconds (aggressive detection) ⭐ Recommended
gemma3:12b: 14 redactions in 108 seconds (smart/selective)

Recommendation: Start with gemma3:4b for the best balance of speed and accuracy.

General Guidelines:

Model Size	Parameters	Speed	Accuracy	Best For
Small	1B-4B	⚡⚡⚡	⭐⭐	Quick processing, simple documents
Medium	4B-12B	⚡⚡	⭐⭐⭐	Balanced use, most documents
Large	12B+	⚡	⭐⭐⭐⭐	High accuracy, complex documents

🚀 Usage Examples

Example 1: Basic Redaction

In Claude Desktop, simply say:

Redact "C:\Users\atharv\Desktop\resume.pdf"

RedactAI will:

✅ Analyze the PDF with the default model (gemma3:1b)
✅ Detect all sensitive information
✅ Create a redacted version
✅ Auto-open both PDFs side-by-side for comparison

Example 2: Using a Different Model

Redact my resume using gemma3:4b model for better accuracy

Example 3: Custom Redaction

After seeing the first redaction:

Redact again but don't redact my name "John Doe" and DO redact "Google" and "Project X"

RedactAI will use the redact_pdf_custom tool to:

Exclude: "John Doe"
Include: "Google", "Project X"

Example 4: Analysis Only (Preview)

Analyze "C:\Documents\contract.pdf" without redacting

This shows you what would be redacted without creating a new file.

Example 5: Check Available Models

What Ollama models do I have available?

Example 6: Check Ollama Status

Is Ollama running and ready?

🛠️ Available Tools

RedactAI provides 5 MCP tools:

1. `list_available_models()`

Lists all Ollama models installed on your system with size and details.

Returns: JSON with model list and size-to-accuracy guidance

Use case: Check which models you can use before redacting.

2. `check_ollama_status(model, base_url)`

Verifies Ollama service is running and specified model is available.

Parameters:

model (optional): Model name to check (default: "gemma3:1b")
base_url (optional): Ollama API URL (default: "http://localhost:11434")

Use case: Troubleshooting connection issues.

3. `analyze_pdf_sensitive_data(pdf_path, pdf_base64, model)`

Analyzes PDF to detect sensitive information WITHOUT redacting.

Parameters:

pdf_path: Local file path to PDF
pdf_base64 (optional): Base64 encoded PDF data
model (optional): Ollama model to use (default: "gemma3:1b")

Returns:

Masked preview of detected data
Categories and counts
No files created

Use case: Preview before permanent redaction.

4. `redact_pdf(pdf_path, pdf_base64, model, return_base64, auto_open)`

Permanently redacts sensitive data from PDF.

Parameters:

pdf_path: Local file path to PDF
pdf_base64 (optional): Base64 encoded PDF data
model (optional): Model to use (default: "gemma3:1b")
return_base64 (optional): Return as base64 (default: false)
auto_open (optional): Auto-open PDFs (default: true)

Returns:

Redacted PDF (blacked out sensitive data)
Highlighted preview PDF (shows what was redacted)
Detailed summary with masked data
Statistics per page

Use case: Main redaction workflow.

5. `redact_pdf_custom(pdf_path, exclude_items, include_items, model, auto_open, return_base64)`

Custom redaction with user-specified exclusions and additions.

Parameters:

pdf_path: Path to ORIGINAL PDF (required)
exclude_items: List of strings to NOT redact
include_items: List of strings to forcefully redact
model (optional): Model to use (default: "gemma3:1b")
auto_open (optional): Auto-open PDFs (default: true)
return_base64 (optional): Return as base64 (default: false)

Example:

{
  "pdf_path": "resume.pdf",
  "exclude_items": ["John Doe", "[email protected]"],
  "include_items": ["Secret Project", "XYZ Corp"],
  "model": "gemma3:4b"
}

Use case: Fine-tune redactions after initial pass. User reviews initial redaction and says "don't redact my name, but DO redact 'Google'".

🔄 Workflow Example

1. User uploads PDF → analyze_pdf_sensitive_data()
2. Review masked sensitive data → Decide what to redact
3. Run redact_pdf() → Get redacted + highlighted PDFs
4. Both PDFs auto-open side-by-side
5. If adjustments needed → Use redact_pdf_custom()
   - Exclude false positives
   - Include additional items
6. Final redacted PDF ready for sharing

🐛 Troubleshooting

Issue 1: "Cannot connect to Ollama"

Solution:

# Start Ollama service
ollama serve

# Verify it's running
curl http://localhost:11434/api/tags

Issue 2: "Model not found"

Solution:

# List installed models
ollama list

# Install missing model
ollama pull gemma3:1b

Issue 3: "MCP server not showing in Claude"

Solution:

Verify paths in claude_desktop_config.json are correct
Use forward slashes (/) even on Windows, or escape backslashes (\\)
Restart Claude Desktop completely
Check Claude logs:
- Windows: %APPDATA%\Claude\logs
- macOS: ~/Library/Logs/Claude

Issue 4: "PDFs not opening automatically"

Solution:

Ensure you have a PDF viewer installed (Adobe Reader, browser, etc.)
Try manually opening from the output path
Set auto_open: false in the tool call if it's causing issues

Issue 5: "Redaction too slow"

Solution:

Use a smaller model: gemma3:1b (fastest)
Process fewer pages at once
Upgrade your hardware (more RAM/CPU helps)

Issue 6: "LLM returned empty result"

Solution:

# Check Ollama logs
ollama logs

# Restart Ollama
# Windows: Stop and restart the service
# macOS/Linux: 
killall ollama
ollama serve

Issue 7: "Too many/too few redactions"

Solution:

Try different models: gemma3:4b or gemma3:12b
Use redact_pdf_custom to fine-tune results
Exclude false positives or include missed items

📁 Project Structure

RedactAI/
├── src/
│   ├── server.py              # Main MCP server (FastMCP)
│   └── tools/
│       ├── ollama_llm.py      # Ollama LLM integration
│       ├── pdf_extractor.py   # PDF text extraction
│       ├── data_processor.py  # Sensitive data processing
│       └── pdf_redactor.py    # PDF redaction logic
├── scripts/
│   └── configure_claude.py    # Configuration helper script
├── setup.sh                    # Automated setup (macOS/Linux)
├── setup.ps1                   # Automated setup (Windows)
├── requirements.txt            # Python dependencies
├── INSTALLATION.md             # Detailed installation guide
├── README.md                   # This file
├── LICENSE                     # MIT License
└── .gitignore

🔒 Security Considerations

Original files are never modified - Redactions create new files
Temporary files are cleaned up - Automatic cleanup in finally blocks
Masked reporting - Sensitive data never exposed in full in logs/responses
Local processing - All LLM operations run locally via Ollama (no cloud APIs)
No data transmission - Your sensitive documents stay on your machine

🤝 Contributing

Contributions are welcome! Here's how you can help:

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Commit your changes: git commit -m 'Add amazing feature'
Push to the branch: git push origin feature/amazing-feature
Open a Pull Request

Development Setup

# Clone your fork
git clone https://github.com/YOUR_USERNAME/RedactAI.git
cd RedactAI

# Create venv and install dependencies
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install -r requirements.txt

# Test changes
python src/server.py

💡 Important Notes

This is an MCP server - it exposes tools via the Model Context Protocol, not a standalone CLI application
The server must be running and connected to an MCP client (like Claude Desktop) to use the tools
All operations return detailed JSON responses with progress tracking and error information
The system requires Ollama to be running locally at http://localhost:11434 by default
Larger models provide better accuracy but require more computational resources and time
The highlighted PDF serves as a preview/audit trail of what was redacted

🙏 Acknowledgments

Anthropic for Claude and MCP
Ollama for local LLM infrastructure
PyMuPDF for PDF processing
FastMCP for MCP server framework

📧 Contact

Atharv Sabde

GitHub: @AtharvSabde
Project: RedactAI

🌟 Star the Repo!

If RedactAI helped you protect your privacy, please ⭐ star the repo on GitHub!

Built with ❤️ for privacy-conscious AI users

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
scripts		scripts
src		src
.gitignore		.gitignore
INSTALLATION.md		INSTALLATION.md
README.md		README.md
requirements.txt		requirements.txt
setup.ps1		setup.ps1
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

RedactAI 🛡️

🤐 The Problem

🛡️ The Solution

🎥 Demo Video

🎯 Why RedactAI?

📋 Table of Contents

✨ Features

🤖 Flexible Model Selection

🔍 Comprehensive PII Detection

🎯 Smart Redaction Modes

📄 Dual Output

🚀 User Experience Features

🏗️ Architecture

Technology Stack

Core Components

🔧 Prerequisites

📦 Installation

🚀 Quick Start (recommended)

Windows (PowerShell):

macOS/Linux:

📘 Detailed Installation

For manual installation, troubleshooting, or advanced configuration options, see:

🎯 Model Selection

Benchmark Results (2-page resume):

General Guidelines:

🚀 Usage Examples

Example 1: Basic Redaction

Example 2: Using a Different Model

Example 3: Custom Redaction

Example 4: Analysis Only (Preview)

Example 5: Check Available Models

Example 6: Check Ollama Status

🛠️ Available Tools

1. list_available_models()

2. check_ollama_status(model, base_url)

3. analyze_pdf_sensitive_data(pdf_path, pdf_base64, model)

4. redact_pdf(pdf_path, pdf_base64, model, return_base64, auto_open)

5. redact_pdf_custom(pdf_path, exclude_items, include_items, model, auto_open, return_base64)

🔄 Workflow Example

🐛 Troubleshooting

Issue 1: "Cannot connect to Ollama"

Issue 2: "Model not found"

Issue 3: "MCP server not showing in Claude"

Issue 4: "PDFs not opening automatically"

Issue 5: "Redaction too slow"

Issue 6: "LLM returned empty result"

Issue 7: "Too many/too few redactions"

📁 Project Structure

🔒 Security Considerations

🤝 Contributing

Development Setup

💡 Important Notes

🙏 Acknowledgments

📧 Contact

🌟 Star the Repo!

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `list_available_models()`

2. `check_ollama_status(model, base_url)`

3. `analyze_pdf_sensitive_data(pdf_path, pdf_base64, model)`

4. `redact_pdf(pdf_path, pdf_base64, model, return_base64, auto_open)`

5. `redact_pdf_custom(pdf_path, exclude_items, include_items, model, auto_open, return_base64)`

Packages