📄 ParseStudio ✨

ParseStudio is a powerful and flexible Python library for extracting and parsing content from PDF documents. It provides an intuitive interface for handling diverse tasks such as extracting text, tables, and images using different parsing backends.

Requirements

Python 3.11 or higher
Compatible with Python 3.11, 3.12

Key Features

Modular Design: Choose between multiple parser backends (Docling, PymuPDF, Llama Parse, Anthropic Claude, OpenAI File Search) to suit your needs.
Multimodal Parsing: Extract text, tables, and images seamlessly.
Extensible: Easily adjust parsing behavior with additional parameters.

🚀 Installation

Via pip (recommended)

pip install parsestudio

From source

git clone https://github.com/chatclimate-ai/ParseStudio.git
cd ParseStudio
pip install .

Development installation

git clone https://github.com/chatclimate-ai/ParseStudio.git
cd ParseStudio
pip install -e ".[dev]"

⚡ Quick Start

1. Import and Initialize the Parser

from parsestudio.parse import PDFParser

# Initialize with the desired parser backend
parser = PDFParser(parser="docling")  # Options: "docling", "pymupdf", "llama", "anthropic", "openai"

2. Parse a PDF File

outputs = parser.run(["path/to/file.pdf"], modalities=["text", "tables", "images"])

# Access text content
print(outputs[0].text)
# Output: text="This is the extracted text content from the PDF file."

# Access tables
for table in outputs[0].tables:
    print(table.markdown)
# Output: | Header 1 | Header 2 |
#         |----------|----------|
#         | Value 1  | Value 2  |

# Access images
for image in outputs[0].images:
    image.image.show()
    metadata = image.metadata
    print(metadata)

# Output: Metadata(page_number=1, bbox=[0, 0, 100, 100])

3. Supported Parsers

Choose from the following parsers based on your requirements:

Docling: Advanced parser with multimodal capabilities.
PyMuPDF: Lightweight and efficient.
LlamaParse: AI-enhanced parser with advanced capabilities.
Anthropic Claude: Advanced AI model using Claude 3.5 Sonnet with native PDF processing capabilities.
OpenAI File Search: Efficient document processing using vector embeddings and file search capabilities.

Each parser has its own strengths. Choose the one that best fits your use case.

API Key Setup

If you choose to use the Llama, Anthropic, or OpenAI parsers, you need to set up API keys. Follow these steps:

Create a .env File: In the root directory of your project, create a file named .env.
Add Your API Keys: Add the following lines to the .env file, replacing the placeholders with your actual API keys:
```
LLAMA_API_KEY=your-llama-api-key
ANTHROPIC_API_KEY=your-anthropic-api-key
OPENAI_API_KEY=your-openai-api-key
```

Parser-Specific Requirements

Llama Parser: Requires a Llama API key
Anthropic Parser:
- Requires an Anthropic API key
- Uses Claude 3.5 Sonnet with native PDF processing (no image conversion needed)
- Supports text and table extraction
- Note: Image extraction not currently supported due to API limitations
OpenAI Parser:
- Requires an OpenAI API key
- Uses OpenAI's file search with vector embeddings for efficient PDF processing
- Automatically handles text extraction and table detection
- More efficient, cheaper, and faster than image-based approaches

Contributing

We welcome contributions from the community! ParseStudio uses modern development tools to ensure code quality.

Quick Development Setup

# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/chatclimate-ai/ParseStudio.git
cd ParseStudio
uv sync --dev

# Install pre-commit hooks
uv run pre-commit install

# Verify setup
make all-checks

Development Commands

make format     # Format code with Black & isort
make lint       # Run ruff linter
make type-check # Run mypy type checker  
make test       # Run tests
make all-checks # Run all quality checks

See CONTRIBUTING.md for detailed development guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

🐛 Bug Reports: GitHub Issues
💬 Discussions: GitHub Discussions
📧 Contact: For questions about usage or contributions, please open an issue

Acknowledgments

ParseStudio integrates with several excellent open-source and commercial parsing solutions:

Docling - Advanced document parsing
PyMuPDF - Fast PDF processing
LlamaParse - AI-powered document parsing
Anthropic Claude - Advanced AI capabilities
OpenAI File Search - Efficient document processing with vector embeddings

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
.github/workflows		.github/workflows
docs		docs
images		images
parsestudio		parsestudio
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 ParseStudio ✨

Requirements

Key Features

🚀 Installation

Via pip (recommended)

From source

Development installation

⚡ Quick Start

1. Import and Initialize the Parser

2. Parse a PDF File

3. Supported Parsers

API Key Setup

Parser-Specific Requirements

Contributing

Quick Development Setup

Development Commands

License

Support

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📄 ParseStudio ✨

Requirements

Key Features

🚀 Installation

Via pip (recommended)

From source

Development installation

⚡ Quick Start

1. Import and Initialize the Parser

2. Parse a PDF File

3. Supported Parsers

API Key Setup

Parser-Specific Requirements

Contributing

Quick Development Setup

Development Commands

License

Support

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages