ParseStudio is a powerful and flexible Python library for extracting and parsing content from PDF documents. It provides an intuitive interface for handling diverse tasks such as extracting text, tables, and images using different parsing backends.
- Python 3.11 or higher
- Compatible with Python 3.11, 3.12
- Modular Design: Choose between multiple parser backends (Docling, PymuPDF, Llama Parse, Anthropic Claude, OpenAI File Search) to suit your needs.
- Multimodal Parsing: Extract text, tables, and images seamlessly.
- Extensible: Easily adjust parsing behavior with additional parameters.
pip install parsestudiogit clone https://github.com/chatclimate-ai/ParseStudio.git
cd ParseStudio
pip install .git clone https://github.com/chatclimate-ai/ParseStudio.git
cd ParseStudio
pip install -e ".[dev]"from parsestudio.parse import PDFParser
# Initialize with the desired parser backend
parser = PDFParser(parser="docling") # Options: "docling", "pymupdf", "llama", "anthropic", "openai"outputs = parser.run(["path/to/file.pdf"], modalities=["text", "tables", "images"])
# Access text content
print(outputs[0].text)
# Output: text="This is the extracted text content from the PDF file."
# Access tables
for table in outputs[0].tables:
print(table.markdown)
# Output: | Header 1 | Header 2 |
# |----------|----------|
# | Value 1 | Value 2 |
# Access images
for image in outputs[0].images:
image.image.show()
metadata = image.metadata
print(metadata)
# Output: Metadata(page_number=1, bbox=[0, 0, 100, 100])Choose from the following parsers based on your requirements:
- Docling: Advanced parser with multimodal capabilities.
- PyMuPDF: Lightweight and efficient.
- LlamaParse: AI-enhanced parser with advanced capabilities.
- Anthropic Claude: Advanced AI model using Claude 3.5 Sonnet with native PDF processing capabilities.
- OpenAI File Search: Efficient document processing using vector embeddings and file search capabilities.
Each parser has its own strengths. Choose the one that best fits your use case.
If you choose to use the Llama, Anthropic, or OpenAI parsers, you need to set up API keys. Follow these steps:
- Create a
.envFile: In the root directory of your project, create a file named.env. - Add Your API Keys: Add the following lines to the .env file, replacing the placeholders with your actual API keys:
LLAMA_API_KEY=your-llama-api-key ANTHROPIC_API_KEY=your-anthropic-api-key OPENAI_API_KEY=your-openai-api-key
- Llama Parser: Requires a Llama API key
- Anthropic Parser:
- Requires an Anthropic API key
- Uses Claude 3.5 Sonnet with native PDF processing (no image conversion needed)
- Supports text and table extraction
- Note: Image extraction not currently supported due to API limitations
- OpenAI Parser:
- Requires an OpenAI API key
- Uses OpenAI's file search with vector embeddings for efficient PDF processing
- Automatically handles text extraction and table detection
- More efficient, cheaper, and faster than image-based approaches
We welcome contributions from the community! ParseStudio uses modern development tools to ensure code quality.
# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and setup
git clone https://github.com/chatclimate-ai/ParseStudio.git
cd ParseStudio
uv sync --dev
# Install pre-commit hooks
uv run pre-commit install
# Verify setup
make all-checksmake format # Format code with Black & isort
make lint # Run ruff linter
make type-check # Run mypy type checker
make test # Run tests
make all-checks # Run all quality checksSee CONTRIBUTING.md for detailed development guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.
- π Bug Reports: GitHub Issues
- π¬ Discussions: GitHub Discussions
- π§ Contact: For questions about usage or contributions, please open an issue
ParseStudio integrates with several excellent open-source and commercial parsing solutions:
- Docling - Advanced document parsing
- PyMuPDF - Fast PDF processing
- LlamaParse - AI-powered document parsing
- Anthropic Claude - Advanced AI capabilities
- OpenAI File Search - Efficient document processing with vector embeddings
