pdf2md

API reference | Meet a Maintainer | Discord | Matrix | [email protected]

🦀 PDF2MD 🦀

Self-hostable API server and pipeline for converting PDF's to markdown using thrifty large language vision models like GPT-4o-mini and gemini-flash-1.5.

Written in Rust. Try at pdf2md.trieve.ai.

The Stack

There's no compelling reason why Rust is necessary for this, but we wanted to have some fun 😜. Everything is free and open source. You can self-host easily with docker-compose or kube following the SELF-HOSTING guide here.

minijinja templates for the UI
- there was no way I was going to write more JSX
PDFObject to view PDF's in the demo UI.
actix/actix-web for the HTTP server
fun redis queue macro system for worker pattern async processing
- redis queues are a core part of our infra for Trieve, but we made our system a lot more repeatable with this macro
- there will be a future release of this macro in an isolated crate
Clickhouse for task storage
- we have had a surprising amount of Postgres issues (especially write locks) building Trieve, so Clickhouse as the primary data store here is cool
chm for Clickhouse migrations
- in-house CLI for creating a Clickhouse migrations folder and system
MinIO S3 for file storage

How does PDF2MD work?

Workers horizontally scale on-demand to handle high volume periods. Usually chunk-worker needs to scale before supervisor-worker. Pages for a given Task stream in as the chunk-worker calls out to the LLM to get markdown for them.

1. HTTP server

HTTP server receives a base64 encoded PDF and decodes it
Creates FileTask for document in ClickHouse
Adds FileTask along with the base64 encoded file to files_to_process queue in Redis

2. Supervisor Worker

supervisor-worker continuously polls the files_to_process Redis queue until it grabs a FileTask and its base64
Decodes the base64 into a PDF and puts the PDf into S3
Splits the PDF into pages, converts them to JPEGs
Puts each JPEG page image into S3
Pushes a ChunkingTask for each page into the files_to_chunk Redis queue

3. Chunk Worker

chunk-worker continuously polls the files_to_chunk Redis queue until it grabs a ChunkingTask
Gets its page image from S3
Sends the image to the LLM provider at LLM_BASE_URL along with the prompt and model on the request to get markdown
Updates the task with the markdown for the page

Why Make This?

Trieve has used apache tika to process various filetypes for the past year which means that files with complex layouts and diagrams have been poorly ingested.

We saw OmniAI launch zerox and show that 4o-mini was a viable and cheap way to handle these filetypes and decided it was time to integrate something better than Tika into Trieve.

We previously lightly contributed to Chunkr which is a more advanced system that leverages layout detection and dedicated OCR models to process documents, but still felt the need to build something ourselves since it was a bit complex to work into Trieve's local dev and self-hosting setup. Zerox's approach using just a VLLM was ideal and the path we went with.

We wrote our own API server and pipeline using Rust, Redis queues, and Clickhouse in the Trieve-style to achieve this. Try it using our demo UI hosted at pdf2md.trieve.ai.

Roadmap

Please contribute if you can! We could use help 🙏.

Rename everything from chunk to page because we eventually decided that we would only deal PDF --> Markdown conversion and not chunking. Consider using chonkie with the markdown output for this.
Use Clickhouse MergeTree instead of updating Task's in Clickhouse as that's more correct.
supervisor-worker can get overwhelmed when it receives a large PDF as splitting into pages can take a while. There should be something better here.
Users should be able to send a URL to a file instead of base64 encoding it if they have one because that's easier.
Users should be able to point PDF2MD at an S3 bucket and let it process all of them automatically instead of having to send each file 1 by 1 🤮.

Made with ❤️ in San Francisco

Name		Name	Last commit message	Last commit date
parent directory ..
cli		cli
k8s		k8s
server		server
.env.dist		.env.dist
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SELF-HOSTING.md		SELF-HOSTING.md
docker-compose-prod.yaml		docker-compose-prod.yaml
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

🦀 PDF2MD 🦀

Self-hostable API server and pipeline for converting PDF's to markdown using thrifty large language vision models like GPT-4o-mini and gemini-flash-1.5.

Written in Rust. Try at pdf2md.trieve.ai.

The Stack

How does PDF2MD work?

1. HTTP server

2. Supervisor Worker

3. Chunk Worker

Why Make This?

Roadmap

FilesExpand file tree

pdf2md

Directory actions

More options

Directory actions

More options

Latest commit

History

pdf2md

Folders and files

parent directory

README.md

🦀 PDF2MD 🦀

Self-hostable API server and pipeline for converting PDF's to markdown using thrifty large language vision models like GPT-4o-mini and gemini-flash-1.5.

Written in Rust. Try at pdf2md.trieve.ai.

The Stack

How does PDF2MD work?

1. HTTP server

2. Supervisor Worker

3. Chunk Worker

Why Make This?

Roadmap