Skip to content

Latest commit

 

History

History

README.md

Trieve Logo

API reference | Meet a Maintainer | Discord | Matrix | [email protected]

Github stars Join Discord Join Matrix

🦀 PDF2MD 🦀

Self-hostable API server and pipeline for converting PDF's to markdown using thrifty large language vision models like GPT-4o-mini and gemini-flash-1.5.

Written in Rust. Try at pdf2md.trieve.ai.

PDF2MD service preview

The Stack

There's no compelling reason why Rust is necessary for this, but we wanted to have some fun 😜. Everything is free and open source. You can self-host easily with docker-compose or kube following the SELF-HOSTING guide here.

  • minijinja templates for the UI
    • there was no way I was going to write more JSX
  • PDFObject to view PDF's in the demo UI.
  • actix/actix-web for the HTTP server
  • fun redis queue macro system for worker pattern async processing
    • redis queues are a core part of our infra for Trieve, but we made our system a lot more repeatable with this macro
    • there will be a future release of this macro in an isolated crate
  • Clickhouse for task storage
    • we have had a surprising amount of Postgres issues (especially write locks) building Trieve, so Clickhouse as the primary data store here is cool
  • chm for Clickhouse migrations
    • in-house CLI for creating a Clickhouse migrations folder and system
  • MinIO S3 for file storage

How does PDF2MD work?

Workers horizontally scale on-demand to handle high volume periods. Usually chunk-worker needs to scale before supervisor-worker. Pages for a given Task stream in as the chunk-worker calls out to the LLM to get markdown for them.

1. HTTP server

  1. HTTP server receives a base64 encoded PDF and decodes it
  2. Creates FileTask for document in ClickHouse
  3. Adds FileTask along with the base64 encoded file to files_to_process queue in Redis

2. Supervisor Worker

  1. supervisor-worker continuously polls the files_to_process Redis queue until it grabs a FileTask and its base64
  2. Decodes the base64 into a PDF and puts the PDf into S3
  3. Splits the PDF into pages, converts them to JPEGs
  4. Puts each JPEG page image into S3
  5. Pushes a ChunkingTask for each page into the files_to_chunk Redis queue

3. Chunk Worker

  1. chunk-worker continuously polls the files_to_chunk Redis queue until it grabs a ChunkingTask
  2. Gets its page image from S3
  3. Sends the image to the LLM provider at LLM_BASE_URL along with the prompt and model on the request to get markdown
  4. Updates the task with the markdown for the page

Why Make This?

Trieve has used apache tika to process various filetypes for the past year which means that files with complex layouts and diagrams have been poorly ingested.

We saw OmniAI launch zerox and show that 4o-mini was a viable and cheap way to handle these filetypes and decided it was time to integrate something better than Tika into Trieve.

We previously lightly contributed to Chunkr which is a more advanced system that leverages layout detection and dedicated OCR models to process documents, but still felt the need to build something ourselves since it was a bit complex to work into Trieve's local dev and self-hosting setup. Zerox's approach using just a VLLM was ideal and the path we went with.

We wrote our own API server and pipeline using Rust, Redis queues, and Clickhouse in the Trieve-style to achieve this. Try it using our demo UI hosted at pdf2md.trieve.ai.

Roadmap

Please contribute if you can! We could use help 🙏.

  1. Rename everything from chunk to page because we eventually decided that we would only deal PDF --> Markdown conversion and not chunking. Consider using chonkie with the markdown output for this.
  2. Use Clickhouse MergeTree instead of updating Task's in Clickhouse as that's more correct.
  3. supervisor-worker can get overwhelmed when it receives a large PDF as splitting into pages can take a while. There should be something better here.
  4. Users should be able to send a URL to a file instead of base64 encoding it if they have one because that's easier.
  5. Users should be able to point PDF2MD at an S3 bucket and let it process all of them automatically instead of having to send each file 1 by 1 🤮.

Made with ❤️ in San Francisco