Inspiration

Large documents quickly exceed LLM context limits. Instead of pushing more tokens, we explored whether vision-based compression could extend effective context while reducing storage.

What it does

Vision-Compression converts PDFs into page images, stores lightweight metadata, and answers questions by reasoning directly over retrieved document images, enabling long-document QA with citations.

How we built it

We built a Python backend using Vertex AI Gemini for multimodal reasoning, Supermemory for page-level retrieval, and multiple ingestion modes (text, optical, hybrid), with an evaluation harness to compare accuracy, storage, and context usage.

Challenges we ran into

Handling long documents efficiently, enforcing citation grounding, managing storage tradeoffs, and designing fair evaluations across different context representations.

Accomplishments that we're proud of

-> Demonstrated effective context extension without increasing token budgets -> Achieved significant storage reduction using optical-lite ingestion -> Built a clean, multi-PDF evaluation framework with measurable comparisons

What we learned

Vision tokens can act as a denser carrier of information than text tokens, and combining retrieval with image-based reasoning can outperform naïve text-only approaches on long documents.

What's next for Vision compression project

Automated rendering optimization, stronger retrieval routing, larger-scale benchmarks, and a polished UI for interactive long-document analysis.

Built With

  • fastapi
  • gemini3
  • supermemory
Share this project:

Updates