Inspiration
Large documents quickly exceed LLM context limits. Instead of pushing more tokens, we explored whether vision-based compression could extend effective context while reducing storage.
What it does
Vision-Compression converts PDFs into page images, stores lightweight metadata, and answers questions by reasoning directly over retrieved document images, enabling long-document QA with citations.
How we built it
We built a Python backend using Vertex AI Gemini for multimodal reasoning, Supermemory for page-level retrieval, and multiple ingestion modes (text, optical, hybrid), with an evaluation harness to compare accuracy, storage, and context usage.
Challenges we ran into
Handling long documents efficiently, enforcing citation grounding, managing storage tradeoffs, and designing fair evaluations across different context representations.
Accomplishments that we're proud of
-> Demonstrated effective context extension without increasing token budgets -> Achieved significant storage reduction using optical-lite ingestion -> Built a clean, multi-PDF evaluation framework with measurable comparisons
What we learned
Vision tokens can act as a denser carrier of information than text tokens, and combining retrieval with image-based reasoning can outperform naïve text-only approaches on long documents.
What's next for Vision compression project
Automated rendering optimization, stronger retrieval routing, larger-scale benchmarks, and a polished UI for interactive long-document analysis.
Built With
- fastapi
- gemini3
- supermemory
Log in or sign up for Devpost to join the conversation.