A reproduction of the Deepseek-OCR model based on the VILA codebase. DeepOCR explores context optical compression through vision-text token compression, achieving competitive OCR performance with minimal vision tokens.
- Token Efficiency: Achieves competitive OCR performance using ~250 vision tokens
- Open Source Implementation: Complete reproduction of DeepSeek-OCR's innovative optical compression architecture using the VILA framework
- Novel DeepEncoder: Combines SAM (window attention) + CLIP (global attention) with 16Γ convolutional compression for efficient high-resolution processing (1024Γ1024+)
- Production Ready: Includes complete training pipeline, evaluation scripts, and pre-trained checkpoints for immediate use
Deepseek-OCR: Contexts Optical Compression
arXiv Paper
- Vision-Text Compression: Compresses text into visual representations at 7-20Γ ratios while maintaining high OCR accuracy
- DeepEncoder Architecture: Novel encoder combining SAM (80M) + CLIP (300M) with 16Γ convolutional compressor
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DeepOCR β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β DeepEncoder (380M) β β
β β β β
β β ββββββββββββββββ βββββββββββ ββββββββββ β β
β β β SAM-base βββββ Conv ββββ CLIP β β β
β β β (80M) β β 16Γ β β (300M) β β β
β β β Window Attn β βCompress β β Global β β β
β β ββββββββββββββββ βββββββββββ ββββββββββ β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β Linear Projector (2048 β LLM dim) β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β Qwen 2-7B β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Clone the repository
git clone https://github.com/pkulium/DeepOCR
cd DeepOCR
# Set up environment
./environment_setup.sh deeporc
conda activate deeporc
# Install additional dependencies for OCR
pip install safetensors einops easydict mupdfDownload required checkpoints:
# SAM and CLIP checkpoints (combined in one file)
# Place at: checkpoints/sam_clip_ckpt/model_cache/model-00001-of-000001.safetensors
huggingface-cli download pkulium/sam_clip_ckpt
# Base LLM (Qwen2-7B-Instruct)
huggingface-cli download Efficient-Large-Model/Qwen2-VL-7B-InstructTrains the vision-to-text projector while freezing vision encoder and LLM:
bash scripts/NVILA-Lite/align_ocr.sh \
Efficient-Large-Model/Qwen2-VL-7B-Instruct \
llava_15_mix \
runs/train/ocr-qwen2-vl-8b-alignKey parameters:
- Batch size: 512
- Learning rate: 1e-3
- Epochs: 1
- Data: LLaVA-CC3M-Pretrain-595K
- Trainable: Projector only
Full model training with OCR data:
bash scripts/NVILA-Lite/pretrain_ocr.sh \
runs/train/ocr-qwen2-vl-8b-align/model \
olmOCR-mix-pretrain \
runs/train/ocr-qwen2-vl-8b-pretrainKey parameters:
- Batch size: 32
- Learning rate: 5e-5
- Epochs: 1
- Data: allenai/olmOCR-mix-1025
- Trainable: Projector + LLM
The model requires three types of data across two training stages:
Stage 1: Initialize Projector
- Dataset: CC3M (Conceptual Captions 3M)
Stage 2: Model Pretrain
- Data sources: PDF documents and images
- Dataset:
allenai/olmOCR-mix-1025
bash scripts/eval/all.shpython llava/eval/omini_doc_bench.py \
--model-path <model_path> \
--input-folder <input_images> \
--output-folder <output_markdown> \
--text "Free OCR."Available prompts:
"<image>\nFree OCR."- Plain text extraction"<image>\n<|grounding|>Convert the document to markdown."- With layout"<image>\nParse the figure."- Chart/figure parsing"<image>\nDescribe this image in detail."- General description
vila-eval \
--model-name NVILA-8B-OCR \
--model-path runs/train/ocr-qwen2-vl-8b-pretrain/model \
--conv-mode auto \
--tags-include localCore Components:
-
deepencoder.py: Implements SAM and CLIP vision towersbuild_sam_vit_b(): SAM-base with 768-dim, 12 layers, window attentionbuild_clip_l(): CLIP-large with 1024-dim, 24 layers, global attentionMlpProjector: Token compression module
-
modeling_sam_clip.py: Main SAMCLIP wrapper- Handles multi-resolution input processing
- Dynamic tile-based processing for high-res images
- Token concatenation:
[CLIP_cls, CLIP_patches, SAM_features]
Token Flow:
Input (1024Γ1024) β SAM (4096 tokens) β Conv16Γ (256 tokens)
β
CLIP (256 tokens) β Concat β 2048-dim features
# Dynamic resolution preprocessing
def dynamic_preprocess(image, min_num=2, max_num=6, image_size=640):
"""
Splits image into tiles based on aspect ratio
Returns: List of tile images + crop ratio
"""Processing modes:
- Single image: Resize or pad to base size (1024Γ1024)
- Cropping enabled: Dynamic tiling (2-6 tiles per dimension)
- Output: Global view (1024Γ1024) + Local tiles (640Γ640 each)
class MultimodalProjector:
def __init__(self):
self.layers = nn.Linear(2048, llm_hidden_size)
self.image_newline = nn.Parameter(...) # Token separator
self.view_seperator = nn.Parameter(...) # View separatorToken formatting:
[Local_Tiles] + [Image_Newline] + [Global_View] + [View_Separator]
Key settings:
BASE_SIZE = 1024 # Global view size
IMAGE_SIZE = 640 # Tile size
CROP_MODE = True # Enable dynamic tiling
MIN_CROPS = 2 # Min tiles per dimension
MAX_CROPS = 6 # Max tiles per dimension
MAX_CONCURRENCY = 100 # Batch processing limitFirst, download the model from Hugging Face:
huggingface-cli download pkulium/easy_deepocr --local-dir ./easy_deepocr_sam_clipThen use the model:
vila-infer \
--model-path ./easy_deepocr_sam_clip \
--conv-mode auto \
--text "Free OCR." \
--media "./assets/test.png"import llava
# Load model
model = llava.load("./easy_deepocr_sam_clip")
prompt = [
Image("document.pdf"),
"<|grounding|>Convert the document to markdown."
]
response = model.generate_content(prompt)prompt = [Image("chart.png"), "Parse the figure."]
response = model.generate_content(prompt)
# Returns: HTML table or structured data-
CUDA OOM during training
# Reduce batch size or enable gradient checkpointing --per_device_train_batch_size 1 \ --gradient_accumulation_steps 16 \ --gradient_checkpointing True -
NCCL timeout in multi-GPU training
export NCCL_TIMEOUT=1800 export NCCL_IB_TIMEOUT=22
-
Position_ids buffer device mismatch
- Fixed in
deepencoder.pyby reinitializing position_ids after checkpoint loading
- Fixed in
-
Distributed training hangs
- Ensure all processes take same conditional branches (see
modeling_sam_clip.pyfix)
- Ensure all processes take same conditional branches (see
This reproduction is based on the VILA codebase and has some adaptations:
- LLM Decoder: Uses Qwen2-VL-7B instead of Deepseek-3B-MoE
- Training Framework: VILA's training pipeline instead of Deepseek's custom framework
- Data Loading: Adapted to VILA's data format and preprocessing
- Multi-resolution: Simplified implementation with preset modes
- Implement needle-in-a-haystack tests for context compression
- Add support for digital-optical text interleaved pretraining
- Optimize inference with TensorRT/vLLM
- Add LoRA fine-tuning support for domain adaptation
- Implement forgetting mechanism experiments
If you find our work helpful, please consider citing it:
@misc{DeepOCR,
title = {DeepOCR},
year = {2025},
howpublished = {\url{https://github.com/pkulium/DeepOCR}},
note = {Accessed: 2025-11-04}
}- Code: Apache 2.0 License
- Model weights: CC-BY-NC-SA-4.0 License
- For commercial use, please contact the authors
- Original Deepseek-OCR paper and team
- VILA codebase and NVIDIA team
- SAM (Segment Anything Model) from Meta
- CLIP from OpenAI
- Qwen2 from Qwen team
Note: This is a research reproduction. For production deployment, consider the official Deepseek-OCR implementation.

