Made autonomously using NEO ·
Benchmark Vision-Language efficiency by dynamically masking image tokens during inference.
git clone https://github.com/dakshjain-1616/vissparse
cd vissparse
pip install -r requirements.txtRun a full VQA accuracy-vs-latency sweep in mock mode (no GPU or model download needed):
python run_sparse_vqa.py --num-samples 100 --keep-ratio 0.1 --mode mockOr use the library directly in your code:
from vissparse.token_selector import TokenSelector
from vissparse.sparse_attention import SparseAttentionMask
selector = TokenSelector(similarity_threshold=0.5)
mask = SparseAttentionMask.generate(selector, image_tokens, keep_ratio=0.1)- Dynamic Token Selection: Cosine-similarity based selector to identify informative image tokens at inference time.
- Sparse Attention Masking: Custom mask generator compatible with standard VLM architectures like Qwen2-VL.
- VQA Benchmarking: Evaluate accuracy vs. tokens skipped on VQA v2 datasets with automated CSV reporting.
- Mock Mode: Test logic and metrics without downloading heavy vision-language models or requiring GPU.
pytest tests/ -q
# 91 passedvissparse/
├── vissparse/ ← main library (token_selector, sparse_attention, metrics)
├── tests/ ← test suite (integration, metrics, attention, selector)
├── scripts/ ← demo scripts (demo.py)
├── run_sparse_vqa.py ← main CLI entry point
├── conftest.py ← pytest configuration
└── requirements.txt