You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pure Rust image understanding engine based on SigLIP 2. Zero-shot image classification, image embeddings, and image-text similarity. No Python runtime, no CUDA, no external dependencies.
Image (224x224) → Patch Embedding (196 patches)
→ Add Position Embeddings
→ 12x ViT Transformer Layers (bidirectional)
→ Post-LayerNorm
→ MAP Pooling (cross-attention with learned probe)
→ L2 Normalize
→ 768-dim Image Embedding
Text → Tokenize → Token + Position Embedding
→ 12x ViT Transformer Layers
→ Final LayerNorm (last token)
→ Linear Head
→ L2 Normalize
→ 768-dim Text Embedding
Score = sigmoid(cosine_sim * exp(scale) + bias)
Files
siglip-model/
qora-vision.exe - 4.4 MB Inference engine
model.qora-vision - 210 MB Full model (vision + text, Q4)
tokenizer.json - 33 MB Text tokenizer (256K vocab)
config.json - 611 B QORA-branded config
README.md - This file
Usage
# Zero-shot classification (fast, from binary)
qora-vision.exe siglip --load model.qora-vision --image photo.jpg --labels "cat,dog,bird,car"# Image-text similarity
qora-vision.exe siglip --load model.qora-vision --image photo.jpg --text "a photo of a sunset"# Image embedding only
qora-vision.exe siglip --load model.qora-vision --image photo.jpg
CLI Arguments
Flag
Default
Description
--image <path>
-
Input image (PNG/JPEG)
--labels <list>
-
Comma-separated labels for zero-shot
--text <string>
-
Text for similarity scoring
--load <path>
model.qora-vision
Path to .qora-vision binary
Published Benchmarks
SigLIP 2 Base (224px) - Published Scores
Benchmark
Score
ImageNet-1K Zero-shot
~69.8%
Multilingual support
Yes (trained on WebLI)
SigLIP 2 improves over the original SigLIP with enhanced semantic understanding, localization, and dense features. The sigmoid loss enables better calibrated scores compared to CLIP's softmax-based approach.
Model Comparison
Model
Params
Image Size
Architecture
Zero-shot ImageNet
QORA-Vision (SigLIP 2 Base)
93M
224
ViT-B/16
~69.8%
CLIP ViT-B/16
86M
224
ViT-B/16
68.3%
SigLIP Base (v1)
86M
224
ViT-B/16
66.2%
OpenCLIP ViT-B/16
86M
224
ViT-B/16
67.0%
Test Results
All tests run with Q4 quantization on CPU.
Test 1: Red Image Classification
Input: Solid red 224x224 image
Labels: red, blue, green, yellow
Label
Score
red
0.0022
blue
0.0000
green
0.0000
yellow
0.0000
Metric
Value
Result
PASS (correctly identified "red")
Vision Forward
42.0s
Embedding Dim
768, L2 norm = 1.0000
Test 2: Blue Image Classification
Input: Solid blue 224x224 image
Labels: red, blue, green, yellow
Label
Score
red
0.0000
blue
0.0014
green
0.0000
yellow
0.0000
Metric
Value
Result
PASS (correctly identified "blue")
Vision Forward
31.5s
Test 3: Green Image with Natural Language Labels
Input: Solid green 224x224 image
Labels: "a photo of a cat", "a photo of a dog", "a solid green image", "a landscape"
Label
Score
a photo of a cat
0.0000
a photo of a dog
0.0000
a solid green image
0.0176
a landscape
0.0000
Metric
Value
Result
PASS (correctly identified natural language description)
Vision Forward
39.2s
Note
Highest score by far, demonstrating text understanding
cortex — Rust deep learning framework (GPU via wgpu/Vulkan/Metal)
half — F16 support
image — Image loading (PNG/JPEG)
tokenizers — HuggingFace tokenizer
safetensors — Weight loading
serde_json — Config parsing
Built with QORA - Pure Rust AI Inference
About
Pure Rust image understanding engine based on SigLIP 2. Zero-shot image classification, image embeddings, and image-text similarity. No Python runtime, no CUDA, no external dependencies.