A natural language image editing pipeline with full Photoshop-equivalent capabilities and 3D object manipulation — powered by Claude (intent parsing), SAM2 (segmentation), Depth Anything V2 (depth estimation), Zero123++ / TripoSR (3D novel view synthesis), and FLUX.1-dev (diffusion).
You describe what you want in plain English. The system figures out which model to use.
ps = AiPhotoshop("photo.jpg")
ps.edit("boost contrast 20, add warm white balance") # instant, no GPU
ps.edit("rotate the chair 45 degrees to the left") # 3D object edit, ~30s
ps.edit("remove the person on the right") # FLUX inpaint, ~20s
ps.edit("set blend mode to soft light, opacity 60%") # layer op, instant
ps.save("output.jpg")
⚠️ FLUX.1-dev license notice This pipeline uses FLUX.1-dev for inpainting and generative edits. FLUX.1-dev is released under a non-commercial license — it may not be used for commercial purposes without a separate agreement with Black Forest Labs. You must accept the license at huggingface.co/black-forest-labs/FLUX.1-dev before downloading weights. All other models in this stack (SAM2, Depth Anything V2, Zero123++, TripoSR, Grounding DINO) are Apache 2.0 or MIT.
Every prompt goes through an LLM intent parser (Claude) that extracts a structured operation — op type, target object, and parameters — then routes to the cheapest backend that can do the job correctly.
prompt
└─► Claude intent parser → {op_type, operation, target_object, params}
├─ whole_image_geometric → OpenCV (instant, no GPU)
├─ object_2d_geometric → SAM2 + OpenCV + FLUX (~10s)
├─ object_3d_geometric → SAM2 + Depth + Zero123++/TripoSR + FLUX (~30–60s)
├─ adjustment → NumPy / PIL (instant, no GPU)
├─ filter → OpenCV / PIL (instant, no GPU)
├─ generative/semantic → FLUX.1-dev (~20s)
└─ layer_operation → LayerStack (instant)
Core design principle: geometric math is never delegated to a diffusion model. Rotation, scaling, and translation use OpenCV for exact pixel-level results. Diffusion only handles what it's actually good at — synthesizing novel views, filling inpainted holes, and semantic edits.
Input image + "rotate the chair 45 degrees left"
│
├─ SAM2 (Grounding DINO → bounding box → precise mask)
├─ Depth Anything V2 (metric depth map for scene context)
│
├─ [fast] Zero123++ — diffusion novel view synthesis at target (pitch, yaw, roll)
└─ [quality] TripoSR — 3D mesh reconstruction → render from new camera angle
│
├─ FLUX.1 inpaint — fill the background hole left by the moved object
└─ Alpha composite — place rotated object onto inpainted background
Use zero123 (default, ~30s) for speed, triposr (~60s) for mesh-accurate results:
ps.edit("rotate the sculpture 60 degrees yaw, use triposr for quality")Place your own before/after images in
assets/after running the examples. Seeassets/README.mdfor instructions.
| Operation | Before | After |
|---|---|---|
| 3D yaw rotation | assets/demo_3d_before.jpg |
assets/demo_3d_after.jpg |
| Object removal | assets/demo_remove_before.jpg |
assets/demo_remove_after.jpg |
To generate your own:
# 3D rotation demo
python examples/3d_editing.py --input photo.jpg --object "the chair" --output_dir outputs/3d/
# Retouching and filters
python examples/basic_edits.py --input photo.jpg --output_dir outputs/basic/
# Layer workflow
python examples/layer_workflow.py --input photo.jpg --output_dir outputs/layers/| Prompt | Backend | Speed |
|---|---|---|
| "rotate 15 degrees clockwise" | OpenCV | instant |
| "flip horizontally" | OpenCV | instant |
| "zoom in 1.5x" | OpenCV | instant |
| "crop to center square" | OpenCV | instant |
| "perspective correct the sign" | OpenCV | instant |
| "flip the car horizontally" | SAM2 + OpenCV + FLUX | ~10s |
| "rotate the chair 45 degrees left" | SAM2 + Depth + Zero123++ + FLUX | ~30s |
| "rotate the sculpture 60° left, high quality" | SAM2 + Depth + TripoSR + FLUX | ~60s |
brightness,contrast,sharpnesscurves— per-channel (R, G, B, or composite), arbitrary control pointslevels— black point, white point, gamma, output rangehsl— hue shift, saturation, lightnessvibrance— saturation boost weighted toward less-saturated pixelswhite_balance— temperature and tintshadows_highlights— recover shadow/highlight detail independentlyvignette— strength, feather, midpoint
gaussian_blur,lens_blur,motion_blurbilateral_denoise— edge-preserving noise reductionunsharp_mask— amount, radius, thresholdgrain— simulated film grainemboss,posterize,pixelate
inpaint— fill a region described in natural languageremove_object— erase and fill backgroundreplace_background— swap the entire backgroundadd_object— add a new element to the scenetext_to_image— generate from scratchstyle_transfer,change_season,change_weather,colorize,age
- Named layers with per-layer opacity and visibility
- Blend modes: Normal, Multiply, Screen, Overlay, Soft Light, Hard Light, Darken, Lighten, Difference, Hue, Saturation, Luminosity
- Per-layer masks (uint8 alpha)
- Duplicate, delete, reorder, merge visible, flatten
- Python 3.10+
- CUDA GPU with 16 GB+ VRAM recommended (24 GB for TripoSR + FLUX simultaneously)
- An Anthropic API key for intent parsing
git clone https://github.com/yourname/ai-photoshop.git
cd ai-photoshop
pip install -r requirements.txtOr install as a package:
pip install -e .pip install git+https://github.com/facebookresearch/segment-anything-2.git
# Download weights
wget -P weights/sam2 https://dl.fbaipublicfiles.com/segment_anything_2/sam2_hiera_large.pt
wget -P weights/sam2 https://dl.fbaipublicfiles.com/segment_anything_2/sam2_hiera_large.yamlpip install depth-anything-v2
wget -P weights/depth \
https://huggingface.co/depth-anything/Depth-Anything-V2-Large/resolve/main/depth_anything_v2_vitl.pth
⚠️ Accept the non-commercial license at huggingface.co/black-forest-labs/FLUX.1-dev first.
huggingface-cli login # authenticate — weights download automatically on first use (~24 GB)pip install git+https://github.com/VAST-AI-Research/TripoSR.gitexport ANTHROPIC_API_KEY=sk-ant-...See weights/README.md for full download instructions, all model sizes, and licensing details.
from ai_photoshop_pipeline import AiPhotoshop
ps = AiPhotoshop("photo.jpg")
# Instant — no GPU
ps.edit("increase brightness by 10")
ps.edit("boost contrast 25")
ps.edit("shift hue +15 degrees, saturation +20")
ps.edit("add subtle film grain")
ps.edit("soft vignette, strength 0.4")
ps.save("output.jpg")# Fast (Zero123++ novel view synthesis, ~30s)
ps.edit("rotate the chair 45 degrees to the left")
ps.edit("tilt the bottle 20 degrees forward")
# High quality (TripoSR mesh, ~60s)
ps.edit("rotate the sculpture 60 degrees left, use triposr for quality")ps.edit("flip the car horizontally")
ps.edit("scale the dog up by 30%")
ps.edit("move the coffee cup 100 pixels to the right")ps.edit("remove the person standing on the right")
ps.edit("replace the sky with a dramatic stormy sky")
ps.edit("make it look like a watercolor painting")ps.edit("duplicate this layer")
ps.edit("gaussian blur sigma 10")
ps.edit("set blend mode to screen")
ps.edit("set opacity 40%")
ps.show_layers()
# ID Name Blend Opacity Visible
# ▶a3f2 Layer copy screen 40% True
# b7c1 Background normal 100% True
ps.edit("merge visible layers")
ps.save("final.jpg")ai-photoshop/
├── ai_photoshop_pipeline.py — main pipeline (all engines in one file)
├── requirements.txt — pip dependencies
├── pyproject.toml — packaging metadata
├── .gitignore
├── examples/
│ ├── basic_edits.py — adjustments, filters, 2D geometry (no GPU)
│ ├── 3d_editing.py — 3D rotation and object-aware 2D transforms
│ └── layer_workflow.py — blend modes, opacity, non-destructive editing
├── weights/
│ ├── README.md — download instructions for all models
│ ├── sam2/ — SAM2 weights (manual download)
│ └── depth/ — Depth Anything V2 weights (manual download)
└── assets/
├── pipeline_diagram.svg — routing architecture (shown in this README)
└── README.md — instructions for adding before/after images
op_type |
Backend | GPU | Approx. time |
|---|---|---|---|
whole_image_geometric |
GeometricEngine (OpenCV) | No | <1s |
object_2d_geometric |
SegmentationEngine + GeometricEngine + DiffusionEngine | Yes | ~10s |
object_3d_geometric |
SegmentationEngine + ThreeDEngine (Zero123++ or TripoSR) | Yes | ~30–60s |
adjustment |
AdjustmentEngine (NumPy/PIL) | No | <1s |
filter |
FilterEngine (OpenCV/PIL) | No | <1s |
generative |
DiffusionEngine (FLUX.1-dev) | Yes | ~20s |
semantic_edit |
DiffusionEngine (FLUX.1-dev img2img) | Yes | ~20s |
layer_operation |
LayerStack | No | <1s |
All models are lazily loaded — if you only use adjustments and 2D geometry, no diffusion or segmentation weights are ever loaded into memory.
- GeoDiffuser (WACV 2025) — geometric transforms baked into diffusion attention layers
- FreeFine (2025) — decoupled geometric editing pipeline with GeoBench
- GeoEdit (2026) — Effects-Sensitive Attention for lighting-aware geometric edits
- DiT4Edit (2024) — first DiT-based image editing framework
The key difference from those papers: this pipeline never asks a diffusion model to perform geometric math. Affine transforms are always OpenCV (exact, deterministic, sub-millisecond). Diffusion handles only what it's actually good at.
| Model | Authors | License | Purpose |
|---|---|---|---|
| FLUX.1-dev | Black Forest Labs | Non-commercial | Inpainting, generative, style transfer |
| SAM2 | Meta AI | Apache 2.0 | Object segmentation |
| Grounding DINO | IDEA Research | Apache 2.0 | Text-to-bounding-box detection |
| Depth Anything V2 | HK University | Apache 2.0 | Monocular depth estimation |
| Zero123++ | Sudo AI | Apache 2.0 | Novel view synthesis (fast 3D) |
| TripoSR | Stability AI + VAST AI | MIT | 3D mesh reconstruction |
| Claude | Anthropic | Commercial API | Intent parsing |
This project is released under the MIT License.
Model weights carry their own licenses — see the table above. FLUX.1-dev in particular is non-commercial only. If you need a commercial deployment, either replace the FLUX components with a commercially licensed model (e.g. FLUX.1-schnell, Stable Diffusion 3.5) or obtain a license from Black Forest Labs.