Skip to content

lfopensource/ai-photoshop

Repository files navigation

AI Photoshop

A natural language image editing pipeline with full Photoshop-equivalent capabilities and 3D object manipulation — powered by Claude (intent parsing), SAM2 (segmentation), Depth Anything V2 (depth estimation), Zero123++ / TripoSR (3D novel view synthesis), and FLUX.1-dev (diffusion).

You describe what you want in plain English. The system figures out which model to use.

ps = AiPhotoshop("photo.jpg")

ps.edit("boost contrast 20, add warm white balance")   # instant, no GPU
ps.edit("rotate the chair 45 degrees to the left")     # 3D object edit, ~30s
ps.edit("remove the person on the right")              # FLUX inpaint, ~20s
ps.edit("set blend mode to soft light, opacity 60%")   # layer op, instant

ps.save("output.jpg")

⚠️ FLUX.1-dev license notice This pipeline uses FLUX.1-dev for inpainting and generative edits. FLUX.1-dev is released under a non-commercial license — it may not be used for commercial purposes without a separate agreement with Black Forest Labs. You must accept the license at huggingface.co/black-forest-labs/FLUX.1-dev before downloading weights. All other models in this stack (SAM2, Depth Anything V2, Zero123++, TripoSR, Grounding DINO) are Apache 2.0 or MIT.


Architecture

Pipeline routing diagram

Every prompt goes through an LLM intent parser (Claude) that extracts a structured operation — op type, target object, and parameters — then routes to the cheapest backend that can do the job correctly.

prompt
  └─► Claude intent parser → {op_type, operation, target_object, params}
        ├─ whole_image_geometric  → OpenCV              (instant, no GPU)
        ├─ object_2d_geometric    → SAM2 + OpenCV + FLUX (~10s)
        ├─ object_3d_geometric    → SAM2 + Depth + Zero123++/TripoSR + FLUX (~30–60s)
        ├─ adjustment             → NumPy / PIL          (instant, no GPU)
        ├─ filter                 → OpenCV / PIL         (instant, no GPU)
        ├─ generative/semantic    → FLUX.1-dev           (~20s)
        └─ layer_operation        → LayerStack           (instant)

Core design principle: geometric math is never delegated to a diffusion model. Rotation, scaling, and translation use OpenCV for exact pixel-level results. Diffusion only handles what it's actually good at — synthesizing novel views, filling inpainted holes, and semantic edits.

3D object editing pipeline

Input image + "rotate the chair 45 degrees left"
  │
  ├─ SAM2 (Grounding DINO → bounding box → precise mask)
  ├─ Depth Anything V2 (metric depth map for scene context)
  │
  ├─ [fast]    Zero123++ — diffusion novel view synthesis at target (pitch, yaw, roll)
  └─ [quality] TripoSR   — 3D mesh reconstruction → render from new camera angle
  │
  ├─ FLUX.1 inpaint — fill the background hole left by the moved object
  └─ Alpha composite — place rotated object onto inpainted background

Use zero123 (default, ~30s) for speed, triposr (~60s) for mesh-accurate results:

ps.edit("rotate the sculpture 60 degrees yaw, use triposr for quality")

Example outputs

Place your own before/after images in assets/ after running the examples. See assets/README.md for instructions.

Operation Before After
3D yaw rotation assets/demo_3d_before.jpg assets/demo_3d_after.jpg
Object removal assets/demo_remove_before.jpg assets/demo_remove_after.jpg

To generate your own:

# 3D rotation demo
python examples/3d_editing.py --input photo.jpg --object "the chair" --output_dir outputs/3d/

# Retouching and filters
python examples/basic_edits.py --input photo.jpg --output_dir outputs/basic/

# Layer workflow
python examples/layer_workflow.py --input photo.jpg --output_dir outputs/layers/

Capabilities

Geometric transforms

Prompt Backend Speed
"rotate 15 degrees clockwise" OpenCV instant
"flip horizontally" OpenCV instant
"zoom in 1.5x" OpenCV instant
"crop to center square" OpenCV instant
"perspective correct the sign" OpenCV instant
"flip the car horizontally" SAM2 + OpenCV + FLUX ~10s
"rotate the chair 45 degrees left" SAM2 + Depth + Zero123++ + FLUX ~30s
"rotate the sculpture 60° left, high quality" SAM2 + Depth + TripoSR + FLUX ~60s

Tonal adjustments (instant, no GPU)

  • brightness, contrast, sharpness
  • curves — per-channel (R, G, B, or composite), arbitrary control points
  • levels — black point, white point, gamma, output range
  • hsl — hue shift, saturation, lightness
  • vibrance — saturation boost weighted toward less-saturated pixels
  • white_balance — temperature and tint
  • shadows_highlights — recover shadow/highlight detail independently
  • vignette — strength, feather, midpoint

Filters (instant, no GPU)

  • gaussian_blur, lens_blur, motion_blur
  • bilateral_denoise — edge-preserving noise reduction
  • unsharp_mask — amount, radius, threshold
  • grain — simulated film grain
  • emboss, posterize, pixelate

Generative operations (FLUX.1-dev, ~20s)

  • inpaint — fill a region described in natural language
  • remove_object — erase and fill background
  • replace_background — swap the entire background
  • add_object — add a new element to the scene
  • text_to_image — generate from scratch
  • style_transfer, change_season, change_weather, colorize, age

Layer system

  • Named layers with per-layer opacity and visibility
  • Blend modes: Normal, Multiply, Screen, Overlay, Soft Light, Hard Light, Darken, Lighten, Difference, Hue, Saturation, Luminosity
  • Per-layer masks (uint8 alpha)
  • Duplicate, delete, reorder, merge visible, flatten

Installation

Requirements

  • Python 3.10+
  • CUDA GPU with 16 GB+ VRAM recommended (24 GB for TripoSR + FLUX simultaneously)
  • An Anthropic API key for intent parsing

1 — Clone and install core dependencies

git clone https://github.com/yourname/ai-photoshop.git
cd ai-photoshop
pip install -r requirements.txt

Or install as a package:

pip install -e .

2 — SAM2

pip install git+https://github.com/facebookresearch/segment-anything-2.git

# Download weights
wget -P weights/sam2 https://dl.fbaipublicfiles.com/segment_anything_2/sam2_hiera_large.pt
wget -P weights/sam2 https://dl.fbaipublicfiles.com/segment_anything_2/sam2_hiera_large.yaml

3 — Depth Anything V2

pip install depth-anything-v2

wget -P weights/depth \
  https://huggingface.co/depth-anything/Depth-Anything-V2-Large/resolve/main/depth_anything_v2_vitl.pth

4 — FLUX.1-dev

⚠️ Accept the non-commercial license at huggingface.co/black-forest-labs/FLUX.1-dev first.

huggingface-cli login    # authenticate — weights download automatically on first use (~24 GB)

5 — TripoSR (optional, for high-quality 3D)

pip install git+https://github.com/VAST-AI-Research/TripoSR.git

6 — API key

export ANTHROPIC_API_KEY=sk-ant-...

See weights/README.md for full download instructions, all model sizes, and licensing details.


Usage

Basic session

from ai_photoshop_pipeline import AiPhotoshop

ps = AiPhotoshop("photo.jpg")

# Instant — no GPU
ps.edit("increase brightness by 10")
ps.edit("boost contrast 25")
ps.edit("shift hue +15 degrees, saturation +20")
ps.edit("add subtle film grain")
ps.edit("soft vignette, strength 0.4")

ps.save("output.jpg")

3D object manipulation

# Fast (Zero123++ novel view synthesis, ~30s)
ps.edit("rotate the chair 45 degrees to the left")
ps.edit("tilt the bottle 20 degrees forward")

# High quality (TripoSR mesh, ~60s)
ps.edit("rotate the sculpture 60 degrees left, use triposr for quality")

Object-aware 2D geometry

ps.edit("flip the car horizontally")
ps.edit("scale the dog up by 30%")
ps.edit("move the coffee cup 100 pixels to the right")

Generative edits

ps.edit("remove the person standing on the right")
ps.edit("replace the sky with a dramatic stormy sky")
ps.edit("make it look like a watercolor painting")

Layer workflow

ps.edit("duplicate this layer")
ps.edit("gaussian blur sigma 10")
ps.edit("set blend mode to screen")
ps.edit("set opacity 40%")

ps.show_layers()
# ID         Name                 Blend          Opacity  Visible
# ▶a3f2      Layer copy           screen         40%      True
#  b7c1      Background           normal         100%     True

ps.edit("merge visible layers")
ps.save("final.jpg")

Repo structure

ai-photoshop/
├── ai_photoshop_pipeline.py   — main pipeline (all engines in one file)
├── requirements.txt           — pip dependencies
├── pyproject.toml             — packaging metadata
├── .gitignore
├── examples/
│   ├── basic_edits.py         — adjustments, filters, 2D geometry (no GPU)
│   ├── 3d_editing.py          — 3D rotation and object-aware 2D transforms
│   └── layer_workflow.py      — blend modes, opacity, non-destructive editing
├── weights/
│   ├── README.md              — download instructions for all models
│   ├── sam2/                  — SAM2 weights (manual download)
│   └── depth/                 — Depth Anything V2 weights (manual download)
└── assets/
    ├── pipeline_diagram.svg   — routing architecture (shown in this README)
    └── README.md              — instructions for adding before/after images

Engine routing table

op_type Backend GPU Approx. time
whole_image_geometric GeometricEngine (OpenCV) No <1s
object_2d_geometric SegmentationEngine + GeometricEngine + DiffusionEngine Yes ~10s
object_3d_geometric SegmentationEngine + ThreeDEngine (Zero123++ or TripoSR) Yes ~30–60s
adjustment AdjustmentEngine (NumPy/PIL) No <1s
filter FilterEngine (OpenCV/PIL) No <1s
generative DiffusionEngine (FLUX.1-dev) Yes ~20s
semantic_edit DiffusionEngine (FLUX.1-dev img2img) Yes ~20s
layer_operation LayerStack No <1s

All models are lazily loaded — if you only use adjustments and 2D geometry, no diffusion or segmentation weights are ever loaded into memory.


Related work

  • GeoDiffuser (WACV 2025) — geometric transforms baked into diffusion attention layers
  • FreeFine (2025) — decoupled geometric editing pipeline with GeoBench
  • GeoEdit (2026) — Effects-Sensitive Attention for lighting-aware geometric edits
  • DiT4Edit (2024) — first DiT-based image editing framework

The key difference from those papers: this pipeline never asks a diffusion model to perform geometric math. Affine transforms are always OpenCV (exact, deterministic, sub-millisecond). Diffusion handles only what it's actually good at.


Model credits

Model Authors License Purpose
FLUX.1-dev Black Forest Labs Non-commercial Inpainting, generative, style transfer
SAM2 Meta AI Apache 2.0 Object segmentation
Grounding DINO IDEA Research Apache 2.0 Text-to-bounding-box detection
Depth Anything V2 HK University Apache 2.0 Monocular depth estimation
Zero123++ Sudo AI Apache 2.0 Novel view synthesis (fast 3D)
TripoSR Stability AI + VAST AI MIT 3D mesh reconstruction
Claude Anthropic Commercial API Intent parsing

License

This project is released under the MIT License.

Model weights carry their own licenses — see the table above. FLUX.1-dev in particular is non-commercial only. If you need a commercial deployment, either replace the FLUX components with a commercially licensed model (e.g. FLUX.1-schnell, Stable Diffusion 3.5) or obtain a license from Black Forest Labs.

About

combine ai generative feature and the photoshop transform-as-you-edit feature

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages