This project implements a fashion runway search engine that allows users to search for runway images using text queries (e.g., "dress with floral pattern"). The system leverages the CLIP model for semantic text-to-image matching and FAISS for efficient similarity search. Background-removed images are used for feature extraction to focus on clothing, while original runway images are displayed in the results.

The search functionality combines CLIP (Contrastive Language-Image Pretraining) and FAISS (Facebook AI Similarity Search) to deliver fast and accurate results. Here's how it works:
-
Image Feature Extraction:
- Images with removed backgrounds are processed using the CLIP model (
CLIP-ViT-L-14-laion2B-s32B-b82K). - Each image is converted to a 768-dimensional feature embedding, normalized for consistent similarity comparisons.
- Features are extracted once and cached (
features.npy,paths.pkl) to avoid redundant computation.
- Images with removed backgrounds are processed using the CLIP model (
-
FAISS Indexing:
- The extracted image features are added to a FAISS
IndexFlatL2index, which uses L2 distance for efficient nearest-neighbor search. - The index enables fast retrieval of the most similar images to a given query, even with large datasets.
- The extracted image features are added to a FAISS
-
Query Processing:
- A user submits a text query via the FastAPI
/searchendpoint. - The query is encoded into a 768-dimensional CLIP embedding using the CLIP model's text encoder and normalized.
- A user submits a text query via the FastAPI
-
Similarity Search:
- The query embedding is searched against the FAISS index to find the top
kclosest image embeddings. - FAISS returns the indices of the matching images, which are mapped to their corresponding image paths.
- The query embedding is searched against the FAISS index to find the top
-
Result Delivery:
- CLIP: Encodes text and images into a shared embedding space, enabling semantic matching between natural language queries and visual content.
- FAISS: Provides highly efficient similarity search, making it scalable for large image collections.
- Background Removal: Using background-removed images for feature extraction ensures the model focuses on clothing, avoiding irrelevant matches based on runway backgrounds.
- Query: "fur coat"
- Process:
- CLIP encodes the query into a feature vector.
- FAISS finds the images with the closest feature vectors.
- The system returns URLs to the original runway images.
- Result: A list of runway images featuring fur coats.
