Semantify

It might look like we have a lot of duplicates of the same file -- each label is a path to a section of a given file!
Semantic search + RAG queries
Example of what organised folders look like. The input was files all in one directory

Project Title: Semantify

Overview

Semantify is an intelligent file organizer and hierarchical vector database designed to bring order to unstructured text documents. Users can upload disorganized files, which Semantify then categorizes based on their semantic content.

Through an interactive web interface, users can visualize their data as a vector database. To streamline document retrieval, the platform features a Retrieval-Augmented Generation (RAG) chat agent, which not only answers queries but also guides users to relevant document clusters within the visualized database.

Inspiration

Finding relevant documents in large, unstructured datasets—such as legal document reviews or corporate archives—can be time-consuming and costly. Many users, like Rickey (who keeps all his files on his desktop), struggle with file organization, making search and retrieval inefficient.

Manual sorting is impractical for large-scale datasets, especially in legal proceedings where document dumps can span thousands of files. Semantify addresses this challenge by automatically structuring files into meaningful directories and providing an advanced semantic search mechanism.

Key Features

Semantic File Organization
- Automatically clusters and categorizes files based on their content.
- Uses hierarchical clustering and topic modeling to create structured, easy-to-navigate directories.
- Supports recursive clustering to handle large and complex document collections.
Visual Vector Database
- Displays document embeddings in a 2D space using UMAP for an intuitive, interactive visualization.
- Color-codes documents by cluster labels, making it easy to explore semantic relationships.
RAG-Powered Chat Assistant
- Allows users to query documents using natural language.
- Retrieves the most relevant document sections and generates responses using a language model (e.g., DeepSeek).
- Highlights source files in the vector database to ensure transparency and traceability.

How We Built It

Semantify integrates state-of-the-art NLP models and frameworks, including:

Sentence Transformers – For generating high-quality document embeddings.
UMAP/t-SNE – For dimensionality reduction and visualization.
KeyBERT – For extracting keywords and generating topic labels.
Agglomerative Clustering / HDBSCAN – For hierarchical document organization.
FastAPI – For building a scalable backend API.
React – For an interactive and user-friendly frontend.
Ollama / DeepSeek – For AI-powered document retrieval and response generation.

What’s Next for Semantify?

Enhanced User Interface – Improving the UX with in-browser document previews and highlighting relevant sections after a semantic search.
Cloud Integration – Allowing users to organize and search documents stored on platforms like Google Drive or Dropbox.
Collaboration Features – Supporting multiple users to collaboratively organize and query shared document collections.

Try It Out

You can try Semantify by cloning the repository and following the setup instructions in the README. Start organizing, visualizing, and querying your documents today!

📂 GitHub Repository: Semantify Repo
🎥 Demo Video: Watch Here