Document Semantic Search & Smart Q&A

An intelligent document semantic search and Q&A engine built with FastAPI, React, and Sentence Transformers. This application allows users to upload documents in multiple formats and perform semantic searches or ask natural language questions to find relevant content based on meaning rather than exact keyword matching.

Features

Multi-Mode Engine (v2):
- Fast Mode: Uses all-MiniLM-L6-v2 (384d) for rapid embedding and ms-marco-MiniLM-L-6-v2 for efficient reranking. Ideal for standard hardware and quick queries.
- Pro Mode: Leverages ibm-granite/granite-embedding-english-r2 (1024d) and granite-reranker for professional-grade accuracy and high-fidelity results.
Smart Q&A Mode (v2): Ask direct questions to your documents. The system uses a specialized extractive QA model (deepset/tinyroberta-squad2) to pinpoint exact answers and provide citations.
Multi-format Support: Upload and process PDF, DOCX, TXT, MD, and CSV files.
Hybrid Search Engine: Combines efficient vector retrieval (Bi-Encoder) with high-precision reranking (Cross-Encoder).
Smart Highlighting: Pinpoints the exact sentence within a chunk that answers the query.
FAISS Indexing: Fast and efficient similarity search using Facebook AI Similarity Search.
Index Management: Tracks which mode was used to build each index, allowing seamless switching between different model configurations.
Modern UI: React-based 3-column layout with optimized UX, mode selection, and persistent search history.

Technical Deep Dive: The Smart Search Algorithm

This application employs a state-of-the-art Retrieve & Rerank pipeline to deliver high-quality search results. Here is the breakdown for computer science practitioners:

1. The Challenge of PDF Text

Raw text extraction from PDFs often results in "hard wraps" (newlines in the middle of sentences). We implement a heuristic text cleaner (backend/main.py) that uses lookahead strategies to detect hyphenation and join broken lines, ensuring the embedding model receives coherent sentences rather than fragmented tokens.

2. Stage 1: Dense Retrieval (Recall)

We use a Bi-Encoder architecture to map both documents and queries into a shared vector space.

Fast Mode: Uses all-MiniLM-L6-v2 (384 dimensions).
Pro Mode: Uses ibm-granite/granite-embedding-english-r2 (1024 dimensions).
Indexing: Document chunks are embedded and stored in a FAISS (Facebook AI Similarity Search) index for $O(1)$ approximate nearest neighbor lookup.
Retrieval: For a user query $Q$, we retrieve the top $N$ candidates (where $N = K \times 5$) based on cosine similarity. This stage prioritizes Recall, ensuring the relevant content is likely in the candidate pool.

3. Stage 2: Cross-Encoder Reranking (Precision)

The top candidates from Stage 1 are passed to a Cross-Encoder.

Fast Mode: Uses cross-encoder/ms-marco-MiniLM-L-6-v2. We apply a Sigmoid function to the raw logits to normalize scores to a [0, 1] probability range.
Pro Mode: Uses ibm-granite/granite-embedding-reranker-english-r2.
Mechanism: Unlike Bi-Encoders which process inputs independently ($f(A) \cdot f(B)$), a Cross-Encoder processes the pair simultaneously ($f(A, B)$). This allows the model's self-attention layers to attend to the interaction between specific query tokens and document tokens, capturing nuanced semantic relationships that vector dot products miss.
Result: We re-score the candidates and sort them to return the final top $K$ results. This stage prioritizes Precision.

4. Stage 3: Extractive Q&A (v2)

Optional When Q&A Mode is enabled:

The system first performs the Hybrid Search (Stage 1 & 2) to retrieve the top 3 most relevant chunks.
These chunks are concatenated to form a context window.
A lightweight extractive QA model (deepset/tinyroberta-squad2) processes the (Question, Context) pair.
The model identifies the start and end tokens of the answer span within the text.
The system returns the specific answer string along with citations (source file and section).

5. Granular Answer Extraction

For standard search results, we perform Sentence-Level Granularity Analysis:

The retrieval unit (chunk) is often 500+ characters.
We split the chunk into sentences and compute the similarity of each sentence against the query.
The highest-scoring sentence is highlighted as the "Golden Snippet," directing the user's attention immediately to the answer.

System Process Flow

graph TD
    %% Styles
    classDef input fill:#e1f5fe,stroke:#01579b,stroke-width:2px;
    classDef process fill:#fff9c4,stroke:#fbc02d,stroke-width:2px;
    classDef storage fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px;
    classDef decision fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px;
    classDef output fill:#ffebee,stroke:#c62828,stroke-width:2px;

    %% Nodes
    UserDocs["📄 User Documents<br/>(PDF, DOCX, TXT, MD)"]:::input
    DocProc["⚙️ DocumentProcessor<br/>Clean & Chunk Text"]:::process
    ModeSelect{"🛠️ Select Mode<br/>Fast / Pro?"}:::decision
    
    subgraph "Indexing Phase"
        EmbedModel["🧠 Embedding Model<br/>(MiniLM / Granite)"]:::process
        FAISS[("🗄️ FAISS Index<br/>Vector Store")]:::storage
        Cache["💾 Disk Cache<br/>(Pickle + Metadata)"]:::storage
    end

    UserQuery["🔍 User Query / Question"]:::input
    QueryEmbed["🧠 Encode Query"]:::process
    ANN["⚡ FAISS ANN Search<br/>Retrieve Top N Candidates"]:::process
    
    subgraph "Reranking Phase"
        CrossEnc["⚖️ Cross-Encoder<br/>(MS-MARCO / Granite)"]:::process
        ScoreNorm{"📉 Fast Mode?"}:::decision
        Sigmoid["Math: Sigmoid(x)"]:::process
        TopK["🏆 Select Top K Results"]:::process
    end

    QAMode{"❓ Q&A Mode?"}:::decision
    
    subgraph "Q&A Extraction"
        Context["📝 Build Context Window<br/>(Top 3 Chunks)"]:::process
        QAModel["🤖 QA Pipeline<br/>(TinyRoberta)"]:::process
        Answer["💬 Extracted Answer<br/>+ Citations"]:::output
    end
    
    SearchResults["📑 Ranked Snippets<br/>+ Smart Highlights"]:::output

    %% Flow
    UserDocs --> DocProc
    DocProc --> ModeSelect
    ModeSelect -->|Fast| EmbedModel
    ModeSelect -->|Pro| EmbedModel
    EmbedModel --> FAISS
    FAISS --> Cache
    
    UserQuery --> QueryEmbed
    QueryEmbed --> ANN
    FAISS -.-> ANN
    ANN --> CrossEnc
    CrossEnc --> ScoreNorm
    ScoreNorm -->|Yes| Sigmoid
    Sigmoid --> TopK
    ScoreNorm -->|No| TopK
    TopK --> QAMode
    
    QAMode -->|Yes| Context
    Context --> QAModel
    QAModel --> Answer
    
    QAMode -->|No| SearchResults

Architecture

Backend: FastAPI (Python) with Sentence Transformers, FAISS, and HuggingFace Pipelines.
Frontend: React with Tailwind CSS.
Proxy: Vite development server proxies API requests to backend.

Prerequisites

Python 3.10+
Node.js 18+ and npm
uv package manager (optional but recommended)

Installation and Setup

Quick Setup (Windows)

Run the setup script:
```
setup.bat
```

Manual Setup (All Platforms)

Install Python dependencies:

uv sync  # or pip install -r requirements.txt

Install Node.js dependencies:
```
cd frontend
npm install
```

Running the Application

Windows

Start the backend server:
```
run_backend.bat
```
In a new terminal, start the frontend:
```
run_frontend.bat
```
Open your browser to http://localhost:5173.

Linux/Ubuntu

Start the backend server:
```
./run_backend.sh
```
In a new terminal, start the frontend:
```
./run_frontend.sh
```
Open your browser to http://localhost:5173.

Usage

Select Mode: Toggle between Fast Mode and Pro Mode in the header.
Upload Documents: Click "Choose Files" to upload one or more documents.
Process Documents: Click "Process Documents" to create embeddings for the selected mode.
Search: Enter your query in the search box.
Ask Questions: Enable Q&A Mode to receive direct answers with citations instead of just snippets.
Manage Indexes: Select different cached indexes from the sidebar. The UI indicates which mode each index was built with.
Search History: View, pin, or remove previous searches.

API Endpoints

GET / - Root endpoint
POST /upload/ - Upload files
POST /process/ - Process documents and build index (supports mode parameter)
POST /search/ - Perform semantic search (supports mode parameter)
POST /answer/ - Perform extractive Q&A (supports mode parameter)
GET /cache/ - Get cached index information (includes mode info)
POST /load_cache/ - Load a specific cached index
GET /history/ - Get search history
POST /history/pin/ - Pin/unpin a search query
POST /history/remove/ - Remove a search from history
DELETE /cache/clear/ - Clear all cached indexes
GET /health/ - Health check

Project Structure

doc-semantic-search/
├── backend/
│   └── main.py                 # FastAPI backend with multi-mode logic
├── frontend/
│   ├── src/
│   │   └── App.jsx            # Main React component
│   ├── package.json           # Frontend dependencies
│   ├── vite.config.js         # Vite configuration
│   └── index.html             # Main HTML file
├── cache/                     # Cached FAISS indexes
├── temp_uploads/              # Temporary uploaded files
├── setup.bat                  # Windows setup script
├── run_backend.bat            # Windows backend runner
├── run_frontend.bat           # Windows frontend runner
├── setup.sh                   # Linux setup script
├── run_backend.sh             # Linux backend runner
├── run_frontend.sh            # Linux frontend runner
├── pyproject.toml             # Python dependencies
└── README.md                  # This file

Dependencies

Backend

FastAPI: Web framework
sentence-transformers: For semantic embeddings and cross-encoders
transformers: For Q&A pipeline
faiss-cpu: For similarity search
PyPDF2, python-docx: File text extraction

Frontend

React 18: UI library
Vite: Build tool
Tailwind CSS: Styling
Axios: HTTP client
@heroicons/react: Icons

Troubleshooting

Scores > 100%: If you see this in Fast Mode, ensure you have the latest backend update which applies Sigmoid normalization to the ms-marco logits.
CORS Errors: Ensure both backend (port 8000) and frontend (port 5173) are running.
Model Downloads: First run of any mode will download models (~1GB+ for Pro mode). Ensure you have internet access.

License

GNU GENERAL PUBLIC LICENSE V3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Semantic Search & Smart Q&A

Features

Technical Deep Dive: The Smart Search Algorithm

1. The Challenge of PDF Text

2. Stage 1: Dense Retrieval (Recall)

3. Stage 2: Cross-Encoder Reranking (Precision)

4. Stage 3: Extractive Q&A (v2)

5. Granular Answer Extraction

System Process Flow

Architecture

Prerequisites

Installation and Setup

Quick Setup (Windows)

Manual Setup (All Platforms)

Running the Application

Windows

Linux/Ubuntu

Usage

API Endpoints

Project Structure

Dependencies

Backend

Frontend

Troubleshooting

License

About

Uh oh!

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
backend		backend
frontend		frontend
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_backend.bat		run_backend.bat
run_backend.sh		run_backend.sh
run_frontend.bat		run_frontend.bat
run_frontend.sh		run_frontend.sh
setup.bat		setup.bat
setup.sh		setup.sh
uv.lock		uv.lock

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Document Semantic Search & Smart Q&A

Features

Technical Deep Dive: The Smart Search Algorithm

1. The Challenge of PDF Text

2. Stage 1: Dense Retrieval (Recall)

3. Stage 2: Cross-Encoder Reranking (Precision)

4. Stage 3: Extractive Q&A (v2)

5. Granular Answer Extraction

System Process Flow

Architecture

Prerequisites

Installation and Setup

Quick Setup (Windows)

Manual Setup (All Platforms)

Running the Application

Windows

Linux/Ubuntu

Usage

API Endpoints

Project Structure

Dependencies

Backend

Frontend

Troubleshooting

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages