Skip to content

amitportal/doc-semantic-search

Repository files navigation

Document Semantic Search & Smart Q&A

An intelligent document semantic search and Q&A engine built with FastAPI, React, and Sentence Transformers. This application allows users to upload documents in multiple formats and perform semantic searches or ask natural language questions to find relevant content based on meaning rather than exact keyword matching.

Features

  • Multi-Mode Engine (v2):
    • Fast Mode: Uses all-MiniLM-L6-v2 (384d) for rapid embedding and ms-marco-MiniLM-L-6-v2 for efficient reranking. Ideal for standard hardware and quick queries.
    • Pro Mode: Leverages ibm-granite/granite-embedding-english-r2 (1024d) and granite-reranker for professional-grade accuracy and high-fidelity results.
  • Smart Q&A Mode (v2): Ask direct questions to your documents. The system uses a specialized extractive QA model (deepset/tinyroberta-squad2) to pinpoint exact answers and provide citations.
  • Multi-format Support: Upload and process PDF, DOCX, TXT, MD, and CSV files.
  • Hybrid Search Engine: Combines efficient vector retrieval (Bi-Encoder) with high-precision reranking (Cross-Encoder).
  • Smart Highlighting: Pinpoints the exact sentence within a chunk that answers the query.
  • FAISS Indexing: Fast and efficient similarity search using Facebook AI Similarity Search.
  • Index Management: Tracks which mode was used to build each index, allowing seamless switching between different model configurations.
  • Modern UI: React-based 3-column layout with optimized UX, mode selection, and persistent search history.

Technical Deep Dive: The Smart Search Algorithm

This application employs a state-of-the-art Retrieve & Rerank pipeline to deliver high-quality search results. Here is the breakdown for computer science practitioners:

1. The Challenge of PDF Text

Raw text extraction from PDFs often results in "hard wraps" (newlines in the middle of sentences). We implement a heuristic text cleaner (backend/main.py) that uses lookahead strategies to detect hyphenation and join broken lines, ensuring the embedding model receives coherent sentences rather than fragmented tokens.

2. Stage 1: Dense Retrieval (Recall)

We use a Bi-Encoder architecture to map both documents and queries into a shared vector space.

  • Fast Mode: Uses all-MiniLM-L6-v2 (384 dimensions).
  • Pro Mode: Uses ibm-granite/granite-embedding-english-r2 (1024 dimensions).
  • Indexing: Document chunks are embedded and stored in a FAISS (Facebook AI Similarity Search) index for $O(1)$ approximate nearest neighbor lookup.
  • Retrieval: For a user query $Q$, we retrieve the top $N$ candidates (where $N = K \times 5$) based on cosine similarity. This stage prioritizes Recall, ensuring the relevant content is likely in the candidate pool.

3. Stage 2: Cross-Encoder Reranking (Precision)

The top candidates from Stage 1 are passed to a Cross-Encoder.

  • Fast Mode: Uses cross-encoder/ms-marco-MiniLM-L-6-v2. We apply a Sigmoid function to the raw logits to normalize scores to a [0, 1] probability range.
  • Pro Mode: Uses ibm-granite/granite-embedding-reranker-english-r2.
  • Mechanism: Unlike Bi-Encoders which process inputs independently ($f(A) \cdot f(B)$), a Cross-Encoder processes the pair simultaneously ($f(A, B)$). This allows the model's self-attention layers to attend to the interaction between specific query tokens and document tokens, capturing nuanced semantic relationships that vector dot products miss.
  • Result: We re-score the candidates and sort them to return the final top $K$ results. This stage prioritizes Precision.

4. Stage 3: Extractive Q&A (v2)

Optional When Q&A Mode is enabled:

  • The system first performs the Hybrid Search (Stage 1 & 2) to retrieve the top 3 most relevant chunks.
  • These chunks are concatenated to form a context window.
  • A lightweight extractive QA model (deepset/tinyroberta-squad2) processes the (Question, Context) pair.
  • The model identifies the start and end tokens of the answer span within the text.
  • The system returns the specific answer string along with citations (source file and section).

5. Granular Answer Extraction

For standard search results, we perform Sentence-Level Granularity Analysis:

  • The retrieval unit (chunk) is often 500+ characters.
  • We split the chunk into sentences and compute the similarity of each sentence against the query.
  • The highest-scoring sentence is highlighted as the "Golden Snippet," directing the user's attention immediately to the answer.

System Process Flow

graph TD
    %% Styles
    classDef input fill:#e1f5fe,stroke:#01579b,stroke-width:2px;
    classDef process fill:#fff9c4,stroke:#fbc02d,stroke-width:2px;
    classDef storage fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px;
    classDef decision fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px;
    classDef output fill:#ffebee,stroke:#c62828,stroke-width:2px;

    %% Nodes
    UserDocs["📄 User Documents<br/>(PDF, DOCX, TXT, MD)"]:::input
    DocProc["⚙️ DocumentProcessor<br/>Clean & Chunk Text"]:::process
    ModeSelect{"🛠️ Select Mode<br/>Fast / Pro?"}:::decision
    
    subgraph "Indexing Phase"
        EmbedModel["🧠 Embedding Model<br/>(MiniLM / Granite)"]:::process
        FAISS[("🗄️ FAISS Index<br/>Vector Store")]:::storage
        Cache["💾 Disk Cache<br/>(Pickle + Metadata)"]:::storage
    end

    UserQuery["🔍 User Query / Question"]:::input
    QueryEmbed["🧠 Encode Query"]:::process
    ANN["⚡ FAISS ANN Search<br/>Retrieve Top N Candidates"]:::process
    
    subgraph "Reranking Phase"
        CrossEnc["⚖️ Cross-Encoder<br/>(MS-MARCO / Granite)"]:::process
        ScoreNorm{"📉 Fast Mode?"}:::decision
        Sigmoid["Math: Sigmoid(x)"]:::process
        TopK["🏆 Select Top K Results"]:::process
    end

    QAMode{"❓ Q&A Mode?"}:::decision
    
    subgraph "Q&A Extraction"
        Context["📝 Build Context Window<br/>(Top 3 Chunks)"]:::process
        QAModel["🤖 QA Pipeline<br/>(TinyRoberta)"]:::process
        Answer["💬 Extracted Answer<br/>+ Citations"]:::output
    end
    
    SearchResults["📑 Ranked Snippets<br/>+ Smart Highlights"]:::output

    %% Flow
    UserDocs --> DocProc
    DocProc --> ModeSelect
    ModeSelect -->|Fast| EmbedModel
    ModeSelect -->|Pro| EmbedModel
    EmbedModel --> FAISS
    FAISS --> Cache
    
    UserQuery --> QueryEmbed
    QueryEmbed --> ANN
    FAISS -.-> ANN
    ANN --> CrossEnc
    CrossEnc --> ScoreNorm
    ScoreNorm -->|Yes| Sigmoid
    Sigmoid --> TopK
    ScoreNorm -->|No| TopK
    TopK --> QAMode
    
    QAMode -->|Yes| Context
    Context --> QAModel
    QAModel --> Answer
    
    QAMode -->|No| SearchResults
Loading

Architecture

  • Backend: FastAPI (Python) with Sentence Transformers, FAISS, and HuggingFace Pipelines.
  • Frontend: React with Tailwind CSS.
  • Proxy: Vite development server proxies API requests to backend.

Prerequisites

  • Python 3.10+
  • Node.js 18+ and npm
  • uv package manager (optional but recommended)

Installation and Setup

Quick Setup (Windows)

  1. Run the setup script:
    setup.bat

Manual Setup (All Platforms)

  1. Install Python dependencies:

    uv sync  # or pip install -r requirements.txt
  2. Install Node.js dependencies:

    cd frontend
    npm install

Running the Application

Windows

  1. Start the backend server:

    run_backend.bat
  2. In a new terminal, start the frontend:

    run_frontend.bat
  3. Open your browser to http://localhost:5173.

Linux/Ubuntu

  1. Start the backend server:

    ./run_backend.sh
  2. In a new terminal, start the frontend:

    ./run_frontend.sh
  3. Open your browser to http://localhost:5173.

Usage

  1. Select Mode: Toggle between Fast Mode and Pro Mode in the header.
  2. Upload Documents: Click "Choose Files" to upload one or more documents.
  3. Process Documents: Click "Process Documents" to create embeddings for the selected mode.
  4. Search: Enter your query in the search box.
  5. Ask Questions: Enable Q&A Mode to receive direct answers with citations instead of just snippets.
  6. Manage Indexes: Select different cached indexes from the sidebar. The UI indicates which mode each index was built with.
  7. Search History: View, pin, or remove previous searches.

API Endpoints

  • GET / - Root endpoint
  • POST /upload/ - Upload files
  • POST /process/ - Process documents and build index (supports mode parameter)
  • POST /search/ - Perform semantic search (supports mode parameter)
  • POST /answer/ - Perform extractive Q&A (supports mode parameter)
  • GET /cache/ - Get cached index information (includes mode info)
  • POST /load_cache/ - Load a specific cached index
  • GET /history/ - Get search history
  • POST /history/pin/ - Pin/unpin a search query
  • POST /history/remove/ - Remove a search from history
  • DELETE /cache/clear/ - Clear all cached indexes
  • GET /health/ - Health check

Project Structure

doc-semantic-search/
├── backend/
│   └── main.py                 # FastAPI backend with multi-mode logic
├── frontend/
│   ├── src/
│   │   └── App.jsx            # Main React component
│   ├── package.json           # Frontend dependencies
│   ├── vite.config.js         # Vite configuration
│   └── index.html             # Main HTML file
├── cache/                     # Cached FAISS indexes
├── temp_uploads/              # Temporary uploaded files
├── setup.bat                  # Windows setup script
├── run_backend.bat            # Windows backend runner
├── run_frontend.bat           # Windows frontend runner
├── setup.sh                   # Linux setup script
├── run_backend.sh             # Linux backend runner
├── run_frontend.sh            # Linux frontend runner
├── pyproject.toml             # Python dependencies
└── README.md                  # This file

Dependencies

Backend

  • FastAPI: Web framework
  • sentence-transformers: For semantic embeddings and cross-encoders
  • transformers: For Q&A pipeline
  • faiss-cpu: For similarity search
  • PyPDF2, python-docx: File text extraction

Frontend

  • React 18: UI library
  • Vite: Build tool
  • Tailwind CSS: Styling
  • Axios: HTTP client
  • @heroicons/react: Icons

Troubleshooting

  • Scores > 100%: If you see this in Fast Mode, ensure you have the latest backend update which applies Sigmoid normalization to the ms-marco logits.
  • CORS Errors: Ensure both backend (port 8000) and frontend (port 5173) are running.
  • Model Downloads: First run of any mode will download models (~1GB+ for Pro mode). Ensure you have internet access.

License

GNU GENERAL PUBLIC LICENSE V3.0

About

A semantic search application that allows users to upload documents and search for semantically similar content within those documents using advanced NLP techniques.

Topics

Resources

License

Stars

Watchers

Forks

Contributors