A Retrieval-Augmented Generation (RAG) system that allows users to chat with multiple PDF documents simultaneously.
- ✅ Upload and process multiple PDF documents
- ✅ Smart text chunking with metadata
- ✅ Vector search across all documents
- ✅ Context-aware responses based on document content
- ✅ Source citations for answers
- ✅ Streaming chat interface with real-time responses
- ✅ Document summarization feature
- ✅ Page-level source attribution
- Python
- LangChain
- OpenAI API / Hugging Face Models
- Streamlit (frontend)
- ChromaDB (vector storage)
- Sentence Transformers (embeddings)
- PyPDF (PDF processing)
- Clone the repository
- Install dependencies:
pip install -r requirements.txt- Set up your OpenAI API key (optional, if using OpenAI models):
# Create a .env file and add your OpenAI API key
echo "OPENAI_API_KEY=your_key_here" > .envstreamlit run app.pyThen:
- Upload one or more PDF files using the sidebar
- Click "Process PDFs" to index the documents
- Use "Generate Document Summaries" to get summaries of your documents
- Ask questions in the chat interface about your documents
- View responses with source citations
- Get query suggestions for follow-up questions
The application follows a modular architecture:
- Document Processing: PDF ingestion and text extraction with metadata
- Text Splitting: Smart chunking with metadata preservation
- Vector Storage: ChromaDB for efficient similarity search
- Retrieval: RAG pipeline for context-aware responses
- UI: Streamlit interface with streaming responses and source citations
- Additional Features: Document summarization and query suggestions
