| title | RAG Chatbot |
|---|---|
| emoji | 🤖 |
| colorFrom | blue |
| colorTo | green |
| sdk | streamlit |
| sdk_version | 1.30.0 |
| app_file | app/streamlit_app.py |
| pinned | true |
A Retrieval-Augmented Generation (RAG) chatbot that enables interactive conversations with PDF documents. Upload PDFs, process them, and ask questions in a chat interface powered by advanced NLP models.
🚀 Deployed & Live: Hugging Face Spaces
- PDF Ingestion: Extract and chunk text from PDF documents using pdfplumber.
- Embeddings: Generate sentence embeddings with Sentence Transformers for semantic search.
- Vector Search: Efficient similarity search using FAISS vector database.
- Question Answering: Generate answers using pre-trained language models (e.g., FLAN-T5).
- Chat Interface: Interactive chat UI built with Streamlit for seamless user experience.
- Session Management: Persistent chat history and processed data per session.
- Source Citations: View relevant text chunks as sources for answers.
The application follows a modular RAG pipeline:
- Ingestion (
src/ingest.py): Extracts text from PDFs and chunks it into manageable pieces. - Embedding (
src/embed.py): Converts text chunks into vector embeddings. - Vector Store (
src/vectorstore.py): Builds and manages FAISS index for fast retrieval. - Query (
src/query.py): Embeds user queries, retrieves relevant chunks, and generates answers. - UI (
app/streamlit_app.py): Provides a chat interface for user interaction.
-
Clone the repository:
git clone https://github.com/lydiamavin/rag-chatbot.git cd rag-chatbot -
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
-
Run the Streamlit app:
streamlit run app/streamlit_app.py
-
Open your browser to the provided URL.
-
In the sidebar:
- Upload one or more PDF files.
- Click "Process PDFs" to ingest and index the documents.
-
Once processed, use the chat input at the bottom to ask questions about the PDFs.
-
View answers in the chat, with expandable sources for transparency.
- Upload a PDF about machine learning.
- Ask Questions
- Receive answers with relevant excerpts from the document.
- Models: Default embedding model is
all-MiniLM-L6-v2. Generation model isgoogle/flan-t5-small. Modify in source code for customization. - Chunking: Default chunk size is 1000 characters with 200 overlap. Adjust in
src/ingest.py. - Data Storage: Processed data (chunks, embeddings, index) is stored in the
data/directory.
Run tests with pytest:
pytest tests/- Built with Streamlit, Sentence Transformers, FAISS, and Transformers.