RAG Chatbot

title	RAG Chatbot
emoji	🤖
colorFrom	blue
colorTo	green
sdk	streamlit
sdk_version	1.30.0
app_file	app/streamlit_app.py
pinned	true

RAG Chatbot

A Retrieval-Augmented Generation (RAG) chatbot that enables interactive conversations with PDF documents. Upload PDFs, process them, and ask questions in a chat interface powered by advanced NLP models.

🚀 Deployed & Live: Hugging Face Spaces

Features

PDF Ingestion: Extract and chunk text from PDF documents using pdfplumber.
Embeddings: Generate sentence embeddings with Sentence Transformers for semantic search.
Vector Search: Efficient similarity search using FAISS vector database.
Question Answering: Generate answers using pre-trained language models (e.g., FLAN-T5).
Chat Interface: Interactive chat UI built with Streamlit for seamless user experience.
Session Management: Persistent chat history and processed data per session.
Source Citations: View relevant text chunks as sources for answers.

Architecture

The application follows a modular RAG pipeline:

Ingestion (src/ingest.py): Extracts text from PDFs and chunks it into manageable pieces.
Embedding (src/embed.py): Converts text chunks into vector embeddings.
Vector Store (src/vectorstore.py): Builds and manages FAISS index for fast retrieval.
Query (src/query.py): Embeds user queries, retrieves relevant chunks, and generates answers.
UI (app/streamlit_app.py): Provides a chat interface for user interaction.

Setup

Clone the repository:

git clone https://github.com/lydiamavin/rag-chatbot.git
cd rag-chatbot

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```

Usage

Run the Streamlit app:
```
streamlit run app/streamlit_app.py
```
Open your browser to the provided URL.
In the sidebar:
- Upload one or more PDF files.
- Click "Process PDFs" to ingest and index the documents.
Once processed, use the chat input at the bottom to ask questions about the PDFs.
View answers in the chat, with expandable sources for transparency.

Example

Upload a PDF about machine learning.
Ask Questions
Receive answers with relevant excerpts from the document.

Configuration

Models: Default embedding model is all-MiniLM-L6-v2. Generation model is google/flan-t5-small. Modify in source code for customization.
Chunking: Default chunk size is 1000 characters with 200 overlap. Adjust in src/ingest.py.
Data Storage: Processed data (chunks, embeddings, index) is stored in the data/ directory.

Testing

Run tests with pytest:

pytest tests/

Acknowledgments

Built with Streamlit, Sentence Transformers, FAISS, and Transformers.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
app		app
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Chatbot

Features

Architecture

Setup

Usage

Example

Configuration

Testing

Acknowledgments

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Chatbot

Features

Architecture

Setup

Usage

Example

Configuration

Testing

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages