FinRAG: Retrieval-Augmented Intelligence for Financial Document Analysis

A powerful financial document analysis tool that leverages Retrieval Augmented Generation (RAG) techniques to extract insights from 10-K reports and other financial documents.

📸 Application Overview

Home Page

RAG System	Document Q&A

🌟 Features

Dual Functionality:
- Pre-processed financial knowledge base querying
- Upload and analyze your own financial documents
Advanced RAG Implementation:
- Hybrid search combining semantic (FAISS) and keyword (BM25) approaches
- Maximum Marginal Relevance (MMR) for diverse results
- Cross-encoder reranking for improved result quality
Multiple AI Models:
- Support for Mistral AI and Google Gemini LLMs
- Various embedding models (general and financial domain-specific)
- Multiple cross-encoder reranker options
PDF Processing:
- Text extraction with layout preservation
- Table detection and extraction
- Multiple text chunking strategies (semantic, fixed-length, TF-IDF)
Company Information:
- Integration with Yahoo Finance API
- Quick company overview and financial metrics
- Market data, valuation metrics, and executive information
Implementation Options:
- Custom RAG pipeline with fine-tuned parameters
- LangChain-based implementation with ensemble retrieval

📋 Requirements

Python 3.8+
Streamlit 1.25+
PyTorch
FAISS
Sentence Transformers
NLTK
PyPDF2
Tabula
Rank BM25
LangChain
yfinance

🚀 Installation

Clone the repository:

git clone https://github.com/aatmaj28/FinRAG.git
cd FinRAG

Install dependencies:

pip install -r requirements.txt

Download NLTK resources:

import nltk
nltk.download('punkt')

Set up your API keys: Edit the code to include your API keys for Mistral AI and Google Gemini.

🏃‍♀️ Running the Application

Start the application with:

streamlit run main.py

The application has three main components:

Home Page: Navigation hub for accessing different features
RAG System: Query the pre-processed financial knowledge base
Document Q&A: Upload and analyze your own financial documents

🔧 Usage Guide

RAG System

Select a company from the dropdown menu
Choose your preferred RAG implementation (Custom or LangChain)
Select the LLM provider (Mistral or Gemini)
Choose a cross-encoder model for reranking (Custom RAG only)
Enter your query or select from example queries
Review the response and explore retrieved contexts

Document Q&A

Configure document processing options:
- Select a text chunking strategy
- Choose chunk size
- Select embedding model
Upload your 10-K or financial PDF document
Enter the company name
Wait for processing to complete
Follow the same query process as the RAG System

📊 RAG System Architecture

The system uses a hybrid retrieval approach combining:

Dense Retrieval (Semantic Search):
- Sentence Transformer embeddings
- FAISS vector similarity search
Sparse Retrieval (Keyword Search):
- BM25 algorithm for keyword matching
Hybrid Ranking:
- Weighted combination of semantic and keyword scores
- Maximum Marginal Relevance for diversity
- Cross-encoder reranking for final result quality
Response Generation:
- Top contexts sent to Mistral AI or Google Gemini API
- Contextual response generation with 100-word target

📁 Project Structure

financial-document-analysis/
├── main.py                  # Main application entry point
├── App1.py                  # RAG System implementation
├── App2.py                  # Document Q&A implementation
├── requirements.txt         # Project dependencies
├── Company 10-K's/          # Processed company data
│   └── COMPANY_NAME/        # Specific company folder
│       ├── *_index.faiss    # FAISS vector index
│       ├── *_metadata.pkl   # Document metadata
│       ├── *_bm25.pkl       # BM25 index
│       └── langchain/       # LangChain implementation files
└── FA.jpg                   # Application logo/image

🛠️ Advanced Customization

Text Chunking Strategies

Semantic Chunking: Preserves meaning with content relationships (recommended for complex documents)
Fixed-Length Chunking: Simple uniform chunks without overlap (faster processing)
TF-IDF Based Chunking: Content-aware chunking based on term importance (good for topically distinct sections)

Embedding Models

General Purpose (all-mpnet-base-v2): Works well for most text analysis tasks
Financial Domain (FinBERT): Specialized for financial terminology and contexts

Cross-Encoder Models

MiniLM-L-6: Fastest option with good performance
ELECTRA-Base: Good balance between speed and quality
RoBERTa-Base: Highest quality but slower performance

⚠️ Limitations

Processing large PDFs may require significant memory and CPU resources
API rate limits apply to the Mistral AI and Google Gemini services
Table extraction requires Java to be installed for optimal performance
Yahoo Finance data may not be available for all companies or may be delayed

🔑 API Keys

This application requires API keys for:

Mistral AI: Get from Mistral AI platform
Google Gemini: Get from Google AI Studio

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

🎯 Motivation

This project was developed to address the significant challenge organizations face when trying to quickly extract relevant information from large financial documents. Instead of manually sifting through hundreds of pages of 10-K reports, earnings calls, and financial statements, this tool enables users to find precise answers through natural language queries.

The system was created as part of the CS 6120: Natural Language Processing course, applying advanced NLP techniques to solve real-world information retrieval problems in the financial domain.

📧 Contact

For questions or support, please open an issue in the GitHub repository or contact Aatmaj Salunke @[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
10K		10K
Company 10-K's		Company 10-K's
__pycache__		__pycache__
images		images
App1.py		App1.py
App2.py		App2.py
Apple.pdf		Apple.pdf
FA.jpg		FA.jpg
README.md		README.md
company_tickers_RAG.csv		company_tickers_RAG.csv
fortune500_tickers.csv		fortune500_tickers.csv
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FinRAG: Retrieval-Augmented Intelligence for Financial Document Analysis

📸 Application Overview

🌟 Features

📋 Requirements

🚀 Installation

🏃‍♀️ Running the Application

🔧 Usage Guide

RAG System

Document Q&A

📊 RAG System Architecture

📁 Project Structure

🛠️ Advanced Customization

Text Chunking Strategies

Embedding Models

Cross-Encoder Models

⚠️ Limitations

🔑 API Keys

🤝 Contributing

🎯 Motivation

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FinRAG: Retrieval-Augmented Intelligence for Financial Document Analysis

📸 Application Overview

🌟 Features

📋 Requirements

🚀 Installation

🏃‍♀️ Running the Application

🔧 Usage Guide

RAG System

Document Q&A

📊 RAG System Architecture

📁 Project Structure

🛠️ Advanced Customization

Text Chunking Strategies

Embedding Models

Cross-Encoder Models

⚠️ Limitations

🔑 API Keys

🤝 Contributing

🎯 Motivation

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages