Natural language to SQL analytics tool powered by AI. Ask questions about banking data in plain English and get instant insights with interactive visualizations.
Live App - https://datagpt-banking.streamlit.app/
DataGPT is an AI-powered data analysis tool that enables users to query banking databases using natural language. The application automatically converts plain English questions into SQL queries, executes them, and presents results with appropriate visualizations. Built to demonstrate practical AI integration, prompt engineering, and full-stack data application development.
- Natural Language Processing: Ask questions in plain English, no SQL knowledge required
- AI-Powered SQL Generation: Automatically converts questions to optimized SQL queries using Groq's LLaMA 3.3 model
- Interactive Visualizations: Auto-generates charts based on query results (bar charts, line charts, time series)
- Security-First Design: SQL validation prevents dangerous operations (DROP, DELETE, ALTER, etc.)
- Sample Banking Database: Pre-loaded with realistic customer, account, transaction, loan, and credit card data
- Real-time Query Execution: Instant results with downloadable CSV exports
- User-Friendly Interface: Clean, professional UI built with Streamlit
- Python 3.9+
- Streamlit (Web framework)
- Groq API (LLM for natural language understanding)
- SQLite (Database)
- Pandas (Data manipulation)
- Plotly (Interactive visualizations)
- Python-dotenv (Environment management)
DataGPT-Banking/
├── app.py # Main Streamlit application
├── config.py # Configuration settings
├── requirements.txt # Python dependencies
├── README.md # Project documentation
├── .gitignore # Git ignore rules
├── .env # Environment variables (not committed)
├── test_db.py # Test database creation
├── database/
│ ├── __init__.py
│ ├── setup_db.py # Database creation script
│ └── sample_data.py # Sample data generation
├── utils/
│ ├── __init__.py
│ ├── llm_client.py # Groq API wrapper
│ ├── sql_generator.py # Text-to-SQL conversion
│ ├── sql_validator.py # SQL security validation
│ └── visualizer.py # Chart generation logic
├── prompts/
│ ├── __init__.py
│ └── sql_prompts.py # LLM prompt templates
├── data/
│ └── banking.db # SQLite database (auto-generated)
└── .streamlit/
└── config.toml # Streamlit theme configuration
- Python 3.9 or higher
- Groq API key (free at console.groq.com)
- Git
- Clone the repository
git clone https://github.com/Amrita-DevX/DataGPT-Banking
cd DataGPT-Banking- Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies
pip install -r requirements.txt- Set up environment variables
Create a .env file in the root directory:
GROQ_API_KEY=your_groq_api_key_here
- Create database
python database/setup_db.py- Run the application
streamlit run app.pyThe application will open in your browser at http://localhost:8501
The tool can answer various questions about banking data:
Basic Queries:
- "Show me total deposits last month"
- "What is the average account balance?"
- "How many active customers do we have?"
Analytical Queries:
- "Show me top 10 customers by total balance"
- "What are the most common transaction categories?"
- "Compare savings vs checking account balances"
Advanced Queries:
- "Find customers with more than 5 transactions over $10,000"
- "Show customers with decreasing balances over last 3 months"
- "What is the loan default rate by loan type?"
- Enter your question in plain English
- Click "Analyze"
- View generated SQL query
- Review results in table format
- Explore interactive visualizations
- Download results as CSV if needed
The sample database includes five main tables:
customers: Customer information (100 records)
- customer_id, name, email, phone, address, city, state, join_date, customer_type, risk_score
accounts: Bank accounts (150 records)
- account_id, customer_id, account_type, account_number, balance, opening_date, status, interest_rate
transactions: Transaction history (1000 records)
- transaction_id, account_id, transaction_type, amount, transaction_date, merchant, category, status, description
loans: Loan records (30 records)
- loan_id, customer_id, loan_type, loan_amount, interest_rate, loan_date, term_months, monthly_payment, remaining_balance, status
credit_cards: Credit card accounts (50 records)
- card_id, customer_id, card_number, card_type, credit_limit, current_balance, available_credit, issue_date, expiry_date, status
- User Input: Natural language question entered through Streamlit interface
- Prompt Engineering: Question combined with database schema in structured prompt
- LLM Processing: Groq API (LLaMA 3.3) generates SQL query
- Validation: SQL query checked for security vulnerabilities
- Execution: Safe query executed against SQLite database
- Visualization: Results processed and appropriate chart type selected
- Display: Results shown with interactive charts and download option
graph TD
A[User Interface - Streamlit] --> B[Natural Language Question]
B --> C[Prompt Engineering Module]
C --> |Schema + Rules| D[Groq API - LLaMA 3.3]
D --> E[Generated SQL Query]
E --> F[SQL Validator]
F --> G{Safe Query?}
G -->|Yes| H[SQLite Database]
G -->|No| I[Error Message]
H --> J[Query Results]
J --> K[Visualizer]
K --> L[Display Results + Chart]
style D fill:#4CAF50
style F fill:#FFB74D
style G fill:#FF8C00
style I fill:#E53935
style L fill:#388E3C
### Security Features
- Read-only database access (only SELECT queries allowed)
- SQL injection prevention through keyword filtering
- Validation of all LLM-generated queries before execution
- No data modification operations permitted
- Sanitization of user inputs
## Configuration
### Environment Variables
GROQ_API_KEY: Your Groq API key (required)
### Config Options (config.py)
- MODEL_NAME: LLM model to use (default: llama-3.3-70b-versatile)
- DATABASE_PATH: SQLite database location
- TEMPERATURE: LLM response randomness (0.1 for consistent SQL)
- MAX_TOKENS: Maximum LLM response length
## Current Limitations
- File size: Optimized for databases up to 50MB
- Query complexity: Best suited for analytical queries, not complex joins
- Data privacy: Uses external API (Groq) - not suitable for sensitive production data without modifications
- File formats: Currently supports SQLite only
- Concurrent users: Designed for single-user or low-traffic scenarios
## Future Roadmap
### Version 2.0
- CSV upload functionality for custom datasets
- Support for PostgreSQL and MySQL connections
- Advanced fraud detection algorithms
- Custom query templates and saved queries
- Batch query processing
- Enhanced visualizations (dashboards, pivot tables)
### Version 3.0
- Multi-user authentication and authorization
- Query history and analytics
- Scheduled reports and alerts
- Natural language insights generation
- Integration with BI tools
- Local LLM option for data privacy
## Development
### Running Tests
```bash
python -m pytest tests/
This project follows PEP 8 style guidelines. All code includes comprehensive comments explaining functionality.
Contributions are welcome. Please follow these steps:
- Fork the repository
- Create a feature branch
- Make your changes with clear commit messages
- Add tests if applicable
- Submit a pull request
- Query generation: 2-5 seconds (depending on question complexity)
- Query execution: < 1 second for most queries
- Visualization rendering: < 1 second
- Supports datasets with up to 100,000 rows comfortably
- Push code to GitHub
- Go to streamlit.io/cloud
- Connect repository
- Set GROQ_API_KEY in secrets
- Deploy
Application runs locally with:
streamlit run app.pyAccess at http://localhost:8501
This project is licensed under the MIT License. See LICENSE file for details.
- Groq for providing free, fast LLM API access
- Streamlit for the excellent web framework
- OpenAI and Anthropic for advancing LLM research
- The open-source community for supporting libraries
For questions, issues, or feedback:
- GitHub Issues: [Create an issue]
- Email: [email protected]
- LinkedIn: https://www.linkedin.com/in/amrita-dasdev/
This is a demonstration project using sample banking data. Not intended for production use with real financial data without proper security enhancements, compliance measures, and data governance policies.
- v1.0.0 (Current): Initial release with core functionality
- Natural language to SQL conversion
- Sample banking database
- Interactive visualizations
- Security validation
- Streamlit web interface