ScrubHub is an intelligent, conversational data cleaning agent powered by Streamlit and OpenAI. Simply upload your messy CSV file, and tell ScrubHub what you want to do in plain English. From handling missing values to generating statistical summaries, ScrubHub makes data preprocessing faster and more intuitive than ever.
- ** Conversational AI**: Interact with your data using natural language. Just ask it to "clean the data" or "show me a summary."
- ** Smart Analysis**: Automatically analyzes your dataset to suggest cleaning steps and imputation strategies.
- ** Code Generation**: Generates and executes Python code on the fly to perform cleaning, imputation, and summarization tasks.
- ** Robust Error Handling**: If the AI-generated code fails, it automatically retries with the error context to self-correct.
- ** CSV Upload & Download**: Easily upload one or more CSV files and download the cleaned dataset with a single click.
- ** Data Previews**: Instantly view a preview of your uploaded data to get a quick overview.
Follow these instructions to get a local copy up and running.
- Python 3.8 or higher
- An OpenAI API Key
-
Clone the repository:
git clone https://github.com/your-username/ScrubHub.git cd ScrubHub -
Create and activate a virtual environment (recommended):
# For Mac/Linux python3 -m venv venv source venv/bin/activate # For Windows python -m venv venv .\venv\Scripts\activate
-
Install the required dependencies:
pip install -r requirements.txt
-
Set up your environment variables:
- Create a new file named
.envin the root of the project directory. - Add your OpenAI API key to this file:
OPENAI_KEY="your_api_key_here"
- Create a new file named
Once the installation is complete, you can run the Streamlit application with a single command:
streamlit run DataCleaningAgent.pyYour web browser should automatically open to the ScrubHub application. If not, navigate to http://localhost:8501.
- Upload Data: Use the sidebar to upload one or more of your CSV files.
- Ask Away: Use the chat input at the bottom of the screen to give commands like:
clean my dataimpute missing values in the age column with the mediangive me a numerical summaryshow me the unique values for the category column
- Download: Once the data is cleaned, a download button will appear for you to save the results.
- Streamlit - The core framework for building the web application.
- Pandas - For all data manipulation and analysis.
- OpenAI API - For the natural language understanding and code generation.
- LangChain - To structure and manage interactions with the language model.