Skip to content

aledrinyth/ScrubHub

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧼 ScrubHub: The One-Stop Data Cleaning Solution

ScrubHub is an intelligent, conversational data cleaning agent powered by Streamlit and OpenAI. Simply upload your messy CSV file, and tell ScrubHub what you want to do in plain English. From handling missing values to generating statistical summaries, ScrubHub makes data preprocessing faster and more intuitive than ever.


Key Features

  • ** Conversational AI**: Interact with your data using natural language. Just ask it to "clean the data" or "show me a summary."
  • ** Smart Analysis**: Automatically analyzes your dataset to suggest cleaning steps and imputation strategies.
  • ** Code Generation**: Generates and executes Python code on the fly to perform cleaning, imputation, and summarization tasks.
  • ** Robust Error Handling**: If the AI-generated code fails, it automatically retries with the error context to self-correct.
  • ** CSV Upload & Download**: Easily upload one or more CSV files and download the cleaned dataset with a single click.
  • ** Data Previews**: Instantly view a preview of your uploaded data to get a quick overview.

Getting Started

Follow these instructions to get a local copy up and running.

Prerequisites

Installation

  1. Clone the repository:

    git clone https://github.com/your-username/ScrubHub.git
    cd ScrubHub
  2. Create and activate a virtual environment (recommended):

    # For Mac/Linux
    python3 -m venv venv
    source venv/bin/activate
    
    # For Windows
    python -m venv venv
    .\venv\Scripts\activate
  3. Install the required dependencies:

    pip install -r requirements.txt
  4. Set up your environment variables:

    • Create a new file named .env in the root of the project directory.
    • Add your OpenAI API key to this file:
      OPENAI_KEY="your_api_key_here"
      

Usage

Once the installation is complete, you can run the Streamlit application with a single command:

streamlit run DataCleaningAgent.py

Your web browser should automatically open to the ScrubHub application. If not, navigate to http://localhost:8501.

How to Use the App

  1. Upload Data: Use the sidebar to upload one or more of your CSV files.
  2. Ask Away: Use the chat input at the bottom of the screen to give commands like:
    • clean my data
    • impute missing values in the age column with the median
    • give me a numerical summary
    • show me the unique values for the category column
  3. Download: Once the data is cleaned, a download button will appear for you to save the results.

Built With

  • Streamlit - The core framework for building the web application.
  • Pandas - For all data manipulation and analysis.
  • OpenAI API - For the natural language understanding and code generation.
  • LangChain - To structure and manage interactions with the language model.

About

One stop Data cleaning solution

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages