Data-Refiner

📌 CSV Data Preprocessing with Automation

Project Overview

This project automates the preprocessing of CSV files by filtering, cleaning, and storing the data efficiently. It processes multiple CSV files in a directory and saves the cleaned outputs with dynamically generated filenames based on the year present in the input filenames. The cleaned data is stored both as CSV files and in an SQLite database for structured querying.

🎯 Features

✅ Batch Processing: Processes all CSV files in a directory automatically.

✅ Data Filtering: Removes rows with missing 'Date' or 'Product' values.

✅ Missing Value Imputation: Fills missing 'Sales' and 'Revenue' using column median.

✅ Database Storage: Saves processed data into an SQLite database for structured retrieval.

✅ Dynamic Output Naming: Generates filenames as cleaned_.csv.

✅ Automatic Directory Handling: Creates an output directory if it doesn’t exist.

✅ Logging System: Tracks each processing step and errors for debugging.

📄 Logging System

All logs are stored in data_preprocessing.log, tracking:

🔹 INFO: Normal operations like filtering and saving. 🔹 WARNING: Non-critical issues such as missing values are handled with imputation. 🔹 ERROR: Failures like missing input files or database errors.

📂 Project Structure

📁 Dat_Preprocessor

│── filter.py # Filters data based on required columns

│── imputation.py # Handles missing value imputation

│── database.py # Manages SQLite database operations

│── logger.py # Implements logging system

│── main.py # CLI entry point for execution

│── requirements.txt # Dependencies

│── README.md # Project documentation

🚀 Installation & Usage

1️⃣ Setup Environment

python -m venv venv source venv/bin/activate # On Windows use venv\Scripts\activate pip install -r requirements.txt

2️⃣ Run the Script

python main.py files/ output_files/

3️⃣ Query Cleaned Data (Example SQLite Query)

SELECT * FROM sales_data WHERE Product = 'Laptop';

🤖 Why CSV Preprocessing is Important?

🔹 Data Cleaning & Transformation: Ensures structured and error-free data. 🔹 Automated Processing: Saves time by handling large datasets efficiently. 🔹 Better Analysis & Machine Learning: Provides clean data for insightful analysis. 🔹 Database Storage: Enables structured querying and retrieval of processed data.

This project provides an automated, scalable, and efficient solution for CSV data preprocessing, making data ready for analysis and machine learning tasks

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Data_Preprocessor		Data_Preprocessor
Pipfile		Pipfile
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-Refiner

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data-Refiner

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages