Skip to content

Hanishsaini/SynthForge

Repository files navigation

SynthForge

🚀 Democratizing Synthetic Data for AI Builders

SynthForge is a privacy-first, lightweight tool that generates high-quality synthetic tabular data to overcome common AI bottlenecks like data scarcity, annotation challenges, and privacy concerns. Built as a Minimum Viable Product (MVP) in just 3 days, it's designed for indie developers, students, and teams with real-world constraints—making "infinite clean data" accessible without heavy compute or budgets.

Live Demo: synthforge.streamlit.app

Why SynthForge?

In AI development, 80% of time is often wasted on data prep (Gartner). SynthForge flips that by enabling quick, ethical data generation:

  • Solve Data Shortages: Create statistically similar datasets from small uploads.
  • Prioritize Privacy: Built-in PII detection and differential privacy.
  • Streamline Annotation: Auto-label with sentiment, binning, or clustering.
  • Vision: Empower underrepresented builders (e.g., from Jaipur, India) to innovate globally.

Inspired by trends like synthetic data (market: $1-2B by 2025) and tools like Faker/SDV, but focused on simplicity and accessibility.

Features

  • Upload & Generate: Supports CSV/Excel (auto-samples large files for efficiency).
  • Customization: Adjust variance for randomness, add differential privacy noise.
  • Privacy Tools: Scans for emails/phones; optional epsilon-based anonymization.
  • Auto-Labeling: Rule-based or LLM-enhanced (via free Groq tier) for sentiment on text, binning/clustering on numerics.
  • Outputs: Download synthetic CSV + HTML report with stats comparisons (means, stds, top values).
  • Lightweight: Runs smoothly on modest hardware—no GPUs required.

Installation & Setup

  1. Clone the repo:

    git clone https://github.com/yourusername/synthforge.git
    cd synthforge
    
  2. Create a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
  3. Install dependencies:

    pip install streamlit pandas numpy faker scikit-learn openpyxl requests
    
  4. (Optional) For LLM-enhanced labeling: Sign up for a free Groq API key at groq.com and input it in the app sidebar.

Usage

Run the app locally:

streamlit run app.py
  • Open in your browser (defaults to http://localhost:8501).
  • Upload a file, tweak settings in the sidebar, and generate!
  • For production: Deploy to Streamlit Sharing or similar (as done for the live demo).

Example: Upload a CSV with names, emails, ages—watch it generate synthetics while masking PII.

Tech Stack

  • Frontend: Streamlit (simple, interactive UI)
  • Core Libraries: Pandas (data handling), NumPy (stats), Faker (heuristic generation), scikit-learn (clustering/imputation)
  • Privacy/Labeling: Regex for PII, Laplace noise for DP, Requests for optional Groq LLM
  • Deployment: Streamlit Cloud (free tier)

Roadmap

  • v1.1: Multi-modal support (text/images).
  • v1.2: API endpoints for integrations (e.g., Jupyter/HF).
  • v2.0: Federated privacy, compute optimization, bias auditing.
  • Long-term: Agentic workflows and a synthetic data marketplace.

We welcome contributions! See CONTRIBUTING.md (add if needed).

Contributing

Fork the repo, create a branch, and submit a PR. Focus areas: Bug fixes, new heuristics, modality expansions. Let's build together!

License

MIT License – See LICENSE for details.

Contact

  • Hanish (Founder): LinkedIn | [email protected]
  • Issues/PRs: GitHub Issues
  • Feedback: Test the app and drop thoughts on LinkedIn or X!

Built with ❤️ from Jaipur, India. Join the forge—let's crush AI bottlenecks! 🚀

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors