🚀 Democratizing Synthetic Data for AI Builders
SynthForge is a privacy-first, lightweight tool that generates high-quality synthetic tabular data to overcome common AI bottlenecks like data scarcity, annotation challenges, and privacy concerns. Built as a Minimum Viable Product (MVP) in just 3 days, it's designed for indie developers, students, and teams with real-world constraints—making "infinite clean data" accessible without heavy compute or budgets.
Live Demo: synthforge.streamlit.app
In AI development, 80% of time is often wasted on data prep (Gartner). SynthForge flips that by enabling quick, ethical data generation:
- Solve Data Shortages: Create statistically similar datasets from small uploads.
- Prioritize Privacy: Built-in PII detection and differential privacy.
- Streamline Annotation: Auto-label with sentiment, binning, or clustering.
- Vision: Empower underrepresented builders (e.g., from Jaipur, India) to innovate globally.
Inspired by trends like synthetic data (market: $1-2B by 2025) and tools like Faker/SDV, but focused on simplicity and accessibility.
- Upload & Generate: Supports CSV/Excel (auto-samples large files for efficiency).
- Customization: Adjust variance for randomness, add differential privacy noise.
- Privacy Tools: Scans for emails/phones; optional epsilon-based anonymization.
- Auto-Labeling: Rule-based or LLM-enhanced (via free Groq tier) for sentiment on text, binning/clustering on numerics.
- Outputs: Download synthetic CSV + HTML report with stats comparisons (means, stds, top values).
- Lightweight: Runs smoothly on modest hardware—no GPUs required.
-
Clone the repo:
git clone https://github.com/yourusername/synthforge.git cd synthforge -
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate -
Install dependencies:
pip install streamlit pandas numpy faker scikit-learn openpyxl requests -
(Optional) For LLM-enhanced labeling: Sign up for a free Groq API key at groq.com and input it in the app sidebar.
Run the app locally:
streamlit run app.py
- Open in your browser (defaults to http://localhost:8501).
- Upload a file, tweak settings in the sidebar, and generate!
- For production: Deploy to Streamlit Sharing or similar (as done for the live demo).
Example: Upload a CSV with names, emails, ages—watch it generate synthetics while masking PII.
- Frontend: Streamlit (simple, interactive UI)
- Core Libraries: Pandas (data handling), NumPy (stats), Faker (heuristic generation), scikit-learn (clustering/imputation)
- Privacy/Labeling: Regex for PII, Laplace noise for DP, Requests for optional Groq LLM
- Deployment: Streamlit Cloud (free tier)
- v1.1: Multi-modal support (text/images).
- v1.2: API endpoints for integrations (e.g., Jupyter/HF).
- v2.0: Federated privacy, compute optimization, bias auditing.
- Long-term: Agentic workflows and a synthetic data marketplace.
We welcome contributions! See CONTRIBUTING.md (add if needed).
Fork the repo, create a branch, and submit a PR. Focus areas: Bug fixes, new heuristics, modality expansions. Let's build together!
MIT License – See LICENSE for details.
- Hanish (Founder): LinkedIn | [email protected]
- Issues/PRs: GitHub Issues
- Feedback: Test the app and drop thoughts on LinkedIn or X!
Built with ❤️ from Jaipur, India. Join the forge—let's crush AI bottlenecks! 🚀