We're generating a high-quality Ghanaian conversational AI dataset from news articles and research papers.
No coding experience needed. One command does everything.
git clone https://github.com/YOUR_USERNAME/ghana-llm-datagen.git
cd ghana-llm-datagenpython run.py --code YOUR_VOLUNTEER_CODEOr for faster processing, run the async version
python run_async.py --code YOUR_VOLUNTEER_CODEYour volunteer code is sent to you by the project owner.
That's it. The script will:
- ⬇️ Download your portion of the dataset automatically
- ⚙️ Generate conversations with a live progress bar
- 💾 Save results locally with auto-resume if interrupted
- 📤 Tell you how to submit when done
Just re-run the same command:
python run.py --code YOUR_VOLUNTEER_CODEUsing async version
python run_async.py --code YOUR_VOLUNTEER_CODEIt resumes exactly where it left off. Nothing is lost.
When the run finishes, it will print submission instructions. You'll open a GitHub issue and attach your .jsonl results file.
We have a dahsboard to track progress here.
Q: How long will it take?
Typically 80–100 hours depending on the api server speed. You can stop and resume the run with progress preserved.
Q: How long will it take for Async version?
Typically 48–72 hours depending on the api server speed. You can stop and resume the run with progress preserved.
Q: Is the code safe? Is it an API key?
Your code is a volunteer-specific token that encodes your batch assignment and a temporary API key.
Q: Can I run on a server / VM / cloud?
Yes! Any machine with Python 3.10+ and internet access works.
Q: What if I see lots of warnings or errors?
Some failures are normal — the script retries automatically. As long as the progress bar is moving, you're fine.
Q: What GPU / hardware do I need?
None. All computation happens on NVIDIA's API servers. You just need a normal laptop or PC.
All volunteers will be credited in the final dataset release. Thank you for helping build AI resources for Ghana! 🇬🇭