Skip to content

Commit 6178bb0

Browse files
committed
initial file upload
0 parents  commit 6178bb0

17 files changed

Lines changed: 2013 additions & 0 deletions

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
data/nextcoder-synthetic.jsonl
2+
notebook.ipynb

README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# NextCoder
2+
3+
## Assets
4+
- Synthetic Dataset [Only New]
5+
- Models
6+
- Training Reciepe

data/README.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Data Generation
2+
3+
## Folder Structure
4+
- `propmpts` contain all the different prompts used in synthetic data generation
5+
- `config` contains the yaml file to map prompts to their corresponding location
6+
- `utils.py` contains the helper code to extract and parse data from LLM responses
7+
- `data_pipeline.py` contains the main source code for generating synthetic data according to the pipeline explained in our paper.
8+
9+
# Usage
10+
- Make sure the proper packages are installed via the `environment.yaml` file provided at root folder
11+
- Run the following command for generating data with Llama-3.3-70B-Instruct Model
12+
```bash
13+
python data_pipeline.py --output_dir /path_to_output --language "python" --data_path /path_to_seed_code --llm_path huggingface/local path to LLM
14+
```

data/config/prompts.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
prompt_directory: "prompts/prompts-2"
2+
prompts:
3+
problem_code: "problem-code.txt"
4+
code_edit_generation: "code-edit.txt"
5+
generate_instructions: "instructions.txt"
6+
quality_check: "quality-check.txt"

0 commit comments

Comments
 (0)