microsoft
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 0 deletions b/‎.gitignore‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 6 additions & 0 deletions b/‎README.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎data/README.md‎
Lines changed: 14 additions & 0 deletions b/‎data/README.md‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎data/config/prompts.yml‎
Lines changed: 6 additions & 0 deletions b/‎data/config/prompts.yml‎
Lines changed: 6 additions & 0 deletions
@@ -0,0 +1,2 @@
+data/nextcoder-synthetic.jsonl
+notebook.ipynb
@@ -0,0 +1,6 @@
+# NextCoder
+
+## Assets
+- Synthetic Dataset [Only New]
+- Models
+- Training Reciepe
@@ -0,0 +1,14 @@
+# Data Generation
+
+## Folder Structure
+- `propmpts` contain all the different prompts used in synthetic data generation
+- `config` contains the yaml file to map prompts to their corresponding location
+- `utils.py` contains the helper code to extract and parse data from LLM responses
+- `data_pipeline.py` contains the main source code for generating synthetic data according to the pipeline explained in our paper.
+
+# Usage
+- Make sure the proper packages are installed via the `environment.yaml` file provided at root folder
+- Run the following command for generating data with Llama-3.3-70B-Instruct Model
+  ```bash
+  python data_pipeline.py --output_dir /path_to_output --language "python" --data_path /path_to_seed_code --llm_path huggingface/local path to LLM
+  ```
@@ -0,0 +1,6 @@
+prompt_directory: "prompts/prompts-2"
+prompts:
+  problem_code: "problem-code.txt"
+  code_edit_generation: "code-edit.txt"
+  generate_instructions: "instructions.txt"
+  quality_check: "quality-check.txt"
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+data/nextcoder-synthetic.jsonl`
	`2`	`+notebook.ipynb`