Skip to content

Commit c4ff5f0

Browse files
committed
adding assets
1 parent e004a66 commit c4ff5f0

File tree

4 files changed

+12
-4
lines changed

4 files changed

+12
-4
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
data/nextcoder-synthetic.jsonl
22
notebook.ipynb
33
git-credential-manager
4-
models
4+
models
5+
*.parquet

README.md

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
# NextCoder
22

33
<p align="center">
4-
🤗 <a href="proxy.php?url=https%3A%2F%2Fhuggingface.co%2Fmicrosoft">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="proxy.php?url=https%3A%2F%2Farxiv.org%2Fabs%2F2503.03656">Paper</a>
4+
🤗 <a href="proxy.php?url=https%3A%2F%2Fhuggingface.co%2F%3Cspan+class%3D"x x-first x-last">collections/microsoft/nextcoder-6815ee6bfcf4e42f20d45028">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="proxy.php?url=https%3A%2F%2Farxiv.org%2Fabs%2F2503.03656">Arxiv</a>
55
</p>
66

77
## Introduction
88
This repository hosts the official code and data artifact for the paper [NextCoder: Robust Learning of Diverse Code Edits
99
](https://arxiv.org/abs/2503.03656)
1010

11-
The work is the development of code-editing LLMs, synthetic data generation pipeline and a novel finetuning methodology.
11+
The work is the development of code-editing LLMs, synthetic data generation pipeline and a novel finetuning methodology called **Selective Knowledge Transfer (SeleKT)**.
1212

1313
## Repository Structure
1414
- [data](data/): contains the scripts and files required to generate synthetic dataset for code-editing as per the pipeline proposed in the paper
@@ -18,7 +18,7 @@ The work is the development of code-editing LLMs, synthetic data generation pipe
1818
```python
1919
from transformers import AutoModelForCausalLM, AutoTokenizer
2020

21-
model_name = "Microsoft/NextCoder-7B"
21+
model_name = "microsoft/NextCoder-7B"
2222

2323
model = AutoModelForCausalLM.from_pretrained(
2424
model_name,
@@ -56,6 +56,9 @@ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
5656

5757
## Evaluation and Performanc
5858

59+
![](assets/aider-polyglot.png)
60+
*Comparison of NextCoder-32B models with other models*
61+
5962
| Models | HUMANEVALEDIT | CANITEDIT | AIDER | POLYGLOT |
6063
|--------|---------------|-----------|-------|----------|
6164
| QwenCoder-2.5-3B | 73.2 | 37.1 | 36.8 | - |
@@ -73,6 +76,10 @@ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
7376

7477
*Comparison of base QwenCoder-2.5 models of different sizes and their SELEKT-enhanced versions across three code editing benchmarks.*
7578

79+
<img src="assets/spider-plot.png" width=400></img>
80+
81+
**A detailed evaluation and ablations can be found in our paper**
82+
7683
## Contributing
7784

7885
This project welcomes contributions and suggestions. Most contributions require you to agree to a

assets/aider-polyglot.png

138 KB
Loading

assets/spider-plot.png

305 KB
Loading

0 commit comments

Comments
 (0)