Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing

This repository contains the data and code for the paper "Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing", published at ACM Transactions on Software Engineering and Methodology.

Data

Our data is published using Figshare, please download data from here and put it into the folder data before running experiments.

Replicating results in the Paper

RQ2

To replicate results of our RQ2 (Table 11--14), please use the following command:

python3 rq2.py

RQ3

To draw Figure 3, please use the following command:

python3 rq3_1.py

To draw Figure 4, please use the following command:

python3 rq3_21.py

To draw Figure 5, please use the following command:

python3 rq3_22.py

To draw Figure 6 and reproduce Table 16, please use the following command:

python3 rq3_3.py

RQ4

To replicate results of our RQ4 (Table 17--19), please use the following command:

python3 rq4.py

RQ5

To replicate results of our RQ4 (Table 20 and Figure 7), please use the following command:

python3 rq5.py

Supplementary Materials

Human Study Data

Data collected from our human study is in the human_study folder. Particularly:

Interview:
- Transcripts: data/human_study/interview/Transcript
- Themes with their associated main themes: data/human_study/interview/Final_Themes.xlsx
- Card Sorting Discussion Resulst: data/human_study/interview/A1(A2)_Categories.txt

Survey:

Raw Data: data/human_study/survey/survey.json
Example:

"naming": { // transformation levels
    "1": { // ID of transformation in this level
        "S9": { // ID of the participant
            "CR": 3, // Assessment for Code Readability
            "CC": 1, // Assessment for Code Convention
            "Time": 10.56 // Completion Time
        },
        ...
    },
    ...
}

List of Defects4J bugs used in this study:

In this work, we used the following 225 bugs from the Defects4J dataset:

    - Chart: 1, 3, 6, 8, 9, 10, 11, 12, 13, 17, 20, 24
    - Cli: 4, 5, 8, 11, 25, 32
    - Closure: 10, 11, 14, 18, 20, 35, 38, 46, 51, 52, 55, 57, 62, 65, 70, 71, 73, 77, 81, 83, 92, 97, 104, 109, 111, 113, 122, 123, 124, 125, 126, 130, 132, 133, 150, 152, 159, 168 
    - Codec: 2, 3, 7, 9, 10, 17, 18 
    - Compress: 5, 12, 13, 14, 19, 23, 26, 27, 31, 36, 37, 38, 45, 46
    - Csv: 1, 2, 3, 5, 6, 9, 11, 14, 15
    - Gson: 6, 10, 11, 12, 13, 15, 17 
    - JacksonCore: 3, 4, 5, 6, 8, 25, 26 
    - JacksonDatabind: 5, 12, 16, 17, 19, 27, 33, 34, 37, 39, 45, 46, 49, 51, 57, 58, 70, 71, 76, 82, 88, 93, 96, 97, 98, 99, 102 
    - JacksonXml: 4, 5
    - Jsoup: 1, 10, 13, 19, 26, 27, 32, 33, 34, 37, 40, 41, 43, 45, 46, 47, 49, 51, 54, 57, 61, 68, 75, 77, 84, 86
    - JxPath: 5, 8, 10, 12
    - Lang: 6, 9, 14, 16, 21, 22, 24, 26, 28, 29, 33, 37, 38, 39, 40, 43, 44, 49, 52, 54, 57, 58, 59, 61
    - Math: 9, 11, 17, 30, 32, 33, 41, 45, 50, 53, 56, 57, 58, 59, 63, 69, 70, 75, 80, 82, 85, 89, 91, 94, 96, 101
    - Mockito: 5, 12, 18, 22, 27, 28, 29, 33, 34, 38
    - Time: 4, 14, 15, 16, 19, 24

Repair Data

Data collected from our repair experiments is in the data/plausible_patches folder. Particularly:

Naming Format: {transformation_level}-{repair_tool}.xlsx
Columns in this data:
- ID: ID of the transformation
- Bug_id: ID of the original bug in Defects4J
- "generated_diff": the generated patch by repair tool
- "developer_diff": the patch written by developers extracted from Defects4J dataset
- "Annotation": Correctness Assessment (yes is correct, no is plausible)
- Any ID do not exists in this data means that repair tool do not provide any plausible patch, a.k.a, wrong patch quality.
This results are obtained by running Cerberus (SHA: baed4074cdc1b0ff6b6c99619dbe70f508ec4004, dev-branch) on repair dataset in data/repair_dataset. Please following instructions in Cerberus and using configurations presented in the paper to reproduce these results.

Transformations Data

Our transformation data is stored in data/repair_dataset/naturaltransform. This dataset is generated based on our tool, CodeTransform tools/CodeTransform which is extended based on SPAT. Please following the instructions in tools/CodeTransform/README.md to reproduce this dataset. If you are interested in our code transformation tool, please access the lastest version at this link.

Naturalness Evaluation

Cross-Entropy values for original and transformed programs are stored in data/entropy. These results are generated using our tool CodeNaturalnessEvaluator tools/CodeNaturalnessEvaluator. Please following the instructions in tools/CodeNaturalnessEvaluator/README.md to reproduce these results. If you are interested in our naturalness assessment metric, please access the lastest version at this link.

Citations

Please cite the following article if you find our research including findings, datasets and tools to be useful:

@article{10.1145/3716167,
    author = {Le-Cong, Thanh and Nguyen, Thanh-Dat and Le, Bach and Murray, Toby},
    title = {Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing},
    year = {2025},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    issn = {1049-331X},
    url = {https://doi.org/10.1145/3716167},
    doi = {10.1145/3716167},
    note = {Just Accepted},
    journal = {ACM Trans. Softw. Eng. Methodol.},
    month = feb,
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing

Data

Replicating results in the Paper

RQ2

RQ3

RQ4

RQ5

Supplementary Materials

Human Study Data

List of Defects4J bugs used in this study:

Repair Data

Transformations Data

Naturalness Evaluation

Citations

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
results		results
tools		tools
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
rq2.py		rq2.py
rq3_1.py		rq3_1.py
rq3_21.py		rq3_21.py
rq3_22.py		rq3_22.py
rq3_3.py		rq3_3.py
rq4.py		rq4.py
rq5.py		rq5.py
transform_distribution.py		transform_distribution.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing

Data

Replicating results in the Paper

RQ2

RQ3

RQ4

RQ5

Supplementary Materials

Human Study Data

List of Defects4J bugs used in this study:

Repair Data

Transformations Data

Naturalness Evaluation

Citations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages