This repository contains the data and code for the paper "Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing", published at ACM Transactions on Software Engineering and Methodology.
Our data is published using Figshare, please download data from here and put it into the folder data before running experiments.
To replicate results of our RQ2 (Table 11--14), please use the following command:
python3 rq2.py
To draw Figure 3, please use the following command:
python3 rq3_1.py
To draw Figure 4, please use the following command:
python3 rq3_21.py
To draw Figure 5, please use the following command:
python3 rq3_22.py
To draw Figure 6 and reproduce Table 16, please use the following command:
python3 rq3_3.py
To replicate results of our RQ4 (Table 17--19), please use the following command:
python3 rq4.py
To replicate results of our RQ4 (Table 20 and Figure 7), please use the following command:
python3 rq5.py
Data collected from our human study is in the human_study folder. Particularly:
- Interview:
- Transcripts:
data/human_study/interview/Transcript - Themes with their associated main themes:
data/human_study/interview/Final_Themes.xlsx - Card Sorting Discussion Resulst:
data/human_study/interview/A1(A2)_Categories.txt
- Transcripts:
- Survey:
- Raw Data:
data/human_study/survey/survey.json - Example:
"naming": { // transformation levels "1": { // ID of transformation in this level "S9": { // ID of the participant "CR": 3, // Assessment for Code Readability "CC": 1, // Assessment for Code Convention "Time": 10.56 // Completion Time }, ... }, ... } - Raw Data:
In this work, we used the following 225 bugs from the Defects4J dataset:
- Chart: 1, 3, 6, 8, 9, 10, 11, 12, 13, 17, 20, 24
- Cli: 4, 5, 8, 11, 25, 32
- Closure: 10, 11, 14, 18, 20, 35, 38, 46, 51, 52, 55, 57, 62, 65, 70, 71, 73, 77, 81, 83, 92, 97, 104, 109, 111, 113, 122, 123, 124, 125, 126, 130, 132, 133, 150, 152, 159, 168
- Codec: 2, 3, 7, 9, 10, 17, 18
- Compress: 5, 12, 13, 14, 19, 23, 26, 27, 31, 36, 37, 38, 45, 46
- Csv: 1, 2, 3, 5, 6, 9, 11, 14, 15
- Gson: 6, 10, 11, 12, 13, 15, 17
- JacksonCore: 3, 4, 5, 6, 8, 25, 26
- JacksonDatabind: 5, 12, 16, 17, 19, 27, 33, 34, 37, 39, 45, 46, 49, 51, 57, 58, 70, 71, 76, 82, 88, 93, 96, 97, 98, 99, 102
- JacksonXml: 4, 5
- Jsoup: 1, 10, 13, 19, 26, 27, 32, 33, 34, 37, 40, 41, 43, 45, 46, 47, 49, 51, 54, 57, 61, 68, 75, 77, 84, 86
- JxPath: 5, 8, 10, 12
- Lang: 6, 9, 14, 16, 21, 22, 24, 26, 28, 29, 33, 37, 38, 39, 40, 43, 44, 49, 52, 54, 57, 58, 59, 61
- Math: 9, 11, 17, 30, 32, 33, 41, 45, 50, 53, 56, 57, 58, 59, 63, 69, 70, 75, 80, 82, 85, 89, 91, 94, 96, 101
- Mockito: 5, 12, 18, 22, 27, 28, 29, 33, 34, 38
- Time: 4, 14, 15, 16, 19, 24
Data collected from our repair experiments is in the data/plausible_patches folder. Particularly:
- Naming Format:
{transformation_level}-{repair_tool}.xlsx - Columns in this data:
- ID: ID of the transformation
- Bug_id: ID of the original bug in Defects4J
- "generated_diff": the generated patch by repair tool
- "developer_diff": the patch written by developers extracted from Defects4J dataset
- "Annotation": Correctness Assessment (yes is correct, no is plausible)
- Any ID do not exists in this data means that repair tool do not provide any plausible patch, a.k.a, wrong patch quality.
- This results are obtained by running Cerberus (SHA: baed4074cdc1b0ff6b6c99619dbe70f508ec4004, dev-branch) on repair dataset in
data/repair_dataset. Please following instructions in Cerberus and using configurations presented in the paper to reproduce these results.
Our transformation data is stored in data/repair_dataset/naturaltransform. This dataset is generated based on our tool, CodeTransform tools/CodeTransform which is extended based on SPAT. Please following the instructions in tools/CodeTransform/README.md to reproduce this dataset. If you are interested in our code transformation tool, please access the lastest version at this link.
Cross-Entropy values for original and transformed programs are stored in data/entropy. These results are generated using our tool CodeNaturalnessEvaluator tools/CodeNaturalnessEvaluator. Please following the instructions in tools/CodeNaturalnessEvaluator/README.md to reproduce these results. If you are interested in our naturalness assessment metric, please access the lastest version at this link.
Please cite the following article if you find our research including findings, datasets and tools to be useful:
@article{10.1145/3716167,
author = {Le-Cong, Thanh and Nguyen, Thanh-Dat and Le, Bach and Murray, Toby},
title = {Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing},
year = {2025},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {1049-331X},
url = {https://doi.org/10.1145/3716167},
doi = {10.1145/3716167},
note = {Just Accepted},
journal = {ACM Trans. Softw. Eng. Methodol.},
month = feb,
}