This is the codebase for the paper "VulInject: Multi-Type Samples Generation for Learning-based Vulnerability Detection".
VulInject
├── data
│ ├── pattern <- Vulnerability Pattern
│ └── programs <- Benign programs' source code
| |—— vim
| └── ...
├── result
│ └── generated_vulnerable_programs <- Gnerated vulnerable programs' information
│ ├── vim
│ │ ├── c_origin <- Original C files
│ │ ├── c_vul <- Generated vulnerable C files
│ │ └── match_result <- Detailed vulnerability infomation including critical variables info, CVE type, vulnerable code slice, modified statements lines info
│ └── ...
├── src
| ├── copydetect <- Source code of Copydetect
| ├── ctags <- Source code of Ctags
| ├── get_code_slice <- Source code for syntax matching, semantic matching and code slicing
| │ ├── match.sh <- Shell script for syntax matching and semantic matching
| │ ├── get_slice.py <- Generate code slices
| │ └── ...
| ├── joern-0.3.1 <- Source code of Joern
| ├── neo4j <- Source code of neo4j
| ├── pattern_application
| │ └── injection.py <- Inject vulnerabilities into programs using patterns
| └── type_labeling
| ├── LLM.py <- Vulnerability type labeling using ChatGPT
| ├── vulsample.py <- Module for extracting patch statements type and critival variables type
| └── ...
├── experiments
| ├── binary_models <- Source code for binary model training and test
| | └── data <- data for binary model training and test
│ └── multiclass_models
| ├── PDBERT <- Source code for PDBERT
| └── VulBERTa <- Source code for VulBERTa
├── base_env.yml <- Conda base environment configuration file
├── OPENAI_env.yml <- Conda OPENAI environment configuration file
├── vulinject_env.yml <- Conda vulinject environment configuration file
├── run.sh <- Shell script for the entire project
└── README.md
Install necessary dependencies before running the project:
- JAVA (jdk1.8.0_161)
- ant (1.9.14)
- Joern (0.3.1)
- Universal Ctags (6.1.0)
- Copydetect
- Neo4j (2.1.5)
This section gives the steps, explainations for getting the project running.
$ git clone https://github.com/VulInject/VulInject.gitYou should install prerequisites and add them to the system path.
export JAVA_HOME=/usr/java/jdk1.8.0_161
export JRE_HOME=/usr/java/jdk1.8.0_161/jre
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH
export ANT_HOME=/usr/ant/apache-ant-1.9.14
export PATH=$PATH:$ANT_HOME/binWe provide source code of Joern, Ctags, Copydetect, and Neo4j in the src folder. You can directly use them.
Create three new conda environments.
conda env create -f base_env.yml
conda env create -f OPENAI_env.yml
conda env create -f vulinject_env.yml You should modify paths in the configuration file (src/get_code_slice/config.json) to ensure successful code slicing.
- Put your target programs' repository in
data/programs - We provide a shell script for you to inject vulnerabilities into the target programs automatically.
bash run.sh- Generated vulnerabilities are stored in
result/generated_vulnerable_programs/<Your Target program's name>
For binary models, we use VulCNN, VulBERTa, LineVul and Devign. For multiclass models, we use PDBERT and VulBERTa.
The source code for these models can be found in the experiments directory.
The complete experimental source code and trained models have been uploaded to Zenodo and are available at: https://zenodo.org/records/18811174?preview=1&token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6ImJiOWQ1YWExLTI2YTYtNDQ1Yi1hOTYwLTk5Nzk5Y2IzODQ4MCIsImRhdGEiOnt9LCJyYW5kb20iOiJhN2ZjMWI0OWM0ODViZmY2MmQxNGI5NmRjMzk2MTkzNiJ9.84AVkgnjPXAnDkjuW1lrboIo1blG7yy-Eus5t9drSvgXO3Q6OpzvZD2d8iCXgbFdGd8WPvg62x3N-fJ1M91UaA
- 352 vulnerability patterns we extract are present in
data/pattern. - For the training and testing of binary and multi-class classification models, we use the PrimeVul dataset as the baseline. The data is located in the respective model directories. Please refer to the README file of each model for details.
- We will make all generated datasets publicly available upon the acceptance of this paper.