Skip to content

VulInject/VulInject

Repository files navigation

VulInject

This is the codebase for the paper "VulInject: Multi-Type Samples Generation for Learning-based Vulnerability Detection".

Structure

VulInject
├── data                                    
│   ├── pattern                             <- Vulnerability Pattern            
│   └── programs                            <- Benign programs' source code
|       |—— vim
|       └── ...
├── result
│   └── generated_vulnerable_programs       <- Gnerated vulnerable programs' information
│       ├── vim
│       │   ├── c_origin                    <- Original C files 
│       │   ├── c_vul                       <- Generated vulnerable C files
│       │   └── match_result                <- Detailed vulnerability infomation including critical variables info, CVE type, vulnerable code slice, modified statements lines info
│       └── ...
├── src
|   ├── copydetect                          <- Source code of Copydetect
|   ├── ctags                               <- Source code of Ctags
|   ├── get_code_slice                      <- Source code for syntax matching, semantic matching and code slicing
|   │   ├── match.sh                        <- Shell script for syntax matching and semantic matching
|   │   ├── get_slice.py                    <- Generate code slices
|   │   └── ...
|   ├── joern-0.3.1                         <- Source code of Joern
|   ├── neo4j                               <- Source code of neo4j
|   ├── pattern_application                 
|   │   └── injection.py                    <- Inject vulnerabilities into programs using patterns
|   └── type_labeling                       
|       ├── LLM.py                          <- Vulnerability type labeling using ChatGPT
|       ├── vulsample.py                    <- Module for extracting patch statements type and critival variables type
|       └── ...
├── experiments
|   ├── binary_models                       <- Source code for binary model training and test 
|   |   └── data                            <- data for binary model training and test
│   └── multiclass_models
|       ├── PDBERT                          <- Source code for PDBERT
|       └── VulBERTa                        <- Source code for VulBERTa
├── base_env.yml                            <- Conda base environment configuration file
├── OPENAI_env.yml                          <- Conda OPENAI environment configuration file
├── vulinject_env.yml                       <- Conda vulinject environment configuration file
├── run.sh                                  <- Shell script for the entire project
└── README.md

Get Started

Prerequisites

Install necessary dependencies before running the project:

  • JAVA (jdk1.8.0_161)
  • ant (1.9.14)
  • Joern (0.3.1)
  • Universal Ctags (6.1.0)
  • Copydetect
  • Neo4j (2.1.5)

Setup

This section gives the steps, explainations for getting the project running.

1) Clone this repo

$ git clone https://github.com/VulInject/VulInject.git

2) Install Prerequisites

You should install prerequisites and add them to the system path.

export JAVA_HOME=/usr/java/jdk1.8.0_161
export JRE_HOME=/usr/java/jdk1.8.0_161/jre
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH
export ANT_HOME=/usr/ant/apache-ant-1.9.14
export PATH=$PATH:$ANT_HOME/bin

We provide source code of Joern, Ctags, Copydetect, and Neo4j in the src folder. You can directly use them.

3) Configure Conda

Create three new conda environments.

conda env create -f base_env.yml
conda env create -f OPENAI_env.yml 
conda env create -f vulinject_env.yml 

4) Configure the project

You should modify paths in the configuration file (src/get_code_slice/config.json) to ensure successful code slicing.

How To Run

Vulnerability Generation

  1. Put your target programs' repository in data/programs
  2. We provide a shell script for you to inject vulnerabilities into the target programs automatically.
bash run.sh
  1. Generated vulnerabilities are stored in result/generated_vulnerable_programs/<Your Target program's name>

Downstream Tasks Evaluation

Models

For binary models, we use VulCNN, VulBERTa, LineVul and Devign. For multiclass models, we use PDBERT and VulBERTa.

The source code for these models can be found in the experiments directory.

The complete experimental source code and trained models have been uploaded to Zenodo and are available at: https://zenodo.org/records/18811174?preview=1&token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6ImJiOWQ1YWExLTI2YTYtNDQ1Yi1hOTYwLTk5Nzk5Y2IzODQ4MCIsImRhdGEiOnt9LCJyYW5kb20iOiJhN2ZjMWI0OWM0ODViZmY2MmQxNGI5NmRjMzk2MTkzNiJ9.84AVkgnjPXAnDkjuW1lrboIo1blG7yy-Eus5t9drSvgXO3Q6OpzvZD2d8iCXgbFdGd8WPvg62x3N-fJ1M91UaA

Dataset

  • 352 vulnerability patterns we extract are present in data/pattern.
  • For the training and testing of binary and multi-class classification models, we use the PrimeVul dataset as the baseline. The data is located in the respective model directories. Please refer to the README file of each model for details.
  • We will make all generated datasets publicly available upon the acceptance of this paper.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors