Large Language Models (LLMs) for code generation have shown remarkable progress, yet they still produce code containing security vulnerabilities. RESCUE (REtrieval-augmented SeCUre codE generation) addresses this challenge through incorporating a hybrid security knowledge base.
- Python 3.11 (recommended)
- Docker (required for HumanEval+, BigCodeBench, and CWEval evaluation)
Set up the environment and install dependencies:
uv venv --python=3.11
source .venv/bin/activate
uv pip install -r requirements.txtWe evaluate RESCUE on five benchmarks covering both functional correctness and security. Follow the instructions below to set up each benchmark.
| Abbreviation | Benchmark | Focus | Languages |
|---|---|---|---|
cgp |
CodeGuard+ | Security (CodeQL + unittest) | Python, C, C++ |
cwe |
CWEval | Security (functional + secure tests) | Python, C, C++ |
hep |
HumanEval+ | Functional correctness | Python |
bcb |
BigCodeBench | Functional correctness | Python |
lcb |
LiveCodeBench | Functional correctness | Python |
- Install CodeQL
wget https://github.com/github/codeql-cli-binaries/releases/download/v2.17.5/codeql-linux64.zip
unzip codeql-linux64.zip -d ~
git clone --depth=1 --branch codeql-cli-2.17.5 https://github.com/github/codeql.git ~/codeql/codeql-repo
~/codeql/codeql pack download [email protected] [email protected] codeql/[email protected] codeql/[email protected] codeql/[email protected] codeql/[email protected]- Install Python Dependencies
uv pip install -r repos-archive/CodeGuardPlus/requirements.txtThese benchmarks rely on Docker for sandboxed evaluation. Please ensure Docker is installed and running before proceeding.
Note: There is a known issue in BigCodeBench at line 223 of .venv/lib/python3.11/site-packages/bigcodebench/sanitize.py:
# buggy line
msg += " -> " + dbg_identifier.replace(samples, target_path)This line references a non-existent path, causing the evaluation to fail. Please temporarily comment out or remove this line.
Clone and set up LiveCodeBench:
cd repos-archive
git clone --depth 1 https://github.com/LiveCodeBench/LiveCodeBench.git LiveCodeBench
cd LiveCodeBench
uv pip install -e .CWEval is vendored under repos-archive/CWEval and already contains the benchmark assets. We evaluate on the C, C++, and Python subsets.
- The required Python packages for CWEval are already included in
requirements.txt. - Evaluation runs inside a Docker container (
co1lin/cweval) — make sure Docker is available. Pull the image in advance:docker pull co1lin/cweval
The evaluation pipeline parses generated code, compiles C/C++ sources, runs functional and security tests inside Docker, and reports pass@k metrics for func (functional correctness) and func-sec (both functional and secure).
Modifications to CWEval source: We modified
cweval/evaluate.pyin two places:
pipeline(): movedcompile_parsed()out of the Docker path, so it only runs locally whendocker=False.run_tests_in_docker(): added acompile_parsedstep to the Docker command, beforerun_tests.Reason: The original code compiled C/C++ locally before copying into Docker, which requires installing all system-level libraries (OpenSSL, libarchive, libjwt, etc.) on the host. Now compilation happens inside the
co1lin/cwevalcontainer where these libraries are pre-installed. This change does not affect the evaluation behavior or correctness of the results.
For local models (e.g., Qwen2.5-Coder-7B-Instruct, DeepSeek-Coder-V2-Lite-Instruct):
bash scripts/vllm_deploy.sh <MODEL_PATH> <SERVED_MODEL_NAME> <PORT>This starts a vLLM-based OpenAI-compatible API server.
For API-based models (e.g., GPT-4o-mini, DeepSeek-V3): skip this step.
Copy the template configuration files and fill in your parameters:
cp conf/model_deploy.yaml.template conf/model_deploy.yaml
cp conf/secret.yaml.template conf/secret.yamlmodel_deploy.yaml— define API endpoints and ports for each modelsecret.yaml— store API keys
bash scripts/generate.sh <MODEL_NAME> <METHOD> <BENCHMARK><METHOD>:zero_shot(baseline) orrescue(our method)<BENCHMARK>:cgp,cwe,hep,bcb, orlcb
bash scripts/evaluate.sh <MODEL_NAME> <METHOD> <BENCHMARK># Zero-Shot baseline on CodeGuard+
bash scripts/generate.sh gpt-4o-mini zero_shot cgp
bash scripts/evaluate.sh gpt-4o-mini zero_shot cgp
# RESCUE on CodeGuard+
bash scripts/generate.sh gpt-4o-mini rescue cgp
bash scripts/evaluate.sh gpt-4o-mini rescue cgp
# RESCUE on CWEval
bash scripts/generate.sh gpt-4o-mini rescue cwe
bash scripts/evaluate.sh gpt-4o-mini rescue cweResults are saved to experiments/<BENCHMARK>/<MODEL>/<METHOD>/.
.
├── conf/ # Configuration files for LLM endpoints and API keys
│ ├── model_deploy.yaml.template
│ └── secret.yaml.template
├── data/
│ ├── knowledge/ # Pre-built security knowledge base
│ │ ├── code_level_knowledge.json
│ │ └── cwe_level_knowledge.json
│ └── ql_check_completion_prompt.json # CodeGuard+ benchmark data
├── repos-archive/ # Vendored benchmark repositories
│ ├── CodeGuardPlus/
│ └── CWEval/
├── scripts/
│ ├── generate.sh # Code generation entry script
│ ├── evaluate.sh # Evaluation entry script
│ └── vllm_deploy.sh # vLLM model deployment script
├── src/
│ ├── baselines/
│ │ └── zero_shot.py # Zero-shot baseline (LLM alone)
│ ├── benchmarks/ # Benchmark runners
│ │ ├── benchmark.py # Abstract base class
│ │ ├── bcb/ # BigCodeBench
│ │ ├── cgp/ # CodeGuard+
│ │ ├── cweval/ # CWEval
│ │ ├── hep/ # HumanEval+
│ │ └── lcb/ # LiveCodeBench
│ ├── common/
│ │ ├── const.py # Constants and paths
│ │ ├── model.py # LLM invocation (chat / completion)
│ │ ├── zero_shot_prompt.py # Zero-shot prompt templates
│ │ └── utils.py # Utility functions
│ └── rescue/ # RESCUE method implementation
│ ├── augmented_code_generation.py # Online stage entry point
│ ├── augmented_prompt.py # Augmented prompt construction
│ ├── construction/ # Offline stage: knowledge base construction
│ │ ├── api_extraction.py # API call extraction
│ │ └── program_slicing.py # Program slicing
│ └── retrieval/
│ └── hmr.py # Hierarchical Multi-faceted Retrieval (HMR)
├── experiments/ # Output directory (generated code & evaluation results)
├── requirements.txt
└── README.md
If you find this work useful, please cite our paper:
@inproceedings{
shi2026rescue,
title={Rescue: Retrieval Augmented Secure Code Generation},
author={Jiahao Shi and Tianyi Zhang},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=gbxhesw4UH}
}