Exploring the Potential of LLMs for Code Deobfuscation

Dataset Link

Instructions to reproduce

Single Transformations

Create the training dataset
Fine-tune the model
Copy and switch to the eval directory
Create the evaluation dataset
Evaluate the model
Build the eval files based on the LLM output
Evaluate Clang
Evaluate correctness and compute the metrics
Show the evaluation

Example:

cd single_transformations/train; python3 create_training_data_single.py --tokenizer deepseek-ai/deepseek-coder-6.7b-instruct --max_tokens 6144 --number_of_samples 3000
python3 llm.py --model_type deepseek-coder-instruct --train_model deepseek-ai/deepseek-coder-6.7b-instruct --train_file datasets/obfuscation_dataset_encode_arithmetic_6144.txt --trained_model_path models/deepseek-coder-instruct-7b-encode_arithmetic --max_tokens 6144
mkdir ../eval/models ../eval/datasets; cp -r models/deepseek-coder-instruct-7b-encode_arithmetic ../eval/models/deepseek-coder-instruct-7b-encode_arithmetic; cd ../eval # It is important to set two different directories for training and eval data to prevent source files from being overwritten!
python3 create_eval_data_single.py --tokenizer deepseek-ai/deepseek-coder-6.7b-instruct --max_tokens 6144 --number_of_samples 200
python3 llm.py --model_type deepseek-coder-instruct --eval_model models/deepseek-coder-instruct-7b-encode_arithmetic/ --eval_out_path datasets/deobfuscated --eval_file datasets/obfuscation_dataset_encode_arithmetic_6144_eval.json --max_tokens 6144 --data_suffix _encode_arithmetic
python3 llm.py --model_type deepseek-coder-instruct --eval_model models/deepseek-coder-instruct-7b-encode_arithmetic/ --eval_out_path datasets/deobfuscated --eval_file datasets/obfuscation_dataset_encode_arithmetic_6144_eval.json --max_tokens 6144 --obfs_data_suffix _encode_arithmetic --data_suffix _encode_arithmetic --build_eval_files 1
python3 llvm.py --eval_file datasets/obfuscation_dataset_encode_arithmetic_6144_eval.json --obfs_data_suffix _encode_arithmetic --data_suffix _encode_arithmetic
python3 eval_deobf.py --eval_dataset_path datasets/obfuscation_dataset_encode_arithmetic_6144_eval.json --original_path datasets/original --obfuscated_path datasets/obfuscated --deobfuscated_path datasets/deobfuscated --obfs_data_suffix _encode_arithmetic --data_suffix _encode_arithmetic --io_path datasets/input_samples
python3 show_eval.py --eval_dataset_path datasets/obfuscation_dataset_encode_arithmetic_6144_eval.json --obfs_data_suffix _encode_arithmetic --data_suffix _encode_arithmetic --original_path datasets/original --obfuscated_path datasets/obfuscated --deobfuscated_path datasets/deobfuscated

Chained Transformations

Create the training dataset
Fine-tune the model
Copy and switch to the eval directory
Create the evaluation dataset
Evaluate the model
Build the evaluation files around the LLM-generated samples
Evaluate clang
Evaluate correctness and compute the metrics
Show the evaluation

Example:

python3 create_training_data_chain.py --tokenizer deepseek-ai/deepseek-coder-6.7b-instruct --max_tokens 6144 --chain_length 1 --number_of_samples 3000
python3 llm.py --model_type deepseek-coder-instruct --train_model deepseek-coder-instruct-7b-chain-6144 --train_file datasets/obfuscation_dataset_chain_6144_all_training.txt --trained_model_path models/deepseek-coder-instruct-7b-chain-6144 --max_tokens 6144
cp -r models/deepseek-coder-instruct-7b-chain-6144 eval/models/deepseek-coder-instruct-7b-chain-6144; cd ../eval
python3 create_eval_data_chain.py --tokenizer deepseek-ai/deepseek-coder-6.7b-instruct --max_tokens 6144 --chain_length 1 --number_of_samples 1000
python3 llm.py --model_type deepseek-coder-instruct --eval_model models/deepseek-coder-instruct-7b-chain-6144 --eval_out_path datasets/deobfuscated --eval_file datasets/obfuscation_dataset_1_chain_eval2_l.json --max_tokens 6144 --obfs_data_suffix _1_chain --data_suffix _1_chain_6144
python3 llm.py --model_type deepseek-coder-instruct --eval_model models/deepseek-coder-instruct-7b-chain-6144 --eval_out_path datasets/deobfuscated --eval_file datasets/obfuscation_dataset_1_chain_eval2_l.json --max_tokens 6144 --obfs_data_suffix _1_chain --data_suffix _1_chain_6144 --build_eval_files 1
python3 llvm.py --eval_file datasets/obfuscation_dataset_1_chain_eval2_l.json --orig_data_suffix _1 --obfs_data_suffix _1_chain --data_suffix _1_chain_6144;python3 llvm.py --eval_file datasets/obfuscation_dataset_2_chain_eval2_l.json --orig_data_suffix _2 --obfs_data_suffix _2_chain --data_suffix _2_chain_6144
python3 eval_deobf.py --eval_dataset_path datasets/obfuscation_dataset_1_chain_eval2_l.json --no_metrics --original_path datasets/original --obfuscated_path datasets/obfuscated --deobfuscated_path datasets/deobfuscated --orig_data_suffix _1 --obfs_data_suffix _1_chain --data_suffix _1_chain_6144 --io_path datasets/input_samples
python3 show_eval.py --eval_dataset_path datasets/obfuscation_dataset_1_chain_eval2_l.json --orig_data_suffix _1 --obfs_data_suffix _1_chain --data_suffix _1_chain_6144 --original_path datasets/original --original_io_path datasets/original_eval --obfuscated_path datasets/obfuscated --deobfuscated_path datasets/deobfuscated

Memorization

Build the memorization dataset
Evaluate the model with the memorized samples
Build evaluation files around the LLM generated samples
Evaluate correctness and compute the metrics
Show the evaluation
Manually check for memorized constants (We only need the correctness part of the evaluation here since we only have to examine semantically incorrect samples and don't need the deobfuscation performance)

Example:

python3 build_memorization_dataset.py --input_dataset ../FineTuning-6144-eval/datasets/obfuscation_dataset_encode_arithmetic_6144_eval.json --output_dataset _gpt-4-32k-0314-memorization --original_path ../FineTuning-6144-eval/datasets/original --obfuscated_path ../FineTuning-6144-eval/datasets/obfuscated --deobfuscated_path ../FineTuning-6144-eval/datasets/deobfuscated --data_suffix _encode_arithmetic_gpt-4-32k-0314,_encode_branches_gpt-4-32k-0314,_flatten_gpt-4-32k-0314,_opaque_gpt-4-32k-0314,_randomize_arguments_gpt-4-32k-0314
python3 llm.py --model_type openai --eval_model ../FineTuning-6144-eval/models/codellama-7b-encode_arithmetic-6144/ --eval_out_path MemTest/deobfuscated_modified --eval_file obfuscation_dataset_gpt-4-32k-0314-memorization_encode_arithmetic.json --max_tokens 6144 --data_suffix _encode_arithmetic_gpt-4-32k-0314
python3 llm.py --model_type openai --eval_model ../FineTuning-6144-eval/models/codellama-7b-encode_arithmetic-6144/ --eval_out_path MemTest/deobfuscated_modified --eval_file obfuscation_dataset_gpt-4-32k-0314-memorization_encode_arithmetic.json --obfs_data_suffix _encode_arithmetic --max_tokens 6144 --data_suffix _encode_arithmetic_gpt-4-32k-0314 --build_eval_files 1
python3 eval_deobf.py --eval_dataset_path obfuscation_dataset_gpt-4-32k-0314-memorization_encode_arithmetic.json --original_path MemTest/original_modified --obfuscated_path MemTest/obfuscated_modified --deobfuscated_path MemTest/deobfuscated_modified --obfs_data_suffix _encode_arithmetic --data_suffix _encode_arithmetic_gpt-4-32k-0314 --io_path ../FineTuning-6144-eval/datasets/input_samples
python3 show_eval.py --eval_dataset_path obfuscation_dataset_gpt-4-32k-0314-memorization_encode_arithmetic.json --original_path MemTest/original_modified --obfuscated_path MemTest/obfuscated_modified --deobfuscated_path MemTest/deobfuscated_modified --obfs_data_suffix _encode_arithmetic --data_suffix _encode_arithmetic_gpt-4-32k-0314

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
chained_transformations		chained_transformations
memorization		memorization
single_transformations		single_transformations
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring the Potential of LLMs for Code Deobfuscation

Instructions to reproduce

Single Transformations

Example:

Chained Transformations

Example:

Memorization

Example:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Exploring the Potential of LLMs for Code Deobfuscation

Instructions to reproduce

Single Transformations

Example:

Chained Transformations

Example:

Memorization

Example:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages