Repository files navigation Exploring the Potential of LLMs for Code Deobfuscation
Dataset Link
Instructions to reproduce
Create the training dataset
Fine-tune the model
Copy and switch to the eval directory
Create the evaluation dataset
Evaluate the model
Build the eval files based on the LLM output
Evaluate Clang
Evaluate correctness and compute the metrics
Show the evaluation
cd single_transformations/train; python3 create_training_data_single.py --tokenizer deepseek-ai/deepseek-coder-6.7b-instruct --max_tokens 6144 --number_of_samples 3000
python3 llm.py --model_type deepseek-coder-instruct --train_model deepseek-ai/deepseek-coder-6.7b-instruct --train_file datasets/obfuscation_dataset_encode_arithmetic_6144.txt --trained_model_path models/deepseek-coder-instruct-7b-encode_arithmetic --max_tokens 6144
mkdir ../eval/models ../eval/datasets; cp -r models/deepseek-coder-instruct-7b-encode_arithmetic ../eval/models/deepseek-coder-instruct-7b-encode_arithmetic; cd ../eval # It is important to set two different directories for training and eval data to prevent source files from being overwritten!
python3 create_eval_data_single.py --tokenizer deepseek-ai/deepseek-coder-6.7b-instruct --max_tokens 6144 --number_of_samples 200
python3 llm.py --model_type deepseek-coder-instruct --eval_model models/deepseek-coder-instruct-7b-encode_arithmetic/ --eval_out_path datasets/deobfuscated --eval_file datasets/obfuscation_dataset_encode_arithmetic_6144_eval.json --max_tokens 6144 --data_suffix _encode_arithmetic
python3 llm.py --model_type deepseek-coder-instruct --eval_model models/deepseek-coder-instruct-7b-encode_arithmetic/ --eval_out_path datasets/deobfuscated --eval_file datasets/obfuscation_dataset_encode_arithmetic_6144_eval.json --max_tokens 6144 --obfs_data_suffix _encode_arithmetic --data_suffix _encode_arithmetic --build_eval_files 1
python3 llvm.py --eval_file datasets/obfuscation_dataset_encode_arithmetic_6144_eval.json --obfs_data_suffix _encode_arithmetic --data_suffix _encode_arithmetic
python3 eval_deobf.py --eval_dataset_path datasets/obfuscation_dataset_encode_arithmetic_6144_eval.json --original_path datasets/original --obfuscated_path datasets/obfuscated --deobfuscated_path datasets/deobfuscated --obfs_data_suffix _encode_arithmetic --data_suffix _encode_arithmetic --io_path datasets/input_samples
python3 show_eval.py --eval_dataset_path datasets/obfuscation_dataset_encode_arithmetic_6144_eval.json --obfs_data_suffix _encode_arithmetic --data_suffix _encode_arithmetic --original_path datasets/original --obfuscated_path datasets/obfuscated --deobfuscated_path datasets/deobfuscated
Create the training dataset
Fine-tune the model
Copy and switch to the eval directory
Create the evaluation dataset
Evaluate the model
Build the evaluation files around the LLM-generated samples
Evaluate clang
Evaluate correctness and compute the metrics
Show the evaluation
python3 create_training_data_chain.py --tokenizer deepseek-ai/deepseek-coder-6.7b-instruct --max_tokens 6144 --chain_length 1 --number_of_samples 3000
python3 llm.py --model_type deepseek-coder-instruct --train_model deepseek-coder-instruct-7b-chain-6144 --train_file datasets/obfuscation_dataset_chain_6144_all_training.txt --trained_model_path models/deepseek-coder-instruct-7b-chain-6144 --max_tokens 6144
cp -r models/deepseek-coder-instruct-7b-chain-6144 eval/models/deepseek-coder-instruct-7b-chain-6144; cd ../eval
python3 create_eval_data_chain.py --tokenizer deepseek-ai/deepseek-coder-6.7b-instruct --max_tokens 6144 --chain_length 1 --number_of_samples 1000
python3 llm.py --model_type deepseek-coder-instruct --eval_model models/deepseek-coder-instruct-7b-chain-6144 --eval_out_path datasets/deobfuscated --eval_file datasets/obfuscation_dataset_1_chain_eval2_l.json --max_tokens 6144 --obfs_data_suffix _1_chain --data_suffix _1_chain_6144
python3 llm.py --model_type deepseek-coder-instruct --eval_model models/deepseek-coder-instruct-7b-chain-6144 --eval_out_path datasets/deobfuscated --eval_file datasets/obfuscation_dataset_1_chain_eval2_l.json --max_tokens 6144 --obfs_data_suffix _1_chain --data_suffix _1_chain_6144 --build_eval_files 1
python3 llvm.py --eval_file datasets/obfuscation_dataset_1_chain_eval2_l.json --orig_data_suffix _1 --obfs_data_suffix _1_chain --data_suffix _1_chain_6144;python3 llvm.py --eval_file datasets/obfuscation_dataset_2_chain_eval2_l.json --orig_data_suffix _2 --obfs_data_suffix _2_chain --data_suffix _2_chain_6144
python3 eval_deobf.py --eval_dataset_path datasets/obfuscation_dataset_1_chain_eval2_l.json --no_metrics --original_path datasets/original --obfuscated_path datasets/obfuscated --deobfuscated_path datasets/deobfuscated --orig_data_suffix _1 --obfs_data_suffix _1_chain --data_suffix _1_chain_6144 --io_path datasets/input_samples
python3 show_eval.py --eval_dataset_path datasets/obfuscation_dataset_1_chain_eval2_l.json --orig_data_suffix _1 --obfs_data_suffix _1_chain --data_suffix _1_chain_6144 --original_path datasets/original --original_io_path datasets/original_eval --obfuscated_path datasets/obfuscated --deobfuscated_path datasets/deobfuscated
Build the memorization dataset
Evaluate the model with the memorized samples
Build evaluation files around the LLM generated samples
Evaluate correctness and compute the metrics
Show the evaluation
Manually check for memorized constants (We only need the correctness part of the evaluation here since we only have to examine semantically incorrect samples and don't need the deobfuscation performance)
python3 build_memorization_dataset.py --input_dataset ../FineTuning-6144-eval/datasets/obfuscation_dataset_encode_arithmetic_6144_eval.json --output_dataset _gpt-4-32k-0314-memorization --original_path ../FineTuning-6144-eval/datasets/original --obfuscated_path ../FineTuning-6144-eval/datasets/obfuscated --deobfuscated_path ../FineTuning-6144-eval/datasets/deobfuscated --data_suffix _encode_arithmetic_gpt-4-32k-0314,_encode_branches_gpt-4-32k-0314,_flatten_gpt-4-32k-0314,_opaque_gpt-4-32k-0314,_randomize_arguments_gpt-4-32k-0314
python3 llm.py --model_type openai --eval_model ../FineTuning-6144-eval/models/codellama-7b-encode_arithmetic-6144/ --eval_out_path MemTest/deobfuscated_modified --eval_file obfuscation_dataset_gpt-4-32k-0314-memorization_encode_arithmetic.json --max_tokens 6144 --data_suffix _encode_arithmetic_gpt-4-32k-0314
python3 llm.py --model_type openai --eval_model ../FineTuning-6144-eval/models/codellama-7b-encode_arithmetic-6144/ --eval_out_path MemTest/deobfuscated_modified --eval_file obfuscation_dataset_gpt-4-32k-0314-memorization_encode_arithmetic.json --obfs_data_suffix _encode_arithmetic --max_tokens 6144 --data_suffix _encode_arithmetic_gpt-4-32k-0314 --build_eval_files 1
python3 eval_deobf.py --eval_dataset_path obfuscation_dataset_gpt-4-32k-0314-memorization_encode_arithmetic.json --original_path MemTest/original_modified --obfuscated_path MemTest/obfuscated_modified --deobfuscated_path MemTest/deobfuscated_modified --obfs_data_suffix _encode_arithmetic --data_suffix _encode_arithmetic_gpt-4-32k-0314 --io_path ../FineTuning-6144-eval/datasets/input_samples
python3 show_eval.py --eval_dataset_path obfuscation_dataset_gpt-4-32k-0314-memorization_encode_arithmetic.json --original_path MemTest/original_modified --obfuscated_path MemTest/obfuscated_modified --deobfuscated_path MemTest/deobfuscated_modified --obfs_data_suffix _encode_arithmetic --data_suffix _encode_arithmetic_gpt-4-32k-0314
About
No description, website, or topics provided.
Resources
License
Stars
Watchers
Forks
You can’t perform that action at this time.