Here goes a very nice abstract, DOI stuff and links and maybe citations to the paper and code.
- Create a virtual environment and install the dependencies:
python -m venv venv
source venv/bin/activate
python -m pip install -r requirements.txt- Add your keys to the
.envfile. They key is necessary to create the steering vectors later on.
cp .env.template .env
# Then edit the .env file and add your OpenAI key
nano -w .env- Clone the
persona-vectorsrepository into this directory and install their dependencies:
git clone https://github.com/safety-research/persona_vectors
cd persona_vectors
python -m pip install -r requirements.txt
# remove their trait data, since we have our own
rm -rf data_generation/*
cd .. # go back to the main directory
# copy our trait data, the persona vectors, and the API keys into the correct directory
cp .env persona_vectors/.env
cp -r data/trait_datasets persona_vectors/data_generation/trait_data_eval
cp -r data/trait_datasets persona_vectors/data_generation/trait_data_extract
cp -r data/persona_vectors persona_vectors/persona_vectorsThis section covers how to run the different experiments in this repository. All logs and figures will be saved in the logs/ directory.
The default model for all experiments is mlabonne/gemma-3-27b-it-abliterated.
Run the Hexaco benchmarkt against a specific model with a variety of personas:
python test_personas_hexaco.py --model <model_name> --device cudaThe following experiments evaluate a model on a set of safety questions (which can be found here). The experiments cover a variety of settings for different personas (defined here) and emotions (defined here) in both system prompt and user inputs. All experiments contain a baseline without a persona or emotion.
The combinations are as follows:
- The LLM is instructed with a persona in its system prompt, and the user input contains the plain safety questions.
python test_personas_safety.py \
--model <model_name> \
--device cuda \
--pass_at_k 1 \
--hierarchy_level system- The LLM is instructed with a persona in the user input, followed by the plain safety questions.
python test_personas_safety.py \
--model <model_name> \
--device cuda \
--pass_at_k 1 \
--hierarchy_level user- The LLM is instructed neutrally in the system prompt, the user input contains the safety questions with an emotion prefix (e.g. "I am feeling angry. <safety question>").
python test_emotions_safety.py \
--model <model_name> \
--device cuda \
--pass_at_k 1 \
--hierarchy_level user \
--question_type prefix- The LLM is instructed neutrally in the system prompt, the user input contains an emotionalized version of the security questions (which can be found here in the
emotionalized_questionsfield).
python test_emotions_safety.py \
--model <model_name> \
--device cuda \
--pass_at_k 1 \
--hierarchy_level user \
--question_type emotionalized- The LLM is instructed with a history of the users previous emotional states (which can be found here in the
emotion_historyfield) in the system prompt, and the user input contains the plain safety questions.
python test_emotions_safety.py \
--model <model_name> \
--device cuda \
--pass_at_k 1 \
--hierarchy_level systemTo apply steering vectors for different personas and emotions, follow the next steps to create the datasets and activation vectors. The model_str variable is the model name with all "/" and ":" characters replaced by "-", e.g. mlabonne-gemma-3-27b-it-abliterated. The persona_name variable is the name of the persona for which you want to create the steering vector, e.g. evil.
Tip
Creating the persona datasets and vectors is optional. The prebuild dataset and vectors is already included in this repository and was installed during the setup instructions.
- Create the persona datasets
python gen_trait_data.pyThis will create a dataset for each persona in the persona_vectors/data_generation/ directory.
- Generate activations using positive and negative system prompts from the previously generated datasets. The files will be saved in the
eval_persona_extract/directory.
cd persona_vectors
python -m eval.eval_persona \
--model <model_name> \
--trait <persona_name> \
--output_path eval_persona_extract/<model_str>/<persona_name>_pos_instruct.csv \
--persona_instruction_type pos \
--assistant_name <persona_name> \
--judge_model gpt-4.1-mini-2025-04-14 \
--version extract
python -m eval.eval_persona \
--model <model_name> \
--trait <persona_name> \
--output_path eval_persona_extract/<model_str>/<persona_name>_neg_instruct.csv \
--persona_instruction_type neg \
--assistant_name <opposite of persona_name> \
--judge_model gpt-4.1-mini-2025-04-14 \
--version extract- Generate the steering vectors for the different personas.
python generate_vec.py \
--model_name <model_name> \
--pos_path eval_persona_extract/<model_str>/<persona_name>_pos_instruct.csv \
--neg_path eval_persona_extract/<model_str>/<persona_name>_neg_instruct.csv \
--trait <persona_name> \
--save_dir persona_vectors/<model_str>/<persona_name>- Re-run the persona safety question experiments with the
--steering <persona_name>argument to apply the steering vectors to the model's activations.
This example generates the evil persona steering vector for the mlabonne/gemma-3-27b-it-abliterated model and uses it for the safety questions experiment.
cd persona_vectors
python -m eval.eval_persona \
--model mlabonne/gemma-3-27b-it-abliterated \
--trait evil \
--output_path eval_persona_extract/mlabonne-gemma-3-27b-it-abliterated/evil_pos_instruct.csv \
--persona_instruction_type pos \
--assistant_name evil \
--judge_model gpt-4.1-mini-2025-04-14 \
--version extract
python -m eval.eval_persona \
--model mlabonne/gemma-3-27b-it-abliterated \
--trait evil \
--output_path eval_persona_extract/mlabonne-gemma-3-27b-it-abliterated/evil_neg_instruct.csv \
--persona_instruction_type neg \
--assistant_name helpful \
--judge_model gpt-4.1-mini-2025-04-14 \
--version extract
python generate_vec.py \
--model_name mlabonne/gemma-3-27b-it-abliterated \
--pos_path eval_persona_extract/mlabonne-gemma-3-27b-it-abliterated/evil_pos_instruct.csv \
--neg_path eval_persona_extract/mlabonne-gemma-3-27b-it-abliterated/evil_neg_instruct.csv \
--trait evil \
--save_dir persona_vectors/mlabonne-gemma-3-27b-it-abliterated/evil
cd .. # go back to the main directory
python test_personas_safety.py \
--model mlabonne/gemma-3-27b-it-abliterated \
--device cuda \
--pass_at_k 1 \
--hierarchy_level system \
--steering evilbash generate_vectors.sh