Skip to content

Dormant-Neurons/llm-mbti

Repository files navigation

Watch Your Tone : How Emotions Influence LLMs in Security oder so

Here goes a very nice abstract, DOI stuff and links and maybe citations to the paper and code.

Setup

  1. Create a virtual environment and install the dependencies:
python -m venv venv
source venv/bin/activate
python -m pip install -r requirements.txt
  1. Add your keys to the .env file. They key is necessary to create the steering vectors later on.
cp .env.template .env
# Then edit the .env file and add your OpenAI key
nano -w .env
  1. Clone the persona-vectors repository into this directory and install their dependencies:
git clone https://github.com/safety-research/persona_vectors
cd persona_vectors
python -m pip install -r requirements.txt
# remove their trait data, since we have our own
rm -rf data_generation/*
cd .. # go back to the main directory
# copy our trait data, the persona vectors, and the API keys into the correct directory
cp .env persona_vectors/.env
cp -r data/trait_datasets persona_vectors/data_generation/trait_data_eval
cp -r data/trait_datasets persona_vectors/data_generation/trait_data_extract
cp -r data/persona_vectors persona_vectors/persona_vectors

Experiments

This section covers how to run the different experiments in this repository. All logs and figures will be saved in the logs/ directory.

The default model for all experiments is mlabonne/gemma-3-27b-it-abliterated.

Hexaco

Run the Hexaco benchmarkt against a specific model with a variety of personas:

python test_personas_hexaco.py --model <model_name> --device cuda

Safety questions

The following experiments evaluate a model on a set of safety questions (which can be found here). The experiments cover a variety of settings for different personas (defined here) and emotions (defined here) in both system prompt and user inputs. All experiments contain a baseline without a persona or emotion.

The combinations are as follows:

  1. The LLM is instructed with a persona in its system prompt, and the user input contains the plain safety questions.
python test_personas_safety.py \
    --model <model_name> \
    --device cuda \
    --pass_at_k 1 \
    --hierarchy_level system
  1. The LLM is instructed with a persona in the user input, followed by the plain safety questions.
python test_personas_safety.py \
    --model <model_name> \
    --device cuda \
    --pass_at_k 1 \
    --hierarchy_level user
  1. The LLM is instructed neutrally in the system prompt, the user input contains the safety questions with an emotion prefix (e.g. "I am feeling angry. <safety question>").
python test_emotions_safety.py \
    --model <model_name> \
    --device cuda \
    --pass_at_k 1 \
    --hierarchy_level user \
    --question_type prefix
  1. The LLM is instructed neutrally in the system prompt, the user input contains an emotionalized version of the security questions (which can be found here in the emotionalized_questions field).
python test_emotions_safety.py \
    --model <model_name> \
    --device cuda \
    --pass_at_k 1 \
    --hierarchy_level user \
    --question_type emotionalized
  1. The LLM is instructed with a history of the users previous emotional states (which can be found here in the emotion_history field) in the system prompt, and the user input contains the plain safety questions.
python test_emotions_safety.py \
    --model <model_name> \
    --device cuda \
    --pass_at_k 1 \
    --hierarchy_level system

Steering

To apply steering vectors for different personas and emotions, follow the next steps to create the datasets and activation vectors. The model_str variable is the model name with all "/" and ":" characters replaced by "-", e.g. mlabonne-gemma-3-27b-it-abliterated. The persona_name variable is the name of the persona for which you want to create the steering vector, e.g. evil.

Tip

Creating the persona datasets and vectors is optional. The prebuild dataset and vectors is already included in this repository and was installed during the setup instructions.

  1. Create the persona datasets
python gen_trait_data.py

This will create a dataset for each persona in the persona_vectors/data_generation/ directory.

  1. Generate activations using positive and negative system prompts from the previously generated datasets. The files will be saved in the eval_persona_extract/ directory.
cd persona_vectors

python -m eval.eval_persona \
    --model <model_name> \
    --trait <persona_name> \
    --output_path eval_persona_extract/<model_str>/<persona_name>_pos_instruct.csv \
    --persona_instruction_type pos \
    --assistant_name <persona_name> \
    --judge_model gpt-4.1-mini-2025-04-14  \
    --version extract

python -m eval.eval_persona \
    --model <model_name> \
    --trait <persona_name> \
    --output_path eval_persona_extract/<model_str>/<persona_name>_neg_instruct.csv \
    --persona_instruction_type neg \
    --assistant_name <opposite of persona_name> \
    --judge_model gpt-4.1-mini-2025-04-14  \
    --version extract
  1. Generate the steering vectors for the different personas.
python generate_vec.py \
    --model_name <model_name> \
    --pos_path eval_persona_extract/<model_str>/<persona_name>_pos_instruct.csv \
    --neg_path eval_persona_extract/<model_str>/<persona_name>_neg_instruct.csv \
    --trait <persona_name> \
    --save_dir persona_vectors/<model_str>/<persona_name>
  1. Re-run the persona safety question experiments with the --steering <persona_name> argument to apply the steering vectors to the model's activations.

Example for applying steering vectors

This example generates the evil persona steering vector for the mlabonne/gemma-3-27b-it-abliterated model and uses it for the safety questions experiment.

cd persona_vectors

python -m eval.eval_persona \
    --model mlabonne/gemma-3-27b-it-abliterated \
    --trait evil \
    --output_path eval_persona_extract/mlabonne-gemma-3-27b-it-abliterated/evil_pos_instruct.csv \
    --persona_instruction_type pos \
    --assistant_name evil \
    --judge_model gpt-4.1-mini-2025-04-14  \
    --version extract

python -m eval.eval_persona \
    --model mlabonne/gemma-3-27b-it-abliterated \
    --trait evil \
    --output_path eval_persona_extract/mlabonne-gemma-3-27b-it-abliterated/evil_neg_instruct.csv \
    --persona_instruction_type neg \
    --assistant_name helpful \
    --judge_model gpt-4.1-mini-2025-04-14  \
    --version extract

python generate_vec.py \
    --model_name mlabonne/gemma-3-27b-it-abliterated \
    --pos_path eval_persona_extract/mlabonne-gemma-3-27b-it-abliterated/evil_pos_instruct.csv \
    --neg_path eval_persona_extract/mlabonne-gemma-3-27b-it-abliterated/evil_neg_instruct.csv \
    --trait evil \
    --save_dir persona_vectors/mlabonne-gemma-3-27b-it-abliterated/evil

cd .. # go back to the main directory

python test_personas_safety.py \
    --model mlabonne/gemma-3-27b-it-abliterated \
    --device cuda \
    --pass_at_k 1 \
    --hierarchy_level system \
    --steering evil

Create steering vectors for all personas

bash generate_vectors.sh

About

LLM MBTI

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors