PsySET: a Psychologically-informed benchmark to evaluate LLM Steering Effectiveness and Trustworthiness across the emotion and personality domains

Abstract

PsySET framework comprises three components: (1) LLM steering methods, (2) psychometric evaluation tasks for assessing effectiveness, and (3) trustworthiness evaluations.

Experiments

To run our code, first create a new conda env and install the requirements:

conda create --name psyset python=3.11
conda activate psyset
pip install -r requirements.txt

Evaluation of Effectiveness:

Prompt-based methods:

To run the main experiments, please use the main.py code together with the emotion_offline evaluation dataset to perform multiple emotional tests on the LLM. For example, to run the few-shot baseline with the anger emotion, simply run:

python main.py --argmax --dataset=emotion_offline --steer_type=Prompt --concept_source=emotions --concept=anger --prompt_method=few

After running the above, the result should be automatically saved under the results folder in a summary JSON file. Note that the default LLM is Llama-3.1-8B-Instruct. Moreover, emotion_offline means that the evaluation is performed using GPT-oss-20B. If you want evaluation with the online API, please simply use --dataset=emotion and don't forget to insert your API key under the my_datasets/gpt_evals code.

Similarly, one can use the following to perform the personality tests:

python main.py --argmax --dataset=personality_offline --concept_source=personalities --concept=extraversion --steer_type=Prompt --prompt_method=few

Vector-Injection (VI) methods:

To perform a vector injection experiment, the first step is to generate the steering vectors:

python generate_steer_vecs.py --dataset_source=goemotions_instruct_full_binary --target_tokens=all

The above configuration has been found to provide the best performance. Next, run the following to steer toward fear as an example:

python main.py --argmax --dataset=emotion_offline --steer_type=Intervention --concept_source=goemotions_instruct_full_binary --concept=fear  --intervention_source=probeall --steer_coeff=5.0 --steer_layers=16,17

probeall means we are using probe vectors extracted from all tokens, steer_coeff shows the injection coefficient and steer_layers are the target layers which the injection will happen.

SFT/DPO methods:

To perform SFT, we should first train an SFT model:

python PEFT.py --lr=1e-4 --steps=2048 --method=SFT --dataset=emotion_translate_binary_full

Next, run the following code to evaluate the trained models:

python main.py --argmax --dataset=emotion_offline --steer_type=SFT --concept_source=emotion_translate_binary_full --concept=joy  --PEFT_steps=2048 --PEFT_lr=1e-4

Evaluation of Trustworthiness

Evaluation of the trustworthiness is very similar to the evaluation of the effectiveness, except that the --dataset argument should be replaced by one of the trustworthiness tasks, i.e., one of truthfulness, safety, fairness, privacy, robustness and ethics. For instance, run

python main.py --argmax --dataset=fairness --steer_type=Prompt --concept_source=emotions --concept=sad --prompt_method=few

to evaluate the fairness of a sad LLM. Don't forget to put your GPT and perspective API keys under trustllm/config.py before executing these tests.

📄 Citation

If you found our paper helpful, please cite us as:

@misc{banayeeanzade2025psychologicalsteeringllmsevaluation,
      title={Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness}, 
      author={Amin Banayeeanzade and Ala N. Tak and Fatemeh Bahrani and Anahita Bolourani and Leonardo Blas and Emilio Ferrara and Jonathan Gratch and Sai Praneeth Karimireddy},
      year={2025},
      eprint={2510.04484},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.04484}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LLMs		LLMs
concept_datasets		concept_datasets
dataset		dataset
figs		figs
my_datasets		my_datasets
trustllm		trustllm
.gitignore		.gitignore
PEFT.py		PEFT.py
README.md		README.md
concept_datasets.py		concept_datasets.py
generate_steer_vecs.py		generate_steer_vecs.py
main.py		main.py
path_manager.py		path_manager.py
prompt_manager.py		prompt_manager.py
requirements.txt		requirements.txt
steer_manager.py		steer_manager.py
steer_vec_utils.py		steer_vec_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PsySET: a Psychologically-informed benchmark to evaluate LLM Steering Effectiveness and Trustworthiness across the emotion and personality domains

Abstract

Experiments

Evaluation of Effectiveness:

Prompt-based methods:

Vector-Injection (VI) methods:

SFT/DPO methods:

Evaluation of Trustworthiness

📄 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PsySET: a Psychologically-informed benchmark to evaluate LLM Steering Effectiveness and Trustworthiness across the emotion and personality domains

Abstract

Experiments

Evaluation of Effectiveness:

Prompt-based methods:

Vector-Injection (VI) methods:

SFT/DPO methods:

Evaluation of Trustworthiness

📄 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages