This repository contains the code and data for CoSToM, our ACL 2026 accepted work on probing and steering intrinsic Theory-of-Mind (ToM) representations in large language models. Our work studies the gap between a model's internal ToM inference and its external conversational behavior, and introduces a lightweight causal intervention method that transfers ToM-relevant activations into generation.
Our work on CoSToM has been accepted to appear at ACL 2026 (Main Conference).
- Title: "CoSToM: Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models"
- Authors: Mengfan Li, Xuanhua Shi, Yang Deng
- Venue: Annual Meeting of the Association for Computational Linguistics (ACL 2026)
- Status: Accepted, to appear
- Preprint: CoSToM on arXiv
.
├── CoSToM_reading_tom.py # Probe intrinsic ToM representations on structured ToM tasks
├── CoSToM_generation.py # Steer generation with CoSToM latent intervention
├── dataset_utils.py # Data formatting utilities for structured ToM reading
├── generation_data_utils.py # Data formatting utilities for negotiation generation
├── interpret_config.py # Central experiment configuration
├── reading_evaluate.py # Automatic evaluation for structured ToM outputs
├── model_generation_direct.py # Direct generation baselines
├── llm_as_a_judge.py # LLM-based evaluation for response quality
├── lit/
│ └── utils/ # Model loading, activation extraction, and intervention utilities
├── train/
│ ├── train.py # Distributed training entry point for CoSToM
│ ├── config.py # Training and LoRA configuration
│ ├── dataset_utils.py # Training data construction utilities
│ ├── activation_utils.py # Training activation intervention utilities
│ └── infra_utils.py # Training model and checkpoint utilities
└── data/
├── NegotiationToM/
│ ├── train.json
│ ├── valid.json
│ └── test.json
└── PersuasionToM/
├── persuader_test.json
├── train.json
├── valid.json
└── test.json
NegotiationToM: the main dataset used by the current released scripts, with train/valid/test splits of 1335/334/711 examples.PersuasionToM: an additional persuasion benchmark with train/valid/test splits of 10355/2219/2222 examples.PersuasionToM/persuader_test.json: a response-generation style evaluation file with 1241 examples, storinghistory,ground_truth, andpersuader_name.
Most experiment settings are currently controlled through interpret_config.py, including:
target_model_namedecoder_model_namemin_layer_to_readmax_layer_to_readnum_layers_to_readlayer_to_writedataseteval_qasave_namemodel_path
Before running an experiment, please update this file to match your model checkpoints, dataset path, and desired intervention layers.
At minimum, the current scripts expect the following packages:
pip install torch transformers peft fire numpy pandas openai tqdmThis script probes whether the target model internally represents ToM-relevant information.
python CoSToM_reading_tom.pyOutputs are written to reading_results/, and logs are written to reading_logs/.
After generating structured ToM outputs, evaluate them with:
python reading_evaluate.py \
--path reading_results/ \
--model_name qwen-2.5-7B \
--output_dir ./reading_eval_nego \
--dataset NegotiationThis script reports exact-match style accuracy for:
- intent prediction
- desire prediction
- belief prediction
Note: the current implementation constructs the target result filename internally. You may need to update the filename pattern in reading_evaluate.py to match your generated result file.
Training is implemented under train/. The main entry point is train/train.py, and the training setup is controlled by train/config.py, including model name, data paths, read/write layers, LoRA modules, learning rate, and output directory.
Run training from the train/ directory with torchrun:
cd train
torchrun --nproc_per_node=4 train.pyOutputs are written to train/train_runs/<min_layer_to_read>/ by default. Adjust --nproc_per_node and the values in train/config.py to match your GPU environment and dataset paths.
This script reads latent ToM states from the target model and writes them into the decoder pathway to generate a negotiation response.
python CoSToM_generation.pyOutputs are written to generation/, and logs are written to trained_logs/.
one_model_generation_direct.py generates responses directly from a base model or PEFT model without CoSToM intervention. Model entries are specified in the hard-coded MODEL_CONFIGS list.
python one_model_generation_direct.pyOutputs are saved under Metrics/.
llm_as_a_judge.py scores generated responses on:
- ToM reasoning quality
- contextual coherence
- negotiation strategy effectiveness
Before running, set the following fields inside the script:
API_KEYbase_urlin theOpenAI(...)clientMODEL_NAME
Then run:
python llm_as_a_judge.py --input Metrics/your_generation_file.jsonThis produces:
- raw JSON scores
- an aggregated ranking
- a Markdown report
- Several scripts use hard-coded GPU devices such as
cuda:0andcuda:1. Please adjust them to match your environment before running experiments. - Although the entry points use
fire.Fire(...), the main experiment settings are still controlled byinterpret_config.py.
We thank the authors of LatentQA: Teaching LLMs to Decode Activations Into Natural Language for their inspiring work on activation decoding and model control. Parts of this project build on ideas from LatentQA for reading and steering latent model representations.