CoSToM: Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models

This repository contains the code and data for CoSToM, our ACL 2026 accepted work on probing and steering intrinsic Theory-of-Mind (ToM) representations in large language models. Our work studies the gap between a model's internal ToM inference and its external conversational behavior, and introduces a lightweight causal intervention method that transfers ToM-relevant activations into generation.

Publication

Our work on CoSToM has been accepted to appear at ACL 2026 (Main Conference).

Title: "CoSToM: Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models"
Authors: Mengfan Li, Xuanhua Shi, Yang Deng
Venue: Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Status: Accepted, to appear
Preprint: CoSToM on arXiv

Repository Structure

.
├── CoSToM_reading_tom.py           # Probe intrinsic ToM representations on structured ToM tasks
├── CoSToM_generation.py            # Steer generation with CoSToM latent intervention
├── dataset_utils.py                # Data formatting utilities for structured ToM reading
├── generation_data_utils.py        # Data formatting utilities for negotiation generation
├── interpret_config.py             # Central experiment configuration
├── reading_evaluate.py             # Automatic evaluation for structured ToM outputs
├── model_generation_direct.py  # Direct generation baselines
├── llm_as_a_judge.py               # LLM-based evaluation for response quality
├── lit/
│   └── utils/                      # Model loading, activation extraction, and intervention utilities
├── train/
│   ├── train.py                     # Distributed training entry point for CoSToM
│   ├── config.py                    # Training and LoRA configuration
│   ├── dataset_utils.py             # Training data construction utilities
│   ├── activation_utils.py          # Training activation intervention utilities
│   └── infra_utils.py               # Training model and checkpoint utilities
└── data/
    ├── NegotiationToM/
    │   ├── train.json
    │   ├── valid.json
    │   └── test.json
    └── PersuasionToM/
        ├── persuader_test.json
        ├── train.json
        ├── valid.json
        └── test.json

Data

NegotiationToM: the main dataset used by the current released scripts, with train/valid/test splits of 1335/334/711 examples.
PersuasionToM: an additional persuasion benchmark with train/valid/test splits of 10355/2219/2222 examples.
PersuasionToM/persuader_test.json: a response-generation style evaluation file with 1241 examples, storing history, ground_truth, and persuader_name.

Configuration

Most experiment settings are currently controlled through interpret_config.py, including:

target_model_name
decoder_model_name
min_layer_to_read
max_layer_to_read
num_layers_to_read
layer_to_write
dataset
eval_qa
save_name
model_path

Before running an experiment, please update this file to match your model checkpoints, dataset path, and desired intervention layers.

Dependencies

At minimum, the current scripts expect the following packages:

pip install torch transformers peft fire numpy pandas openai tqdm

Running the Main Experiments

1. Read intrinsic ToM states

This script probes whether the target model internally represents ToM-relevant information.

python CoSToM_reading_tom.py

Outputs are written to reading_results/, and logs are written to reading_logs/.

2. Evaluate structured ToM predictions

After generating structured ToM outputs, evaluate them with:

python reading_evaluate.py \
  --path reading_results/ \
  --model_name qwen-2.5-7B \
  --output_dir ./reading_eval_nego \
  --dataset Negotiation

This script reports exact-match style accuracy for:

intent prediction
desire prediction
belief prediction

Note: the current implementation constructs the target result filename internally. You may need to update the filename pattern in reading_evaluate.py to match your generated result file.

3. Train CoSToM

Training is implemented under train/. The main entry point is train/train.py, and the training setup is controlled by train/config.py, including model name, data paths, read/write layers, LoRA modules, learning rate, and output directory.

Run training from the train/ directory with torchrun:

cd train
torchrun --nproc_per_node=4 train.py

Outputs are written to train/train_runs/<min_layer_to_read>/ by default. Adjust --nproc_per_node and the values in train/config.py to match your GPU environment and dataset paths.

4. Steer negotiation generation with CoSToM

This script reads latent ToM states from the target model and writes them into the decoder pathway to generate a negotiation response.

python CoSToM_generation.py

Outputs are written to generation/, and logs are written to trained_logs/.

5. Run direct generation baselines

one_model_generation_direct.py generates responses directly from a base model or PEFT model without CoSToM intervention. Model entries are specified in the hard-coded MODEL_CONFIGS list.

python one_model_generation_direct.py

Outputs are saved under Metrics/.

6. LLM-as-a-judge evaluation

llm_as_a_judge.py scores generated responses on:

ToM reasoning quality
contextual coherence
negotiation strategy effectiveness

Before running, set the following fields inside the script:

API_KEY
base_url in the OpenAI(...) client
MODEL_NAME

Then run:

python llm_as_a_judge.py --input Metrics/your_generation_file.json

This produces:

raw JSON scores
an aggregated ranking
a Markdown report

Notes

Several scripts use hard-coded GPU devices such as cuda:0 and cuda:1. Please adjust them to match your environment before running experiments.
Although the entry points use fire.Fire(...), the main experiment settings are still controlled by interpret_config.py.

Acknowledgements

We thank the authors of LatentQA: Teaching LLMs to Decode Activations Into Natural Language for their inspiring work on activation decoding and model control. Parts of this project build on ideas from LatentQA for reading and steering latent model representations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoSToM: Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models

Publication

Repository Structure

Data

Configuration

Dependencies

Running the Main Experiments

1. Read intrinsic ToM states

2. Evaluate structured ToM predictions

3. Train CoSToM

4. Steer negotiation generation with CoSToM

5. Run direct generation baselines

6. LLM-as-a-judge evaluation

Notes

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
lit/utils		lit/utils
train		train
.codex		.codex
.gitignore		.gitignore
CoSToM_generation.py		CoSToM_generation.py
CoSToM_reading_tom.py		CoSToM_reading_tom.py
README.md		README.md
dataset_utils.py		dataset_utils.py
generation_data_utils.py		generation_data_utils.py
interpret_config.py		interpret_config.py
llm_as_a_judge.py		llm_as_a_judge.py
model_generation_direct.py		model_generation_direct.py
reading_evaluate.py		reading_evaluate.py

Folders and files

Latest commit

History

Repository files navigation

CoSToM: Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models

Publication

Repository Structure

Data

Configuration

Dependencies

Running the Main Experiments

1. Read intrinsic ToM states

2. Evaluate structured ToM predictions

3. Train CoSToM

4. Steer negotiation generation with CoSToM

5. Run direct generation baselines

6. LLM-as-a-judge evaluation

Notes

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages