Skip to content

CGCL-codes/CoSToM

Repository files navigation

CoSToM: Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models

This repository contains the code and data for CoSToM, our ACL 2026 accepted work on probing and steering intrinsic Theory-of-Mind (ToM) representations in large language models. Our work studies the gap between a model's internal ToM inference and its external conversational behavior, and introduces a lightweight causal intervention method that transfers ToM-relevant activations into generation.

Publication

Our work on CoSToM has been accepted to appear at ACL 2026 (Main Conference).

  • Title: "CoSToM: Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models"
  • Authors: Mengfan Li, Xuanhua Shi, Yang Deng
  • Venue: Annual Meeting of the Association for Computational Linguistics (ACL 2026)
  • Status: Accepted, to appear
  • Preprint: CoSToM on arXiv

Repository Structure

.
├── CoSToM_reading_tom.py           # Probe intrinsic ToM representations on structured ToM tasks
├── CoSToM_generation.py            # Steer generation with CoSToM latent intervention
├── dataset_utils.py                # Data formatting utilities for structured ToM reading
├── generation_data_utils.py        # Data formatting utilities for negotiation generation
├── interpret_config.py             # Central experiment configuration
├── reading_evaluate.py             # Automatic evaluation for structured ToM outputs
├── model_generation_direct.py  # Direct generation baselines
├── llm_as_a_judge.py               # LLM-based evaluation for response quality
├── lit/
│   └── utils/                      # Model loading, activation extraction, and intervention utilities
├── train/
│   ├── train.py                     # Distributed training entry point for CoSToM
│   ├── config.py                    # Training and LoRA configuration
│   ├── dataset_utils.py             # Training data construction utilities
│   ├── activation_utils.py          # Training activation intervention utilities
│   └── infra_utils.py               # Training model and checkpoint utilities
└── data/
    ├── NegotiationToM/
    │   ├── train.json
    │   ├── valid.json
    │   └── test.json
    └── PersuasionToM/
        ├── persuader_test.json
        ├── train.json
        ├── valid.json
        └── test.json

Data

  • NegotiationToM: the main dataset used by the current released scripts, with train/valid/test splits of 1335/334/711 examples.
  • PersuasionToM: an additional persuasion benchmark with train/valid/test splits of 10355/2219/2222 examples.
  • PersuasionToM/persuader_test.json: a response-generation style evaluation file with 1241 examples, storing history, ground_truth, and persuader_name.

Configuration

Most experiment settings are currently controlled through interpret_config.py, including:

  • target_model_name
  • decoder_model_name
  • min_layer_to_read
  • max_layer_to_read
  • num_layers_to_read
  • layer_to_write
  • dataset
  • eval_qa
  • save_name
  • model_path

Before running an experiment, please update this file to match your model checkpoints, dataset path, and desired intervention layers.

Dependencies

At minimum, the current scripts expect the following packages:

pip install torch transformers peft fire numpy pandas openai tqdm

Running the Main Experiments

1. Read intrinsic ToM states

This script probes whether the target model internally represents ToM-relevant information.

python CoSToM_reading_tom.py

Outputs are written to reading_results/, and logs are written to reading_logs/.

2. Evaluate structured ToM predictions

After generating structured ToM outputs, evaluate them with:

python reading_evaluate.py \
  --path reading_results/ \
  --model_name qwen-2.5-7B \
  --output_dir ./reading_eval_nego \
  --dataset Negotiation

This script reports exact-match style accuracy for:

  • intent prediction
  • desire prediction
  • belief prediction

Note: the current implementation constructs the target result filename internally. You may need to update the filename pattern in reading_evaluate.py to match your generated result file.

3. Train CoSToM

Training is implemented under train/. The main entry point is train/train.py, and the training setup is controlled by train/config.py, including model name, data paths, read/write layers, LoRA modules, learning rate, and output directory.

Run training from the train/ directory with torchrun:

cd train
torchrun --nproc_per_node=4 train.py

Outputs are written to train/train_runs/<min_layer_to_read>/ by default. Adjust --nproc_per_node and the values in train/config.py to match your GPU environment and dataset paths.

4. Steer negotiation generation with CoSToM

This script reads latent ToM states from the target model and writes them into the decoder pathway to generate a negotiation response.

python CoSToM_generation.py

Outputs are written to generation/, and logs are written to trained_logs/.

5. Run direct generation baselines

one_model_generation_direct.py generates responses directly from a base model or PEFT model without CoSToM intervention. Model entries are specified in the hard-coded MODEL_CONFIGS list.

python one_model_generation_direct.py

Outputs are saved under Metrics/.

6. LLM-as-a-judge evaluation

llm_as_a_judge.py scores generated responses on:

  • ToM reasoning quality
  • contextual coherence
  • negotiation strategy effectiveness

Before running, set the following fields inside the script:

  • API_KEY
  • base_url in the OpenAI(...) client
  • MODEL_NAME

Then run:

python llm_as_a_judge.py --input Metrics/your_generation_file.json

This produces:

  • raw JSON scores
  • an aggregated ranking
  • a Markdown report

Notes

  • Several scripts use hard-coded GPU devices such as cuda:0 and cuda:1. Please adjust them to match your environment before running experiments.
  • Although the entry points use fire.Fire(...), the main experiment settings are still controlled by interpret_config.py.

Acknowledgements

We thank the authors of LatentQA: Teaching LLMs to Decode Activations Into Natural Language for their inspiring work on activation decoding and model control. Parts of this project build on ideas from LatentQA for reading and steering latent model representations.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages