Experiment setup for the thesis "Sensitivity Analysis of Evaluation Awareness in Large Language Models".
- Trains one linear probe for each layer of the model on a contrastive dataset of evaluation-aware vs. non-aware model responses.
- Evaluates each probe on the Needham et al. dataset (chat-only transcripts) to find the best layer via AUROC — how well does the probe separate evaluation conversations from organic ones?
- Runs a sensitivity dataset through the model and scores each prompt using the winning probe.
- Saves augmented results to
output/— including generated text, probe projection score, and binary verdict.
Everything related to probe training is based on and credited to evaluation-awareness-probing, including:
datasets/contrastive_dataset.json— the contrastive prompt pairs used for probe training- The probe training methodology (normalized mean difference of activations at a chosen layer)
- The steering vector design and projection/thresholding approach
src/probe.py mirrors the implementation from that repository.
Probe evaluation uses the Needham et al. dataset ("Large Language Models Often Know They Are Being Evaluated", Needham et al.) — not included in this repo, see setup below.
Requires Python 3.11. Create a virtual environment with your preferred tool and install dependencies:
pip install -r requirements.txtThe Needham et al. dataset is not included in this repo. To obtain it:
# From the parent directory of this repo:
git clone https://huggingface.co/datasets/jjpn2/eval_awareness ../eval_awareness
cd ../eval_awareness
git lfs pull
bash scripts/decrypt.sh
cd -This places dataset.json at ../eval_awareness/dataset.json, which is the default path.
To use a custom path: --eval-dataset /path/to/dataset.json
# All layers (default) — trains probes on all layers, evaluates on Needham et al. dataset to find best
python run.py --test-mode
# Specific layers only — evaluates on Needham et al. dataset to pick best among specified layers
python run.py --test-mode --layers 4,8,12
# Single layer (skips Needham et al. evaluation automatically)
python run.py --test-mode --layers 8
# Full run with Llama 70B (requires 2x 80GB GPUs)
python run.py --model meta-llama/Llama-3.3-70B-Instruct --device cuda --n-gpus 2
# Full run with Llama 70B using only first 50 sensitivity records
python run.py --model meta-llama/Llama-3.3-70B-Instruct --device cuda --n-gpus 2 --sensitivity-limit 50
# Skip probe retraining
python run.py --test-mode --load-probe output/probes.pt
# Custom eval dataset path
python run.py --test-mode --eval-dataset /path/to/dataset.jsonThe 1B model (--test-mode) runs on CPU/MPS. 70B requires CUDA with 2x 80GB GPUs (--n-gpus 2).
For running 70B models that don't fit on local hardware.
Quick version (on a 2x A100 80GB RunPod pod):
cd /workspace
git clone <your_repo_url> exp-setup
bash exp-setup/scripts/runpod_setup_env.sh # install deps, auth, dataset
cd exp-setup
bash scripts/runpod_run_llama70b_50.sh # run experimentThen from your local machine:
bash scripts/download_runpod_artifacts.sh <IP> <PORT>run.py # Main orchestration script
src/
model.py # Model loading + generation (HookedTransformer)
probe.py # Probe training and inference
needham.py # Load Needham et al. dataset for probe selection
datasets/
contrastive_dataset.json # Training data (from evaluation-awareness-probing)
sensitivity_dataset.json # Evaluation prompts
scripts/
runpod_setup_env.sh # Pod environment setup (deps, auth, dataset)
runpod_run_llama70b_50.sh # Run experiment on pod
download_runpod_artifacts.sh # Download results to local machine
docs/
cloud-llama70b-budgeted-run.md # Full RunPod guide with troubleshooting
output/
probes.pt # Saved probe weights (one per layer)