Skip to content

yifan1207/PT-IT-Model-Differences

Repository files navigation

Instruction Tuning Creates a Broad Convergence Gap

A Late-Centered Corrective Computation Across Transformer Families

Python 3.13+ PyTorch 6 Model Families

TL;DR — The current paper story is: instruction tuning creates a broad convergence gap and delayed commitment under native decoding across six model families. The strongest mechanistic leverage is a late-layer MLP-centered corrective bottleneck. The cleanest cross-model internal causal evidence comes from matched-prefix graft/swap experiments, and the free-running behavioral experiments show that the same late intervention family moves a specific component of assistant behavior rather than the full assistant phenotype.

Broad convergence gap across six model families

Figure 1. Tuned-lens KL-to-own-final curves from the main cross-family observational suite. Across all six families, IT stays farther from its own final distribution than PT through much of the stack, making the broad convergence gap the primary cross-model signature.


Start Here

If you are new to the repo, these are the most useful entrypoints:

The repo has been reorganized into descriptive canonical paths:

  • experiment code: src/poc/exp##_descriptive_name/
  • results: results/exp##_descriptive_name/
  • scripts: scripts/run/, scripts/plot/, scripts/analysis/, scripts/infra/, etc.

A few flat script aliases are still kept where practical, but results now live only under the descriptive canonical paths.


Current Status

The current paper-facing story is best understood in three layers:

Layer Best current claim Main evidence
Observational Instruction tuning creates a broad convergence-gap signature under native decoding exp09 cross-model PT/IT analyses
Internal causal The strongest cross-model mechanistic leverage is late-centered and MLP-heavy exp11 matched-prefix grafts + exp14 symmetric sufficiency/necessity
Behavioral The same late intervention family moves a real but partial slice of assistant behavior exp12 free-running A/B/C, with exp15 as the next symmetric behavioral phase

What is strongest right now:

  • broad IT-vs-PT convergence gap and delayed commitment across 6 families under both tuned and raw logit lenses
  • a late-concentrated IT-vs-PT increase in residual opposition as a geometric companion, with architecture-dependent magnitude and spatial extent
  • Gemma steering as the strongest single-direction causal bridge between convergence speed and governance behavior
  • matched-prefix late graft/swap as the cleanest cross-model internal causal evidence for a late MLP-centered bottleneck
  • exp13A-lite plus the exp13/14 mechanism summaries as evidence that the late stage is broader than a narrow formatting-token injector
  • free-running A/B/C as a behavioral precision finding: late MLPs move anti-raw-continuation / anti-false-refusal more than polished structure

What remains intentionally careful:

  • the free-running six-family observational curves are descriptive, not matched-history estimates
  • KL(layer || own final) is useful but endpoint-sensitive
  • dimensionality diagnostics are exploratory / mixed and are not part of the main claim
  • late IT MLPs are a bottleneck inside a broader circuit, not a full assistantness module

Late-window sufficiency and necessity under matched-prefix control

Figure 2. Symmetric matched-prefix exp13/14 summary. The late IT→PT graft is the strongest sufficiency window and the mirrored late PT→IT swap is the strongest necessity window on the primary late-region KL metric, supporting a late-centered MLP bottleneck rather than a diffuse endpoint-only story.


Quickstart

Setup

git clone <repo> && cd structral-semantic-features
uv sync

Sanity-check the repo

uv run python scripts/infra/repo_doctor.py

Optional:

uv run python scripts/infra/repo_doctor.py --pytest

Explore the main runnable entrypoints

# Canonical exp14 matched-prefix causal runner
uv run python -m src.poc.exp14_symmetric_matched_prefix_causality --help

# Canonical exp15 free-running behavioral runner
uv run python -m src.poc.exp15_symmetric_behavioral_causality --help

# Local smoke for the exp13+14 causal stack
bash scripts/run/run_exp13_exp14_local.sh --mode smoke --model gemma3_4b --smoke-prompts 8

Common analysis / plotting commands

# Current cross-model observational figures
uv run python -m src.poc.exp09_cross_model_observational_replication.plot_replication

# Exp13A-lite analysis + plots
uv run python scripts/analysis/analyze_exp13a_lite.py --help
uv run python scripts/plot/plot_exp13a_lite.py --help

# Exp13 full + Exp14 causal summary plots
uv run python scripts/analysis/analyze_exp13_full.py --help
uv run python scripts/plot/plot_exp13_full.py --help

Canonical run scripts

# Multi-model steering / phase 0
bash scripts/run/run_phase0_multimodel.sh --step precompute
bash scripts/run/run_phase0_multimodel.sh --step steer

# Exp13 + Exp14 local causal campaign
bash scripts/run/run_exp13_exp14_local.sh --mode full

Models

Model Layers d_model Architecture Post-training
Gemma 3 4B (primary) 34 2560 GQA, hybrid local/global (5:1) KD + supervised / preference / rule-based stages
Llama 3.1 8B 32 4096 GQA, all global Iterative supervised + preference optimization
Qwen 3 4B 36 2560 GQA, all global Multi-stage SFT / RL post-training
Mistral 7B v0.3 32 4096 GQA, sliding window (4096) Instruct checkpoint
DeepSeek-V2-Lite 27 2048 MLA, MoE (2 shared + 64 routed, top-6) Chat checkpoint / GRPO-style post-training
OLMo 2 7B 32 4096 MHA, all global SFT + DPO + RLVR (Tülu 3)

All main observational analyses use each IT model's native chat template and raw prompting for PT. Template-free conditions are treated as ablations rather than replacement primaries.


Project structure

src/poc/
  cross_model/                                   # Shared multi-model infrastructure
  exp01_hierarchical_distributional_narrowing/
  exp02_ic_ooc_reasoning_mechanistic_comparison/
  exp03_corrective_stage_characterization/
  exp04_phase_transition_characterization/
  exp05_corrective_direction_ablation_cartography/
  exp06_corrective_direction_steering/
  exp07_methodology_validation_tier0/
  exp08_multimodel_steering_phase0/
  exp09_cross_model_observational_replication/
  exp10_contrastive_activation_patching/
  exp11_matched_prefix_mlp_graft/
  exp12_free_running_abc_graft/
  exp13_late_stage_token_support_analysis/
  exp14_symmetric_matched_prefix_causality/
  exp15_symmetric_behavioral_causality/

scripts/
  analysis/                                      # Post-hoc summaries, cross-checks, paper stats
  data/                                          # Dataset builders / data prep
  eval/                                          # Judge and evaluation entrypoints
  infra/                                         # Modal/Lambda/cloud helpers
  merge/                                         # Worker/shard merge utilities
  plot/                                          # Figure generation
  precompute/                                    # Direction extraction and preprocessing
  run/                                           # Main experiment launchers
  scoring/                                       # Rescoring utilities

results/
  cross_model/{model}/
  exp01_hierarchical_distributional_narrowing/
  ...
  exp15_symmetric_behavioral_causality/

Canonical experiment/result paths now use descriptive names. Source code now lives only in the canonical named experiment folders. Some legacy result and flat script aliases are still kept during the results/scripts migration so older commands keep working.

For a full index, see docs/EXPERIMENT_REGISTRY.md.


Experiment index

Observational (cross-model, 6/6)

ID Analysis Key result
L1 δ-cosine profiles IT adds more late residual opposition than PT in all 6 families, but with heterogeneous magnitude (−0.021 to −0.269 in the final 20%)
L2 Broad convergence gap + delayed commitment (5 metrics × 2 lenses) IT stays farther from its own final distribution through much of the stack and commits later in all 6 families
L3 Weight change localization Gemma: concentrated at corrective layers; others: uniform
L8 Geometry follow-up Exploratory dimensionality / covariance diagnostics are mixed and not part of the core evidence chain
L9 Attention entropy divergence Architecture-dependent

Causal steering (Gemma, extending to all 6)

ID Experiment Key result
A1 α-sweep on corrective layers Governance dose-response, content flat
A1_rand Random direction control 3× less governance effect — direction specificity
A1_notmpl No chat template Dose-response preserved — weight-encoded
A2 Inject into PT Noisy — PT lacks downstream circuitry
A5a Progressive layer skipping Final 3 layers: format; earlier: coherence

Matched-prefix Internal Causality

ID Experiment Key result
exp11 Matched-prefix late IT MLP graft Late IT MLPs increase late KL-to-own-final and move PT internal predictions toward the IT teacher under shared token history
exp13A-lite Descriptive token-support analysis Late grafts broadly suppress raw-continuation-like FUNCTION/OTHER candidates and increase support for the eventual teacher token
exp14 Symmetric sufficiency / necessity Late IT→PT graft is the strongest sufficiency window and late PT→IT swap is the strongest necessity window across all 6 models on the primary late-region KL metric

Free-running Behavioral Causality

ID Experiment Key result
exp12 A/B/C free-running graft comparison Late graft reduces benign false refusals in 6/6 families and improves assistant register in 4/6, but remains far from the full IT endpoint on polished structure
exp15 Symmetric behavioral phase Current canonical follow-up for making the behavioral late-stage claim more symmetric and better localized

Methodology validation (Tier 0)

ID Test Result
0A Direction bootstrap stability cos > 0.993 by n=300
0B Matched-token direction cos = 0.82 (primarily weight-driven)
0C Projection-matched random 3× less governance, identical content degradation
0D Bootstrap 95% CIs BCa intervals on all metrics
0E Classifier robustness Robust to all boundary perturbations
0F Layer range sensitivity Stable across 4 overlapping ranges
0G Tuned-lens commitment Primary commitment measurement (6 models × 2 variants)
0H Calibration split Three disjoint prompt sets → same dose-response
0I Formula comparison MLP projection only; attention/residual fail
0J Onset threshold sensitivity Robust across σ-based and absolute thresholds

Contrastive activation patching (Exp10, in progress)

Phase Description Status
1 Forced-decoding paired data collection Prototype complete
2 Ridge probes → convergence direction (d_conv) Prototype complete
3 Causal activation patching (5 conditions) Prototype complete
4 Steering with d_conv vs d_mean Prototype: d_mean steers (11–19×), d_conv does not

Pipeline design

The steering pipeline is architecture-agnostic. It operates on raw MLP activations via a model-agnostic adapter system — no transcoders, SAEs, or model-specific decompositions required.

Direction Extraction          Steering                Evaluation
--------------------    --------------------    --------------------
IT model --+            IT model + hooks        LLM judge (G1/G2)
           |-- d_mean   h += (alpha-1)(d'h)d    Programmatic (STR)
PT model --+            per corrective layer    IFEval compliance
                                                MMLU / GSM8K / reasoning

The adapter system provides a uniform interface across all six architectures, including DeepSeek's MoE routing and Gemma's hybrid attention. Extending to a new model requires only registering its architecture in the adapter config.


Citation

@article{anonymous2026corrective,
  title={Instruction Tuning Creates a Broad Convergence Gap: A Late-Centered Corrective Computation Across Transformer Families},
  author={Anonymous},
  year={2026}
}

License

See LICENSE.

About

Mechanistic interpretability research for studying how instruction tuning restructures the computational pipeline of language models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors