Instruction Tuning Creates a Broad Convergence Gap

A Late-Centered Corrective Computation Across Transformer Families

TL;DR — The current paper story is: instruction tuning creates a broad convergence gap and delayed commitment under native decoding across six model families. The strongest mechanistic leverage is a late-layer MLP-centered corrective bottleneck. The cleanest cross-model internal causal evidence comes from matched-prefix graft/swap experiments, and the free-running behavioral experiments show that the same late intervention family moves a specific component of assistant behavior rather than the full assistant phenotype.

Figure 1. Tuned-lens KL-to-own-final curves from the main cross-family observational suite. Across all six families, IT stays farther from its own final distribution than PT through much of the stack, making the broad convergence gap the primary cross-model signature.

Start Here

If you are new to the repo, these are the most useful entrypoints:

docs/EXPERIMENT_REGISTRY.md: canonical experiment map and path conventions
scripts/README.md: grouped script layout and common commands
uv run python scripts/infra/repo_doctor.py: lightweight repo health check
paper_draft/PAPER_DRAFT_v12.md: current paper framing

The repo has been reorganized into descriptive canonical paths:

experiment code: src/poc/exp##_descriptive_name/
results: results/exp##_descriptive_name/
scripts: scripts/run/, scripts/plot/, scripts/analysis/, scripts/infra/, etc.

A few flat script aliases are still kept where practical, but results now live only under the descriptive canonical paths.

Current Status

The current paper-facing story is best understood in three layers:

Layer	Best current claim	Main evidence
Observational	Instruction tuning creates a broad convergence-gap signature under native decoding	`exp09` cross-model PT/IT analyses
Internal causal	The strongest cross-model mechanistic leverage is late-centered and MLP-heavy	`exp11` matched-prefix grafts + `exp14` symmetric sufficiency/necessity
Behavioral	The same late intervention family moves a real but partial slice of assistant behavior	`exp12` free-running A/B/C, with `exp15` as the next symmetric behavioral phase

What is strongest right now:

broad IT-vs-PT convergence gap and delayed commitment across 6 families under both tuned and raw logit lenses
a late-concentrated IT-vs-PT increase in residual opposition as a geometric companion, with architecture-dependent magnitude and spatial extent
Gemma steering as the strongest single-direction causal bridge between convergence speed and governance behavior
matched-prefix late graft/swap as the cleanest cross-model internal causal evidence for a late MLP-centered bottleneck
exp13A-lite plus the exp13/14 mechanism summaries as evidence that the late stage is broader than a narrow formatting-token injector
free-running A/B/C as a behavioral precision finding: late MLPs move anti-raw-continuation / anti-false-refusal more than polished structure

What remains intentionally careful:

the free-running six-family observational curves are descriptive, not matched-history estimates
KL(layer || own final) is useful but endpoint-sensitive
dimensionality diagnostics are exploratory / mixed and are not part of the main claim
late IT MLPs are a bottleneck inside a broader circuit, not a full assistantness module

Figure 2. Symmetric matched-prefix exp13/14 summary. The late IT→PT graft is the strongest sufficiency window and the mirrored late PT→IT swap is the strongest necessity window on the primary late-region KL metric, supporting a late-centered MLP bottleneck rather than a diffuse endpoint-only story.

Quickstart

Setup

git clone <repo> && cd structral-semantic-features
uv sync

Sanity-check the repo

uv run python scripts/infra/repo_doctor.py

Optional:

uv run python scripts/infra/repo_doctor.py --pytest

Explore the main runnable entrypoints

# Canonical exp14 matched-prefix causal runner
uv run python -m src.poc.exp14_symmetric_matched_prefix_causality --help

# Canonical exp15 free-running behavioral runner
uv run python -m src.poc.exp15_symmetric_behavioral_causality --help

# Local smoke for the exp13+14 causal stack
bash scripts/run/run_exp13_exp14_local.sh --mode smoke --model gemma3_4b --smoke-prompts 8

Common analysis / plotting commands

# Current cross-model observational figures
uv run python -m src.poc.exp09_cross_model_observational_replication.plot_replication

# Exp13A-lite analysis + plots
uv run python scripts/analysis/analyze_exp13a_lite.py --help
uv run python scripts/plot/plot_exp13a_lite.py --help

# Exp13 full + Exp14 causal summary plots
uv run python scripts/analysis/analyze_exp13_full.py --help
uv run python scripts/plot/plot_exp13_full.py --help

Canonical run scripts

# Multi-model steering / phase 0
bash scripts/run/run_phase0_multimodel.sh --step precompute
bash scripts/run/run_phase0_multimodel.sh --step steer

# Exp13 + Exp14 local causal campaign
bash scripts/run/run_exp13_exp14_local.sh --mode full

Models

Model	Layers	d_model	Architecture	Post-training
Gemma 3 4B (primary)	34	2560	GQA, hybrid local/global (5:1)	KD + supervised / preference / rule-based stages
Llama 3.1 8B	32	4096	GQA, all global	Iterative supervised + preference optimization
Qwen 3 4B	36	2560	GQA, all global	Multi-stage SFT / RL post-training
Mistral 7B v0.3	32	4096	GQA, sliding window (4096)	Instruct checkpoint
DeepSeek-V2-Lite	27	2048	MLA, MoE (2 shared + 64 routed, top-6)	Chat checkpoint / GRPO-style post-training
OLMo 2 7B	32	4096	MHA, all global	SFT + DPO + RLVR (Tülu 3)

All main observational analyses use each IT model's native chat template and raw prompting for PT. Template-free conditions are treated as ablations rather than replacement primaries.

Project structure

src/poc/
  cross_model/                                   # Shared multi-model infrastructure
  exp01_hierarchical_distributional_narrowing/
  exp02_ic_ooc_reasoning_mechanistic_comparison/
  exp03_corrective_stage_characterization/
  exp04_phase_transition_characterization/
  exp05_corrective_direction_ablation_cartography/
  exp06_corrective_direction_steering/
  exp07_methodology_validation_tier0/
  exp08_multimodel_steering_phase0/
  exp09_cross_model_observational_replication/
  exp10_contrastive_activation_patching/
  exp11_matched_prefix_mlp_graft/
  exp12_free_running_abc_graft/
  exp13_late_stage_token_support_analysis/
  exp14_symmetric_matched_prefix_causality/
  exp15_symmetric_behavioral_causality/

scripts/
  analysis/                                      # Post-hoc summaries, cross-checks, paper stats
  data/                                          # Dataset builders / data prep
  eval/                                          # Judge and evaluation entrypoints
  infra/                                         # Modal/Lambda/cloud helpers
  merge/                                         # Worker/shard merge utilities
  plot/                                          # Figure generation
  precompute/                                    # Direction extraction and preprocessing
  run/                                           # Main experiment launchers
  scoring/                                       # Rescoring utilities

results/
  cross_model/{model}/
  exp01_hierarchical_distributional_narrowing/
  ...
  exp15_symmetric_behavioral_causality/

Canonical experiment/result paths now use descriptive names. Source code now lives only in the canonical named experiment folders. Some legacy result and flat script aliases are still kept during the results/scripts migration so older commands keep working.

For a full index, see docs/EXPERIMENT_REGISTRY.md.

Experiment index

Observational (cross-model, 6/6)

ID	Analysis	Key result
L1	δ-cosine profiles	IT adds more late residual opposition than PT in all 6 families, but with heterogeneous magnitude (−0.021 to −0.269 in the final 20%)
L2	Broad convergence gap + delayed commitment (5 metrics × 2 lenses)	IT stays farther from its own final distribution through much of the stack and commits later in all 6 families
L3	Weight change localization	Gemma: concentrated at corrective layers; others: uniform
L8	Geometry follow-up	Exploratory dimensionality / covariance diagnostics are mixed and not part of the core evidence chain
L9	Attention entropy divergence	Architecture-dependent

Causal steering (Gemma, extending to all 6)

ID	Experiment	Key result
A1	α-sweep on corrective layers	Governance dose-response, content flat
A1_rand	Random direction control	3× less governance effect — direction specificity
A1_notmpl	No chat template	Dose-response preserved — weight-encoded
A2	Inject into PT	Noisy — PT lacks downstream circuitry
A5a	Progressive layer skipping	Final 3 layers: format; earlier: coherence

Matched-prefix Internal Causality

ID	Experiment	Key result
exp11	Matched-prefix late IT MLP graft	Late IT MLPs increase late KL-to-own-final and move PT internal predictions toward the IT teacher under shared token history
exp13A-lite	Descriptive token-support analysis	Late grafts broadly suppress raw-continuation-like `FUNCTION/OTHER` candidates and increase support for the eventual teacher token
exp14	Symmetric sufficiency / necessity	Late IT→PT graft is the strongest sufficiency window and late PT→IT swap is the strongest necessity window across all 6 models on the primary late-region KL metric

Free-running Behavioral Causality

ID	Experiment	Key result
exp12	A/B/C free-running graft comparison	Late graft reduces benign false refusals in 6/6 families and improves assistant register in 4/6, but remains far from the full IT endpoint on polished structure
exp15	Symmetric behavioral phase	Current canonical follow-up for making the behavioral late-stage claim more symmetric and better localized

Methodology validation (Tier 0)

ID	Test	Result
0A	Direction bootstrap stability	cos > 0.993 by n=300
0B	Matched-token direction	cos = 0.82 (primarily weight-driven)
0C	Projection-matched random	3× less governance, identical content degradation
0D	Bootstrap 95% CIs	BCa intervals on all metrics
0E	Classifier robustness	Robust to all boundary perturbations
0F	Layer range sensitivity	Stable across 4 overlapping ranges
0G	Tuned-lens commitment	Primary commitment measurement (6 models × 2 variants)
0H	Calibration split	Three disjoint prompt sets → same dose-response
0I	Formula comparison	MLP projection only; attention/residual fail
0J	Onset threshold sensitivity	Robust across σ-based and absolute thresholds

Contrastive activation patching (Exp10, in progress)

Phase	Description	Status
1	Forced-decoding paired data collection	Prototype complete
2	Ridge probes → convergence direction (d_conv)	Prototype complete
3	Causal activation patching (5 conditions)	Prototype complete
4	Steering with d_conv vs d_mean	Prototype: d_mean steers (11–19×), d_conv does not

Pipeline design

The steering pipeline is architecture-agnostic. It operates on raw MLP activations via a model-agnostic adapter system — no transcoders, SAEs, or model-specific decompositions required.

Direction Extraction          Steering                Evaluation
--------------------    --------------------    --------------------
IT model --+            IT model + hooks        LLM judge (G1/G2)
           |-- d_mean   h += (alpha-1)(d'h)d    Programmatic (STR)
PT model --+            per corrective layer    IFEval compliance
                                                MMLU / GSM8K / reasoning

The adapter system provides a uniform interface across all six architectures, including DeepSeek's MoE routing and Gemma's hybrid attention. Extending to a new model requires only registering its architecture in the adapter config.

Citation

@article{anonymous2026corrective,
  title={Instruction Tuning Creates a Broad Convergence Gap: A Late-Centered Corrective Computation Across Transformer Families},
  author={Anonymous},
  year={2026}
}

License

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
data		data
docs/assets		docs/assets
logs/exp15/exp15_eval_core_600_runpod_a100_dense5_shmcache_20260421		logs/exp15/exp15_eval_core_600_runpod_a100_dense5_shmcache_20260421
paper_draft		paper_draft
results		results
scripts		scripts
src		src
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Instruction Tuning Creates a Broad Convergence Gap

A Late-Centered Corrective Computation Across Transformer Families

Start Here

Current Status

Quickstart

Setup

Sanity-check the repo

Explore the main runnable entrypoints

Common analysis / plotting commands

Canonical run scripts

Models

Project structure

Experiment index

Observational (cross-model, 6/6)

Causal steering (Gemma, extending to all 6)

Matched-prefix Internal Causality

Free-running Behavioral Causality

Methodology validation (Tier 0)

Contrastive activation patching (Exp10, in progress)

Pipeline design

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Instruction Tuning Creates a Broad Convergence Gap

A Late-Centered Corrective Computation Across Transformer Families

Start Here

Current Status

Quickstart

Setup

Sanity-check the repo

Explore the main runnable entrypoints

Common analysis / plotting commands

Canonical run scripts

Models

Project structure

Experiment index

Observational (cross-model, 6/6)

Causal steering (Gemma, extending to all 6)

Matched-prefix Internal Causality

Free-running Behavioral Causality

Methodology validation (Tier 0)

Contrastive activation patching (Exp10, in progress)

Pipeline design

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages