TL;DR — The current paper story is: instruction tuning creates a broad convergence gap and delayed commitment under native decoding across six model families. The strongest mechanistic leverage is a late-layer MLP-centered corrective bottleneck. The cleanest cross-model internal causal evidence comes from matched-prefix graft/swap experiments, and the free-running behavioral experiments show that the same late intervention family moves a specific component of assistant behavior rather than the full assistant phenotype.
Figure 1. Tuned-lens KL-to-own-final curves from the main cross-family observational suite. Across all six families, IT stays farther from its own final distribution than PT through much of the stack, making the broad convergence gap the primary cross-model signature.
If you are new to the repo, these are the most useful entrypoints:
- docs/EXPERIMENT_REGISTRY.md: canonical experiment map and path conventions
- scripts/README.md: grouped script layout and common commands
uv run python scripts/infra/repo_doctor.py: lightweight repo health check- paper_draft/PAPER_DRAFT_v12.md: current paper framing
The repo has been reorganized into descriptive canonical paths:
- experiment code:
src/poc/exp##_descriptive_name/ - results:
results/exp##_descriptive_name/ - scripts:
scripts/run/,scripts/plot/,scripts/analysis/,scripts/infra/, etc.
A few flat script aliases are still kept where practical, but results now live only under the descriptive canonical paths.
The current paper-facing story is best understood in three layers:
| Layer | Best current claim | Main evidence |
|---|---|---|
| Observational | Instruction tuning creates a broad convergence-gap signature under native decoding | exp09 cross-model PT/IT analyses |
| Internal causal | The strongest cross-model mechanistic leverage is late-centered and MLP-heavy | exp11 matched-prefix grafts + exp14 symmetric sufficiency/necessity |
| Behavioral | The same late intervention family moves a real but partial slice of assistant behavior | exp12 free-running A/B/C, with exp15 as the next symmetric behavioral phase |
What is strongest right now:
- broad IT-vs-PT convergence gap and delayed commitment across 6 families under both tuned and raw logit lenses
- a late-concentrated IT-vs-PT increase in residual opposition as a geometric companion, with architecture-dependent magnitude and spatial extent
- Gemma steering as the strongest single-direction causal bridge between convergence speed and governance behavior
- matched-prefix late graft/swap as the cleanest cross-model internal causal evidence for a late MLP-centered bottleneck
exp13A-liteplus the exp13/14 mechanism summaries as evidence that the late stage is broader than a narrow formatting-token injector- free-running A/B/C as a behavioral precision finding: late MLPs move anti-raw-continuation / anti-false-refusal more than polished structure
What remains intentionally careful:
- the free-running six-family observational curves are descriptive, not matched-history estimates
KL(layer || own final)is useful but endpoint-sensitive- dimensionality diagnostics are exploratory / mixed and are not part of the main claim
- late IT MLPs are a bottleneck inside a broader circuit, not a full assistantness module
Figure 2. Symmetric matched-prefix exp13/14 summary. The late IT→PT graft is the strongest sufficiency window and the mirrored late PT→IT swap is the strongest necessity window on the primary late-region KL metric, supporting a late-centered MLP bottleneck rather than a diffuse endpoint-only story.
git clone <repo> && cd structral-semantic-features
uv syncuv run python scripts/infra/repo_doctor.pyOptional:
uv run python scripts/infra/repo_doctor.py --pytest# Canonical exp14 matched-prefix causal runner
uv run python -m src.poc.exp14_symmetric_matched_prefix_causality --help
# Canonical exp15 free-running behavioral runner
uv run python -m src.poc.exp15_symmetric_behavioral_causality --help
# Local smoke for the exp13+14 causal stack
bash scripts/run/run_exp13_exp14_local.sh --mode smoke --model gemma3_4b --smoke-prompts 8# Current cross-model observational figures
uv run python -m src.poc.exp09_cross_model_observational_replication.plot_replication
# Exp13A-lite analysis + plots
uv run python scripts/analysis/analyze_exp13a_lite.py --help
uv run python scripts/plot/plot_exp13a_lite.py --help
# Exp13 full + Exp14 causal summary plots
uv run python scripts/analysis/analyze_exp13_full.py --help
uv run python scripts/plot/plot_exp13_full.py --help# Multi-model steering / phase 0
bash scripts/run/run_phase0_multimodel.sh --step precompute
bash scripts/run/run_phase0_multimodel.sh --step steer
# Exp13 + Exp14 local causal campaign
bash scripts/run/run_exp13_exp14_local.sh --mode full| Model | Layers | d_model | Architecture | Post-training |
|---|---|---|---|---|
| Gemma 3 4B (primary) | 34 | 2560 | GQA, hybrid local/global (5:1) | KD + supervised / preference / rule-based stages |
| Llama 3.1 8B | 32 | 4096 | GQA, all global | Iterative supervised + preference optimization |
| Qwen 3 4B | 36 | 2560 | GQA, all global | Multi-stage SFT / RL post-training |
| Mistral 7B v0.3 | 32 | 4096 | GQA, sliding window (4096) | Instruct checkpoint |
| DeepSeek-V2-Lite | 27 | 2048 | MLA, MoE (2 shared + 64 routed, top-6) | Chat checkpoint / GRPO-style post-training |
| OLMo 2 7B | 32 | 4096 | MHA, all global | SFT + DPO + RLVR (Tülu 3) |
All main observational analyses use each IT model's native chat template and raw prompting for PT. Template-free conditions are treated as ablations rather than replacement primaries.
src/poc/
cross_model/ # Shared multi-model infrastructure
exp01_hierarchical_distributional_narrowing/
exp02_ic_ooc_reasoning_mechanistic_comparison/
exp03_corrective_stage_characterization/
exp04_phase_transition_characterization/
exp05_corrective_direction_ablation_cartography/
exp06_corrective_direction_steering/
exp07_methodology_validation_tier0/
exp08_multimodel_steering_phase0/
exp09_cross_model_observational_replication/
exp10_contrastive_activation_patching/
exp11_matched_prefix_mlp_graft/
exp12_free_running_abc_graft/
exp13_late_stage_token_support_analysis/
exp14_symmetric_matched_prefix_causality/
exp15_symmetric_behavioral_causality/
scripts/
analysis/ # Post-hoc summaries, cross-checks, paper stats
data/ # Dataset builders / data prep
eval/ # Judge and evaluation entrypoints
infra/ # Modal/Lambda/cloud helpers
merge/ # Worker/shard merge utilities
plot/ # Figure generation
precompute/ # Direction extraction and preprocessing
run/ # Main experiment launchers
scoring/ # Rescoring utilities
results/
cross_model/{model}/
exp01_hierarchical_distributional_narrowing/
...
exp15_symmetric_behavioral_causality/
Canonical experiment/result paths now use descriptive names. Source code now lives only in the canonical named experiment folders. Some legacy result and flat script aliases are still kept during the results/scripts migration so older commands keep working.
For a full index, see docs/EXPERIMENT_REGISTRY.md.
| ID | Analysis | Key result |
|---|---|---|
| L1 | δ-cosine profiles | IT adds more late residual opposition than PT in all 6 families, but with heterogeneous magnitude (−0.021 to −0.269 in the final 20%) |
| L2 | Broad convergence gap + delayed commitment (5 metrics × 2 lenses) | IT stays farther from its own final distribution through much of the stack and commits later in all 6 families |
| L3 | Weight change localization | Gemma: concentrated at corrective layers; others: uniform |
| L8 | Geometry follow-up | Exploratory dimensionality / covariance diagnostics are mixed and not part of the core evidence chain |
| L9 | Attention entropy divergence | Architecture-dependent |
| ID | Experiment | Key result |
|---|---|---|
| A1 | α-sweep on corrective layers | Governance dose-response, content flat |
| A1_rand | Random direction control | 3× less governance effect — direction specificity |
| A1_notmpl | No chat template | Dose-response preserved — weight-encoded |
| A2 | Inject into PT | Noisy — PT lacks downstream circuitry |
| A5a | Progressive layer skipping | Final 3 layers: format; earlier: coherence |
| ID | Experiment | Key result |
|---|---|---|
| exp11 | Matched-prefix late IT MLP graft | Late IT MLPs increase late KL-to-own-final and move PT internal predictions toward the IT teacher under shared token history |
| exp13A-lite | Descriptive token-support analysis | Late grafts broadly suppress raw-continuation-like FUNCTION/OTHER candidates and increase support for the eventual teacher token |
| exp14 | Symmetric sufficiency / necessity | Late IT→PT graft is the strongest sufficiency window and late PT→IT swap is the strongest necessity window across all 6 models on the primary late-region KL metric |
| ID | Experiment | Key result |
|---|---|---|
| exp12 | A/B/C free-running graft comparison | Late graft reduces benign false refusals in 6/6 families and improves assistant register in 4/6, but remains far from the full IT endpoint on polished structure |
| exp15 | Symmetric behavioral phase | Current canonical follow-up for making the behavioral late-stage claim more symmetric and better localized |
| ID | Test | Result |
|---|---|---|
| 0A | Direction bootstrap stability | cos > 0.993 by n=300 |
| 0B | Matched-token direction | cos = 0.82 (primarily weight-driven) |
| 0C | Projection-matched random | 3× less governance, identical content degradation |
| 0D | Bootstrap 95% CIs | BCa intervals on all metrics |
| 0E | Classifier robustness | Robust to all boundary perturbations |
| 0F | Layer range sensitivity | Stable across 4 overlapping ranges |
| 0G | Tuned-lens commitment | Primary commitment measurement (6 models × 2 variants) |
| 0H | Calibration split | Three disjoint prompt sets → same dose-response |
| 0I | Formula comparison | MLP projection only; attention/residual fail |
| 0J | Onset threshold sensitivity | Robust across σ-based and absolute thresholds |
| Phase | Description | Status |
|---|---|---|
| 1 | Forced-decoding paired data collection | Prototype complete |
| 2 | Ridge probes → convergence direction (d_conv) | Prototype complete |
| 3 | Causal activation patching (5 conditions) | Prototype complete |
| 4 | Steering with d_conv vs d_mean | Prototype: d_mean steers (11–19×), d_conv does not |
The steering pipeline is architecture-agnostic. It operates on raw MLP activations via a model-agnostic adapter system — no transcoders, SAEs, or model-specific decompositions required.
Direction Extraction Steering Evaluation
-------------------- -------------------- --------------------
IT model --+ IT model + hooks LLM judge (G1/G2)
|-- d_mean h += (alpha-1)(d'h)d Programmatic (STR)
PT model --+ per corrective layer IFEval compliance
MMLU / GSM8K / reasoning
The adapter system provides a uniform interface across all six architectures, including DeepSeek's MoE routing and Gemma's hybrid attention. Extending to a new model requires only registering its architecture in the adapter config.
@article{anonymous2026corrective,
title={Instruction Tuning Creates a Broad Convergence Gap: A Late-Centered Corrective Computation Across Transformer Families},
author={Anonymous},
year={2026}
}See LICENSE.

