Chain-of-thought (CoT) reasoning improves model performance - whether elicited through prompting, trained via fine-tuning or RL, or aggregated via self-consistency. But after which token in the reasoning trace does the model realize the answer?
Detecting this "Aha moment" - the token step when the model's answer crystallizes - could unlock:
- Efficiency: Terminate generation early once the Aha moment occurs, saving compute
- Better training data: Select high-quality reasoning traces for self-training or distillation - either traces with strong Aha magnitude (good reasoning) or early Aha moments (efficient reasoning). Could also identify candidate tokens for gradient masking during RL.
- Smarter decoding: Use Aha magnitude instead of token log-probability as the pruning criterion in beam search, discarding unpromising reasoning paths earlier
My approach: Track answer-token logits throughout CoT generation to detect when and how strongly the model's prediction emerges.
Core finding: Tracking answer-token logits (e.g., "a", "b", "c", "d" or "True", "False") throughout CoT generation reveals "Aha moments" - the answer logits spike sharply at key moments in the reasoning.
Comparison with controls:
| Method | Tracked | Behavior | Interpretation |
|---|---|---|---|
| Answer tokens | Logits of a,b,c,d or True,False | Structured spikes at key moments | Tracks Aha moments |
| Control tokens | Logits of 1,2,3,4 or Yes,No | Random fluctuations | No consistent pattern - signal is answer-specific |
| Top logits | Logits of argmax token | Spikes on token transitions | Tracks next-token prediction, not final answer |
| Top probs | Logprobs of argmax tokens | Correlates with prompt tokens | Reflects surface-level token likelihood |
Implication: Answer-token logit fluctuations may identify Aha moments in CoT reasoning.
Experiment 1: Logit dynamics across reasoning traces
I track the Δlogit (relative change in logits) at each generation step: (logit_t - logit_{t-1}) / |logit_{t-1}|. This measures how much the logit changed relative to its previous value. I visualize by color-coding tokens by Δlogit magnitude.
- Model: Qwen3-30B-A3B-Instruct
- Dataset: StrategyQA (True/False)
- Methods tracked: All 4 from the comparison table
Example from StrategyQA: "Were Raphael's paintings influenced by the country of Guam?" (Answer: False)
The token "he" spikes sharply (Δlogit = 3801) right before stating the decisive fact: "he died in 1520" - before Europeans ever reached Guam (1521). This is the Aha moment.
The largest spike (Δlogit = 300) is at the "1" in 14th century - not semantically meaningful.
The spike (Δlogit = 0.9) is at -ious in harmonious - just a token transition artifact.
The spike (Δlogit = 3.3) is at Raphael - simply echoing the question subject.
Finding: Answer-token logits spike at semantically meaningful reasoning moments; controls do not.
Experiment 2: Does Aha magnitude matter?
For the same question, I sample 10 reasoning traces and compare the highest vs lowest Δlogit magnitude.
Max sample (Δlogit = 3801): The reasoning explicitly states "he died in 1520" before European contact with Guam - a concrete, decisive fact.
Min sample (Δlogit = 52): Also concludes False, but never states the specific 1520 death date. Spikes occur at vague moments like "What is Guam" and "was not known to Europeans".
Finding: Higher Δlogit magnitude correlates with more concrete, factual reasoning.