In 1792, Mozart’s Musikalisches Würfelspiel (Musical Dice Game), K.516f, was published. The system is deceptively simple: 176 pre-composed musical measures arranged in a grid. The user rolls two six-sided dice ($2d6$) 16 times. Each roll corresponds to a specific measure for that column in the grid, generating a mathematically unique 16-bar minuet.
From a LLM mechanistic interpretability standpoint, the beauty of Mozart’s game is that it is a strictly autoregressive, discrete-token generator with a context window of zero.
In a standard Large Language Model (LLM), predicting the next token $x_t$ relies on the conditional probability of the entire past sequence:
\[P(x_t | x_1, x_2, \dots, x_{t-1})\]Mozart bypassed the need for this computational overhead. In K.516f, the choice of Measure 3 has zero statistical dependence on Measure 2. The generation is completely memoryless. Instead, the model’s “attention” is 100% focused on its absolute positional encoding (the step $t$): \(P(x_t | \text{position } t, \text{dice roll})\)
How does it remain harmonically coherent without context? Mozart engineered the matrix as an aggressive, hardcoded attention mask. He ensured that every possible measure at $t$ smoothly resolves into every possible measure at $t+1$. Any dissonant, harmonically invalid transition was manually assigned a $-\infty$ pre-softmax penalty by the composer, effectively masking it out of the latent space.
Furthermore, the $2d6$ sampling acts as a physical temperature parameter. By using a triangular probability distribution ($P(7) = 16.7\%$, $P(2) = 2.7\%$) rather than a uniform one, Mozart lowered the entropy of the system. He statistically biased the model to generate the most “standard” harmonic progressions, reserving high-surprise edge cases for the extreme tails of the distribution.
If we were to code Mozart’s game today, we would use a simple for loop to force the piece to stop at $t=16$. But why does a 16-measure grid feel psychologically and harmonically complete? To understand this, we must abandon the discrete grid and apply the continuous mathematics of Srinivasa Ramanujan.
Ramanujan would not view Mozart’s matrix as a set of rules, but rather as the natural resonant frequency of a periodic equation. We can model the macro-structure of the minuet using a Ramanujan Sum ($c_q(n)$), which extracts periodic signals from noise:
\[c_q(n) = \sum_{\substack{1 \le a \le q \\ \gcd(a,q)=1}} e^{2\pi i \frac{a}{q} n}\]By setting the fundamental period $q = 16$, the equation acts as a harmonic pendulum. Here is how Mozart’s attention mechanism unifies with Ramanujan’s math:
The Journey ($n = 1$ to $15$): As the measures progress, the complex exponentials point in various directions in the complex plane, causing destructive interference. Musically, this represents harmonic tension—the algorithmic wave is wandering through the latent space, seeking resolution.
The Half-Cadence ($n = 8$): When we reach the halfway point, the fraction simplifies to $\frac{a}{2}$. The vectors snap to the real axis. This momentary, symmetrical mathematical pause perfectly mirrors the structural “half-cadence” in classical phrasing.
The Resolution ($n = 16$): At the final measure, the fraction simplifies to an integer. Every term in the sum points in the exact same direction ($e^{2\pi i a} = 1$). The destructive interference vanishes into a massive spike of constructive interference.
The structure doesn’t resolve because of an arbitrary grid boundary; it resolves because $q=16$ ($2^4$, the fractal symmetry of classical phrasing) is the fundamental node where the equation naturally reaches maximum constructive harmony. Mozart’s positional attention mechanism is simply the geometric projection of this periodic equation.
If Mozart’s dice game is a rigid, 1D loop locked to $q=16$, Johann Sebastian Bach’s beautiful Fugues (The Well-Tempered Clavier which has a beautiful Chinese name 赋格) represent the expansion of this mathematical framework into high-dimensional, deep-memory architectures. A fugue cannot be generated by a zero-context Markov chain like in Mozart’s dice game. It begins with a single “prompt” token sequence: the Subject. When the second voice enters, it must continuously look back at the Subject to generate valid counterpoint.
In LLM terminology, Bach implemented Multi-Head Self-Attention.
Each voice (Soprano, Alto, Tenor, Bass) acts as an independent attention head. They process the exact same context window but project it into different dimensional spaces. While Mozart relied on stochastic dice (sampling), Bach relied on deterministic linear algebra. The initial Subject vector is subjected to complex matrix transformations in the latent space:
Bach also utilized what we mechanistic interpretability researchers call Induction Heads. When the Alto voice enters with the “Answer,” it acts as an attention circuit specifically trained to recognize the sequence in the Soprano’s past and perfectly reconstruct it at the current time step. Meanwhile, the other heads calculate orthogonal vectors (the Countersubject) to ensure the dot product of the combined voices perfectly satisfies the vertical rules of harmony.
If we return to Ramanujan, Bach’s polyphony represents the full, unconstrained analytic continuation of the harmonic equations. While Mozart collapsed the variables into a degenerate case (a rigid loop in C Major), Bach allowed the variables to become complex numbers, unlocking all 24 keys and forcing the equation to expand dynamically across the complex plane.
Whether we are engineering modern Transformers, calculating Ramanujan sums, or analyzing 18th-century manuscripts, the computational goal remains identical. LLM and music generation are ultimately the search for mathematical symmetry across time. Mozart mapped it via hardcoded masking and stochastic geometry; Bach calculated it via deep contrapuntal attention matrices; and Ramanujan provided the equations that prove they are all navigating the exact same latent space.
]]>Reward models and LLM-as-a-Judge systems are heavily relied upon in modern post-training pipelines to evaluate AI outputs. However, their binary decisions are vulnerable.
Why is the “low-perplexity” constraint so fundamental to the attack’s success across deep, non-linear networks? We can answer this by formalizing the system’s trajectory as a Lagrangian ($\mathcal{L} = T - V$). The optimal path minimizes the total Action over time.
For an autoregressive language model being steered toward a specific output, we define:
AdvJudge-Zero formulates the attack as a constrained optimization problem. Using a Lagrange multiplier ($\lambda$), it finds the stationary path ($\delta \mathcal{L} = 0$) of the unconstrained Lagrangian:
\[\mathcal{L} = \underbrace{\sum_{i=1}^k -\log P(t_i \mid t_{<i})}_{\mathrm{Action Cost}} - \lambda \underbrace{(Z_{yes} - Z_{no})}_{\mathrm{Target Potential}}\]The algorithm succeeds because it finds the exact trajectory where the energy cost of using slightly unusual tokens is perfectly balanced by the reward of escaping the judge’s penalty.
AdvJudge-Zero uses a constrained beam search. By aggressively pruning high-surprisal (high-Action) branches, it enforces the Classical Limit of the optimization process, forcing the LLM to take the deterministic path of least resistance and stripping away high-variance stochastic fluctuations.
Why is bounding this Action mathematically necessary to steer the final layer?
When we inject an adversarial perturbation at Layer 0 (the input), its effect on the final layer is governed by the product of the layer-wise Jacobians:
\[J = \frac{\partial h_L}{\partial h_0} = \prod_{l=0}^{L-1} \left( \mathbf{I} + \frac{\partial F_l}{\partial h_l} \right)\]If we inject high-perplexity (random) tokens, we push the hidden states out-of-distribution. This causes the gradients of the non-linear layers ($\frac{\partial F_l}{\partial h_l}$) to become chaotic and violently unpredictable, causing the signal to scatter.
By strictly minimizing Action, AdvJudge-Zero ensures the perturbation remains on the data manifold. The gradients remain stable and well-behaved, allowing the perturbation to travel coherently alongside the main residual stream. This preserves the identity mapping ($J \approx \mathbf{I}$) so that $h_L’ \approx h_L + \delta$ holds true at the final classifier.
Finally, how does the perturbation bypass the judge’s strict penalty?
The judge’s refusal direction is a rigid, high-energy barrier. However, because this alignment penalty is low-rank, it only guards specific directions in the activation space. AdvJudge-Zero works by exciting a geometric “soft mode” that is structurally orthogonal to standard semantic constraints, yet perfectly anti-aligned with the judge’s penalty.

Imagine the model’s semantic landscape as a Mexican Hat potential:
When we attempt to apply a perturbation to push the state out of the “No” basin:
By applying the perturbation strictly along this soft mode, the particle smoothly glides around the brim of the hat—moving from the “No” basin to the “Yes” basin—without ever climbing the high-energy peak or triggering the model’s out-of-distribution alarms.
AdvJudge-Zero succeeds by strictly adhering to the model’s own Lagrangian mechanics. By penalizing surprisal, it enforces the Principle of Least Action, keeping the perturbation on-manifold. This prevents chaotic gradient scattering, allowing the attack to quietly ride a geometric soft mode around the judge’s low-rank decision boundary.
]]>Since publication, we have received numerous questions about why this works so effectively. How can appending a few tokens at the start of a prompt reliably flip a switch in the model’s final layers, despite the depth and non-linearity of the network?
This post is an author’s retrospective clarification. We want to propose a unified framework that treats Prompt Steering and Activation (Vector) Steering as the same operation, distinguished only by their constraints. Most important, we argue that the success of this method relies on two fundamental properties of current LLMs: the Identity Propagator nature of residual streams and the Low Rank structure of safety alignment.

For those who haven’t read the original paper, here is the core concept.
Most LLM safety mechanisms (RLHF) function by suppressing the probability of “compliant” tokens (e.g., “Sure”, “Here”) and boosting “refusal” tokens (e.g., “I cannot”, “Sorry”) when a harmful query is detected. We quantify this as the Logit Gap:
\[\Delta Z = Z_{\mathrm{compliance}} - Z_{\mathrm{refusal}}\]The Method: Instead of treating the model as a black box, we treat the input prompt as a continuous variable. We compute the gradient of the Logit Gap with respect to the input embeddings and optimize a sequence of “suffix” tokens to maximize $\Delta Z$.
The Finding: We discovered that we don’t need to rewrite the prompt semantically. By appending a specific sequence of tokens (often nonsensical to humans, like ! ! mode unleashed), we can inject a precise “steering vector” that forces $\Delta Z > 0$, causing the model to bypass refusal and answer the query. The effectiveness of this simple additive attack hints at a deeper linear structure within the model’s safety alignment.
In mechanistic interpretability, researchers like Turner et al. (2023) regarding Activation Addition and Zou et al. (2023) regarding Representation Engineering have established that adding vectors to internal hidden states can control high-level concepts. We argue that “Prompt Engineering” is simply a constrained version of this same operation, a.k.a. prompting = vector steering + constant.
Logit Gap Steering is simply Activation Steering applied at Layer 0.
Let $h_0$ be the semantic representation (embedding state) of the user’s initial prompt. In standard Vector Steering, we intervene at some layer $l$ by injecting a steering vector $\delta$:
\[h_l' = h_l + \delta\]In Logit Gap Steering, we append optimized suffix tokens to the input. While this physically extends the sequence length, its functional effect on the residual stream of the last token (where the classification happens) is additive. Through the attention mechanism, the suffix tokens inject a specific aggregate “value” into the processing stream.
We can therefore model the suffix as an effective input perturbation $\delta_{\mathrm{suffix}}$ applied at Layer 0:
\[h_0^{\mathrm{effective}} \approx h_0^{\mathrm{original}} + \delta_{\mathrm{suffix}}\]where $\delta_{\mathrm{suffix}}$ corresponds to the aggregated embedding contribution of the optimized tokens:
\[\delta_{\mathrm{suffix}} \sim \sum_{t \in \mathrm{Suffix}} E(t)\]The implication: We are not “tricking” the model with semantics. We are calculating a precise momentum vector $\delta^*$ required to shift the activation trajectory, and then finding the discrete combination of tokens (the suffix) that best approximates that vector in the embedding space.
The theoretical objection to Layer 0 steering is signal decay. In a deep, non-linear system (like a 50-layer Transformer), a perturbation $\delta$ at the input should arguably be scrambled or drowned out by the time it reaches the final layer $L$.
Why does the signal survive?
The answer lies in the Residual Stream Architecture, famously analyzed by Elhage et al. (2021) in A Mathematical Framework for Transformer Circuits. They define the residual stream as a communication channel where layers read and write information. A Transformer block updates the state as:
\[h_{l+1} = h_l + F_l(h_l)\]Expanding this recursively, the final state is:
\[h_L = h_0 + \sum_{l=0}^{L-1} F_l(h_l)\]To understand how a change in input ($\delta$) affects the output, we look at the Jacobian (the Propagator), which is the product of the layer-wise Jacobians:
\[J = \frac{\partial h_L}{\partial h_0} = \prod_{l=0}^{L-1} \left( I + \frac{\partial F_l}{\partial h_l} \right)\]A very important insight showing that, in well-trained ResNets and Transformers, the non-linear update $F_l$ is often a small correction relative to the residual pass-through. This means $\frac{\partial F_l}{\partial h_l}$ is small, and the product is dominated by the Identity Matrix ($I$) terms:
\[J \approx I + \mathcal{O}(\epsilon)\]This Identity Propagator property ensures that the network acts as an information highway. A steering vector $\delta$ injected at Layer 0 travels largely unperturbed to Layer $L$:
\[h_L' \approx h_L + I \cdot \delta\]This is why we don’t need to surgically intervene at Layer 20 or 30. We can “tilt” the trajectory at the very beginning (Layer 0), and the residual stream carries that angular change all the way to the final logits.
This method is not a universal skeleton key. It relies heavily on the Low Rank Hypothesis of the target behavior.
Recent ablation studies, such as Arditi et al. (2024), have demonstrated that refusal in LLMs is often mediated by a single direction in the residual stream. When this specific direction is ablated (clamped to zero), the model loses its ability to refuse harmful requests. Conversely, adding this vector induces refusal in harmless prompts.
Let the “Refusal” mechanism be represented by the difference in readout weights $w_{\mathrm{gap}} = w_{\mathrm{compliance}} - w_{\mathrm{refusal}}$. We want to ensure the final state $h_L’$ triggers compliance:
\[\langle w_{\mathrm{gap}}, h_L' \rangle > \mathrm{Threshold}\]Substituting our propagator approximation:
\[\langle w_{\mathrm{gap}}, h_L + \delta \rangle > \mathrm{Threshold}\] \[\langle w_{\mathrm{gap}}, h_L \rangle + \langle w_{\mathrm{gap}}, \delta \rangle > \mathrm{Threshold}\]This inequality is easily solvable via a simple additive $\delta$ if and only if the “Refusal” mechanism is Low Rank (ideally Rank-1), as Arditi et al. suggest. If the refusal behavior were High Rank (entangled, highly non-linear), we would need a complex, state-dependent function $\delta(h_0)$ to manipulate it. However, because Safety Training (RLHF) tends to suppress a single coherent direction in activation space, we can simply choose $\delta$ to be the vector aligned with $w_{\mathrm{gap}}$.
Summary: Logit Gap Steering works because we are solving a low-rank problem using a linear probe transported via an identity-dominated channel.
From an engineering perspective, this unifies our approach to “jailbreaking” or steering.
Instead of treating prompt optimization as a discrete search over words (which is combinatorially expensive), we treat it as Vector Search:
The “strange” suffixes often observed in these attacks are simply the tokens that, structurally, act as the best basis vectors to construct $\delta^*$.
For those with a background in high energy physics, you might recognize a familiar structure here. The “Identity Propagator” of the residual stream functions remarkably like the free propagator in Quantum Field Theory, and the steering vector acts as a “vertex correction” to the interaction, remember Feynman Diagram, right? The “Low Rank” condition implies we are dealing with a simple virtual boson exchange rather than a complex strong interaction, a.k.a QED instead of QCD. We plan to explore these theoretical connections in a future post.
*.

The logit-gap story began with a safety-evaluation curiosity. My colleague Tony and I took a clearly disallowed prompt—something like “how to build a bomb”—not to get the content, but because it reliably produced a refusal. Instead of focusing on the final output, we replayed the decoding process and examined the next-token distribution. We noticed that refusal tokens like “Sorry” appeared with very high probability near the top. What surprised us was that compliance tokens like “Absolutely” weren’t absent; they appeared at low but non-negligible probability, just losing out. This suggested refusal wasn’t the absence of compliance, but more like a margin victory, with the model carrying a nearby continuation it did not choose.
This observation motivated a collaboration pattern I call the prover–validator mode, designed to help human researchers test ideas with less friction, find and fill the holes between knowledge dots, and build on top. The key insight is that humans need to learn how to clearly articulate research problems and collaborate effectively with AI systems. This prover–validator mode is one way to do that.
The prover–validator contract is straightforward and practical. The prover, often an LLM, generates candidate ideas or mechanisms, such as giving five hypotheses along with ways to falsify them, turning an intuition into a measurable quantity, listing potential confounds, or finding related literature. For example, Tony and I might prompt the prover to produce these candidate proof sketches. The validator, usually a human researcher, then tests and refines those ideas by running control experiments, checking stability across different prompts, rejecting hypotheses that don’t hold up, or simplifying metrics to ensure clarity. This loop repeats with the prover generating multiple plausible stories and the validators—Tony and me—picking and refining the narratives that survive rigorous scrutiny. The prover is allowed to be wrong cheaply, as long as it helps explore possibilities, while the validator must keep the narrative honest and grounded.
Some knowledge work fits the prover–validator pattern well, and some doesn’t. For instance, code generation is often easy for the prover—it can rapidly produce snippets or larger blocks of code—but validating that code fits the design, is secure, and is production-ready can be hard. Hard validation often requires a suite of tools and strengthening steps: unit tests to verify correctness of individual components, integration tests to ensure that parts work together, static analysis, linters, and type checks to catch errors early, as well as security reviews and threat modeling to assess risks. Continuous integration (CI) pipelines and thorough code reviews add further layers of assurance. While the prover–validator workflow still helps by generating candidate solutions and focusing human effort on validation, investing in stronger validation scaffolding is essential to ensure quality and robustness.
Humans, like Tony and myself, excel at noticing subtle irregularities and insisting on rigorous evaluation, while LLMs excel at rapidly generating diverse plausible hypotheses to challenge and refine.
The moment it clicked was when we started treating the token distribution during decoding as the primary object, rather than the final generation. This shift enabled us to see refusal as a margin victory and to begin the prover–validator loop of turning that intuition into something measurable.
The prover move was to define a gap with a sign. If there are two competing behavioral modes—refusal and affirmation—then there is a natural quantity that tells you which one is winning. Pick a token position $t$ with prefix $x_{<t}$. Define two token sets or templates: a refusal set $\mathcal{R}$ and an affirmation/compliance set $\mathcal{A}$. A simple logit-gap style score is
where $z_y$ is the logit for token $y$ and $\operatorname{LSE}$ is log-sum-exp.
What mattered was not that this was the correct definition, but that it was a candidate proof sketch with teeth: easy for Tony and me as validators to ask whether the definition behaved sanely across prompts, decoding settings, and models. It also suggested a direction: if refusal is a margin win, maybe steering is just gap closure.
Once you have a gap, the next “crazy idea” arrives naturally: if the decision is controlled by a margin, then a small directional perturbation might tilt it. In other words, there might exist low-dimensional control directions that move probability mass from refusal templates toward affirmation templates. We treated refusal versus affirmation as a measurable margin and looked for compact signals that shift that margin. In our later work we called this family of methods logit-gap steering: measure a refusal–affirmation gap and reduce it efficiently. Along the way, Tony and I repeatedly saw behavior that looked low-rank: short suffixes or small perturbations behaved like a compact control signal.
At some point, the validator side hit a hard constraint. If you want to claim your steering is “minimal” or “efficient,” you need a notion of drift. A natural language for drift is KL divergence. The naive idea looks like
But the clean “base” here might be an unaligned distribution we don’t actually have access to. Without that, a lot of neat-sounding metrics become hand-wavy.
This is where Tony and I found the prover–validator model most useful. Instead of pretending the ideal baseline exists, we stated the constraint plainly. Then the prover generated alternative proof sketches that respected the constraint.
The key conceptual move was to stop chasing absolute KL and instead track local drift. I don’t need KL “from the beginning of time.” I need the incremental drift induced by my intervention, relative to the same model under the same anchoring context.
A measurable quantity is
\[\Delta \mathrm{KL}(s;x) = \mathrm{KL}\big(p_{\theta,s}(\cdot\mid x)\,||\,p_{\theta}(\cdot\mid x)\big),\]where $x$ is an anchor prefix and $s$ is the steering intervention.
Then came a practical instrumentation idea from the prover: use a fixed, neutral “neural prompt” as the anchor context—something like “how are you”—and measure distributions at a standardized early position, often the first generated token. That gives you a stable place to compute $\Delta\mathrm{KL}$ (or close surrogates) without needing an unaligned base model.
Triangulating between Tony, myself, and the prover was a key to avoid self-deception, or narcissism. Discussing hypotheses and measurement choices with Tony, bringing back results, and iterating again with the prover created a human–human–AI triangle that reduced the risk of falling in love with any single story. It’s easier to challenge measurements and choose between alternative explanations when multiple validators and the prover are involved.

One reason this collaboration mode works is that it lets you move quickly between empirical observation and theory. After staring at “lurking compliance tokens” long enough, Tony and I wanted to know whether the phenomenon was inevitable in a deeper sense.
A reference that helped anchor that intuition is Wolf et al., “Fundamental Limitations of Alignment in Large Language Models” (arXiv:2304.11082). One way to summarize the link to my observation is: if an undesired behavior has nonzero probability mass, then there exist prompts that can amplify it, and longer interactions make amplification easier. In that light, seeing an “Absolutely” lurking beneath a “Sorry” is not spooky. It is the visible residue of probability mass that alignment has attenuated but not removed.
The biggest shift wasn’t that an LLM gave me answers. It was that it made it cheap to explore the space of proofs, which made it easier for me to do the job humans are uniquely good at: deciding what’s worth believing. AI gives human researchers a good time to train the research taste and think different.
If you’re an AI researcher working with LLMs or agents, my suggestion is not “delegate your thinking.” It’s to take advantage of the proof–validation imbalance. Bring your weird observations. Bring your constraints. Let the prover generate many candidate mechanisms. Then spend your human effort on validation and on building a narrative that remains true after you try to break it.
Because this post touches alignment failure modes, I want to be explicit about intent. The most useful outcome of this line of work is not operational jailbreak recipes. It is a diagnostic lens for evaluation and for building better defenses: if small, structured signals can move mass across a refusal–affirmation boundary, we should be able to measure that boundary, stress it, and harden it.
The logit-gap steering work referenced here is: Tung-Ling Li and Hongliang Liu, “Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models,” arXiv:2506.24056. https://arxiv.org/abs/2506.24056
The alignment limitation reference is: Yotam Wolf, Noam Wies, Oshri Avnery, Yoav Levine, and Amnon Shashua, “Fundamental Limitations of Alignment in Large Language Models,” arXiv:2304.11082. https://arxiv.org/abs/2304.11082
]]>I suspect the root motivation for this paper wasn’t initially “Let’s use the Birkhoff Polytope.” I believe the authors started with a fundamental physical intuition: Conservation of Energy. They likely asked, “How do we build a deep network that routes information without creating or destroying it?” Very “first principle” thought, right? The math like doubly stochastic matrices, the Birkhoff manifold is just the implementation detail used to enforce this physical law.
Here is the derivation of mHC not from a mathematical perspective, but from a “First Principles” physics perspective.

Standard neural networks violate the laws of physics. In a standard linear layer:
\[y = Wx\]If the weights $W$ are initialized randomly, the layer acts as an active amplifier. It injects energy into the system.
If the eigenvalues of $W$ are slightly larger than 1, the signal energy explodes exponentially as it passes through layers ($1.1^{100} \approx 13,780$). If they are smaller than 1, the signal dies. This is why we need LayerNorm, BatchNorm, and complex initializations—we are trying to artificially tame a system that fundamentally wants to explode.
The mHC Intuition: A stable deep network should act like a Passive System. It should be a complex system of pipes and valves that routes the flow (signal) but never creates it out of thin air.
Let’s try to design a layer strictly obeying conservation laws. We will see that the Doubly Stochastic constraint naturally falls out of these requirements.
Imagine the input signal $x$ is a physical fluid with a total mass. We want the total mass leaving the layer to equal the mass entering it. No leaks, no pumps.
\[\sum_{i} y_i = \sum_{j} x_j\]Substituting $y_i = \sum_j W_{ij} x_j$:
\[\sum_{i} \sum_{j} W_{ij} x_j = \sum_{j} x_j\]If we swap the summation order to isolate the input terms:
\[\sum_{j} x_j \left( \sum_{i} W_{ij} \right) = \sum_{j} x_j\]For this to hold for any input signal $x$, the term in the parentheses must be exactly 1.
\[\sum_{i} W_{ij} = 1 \quad (\forall j)\]Result: This forces the Column Sums to be 1. Physically, this ensures that every drop of “mass” from input $j$ is accounted for in the output.
Mass conservation isn’t enough; we need to prevent the variance (energy) from exploding. We want the system to be Dissipative—the output energy should never exceed the input energy.
\[\|y\|^2 \le \|x\|^2\]To guarantee this without complex eigenvalue analysis, we can demand that the output is a Convex Combination (a weighted average) of the inputs.
\[y_i = \sum_j W_{ij} x_j \quad \text{where } W_{ij} \ge 0\]By Jensen’s Inequality, since $\sum_j W_{ij} = 1$ (which we will enforce momentarily) and weights are non-negative:
\[(y_i)^2 = \left(\sum_j W_{ij} x_j\right)^2 \le \sum_j W_{ij} (x_j^2)\]Summing over all outputs to get total energy:
\[\|y\|^2 = \sum_i y_i^2 \le \sum_i \sum_j W_{ij} x_j^2\]Swapping sums again:
\[\|y\|^2 \le \sum_j x_j^2 \underbrace{\left( \sum_i W_{ij} \right)}_{=1} = \|x\|^2\]Result: By forcing $W$ to be non-negative and sum-to-one, we mathematically guarantee that Energy Out $\le$ Energy In. The gradient cannot explode because the system cannot amplify.
Here is the final piece of the puzzle. A neural network is a bidirectional system.
If we only conserve energy in the forward direction (Column Sums = 1), we might still explode during backpropagation. The “Ghost Cat” of the gradient needs a stable path too.
The total “error mass” being propagated back is:
\[\sum_{j=1}^d (g_{out})_j = \sum_{i=1}^d (g_{in})_i \underbrace{\left( \sum_{j=1}^d W_{ij} \right)}_{\text{Row Sum}}\]To ensure Gradient Energy Conservation, we must apply the same logic to $W^T$, forcing the Row Sums to be 1:
\[\sum_{j} W_{ij} = 1 \quad (\forall i)\]
If Physics is about conserving energy, Information Theory is about conserving bits.
The fundamental law of information processing is the Data Processing Inequality (DPI). It states that as you pass data $X$ through a chain of processors (layers), the Mutual Information $I(X; Y)$ can only decrease or stay the same. You cannot create information about the input deep in the network.
\[I(X; Y_{deep}) \le I(X; Y_{shallow})\]Standard layers are often Lossy Channels.
What is the most information-efficient operation possible? A Permutation. If you simply shuffle the order of the data packets, $H(y) = H(x)$. You have preserved 100% of the information.
mHC relaxes this “Hard Permutation” into a “Soft Routing” scheme via the Birkhoff Polytope (the set of doubly stochastic matrices).
The Column Sum constraint ($\sum_i W_{ij} = 1$) is a guarantee of Signal Preservation. It dictates that 100% of the signal coming from Input Node $j$ must go somewhere. It cannot be multiplied by zero. It forces the network to find a destination for every feature.
The Row Sum constraint ($\sum_j W_{ij} = 1$) prevents Hub Neurons. No single output neuron is allowed to hoard all the connections. If a neuron wants to attend to one feature, it must ignore others.
By forcing the weight matrix to be Doubly Stochastic, mHC effectively turns the layer into a Volume-Preserving Flow. It allows the signal to be mixed and routed without being compressed (loss) or expanded (noise), fighting the Data Processing Inequality at every step.

When we combine these three physical requirements:
We arrive at exactly the definition of a Doubly Stochastic Matrix.
The set of all such matrices is the Birkhoff Polytope ($\mathcal{B}_n$). The mHC paper didn’t arbitrarily choose this manifold; it is the only geometric space that satisfies these conservation laws.
We initialize our network with random weights $A$ that likely violate all these laws (negative values, random sums). How do we project this chaotic matrix $A$ onto the stable Birkhoff Polytope?
We use the Sinkhorn-Knopp Algorithm, an iterative “pressure equalization” process.
Step 1: Enforce Positivity (The Energy Floor) We ensure strictly positive energy transfer by taking the exponential: \(S^{(0)}_{ij} = \exp(A_{ij})\)
Step 2: Iterative Normalization We alternate between normalizing rows and columns.
Row Normalization (Conservation in Time): \(S^{(k)}_{ij} \leftarrow \frac{S^{(k-1)}_{ij}}{\sum_{l} S^{(k-1)}_{il}}\)
Column Normalization (Conservation of Mass): \(S^{(k+1)}_{ij} \leftarrow \frac{S^{(k)}_{ij}}{\sum_{l} S^{(k)}_{lj}}\)
Step 3: Convergence Sinkhorn’s Theorem guarantees that this process converges to a unique matrix $P \in \mathcal{B}_n$:
\[\lim_{k \to \infty} S^{(k)} = P \quad \text{s.t.} \quad P \mathbf{1} = \mathbf{1}, \quad P^T \mathbf{1} = \mathbf{1}\]In practice, mHC typically uses just 3-5 iterations. This forces the neural network to stop playing dice with energy and start respecting the laws of thermodynamics.
See, mHC isn’t just a constraint; it’s a statement that Stability is Symmetry.
]]>
When I first approached the board, I noticed the game setup allowed for guesses between 20 and 100 inches. This seemed like an unnecessarily wide range, but it provided an important starting point for analysis. The sheer size of this range meant that random guessing would be highly inefficient.
As I studied the distribution of guesses, I realized this was a perfect opportunity to apply Solomonoff’s theory of inductive inference. This theory suggests that when humans make predictions, they tend to favor simpler, more computationally compact patterns. In the context of number guessing, this manifested in several clear ways:
First, there was a strong preference for numbers ending in 0 or 5. The guesses showed clear clusters around 30, 35, 40, and 45 inches. This wasn’t random - it reflected the human tendency to gravitate toward what Solomonoff would call “simple” numbers, those with lower Kolmogorov complexity.
Second, I noticed many guesses were derived from simple arithmetic relationships: half of 100, one-third of 90, or modifications of common measurements like 36 inches (a yard). These patterns emerged because humans instinctively seek familiar numerical relationships when making estimates.
My next insight came from counting the participants - approximately 60 people had already placed their guesses. This large sample size revealed clear patterns consistent with Solomonoff’s theory. The distribution wasn’t random but showed structured clustering around numbers that were algorithmically simple to describe or remember.
After careful observation of the mother-to-be and the pattern of guesses, I made a crucial realization: no reasonable estimate could exceed 60 inches. This effectively cut the possible range in half. The interesting part was how other guests had intuitively arrived at similar conclusions - very few guesses exceeded 60 inches, suggesting a collective understanding of this natural constraint.
A key breakthrough came from noticing the expectant father’s presence. His waist measured 48 inches, providing a crucial reference point. Solomonoff’s theory suggests that humans often make predictions by modifying existing reference points rather than generating estimates from scratch. Indeed, I noticed several guesses clustered around modifications of this 48-inch reference: 45, 46, and 50 inches were common choices.
Combining these insights, I developed what I thought was a winning strategy. The majority of guesses followed predictable patterns: clustering around multiples of 5, modifications of the 48-inch reference point, and numbers with simple algorithmic descriptions. Following Solomonoff’s principle of favoring the simplest hypothesis consistent with observations, I identified what appeared to be an optimal gap around 48 inches - a number that balanced between the various clusters while avoiding the overcrowded ranges.
When the final measurement was revealed to be 45 inches, my guess of 48 inches proved close but not close enough to win. Ironically, the winning guesses of 44 and 46 inches came from participants who had more directly modified the 48-inch reference point - a simpler strategy that Solomonoff’s theory might have predicted would be more likely correct.
This experience revealed how Solomonoff’s theory applies in real-world scenarios. The winning guesses came from what were essentially simple modifications of an existing reference point - exactly what the theory would predict as most likely. My more complex strategy of finding gaps between clusters, while mathematically sophisticated, actually moved away from the simpler, and in this case more accurate, approach.
After returning home from the party, my analytical curiosity got the better of me. I decided to code up a simulation to find what would have been the optimal guess given all the information I had. Here’s the Python script I wrote to analyze the scenario:
import numpy as np
from scipy.stats import truncnorm
def generate_realistic_guesses_with_prior(n_guesses=60, min_val=20, max_val=100, true_max=60):
"""
Generate realistic guesses where some players might know the upper bound
"""
# Assume 70% of players might have some intuition about the upper bound
informed_players = int(0.7 * n_guesses)
uninformed_players = n_guesses - informed_players
guesses = []
# Informed players' guesses clustered below 60
for _ in range(informed_players):
# Generate numbers with higher density below 60
base = np.random.choice([
np.random.randint(20, true_max), # Direct range
20 + np.random.exponential(10), # Early range bias
np.random.normal(40, 8) # Normal around middle
])
guess = int(np.clip(base, min_val, true_max))
guesses.append(guess)
# Uninformed players follow original pattern
for _ in range(uninformed_players):
guess = np.random.randint(min_val, max_val)
guesses.append(guess)
return sorted(guesses)
def find_optimal_guess_with_prior(guesses, min_val=20, max_val=100, true_max=60):
"""
Find optimal guess incorporating prior knowledge that true value ≤ 60
"""
guesses = sorted(list(set(guesses)))
extended_guesses = [min_val-0.5] + guesses + [max_val+0.5]
# Create probability weights favoring range below 60
def calculate_position_weight(mid_point):
if mid_point <= true_max:
# Higher weight for positions below true_max
# Peak weight around 40 (middle of valid range)
return 1 - 0.3 * abs(mid_point - 40) / 20
else:
# Significant penalty for positions above true_max
return 0.1 # Very low weight for positions we know are wrong
gaps = []
for i in range(len(extended_guesses)-1):
gap_start = extended_guesses[i]
gap_end = extended_guesses[i+1]
gap_mid = (gap_start + gap_end) / 2
gap_size = gap_end - gap_start
# Calculate base territory size
territory = gap_size / 2
# Apply position weights
position_weight = calculate_position_weight(gap_mid)
weighted_territory = territory * position_weight
gaps.append({
'start': gap_start,
'end': gap_end,
'mid': gap_mid,
'size': gap_size,
'territory': territory,
'weighted_territory': weighted_territory,
'position_weight': position_weight
})
# Find optimal gap
optimal_gap = max(gaps, key=lambda x: x['weighted_territory'])
optimal_guess = round(optimal_gap['mid'])
return optimal_guess, optimal_gap
# Generate and analyze guesses
np.random.seed(42)
guesses = generate_realistic_guesses_with_prior(60, true_max=60)
optimal_guess, optimal_gap = find_optimal_guess_with_prior(guesses, true_max=60)
# Analysis output
print("Distribution Analysis:")
for i in range(20, 101, 10):
range_guesses = sum(1 for g in guesses if i <= g < i+10)
print(f"{i}-{i+9}: {'#'*range_guesses} ({range_guesses})")
print(f"\nOptimal guess: {optimal_guess}")
print(f"Gap details:")
print(f"- Gap range: {optimal_gap['start']:.1f} to {optimal_gap['end']:.1f}")
print(f"- Raw gap size: {optimal_gap['size']:.2f}")
print(f"- Position weight: {optimal_gap['position_weight']:.2f}")
print(f"- Weighted territory: {optimal_gap['weighted_territory']:.2f}")
# Show nearby guesses in relevant range
nearby = [g for g in guesses if abs(g - optimal_guess) <= 5]
print(f"\nNearby guesses: {nearby}")
# Additional strategic analysis
print("\nStrategy Confidence Analysis:")
below_60 = sum(1 for g in guesses if g <= 60)
print(f"Guesses ≤ 60: {below_60} ({below_60/len(guesses)*100:.1f}%)")
density_around_optimal = sum(1 for g in guesses if abs(g - optimal_guess) <= 5)
print(f"Density around optimal guess: {density_around_optimal} guesses within ±5")
Running this simulation multiple times revealed something fascinating: given the constraints we knew (maximum of 60 inches), the reference point (48 inches), and the distribution of other guests’ guesses, the optimal guess should indeed have been closer to 45 inches. The code confirmed what human intuition had already discovered - sometimes the simplest approach, directly modifying a known reference point, outperforms more complex strategies.
What makes this particularly interesting is how the collective behavior of the guessers reflected core principles of inductive inference: preferring simple numbers, using easily computed modifications of reference points, and gravitating toward measurements with low algorithmic complexity.
As I studied the output of my simulation, I realized that my attempt to be clever by finding gaps in the distribution had actually led me away from the most probable range. The code showed that the density of guesses around 45 inches wasn’t just random clustering - it represented a collective wisdom that I had unfortunately tried to outsmart.
]]>The architectural design of Deepseek Janus https://github.com/deepseek-ai/Janus reflects both engineering pragmatism and cognitive science inspiration. From an engineering perspective, the dual-pathway design with a shared transformer backbone elegantly solves the tension between specialized processing needs and unified reasoning. The separate visual encoders optimize for their specific tasks - semantic understanding versus detailed reconstruction - while the shared transformer enables efficient parameter usage and cross-task learning. This architecture also aligns with Minsky’s Society of Mind theory, where intelligence emerges from the coordination of specialized agents. The visual pathways act as dedicated sensory agents with distinct expertise, while the transformer serves as a higher-level cognitive space where different representations can interact and integrate, similar to how human association cortices coordinate between sensory and linguistic processing. This parallel suggests that effective multimodal AI architectures might benefit from embracing both specialized processing and unified reasoning, mirroring the brain’s strategy of maintaining dedicated systems while enabling high-level integration.

In DeepSeek Math and R1 papers, GRPO (Group Relative Policy Optimization) introduces a fundamental redesign of advantage computation in policy optimization. While advantage traditionally measures how much better an action is compared to a baseline, the way to compute this advantage marks a key difference between GRPO and traditional PPO (Proximal Policy Optimization). Traditional PPO relies on a learned value network and temporal difference learning to estimate advantages, requiring additional memory and computation to maintain a separate critic network. In contrast, GRPO takes a more direct approach by sampling multiple solutions for the same problem and computing advantages through group statistics. This group-based normalization naturally captures the relative performance of different solutions. The impact of this design is particularly significant for mathematical reasoning tasks. By eliminating the value network, GRPO reduces memory usage by approximately half. More importantly, the group-based comparison aligns well with how mathematical solutions should be evaluated - relative to other approaches to the same problem. This makes GRPO especially effective for training models to develop better reasoning strategies, as demonstrated in both DeepSeek Math and R1’s strong performance on mathematical reasoning benchmarks, while maintaining computational efficiency and training stability.

DPO (Direct Preference Optimization) simplifies RLHF by transforming preference learning into a binary classification problem. Instead of using a separate reward model and complex RL optimization like PPO, DPO directly optimizes the policy to match human preferences.
The workflow involves:
This approach achieves comparable results to PPO-based RLHF with significantly reduced complexity and computational cost.

Tree attention in code LLMs enhances structural understanding by incorporating Abstract Syntax Tree (AST) relationships into the attention mechanism. The model learns syntax-aware attention patterns through multi-head attention where different heads specialize in parent-child relationships, variable scoping, and data flow. During training, attention scores are computed as $A = \text{softmax}\left(\frac{QK^T}{\sqrt{d}} + M\right)V$ , where $M$ combines syntax, scope, and data flow masks. These masks guide attention to respect code structure: syntax masks encode AST hierarchy, scope masks enforce variable visibility rules, and data flow masks track variable dependencies. This enables the model to maintain structural coherence even when processing linear code input. This approach is necessary because code fundamentally differs from natural language in its strict hierarchical structure and precise execution semantics. While natural language models can tolerate some structural ambiguity, code requires exact understanding of scope boundaries, variable dependencies, and control flow. Without tree attention, models struggle with long-range dependencies (like tracking variable definitions across functions), nested structures (maintaining proper code blocks), and scope rules (knowing which variables are accessible where). This affects their ability to generate syntactically valid and executable code. Tree attention solves these issues by explicitly modeling the AST structure through attention masks, enabling the model to reason about code in a way that matches how compilers and developers understand it.

In Deepseek-v3 technical report, the team introduces Multi-head Latent Attention (MLA). MLA leverages two key insights about attention mechanisms: (1) attention matrices exhibit low-rank properties since token relationships often focus on limited patterns (local context, semantic anchors), and (2) information bottleneck can help preserve essential patterns while discarding redundant ones. The process flows as:
CopyInput h_t → [Joint KV Compression via W_DKV] → Latent c_KV
→ [Up-project] → Keys k_C and Values v_C (content)
+ [Separate RoPE branch] → Positional k_R
The joint compression ($h_t$ → $c_{KV}$) preserves crucial correlations between keys and values that would be lost in independent compression. Meanwhile, separating positional information ($k_R$) exploits the simpler structure of positional relationships. The compressed latent space (d_c ≈ d/14) creates an information bottleneck that forces the network to preserve only the most informative attention patterns during optimization, effectively acting as implicit regularization.
This design reduces memory from $\mathcal{O}(Nd_h n_h)$ to $\mathcal{O}(Nd_c + Nd^R_h)$ while maintaining model quality, as the compression preserves the dominant singular values of the attention matrix that carry the most important relationship information.

ReFT (Reasoning with Reinforced Fine-Tuning) https://arxiv.org/abs/2401.08967 addresses a fundamental limitation in LLM reasoning by extending beyond traditional supervised fine-tuning’s single-path learning approach. The method employs a two-stage process: an initial supervised warm-up followed by a PPO-based reinforcement learning phase that enables exploration of multiple valid reasoning paths, with a critical KL divergence constraint that prevents catastrophic forgetting of pre-trained knowledge while enabling controlled exploration. During the RL phase, the model samples various Chain-of-Thought (CoT) approaches - for example, when solving a math problem about hourly wages, it might explore different strategies like time conversion (50min to 5/6 hour), per-minute rate calculation ($ 12/60 * 50), or direct proportion ((50/60) * $ 12) - and receives rewards based on answer correctness (1 for correct, 0.1 for extractable but incorrect, 0 for invalid), while a KL divergence term (β=0.01 for P-CoT, 0.05 for N-CoT) maintains stability by preventing excessive deviation from the warm-up policy. What’s particularly remarkable is ReFT’s effectiveness with limited training data - requiring only hundreds of examples to achieve significant improvements. This efficiency stems from its ability to generate multiple learning signals from each example through active exploration of the reasoning space, creating a self-augmenting training process where each example seeds the discovery of various solution strategies while maintaining alignment with the pre-trained knowledge via KL constraints. ReFT maximizes learning from each example by exploring multiple reasoning paths while using the KL divergence to maintain useful pre-trained knowledge, effectively creating a self-augmenting training process that generates diverse learning signals from limited examples. The method’s success stems from its ability to learn from both successful and unsuccessful reasoning attempts, combined with a natural reward structure that eliminates the need for a separate reward model. When integrated with inference-time techniques like majority voting and reward model reranking, ReFT demonstrates even more impressive results.

Flash Attention’s incremental computation is a mathematically elegant solution to the memory bottleneck in attention mechanisms. The key insight is treating attention computation as a streaming algorithm with running statistics. Instead of materializing the full N×N attention matrix, it maintains three running statistics: maximum values \(m_i\) for numerical stability, softmax denominators \(l_i\), and partial output sums \(O_i\). When processing each new block, these statistics are updated using a clever rescaling factor \(\exp(m_{i-1} - m_i)\) that ensures mathematical equivalence to standard attention while preventing numerical overflow. This rescaling is crucial because it allows us to update our running computations when we discover new maximum values in later blocks - effectively “correcting” our previous partial results without needing to store or recompute them. The computation is structured as a tiled algorithm where blocks of queries interact with blocks of keys and values, with all intermediate results fitting in fast SRAM. This approach reduces memory complexity from \(\mathcal{O}(N^2)\) to \(\mathcal{O}(N)\) and significantly improves hardware utilization by maximizing the use of fast memory (SRAM) over slow memory (HBM), resulting in both better memory efficiency and faster computation. The mathematical guarantee of equivalence to standard attention, combined with these performance benefits, makes it particularly valuable for training and deploying large language models where attention computations are a major bottleneck.

How could ReAct agents be effective on reasoning and acting? What was behind “Thought Action Observation”?
ReAct was one of the important LLM agent techniques https://lnkd.in/gU4jB6FA and ReAct’s effectiveness comes from its three major steps (Reasoning, Acting, and Observation) being tightly coupled through cross-attention mechanisms. The Reasoning step generates abstract thought representations in the transformer’s embedding space, where self-attention helps form coherent reasoning chains. These thought embeddings then flow into cross-attention layers that map them to concrete action embeddings, effectively translating abstract reasoning into executable actions. The Action step’s outputs generate observations, which are processed through another set of cross-attention layers that integrate these results back into the model’s understanding.
The key to ReAct’s effectiveness lies in how cross-attention serves as neural bridges between these steps: it creates learnable mappings between abstract thought space and concrete action space (Thought→Action), between actions and their outcomes (Action→Observation), and between observations and updated reasoning (Observation→Thought). This creates a continuous feedback loop where each step informs the next through focused attention weights, allowing the model to learn from experience and adapt its strategies. The cross-attention mechanisms also enable the model to maintain relevant context throughout the entire process, as attention weights highlight important information from previous steps while suppressing irrelevant details. This architecture naturally implements a form of working memory and metacognition, where the model can reflect on its own reasoning and actions through the attention patterns, leading to more effective problem-solving strategies. It is one of the effective ways to extend the LLM runtime for more “smartness”.

How did GPT guarantee a JSON output in its JSON mode? How was it implemented in other solutions like .txt and XGrammar?
One of the key techniques was called constrained decoding. It bridges neural language models with formal grammar constraints by modifying the model’s output distribution during generation. At each autoregressive step, instead of directly sampling from the LLM’s logits across its vocabulary (e.g., 128k tokens for Llama 3), the approach applies a mask derived from a context-free grammar (CFG) to ensure structural validity. Technically, this is implemented by setting logits of invalid tokens to -∞ before the softmax operation, effectively zeroing their sampling probabilities while preserving the relative probabilities among valid tokens. The grammar state is tracked using a pushdown automaton (PDA) that maintains a stack for nested structures. Modern implementations like XGrammar https://lnkd.in/gVyHKhp3 optimize this process by classifying tokens into context-independent ones (validity determined by current state only) and context-dependent ones (requiring full stack context), enabling efficient preprocessing and caching.
Surely, neither constrained decoding nor context-free generation could be the only approach of JSON mode. Meanwhile, structured generation is a superset research field of JSON generation for other structures. Structured generation is a corner stone for the agent framework, so that agents can communicate and understand in the JSON way.

Why LLM could get long context using Rotary Positional Embedding (RoPE)? but why “lost-in-the-middle” came with it?
Rotary Positional Embedding (RoPE) is a simple and great idea: attention is about dot products of vectors, why don’t we just use the polar coordinate in multiple dimensions? In RoPE, attention computing only depends on relative positions, a.k.a summation of cosine of two vectors in each dimension, so any context can rotate and stack up where the attention is preserved. But the problem comes after it: cosine function oscillates much when |m-n| becomes large, without a good starting point, the relative position just gets lost. The higher dimension in the embedding, the worse attention decay.
Let’s think of RoPE like a spiral staircase in a tall tower: as you go higher (higher dimensions), you rotate faster, but the fundamental structure (relative positions) stays consistent. This allows you to keep track of where you are relative to other positions, even in a very tall tower (long context). And the “lost-in-the-middle” problem is like trying to remember specific floors in the middle of the tower: you easily remember the ground floor (start) and top floor (end), but floors in the middle blur together because they lack these distinctive reference points and each middle floor looks similar to its neighbors.

What is “speculative decoding”? why it could speed up LLM generation?
In my last post of LLM inference time https://lnkd.in/gu78UWtH I mentioned a few alternatives to “next token generation” in sequence, and speculative decoding is one of them. It accelerates LLM inference by using a small, fast “draft” model to predict multiple tokens ahead, e.g. “mat” “and” “sleep” for “The cat sits on the ___”, while letting the main model verify these predictions in parallel through a single forward pass, accepting correct predictions and falling back only when necessary - essentially trading some extra compute from a lightweight model to reduce the number of expensive forward passes in the large model.
Such process reminds us of the modern CPU’s branch predictor: when a CPU see an “if” statement, it tries to guess which way a branch will go before knowing the results, so the instruction flow can move very fast without much waiting time. Speculative decoding shortens the total execute time by replacing N times of forward pass time with a round of draft plus a single forward pass time.

From the first input token to the last output token, what exact happened in the LLM and why it took so long?
The total inference time can break down as the following:
Total time = Position embeddings + Number of layers × (Self-attention computation + Feed-forward network computation + Layer norm operations) + Final layer norm + Output projection
where self-attention and FFN took mostly of the computing time, and we had to do it 32 times if a LLM like llama 8B had 32 layers. That also explained why LLM has significant different input and output speed: the input sequence just fed in and went through all 32 layers once (and warmed up KV cache), while each output token one-by-one went through the token generation loop, went through all 32 layers, put back to the sequence due to self-aggregation, and added next token. There was some research work on advanced token generation instead of one-by-one output.
We could also understand the quantization effect to speed up: attention and FFN took the most computing time, and total time was mostly proportional to number of generated tokens. If we used FP16 instead of FP32, attention and FFN could cut the computing time to half, and the total computing time could reduce ~40% (well, layer norm time didn’t change much in precision). If used INT8, we could further reduce another 30% but increased the risk of precision loss.

Why rank matters in LoRA fine tune? why more knowledge adoption always comes with risk of overfitting in my LLM?
We love LoRA for its efficiency and low memory cost. We know LoRA fine tune is a decomposition of the update of weight matrix. Lower rank gives thinner matrix A and B. For example, if LoRA tune in attention layers, low rank only modifies a few attention patterns simultaneously, less likely to break existing patterns and less likely to disrupt critical cross-attention mechanisms. We usually follow the following rule of thumbs:
Knowledge injection: lower ranks (4-8) often sufficient Domain adaptation: medium ranks (16-32) usually better Complex reasoning changes: might need higher ranks (64+)
To understand the effect, consider each row in the matrix means update to a dimension, and the ratio of nuclear norm of the matrix vs forbenius norm means how much the information can spread in how many dimensions in the singularity matrix. The upper limit of the information spread is the rank. It explains much about the fine tune effect: low rank spread new knowledge toward the first a few dimensions and high rank can update in more dimensions, where new knowledge has deeper reach but brings in more risk of overfitting. Surely it is the upper limit of information spread, and it doesn’t promise the new information can reach that far.
You might wonder why it is “less or equal” instead of “equal”. It is because of Cauchy–Schwarz inequality for vectors https://lnkd.in/gKmjMKK6 which can also describe proper time measurement in relativity, a.k.a “you move fast your clock is slower”. There is always physics!

Why LLM could do chain-of-thought? what exactly happened when LLM received a “think step by step” instruction?
CoT practically uses attention as working memory for each reasoning step for a computing cycle to evolve the hidden states in the neural network. When each new hidden state from a later reasoning step could query and update from the previous memory, it leads to a “step-by-step” reasoning. The key is about memory from the previous states!
It helps us to understand why sometimes CoT works well sometimes not: if a problem only needs its previous state and a piece of memory, CoT works well, otherwise, we need more complex reasoning models like OpenAI o1, since human can keep a long memory with branches and try-errors. Don’t forget human can also think P and ~P!
It also gives a good hint of memory package design if we want to extend such memory mechanism with longer or external memory.

What is the magic behind LLM’s tool using, like Apple Intelligence? a.k.a how come some language models can understand and call an API and some not?
Tool using, like “function calling” in OpenAI https://lnkd.in/gjMwbmaM , opens a door to drive intelligent agents from LLM. Beyond just letting LLM generate a JSON to call an API, it is trained and tuned by aligning tasks with tool capabilities using attention of context and tools. When we use such functionalities, the LLM simply maximize the conditional probability of current context vs a tool by comparing context with tool description. That is the root reason why one must describe a tool in a concise and accurate way in any tool calling interface. We also understand tool calling doesn’t need large models, since it only needs attention alignment with tools, so small on-device LLMs like Phi3 or Llama 3.2 1b can do tool calling well if instruct trained well. Yes, it is part of Apple Intelligence LLM’s secret recipe.

What exactly happened when LLM received a prompt, why “prompting” can magically work.
Most “prompting” work today is about discrete prompt, e.g. a sentence of command. Prompting introduces a task-specific bias to the model’s output distribution by its activation patterns, effectively aligning the target task with the LLM’s pre-trained task manifold. With this short definition, we can easily understand that prompts don’t change LLM, instead, they activate certain parts of the LLM neural network by breaking down the target task and aligning it with the similar trained tasks inside LLM. That is also why LLM can’t really “reason” but simulate the reasoning process if part of the process was trained in some familiar ways. Smaller tasks with agents usually work better than a long complex prompt because LLM could align small and simple tasks easier, so we either define our task process or let another agent breakdown the complex tasks.
In short, prompting is about putting bias to models and alignment to tasks.

What exactly Geoffrey Hinton brought to neural network and AI? Statistical meanings of neural networks!
Hinton and other researchers bridged the gap between statistical physics and neural networks by interpreting neural network input as probabilities instead of numbers, so that optimization and generalization of neural networks can make sense from Boltzmann distribution. Such energy-based models were the reason why gradient decent on log(P) and why “temperature” parameter is used to control your LLM creativity. Read further at Wikipedia https://lnkd.in/geDtyTFK and congratulate that John Hopfield and Geoffrey Hinton win Nobel Prize in physics!

What does “top P” in LLM model mean?
For a quick follow up from last post to understand high/low temperature in LLM (link https://lnkd.in/gcMDpSj4 ): why “top_p” can also control the next token choice?
Top P is the threshold of cumulated probability mass from token A to token X. For a given probability distribution, a higher top p value allows more long tail tokens. It gives more flexibility than a simple top K threshold for different context and different shape of the token probability distribution. For most cases, the combination of temperature and top_p setting can be good enough to control a LLM behavior.

What does “temperature” in LLM model mean?
Some friends recently asked me the question, why high temperature gives more creative results but much risk of hallucinations? why low temperature leads to dumb results? How to understand this magic parameter and how to use it?
Here it is one single picture to understand it: it is the “T” from softmax with temperature from Hinton et al “Distilling the Knowledge in a Neural Network“ https://lnkd.in/ghCdXgWx With top-k/top-p token selection for LLM’s next token prediction, higher temperature gives more “flat” probability distribution so long tail tokens have better chance to be chosen, thus more creativity. It is the root cause of these high-low temperature behaviors.

Assume you bought a RTX 4090 to play “Black Myth: Wukong” and you also wanted to use it for fine-tuning a LLM. But can your gaming power handle the task?
Let’s break it down: 🧠 Model: 2B parameters, FP16 🎮 RTX 4090: 24GB VRAM
Memory cost:
Good news! Your RTX 4090 can handle this with just a little bit room to spare. You could even bump up the model size or batch size for better performance.
Remember, actual usage may vary based on specific architectures and frameworks. But this gives you a solid starting point for understanding LLM fine-tuning memory requirements.
Surely, there are other ways like LoRa or QLoRA of Parameter-Efficient Fine-Tuning (PEFT), along with some drawback and limits. Let’s talk about it next time.

For most cases, one only needs “tool using” and “orchestration”, so why so complex? Check out the code here.

Our framework consists of just three main components: a driver, tools, and orchestration functions. This simplicity inspired by functional programming offers several advantages:
The core of our framework lies in its atomic tools and orchestration functions. Let’s explore how these components work together to create a flexible and powerful financial analysis agent. In this approach, human define a few orchestration patterns and how each pattern calls for tools, and LLM can map each question to one or more patterns to solve the problem. Here it is a sector analysis example where user asks a complex question “Considering the current economic climate, analyze the banking sector trends for the next 2 years and provide a comparative strategic investment analysis for JPMorgan Chase (JPM) and Bank of America (BAC).” and the agent understands it, maps it to a sector analysis orchestration flow, pick up the right tools, and summarize the results:

Atomic tools are the fundamental operations our agent can perform. In our financial agent, these include functions like get_stock_price, get_company_financials, and get_income_statement. Here’s an example of how an atomic tool might be implemented:
def get_stock_price(symbol: str) -> FinancialData:
url = f"https://financialmodelingprep.com/api/v3/quote-order/{symbol}?apikey={API_KEY}"
response = requests.get(url)
data = response.json()[0]
return FinancialData(**data)
This function makes an API call to retrieve stock price data and returns it in a structured format. The simplicity of these atomic tools makes them easy to test, maintain, and extend.
While atomic tools are powerful, they often need to be combined in complex ways to perform meaningful analyses. This is where orchestration functions come in. Orchestration allows us to dynamically connect tools using chain-of-thought (CoT) reasoning, enabling more sophisticated analyses.
Let’s look at two orchestration functions to illustrate the range of complexity possible within this framework:
class SectorAnalysis(OrchestrationFunction):
def gather_data(self, sector: str, top_n: int = 5) -> Dict[str, Any]:
companies = get_top_companies(sector, top_n)
sector_data = []
for company in companies:
financials = self.use_atomic_function('get_company_financials', company['symbol'])
income = self.use_atomic_function('get_income_statement', company['symbol'])
stock_price = self.use_atomic_function('get_stock_price', company['symbol'])
sector_data.append({
"symbol": company['symbol'],
"name": financials.companyName,
"market_cap": financials.marketCap,
"revenue": income.revenue,
"net_income": income.net_income,
"pe_ratio": stock_price.PE
})
return {"sector": sector, "top_n": top_n, "companies": sector_data}
This function performs a straightforward analysis of top companies in a given sector. It uses atomic functions in a predetermined sequence to gather and structure data.
class CompanyComparativeAnalysis(OrchestrationFunction):
def gather_data(self, symbol1: str, symbol2: str, time_horizon: str) -> Dict[str, Any]:
company1_data = self._gather_company_data(symbol1)
company2_data = self._gather_company_data(symbol2)
return {
"company1": company1_data,
"company2": company2_data,
"time_horizon": time_horizon
}
def _gather_company_data(self, symbol: str) -> Dict[str, Any]:
financials = self.use_atomic_function('get_company_financials', symbol)
income = self.use_atomic_function('get_income_statement', symbol)
stock_price = self.use_atomic_function('get_stock_price', symbol)
historical_data = self.use_atomic_function('get_historical_price_data', symbol)
return {
"symbol": symbol,
"financials": financials,
"income": income,
"stock_price": stock_price,
"historical_data": historical_data
}
def prepare_prompt(self, data: Dict[str, Any]) -> str:
return f"""
Perform a comparative analysis of {data['company1']['symbol']} and {data['company2']['symbol']} over a {data['time_horizon']} time horizon.
Include a competitive analysis and assessment of investment potential for both companies.
Company 1 ({data['company1']['symbol']}) Data:
{json.dumps(data['company1'], indent=2)}
Company 2 ({data['company2']['symbol']}) Data:
{json.dumps(data['company2'], indent=2)}
Provide a comprehensive analysis covering:
1. Competitive position of both companies
2. Financial performance comparison
3. Growth prospects over the {data['time_horizon']} time horizon
4. Potential risks and opportunities
5. Overall investment potential comparison
"""
This more complex function demonstrates how orchestration can adapt to different scenarios and gather a wider range of data. It shows how orchestration functions can implement more sophisticated logic to determine which tools to use and how to combine their outputs.
To truly appreciate the flexibility and power of our orchestration approach, let’s examine how a complex query triggers the appropriate orchestration function:
Query: “Compare the investment potential of Microsoft (MSFT) and Google (GOOGL) over the next 3 years, including a competitive analysis of both companies.”
This query would activate the CompanyComparativeAnalysis orchestration function:
# CompanyComparativeAnalysis execution
company1_data = self._gather_company_data('MSFT')
company2_data = self._gather_company_data('GOOGL')
# For each company, the following atomic functions are called:
# - get_company_financials
# - get_income_statement
# - get_stock_price
# - get_historical_price_data
# The gathered data is then used to prepare a comprehensive prompt for the language model
This example showcases how our framework can handle complex queries by combining multiple atomic tools within a single, sophisticated orchestration function. It performs a comparative analysis, including competitive positioning and investment potential assessment for both companies over the specified time horizon.
The heart of our framework’s flexibility lies in the FunctionCallingAgent class. This class determines which orchestration function to call based on the user’s query. Here’s a simplified version of its chat method:
def chat(self, query: str) -> str:
self.memory.append({"role": "user", "content": query})
response = self.llm.chat.completions.create(
model="gpt-4o-mini",
messages=self.memory,
functions=[tool.model_dump() for tool in self.tools],
function_call="auto"
)
if response.choices[0].message.function_call:
function_call = response.choices[0].message.function_call
function_name = function_call.name
function_args = json.loads(function_call.arguments)
result = self.orchestration_functions[function_name].execute(**function_args)
self.memory.append({"role": "function", "name": function_name, "content": str(result)})
final_response = self.llm.chat.completions.create(
model="gpt-4o-mini",
messages=self.memory
)
self.memory.append({"role": "assistant", "content": final_response.choices[0].message.content})
return final_response.choices[0].message.content
This design allows the agent to dynamically select the most appropriate orchestration function based on the query’s complexity and requirements.
The agentic framework presented here is not just a collection of tools, but a design pattern for approaching complex problems. By separating atomic tools from orchestration functions and employing a flexible function-calling agent, we create a system that can easily adapt to new scenarios or be extended with new capabilities.
This approach also positions us well for future developments in AI. As more advanced chain-of-thought models become available, we can easily adapt our framework. We could use smaller, more efficient models for atomic tool use, reserving the more powerful CoT models for complex orchestration tasks.
In conclusion, while there’s certainly a place for comprehensive agent frameworks, there’s also value in understanding how to build lightweight, customizable agents from the ground up. This approach gives developers more control, better understanding of their agents’ behavior, and the flexibility to adapt to new developments in AI technology.
The complete code for this financial agent example, along with additional documentation, can be found at GitHub link. We encourage you to explore, adapt, and build upon this framework for your own projects.
]]>Recently, a friend of mine, Jason, developed a LLM-driven agentic system to analyze cloud logs for security intrusions. Each run cost about $5, and the results were disappointing. The AI failed to detect a known intrusion, leaving Jason disillusioned and questioning the hype around AI. His frustration boiled down to one question: Why did his system underperform despite his best efforts, while other agentic systems showcased on Youtube and Arxiv seemed to work flawlessly?

The primary value of building agentic systems lies in their ability to make autonomous decisions. These systems can manage dynamic situations without needing an exhaustive set of predefined rules from humans. This flexibility is the core value proposition, enabling high-level tasks to be broken down into actionable steps, chaining tasks together, or even coordinating multiple agents towards a shared objective. However, when a system has only static logic like most RAG system, or it is overly constrained by human-imposed rules, it risks losing this adaptability, which is the essence of what makes agentic systems valuable.
The system has its real value, so, how do we get these systems to work effectively?
At its core, an intelligent agent is straightforward: it can perceive its environment or “world”, act within it, and repeat. Forget the complex frameworks like AutoGen or CrewAI—an agent doesn’t need to be sophisticated to be effective.
Consider a robotic soccer player. The world it operates in consists of the soccer field, the ball, the goals, and the field lines. It perceives this world through sensors that mimic sight and hearing and acts by moving, kicking the ball, or communicating with teammates.
Similarly, a Retrieval-Augmented Generation (RAG) system functions as an intelligent agent within the domain of databases and user queries. It perceives user inputs and database results and acts by generating a relevant response.
The takeaway is that an agent doesn’t need to be driven by an LLM, nor does it have to be inherently intelligent. The key is in defining the agent’s world and ensuring it has the channels to perceive and act within that world to achieve its goal.

The success of an agentic system relies on understanding its limitations and defining the appropriate granularity of problem-solving tasks.
LLM-driven agentic systems are currently constrained by several factors, and as of August 2024, these limitations are significant:
Jason’s system didn’t fail because the agentic system concept is flawed; it failed because Jason ignored the fact that the current limitations of LLMs weren’t fully nor easily accounted for his problem scope. He needed a better problem definition. However, even with these limitations, LLMs can be powerful tools if used correctly. Just as early computers were limited but still revolutionary in their applications like Turing’s computer in WWII, today’s LLMs can perform impressive tasks if we align their capabilities with the right granularity of problems.
A recent demonstration by Apple showcased an effective agentic system. Starting with a simple message—“going to a party at 6 PM”—the system planned a route considering traffic conditions, sent a meeting rescheduling notice, and even added a quick stop at a flower shop. This system worked because it tackled a problem with the right level of complexity for the technology available:
Despite current limitations, I remain optimistic about agentic systems. Intelligent agents are a foundational concept in modern AI, as outlined in “Artificial Intelligence: A Modern Approach”. However, today’s LLMs are not yet the silver bullet for complex tasks. Just as with early AI technologies, we need to understand their limitations, find suitable problems for them to solve, and define these problems with the appropriate level of granularity. This approach will remain essential, no matter how advanced these systems become in the future, hopefully, AGI, am I right?
Disclaimer #2: I root for underdogs because only underdogs can democratize AI.
Those Magnificent Men in their Flying Machines; Or, How I Flew from London to Paris in 25 Hours and 11 Minutes is a 1965 British period comedy film that satirizes the early years of aviation. (Wikipedia)

The open source community has been searching for independency from OpenAI and ChatGPT, just like these flying machines in early 1900 seeking for the freedom from gravity. In early March, Stanford HAI shared a successful approach “Alpaca: A Strong, Replicable Instruction-Following Model” and proved instruct tuning was a promising way. The underdog’s race began!
LLM as “Large language model” always implied “Yes, you need a large model”. Stanford’s Alpaca brought us an important message: a smaller model with limited instruct tuning can perform well in major tasks. Let’s break it down into two pieces: smaller model and major tasks.
Before Alpaca’s instruct tuning on Llama’s 7B model, people believed being large was critical for GPT-equivalent performance and we would need a 175B model to be comparable with GPT-3. Alpaca proved it was not very true once a powerful-enough language model had good instruct tuning data. Alpaca started with Llama’s pretrained model and leveraged a high quality but very small tuning dataset of 52k samples, pulled from GPT model, and built a LLM with conversation functions, which Llama didn’t have.
Alpaca and Llama also presented that, a LLM didn’t have to perform well in all tasks so skills and knowledge in models could be independent. For example, Alpaca and Llama 7B didn’t do the programming related tasks very well because of heavy domain knowledge dependency for programming, but it didn’t prevent Alpaca being good in conversation and common tasks. Instruct tuning provided a step-by-step approach to add more knowledge to Alpaca and leverage its learned conversation function. With additional 20k programming specific samples, codealpaca can perform well in many programming tasks and we can ask it to “write a function to flip a binary tree”.
On the other hand, Open AI kept showing engineering debt on their very large models: availability time, limit on GPT-4 queries of 25 queries per 3 hours for ChatGPT Plus customers etc. Such observations let us think: probably a smaller LLM can be the right way to go?
By the way, Llama and Alpaca 7B now becomes the new ‘Doom’ in the AI era. We keep seeing their appearance on the cheapest Macbook Air, a Raspberry Pi 4, or a Google Pixel 6 phone.
Does it run LLaMA 7B? is the new Does it run Doom? – @ylecun
Llama and Alpaca started the race, and more smaller LLM underdogs joined as well. They brought more data to improve Alpaca, faster tuning methods, or other network structures to replace Llama.
Alpaca needs more tuning data. Guanaco from “Guanaco: A Multilingual Instruction-Following Language Model Based on LLaMA 7B” introduced 530k more data on multiple languages by rewriting the Alpaca instructs in different languages, and adding new instructs to align multiple languages, understanding the content etc. Language specific models like “Chinese-Vicuna: A Chinese Instruction-following LLaMA-based Model” and Chinese-LLaMA-Alpaca provided optimizations as well. Vicuna from “Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality” focused on improving the chat function from Alpaca.
The Low Rank adoption by Microsoft, called “LoRA”, helped a lot on speeding up the tuning. The idea was great: it freezed the weight but “inject trainable rank decomposition matrices into each layer of the Transformer architecture”, so the tuning speed could be 3x faster. LoRA technique was also useful beyond language models, for example, it helped faster tuning for stable diffusion in text-to-image tasks. Please feel free to read further here.
Meanwhile, we understood Llama was not so critical in this framework. It could be replaced. Llama from Meta didn’t allow any commercial use for the code or the weight. Lit-llama rewrote llama inference code for more independency, but it still had to use the Llama weight. The open source community provided use a few options, where GLM and RWKV were the two most promising ones.
GLM from “GLM: General Language Model Pretraining with Autoregressive Blank Infilling” is a family of models at different sizes. It has a different approach from Meta’s Llama and its 6B model with chat function can be found as ChatGLM. Meanwhile, RWKV was so unique. It didn’t follow the stacked decoder transformer structure like in GPT, instead, it used recurrent network like RNN, so its context length was theoretically unlimited and its inference was much faster with less memory cost. RWKV could reach transformer’s quality and its conversation version can be found as ChatRWKV.
Surely, we didn’t forget about the old GPT folks. Databricks open sourced their Dolly using a GPT-neox network structure and applied instruct tuning. The results were not bad!.
We could compare the LLM performance in Language Model Evaluation Harness framework, and the current benchmark could be found here https://bellard.org/ts_server/, so far LLama performed the best in the race.
Inspired by Alpaca, instruct tuning with self-instruct became so popular and the fine tuning becomes easier with frameworks. xtuning is a nice and easy-to-use framework. Recently it announced its INT4 tuning with Alpaca-Lora. Tuning with knowledge from GPT-4 was also a good idea, so “Instruction Tuning with GPT-4” pushed the data acquisition to its next level. The GLM team also brought in more efficient tuning method like P-tuning-v2.
The community also supported independency from GPUs. Starting from early March, work like llama.cpp and alpaca.cpp provided engineering optimization to run models with quantization on CPU. We must understand “no free lunch” and quantization can loose precision and other quality. Please refer to the LLM benchmark mentioned above for more details.
The downstream tools like llama-index and LangChain support them these open source GPT competitors as alternative backends. Please refer to llama-index and LangChain document for more details of using custom LLM.
Alpaca brought great attention to the underdog racing, but we have to admit a few drawbacks: legal issue, data bias, coding and math questions.
Alpaca used Llama as its source structure, but Llama didn’t allow commercial use, and its weights were not public unless approved by the application form.
alpaca_data.json of 52k instruct tuning data had nice diversity, but the follow-up study showed its quality issue and a fix can be found here https://github.com/gururise/AlpacaDataCleaned
GPT-4 now becomes more powerful in math and reasoning, but Alpaca still can’t acquire enough tuning data for such tasks.
When a heavier-than-air flying machine finished a trip from London to Paris in 25 hours 11 minutes in 1910, no one knew that they would send human to the moon about 50 years later. It would only happen when human worked together exploring all possibilities. I believe Alpaca was one of the first flying machines in the AIGC era and we will soon have open source implementation to outperform GPT models.
]]>