Rhys Gould

Constructing quantum field theories

2025-06-20T09:00:00+00:00

The aim of this post is to motivate and demonstrate the guiding principles that underlie the construction of quantum field theories, and particularly the Standard Model – arguably one of the most successful scientific theories of all time.

Notation: We will denote the space of smooth maps between a manifold $M$ and a vector space $V$ by $C^{\infty}(M, V)$. For a functional $F: C^{\infty}(M, V) \to \mathbb{R}$, we will denote its functional derivative at $f \in C^{\infty}(M, V)$ evaluated at point $x \in M$ by $\frac{\delta F[f]}{\delta f(x)} \in \mathbb{R}$. We will denote the space of invertible linear maps from $V$ to $V$ by $\text{GL}(V)$.

Introduction. In quantum field theory, a theory is described by a functional $S: \mathcal{C} \to \mathbb{R}$ called the action, mapping field configurations $\Psi \in \mathcal{C}$ to a real number $S[\Psi]$. $\Psi$ will generally be made up of a collection of fields $\Psi = (\Psi_1, \ldots, \Psi_N)$ relevant to our theory. With an action, we can integrate over the space of field configurations $\mathcal{C}$ via the measure

\[\text{D}\Psi \, \mathbb{P}[\Psi] \equiv \left[\prod_{i=1}^{N} \text{D}\Psi_i\right] \mathbb{P}[\Psi], \qquad \text{with} \quad \mathbb{P}[\Psi] := \frac{1}{Z} e^{-S[\Psi]}\]

with probability density $\mathbb{P}: \mathcal{C} \to [0, \infty)$, defining the normalization constant

\[Z = \int_{\mathcal{C}} \text{D}\Psi \, e^{-S[\Psi]}\]

often called the partition function (we will discuss the meaning of $\text{D}\Psi$ below). Explicitly, the field configuration space will take the form

\[\begin{align*} \mathcal{C} &= C^{\infty}(M, V^{(1)}) \times \cdots \times C^{\infty}(M, V^{(N)})\\ &\cong C^{\infty}(M, V) \end{align*}\]

with $\Psi_i \in C^{\infty}(M, V^{(i)})$ for a (finite-dimensional) vector space $V^{(i)}$ and spacetime manifold $M$, and defining $V := V^{(1)} \oplus \cdots \oplus V^{(N)}$.

Physical quantities that can be measured via experiment – such as scattering probabilities, as demonstrated by the LSZ formula (see Appendix A.2) – are directly related to field expectances, also called correlators, that take the general form

\[\mathbb{E}_{\Psi \sim S}[\Psi_{i_1}(x_1) \cdots \Psi_{i_n}(x_n)] := \frac{1}{Z} \int_{\mathcal{C}} \text{D}\Psi \, \Psi_{i_1}(x_1) \cdots \Psi_{i_n}(x_n) e^{-S[\Psi]}\]

for arbitrary $n \in \mathbb{Z}_{+}$, indices $(i_1, \ldots, i_n) \in \{1, \ldots, N\}^n$, and points $(x_1, \ldots, x_n) \subset M$. Given that such expectances are directly related to physical quantities, an important constraint will be for $S$ to be such that these expectances are invariant under transformations that leave $\Psi$ physically equivalent (e.g. Lorentz transformations), which we will soon make precise.

Constructing a well-behaved definition of the path integral measure $\text{D}\Psi$ is non-trivial. The approach that we will use is to expand each field $\Psi_i \in C^{\infty}(M, V^{(i)})$ in an eigenfunction basis $\{\psi_i^j\}_j$ of $C^{\infty}(M, V^{(i)})$, letting us write $\Psi_i(x) = \sum_j a_i^j \psi_i^j(x)$ for expansion coefficients $\{a_i^j\}_j$ and allowing us to define

\[\text{D}\Psi_i := \prod_j da_i^j\]

but since this is an infinite product, it will generally result in divergences, requiring some form of regulation, either via Fujikawa regulation (as used when computing anomalies, as in Section 4), or by truncating the infinite product via a cutoff (as used in the context of Wilsonian renormalization, as in Section 8), or some other regulation method.

To motivate the guiding principles behind the construction of our theory $S$, we first must introduce some central concepts. First, we should view the configuration space $\mathcal{C}$ as exhibiting physical redundancy, containing many configurations that are physically equivalent. In particular, we can think of there as being some true non-redundant physical configuration space $\mathcal{P}$, and with $\mathcal{C}$ being partitioned as

\[\mathcal{C} = \bigcup_{\Phi \in \mathcal{P}} [\Phi]\]

for equivalence classes $[\Phi] = \{\Phi' \in \mathcal{C}: \Phi \sim \Phi'\}$ defined by a physical equivalence relation $\sim$ over $\mathcal{C}$. Said differently, we have

\[\mathcal{P} \cong \mathcal{C}/\sim\]

The physical equivalence relation $\sim$ will be defined by the orbits of a collection of Lie groups $(G^{(0)}, G^{(1)}, \ldots, G^{(K)})$ (i.e. groups that are also manifolds), consisting of a spacetime symmetry group $G^{(0)}$ inherent to the particular manifold and metric $(M, g)$ under consideration, as well as a collection of gauge symmetry groups $(G^{(1)}, \ldots, G^{(K)})$. We will denote the overall symmetry group by $G := G^{(0)} \times G^{(1)} \times \cdots \times G^{(K)}$.

Physical equivalence to orbits of the spacetime group $G^{(0)}$ is reasonably intuitive since (as discussed in Section 1) it essentially describes coordinate transformations that leave the metric invariant (isometries), and it is reasonable to expect that physical predictions should be independent of such choices. However, physical equivalence under the gauge groups is less intuitive. It turns out that the choice $(G^{(1)}, G^{(2)}, G^{(3)}) = (U(1), SU(2), SU(3))$ agrees extraordinarily closely with our universe, however the reason why appears unknown. As we will see in detail, this choice for gauge groups is a particularly simple choice you can make while satisfying Equation 1 and 2, and defines the Standard Model.

In order to define this physical equivalence relation $\sim$ over $\mathcal{C}$ using these groups, we require representations of these groups that describe how they actually act on the field content $\Psi$. Concretely, the representation of $G^{(k)}$ acting on $\Psi_i \in C^{\infty}(M, V^{(i)})$ will be denoted $\rho^{(i, k)}: G^{(k)} \to \text{GL}(V^{(i, k)})$, where each representation is assigned its own sector $V^{(i, k)}$ of $V^{(i)}$ on which to act, with

\[V^{(i)} = \underbrace{V^{(i, 0)}}_{\text{spacetime sector}} \oplus \underbrace{V^{(i, 1)} \oplus \cdots \oplus V^{(i, K)}}_{\text{gauge sectors}}\]

Namely, though $\Psi_i$ lives in an infinite-dimensional space $C^{\infty}(M, V^{(i)})$, the representations that we will consider will only act on (a subspace of) the finite-dimensional output space $V^{(i)}$. The overall action of $G$ on $\Psi_i$ is described by $\rho^{(i)} := \rho^{(i, 0)} \oplus \rho^{(i, 1)} \oplus \cdots \oplus \rho^{(i, K)}$, with $\rho^{(i)}: G \to \text{GL}(V^{(i)})$.

With such representations, we can now define a physical equivalence relation $\sim$. Recall that we can view $\mathcal{C} = C^{\infty}(M, V)$. Then we can condense the representations introduced above into a single representation $\rho := \rho^{(1)} \oplus \cdots \oplus \rho^{(N)}$, where $\rho: G \to \text{GL}(V)$, allowing us to denote the overall transformation of field content by $\Psi \mapsto \rho_g \Psi$ under $g = (g_0, \ldots, g_K) \in G$. This lets us define the physical equivalence class associated with $\Psi \in \mathcal{C}$ to be

\[[\Psi] := \{\rho_{g} \Psi: g \in C^{\infty}(M, G)\} \subset \mathcal{C}\]

where note that $\rho_{g} \Psi \in C^{\infty}(M, V)$ with $[\rho_{g} \Psi](x) \equiv \rho_{g(x)} \Psi(x) \in V$. This provides us with a physical equivalence relation on $\mathcal{C}$:

\[\Psi \sim \Psi' \iff \Psi' \in [\Psi]\]

The fact that we are integrating over a highly physically-redundant configuration space $\mathcal{C}$ within our expectances can have some problems, namely regarding cases where the redundancy results in divergences of the path integral. This is a problem that we explore in Section 7, and comment on in Appendix A.4.

Guiding principles. With these concepts introduced, we can now formulate the central guiding principles to constructing physical theories in this framework.

The main principle that we will work towards satisfying is invariance of correlators:

\[\mathbb{E}_{\Psi \sim S}\left[[\rho^{(i_1)}_{g} \Psi_{i_1}](x_1) \cdots [\rho^{(i_n)}_{g} \Psi_{i_n}](x_n)\right] = \mathbb{E}_{\Psi \sim S}[\Psi_{i_1}(x_1) \cdots \Psi_{i_n}(x_n)] \quad \forall \; \; g \in C^{\infty}(M, G)\]

for arbitrary $n \in \mathbb{Z}_{+}$, indices $(i_1, \ldots, i_n) \in \{1, \ldots, N\}^n$, and points $(x_1, \ldots, x_n) \subset M$. Or written schematically,

\[\mathbb{E}_{\Psi \sim S} \circ \rho_g = \mathbb{E}_{\Psi \sim S}\]

This principle can be motivated by the fact that physical quantities like scattering amplitudes can be written in terms of correlators (see Appendix A.2), meaning that correlators are of direct physical relevance. As a result, given that we have defined physical equivalence via the relation $\sim$, we should ensure that correlators are invariant under $\sim$ (or equivalently, under $\rho$) to ensure a consistent physical theory.

Importantly, in order for this condition to hold generically, one can see that we require two independent conditions to hold:

\[\begin{equation} \label{eqn:Sinvar} S[\rho_g \Psi] = S[\Psi] \qquad \forall \; \; g \in C^{\infty}(M, G) \qquad\qquad (\text{Equation 1}) \end{equation}\]

and

\[\begin{equation} \label{eqn:Dinvar} \text{D}(\rho_g \Psi) = \text{D}\Psi \qquad \forall \; \; g \in C^{\infty}(M, G) \qquad\qquad (\text{Equation 2}) \end{equation}\]

Our theory is ultimately described by the triple $(S, G, \rho)$ consisting of the action, the group content, and representation content respectively. Equation 2 only places constraints on $(G, \rho)$, whereas Equation 1 constrains all of $(S, G, \rho)$. The focus of Section 1-3 and Sec 5-7 will be constructing a theory that achieves Equation 1, while the implications of Equation 2 are examined in Section 4.

In constructing a theory $S$ that satisfies Equation 1, the spacetime group $G^{(0)}$ will play a particularly special role. Namely, we will begin with a theory of empty field content $\Psi = \emptyset$ and add to the field content incrementally, using representations of $G^{(0)}$ to construct the initial field content of the theory (i.e. spinors), with $S$ designed to be invariant (in the sense of Equation 1) to $G^{(0)}$ by design. From here, we will then obtain invariance to $(G^{(1)}, \ldots, G^{(K)})$ via a process called minimal coupling. A total overview:

Section 1: We motivate the spacetime group $G^{(0)} = \text{Spin}(1, 3) \cong \text{SL}(2; \mathbb{C})$, allowing us to begin constructing the field content $\Psi = \{\psi_{L, i}\}_{i=1}^{N_L} \cup \{\chi_{R, j}\}_{j=1}^{N_R}$ consisting of $N_L$ left-handed spinors and $N_R$ right-handed spinors.
Section 2: We then design an action $S[\Psi]$ that is invariant to $G^{(0)}$ in the sense of Equation 1.
Section 3: To make $S[\Psi]$ invariant to the gauge groups $(G^{(1)}, \ldots, G^{(K)})$, we perform minimal coupling, which involves extending the field content
\[\Psi \mapsto \Psi \cup \{A^{(k)}\}_{k=1}^{K}\]
for a 4-vector $A^{(k)}$ associated with each gauge group $G^{(k)}$ that transforms appropriately under $G^{(k)}$, and adding appropriate terms to $S[\Psi]$ involving these new fields.
Section 4: Checking that Equation 2 is satisfied requires us to compute anomalies, describing the extent to which $\text{D}\Psi$ fails to be invariant under $\rho$. Anomalies must vanish in order for Equation 2 to be satisfied. We will see that the Standard Model, defined by the gauge groups
\[(G^{(1)}, G^{(2)}, G^{(3)}) = (U(1), SU(2), SU(3))\]
indeed has vanishing anomalies as required.
Section 5: To add non-trivial interaction terms between spinors to the action $S[\Psi]$, extend the field content
\[\Psi \mapsto \Psi \cup \{H\}\]
for a single scalar $H$ that transforms appropriately under the gauge groups to preserve gauge invariance of $S[\Psi]$. In the context of the Standard Model, $H$ is called the Higgs boson, and the associated interaction terms are called the Yukawa terms.
Section 6: We will promote the auxiliary fields $\{A^{(k)}\}_{k=1}^{K}$ and $H$ to be dynamical fields in their own right, possessing their own kinetic terms in the action $S[\Psi]$.
Section 7: To remedy divergences that arise from integrating over physically-equivalent configuriations, we must introduce ghost fields $\bar{c}^{(k)}, c^{(k)} \in C^{\infty}(M, \mathfrak{g}^{(k)})$ for each gauge field $A^{(k)}$:
\[\Psi \mapsto \Psi \cup \{\bar{c}^{(k)}, c^{(k)}\}_{k=1}^{K}\]
This requires adding certain contributions to the action that break gauge invariance. However, remnants of gauge invariance still persist through BRST invariance.
Section 8: Experiments on Earth are restricted to a relatively low ``anthropic’’ energy scale, which means that we can only expect our theories (e.g. the Standard Model) to be accurate at low energies if we determine the values of their couplings via such experiments. We show how the procedure of renormalization allows us to extrapolate such low-energy theories to higher energy scales beyond that which we can observe, and in the special case of asymptotic freedom, to arbitrarily large energy scales.
Section 9: Since quantities of physical relevance can be written in terms of correlators, extracting physical predictions from a theory requires computing correlators. We outline how correlators can be computed perturbatively.

Main references of relevance: David Tong’s Standard Model notes are most relevant to Section 1-6, and David Skinner’s Advanced QFT notes are relevant to Section 7-9. Section 4 is also closely based on David Tong’s Gauge Theory notes.

1. The spacetime symmetry group and its representations

We begin with an empty theory of no field content $\Phi = \emptyset$. To start things off, we must choose some ($d$-dimensional) spacetime manifold $M$ to embed our theory into, as well as a metric $g$ over this manifold. Roughly, the manifold $M$ describes the topology of our spacetime, and the metric $g$ describes the geometry of our spacetime.

For example, with a metric $g$, we can define a notion of distance between any two points $x, y \in M$ on the manifold using geodesics. A metric $g$ also gives us a notion of curvature of spacetime.

Our manifold $M$ comes equipped with a tangent space $T(M)$, and in particular, a (local) coordinate basis $\{\partial_{\mu}\}_{\mu=0}^{d-1} \subset T(M)$ of this tangent space, acting as partial derivatives that can act on our (to be constructed) field content of type $C^{\infty}(M, V)$.

Motivating the spacetime symmetry group. To begin defining the objects and fields $\Psi_i$ relevant to $(M, g)$, our starting point will be to consider the isometries of the metric $g$, corresponding to the coordinate transformations that leave $g$ invariant. Particularly, this set of transformations forms a group $\text{Iso}(M, g)$. We would like our spacetime symmetry group $G^{(0)}$ to be related to $\text{Iso}(M, g)$ in some capacity.

Our metric $g$ will have some signature, describing the signs of its eigenvalues. By the assumption of $g$ being non-degenerate (i.e. invertible as a matrix), we have that $g$ has no zero eigenvalues. We say that $g$ has signature $(r, s)$ if it has $r$ positive eigenvalues, and $s$ negative eigenvalues (where $r+s = n$). We then define the corresponding signature matrix $\Omega_{\mu\nu}^{(r, s)} := \text{diag}(\underbrace{1, \ldots, 1}_{r \, \text{times}}, \underbrace{-1, \ldots, -1}_{s \, \text{times}})$ for $g$. Then we have the following result: at any point $x \in M$, there exists a basis $\{e_{\mu}\}_{\mu} \subset T_x(M)$ such that

\[g_x(e_{\mu}, e_{\nu}) = \Omega_{\mu\nu}^{(r, s)}\]

That is, locally, we can always reduce $g$ to the signature matrix $\Omega_{\mu\nu}^{(r, s)}$. As a result, the isometry group $\text{Iso}(M, g)$ locally reduces to $\text{Iso}(M, \Omega^{(r, s)})$, and note that $\text{Iso}(M, \Omega^{(r, s)}) \cong \text{IO}(r, s)$, the inhomogeneous orthogonal group of signature $(r, s)$. Further, note that we can decompose $\text{IO}(r, s)$ as

\[\text{IO}(r, s) = \mathbb{R}^{r, s} \rtimes \text{O}(r, s)\]

with $\text{O}(r, s)$ the linear/homogeneous orthogonal group of signature $(r, s)$, explicitly defined as

\[\text{O}(r, s) = \{A \in \text{GL}(r+s; \mathbb{R}): A^T \Omega^{(r,s)} A = \Omega^{(r,s)}\}\]

We will restrict our attention to the local linear isometries $\text{O}(r, s)$ of $g$.

To construct some initial field content $\Psi$, we would like to understand the available representations of $O(r, s)$, which can most easily be performed by looking at its Lie algebra $\mathfrak{o}(r, s)$. Simply connected Lie groups are particularly nice for this purpose, since there is a one-to-one correspondence between Lie group representations and Lie algebra representations in this case. However, $O(r, s)$ is not simply connected.

But, in general, for any finite-dimensional Lie algebra $\mathfrak{g}$, there is a unique simply connected group $G$ whose Lie algebra is $\mathfrak{g}$. Further, for a Lie group $H$ that also has this algebra $\mathfrak{g}$, then the corresponding simply connected Lie group $G$ is isomorphic to the universal covering group of the connected component of $H$ that contains the identity.

In our case, this means that the unique simply connected Lie group associated with $\mathfrak{o}(r, s)$ is the universal covering group of $SO^{+}(r, s)$ (where $SO^{+}(r, s)$ is the connected component of $O(r, s)$ that contains the identity). Of particular interest to us will be the case of $(r, s) = (1, d-1)$, in which case we have that for $d \geq 4$, the universal covering group of $SO^{+}(1, d-1)$ is the spin group $\text{Spin}(1, d-1)$. As a result, we will choose our spacetime group to be

\[G^{(0)} = \text{Spin}(1, d-1)\]

In total, we have chosen $G^{(0)}$ to be the unique simply connected Lie group corresponding to (the Lie algebra of) the local linear isometries of $g$. We will now study its representations.

Note that if both $r, s > 1$, then $\text{Spin}(r, s)$ is no longer the universal covering group; it is only the double cover of $SO^{+}(r, s)$ (by definition).
In the above, we restricted to the local isometry group $\text{Iso}(M, \Omega^{(r, s)})$. Can we say something about the representations of the ``true’’ global isometry group $\text{Iso}(M, g)$ (or its linear component)?

Representations. Of particular interest to our universe is the choice $d=4$, corresponding to the signature $(r, s) = (1, 3)$, with manifold $M \cong \mathbb{R}^{1, 3}$, representing 1 temporal dimension and 3 spatial dimensions. The signature matrix is $\eta_{\mu\nu}$, called the Minkowski metric, taking the form $\eta = \text{diag}(1, -1, -1, -1)$.

In this case, we can identify

\[G^{(0)} = \text{Spin}(1, 3) \cong SL(2; \mathbb{C})\]

with $SO^{+}(1, 3) \cong SL(2; \mathbb{C})/\mathbb{Z}_2$. In particular, by construction, we have that

\[\mathfrak{o}(1, 3) \cong \mathfrak{sl}(2; \mathbb{C})\]

We therefore wish to classify the spacetime representations of $G^{(0)}$ via studying the algebra $\mathfrak{sl}(2; \mathbb{C})$. As shown in Appendix A.1, we can classify all irreducible representations of complex simple Lie algebras, and we have that $\mathfrak{sl}(2; \mathbb{C})$ is semi-simple, decomposing as the direct sum of simple Lie algebras:

\[\begin{equation} \label{eqn:sliso} \mathfrak{sl}(2; \mathbb{C})_{\mathbb{C}} \cong \mathfrak{su}(2)_{\mathbb{C}} \oplus \mathfrak{su}(2)_{\mathbb{C}} \end{equation}\]

This follows from there being a set of generators $\{L_i\}_{i=1}^{3} \cup \{R_i\}_{i=1}^{3}$ of $\mathfrak{sl}(2; \mathbb{C})_{\mathbb{C}}$ such that $\{L_i\}_{i=1}^{3}$ and $\{R_i\}_{i=1}^{3}$ each satisfy the $\mathfrak{su}(2)_{\mathbb{C}}$ algebra independently, i.e.
\[[L_i, L_j] = \epsilon_{ijk} L_k, \qquad [R_i, R_j] = \epsilon_{ijk} R_k\]
with $[L_i, R_j]= 0$ for all $i, j$. This corresponds to the Lie algebra isomorphism $\mathfrak{sl}(2; \mathbb{C})_{\mathbb{C}} \cong \mathfrak{su}(2)_{\mathbb{C}} \oplus \mathfrak{su}(2)_{\mathbb{C}}$.
In more detail, $\mathfrak{sl}(2; \mathbb{C})$ (non-complexified) has generators with representations
\[\text{Boosts:} \quad K_1 \sim \begin{bmatrix} &1&&\\ 1&&&\\ &&&\\ &&& \end{bmatrix}, \quad K_2 \sim \begin{bmatrix} &&1&\\ &&&\\ 1&&&\\ &&& \end{bmatrix}, \quad K_3 \sim \begin{bmatrix} &&&1\\ &&&\\ &&&\\ 1&&& \end{bmatrix}\] \[\text{Rotations:} \quad J_1 \sim \begin{bmatrix} &&&\\ &&&\\ &&&-1\\ &&1& \end{bmatrix}, \quad J_2 \sim \begin{bmatrix} &&&\\ &&&1\\ &&&\\ &-1&& \end{bmatrix}, \quad J_3 \sim \begin{bmatrix} &&&\\ &&-1&\\ &1&&\\ &&& \end{bmatrix}\]
allowing us to write any element $X \in \mathfrak{sl}(2; \mathbb{C})$ as $X = \theta^i J_i + \chi^i K_i$ for some $\theta^i, \chi^j \in \mathbb{R}$. This lets us define generators for the complexified algebra $\mathfrak{sl}(2; \mathbb{C})_{\mathbb{C}}$:
\[L_i := \frac{1}{2}(J_i + iK_i), \qquad R_i = \frac{1}{2}(J_i - iK_i)\]
Namely, we can write any element $X \in \mathfrak{sl}(2; \mathbb{C})_{\mathbb{C}}$ as $X = \alpha^i L_i + \beta^i R_i$ for some $\alpha^i, \beta^j \in \mathbb{C}$. Then explicitly, the isomorphism is
\[X = \alpha^i L_i + \beta^i R_i \mapsto \alpha^i L_i \oplus \beta^i R_i =: X_L \oplus X_R\]
Note that we can write
\[\alpha^i = \theta^i - i\chi^i, \qquad \beta^i = \theta^i + i\chi^i\]
for coefficients $\theta^i, \chi^j$ in the original uncomplexified basis (rotation and boost parameters respectively).

In the following, we will denote the irreducible representation of $\mathfrak{su}(2)_{\mathbb{C}}$ of highest weight $\Lambda$ by

\[d^{(\Lambda)}: \mathfrak{su}(2)_{\mathbb{C}} \to \mathfrak{gl}(V_{\Lambda})\]

with $\dim d^{(\Lambda)} \equiv \dim V_{\Lambda} = \Lambda + 1$. For more details regarding why we can classify the irreducible representations of complex simple Lie algebras by highest weights $\Lambda$, see Appendix A.1.

The isomorphism of Equation \ref{eqn:sliso} lets us determine all irreducible representations $d^{(\Lambda_1, \Lambda_2)}$ of $\mathfrak{sl}(2; \mathbb{C})_{\mathbb{C}}$, constructed as

\[d^{(\Lambda_1, \Lambda_2)}: \mathfrak{sl}(2; \mathbb{C})_{\mathbb{C}} \to \mathfrak{gl}(V_{\Lambda_1, \Lambda_2})\] \[d_X^{(\Lambda_1, \Lambda_2)} := d_{X_L}^{(\Lambda_1)} \otimes I + I \otimes d_{X_R}^{(\Lambda_2)}\]

defining $V_{\Lambda_1, \Lambda_2} := V_{\Lambda_1} \otimes V_{\Lambda_2}$, with $\dim d^{(\Lambda_1, \Lambda_2)} = (\Lambda_1 + 1)(\Lambda_2 + 1)$. Here, $X_L, X_R \in \mathfrak{su}(2)_{\mathbb{C}}$ are related to $X \in \mathfrak{sl}(2; \mathbb{C})_{\mathbb{C}}$ through the isomorphism outlined above, where

\[X = \alpha^i L_i + \beta^i R_i \implies X_L = \alpha^i L_i, \quad X_R = \beta^i R_i\]

We can extend this algebra representation $d^{(\Lambda_1, \Lambda_2)}$ to a group representation $D^{(\Lambda_1, \Lambda_2)}$ via the exponential map:

\[D^{(\Lambda_1, \Lambda_2)}: SL(2; \mathbb{C}) \to \text{GL}(V_{\Lambda_1, \Lambda_2})\] \[D_{A}^{(\Lambda_1, \Lambda_2)} := \exp\left(d_{X(A)}^{(\Lambda_1, \Lambda_2)}\right)\]

for $A \in SL(2; \mathbb{C})$, with $X(A) \in \mathfrak{su}(2)$ related to $A$ via $A =: \exp(X(A))$.

Iterating through the first couple of these representations is sufficient to define the core objects of interest:

A scalar $S$ lives in the 1-dimensional vector space $V_{0, 0}$ and transforms as
\[S \mapsto_{A} D_{A}^{(0, 0)} S \equiv S\]
under $A \in SL(2; \mathbb{C})$.
A left-handed spinor $\psi_L$ lives in the 2-dimensional vector space $V_{1, 0}$ and transforms as
\[\psi_L \mapsto_{A} D_{A}^{(1, 0)} \psi_L \equiv \exp\left(d_{X_L(A)}^{(1)}\right) \psi_L\]
under $A \in SL(2; \mathbb{C})$.
A right-handed spinor $\psi_R$ lives in the 2-dimensional vector space $V_{0, 1}$ and transforms as
\[\psi_R \mapsto_{A} D_{A}^{(0, 1)} \psi_R \equiv \exp\left(d_{X_R(A)}^{(1)}\right) \psi_R\]
under $A \in SL(2; \mathbb{C})$.
A 4-vector $V$ lives in the 4-dimensional vector space $V_{1, 1}$ and transforms as
\[V \mapsto_{A} D_{A}^{(1, 1)} V \equiv \left(\exp\left(d_{X_L(A)}^{(1)}\right) \otimes \exp\left(d_{X_R(A)}^{(1)}\right)\right) V\]
under $A \in SL(2; \mathbb{C})$.

These 4 objects are all we need to consider to construct the Standard Model.

Spinor transformations. We can write the spinor transformation rules more explicitly. Note that $d^{(1)}$ describes the fundamental representation of $\mathfrak{su}(2)_{\mathbb{C}}$, and the Pauli matrices $\{-i\sigma^j/2\}_{j=1}^{3}$ act as a fundamental representation of $\mathfrak{su}(2)_{\mathbb{C}}$, allowing us to choose

\[d^{(1)}_{L_i} = -\frac{i}{2} \sigma^i, \qquad d^{(1)}_{R_i} = -\frac{i}{2} \sigma^i\] \[\implies d_{X_L(A)}^{(1)} = -\frac{i}{2} \boldsymbol{\theta}(A) \cdot \boldsymbol{\sigma} - \frac{1}{2} \boldsymbol{\chi}(A) \cdot \boldsymbol{\sigma}, \qquad d_{X_R(A)}^{(1)} = -\frac{i}{2} \boldsymbol{\theta}(A) \cdot \boldsymbol{\sigma} + \frac{1}{2} \boldsymbol{\chi}(A) \cdot \boldsymbol{\sigma}\]

for $A \in SL(2; \mathbb{C})$. For brevity we can define the real-valued (anti-symmetric) matrix $\omega_{\mu\nu}(A)$ by

\[\omega_{ij}(A) := \epsilon_{ijk} \theta^k(A), \qquad \omega_{0i}(A) := \chi^i(A)\]

Then, also defining

\[\sigma^{\mu\nu} := \frac{i}{4}(\sigma^{\mu} \bar{\sigma}^{\nu} - \sigma^{\nu} \bar{\sigma}^{\mu}), \qquad \bar{\sigma}^{\mu\nu} := \frac{i}{4}(\bar{\sigma}^{\mu} \sigma^{\nu} - \bar{\sigma}^{\nu} \sigma^{\mu})\]

(related via $(\sigma^{\mu\nu})^{\dagger} = \bar{\sigma}^{\mu\nu}$) we can write

\[d_{X_L(A)}^{(1)} = -\frac{i}{2} \omega_{\mu\nu}(A) \sigma^{\mu\nu}, \qquad d_{X_R(A)}^{(1)} = -\frac{i}{2} \omega_{\mu\nu}(A) \bar{\sigma}^{\mu\nu}\]

which gives us the spinor transformation rules:

\[\psi_L \mapsto_{A} \underbrace{\exp\left(-\frac{i}{2} \omega_{\mu\nu}(A) \sigma^{\mu\nu}\right)}_{=: \, L(A)} \psi_L, \qquad \psi_R \mapsto_{A} \underbrace{\exp\left(-\frac{i}{2} \omega_{\mu\nu}(A) \bar{\sigma}^{\mu\nu}\right)}_{=: \, R(A)}\psi_R\]

Note that $L(A)^{\dagger} = R(A)^{-1}$.

One can also introduce a Dirac spinor $\psi$ to live in the 4-dimensional space $V_{1, 0} \oplus V_{0, 1}$ and transform as
\[\psi \mapsto_{A} (D_{A}^{(1, 0)} \oplus D_{A}^{(0, 1)}) \psi = \exp\left(-\frac{i}{2} \omega_{\mu\nu}(A) S^{\mu\nu}\right)\]
for $S^{\mu\nu} := \frac{i}{4} [\gamma^{\mu}, \gamma^{\nu}]$ (with $S^{\mu\nu} = \text{diag}(\sigma^{\mu\nu}, \bar{\sigma}^{\mu\nu})$). In particular we can view $\psi = \psi_L \oplus \psi_R$. This construction will be useful when computing anomalies in Section 4.

4-vector transformations. The transformation rule for a 4-vector $V$ corresponds to

\[V \mapsto_{A} (L(A) \otimes R(A)) V\]

where recall that $V \in V_1 \otimes V_1$ (with $\dim V_1 = 2$). Explicitly in indices,

\[V_{ij} \mapsto_{\Lambda} L(A)^k{}_{i} R(A)^{l}{}_{j} V_{kl}\]

It turns out that there is a one-to-one correspondence between $V_{ij}$ and an object $V_{\mu}$ (with indices $\mu = 0, 1, 2, 3$) that transforms as

\[V_{\mu} \mapsto_{A} \Lambda(A)^{\nu}{}_{\mu} V_{\nu}\]

with $\Lambda(A) \in SO^{+}(1, 3)$ constructed from $A \in SL(2; \mathbb{C})$ via

\[\Lambda(A)^{\mu}{}_{\nu} := \frac{1}{2} \text{tr}(\bar{\sigma}^{\mu} A \sigma_{\nu} A^{\dagger})\]

which is related to the double-cover correspondence $SO^{+}(1, 3) \cong SL(2; \mathbb{C})/\mathbb{Z}_2$ (see that $\Lambda(A) = \Lambda(-A)$, reflecting the fact that $SL(2; \mathbb{C})$ is a double-cover). We can write $\Lambda(A) \in SO^{+}(1, 3)$ more conveniently: define the collection of matrices $\{M^{\mu\nu}\}_{\mu, \nu}$ by

\[(M^{\mu\nu})^{\sigma}{}_{\rho} = i(\delta^{\nu}{}_{\rho} \eta^{\mu\sigma} - \delta^{\mu}{}_{\rho} \eta^{\nu\sigma})\]

(note that $M^{ij} = i\epsilon^{ijk} J_k, M^{0i} = iK_i$) which allows us to write

\[\Lambda(A) = \exp\left(-\frac{i}{2} \omega_{\mu\nu}(A) M^{\mu\nu}\right)\]

For infinitesimal $\omega_{\mu\nu}$, this expands as $\Lambda^{\mu}{}_{\nu} = \delta^{\mu}{}_{\nu} + \omega^{\mu}{}_{\nu} + O(\omega^2)$.

Todo: should show this correspondence in more detail, may be some errors in the above.

Due to this correspondence, when talking of 4-vectors, we will be referring to the object $V_{\mu}$ that transforms by $\Lambda(A) \in SO^{+}(1, 3)$ rather than $V_{ij}$ itself, since $V_{\mu}$ is much more convenient to work with.

Recall that our theory comes equipped with partial derivatives $\{\partial_{\mu}\}_{\mu=0}^{3} \subset T(M)$. These objects transform under a coordinate transformation described by the matrix $B^{\mu}{}_{\nu}$ as

\[\partial_{\mu} \mapsto_B B^{\nu}{}_{\mu} \partial_{\nu}\]

meaning that, under a Lorentz transformation $B = \Lambda(A)$, partial derivatives transform as a 4-vector:

\[\partial_{\mu} \mapsto_A \Lambda(A)^{\nu}{}_{\mu} \partial_{\nu}\]

for $A \in SL(2; \mathbb{C})$. As we will soon see, this transformation rule helps us to construct non-trivial (dynamical) scalars by combining spinors.

2. Constructing scalars

We will now consider the field content $\Psi = \{\psi_{L, i}\}_{i=1}^{N_L} \cup \{\chi_{R, j}\}_{j=1}^{N_R}$ consisting of $N_L$ left-handed spinors and $N_R$ right-handed spinors. Our first goal is to construct an action $S[\Psi]$ that is invariant to the spacetime symmetry group $G^{(0)}$ (in the sense of \ref{eqn:Sinvar}), which is essentially the statement that $S[\Psi]$ must transform as a scalar.

To achieve this, we will need to understand how we can combine the spinors $\{\psi_{L, i}\}_{i=1}^{N_L} \cup \{\chi_{R, j}\}_{j=1}^{N_R}$ in a way that produces a scalar. In the following, we will show that contractions of the form

\[\psi_{L}^{\dagger} \chi_{R}, \qquad \psi_{L}^{T} \sigma^2 \chi_{L}, \qquad \psi_{L}^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \chi_L\]

and, flipping the handedness,

\[\chi_{R}^{\dagger} \psi_L, \qquad \psi_R^T \sigma^2 \chi_R, \qquad \psi_R^{\dagger} \sigma^{\mu} \partial_{\mu} \chi_R\]

all transform as scalars. There are many more scalars than this, however these are arguably the simplest scalars we can write down and are sufficient for constructing the Standard Model.

It is easy to show that $\psi_{L}^{\dagger} \chi_{R}$ and $\chi_{R}^{\dagger} \psi_L$ are scalars.

See that, since $L(A)^{\dagger} = R(A)^{-1}$,
\[\begin{align*} \psi_L^{\dagger} \chi_R &\mapsto_A \psi_L^{\dagger} L(A)^{\dagger} R(A) \chi_R\\ &= \psi_L^{\dagger} R(A)^{-1} R(A) \chi_R\\ &= \psi_L^{\dagger} \chi_R \end{align*}\]
as required. $\chi_{R}^{\dagger} \psi_L$ follows identically.

$\psi_L^T \chi_L$ (and $\psi_R^T \chi_R$) are not scalars due to the fact that $(\sigma^{\mu})^T \neq \sigma^{\mu}$ for all $\mu$. Indeed, $(\sigma^2)^T = -\sigma^2$ while $(\sigma^{\mu})^T = \sigma^{\mu}$ for $\mu \neq 2$. Inserting $\sigma^2$ results in $\psi_{L}^{T} \sigma^2 \chi_{L}$ and $\psi_R^T \sigma^2 \chi_R$ being scalars.

This follows from using $\sigma^i \sigma^j = \delta_{ij} I + i\epsilon_{ijk} \sigma^k$ to show that
\[\sigma^2 \sigma^{ij} \sigma^2 = i\epsilon_{ijk} \sigma^2 \sigma^{0k} \sigma^2, \qquad \sigma^2 \sigma^{0i} \sigma^2 = \frac{i}{2} (-1)^{1\{i=2\}} \sigma^i\]
which can be used to show
\[\sigma^2 L(A) \sigma^2 = (L(A)^T)^{-1}\]
and therefore
\[\begin{align*} \psi_L^T \sigma^2 \chi_L &\mapsto_A \psi_L^T L(A)^T \sigma^2 L(A) \chi_L\\ &= \psi_L^T L(A)^T \underbrace{(\sigma^2 L(A)^T \sigma^2)}_{(L(A)^T)^{-1}} \sigma^2 \chi_L\\ &= \psi_L^T \sigma^2 \chi_L \end{align*}\]
as required. $\psi_R^T \sigma^2 \chi_R$ follow similarly.

Showing that $\psi_{L}^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \chi_L$ (and $\psi_R^{\dagger} \sigma^{\mu} \partial_{\mu} \chi_R$) are scalars is more involved. First, we can show that $\psi_{L}^{\dagger} \bar{\sigma}^{\mu}\chi_L$ transforms as a 4-vector.

We will show this infinitesimally. We must first derive the infinitesimal transformation rule for a 4-vector $X_{\mu} \mapsto_A \Lambda(A)^{\nu}{}_{\mu} X_{\nu}$. We will write $\Lambda(A) = \exp(-\frac{i}{2} \omega_{\mu\nu} M^{\mu\nu})$. Then for infinitesimal $\omega_{\mu\nu}$,
\[\begin{align*} X_{\mu} \mapsto_{A} \Lambda(A)^{\nu}{}_{\mu} X_{\nu} &= X_{\mu} -\frac{i}{2} \omega_{\sigma\rho} (M^{\sigma\rho})^{\nu}{}_{\mu} X_{\nu} + O(\omega^2)\\ &= X_{\mu} - i\omega_{0i} (M^{0i})^{\nu}{}_{\mu} X_{\nu} - \frac{i}{2} \omega_{ij} (M^{ij})^{\nu}{}_{\mu} X_{\nu} + O(\omega^2)\\ &= X_{\mu} + \chi^i(\delta^i_{\mu} \eta^{0\nu} - \delta^0_{\mu} \eta^{i\nu}) X_{\nu} + \frac{1}{2} \epsilon_{ijk} \theta^k (\delta_{\mu}^j \eta^{i\nu} - \delta_{\mu}^i \eta^{j\nu}) X_{\nu} + O(\omega^2)\\ &= X_{\mu} - \delta^0_{\mu} \chi^i X^i + \delta_{\mu}^i(\chi^i X^0 + \epsilon_{ijk} \theta^j X^k) + O(\omega^2) \end{align*}\]
and so, raising the index, we have the infinitesimal 4-vector transformation rule
\[X^{\mu} \mapsto_{A} X^{\mu} - \delta^{\mu}_0 \chi^i X^i - \delta^{\mu}_i (\chi^i X^0 + \epsilon_{ijk} \theta^j X^k) + O(\omega^2)\]
Now we must show that $\psi_{L}^{\dagger} \bar{\sigma}^{\mu}\chi_L$ also has this transformation rule. See that
\[\begin{align*} \psi_L^{\dagger} \bar{\sigma}^{\mu} \chi_L &\mapsto_{\Lambda} \psi_L^{\dagger} e^{\frac{i}{2} \omega_{\nu\rho} \bar{\sigma}^{\nu\rho}} \bar{\sigma}^{\mu} e^{-\frac{i}{2} \omega_{\nu\rho} \sigma^{\nu\rho}} \chi_L\\ &= \psi_L^{\dagger} \bar{\sigma}^{\mu} \chi_L + \frac{i}{2} \omega_{\nu\rho} \psi_L^{\dagger} (\bar{\sigma}^{\nu\rho} \bar{\sigma}^{\mu} - \bar{\sigma}^{\mu} \sigma^{\nu\rho}) \chi_L + O(\omega^2) \end{align*}\]
Now using $\omega_{\mu\nu} \bar{\sigma}^{\mu\nu} = \theta^i \sigma^i + i \chi^i \sigma^i$ and $\omega_{\mu\nu} \sigma^{\mu\nu} = \theta^i \sigma^i - i\chi^i \sigma^i$, we can write
\[\begin{align*} \frac{i}{2} \omega_{\nu\rho} (\bar{\sigma}^{\nu\rho} \bar{\sigma}^{\mu} - \bar{\sigma}^{\mu} \sigma^{\nu\rho}) &= \frac{i}{2} \theta^i [\sigma^i, \bar{\sigma}^{\mu}] - \frac{1}{2} \chi^i \{\sigma^i, \bar{\sigma}^{\mu}\}\\ &= \begin{cases} -\chi^i \sigma^i, & \mu = 0\\ -\epsilon^{jik} \theta^i \sigma^k - \chi^j \sigma^0, & \mu = j \end{cases} \end{align*}\]
where we have used $\chi^i \delta^{ij} \equiv \chi^i \eta^{ik} \delta^j_k = -\chi^i \delta^j_i = -\chi^j$. This exactly matches the transformation rule of a 4-vector.

Further, we have that the contraction of any two 4-vectors is a scalar, telling us that $\psi_{L}^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \chi_L$ and $\psi_R^{\dagger} \sigma^{\mu} \partial_{\mu} \chi_R$ are scalars, since $\partial_{\mu}$ transforms as a 4-vector.

In more detail, given two 4-vectors $X_{\mu}$ and $Y_{\mu}$, see that from their transformation rules,
\[\begin{align*} X_{\mu} Y^{\mu} &\mapsto_{A} X_{\mu} Y^{\mu} - \chi^i X_0 Y^i - X_i(\chi^i Y^0 + \epsilon_{ijk} \theta^j Y^k) - \chi^i Y^0 X^i + Y^i(\chi^i X^0 + \epsilon_{ijk} \theta^j X^k) + O(\omega^2)\\ &= X_{\mu} Y^{\mu} + O(\omega^2) \end{align*}\]
with all terms cancelling. As a result, contracting two 4-vectors gives a scalar as required.

For our initial action, we will start with only ``kinetic’‘-like spinor terms, involving only first-order derivative terms:

\[\begin{equation} \label{eqn:Skinetic} S[\Psi] = i\int_M d^4 x \, \left(\sum_{i=1}^{N_L} \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \psi_{L, i} + \sum_{j=1}^{N_R} \chi_{R, j}^{\dagger} \sigma^{\mu} \partial_{\mu} \chi_{R, j}\right) \end{equation}\]

Importantly, by construction, this action will satisfy Equation \ref{eqn:Sinvar} restricted to only the spacetime symmetry group $G^{(0)}$.

The factor of $i$ in this action ensures that the integrand is real-valued up to a total derivative. And assuming appropriate boundary conditions (such that the total derivative term vanishes under integration), this will ensure that $S[\Psi] \in \mathbb{R}$. In more detail, see that
\[\begin{align*} (i\psi_L^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \psi_L)^{\dagger} &= -i(\partial_{\mu} \psi_L^{\dagger}) \bar{\sigma}^{\mu} \psi_L\\ &= i\psi_L^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \psi_L + \partial_{\mu}(-i\psi_L^{\dagger} \bar{\sigma}^{\mu} \psi_L) \end{align*}\]
hence, up to a total derivative, $i\psi_L^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \psi_L$ is real-valued.

Terms of the form $\psi_L^{\dagger} \chi_R$ and $\chi_R^{\dagger} \psi_L$ will be later used as interaction terms for our theory (Section 5), corresponding to Yukawa terms in the Standard Model.

3. Gauge invariance via minimal coupling

Use the notation $\psi_{L, i} \in C^{\infty}(M, V^{(L, i)})$ and $\chi_{R, j} \in C^{\infty}(M, V^{(R, j)})$.

We will now begin to consider gauge invariance. Currently, the field content consists of left-handed and right-handed spinors that only possess a spacetime sector $V^{(L, i)} = V^{(L, i, 0)} = V_{1, 0}$ (for all $i = 1, \ldots, N_L$) and $V^{(R, j)} = V^{(R, j, 0)} = V_{0, 1}$ (for all $j = 1, \ldots, N_R$) respectively.

When considering gauge groups, we promote the output spaces of these spinors:

\[V^{(L, i)} = V^{(L, i, 0)} \to V^{(L, i, 0)} \otimes V^{(L, i, 1)} \otimes \cdots \otimes V^{(L, i, K)},\] \[V^{(R, j)} = V^{(R, j, 0)} \to V^{(R, j, 0)} \otimes V^{(R, j, 1)} \otimes \cdots \otimes V^{(R, j, K)},\]

As outlined in the introduction, we will consider gauge groups $(G^{(1)}, \ldots, G^{(K)})$ that come with (unitary) left-handed representations $\{\rho^{(L, i, k)}\}_{i, k}$ each acting on the gauge sector $V^{(L, i, k)}$ of $\psi_{L, i} \in C^{\infty}(M, V^{(L, i)})$, and (unitary) right-handed representations $\{\rho^{(R, j, k)}\}_{j, k}$ each acting on the gauge sector $V^{(R, j, k)}$ of $\chi_{R, j} \in C^{\infty}(M, V^{(R, j)})$. By Equation \ref{eqn:Sinvar}, we require that our action $S[\Psi]$ is invariant to these representations.

The representation of a compact group is always equivalent to a unitary representation, so we just take our representations to be unitary.

We therefore ask: how can we modify the kinetic action \ref{eqn:Skinetic} to make it gauge invariant? A starting point is to understand the extent to which it fails to be gauge invariant. See that, for a gauge element $g \in C^{\infty}(M, G^{(k)})$,

\[\begin{align*} \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \psi_{L, i} &\mapsto_g \psi_{L, i}^{\dagger} \rho_{g}^{(L, i, k)\dagger} \bar{\sigma}^{\mu} \partial_{\mu} (\rho_{g}^{(L, i, k)} \psi_{L, i})\\ &= \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \psi_{L, i} + \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} [\rho_{g}^{(L, i, k)\dagger} (\partial_{\mu} \rho_{g}^{(L, i, k)})] \psi_{L, i} \end{align*}\]

where in the second line we have used unitarity and that $\rho_g \bar{\sigma}^{\mu} \equiv \bar{\sigma}^{\mu} \rho_g$ since $\rho_g$ does not interact with the spacetime sector. The second term captures the failure to achieve gauge invariance. What can we add to our initial action Equation \ref{eqn:Skinetic} in order to remedy this? In general, it appears we must modify the kinetic term

\[\psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \psi_{L, i} \to \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} (\partial_{\mu} - X_{\mu}^{(L, i, k)}) \psi_{L, i}\]

for some 4-vector $X^{(L, i, k)}$ (must be a 4-vector as otherwise the term would no longer be a scalar) that interacts with the gauge sector $V^{(L, i, k)}$. Denote its transformation under $G^{(k)}$ as $X_{\mu}^{(L, i, k)} \mapsto_g \tilde{X}_{\mu}^{(L, i, k)}$. Then this new term transforms as:

\[\begin{align*} \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} (\partial_{\mu} - X_{\mu}^{(L, i, k)}) \psi_{L, i} \mapsto_g \; &\psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \psi_{L, i} + \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} [\rho_{g}^{(L, i, k)\dagger} (\partial_{\mu} \rho_{g}^{(L, i, k)})] \psi_{L, i}\\ &- \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} [\rho_g^{(L, i, k)\dagger} \tilde{X}_{\mu}^{(L, i, k)} \rho_g^{(L, i, k)}] \psi_{L, i} \end{align*}\]

One can see that we will achieve gauge invariance, meaning

\[\psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} (\partial_{\mu} - X_{\mu}^{(L, i, k)}) \psi_{L, i} \mapsto_g \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} (\partial_{\mu} - X_{\mu}^{(L, i, k)}) \psi_{L, i}\]

if we choose $X^{(L, i, k)}$ to transform as

\[X^{(L, i, k)} \mapsto_g \tilde{X}^{(L, i, k)} = \rho_g^{(L, i, k)} X_{\mu}^{(L, i, k)} \rho_g^{(L, i, k)\dagger} + (\partial_{\mu} \rho_g^{(L, i, k)}) \rho_g^{(L, i, k)\dagger}\]

The same procedure follows identically for all left-handed and right-handed kinetic terms, with associated 4-vectors $\{X^{(L, i, k)}\}_{i=1}^{N_L} \cup \{X^{(R, j, k)}\}_{j=1}^{N_R}$ each having their own contribution in the action.

Infinitesimal gauge transformations. Writing $g = \exp(\alpha)$ for $\alpha \in C^{\infty}(M, \mathfrak{g}^{(k)})$, note that $\rho_g^{(L, i, k)} = \exp(d_{\alpha}^{(L, i, k)})$ for a representation $d^{(L, i, k)}$ of $\mathfrak{g}^{(k)}$. Unitarity of $\rho^{(L, i, k)}$ implies that $d^{(L, i, k)}$ is anti-Hermitian.

See that by Taylor expansion, for infinitesimal $\alpha$, $X_{\mu}^{(L, i, k)}$ transforms under $g = \exp(\alpha)$ as

\[X_{\mu}^{(L, i, k)} \mapsto_{g} X_{\mu}^{(L, i, k)} + [d_{\alpha}^{(L, i, k)}, X_{\mu}^{(L, i, k)}] + \partial_{\mu} d_{\alpha}^{(L, i, k)} + O(\alpha^2)\]

One can see that this transformation rule will take a particularly convenient form if we write

\[X_{\mu}^{(L, i, k)} = d^{(L, i, k)}_{A_{\mu}^{(k)}}\]

for some $A_{\mu}^{(k)} \in \mathfrak{g}^{(k)}$, which gives

\[\begin{align*} X_{\mu}^{(L, i, k)} &\mapsto_g X_{\mu}^{(L, i, k)} + [d_{\alpha}^{(L, i, k)}, d^{(L, i, k)}_{A_{\mu}^{(k)}}] + \partial_{\mu} d_{\alpha}^{(L, i, k)} + O(\alpha^2)\\ &= X_{\mu}^{(L, i, k)} + d_{[\alpha, A_{\mu}^{(k)}]}^{(L, i, k)} + \partial_{\mu} d_{\alpha}^{(L, i, k)} + O(\alpha^2) \end{align*}\]

which can equivalently be described by a transformation of $A_{\mu}^{(k)}$:

\[A_{\mu}^{(k)} \mapsto_{g} A_{\mu}^{(k)} + [\alpha, A_{\mu}^{(k)}] + \partial_{\mu} \alpha + O(\alpha^2)\]

Hence, infinitesimally, we can view the role of all of $\{X^{(L, i, k)}\}_{i=1}^{N_L} \cup \{X^{(R, j, k)}\}_{j=1}^{N_R}$ as reducible to a single gauge field $A^{(k)}$, transforming via the above rule.

Achieving gauge invariance. Repeating this for each gauge group $k=1, \ldots, K$, the above process corresponds to extending our field content

\[\Psi \mapsto \Psi \cup \{A^{(k)}\}_{k=1}^{K}\]

to include a gauge field $A^{(k)} \in C^{\infty}(M, \mathfrak{g}^{(k)})$ for each gauge group $G^{(k)}$, with $g = \exp(\alpha) \in C^{\infty}(M, G^{(k)})$ transforming the field content as

\[\psi_{L, i} \mapsto_g \rho_g^{(L, i, k)} \psi_{L, i}, \quad \chi_{R, j} \mapsto_g \rho_g^{(R, j, k)} \chi_{R, j}, \quad A_{\mu}^{(k)} \mapsto_g A_{\mu}^{(k)} + [\alpha, A_{\mu}^{(k)}] + \partial_{\mu} \alpha + O(\alpha^2)\]

(and leaving $A^{(k')}$ invariant for $k' \neq k$).

This procedure for achieving gauge invariance is called minimal coupling (since we are essentially performing the minimal modification to the kinetic action that achieves gauge invariance…?).

We can write the new gauge invariant action as

\[\begin{equation} \label{eqn:Sgauged} S[\Psi] = i\int_M d^4 x \, \left(\sum_{i=1}^{N_L} \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} D_{\mu}^{(L, i)} \psi_{L, i} + \sum_{j=1}^{N_R} \chi_{R, j}^{\dagger} \sigma^{\mu} D_{\mu}^{(R, j)} \chi_{R, j}\right) \end{equation}\]

where we have defined the covariant derivatives:

\[D_{\mu}^{(L, i)} := \partial_{\mu} - \sum_{k=1}^{K} g_k d^{(L, i, k)}_{A_{\mu}^{(k)}}, \qquad D_{\mu}^{(R, j)} := \partial_{\mu} - \sum_{k=1}^{K} g_k d^{(R, j, k)}_{A_{\mu}^{(k)}}\]

and where we have introduced a coupling constant $g_k$ for each gauge field $A^{(k)}$, equivalent to the rescaling $A^{(k)} \mapsto g_k A^{(k)}$, which slightly modifies the transformation rule to:

\[A_{\mu}^{(k)} \mapsto_{g} A_{\mu}^{(k)} + [\alpha, A_{\mu}^{(k)}] + \frac{1}{g_k}\partial_{\mu} \alpha + O(\alpha^2)\]

Field strengths. For our discussion of anomalies, it will be useful to introduce the field strengths $F_{\mu\nu}^{(L, i)}$:

\[F_{\mu\nu}^{(L, i)} := -[D_{\mu}^{(L, i)}, D_{\nu}^{(L, i)}] = \sum_{k=1}^{K} g_k d^{(L, i, k)}_{f_{\mu\nu}^{(k)}} =: \sum_{k=1}^{K} g_k F_{\mu\nu}^{(L, i, k)}\]

where we have introduced $f_{\mu\nu}^{(k)} \in C^{\infty}(M, \mathfrak{g}^{(k)})$, defined

\[f_{\mu\nu}^{(k)} := \partial_{\mu} A_{\nu}^{(k)} - \partial_{\nu} A_{\mu}^{(k)} - g_k[A_{\mu}^{(k)}, A_{\nu}^{(k)}]\]

and $F_{\mu\nu}^{(L, i, k)} := d^{(L, i, k)}_{f_{\mu\nu}^{(k)}}$.

4. Anomalies

In the above, we have restricted our attention to satisfying Equation \ref{eqn:Sinvar}. In this section, we will address the second condition of Equation \ref{eqn:Dinvar}: the measure $\text{D}\Psi$ must be invariant to the symmetry group $G$ of our theory. If the measure fails to be invariant, we say that there is an anomaly in our theory. The condition that the anomaly vanishes places strong constraints on the group content $G$ and representation content $\rho$ that we can consider, as we will see shortly.

Note that the measure $\text{D}\Psi_i$ associated with non-spinor fields, such as gauge bosons, do not contribute anything to the anomaly, which is why we can restrict our attention to spinor measures only below. Further, since the spacetime group $G^{(0)}$ acts non-chirally on left and right-handed spinors (i.e. acts identically), transformations under $G^{(0)}$ will not introduce an anomaly, allowing us to further restrict our attention to gauge groups $(G^{(1)}, \ldots, G^{(K)})$.

Gauge transformation rule for spinor measures. The study of whether $\text{D}\Psi$ is invariant requires proposing a specific definition for $\text{D}\Psi$. As mentioned in the introduction, we can do so by expanding $\Psi$ in a basis of $C^{\infty}(M, V)$ and performing some form of regulation. To construct such a basis, we will consider the Dirac operators

\[\slashed{D}^{(L, i)} := \gamma^{\mu} D_{\mu}^{(L, i)}, \quad \slashed{D}^{(R, j)} := \gamma^{\mu} D_{\mu}^{(R, j)}\]

which have eigenfunctions $\{\phi_{n}^{(i)}\}_{n}$ and $\{\xi_{m}^{(j)}\}_m$ satisfying

\[i\slashed{D}^{(L, i)} \phi_{n}^{(i)} = \lambda_n^{(i)} \phi_n^{(i)}, \qquad i\slashed{D}^{(R, j)} \xi_{n}^{(j)} = \mu_n^{(j)} \xi_n^{(j)}\]

for eigenvalues $\{\lambda_n^{(i)}\}_n$ and $\{\mu_n^{(j)}\}_n$ respectively.

Since $\{\phi_{n}^{(i)}\}_{n}$ and $\{\xi_{m}^{(j)}\}_m$ are each a basis for Dirac spinors, we can project them to a basis for left-handed and right-handed spinors using the projection matrices

\[P_L := \frac{1}{2}(I + \gamma^5), \qquad P_R : = \frac{1}{2}(I - \gamma^5)\]

for the $4 \times 4$ gamma matrix $\gamma^5 = \begin{bmatrix} I&0\\ 0&-I \end{bmatrix}$. Since $\gamma^{\mu} \gamma^5 = -\gamma^5 \gamma^{\mu}$, then

\[i\slashed{D}^{(L, i)} \phi_{n}^{(i)} = \lambda_n^{(i)} \phi_n^{(i)} \iff i\slashed{D}^{(L, i)} (\gamma^5 \phi_{n}^{(i)}) = -\lambda_n^{(i)} (\gamma^5 \phi_n^{(i)})\]

and similarly for $\slashed{D}^{(R, j)}$. This implies that the projected eigenfunctions $\{P_L \phi^{(i)}_n\}_n$ contain two copies of $\{P_L \phi^{(i)}_n\}_{n: \lambda_n > 0}$ since $P_L \gamma^5 = \gamma^5 P_L = P_L$. To avoid overcompleteness and keep orthonormality, we must restrict to some subset $\{P_L \phi_{\gamma}^{(i)}\}_{\gamma}$ of projected eigenfunctions for left-handed spinors (and $\{P_R \xi_{\beta}^{(j)}\}_{\beta}$ for right-handed spinors). This lets us write

\[\psi_{L, i}(x) = \sum_{\gamma} a_{\gamma}^{(i)} P_L \phi_{\gamma}^{(i)}, \qquad \psi_{L, i}^{\dagger}(x) = \sum_{\gamma} b_{\gamma}^{(i)} \phi_{\gamma}^{(i)\dagger} P_L\] \[\chi_{R, j}(x) = \sum_{\beta} c_{\beta}^{(j)} P_R \xi_{\beta}^{(j)}, \qquad \chi_{R, j}^{\dagger}(x) = \sum_{\beta} d_{\beta}^{(j)} \xi_{\beta}^{(j)\dagger} P_R\]

By orthonormality, we have

\[\int d^4 x \, \phi_{\gamma}^{(i)\dagger}(x) P_L \phi_{\gamma'}^{(i)}(x) = \delta_{\gamma\gamma'}, \qquad \int d^4 x \, \xi_{\beta}^{(j)\dagger}(x) P_R \xi_{\beta'}^{(j)}(x) = \delta_{\beta\beta'}\]

We can then take

\[\text{D}\psi_{L, i} := \prod_{\gamma} da_{\gamma}^{(i)}, \qquad \text{D}\psi_{L, i}^{\dagger} := \prod_{\gamma} db_{\gamma}^{(i)}\] \[\text{D}\chi_{R, j} := \prod_{\beta} dc_{\beta}^{(j)}, \qquad \text{D}\chi_{R, j}^{\dagger} := \prod_{\beta} dd_{\beta}^{(j)}\]

To achieve Equation \ref{eqn:Dinvar}, we wish to understand how these measures transform under the gauge transformation associated with $g = \exp(\alpha) \in C^{\infty}(M, G^{(k)})$ (for $\alpha \in C^{\infty}(M, \mathfrak{g}^{(k)})$) for arbitrary $k$:

\[\psi_{L, i} \mapsto_{g} \psi_{L, i} + d^{(L, i, k)}_{\alpha} \psi_{L, i} + O(\alpha^2), \qquad \chi_{R, j} \mapsto_{g} \chi_{R, j} + d^{(R, j, k)}_{\alpha} \chi_{R, j} + O(\alpha^2)\]

Using orthonormality, one can show that this transformation is equivalent to transforming the expansion coefficients as

\[a_{\gamma}^{(i)} \mapsto_g \sum_{\gamma'} a_{\gamma'}^{(i)} (\delta_{\gamma\gamma'} + A_{\gamma\gamma'}^{(i, k)}(\alpha)), \qquad b_{\gamma}^{(i)} \mapsto_g \sum_{\gamma'} b_{\gamma'}^{(i)} (\delta_{\gamma\gamma'} + A_{\gamma\gamma'}^{(i, k)\dagger}(\alpha))\] \[c_{\beta}^{(j)} \mapsto_g \sum_{\beta'} c_{\beta'}^{(j)} (\delta_{\beta\beta'} + B_{\beta\beta'}^{(j, k)}(\alpha)), \qquad d_{\beta}^{(j)} \mapsto_g \sum_{\beta'} d_{\beta'}^{(j)} (\delta_{\beta\beta'} + B_{\beta\beta'}^{(j, k)\dagger}(\alpha))\]

defining

\[A_{\gamma\gamma'}^{(i, k)}(\alpha) := \int d^4 x \, \phi_{\gamma}^{(i)\dagger}(x) d_{\alpha}^{(L, i, k)} P_L \phi_{\gamma'}^{(i)}(x)\] \[B_{\beta\beta'}^{(j, k)}(\alpha) := \int d^4 x \, \xi_{\beta}^{(j)\dagger}(x) d_{\alpha}^{(R, j, k)} P_R \xi_{\beta'}^{(j)}(x)\]

This results in the transformation rules

\[\text{D}\psi_{L, i} \mapsto_{g} \det(I+A^{(i, k)}(\alpha))^{-1} \text{D}\psi_{L, i}, \qquad \text{D}\psi_{L, i}^{\dagger} \mapsto_{g} \det(I+A^{(i, k)\dagger}(\alpha))^{-1} \text{D}\psi_{L, i}^{\dagger}\] \[\text{D}\chi_{R, j} \mapsto_{g} \det(I+B^{(j, k)}(\alpha))^{-1} \text{D}\chi_{R, j}, \qquad \text{D}\chi_{R, j}^{\dagger} \mapsto_{g} \det(I+B^{(i, k)\dagger}(\alpha))^{-1} \text{D}\chi_{R, j}^{\dagger}\]

We get factors like $\det(I+X)^{-1}$ rather than $\det(I+X)$ since spinors are Grassman-valued, meaning the expansion coefficients are also Grassman-valued, which results in an inverted Jacobian.

Using $\det(I+X) \approx \det(e^X) = e^{\text{tr}(X)}$ for infinitesimal $X$, we have that overall:

\[\text{D}\psi_{L, i} \text{D}\psi_{L, i}^{\dagger} \mapsto_{g} J_g^{(L, i, k)} \text{D}\psi_{L, i} \text{D}\psi_{L, i}^{\dagger}, \qquad J_g^{(L, i, k)} := \exp(-2\text{Re}(\text{tr}(A^{(i, k)}(\alpha))))\] \[\text{D}\chi_{R, j} \text{D}\chi_{R, j}^{\dagger} \mapsto_{g} J_g^{(R, j, k)} \text{D}\chi_{R, j} \text{D}\chi_{R, j}^{\dagger}, \qquad J_g^{(R, j, k)} := \exp(-2\text{Re}(\text{tr}(B^{(i, k)}(\alpha))))\]

The total anomaly associated with gauge elements $(g_1, \ldots, g_K) \in G^{(1)} \times \cdots \times G^{(K)}$ (with $g_k = \exp(\alpha_k)$) is

\[\begin{align*} \mathcal{A}(g_1, \ldots, g_K) &:= \prod_{k=1}^{K} \left[\left(\prod_{i=1}^{N_L} J_{g_k}^{(L, i, k)}\right) \left(\prod_{j=1}^{N_R} J_{g_k}^{(R, j, k)}\right)\right]\\ &= \exp\left(-2\sum_{k=1}^{K} \left[\sum_{i=1}^{N_L} \text{Re}(\text{tr}(A^{(i, k)}(\alpha_k))) + \sum_{j=1}^{N_R} \text{Re}(\text{tr}(B^{(j, k)}(\alpha_k)))\right]\right) \end{align*}\]

with

\[\text{D}\Psi \mapsto_{(g_1, \ldots, g_K)} \; \mathcal{A}(g_1, \ldots, g_K) \text{D}\Psi\]

Fujikawa regulation. One problem is that the trace terms, e.g.

\[\text{tr}(A^{(i, k)}(\alpha)) = \sum_{\gamma} \int d^4 x \, \phi_{\gamma}^{(i)\dagger}(x) d_{\alpha}^{(L, i, k)} P_L \phi_{\gamma}^{(i)}(x)\]

appear that they will diverge in general. One can see this problem as downstream of the fact that we defined e.g. $\text{D}\psi_{L, i}$ to be an infinite product of expansion coefficients. One way of resolving such divergences is to apply a cutoff to make the product finite (as we will do for Wilsonian renormalization in Section 8), however applying a cutoff breaks the gauge invariance of our theory. Instead, we will consider Fujikawa regulation, which maintains gauge invariance. This involves replacing such trace terms with their regulated variant:

\[\begin{align*} \text{tr}(A^{(i, k)}(\alpha)) \to \text{tr}_{\Lambda}(A^{(i, k)}(\alpha)) &:= \sum_{\gamma} e^{-\lambda_{\gamma}^2/\Lambda^2}\int d^4 x \, \phi_{\gamma}^{(i)\dagger}(x) d_{\alpha}^{(L, i, k)} P_L \phi_{\gamma}^{(i)}(x)\\ &= \sum_{\gamma} \int d^4 x \, \phi_{\gamma}^{(i)\dagger}(x) d_{\alpha}^{(L, i, k)} P_L e^{(\slashed{D}^{(L, i)})^2/\Lambda^2}\phi_{\gamma}^{(i)}(x) \end{align*}\]

for some regularization parameter $\Lambda$. We will eventually take $\Lambda \to \infty$ at the end. The regulated anomaly is then

\[\mathcal{A}^{(\Lambda)}(g_1, \ldots, g_K) := \exp\left(-2\sum_{k=1}^{K} \left[\sum_{i=1}^{N_L} \text{Re}(\text{tr}_{\Lambda}(A^{(i, k)}(\alpha_k))) + \sum_{j=1}^{N_R} \text{Re}(\text{tr}_{\Lambda}(B^{(j, k)}(\alpha_k)))\right]\right)\]

We will now determine the required conditions on group content $G$ and representation content $\rho$ to ensure that

\[\begin{equation} \label{eqn:reganom} \lim_{\Lambda \to \infty} \mathcal{A}^{(\Lambda)}(g_1, \ldots, g_K) = 1 \quad \forall \; (g_1, \ldots, g_K) \in G^{(1)} \times \cdots \times G^{(K)} \end{equation}\]

holds, such that the regulated measure $\text{D}\Psi^{(\Lambda)}$ is gauge invariant. Note that $\lim_{\Lambda \to \infty} \mathcal{A}^{(\Lambda)} \neq \mathcal{A}$.

Computing the anomaly. We can write this regulated trace as

\[\begin{align*} \text{tr}_{\Lambda}(A^{(i, k)}(\alpha)) &= \int d^4 x \, \left[\sum_{\gamma} \phi_{\gamma}^{(i)\dagger} d_{\alpha}^{(L, i, k)} P_L e^{(\slashed{D}^{(L, i)})^2/\Lambda^2}\phi_{\gamma}^{(i)}\right](x)\\ &= \int d^4 x \, \text{tr}(d_{\alpha}^{(L, i, k)} P_L e^{(\slashed{D}^{(L, i)})^2/\Lambda^2})(x)\\ &= \int d^4 x \frac{d^4 k}{(2\pi)^4} \, \text{tr}_{s, g}(d_{\alpha}^{(L, i, k)} P_L e^{-ik\cdot x} e^{(\slashed{D}^{(L, i)})^2/\Lambda^2} e^{ik\cdot x}) \end{align*}\]

where $\text{tr}_{s, g}$ is a trace over spacetime and gauge indices (i.e. not taken over function space).

Now see that

\[\begin{align*} (\slashed{D}^{(L, i)})^2 &= (D^{(L, i)})^2 +\frac{1}{2} \gamma^{\mu} \gamma^{\nu} [D_{\mu}^{(L, i)}, D_{\nu}^{(L, i)}]\\ &= (D^{(L, i)})^2 - \frac{1}{2} \gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)} \end{align*}\]

This lets us write

\[\begin{align*} \text{tr}_{\Lambda}(A^{(i, k)}(\alpha)) &= \int d^4 x \frac{d^4 k}{(2\pi)^4} \, \text{tr}_{s, g}(d_{\alpha}^{(L, i, k)} P_L e^{-ik\cdot x} e^{(\slashed{D}^{(L, i)})^2/\Lambda^2} e^{ik\cdot x})\\ &= \int d^4 x \frac{d^4 k}{(2\pi)^4} \, \text{tr}_{s, g}(d_{\alpha}^{(L, i, k)} P_L e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2} e^{-ik\cdot x} e^{(D^{(L, i)})^2/\Lambda^2} e^{ik\cdot x})\\ &= \int d^4 x \frac{d^4 k}{(2\pi)^4} \, \text{tr}_{s, g}(d_{\alpha}^{(L, i, k)} P_L e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2} e^{(D^{(L, i)} + ik)^2/\Lambda^2})\\ &= \int d^4 x \frac{d^4 k}{(2\pi)^4} \, \text{tr}_{s, g}(d_{\alpha}^{(L, i, k)} P_L e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2} e^{(\partial + ik)^2/\Lambda^2})\\ &= \int d^4 x \frac{d^4 k}{(2\pi)^4} \, e^{-k^2/\Lambda^2} \text{tr}_{s, g}(d_{\alpha}^{(L, i, k)} P_L e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2})\\ &= \int d^4 x \frac{d^4 k}{(2\pi)^4} \, e^{-k^2/\Lambda^2} \text{tr}_{g}(d_{\alpha}^{(L, i, k)} \text{tr}_s(P_L e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2})) \end{align*}\]

where in the fourth line we have translated $k \to k - i\sum_{k=1}^{K} g_k d_{A_{\mu}^k}^{(L, i, k)}$.

We have that

\[\begin{align*} \text{tr}_s(P_L e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2}) &= \frac{1}{2} \text{tr}_s(e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2}) + \frac{1}{2} \text{tr}_s(\gamma^5 e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2})\\ &= 2 - \frac{1}{2\Lambda^4} \left(F_{\mu\nu}^{(L, i)} F^{(L, i)\mu\nu} + iF_{\mu\nu}^{(L, i)} * F^{(L, i)\mu\nu}\right) + O(1/\Lambda^6) \end{align*}\]

by using

\[\text{tr}_s(e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2}) = 4 - \frac{1}{\Lambda^4} F_{\mu\nu}^{(L, i)} F^{(L, i)\mu\nu} + O(1/\Lambda^6)\] \[\text{tr}_s(\gamma^5 e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2}) = -\frac{i}{\Lambda^4} F_{\mu\nu}^{(L, i)} * F^{(L, i)\mu\nu} + O(1/\Lambda^6)\]

and where $*$ is the Hodge star operator, with $F_{\mu\nu}^{(L, i)} * F^{(L, i)\mu\nu} \equiv \frac{1}{2} \epsilon^{\mu\nu\sigma\rho} F_{\mu\nu}^{(L, i)} F^{(L, i)}_{\sigma\rho}$.

We have also made use of the identities
\[\text{tr}(\gamma^{\mu} \gamma^{\nu}) = 4\eta^{\mu\nu}, \quad \text{tr}(\gamma^{\mu} \gamma^{\nu} \gamma^{\sigma} \gamma^{\rho}) = 4(\eta^{\mu\nu} \eta^{\sigma\rho} - \eta^{\mu\sigma} \eta^{\nu\rho} + \eta^{\mu\rho} \eta^{\nu\sigma})\] \[\text{tr}(\gamma^5 \gamma^{\mu} \gamma^{\nu}) = 0, \quad \text{tr}(\gamma^5 \gamma^{\mu} \gamma^{\nu} \gamma^{\sigma} \gamma^{\rho}) = -4i\epsilon^{\mu\nu\sigma\rho}\]
found here.

This lets us write

\[\begin{align*} \text{tr}_{\Lambda}(A^{(i, k)}(\alpha)) &= \int d^4 x \left[\frac{\Lambda^4}{8\pi^2} \text{tr}(d_{\alpha}^{(L, i, k)}) - \frac{1}{32\pi^2} \text{tr}(d_{\alpha}^{(L, i, k)} F_{\mu\nu}^{(L, i)} F^{(L, i)\mu\nu} + i d_{\alpha}^{(L, i, k)} F_{\mu\nu}^{(L, i)} * F^{(L, i)\mu\nu}) + O(1/\Lambda^2)\right] \end{align*}\]

where we have used $\int \frac{d^4 k}{(2\pi)^4} \, e^{-k^2/\Lambda^2} = \frac{\Lambda^4}{16\pi^2}$.

Relevant to the anomaly is the real part of this regulated trace. By anti-Hermitianity of $d^{(L, i, k)}$ (due to unitarity of $\rho^{(L, i, k)}$), we have that $\text{tr}(d_{\alpha}^{(L, i, k)})$ is purely imaginary, as well as $\text{tr}(d_{\alpha}^{(L, i, k)} F_{\mu\nu}^{(L, i)} F^{(L, i)\mu\nu})$, while we have that the third term satisfies

\[(-i\text{tr}(d_{\alpha}^{(L, i, k)} F_{\mu\nu}^{(L, i)} * F^{(L, i)\mu\nu}))^* = -i\text{tr}(d_{\alpha}^{(L, i, k)} F_{\mu\nu}^{(L, i)} * F^{(L, i)\mu\nu})\]

meaning it is purely real. As a result, we have that

\[\text{Re}(\text{tr}_{\Lambda}(A^{(i, k)}(\alpha))) = \exp\left(\frac{i}{16\pi^2} \int d^4 x \, \text{tr}(d_{\alpha}^{(L, i, k)} F_{\mu\nu}^{(L, i)} * F^{(L, i)\mu\nu})\right)\]

Now recall that we can write

\[F_{\mu\nu}^{(L, i)} = \sum_{k=1}^{K} g_k f_{\mu\nu}^{(k)a} d_{T_a^{(k)}}^{(L, i, k)}\]

for some choice of generators $\{T_a^{(k)}\}_a \subset \mathfrak{g}_k$, and writing $f_{\mu\nu}^{(k)} = f_{\mu\nu}^{(k)a} T_a^{(k)}$. This lets us explicitly write

\[\begin{align*} \text{Re}(\text{tr}_{\Lambda}(A^{(i, k)}(\alpha))) &= \exp\left(\frac{i}{32\pi^2} \epsilon^{\mu\nu\sigma\rho} \int d^4 x \, \text{tr}(d_{\alpha}^{(L, i, k)} F_{\mu\nu}^{(L, i)} F^{(L, i)}_{\sigma\rho})\right)\\ &= \exp\left(\frac{i}{32\pi^2} \epsilon^{\mu\nu\sigma\rho} \sum_{r=1}^{K} \sum_{s=1}^{K} g_r g_s \text{tr}(d_{T_a^{(k)}}^{(L, i, k)} d_{T_b^{(r)}}^{(L, i, r)} d_{T_c^{(s)}}^{(L, i, s)}) \int d^4 x \, \alpha^a f_{\mu\nu}^{(r) b} f_{\sigma\rho}^{(s) c}\right) \end{align*}\]

The right-handed anomaly is identical but with a minus sign, which stems from the fact that $P_R = \frac{1}{2}(1 - \gamma^5)$ whereas $P_L = \frac{1}{2}(1+\gamma^5)$. Namely, we have

\[\text{Re}(\text{tr}_{\Lambda}(B^{(j, k)}(\alpha))) = \exp\left(-\frac{i}{32\pi^2} \epsilon^{\mu\nu\sigma\rho} \sum_{r=1}^{K} \sum_{s=1}^{K} g_r g_s \text{tr}(d_{T_a^{(k)}}^{(R, j, k)} d_{T_b^{(r)}}^{(R, j, r)} d_{T_c^{(s)}}^{(R, j, s)}) \int d^4 x \, \alpha^a f_{\mu\nu}^{(r) b} f_{\sigma\rho}^{(s) c}\right)\]

This gives the total regulated anomaly:

\[\begin{align*} \mathcal{A}^{(\Lambda)}(g_1, \ldots, g_K) &= \exp\Bigg(-\frac{i}{16\pi^2} \epsilon^{\mu\nu\sigma\rho} \sum_{k=1}^{K} \sum_{r=1}^{K} \sum_{s=1}^{K} g_r g_s \int d^4 x \, \alpha_k^a f_{\mu\nu}^{(r)b} f_{\sigma\rho}^{(s)c}\\ &\qquad\qquad \left[\sum_{i=1}^{N_L} \text{tr}(d_{T_a^{(k)}}^{(L, i, k)} d_{T_b^{(r)}}^{(L, i, r)} d_{T_c^{(s)}}^{(L, i, s)}) - \sum_{j=1}^{N_R} \text{tr}(d_{T_a^{(k)}}^{(R, j, k)} d_{T_b^{(r)}}^{(R, j, r)} d_{T_c^{(s)}}^{(R, j, s)})\right]\Bigg) \end{align*}\]

If we require the anomaly to cancel for any values of couplings $\{g_k\}_{k=1}^{K}$ and any gauge elements $\{\alpha_k\}_{k=1}^{K}$, then the total anomaly of the theory cancels (i.e. Equation \ref{eqn:reganom} is satisfied) iff

\[\begin{equation} \label{eqn:anomalycond} \boxed{\sum_{i=1}^{N_L} C_{L, i}(k, r, s) \text{tr}(d_{T_a^{(k)}}^{(L, i, k)} d_{T_b^{(r)}}^{(L, i, r)} d_{T_c^{(s)}}^{(L, i, s)}) = \sum_{j=1}^{N_R} C_{R, j}(k, r, s) \text{tr}(d_{T_a^{(k)}}^{(R, j, k)} d_{T_b^{(r)}}^{(R, j, r)} d_{T_c^{(s)}}^{(R, j, s)}) \quad \forall \; \; k, r, s \quad \forall \; \; a, b, c} \end{equation}\]

where we have factored out dimensional factors from the trace over redundant dimensions, captured by

\[C_{L, i}(a, b, \ldots, c) := \prod_{k \notin \{a, b, \ldots, c\}} \dim d^{(L, i, k)}, \qquad C_{R, j}(a, b, \ldots, c) := \prod_{k \notin \{a, b, \ldots, c\}} \dim d^{(R, j, k)}\]

We can split this condition into four distinct cases:

\[\sum_{i=1}^{N_L} C_{L, i}(k) \text{tr}(d_{T_a^{(k)}}^{(L, i, k)} \{d_{T_b^{(k)}}^{(L, i, k)}, d_{T_c^{(k)}}^{(L, i, k)}\}) = \sum_{j=1}^{N_R} C_{R, j}(k) \text{tr}(d_{T_a^{(k)}}^{(R, j, k)} \{d_{T_b^{(k)}}^{(R, j, k)}, d_{T_c^{(k)}}^{(R, j, k)}\}) \quad \forall \; \; k,\] \[\sum_{i=1}^{N_L} C_{L, i}(k, r) \text{tr}(d_{T_a^{(k)}}^{(L, i, k)}) \text{tr}(d_{T_b^{(r)}}^{(L, i, r)} d_{T_c^{(r)}}^{(L, i, r)}) = \sum_{j=1}^{N_R} C_{R, j}(k, r) \text{tr}(d_{T_a^{(k)}}^{(R, j, k)}) \text{tr}(d_{T_b^{(r)}}^{(R, j, r)} d_{T_c^{(r)}}^{(R, j, r)}) \quad \forall \; \; r \neq k,\] \[\sum_{i=1}^{N_L} C_{L, i}(k, r) \text{tr}(d_{T_a^{(k)}}^{(L, i, k)} d_{T_b^{(k)}}^{(L, i, k)}) \text{tr}(d_{T_c^{(r)}}^{(L, i, r)}) = \sum_{j=1}^{N_R} C_{R, j}(k, r) \text{tr}(d_{T_a^{(k)}}^{(R, j, k)} d_{T_b^{(k)}}^{(R, j, k)}) \text{tr}(d_{T_c^{(r)}}^{(R, j, r)}) \quad \forall \; \; r \neq k,\] \[\sum_{i=1}^{N_L} C_{L, i}(k, r, s) \text{tr}(d_{T_a^{(k)}}^{(L, i, k)}) \text{tr}(d_{T_b^{(r)}}^{(L, i, r)}) \text{tr}(d_{T_c^{(s)}}^{(L, i, s)}) = \sum_{j=1}^{N_R} C_{R, j}(k, r, s) \text{tr}(d_{T_a^{(k)}}^{(R, j, k)}) \text{tr}(d_{T_b^{(r)}}^{(R, j, r)}) \text{tr}(d_{T_c^{(s)}}^{(R, j, s)}) \quad \forall \; \; r \neq s, \; r \neq k, \; s \neq k\]

The final three conditions are called mixed anomaly conditions, since they mix together the representations of different gauge groups.

If the representation content $\rho$ of our theory satisfies this condition, then the (regulated) measure $\text{D}\Psi$ will be gauge invariant.

Some things to consider:

What about gravitational anomalies that apparently emerge from diffeomorphism invariance?
Witten anomalies are missing from the above: sometimes the anomaly condition can be satisfied, yet the theory can still fail to be gauge invariant. This is related to the hopotopy group of the gauge group. If $\Pi_4(G^{(k)}) = 0$, we are fine and only need the above anomaly condition to hold. But otherwise – e.g. $\Pi_4(\text{Sp}(N)) = \mathbb{Z}_2$ for all $N \geq 1$ – then there can still be a type of anomaly called the Witten anomaly. This is related to the zero modes of the Dirac operators, and is discussed here.

The special case of the Standard Model. A theory is defined by the triple $(S, G, \rho)$ as well as the field content $\Psi$. Here we will describe the choices for these components in the Standard Model.

The Standard Model corresponds to the spacetime group $G^{(0)} = \text{Spin}(1, 3) \cong \text{SL}(2; \mathbb{C})$ as considered above, and gauge groups

\[(G^{(1)}, G^{(2)}, G^{(3)}) = (U(1), SU(2), SU(3))\]

The field content includes 2 left-handed spinors, 3 right-handed spinors, and a Higgs boson:

\[\Psi = (Q_L, L_L, u_R, d_R, e_R, H)\]

To succintly state the representation content, we will denote the trivial rep of a gauge group by $\mathbf{1}$, the fundamental rep of $SU(2)$ by $\mathbf{2}$, and the fundamental rep of $SU(3)$ by $\mathbf{3}$. Further, since the reps of $U(1)$ are labelled by real numbers $c \in \mathbb{R}$, we will denote the corresponding rep by $\mathbf{c}$. We will then use the notation $\phi: (\mathbf{a}, \mathbf{b})_{c}$ to say that the field $\phi$ transforms under rep $\mathbf{a}$ of $SU(2)$, rep $\mathbf{b}$ of $SU(3)$, and rep $\mathbf{c}$ of $U(1)$ (where $a \in \{1, 2\}$, $b \in \{1, 3\}$, and $c \in \mathbb{R}$ for our purposes).

Then the representation content of the Standard Model amounts to:

\[Q_L: (\mathbf{2}, \mathbf{3})_{1/6}, \quad L_L: (\mathbf{2}, \mathbf{1})_{-1/2}, \quad u_R: (\mathbf{1}, \mathbf{3})_{2/3},\] \[d_R: (\mathbf{1}, \mathbf{3})_{-1/3}, \quad e_R: (\mathbf{1}, \mathbf{1})_{-1}, \quad H: (\mathbf{2}, \mathbf{1})_{1/2}\]

Why these particular choices? The above is arguably one of the simplest setups that satisfies the anomaly condition Equation \ref{eqn:anomalycond}. This is argued in the notes (gauge theory).

Todo: should better justify why this particular choice is one of the most simple, and what the other choices are.

5. Interactions

(This section is very unfinished)

The action \ref{eqn:Sgauged} has achieved our goal of satisfying \ref{eqn:Sinvar}: it is both invariant to the spacetime group $G^{(0)}$ by construction, and through minimal coupling, we have achieved invariance to the gauge groups $G^{(1)} \times \cdots \times G^{(K)}$. However, the action does not involve any interactions, with no contractions between distinct spinors (e.g. no terms of the form $\psi_L^{\dagger} \chi_R$).

We previously found that terms of the form $\psi_L^{\dagger} \chi_R$ (and $\chi_R^{\dagger} \psi_L$) were invariant to the spacetime group $G^{(0)}$. But gauge invariance is problematic since, generally,

\[\rho^{(L, i, k)\dagger} \rho^{(R, j, k)} \neq I\]

Indeed, as we will see, the choice of representations in the Standard Model are such that no term of the form $\psi_L^{\dagger} \chi_R$ can be gauge invariant. This differs from the case of minimal coupling (Section 3), as here it is not an additive term that causes gauge invariant to fail. To achieve gauge invariance in this case, one could imagine introducing a scalar field $H$ that transforms under the gauge groups precisely such that

\[\psi_L^{\dagger} H \chi_R\]

is gauge invariant. Note that $H$ must transform as a scalar in order for the term to remain a scalar. Is there some maximal choice of the transformation properties of $H$, such that we can produce the maximum number of interaction terms possible?

In the Standard Model, $H$ transforms under the gauge groups as $(\mathbf{2}, \mathbf{1})_{1/2}$, which allows for the interaction terms… Todo

6. Making auxiliary fields dynamical

(This section is unfinished)

In the above, we have introduced auxiliary fields $\{A^{(k)}\}_{k=1}^{K}$ and $H$ to our theory in order to help us achieve gauge invariance. Here, we make a key step: we promote these fields to be dynamical in their own right, possessing their own kinetic terms in the action $S[\Psi]$. In particular, these kinetic terms should include derivatives of these auxiliary fields, and preserve overall spacetime & gauge invariance.

The idea that derivative terms make the fields ``dynamical’’ can be seen by considering the classical equations of motion (i.e. the Euler-Lagrange equations), which give non-trivial dynamics when the action depends on a derivative of the field.

Kinetic terms for $A^{(k)}$. In order to make $A^{(k)}$ dynamical, we would like to construct a spacetime \& gauge invariant term that depends on $A_{\mu}^{(k)}$ and its derivatives $\partial_{\nu} A_{\mu}^{(k)}$. Recall that we defined

\[f_{\mu\nu}^{(k)} := \partial_{\mu} A_{\nu}^{(k)} - \partial_{\nu} A_{\mu}^{(k)} - g_k [A_{\mu}^{(k)}, A_{\nu}^{(k)}]\]

We further have that $f_{\mu\nu}^{(k)}$ transforms as

\[f_{\mu\nu}^{(k)} \mapsto_{\alpha} f_{\mu\nu}^{(k)} + [\alpha, f_{\mu\nu}^{(k)}] + O(\alpha^2)\]

under a gauge transformation by $\alpha \in C^{\infty}(M, \mathfrak{g}^{(k)})$.

As shown in Appendix A.1.1, each semi-simple Lie algebra $\mathfrak{g}^{(k)}$ has its own natural metric $\kappa^{(k)}: \mathfrak{g}^{(k)} \times \mathfrak{g}^{(k)} \to \mathbb{C}$, called the Killing form. Importantly, the Killing form satisfies the identity

\[\kappa^{(k)}(X, [Y, Z]) = \kappa^{(k)}(Y, [Z, X]) = \kappa^{(k)}(Z, [X, Y])\]

(also proven in Appendix A.1.1) for any $X, Y, Z \in \mathfrak{g}^{(k)}$. This identity gives us that the inner product (under metric $\kappa$) transforms as

\[\kappa^{(k)}(f_{\mu\nu}^{(k)}, f^{(k)\mu\nu}) \mapsto_{\alpha} \kappa^{(k)}(f_{\mu\nu}^{(k)}, f^{(k)\mu\nu}) + O(\alpha^2)\]

i.e. $\kappa(f_{\mu\nu}^{(k)}, f^{(k)\mu\nu})$ is (infinitesimally) gauge invariant. As a result, it is a natural candidate as a kinetic term for $A^{(k)}$. And since all 4-vector indices are contracted, it will be a Lorentz scalar.

Free and kinetic terms for $H$. $H^{\dagger} H$ is the simplest Lorentz and gauge invariant term for $H$ (recall that $H$ is a Lorentz scalar). As a result, we can consider the contributions of the form $(H^{\dagger} H)^n$ for $n \in \{1, 2, \ldots\}$. We will focus specifically on the cases of $n = 1,2$.

We focus only on $H^{\dagger} H$ and $(H^{\dagger} H)^2$ terms because $(H^{\dagger} H)^k$ for $k > 2$ are not relevant or marginal, as defined at the end of Section 8. In the notation of Section 8, assuming canonical normalization of the kinetic contribution ($(m, n) = (a, b)$), an $(m, n)$ contribution is said to be irrelevant at low energies iff $n-d+m(d-b)/a > 0$. For us, $(a, b) = (2, 2)$, and since $d=4$, the condition for irrelevance becomes
\[n+m > 4\]
And since $(H^{\dagger} H)^k$ has $(m, n) = (2k, 0)$, terms with $k > 2$ are irrelevant.

We can also construct terms that involve derivatives of $H$. Most naively we could consider $(\partial_{\mu} H)^{\dagger} \partial^{\mu} H$, however this will not be gauge invariant for the same reasons as seen in Section 3. Instead, we must do something analogous to minimal coupling. By proceeding similarly to Section 3, one can find that

\[(D_{\mu}^{(H)} H)^{\dagger} D^{(H)\mu} H\]

is both Lorentz and gauge invariant, defining the associated covariant derivative

\[D_{\mu}^{(H)} := \partial_{\mu} - \sum_{k=1}^{K} g_k d_{A_{\mu}^{(k)}}^{(H, k)}\]

where we have denoted the gauge representations that act on $H$ by $\{d^{(H, k)}\}_{k=1}^{K}$.

This motivates adding the contribution

\[(D_{\mu}^{(H)} H)^{\dagger} D^{(H)\mu} H + \alpha H^{\dagger} H + \beta (H^{\dagger} H)^2\]

or equivalently

\[(D_{\mu}^{(H)} H)^{\dagger} D^{(H)\mu} - \frac{\lambda}{2}(H^{\dagger} H - v^2)^2\]

which allows us to interpret the contribution as a kinetic term + potential term, where the potential has ground states defined by $H^{\dagger} H = v^2$. This interpretation will be useful for the discussion of symmetry breaking.

Symmetry breaking. Todo: Reparameterize $H$ by expanding around ground state, and perform appropriate gauge transformation to $S[\Psi]$. Demonstrate that this has given ``mass’’ to the spinors.

Todo: argue that the reps chosen for the Higgs boson are `maximal’ in some sense for achieving interactions.

Todo: show that, via symmetry breaking, fermions gaining mass.

7. Overcounting in the path integral

There is some kind of ``overcounting’’ happening when integrating over any field object $\Psi_i \in \mathcal{C}_i$ due to $\mathcal{C}_i$ including physically-equivalent field configurations that are related by a gauge transformation. In the specific case of gauge fields, this overcounting is detrimental and results in divergences. To remedy this issue requires breaking gauge invariance, but some remnants of gauge invariance still persist through BRST invariance. We will now outline these concepts.

Some brief notes on the differences between integrating over the redundant configuration space $\mathcal{C}$ compared to the underlying physical configuration space $\mathcal{P}$ can be found in Appendix A.4.

In the following, we will restrict our attention to a particular gauge field $A \in C^{\infty}(M, \mathfrak{g}^{(k)})$ associated with gauge group $G^{(k)}$ – the following analysis will apply to each of the $K$ gauge fields individually. For gauge element $\alpha \in C^{\infty}(M, \mathfrak{g}^{(k)})$, we will use the notation

\[A_{\mu}^{\alpha} := A_{\mu} + [\alpha, A_{\mu}] + \frac{1}{g_k} \partial_{\mu} \alpha\]

for the gauge transformation of $A$ by $\alpha$. We will work with a set of generators $\{T_a\}_a \subset \mathfrak{g}^{(k)}$, with $[T_a, T_b] = f^c{}_{ab} T_c$ for structure constants $\{f^c{}_{ab}\}_{a, b, c}$. We can write the components of $A^{\alpha}$ explicitly as

\[(A_{\mu}^{\alpha})^a = A_{\mu}^a + \alpha^b A_{\mu}^c f^a{}_{bc} + \frac{1}{g_k} \partial_{\mu} \alpha^a\]

where $A_{\mu} = A_{\mu}^a T_a$. For the following we will define the matrix of differential operators

\[\Delta_{\mu}(A)^a{}_{b} := f^a{}_{bc} A_{\mu}^c + \frac{1}{g_k} \delta^a_b \partial_{\mu}\]

which lets us write

\[(A_{\mu}^{\alpha})^a = A_{\mu}^a + \Delta_{\mu}(A)^a{}_{b} \alpha^b\]

In the following we will work in Minkowski signature, meaning $\mathbb{P}[\Psi] \propto e^{iS[\Psi]}$ (rather than $\propto e^{-S[\Psi]}$).

Remedying gauge overcounting. By looking at the partition function, we can interpret the effects of gauge overcounting and identify the divergent contribution. We will make use of the two identities

\[1 = N(\xi) \int D\omega \, e^{-i\int d^d x \, \omega(x)^2/2\xi}, \qquad 1 = \int D\alpha \, \delta(G_{\omega}(A^{\alpha}))\det\left(\frac{\delta G_{\omega}(A^{\alpha})}{\delta \alpha}\right)\]

introducing a constant $\xi$ and normalization factor $N(\xi)$, as well as a gauge condition $G_{\omega}(A) := \partial^{\mu} A_{\mu} - \omega$ (though the identity holds for generic conditions). Multiplying these two identities together gives

\[1 = N(\xi) \int D\alpha \, e^{-i\int(\partial^{\mu} A_{\mu}^{\alpha})^2/2\xi} \det\left(\frac{\delta G_{\omega}(A^{\alpha})}{\delta \alpha}\right)\bigg|_{\omega = \partial^{\mu} A_{\mu}^{\alpha}}\] \[\iff 1 = N(\xi) \det(\Xi(A)) \int D\alpha \, e^{-i\int(\partial^{\mu} A_{\mu}^{\alpha})^2/2\xi}\]

introducing the differential operator

\[\Xi(A)^a{}_{b} := \partial^{\mu} \Delta_{\mu}(A)^a{}_{b} = f^a{}_{bc} A_{\mu}^c \partial^{\mu} + f^a{}_{bc} (\partial^{\mu} A_{\mu}^c) + \frac{1}{g_k} \delta_b^a \partial^{\mu} \partial_{\mu}\]

Inserting this result, and writing the field content as $\Psi = (\Phi, A)$, our partition function can be written in the form

\[\begin{align*} Z &= \int \text{D}\Phi \, \text{D}A \; e^{iS[\Phi, A]}\\ &\propto \int \text{D}\Phi \, \text{D}A \, \text{D}\alpha \, \det(\Xi(A)) e^{iS[\Phi, A]-i\int(\partial^{\mu} A_{\mu}^{\alpha})^2/2\xi} \end{align*}\]

Abelian gauge fields. In the case of $G^{(k)}$ being abelian, then $f^a{}_{bc} = 0$ for all $a, b, c$, which means that

\[\Xi(A^{\alpha}) = \Xi(A)\]

In particular, $\Xi(A)$ is independent of $A$ in this case. As a result, we can make use of gauge invariance (of both $S$ and measures $\text{D}\Phi, \text{D}A$) to find

\[Z = \underbrace{\left[\int \text{D}\alpha\right]}_{\text{divergent overcounting factor}} \det\left(\frac{1}{g_k} \partial^{\mu} \partial_{\mu}\right) \int \text{D}\Phi \, \text{D}A \, e^{iS[\Phi, A]-i\int(\partial^{\mu} A_{\mu})^2/2\xi}\]

i.e. we have been able to pull an infinite overcounting factor outside of our partition function. In order for our theory to be valid, we divide out by this overcounting factor, instead considering the partition function

\[\tilde{Z} := \int \text{D}\Phi \, \text{D}A \, e^{iS[\Phi, A]-i\int(\partial^{\mu} A_{\mu})^2/2\xi}\]

That is, our action has now become

\[S[\Phi, A] \to S[\Phi, A] - \frac{1}{2\xi} \int d^d x \, (\partial^{\mu} A_{\mu}^a(x)) (\partial^{\nu} A_{\nu a}(x))\]

But significantly, this new contribution is \textbf{not gauge invariant}, breaking the original gauge invariance of our theory. It turns out that there are still remnants of gauge invariance left through the BRST transformations, as we will describe further below.

Non-abelian gauge fields. In the abelian case, $\Xi(A)$ was actually independent of $A$, which let us pull the $\det(\Xi(A))$ factor out of the partition function. However, in the general case of $G^{(k)}$ non-abelian, this is no longer possible since $\Xi(A)$ generally depends on $A$. In particular, $\Xi(A^{\alpha}) \neq \Xi(A)$. The partition takes the form (up to proportionality constants)

\[Z = \int \text{D}\Phi \, \text{D}A \, \text{D}\alpha \, \det(\Xi(A^{-\alpha})) e^{iS[\Phi, A] - i\int (\partial^{\mu} A_{\mu})^2/2\xi}\]

But interestingly we can write

\[\det(\Xi(A)) = \int \text{D}\bar{c} \, \text{D}c \, e^{-i \int d^d x \, \bar{c}_a \Xi(A)^a{}_{b} c^b}\]

for so-called ghost fields $c, \bar{c} \in C^{\infty}(M, \mathfrak{g}^{(k)})$ associated with $A$. This lets us consider the partition function

\[\tilde{Z} := \int \text{D}\Phi \, \text{D}A \, e^{iS[\Phi, A]-i\int(\partial^{\mu} A_{\mu})^2/2\xi-i \int \bar{c}_a \Xi(A)^a{}_{b} c^b}\]

corresponding to modifying our action as

\[S[\Phi, A] \to S[\Phi, A] - \frac{1}{2\xi} \int d^d x \, (\partial^{\mu} A_{\mu}^a(x)) (\partial^{\nu} A_{\nu a}(x)) -\int d^d x \, \bar{c}_a(x) \Xi(A)^a{}_{b} c^b(x)\]

Again, these extra contributions will generally break gauge invariance.

In total, for the generic non-abelian case, we must extend our field content to include ghost fields for each gauge field:

\[\Psi \mapsto \Psi \cup \{\bar{c}^{(k)}, c^{(k)}\}_{k=1}^{K}\]

and modifying the action as

\[S[\Psi] \to S[\Psi] - \sum_{k=1}^{K} \left[\frac{1}{2\xi_k} \int d^d x \, (\partial^{\mu} A_{\mu}^{(k)a}(x)) (\partial^{\nu} A_{\nu a}^{(k)}(x)) +\int d^d x \, \bar{c}_a^{(k)}(x) \Xi^{(k)}(A^{(k)})^a{}_{b} c^{(k) b}(x)\right]\]

We will derive gauge transformation rules for these ghost fields below.

BRST invariance. Though this new action no longer satisfies the total gauge invariance we originally had, a restricted case of gauge invariance still remains, called BRST invariance.

In particular, recall that gauge transformations act infinitesimally as

\[A_{\mu}^{(k)} \mapsto_{\alpha} A_{\mu}^{(k)} + \Delta_{\mu}(A^{(k)})\alpha + O(\alpha^2)\]

for $\alpha \in C^{\infty}(M, \mathfrak{g}^{(k)})$. Under this transformation (and a corresponding transformation on the remaining field content), one can show that the modified Lagrangian transforms as $\mathcal{L} \mapsto_{\alpha} \mathcal{L} + \delta\mathcal{L}$ for

\[\begin{align*} \delta\mathcal{L} &= -\frac{1}{\xi_k}(\partial^{\nu} A_{\nu a}^{(k)}) (\partial^{\mu} \partial_{\mu} \alpha^a + f^{(k)a}{}_{bc} (\alpha^b \partial^{\mu} A_{\mu}^{(k) c} + A_{\mu}^{(k) c} \partial^{\mu} \alpha^b)) - \delta\bar{c}_a^{(k)} (f^{(k)a}{}_{bc} c^{(k)b} \partial^{\mu} A_{\mu}^{(k)c} + f^{(k)a}{}_{bc} A_{\mu}^{(k)c} \partial^{\mu} c^{(k)b} + \partial^{\mu} \partial_{\mu} c^{(k)a})\\ &\quad\,\;-\bar{c}_a^{(k)}(f^{(k)a}{}_{bc} f^{(k)c}{}_{de} [A_{\mu}^{(k)e}\partial^{\mu} \alpha^d + \alpha^d \partial^{\mu} A_{\mu}^{(k)e}] + f^{(k)a}{}_{bc} \partial^{\mu} \partial_{\mu} \alpha^c) c^{(k)b} - \bar{c}_a^{(k)} (f^{(k)a}{}_{bc} f^{(k)c}{}_{de} \alpha^d A^{(k)e}_{\mu} + f^{(k)a}{}_{bc} \partial_{\mu} \alpha^c) \partial^{\mu} c^{(k)b}\\ &\quad\,\;-\bar{c}_a^{(k)} (f^{(k)a}{}_{bc} \delta c^{(k)b} \partial^{\mu} A_{\mu}^{(k)c}+ f^{(k)a}{}_{bc} A_{\mu}^{(k)c} \partial^{\mu}[\delta c^{(k)b}] + \partial^{\mu} \partial_{\mu} [\delta c^{(k)a}]) \end{align*}\]

with $c^{(k)} \mapsto_{\alpha} c^{(k)} + \delta c^{(k)}$ and $\bar{c}^{(k)} \mapsto_{\alpha} \bar{c}^{(k)} + \delta\bar{c}^{(k)}$.

One can convince themseves that the second derivative terms will nicely cancel if we choose $\alpha^a = c^a$, and take the ghost fields to transform as

\[\delta \bar{c}^{(k)a} = -\frac{1}{\xi_k} \partial^{\mu} A_{\mu a}^{(k)}, \qquad \delta c^{(k)a} = -\frac{1}{2} f^{(k)a}{}_{bc} \alpha^c c^{(k)b}\]

where note that we must be careful when commuting variables since $c^a c^b = -c^b c^a$ (Grassmann valued). We have also made use of

\[f^a{}_{e[b} f^e{}_{cd]} = 0\]

which follows from the Jacobi identity. One can then check that, under these choices, the remaining terms also cancel and we get $\delta\mathcal{L} = 0$.

As a result, our original gauge invariance has been reduced to a restricted set of gauge transformations that involve our new ghost fields. This restricted set are called the BRST transformations.

8. Renormalization

In the following, denote $p = \dim V$ (where recall $\Psi \in C^{\infty}(M, V)$). Recall that physical quantities are related to field expectances of the general form

\[\mathbb{E}_{\Psi \sim S}[\Psi_{i_1}(x_1) \cdots \Psi_{i_n}(x_n)] \equiv \int_{\mathcal{C}} \left[\prod_{j=1}^{N} \text{D}\Psi_j\right] \Psi_{i_1}(x_1) \cdots \Psi_{i_n}(x_n) e^{-S[\Psi]}\]

As outlined in the introduction, in order to construct a well-behaved definition of $\text{D}\Psi$, one can consider expanding $\Psi$ in some discrete basis $\{\psi_n\}_n$ of $C^{\infty}(M, V)$:

\[\Psi(x) = \sum_{n \in \mathbb{Z}} a_n \psi_n(x)\]

Then naively, we could define

\[\text{D}\Psi := \prod_{n \in \mathbb{Z}} d a_n\]

but since this is an infinite product, it will generally result in divergences, necessitating regulation. In the following, we will consider cutoff regulation: truncate the product by some cutoff parameter $N \in \mathbb{Z}_{+}$:

\[\text{D}\Psi^{(N)} := \prod_{n\in \mathbb{Z}_N} d a_n\]

using the notation $\mathbb{Z}_N := \{-N, -N+1, \ldots, N-1, N\}$. Of particular relevance to renormalization is the Fourier basis, corresponding to $n \in \mathbb{Z}_N^d$ (for spacetime dimension $\dim M = d$) and $\psi_n(x) = e^{i\epsilon n \cdot x}$ for some fixed discretization scale $\epsilon$. In this basis, we may write

\[\begin{align*} \Psi^{(N)}(x) &= \sum_{n \in \mathbb{Z}_N^d} a_n e^{i\epsilon n\cdot x}\\ &= \sum_{k \in \epsilon \mathbb{Z}_N^d} \tilde{\Psi}(k) e^{ik\cdot x} \end{align*}\]

allowing us to interpret $k$ as momentum, using the notation $\tilde{\Psi}(k)$ for components. Namely, at a cutoff scale $N$ and discretization scale $\epsilon$, the maximum allowable energy is $\Lambda := \sqrt{d} \epsilon N$. In the following we will use the notation $\Psi^{(\Lambda)} \equiv \Psi^{(N)}$, and denote the configuration space spanned by $\{\psi_n\}_{n \in \mathbb{Z}_N}$ by $\mathcal{C}^{(\Lambda)}$.

Given that the space of field configurations $\mathcal{C}^{(\Lambda)}$ is defined relative to some energy scale $\Lambda$, this means that theories $S^{(\Lambda)}: \mathcal{C}^{(\Lambda)} \to \mathbb{R}$ are \textbf{inherently energy-dependent}, describing physics at a particular energy scale. We say that $S^{(\Lambda)}$ is a $\Lambda$-energy theory.

Since experiments on Earth are limited to a low anthropic energy scale $\mu$, we can only understand the values of couplings for $\mu$-energy theories via experiment. We can attempt to extrapolate this theory to a $\Lambda$-energy theory $S^{(\Lambda)}$ (for $\Lambda > \mu$), which renormalization will allow us to consider, as we will see below.

Lowering the cutoff parameter. We will now show that there is a natural procedure for lowering the energy of a theory $S^{(\Lambda)}$, resulting in a theory $S^{(\mu)}$ for $\mu < \Lambda$. First see that

\[\begin{align*} \text{D}\Psi^{(\Lambda)} = \prod_{k \in \epsilon\mathbb{Z}_N^d} d^d \tilde{\Psi}(k) &= \prod_{k \in \epsilon\mathbb{Z}_M^d} d^d \tilde{\Psi}(k) \prod_{k \in \epsilon\mathbb{Z}_{M+1,N}^d} d^d \tilde{\Psi}(k)\\ &= \text{D}\Psi^{(\mu)} \text{D}\Psi^{(\mu, \Lambda)} \end{align*}\]

for some lower cutoff $M \in \mathbb{Z}_{+}$ with energy scale $\mu := \sqrt{d} \epsilon M$, where

\[\Psi^{(\mu)}(x) = \sum_{k \in \epsilon \mathbb{Z}_M^d} \tilde{\Psi}(k) e^{ik\cdot x}, \qquad \Psi^{(\mu, \Lambda)} = \sum_{k \in \epsilon \mathbb{Z}_{M+1, N}^d} \tilde{\Psi}(k) e^{ik\cdot x}\]

which allows us to write

\[\Psi^{(\Lambda)}(x) = \Psi^{(\mu)}(x) + \Psi^{(\mu, \Lambda)}(x)\]

In the following we will write

\[\int \text{D}\Psi^{(\Lambda)} \equiv \int_{\mathcal{C}^{(\Lambda)}} \text{D}\Psi\]

Now see that, for a $\Lambda$-energy theory $S^{(\Lambda)}: \mathcal{C}^{(\Lambda)} \to \mathbb{R}$, we can write the $\Lambda$-energy partition function as

\[\begin{align*} Z^{(\Lambda)} &= \int_{\mathcal{C}^{(\Lambda)}} \text{D}\Psi \, e^{-S^{(\Lambda)}[\Psi]}\\ &= \int_{\mathcal{C}^{(\mu)}} \text{D}\chi \int_{\mathcal{C}^{(\mu, \Lambda)}} \text{D} \eta \, e^{-S^{(\Lambda)}[\chi + \eta]}\\ &=: \int_{\mathcal{C}^{(\mu)}} \text{D}\chi \, e^{-S^{(\mu)}[\chi]} \end{align*}\]

where we have defined the effective action $S^{(\mu)}: \mathcal{C}^{(\mu)} \to \mathbb{R}$ by

\[S^{(\mu)}[\chi] := -\log\left(\int_{\mathcal{C}^{(\mu, \Lambda)}} \text{D} \eta \, e^{-S^{(\Lambda)}[\chi + \eta]}\right)\]

Furthermore, since in general we can decompose

\[S^{(\Lambda)}[\chi + \eta] = S^{(\Lambda)}[\chi] + S_0^{(\Lambda)}[\eta] + S_I^{(\Lambda)}[\chi, \eta]\]

we can write

\[S^{(\mu)}[\chi] = S^{(\Lambda)}[\chi] + S^{(\Lambda\to\mu)}[\chi]\]

defining

\[S^{(\Lambda\to\mu)}[\chi] := -\log\left(\int_{\mathcal{C}^{(\mu, \Lambda)}} \text{D} \eta \, e^{-S_0^{(\Lambda)}[\eta] - S_I^{(\Lambda)}[\chi, \eta]}\right)\]

Note that $S^{(\Lambda)}[\chi]$ – which is defined on $\mathcal{C}^{(\Lambda)}$ – is still well-defined for $\chi \in \mathcal{C}^{(\mu)}$ since $\mathcal{C}^{(\mu)} \subset \mathcal{C}^{(\Lambda)}$.

Denote the space of $\Lambda$-energy theories by $\mathcal{S}^{(\Lambda)} \subseteq (\mathcal{C}^{(\Lambda)} \to \mathbb{R})$. Above, we have shown that there is a particularly natural lowering operation

\[\mathfrak{L}^{(\Lambda\to\mu)}: \mathcal{S}^{(\Lambda)} \to \mathcal{S}^{(\mu)}, \quad S^{(\Lambda)} \mapsto S^{(\mu)} = S^{(\Lambda)} + S^{(\Lambda\to\mu)}\]

mapping $\Lambda$-energy theories to $\mu$-energy theories by integrating out momentum modes between $(\mu, \Lambda]$, which has the effect of shifting by $S^{(\Lambda\to\mu)}$.

Extrapolating to higher energies. In reality, the experiments we can run on Earth are at a relatively low anthropic energy scale $\mu$ and so the situation is reversed: experiments help us to determine a $\mu$-energy theory $S^{(\mu)}$ (e.g. the Standard Model) that agrees with experimental results on Earth, and we wish to extrapolate this theory to a theory $S^{(\Lambda)}$ that works at higher energy levels $\Lambda > \mu$. One may even wish to extend the theory to arbitrarily high energies $\Lambda \to \infty$.

Concretely, given a low-energy theory $S^{(\mu)}$ (determined via experiments on Earth at energy scale $\mu$), the requirement that the higher-energy theory $S^{(\Lambda)}$ agrees with experiments at energy $\mu$ corresponds to $S^{(\Lambda)}$ satisfying

\[\begin{equation} \label{eqn:lower} \mathfrak{L}^{(\Lambda\to\mu)}[S^{(\Lambda)}] = S^{(\mu)} \end{equation}\]

Asymptotic freedom. We say that $S^{(\mu)}$ is asymptotically free if such a theory $S^{(\Lambda)}$ exists for $\Lambda = \infty$. For this theory $S^{(\infty)}$ to be valid, physical predictions must be finite, corresponding to finite correlators:

\[\mathbb{E}_{\Psi \sim S^{(\infty)}}[\Psi_{i_1}(x_1) \cdots \Psi_{i_n}(x_n)] \;\; \text{finite}\]

In the following, given some fixed $S^{(\mu)}$, we will consider constructing such a $S^{(\Lambda)}$ via the approach of counterterms. In particular, denoting $S^{(\mu\to\Lambda)} \equiv -S^{(\Lambda\to\mu)}$, we may write

\[S^{(\Lambda)}[\chi] = S^{(\mu)}[\chi] + S^{(\mu\to\Lambda)}[\chi]\]

As we will see, we will then design the contributions $S^{(\mu\to\Lambda)}$ to be such that $S^{(\Lambda)}$ is well-defined in the limit $\Lambda \to \infty$ (i.e. finite correlators).

In the following we will write

\[\begin{equation} \label{eqn:actcounter} S^{(\Lambda)}[\chi] = \sum_{m, n} \lambda_{m, n}^{(\Lambda)} \int_M d^d x \, L_{m, n}[\chi; x] \end{equation}\]

where $L_{m, n}[\chi; x]$ denotes all contributions with $m$ powers of $\chi$ and $n$ derivatives of $\chi$, with $\{\lambda_{m, n}^{(\Lambda)}\}_{m, n}$ the associated couplings. Similarly, we will denote the couplings of $S^{(\mu)}$ by $\{\lambda_{m,n}^{(\mu)}\}_{m, n}$, and of $S^{(\mu\to\Lambda)}$ by $\{\delta\lambda_{m, n}^{(\mu\to\Lambda)}\}_{m, n}$. Note that $\lambda_{m, n}^{(\Lambda)} = \lambda_{m, n}^{(\mu)} + \delta\lambda_{m, n}^{(\mu\to\Lambda)}$. We will denote the kinetic contribution by the indices $(m, n) = (a, b)$ (usually $(a, b) = (2, 2)$).

We will assume that $S^{(\mu)}$ is canonically normalized, in the sense that its kinetic contribution has a coefficient of $1$, i.e. $\lambda_{a, b}^{(\mu)} = 1$. Generally $\lambda_{a, b}^{(\Lambda)} \neq 1$ due to the contributions $S^{(\mu\to\Lambda)}$. To canonically normalize $S^{(\Lambda)}$, we will introduce the normalized field

\[\hat{\chi} := Z^{1/a} \chi\]

defining $Z := 1 + \delta\lambda_{a, b}^{(\mu\to\Lambda)}$. Then Equation \ref{eqn:actcounter} becomes

\[\hat{S}^{(\Lambda)}[\hat{\chi}] = \hat{S}^{(\mu)}[\hat{\chi}] + \hat{S}^{(\mu\to\Lambda)}[\hat{\chi}]\]

where we have defined the normalized actions

\[\hat{S}^{(\Lambda)}[\hat{\chi}] := S^{(\Lambda)}[Z^{-1/a} \hat{\chi}], \quad \hat{S}^{(\mu)}[\hat{\chi}] := S^{(\mu)}[Z^{-1/a} \hat{\chi}], \quad \hat{S}^{(\mu\to\Lambda)}[\hat{\chi}] := S^{(\mu\to\Lambda)}[Z^{-1/a} \hat{\chi}]\]

with couplings $\{\hat{\lambda}^{(\Lambda)}_{m, n}\}_{m, n}$, $\{\hat{\lambda}^{(\mu)}_{m, n}\}_{m, n}$, $\{\delta\hat{\lambda}_{m, n}^{(\mu\to\Lambda)}\}_{m, n}$ respectively. Importantly, we now have that $\hat{S}^{(\Lambda)}$ is canonically normalized, with $\hat{\lambda}_{a, b}^{(\Lambda)} = 1$. Explicitly we can write these new couplings in terms of the original couplings:

\[\begin{align*} \hat{\lambda}^{(\Lambda)}_{m, n} &= Z^{-m/a} \lambda_{m, n}^{(\Lambda)}\\ &= \frac{\lambda_{m, n}^{(\mu)} + \delta\lambda_{m, n}^{(\mu\to\Lambda)}}{Z^{m/a}} \end{align*}\]

One often calls $\hat{S}^{(\Lambda)}$ the bare theory, and $S^{(\mu)}$ the renormalized theory.

Counterterm renormalization involves designing the counterterms $\delta\lambda_{m, n}^{(\mu\to\infty)}$ such that the infinite energy theory $\hat{S}^{(\infty)}$ of couplings $\hat{\lambda}^{(\infty)}_{m, n}$ has finite correlators. As we outline in Section 9, one can compute correlators perturbatively. In our context, we can do so by taking $\delta\lambda_{m, n}^{(\mu\to\infty)}$ to be sufficiently small such that we can treat $\hat{S}^{(\mu\to\infty)}$ as a perturbation. The general procedure of counterterm renormalization includes:

Compute correlators perturbatively under $\hat{S}^{(\Lambda)} = \hat{S}^{(\mu)} + \hat{S}^{(\mu\to\Lambda)}$ by treating $\hat{S}^{(\mu\to\Lambda)}$ as a perturbation.
Choose $\delta\lambda_{m, n}^{(\mu\to\Lambda)}$ appropriately (by some prescription) to ensure that these correlators remain finite in the limit $\Lambda \to \infty$ and that the counterterms $\delta\lambda_{m, n}^{(\mu\to\Lambda)}$ also remain finite in this limit (such that the perturbative assumption is valid).
If this procedure can ensure that all $n$-pt correlators are finite for arbitrary $n \in \mathbb{Z}_{+}$, then we say that the theory $S^{(\mu)}$ is asymptotically free.

In step 2, the prescription of counterterm renormalization describes how one chooses $\delta\lambda_{m, n}^{(\mu\to\Lambda)}$, since generally various different choices can result in finite correlators (e.g. can add an arbitrary constant). Different prescription schemes include the minimal subtraction (MS) scheme, the modified minimal subtraction ($\bar{\text{MS}}$) scheme, and the mass-shell scheme.

Raising the discretization scale. In the following, we will more explicitly write $S^{(N, \epsilon)} \equiv S^{(\Lambda)}$ to denote the theory defined on configuration space $\mathcal{C}^{(N, \epsilon)} \equiv \mathcal{C}^{(\Lambda)}$, with associated field configuration $\Psi^{(N, \epsilon)} \equiv \Psi^{(\Lambda)}$. Through the lowering operator $\mathfrak{L}^{(N \to M)} \equiv \mathfrak{L}^{(\Lambda\to\mu)}$, we saw that we can transform a $(N, \epsilon)$-theory into a $(M, \epsilon)$-theory for $M < N$, which has the effect of lowering the energy scale from $\Lambda \propto N\epsilon$ to $\mu \propto M\epsilon$.

It turns out that we can also construct a mapping from a $(N, \epsilon)$-theory to a $(N, \eta)$-theory with a raised cutoff $\eta > \epsilon$. To see this, we will first introduce a new field $\xi$ to have Fourier components

\[\tilde{\xi}(k) := (\eta/\epsilon)^d \tilde{\Psi}(\epsilon k/\eta) \implies d^d \tilde{\Psi}(k) = d^d \tilde{\xi}(\eta k/\epsilon)\]

which allows us to write

\[\begin{align*} \text{D}\Psi^{(N, \epsilon)} = \prod_{k \in \epsilon \mathbb{Z}_N^d} d^d \tilde{\Psi}(k) &= \prod_{k \in \epsilon \mathbb{Z}_N^d} d^d \tilde{\xi}(\eta k/\epsilon)\\ &= \prod_{k \in \eta \mathbb{Z}_N^d} d^d \tilde{\xi}(k)\\ &= \text{D}\xi^{(N, \eta)} \end{align*}\]

One can also show that

\[\Psi^{(N, \epsilon)}(x) = (\epsilon/\eta)^d \xi^{(N, \eta)}(\epsilon x/\eta)\]

These two results let us write

\[\begin{align*} Z^{(N, \epsilon)} &= \int_{\mathcal{C}^{(N, \epsilon)}} \text{D}\Psi \, e^{-S^{(N, \epsilon)}[\Psi]}\\ &= \int_{\mathcal{C}^{(N, \eta)}} \text{D}\xi \, e^{-S^{(N, \epsilon)}[(\epsilon/\eta)^d \xi(\epsilon \, \cdot/\eta)]} \end{align*}\]

where see that

\[\begin{align*} S^{(N, \epsilon)}[(\epsilon/\eta)^d \xi(\epsilon \, \cdot/\eta)] &= \sum_{m, n} \lambda_{m, n}^{(N, \epsilon)} \int_M d^d x \, L_{m, n}[(\epsilon/\eta)^d \xi(\epsilon \, \cdot/\eta); x]\\ &= \sum_{m, n} (\epsilon/\eta)^{dm + n} \lambda_{m, n}^{(N, \epsilon)} \int_M d^d x \, L_{m, n}[\xi; \epsilon x/\eta]\\ &= \sum_{m, n} (\epsilon/\eta)^{d(m-1) + n} \lambda_{m, n}^{(N, \epsilon)} \int_M d^d \tilde{x} \, L_{m, n}[\xi; \tilde{x}]\\ &= \sum_{m, n} \underbrace{\frac{(\epsilon/\eta)^{n-d+m(d-b)/a}}{(\lambda_{a, b}^{(N, \epsilon)})^{m/a}} \lambda_{m, n}^{(N, \epsilon)}}_{=: \, \tilde{\lambda}^{(N, \eta)}} \int_M d^d \tilde{x} \, L_{m, n}[\tilde{\xi}; \tilde{x}]\\ &= \sum_{m, n} \tilde{\lambda}^{(N, \eta)}_{m, n} \int_M d^d \tilde{x} \, L_{m, n}[\tilde{\xi}; \tilde{x}]\\ &=: \tilde{S}^{(N, \eta)}[\tilde{\xi}] \end{align*}\]

defining the rescaled field

\[\tilde{\xi} := (\epsilon/\eta)^{d+(b-d)/a} (\lambda_{a, b}^{(N, \epsilon)})^{1/a} \xi\]

precisely such that $\tilde{S}^{(N, \eta)}[\tilde{\xi}]$ is canonically normalized, i.e. $\tilde{\lambda}_{a, b}^{(N, \eta)} = 1$. As a result, we can write

\[Z^{(N, \epsilon)} \propto \int_{\mathcal{C}^{(N, \eta)}} \text{D}\tilde{\xi} \, e^{-\tilde{S}^{(N, \eta)}[\tilde{\xi}]}\]

The above has therefore shown that there is a particularly natural raising operation for the discretization scale

\[\mathfrak{R}^{(\epsilon\to\eta)}: \mathcal{S}^{(N, \epsilon)} \to \mathcal{S}^{(N, \eta)}, \quad S^{(N, \epsilon)} \mapsto \tilde{S}^{(N, \eta)}\]

This procedure applies identically to lowering the discretization scale (i.e. $\eta < \epsilon$), however for our purposes we will only care for raising. This is in contrast to lowering the cutoff parameter where there is no analogous means to raise the cutoff parameter.

Coarse-graining. Consider the map

\[\mathfrak{C}^{(N \to M)}: \mathcal{S}^{(N, \epsilon)} \to \mathcal{S}^{(M, N\epsilon/M)}\] \[\mathfrak{C}^{(N \to M)} := \mathfrak{R}^{(\epsilon\to N\epsilon/M)} \circ \mathfrak{L}^{(N \to M)}\]

If a theory $S^{(N, \epsilon)}$ has couplings $\{\lambda_{m, n}^{(N, \epsilon)}\}_{m, n}$, then $\mathfrak{C}^{(N \to M)}[S^{(N, \epsilon)}]$ has couplings

\[\begin{equation} \label{eqn:coarsecouplings} \lambda^{(M, N\epsilon/M)}_{m, n} = (M/N)^{n-d+m(d-b)/a} \frac{\lambda_{m, n}^{(N, \epsilon)} + \delta\lambda_{m, n}^{(N \to M, \epsilon)}}{(\lambda_{a, b}^{(N, \epsilon)} +\delta\lambda_{a, b}^{(N\to M, \epsilon)})^{m/a}} \end{equation}\]

This operation is particularly special in that it \textbf{preserves energy scales}, since $N\epsilon = M(N\epsilon/M)$. It also maps between canonically normalized theories.

As a result, we can view $\mathfrak{C}^{(N\to M)}$ as coarse-graining a theory by decreasing the cutoff and raising the discretization scale, while maintaining the same overall energy scale.

We can classify the possible contributions $\{L_{m, n}\}_{m, n}$ to a theory based on how their couplings $\{\lambda_{m, n}^{(N, \epsilon)}\}_{m, n}$ behave under coarse-graining. Namely, from Equation \ref{eqn:coarsecouplings}, we have that for each $(m, n)$:

$n-d+m(d-b)/a > 0$: Implies that $\lambda^{(M, N\epsilon/M)}_{m, n} < \lambda^{(N, \epsilon)}_{m, n}$, i.e. the $(m, n)$ contribution becomes weaker under coarse-graining. In this case, we say that the contribution is irrelevant.
$n-d+m(d-b)/a = 0$: Implies that $\lambda^{(M, N\epsilon/M)}_{m, n} \approx \lambda^{(N, \epsilon)}_{m, n}$, i.e. the $(m, n)$ contribution is invariant under coarse-graning. In this case, we say that the contribution is marginal.
$n-d+m(d-b)/a < 0$: Implies that $\lambda^{(M, N\epsilon/M)}_{m, n} > \lambda^{(N, \epsilon)}_{m, n}$, i.e. the $(m, n)$ contribution becomes stronger under coarse-graining. In this case, we say that the contribution is relevant.

Todo: usually one interprets irrelevant terms as terms that can be ignored at low energy. But under coarse-graining the energy scale is not changing, so how can we remedy this difference in viewpoint?

Todo: the general prescription for writing down low-energy theories: just write down all relevant and marginal terms that satisfy spacetime and gauge invariance. The Standard Model action can be derived by such a procedure (but also with some extra terms, like the theta term).

Todo: how is gauge invariance affected by cutoff regularization? (in comparison, Fujikawa regulation preserves gauge invariance)

Todo: The existence of Landau poles suggests that the Standard Model may not be asymptotically free, however this analysis is only perturbative and hence does not act as a proof.

9. Computing correlators perturbatively

As mentioned previously, physical quantities such as scattering probabilities can be written in terms of correlators: expectances of the general form

\[\mathbb{E}_{\Psi \sim S}[\Psi_{i_1}(x_1) \cdots \Psi_{i_n}(x_n)] := \frac{1}{Z} \int_{\mathcal{C}} \text{D}\Psi \, \Psi_{i_1}(x_1) \cdots \Psi_{i_n}(x_n) e^{-S[\Psi]}\]

for arbitrary $n \in \mathbb{Z}_{+}$, indices $(i_1, \ldots, i_n) \in \{1, \ldots, N\}^n$, and points $(x_1, \ldots, x_n) \subset M$.

But how can we actually compute correlators? In the

Generally we can write

\[\mathbb{E}_{\Psi \sim S}[\Psi_{i_1}(x_1) \cdots \Psi_{i_n}(x_n)] = (-1)^n \frac{\delta^n Z[J_1, \ldots, J_N]}{\delta J_{i_n}(x_n) \cdots \delta J_{i_1}(x_1)}\bigg|_{J_1 = \cdots = J_N = 0}\]

where we have introduced a source $J_i \in C^{\infty}(M, V^{(i)})$ for each field $\Psi_i$ and defined

\[Z[J_1, \ldots, J_N] := \int_{\mathcal{C}} \text{D}\Psi \, e^{-S[\Psi] - \sum_{i=1}^{N} \int_M d^d x \, J_i(x) \cdot \Psi_i(x)}\]

The ordering of the functional derivatives is significant, since the Standard Model features Grassmann-valued fields (i.e. spinors) that anti-commute, with their sources also being Grassmann-valued.

Decomposing $S[\Psi] = S_0[\Psi] + \Delta S[\Psi]$ into free contributions $S_0$ (i.e. kinetic and mass terms) and perturbative contributions $\Delta S$ (i.e. interactions with small couplings), we are able to compute $Z[J_1, \ldots, J_N]$ perturbatively via Taylor expansion of:

\[Z[J_1, \ldots, J_N] = \exp\left(-\Delta S\left[\frac{\delta}{\delta J_1}, \ldots, \frac{\delta}{\delta J_N}\right]\right) Z_0[J_1, \ldots, J_N]\]

with $Z_0[J_1, \ldots, J_N]$ defined

\[Z_0[J_1, \ldots, J_N] := \int_{\mathcal{C}} \text{D}\Psi \, e^{-S_0[\Psi] - \sum_{i=1}^{N} \int_M d^d x \, J_i(x) \cdot \Psi_i(x)}\]

Because $S_0[\Psi]$ consists of free contributions and is at most quadratic in $\Psi$, and since the source terms are linear in $\Psi$, one can usually complete the square within the exponent and determine a closed form for $Z_0[J_1, \ldots, J_N]$.

For example, a real scalar field $\phi$ has free action
\[S_0[\phi] = \frac{1}{2} \partial_{\mu} \phi \partial^{\mu} \phi - \frac{1}{2} m^2 \phi^2\]
and by using integration by parts and completing the square, one can find that
\[Z_0[J] = Z_0[0] \exp\left(-\frac{1}{2} \int d^d x \, d^d y \, J(x) D_F(x-y) J(y)\right)\]
where $D_F(x-y)$ is the propagator for a real scalar field, defined by the Green’s function property
\[i(\square + m^2) D_F(x-y) = \delta(x-y)\]

Feynman rules. The correlator Feynman rules are a distilled set of rules for computing the correlator term-by-term perturbatively (via Taylor expansion), as described above.

There are also a separate set of Feynman rules for computing scattering amplitudes directly.

Todo: Give explicit conversion between Feynman diagrams and relevant integrals.

Appendix A.1. Classifying semi-simple Lie groups and their representations

To understand the kinds of invariances that our theory can have – specifically, the available choices of Lie groups for the gauge group – it is useful to classify the available Lie groups. It happens that each Lie group has a (not necessarily unique) \textit{Lie algebra}, and it is much easier to classify the available Lie algebras, and so this will be the focus of A.1.1.

Furthermore, ultimately these Lie groups must act on our field content, which requires choosing a \textit{representation} of the Lie group (as outlined in the Introduction). In A.1.2, we will classify the available representations to better understand the choices we can make.

The unitary group $U(n)$ is not semi-simple, and so the classification of its representations must be approached differently. We will explore this in Appendix A.1.3.

A.1.1. Classifying Lie algebras

A.1.1.1. Definitions

A Lie algebra $(\mathfrak{g}, [\cdot, \cdot])$ is a vector space $\mathfrak{g}$ (over some $\mathbb{F}$) paired with a bracket

\[[\cdot, \cdot]: \mathfrak{g} \times \mathfrak{g} \to \mathfrak{g}\]

with the bracket satisfying (i) anti-symmetry (ii) bilinearity (iii) the Jacobi identity:

\[[X, [Y, Z]] + [Y, [Z, X]] + [Z, [X, Y]] = 0\]

which is a convenient property for various proofs.

Roughly, under certain assumptions, every Lie group has a (not necessarily unique) Lie algebra – constructed using the tangent space $T_e(G)$ at the identity element of $G$ – and we can map back from a Lie algebra to a Lie group via the exponential map $\exp: \mathfrak{g} \to G$.

For a vector space $V$ with some associative product $*$ (e.g. $V$ a matrix space with matrix multiplication $*$), then $(V, [\cdot, \cdot])$ is a Lie algebra if one defines $[\cdot, \cdot]$ by

\[[X, Y] := X * Y - Y * X\]

For two Lie algebras $\mathfrak{g}, \mathfrak{h}$, we say that $f: \mathfrak{g} \to \mathfrak{h}$ is a (Lie algebra) homomorphism iff (i) it is a vector space homomorphism (i.e. $f$ linear) and (ii) $f$ commutes with $[\cdot, \cdot]$ in the sense that:

\[f([X, Y]) = [f(X), f(Y)]\]

As usual, we say that $f: \mathfrak{g} \to \mathfrak{h}$ is an isomorphism iff $f$ is a bijective homomorphism whose inverse is also a homomorphism. If such an isomorphism exists, we write $\mathfrak{g} \cong \mathfrak{h}$.

For a Lie algebra $\mathfrak{g}$, a sub-algebra $\mathfrak{h} \leq \mathfrak{g}$ is said to be ideal iff

\[[\mathfrak{g}, \mathfrak{h}] \subseteq \mathfrak{h}\]

using the notation $[\mathfrak{g}, \mathfrak{h}] \equiv \{[X, Y]: X \in \mathfrak{g}, Y \in \mathfrak{h}\}$. Any Lie algebra $\mathfrak{g}$ always has the trivial ideals $\mathfrak{h} = \mathfrak{g}$ and $\mathfrak{h} = \{0\}$.

\textit{Semi-simplicity.} We say that $\mathfrak{g}$ is \textit{semi-simple} if its non-trivial ideal sub-algebras $\mathfrak{h}$ are non-abelian.

For sub-algebras $\mathfrak{h}$ that are instead abelian, we say that $\mathfrak{h}$ is a \textit{maximal abelian subalgebra} if it is not contained in any larger sub-algebra.

For a complex semi-simple Lie algebra $\mathfrak{g}$, we say that a sub-algebra $\mathfrak{h} \leq \mathfrak{g}$ is a \textit{Cartan subalgebra} (CSA) iff $\mathfrak{h}$ is abelian \& maximal, and $\text{ad}_H: \mathfrak{g} \to \mathfrak{g}$ is diagonalizable for all $H \in \mathfrak{h}$.

Any complex semi-simple Lie algebra has a class of CSAs. Indeed, all CSAs are related by conjugation.
The fact that $\mathfrak{g}$ is semi-simple implies that any Cartan sub-algebra $\mathfrak{h}$ is non-ideal.
$\mathfrak{g}$ being semi-simple also ensures that a CSA exists.
The reason why “complex” is important is that it gives us algebraic closure. In fact, we can replace $\mathbb{C}$ with any algebraically closed field (I think?).

Semi-simple Lie algebras are special in that there is a natural metric on the Lie algebra when we have semi-simplicity (indeed, $\mathfrak{g}$ is semi-simple iff its Killing form is non-degenerate, which allows us to define a metric).

A \textit{simple} Lie algebra is a non-abelian Lie algebra with no non-trivial ideals. We can always write a semi-simple Lie algebra as a direct sum of simple Lie algebras. Indeed, an alternative definition of a semi-simple Lie algebra is as any direct sum of simple Lie algebras.

A.1.1.2. Constructing a basis

In the following, we will \textbf{classify all complex semi-simple finite-dim Lie algebras} $\mathfrak{g}$, making use of the existence of a CSA $\mathfrak{h} \leq \mathfrak{g}$.

Given a CSA $\mathfrak{h}$ of our Lie algebra $\mathfrak{g}$, abelian-ness tells us that

\[\text{ad}_H(H') = 0 \; \forall \; H, H' \in \mathfrak{h}\] \[\implies [\text{ad}_H, \text{ad}_{H'}] = \text{ad}_{[H, H']} \equiv 0\]

and further, since $\text{ad}_H$ is diagonalizable, the collection $\{\text{ad}_H\}_{H \in \mathfrak{h}}$ is simltaneously diagonalizable (since diagonalizable operators that commute with eachother are simultaneously diagonalizable). By spectral decomposition theorem, this means that $\mathfrak{g}$ is spanned by the simultaneous eigenvectors of $\{\text{ad}_H\}_{H \in \mathfrak{h}}$.

Overall we can write $\mathfrak{g}$ as a span of the Cartan-Weyl basis:

\[\mathfrak{g} = \text{span}_{\mathbb{C}} \mathcal{B}, \quad \mathcal{B} := \underbrace{\{H_i\}_{i=1}^{r}}_{\text{basis of} \; \mathfrak{h}} \cup \{E_{\alpha}\}_{\alpha \in \Phi}\]

where $\{E_{\alpha}\}_{\alpha \in \Phi}$ are simultaneous eigenvectors of $\{\text{ad}_H\}_{H \in \mathfrak{h}}$ satisfying

\[\text{ad}_H(E_{\alpha}) = \alpha(H) E_{\alpha} \quad \forall \; \; H \in \mathfrak{h}\]

for roots $\Phi \ni \alpha: \mathfrak{h} \to \mathbb{C}$, and with $\text{ad}_H(H_i) = 0 \; \forall \; i$ (since $H_i \in \mathfrak{h}$), and defining $r := \dim\mathfrak{h}$. Further, one can show that the roots $\alpha \in \Phi$ are linear, and hence $\Phi \subset \mathfrak{h}^{*}$.

Linearity of roots $\alpha \in \Phi$ follows simply from their definition and using linearity of $\text{ad}: X \mapsto \text{ad}_X$.
One can show that the roots $\alpha \in \Phi$ are non-degenerate as eigenvalues.

Additionally, we will generally have $\dim\mathfrak{h} \leq \dim\mathfrak{g}/2$, which implies that $|\Phi| \geq \dim\mathfrak{h}^{*}$. As we will see later, one can show that the roots span $\mathfrak{h}^{*}$, i.e. $\mathfrak{h}^{*} = \text{span}_{\mathbb{C}}\Phi$. As a result, $\Phi$ can be viewed as an overcomplete basis for $\mathfrak{h}^{*}$.

Since $\mathfrak{h}$ is a CSA, it is the maximal abelian sub-algebra, and hence no other elements outside of $\mathfrak{h}$ can map to $0$ under $\text{ad}_H$ (as if an element $X \in \mathfrak{g}\backslash\mathfrak{h}$ satisfied $\text{ad}_H(X) = 0$ for all $H \in \mathfrak{h}$, then $\mathfrak{h} \cup \text{span}_{\mathbb{C}}\{X\}$ would be an abelian subalgebra containing $\mathfrak{h}$, contradicting maximality of the CSA $\mathfrak{h}$).
As a result, $0 \notin \Phi$.

A.1.1.3. Relations

By definition, this set of generators has relations

\[[H_i, H_j] = 0, \quad [H_i, E_{\alpha}] = \alpha(H_i) E_{\alpha}\]

and further, $[E_{\alpha}, E_{\beta}]$ can be determined via Jacobi’s identity:

\[\text{ad}_{H}([E_{\alpha}, E_{\beta}]) = (\alpha(H) + \beta(H)) [E_{\alpha}, E_{\beta}]\] \[\implies \begin{cases} [E_{\alpha}, E_{\beta}] \in \mathfrak{h}, & \alpha+\beta = 0\\ [E_{\alpha}, E_{\beta}]=N_{\alpha, \beta} E_{\alpha+\beta}, & \alpha+\beta \in \Phi\\ [E_{\alpha}, E_{\beta}]=0, & \text{otherwise} \end{cases}\]

for proportionality factor $N_{\alpha, \beta} \neq 0$.

\textit{Natural metric for semi-simple Lie algebras.} For the following, introduce the Killing form $\kappa$: a symmetric $(0, 2)$ tensor defined by

\[\kappa(X, Y) := \text{tr}(\text{ad}_X \circ \text{ad}_Y)\]

We can find its components more explicitly. For some basis $\{T_i\}_i \subset \mathfrak{g}$ with structure constants $[T_i, T_j] = f^k{}_{ij} T_k$, expanding the trace in this basis gives us

\[\begin{align*} \kappa_{ij} := \kappa(T_i, T_j) &= \text{tr}(\text{ad}_{T_i} \circ \text{ad}_{T_j})\\ &= [T_i, [T_j, T_k]]^k\\ &= f^k{}_{il} f^l{}_{jk} \end{align*}\]

We have that $\kappa$ is non-degenerate (equivalent to the matrix $\kappa_{ij}$ being invertible in any basis) iff $\mathfrak{g}$ is semi-simple.

This gives us a way to show a Lie algebra $\mathfrak{g}$ is semi-simple: just check whether its Killing form is non-singular.

Therefore, in the case of $\mathfrak{g}$ being semi-simple, we can view $\kappa$ as analogous to a metric (i.e. a symmetric, non-degenerate $(0, 2)$ tensor). As we will see, we will lower indices using $\kappa_{ij}$ and raise indices using $\kappa^{ij} \equiv (\kappa^{-1})^{ij}$. In the following we will work with the inner product

\[(X, Y) := \kappa(X, Y) = \kappa_{ij} X^i Y^j = \kappa^{ij} X_i Y_j\]

for any $X, Y \in \mathfrak{g}$. This inner product also naturally extends to an inner product over $\mathfrak{g}^{*}$: for $\omega, \eta \in \mathfrak{g}^{*}$ we will write

\[(\omega, \eta) := (X_{\omega}, X_{\eta}) \equiv \kappa^{ij} \omega_i \eta_j\]

where (analogous to Riesz representation theorem) for $\lambda \in \mathfrak{g}^{*}$ we define $X_{\lambda} \in \mathfrak{g}$ as satisfying

\[\lambda(Y) =: \kappa(X_{\lambda}, Y) \quad \forall \; Y \in \mathfrak{g}\]

which implies that $X_{\lambda} = \kappa^{ij} \lambda_j T_i$ for basis $\{T_i\}_i \subset \mathfrak{g}$ (found by writing in components).

$\kappa$ is a special metric in that it satisfies a Jacobi identity-like property:

\[\kappa(X, [Y, Z]) = \kappa(Y, [Z, X]) = \kappa(Z, [X, Y])\]

Proof: using the fact that $\text{ad}$ is a rep of $\mathfrak{g}$ (which can be proven via Jacobi’s identity), we have
\[\begin{align*} \kappa(X, [Y, Z]) &= \text{tr}(\text{ad}_X \circ \text{ad}_{[Y, Z]})\\ &= \text{tr}(\text{ad}_X \circ [\text{ad}_Y, \text{ad}_Z])\\ &= \text{tr}(\text{ad}_Y \circ [\text{ad}_Z, \text{ad}_X]) \equiv \kappa(Y, [Z, X])\\ &= \text{tr}(\text{ad}_Z \circ [\text{ad}_X, \text{ad}_Y]) \equiv \kappa(Z, [X, Y]) \end{align*}\]
using the cyclic property of $\text{tr}$.

For $\alpha \in \mathfrak{h}^{*}$, we will use similar notation to above, with $H_{\alpha} = \kappa^{ij} \alpha_j H_i \in \mathfrak{h}$ (with $\kappa_{ij} \equiv \kappa(H_i, H_j)$) satisfying

\[\alpha(H) = \kappa(H_{\alpha}, H) \quad \forall \; H \in \mathfrak{h}\]

\textit{Relations.} Using the properties established above, we can write

\[[E_{\alpha}, E_{-\alpha}] = (E_{\alpha}, E_{-\alpha}) H_{\alpha}\]

This follows from the invariance property of $\kappa$:
\[\kappa(H, [E_{\alpha}, E_{-\alpha}]) = \kappa(E_{\alpha}, [E_{-\alpha}, H]) = \underbrace{\alpha(H)}_{\kappa(H_{\alpha}, H)} \kappa(E_{\alpha}, E_{-\alpha})\] \[\implies \kappa(H, [E_{\alpha}, E_{-\alpha}] - \kappa(E_{\alpha}, E_{-\alpha}) H_{\alpha}) = 0 \quad \forall \; H \in \mathfrak{h}\] \[\implies [E_{\alpha}, E_{-\alpha}] = \kappa(E_{\alpha}, E_{-\alpha}) H_{\alpha} \equiv (E_{\alpha}, E_{-\alpha}) H_{\alpha}\]
by non-degeneracy of $\kappa$.

Instead of considering the generators $\{H_i, E_{\alpha}\}_{i, \alpha \in \Phi}$, we will instead consider $\{H_{\alpha}, E_{\alpha}\}_{\alpha \in \Phi}$ to be our generators, which (using $H_{\alpha} = \kappa^{ij} \alpha_j H_i$) have relations

\[[H_{\alpha}, H_{\beta}] = 0, \quad [H_{\alpha}, E_{\beta}] = (\alpha, \beta) E_{\beta},\] \[[E_{\alpha}, E_{\beta}] = \begin{cases} (E_{\alpha}, E_{-\alpha}) H_{\alpha} & \alpha+\beta=0\\ N_{\alpha, \beta} E_{\alpha+\beta} & \alpha+\beta \in \Phi\\ 0 & \text{otherwise} \end{cases}\]

Note that the sub-algebra generated by $(H_{\alpha}, E_{\alpha}, E_{-\alpha})$ has relations

\[[H_{\alpha}, E_{\pm\alpha}] = \pm(\alpha, \alpha)E_{\pm\alpha}, \quad [E_{\alpha}, E_{-\alpha}] = (E_{\alpha}, E_{-\alpha}) H_{\alpha}\]

We would like for these relations to precisely match the relations of $\mathfrak{su}(2)_{\mathbb{C}}$. To achieve this, we define rescaled elements

\[h_{\alpha} := \frac{2}{(\alpha, \alpha)} H_{\alpha}, \quad e_{\alpha} := \sqrt{\frac{2}{(\alpha, \alpha) (E_{\alpha}, E_{-\alpha})}} E_{\alpha}\]

with $(h_{\alpha}, e_{\alpha}, e_{-\alpha})$ precisely satisfying the $\mathfrak{su}(2)_{\mathbb{C}}$ algebra:

\[[h_{\alpha}, e_{\pm\alpha}] = \pm 2e_{\pm\alpha}, \quad [e_{\alpha}, e_{-\alpha}] = h_{\alpha}\]

That is, each root $\alpha \in \Phi$ is associated with a subalgebra corresponding to $\mathfrak{su}(2)_{\mathbb{C}}$.

These rescaled elements $\{h_{\alpha}, e_{\alpha}, e_{-\alpha}\}_{\alpha}$ have global relations

\[[h_{\alpha}, h_{\beta}] = 0, \quad [h_{\alpha}, e_{\beta}] = \frac{2(\alpha, \beta)}{(\alpha, \alpha)} e_{\beta}, \quad [e_{\alpha}, e_{\beta}] = \begin{cases} h_{\alpha} & \alpha + \beta = 0\\ n_{\alpha, \beta} e_{\alpha+\beta} & \alpha+\beta \in \Phi\\ 0 & \text{otherwise} \end{cases}\]

This construction implies that we have

\[\frac{2(\alpha, \beta)}{(\alpha, \alpha)} \in \mathbb{Z} \quad \forall \; \alpha, \beta \in \Phi\]

This follows because: we have that $\mathfrak{g}^{\alpha} := \text{span}_{\mathbb{C}}\{h^{\alpha}, e^{\alpha}, e^{-\alpha}\}$ is such that $\mathfrak{g}^{\alpha} \cong \mathfrak{su}(2)_{\mathbb{C}}$. And $\mathfrak{g}^{\alpha}$ has CSA $\mathfrak{h}^{\alpha} := \text{span}_{\mathbb{C}}\{h^{\alpha}\}$. Since $\mathfrak{g}^{\alpha} \cong \mathfrak{su}(2)_{\mathbb{C}}$, then any irrep of $\mathfrak{g}^{\alpha}$ is isomorphic to some irrep of $\mathfrak{su}(2)_{\mathbb{C}}$ (?). And note that isomorphic reps share the same weight set, and any rep of $\mathfrak{su}(2)_{\mathbb{C}}$ has a weight set of the form $\{-\Lambda, -\Lambda+2, \ldots, \Lambda\}$.
Now we will construct a valid irrep of $\mathfrak{g}^{\alpha}$ on some space $V$ and apply the above to obtain the desired result. The minimal invariant subspace $V$ of $\text{ad}$ will act as a particularly natural irrep. See that
\[\text{ad}_{h^{\alpha}}(e^{\beta}) = \frac{2(\alpha, \beta)}{(\alpha, \alpha)} e^{\beta}, \quad \text{ad}_{e^{\alpha}}(e^{\beta}) = \begin{cases} n_{\alpha, \beta} e^{\alpha+\beta}, & \alpha+\beta \in \Phi\\ 0, & \text{otherwise} \end{cases}\]
assuming that $\beta \in \Phi$ is such that $\beta \neq -\alpha$ (as if $\beta = -\alpha$, then $h^{\alpha}$ would have to be included in our vector space, but we are looking for the \textit{minimal} invariant subspace). As a result, we clearly have that
\[V_{\alpha, \beta} := \text{span}_{\mathbb{C}}\{e^{\beta+\rho\alpha} \, | \, \rho \in \mathbb{Z}: \beta+\rho\alpha \in \Phi\}\]
is the minimal invariant subspace. As a result, $\text{ad}^{\alpha, \beta}: \mathfrak{g}^{\alpha} \to \mathfrak{gl}(V_{\alpha, \beta}), X \mapsto [X, \cdot]$ is an irrep, and by isomorphism $\mathfrak{g}^{\alpha} \cong \mathfrak{su}(2)_{\mathbb{C}}$, we have that
\[S_{\text{ad}^{\alpha, \beta}} = \{-\Lambda, -\Lambda+2, \ldots, \Lambda\} \quad \text{for some} \; \Lambda \in \mathbb{Z}_{+}\]
and note that, by definition, this weight set can be written
\[\begin{align*} S_{\text{ad}^{\alpha, \beta}} &= \{\lambda \in \mathbb{C} \, | \, \exists \, v \in V_{\alpha, \beta}: \text{ad}_{h^{\alpha}}^{\alpha, \beta}(v) = \lambda v\}\\ &= \left\{\frac{2(\alpha, \beta)}{(\alpha, \alpha)} + 2\rho \, | \, \rho \in \mathbb{Z}: \beta+\rho\alpha \in \Phi\right\} \end{align*}\]
and so we arrive at
\[\frac{2(\alpha, \beta)}{(\alpha, \alpha)} \in \mathbb{Z} \quad \forall \; \alpha, \beta \in \Phi\]
For any $\alpha, \beta \in \Phi$, these results also imply that there must be some maximum $\rho = n_{+} \in \mathbb{Z}_{\geq 0}$ for which $\beta+\rho \alpha \in \Phi$, as well as some minimum $\rho = n_{-} \in \mathbb{Z}_{\leq 0}$ for which $\beta+\rho\alpha \in \Phi$ holds. Since these describe the endpoints of the weight set, corresponding to $\Lambda$ and $-\Lambda$ respectively, then we must have
\[\frac{2(\alpha, \beta)}{(\alpha, \alpha)} = -n_{+} - n_{-}\]
For a pair of roots $\alpha, \beta \in \Phi$, we will use the notation $n_{+} = n_{+}(\alpha, \beta)$ and $n_{-} = n_{-}(\alpha, \beta)$ for these integers.
We can actually generalize this result to arbitrary Lie algebra representations $d$ other than the adjoint representation (i.e. representations over arbitrary vector spaces $V$ other than over $\mathfrak{g}$). We will provide this proof later during the classification of representations.

In the above, we have shown that each root string of length $\ell_{\alpha, \beta}$ is in correspondence with a rep of $\mathfrak{su}(2)_{\mathbb{C}}$ of dimension $\ell_{\alpha, \beta}$. Particularly, the constructed rep $\text{ad}^{\alpha, \beta}$ acts on

\[V_{\alpha, \beta} = \text{span}_{\mathbb{C}}\{e_{\gamma}: \gamma \in S_{\alpha, \beta}\}\]

which has dimension $$\dim V_{\alpha, \beta} =

S_{\alpha, \beta}

\equiv \ell_{\alpha, \beta}$. And since$\ell_{\alpha, \beta} = n_{+} - n_{-} + 1$, we have that for simple roots$\alpha, \beta \in \Phi_S$(such that$n_- = 0$$), we can write

\[\ell_{\alpha, \beta} = 1 - \frac{2(\alpha, \beta)}{(\alpha, \alpha)}\]

The \textit{Cartan matrix} $A$ is defined

\[A_{ij} := \frac{2(\alpha_{(i)}, \alpha_{(j)})}{(\alpha_{(j)}, \alpha_{(j)})}\]

Since $(\alpha, \alpha) > 0$, this tells us that we must have $(\alpha, \beta) \leq 0$ for all $\alpha, \beta \in \Phi_S$ (as otherwise, $\ell_{\alpha, \beta} < 1$ which is invalid). As we will see, this tells us that the non-diagonal elements of the Cartan matrix are non-positive: $A_{ij} \leq 0$ for $i \neq j$.

Using the notation $\ell_{i, j} := \ell_{\alpha_{(i)}, \alpha_{(j)}}$, we therefore have

\[\ell_{i, j} = 1-A_{ji}\]

which we use later to derive Serre’s relation.

A.1.1.4. Simple roots and geometry

We have that $\mathfrak{h}^{*} = \text{span}_{\mathbb{C}} \Phi$.

Proof: if it doesnt span, then there exists some $X \in \mathfrak{h}$ such that $\alpha(X) \equiv \alpha_i X^i = 0$ for all $\alpha \in \Phi$. From here, we can use the existence of $X$ to construct a non-trivial ideal subalgebra $\mathfrak{k}$ (i.e. $[\mathfrak{g}, \mathfrak{k}] \subseteq \mathfrak{k}$) that is abelian, contradicting semi-simplicity of $\mathfrak{g}$. This is trivial to do: $X \in \mathfrak{h}$ and so clearly the span of $X$ will be an abelian subalgebra (since CSAs are abelian), and so now we must show that its ideal. See that
\[[X, E_{\alpha}] = X^i [H_i, E_{\alpha}] = \underbrace{X^i \alpha_i}_{=\, 0} E^{\alpha} = 0\]
(using commutation relations) and hence $\text{span}_{\mathbb{C}} \{X\}$ is an ideal abelian subalgebra of $\mathfrak{g}$, contradicting semi-simplicity.

Since $\mid\Phi\mid \geq \dim\mathfrak{h}^{*}$, the set of roots will generally be an overcomplete basis for $\mathfrak{h}^{*}$. This motivates constructing a minimal set of exactly $r$ roots that acts as a basis for $\mathfrak{h}^{*}$.

To do so, we perform the following reduction to obtain the \textit{simple roots}.

\textit{Constructing the simple roots.} As we have used above in writing $E_{-\alpha}$ as a generator, we have that $\alpha \in \Phi \iff -\alpha \in \Phi$.

Proof: To show this, we must make use of two results:
1. $\kappa(H, E_{\alpha}) = 0$ for all $H \in \mathfrak{h}$ and all $\alpha \in \Phi$.
  - Proof: Since $\alpha \neq 0$, there exists $H' \in \mathfrak{h}$ such that $\alpha(H') \neq 0$, and so
    \[\alpha(H') \kappa(H, E_{\alpha}) = \kappa(H, \alpha(H') E_{\alpha}) = \kappa(H, [H', E_{\alpha}]) = \kappa(E_{\alpha}, [H, H']) = 0\]
2. For any $\alpha, \beta \in \Phi$ such that $\beta \neq -\alpha$, we have that $\kappa(E_{\alpha}, E_{\beta}) = 0$.
  - Proof: Consider $(\alpha(H) + \beta(H)) \kappa(E_{\alpha}, E_{\beta})$ and use linearity + Jacobi property of $\kappa$.
These results tell us that, for any $\alpha \in \Phi$, $\kappa(E_{\alpha}, H) = \kappa(E_{\alpha}, E_{\beta}) = 0$ for any $H \in \mathfrak{h}$ and any $\beta \in \Phi$ such that $\beta \neq -\alpha$. But since $\kappa$ is non-degenerate by semi-simplicity, $\kappa(E_{\alpha}, \cdot)$ cannot map everything to zero. The only option is for $-\alpha \in \Phi$ with $\kappa(E_{\alpha}, E_{-\alpha}) \neq 0$.

And since $0 \notin \Phi$, we have that $\mid\Phi\mid$ is even. Let’s separate $\mathfrak{h}^{*} \cong \mathbb{C}^{r}$ into two halves via a $r-1$ $\mathbb{C}$-dimensional hyperplane (that intersects with the origin). This hyperplane will split $\Phi$ into two equally sized sets of roots:

\[\Phi := \Phi_{+} \cup \Phi_{-}\]

Namely, if $\alpha \in \Phi_{+}$, then we will have $-\alpha \in \Phi_{-}$. Further, if $\alpha, \beta \in \Phi_{+}$, then $\alpha + \beta \in \Phi_{+}$ (and similarly for $\Phi_{-}$).

Now that we have halved the size of $\Phi$ to $\Phi_{+}$, there is one further reduction we perform. We say that a root $\alpha \in \Phi$ is simple iff it is a positive root ($\alpha \in \Phi_{+}$) and cannot be written as the sum of two positive roots. We denote the set of simple roots by

\[\Phi_S = \{\alpha_{(i)} : i = 1, \ldots, |\Phi_S|\}\]

We will now show that $\Phi_S$ is a basis for $\mathfrak{h}^{*}$ with $\mid\Phi_S\mid = r$.

Firstly, we have that

\[\Phi_{+} \subseteq \text{span}_{\mathbb{Z}_{\geq 0}} \Phi_S, \quad \Phi_{-} \subseteq \text{span}_{\mathbb{Z}_{\leq 0}} \Phi_S\]

telling us that any positive root can be written as a linear combination of simple roots with positive coefficients, and similarly for negative roots.

Proof: We want to show that $\alpha \in \Phi_{+} \implies \alpha \in \text{span}_{\mathbb{Z}_{\geq 0}} \Phi_S$. For $\alpha \in \Phi_{+}$, if $\alpha \in \Phi_S$ then we are done. Otherwise, since $\alpha$ is non-simple there must exist some $\alpha_1, \alpha_2 \in \Phi_{+}$ such that $\alpha = \alpha_1 + \alpha_2$. If $\alpha_1$ and $\alpha_2$ are both simple, then we are done. Otherwise, repeat the initial procedure for non-simple $\alpha_1$ or $\alpha_2$ (or both). Eventually this process must terminate since there are a finite number of roots. As a result, we must have $\alpha \in \text{span}_{\mathbb{Z}_{\geq 0}} \Phi_S$.
This result extends analogously to $\Phi_{-}$ since $\alpha \in \Phi_{+} \iff -\alpha \in \Phi_{-}$.

This result implies that $\Phi \subseteq \text{span}_{\mathbb{Z}} \Phi_S$ and, since $\Phi$ $\mathbb{C}$-spans $\mathfrak{h}^{*}$, we therefore have that

\[\text{span}_{\mathbb{C}} \Phi_S = \mathfrak{h}^{*}\]

also.

Finally, we have that the simple roots are linearly independent, meaning that $|\Phi_S| = r$ and hence $\Phi_S$ is a basis for $\mathfrak{h}^{*}$.

Proof: Note that we can write any $\lambda \in \mathfrak{h}^{*}$ as
\[\lambda = \sum_i c_i \alpha_{(i)}, \qquad c_i \in \mathbb{C}\]
To show linear independence of $\Phi_S$ it is sufficient to show that $\lambda = 0 \implies c_i = 0 \; \forall \; i$. See that if $\lambda = 0$, we can apply $(\cdot, \alpha_{(j)})$ to the above:
\[0 = \sum_i c_i (\alpha_{(i)}, \alpha_{(j)}) \quad \forall \; \; j\]
This condition can be written as a matrix equation:
\[Ac = 0\]
for $c = (c_1, \ldots, c_s)^T \in \mathbb{C}^s$ and the symmetric matrix $A$ defined by $A_{ij} := (\alpha_{(i)}, \alpha_{(j)})$. We have that $A$ is non-singular as it is positive definite:
\[x^T A x = x^i x^j (\alpha_{(i)}, \alpha_{(j)}) = (\chi, \chi) > 0\]
for any non-zero $x \in \mathbb{C}^{s}$, defining $\chi := x^i \alpha_{(i)} \in \mathfrak{h}^{*}$. Hence $A$ is non-singular, and so $Ac = 0$ has the unique solution $c=0$, giving us linear independence.

One useful result is that the difference of two simple roots is not a root.

Proof: For any two simple roots $\alpha_{(i)}, \alpha_{(j)} \in \Phi_S$, define
\[\alpha := \alpha_{(i)} - \alpha_{(j)}\]
Now assume that $\alpha \in \Phi$, then WLOG we can assume $\alpha \in \Phi_{+}$ (as otherwise we can switch sign $\alpha \mapsto -\alpha$ in the definition of $\alpha$). Then $\alpha + \alpha_{(j)} = \alpha_{(i)}$ is the sum of two positive roots that equals a simple root, which contradicts the definition of a simple root. Therefore, we must have $\alpha_{(i)} - \alpha_{(j)} \notin \Phi$.

A.1.1.5. Classification

The simple roots

\[\Phi_S = \{\alpha_{(i)}: i = 1, \ldots, r\}\]

provide a canonical $\mathbb{C}$-basis for $\mathfrak{h}^{*}$. Writing $h_i := h_{\alpha_{(i)}}$ and $e_{i} := e_{\alpha_{(i)}}$ (and $h_{-i} := h_{-\alpha_{(i)}}$, $e_{-i} := e_{-\alpha_{(i)}}$), the triple $(h_i, e_i, e_{-i})$ generate a subalgebra corresponding to $\mathfrak{su}(2)_{\mathbb{C}}$, globally acting as

\[[h_i, h_j] = 0, \quad [h_i, e_{\pm j}] = \pm \underbrace{\frac{2(\alpha_{(i)}, \alpha_{(j)})}{(\alpha_{(i)}, \alpha_{(i)})}}_{=: A_{ji}} e_{\pm j}, \quad [e_i, e_{-j}] = \delta_{ij} h_i,\]

defining the Cartan matrix

\[A_{ij} := \frac{2(\alpha_{(i)}, \alpha_{(j)})}{(\alpha_{(j)}, \alpha_{(j)})}\]

Similarly to before, we can define the subalgebra $\mathfrak{g}_i := \text{span}_{\mathbb{C}}\{h_i, e_i, e_{-i}\}$, which has $\mathfrak{g}_{i} \cong \mathfrak{su}(2)_{\mathbb{C}}$ and associated CSA $\mathfrak{h}_{i} := \text{span}_{\mathbb{C}}\{h_i\}$. The generators $\{h_i, e_i, e_{-i}\}_{i}$ are called the Chevalley basis of $\mathfrak{g}$.

In addition to the above relations, we have

\[\text{ad}_{e_i}(e_j) \equiv [e_i, e_j] \propto \begin{cases} e_{\alpha_{(i)}+\alpha_{(j)}}, & \alpha_{(i)} + \alpha_{(j)} \in \Phi\\ 0, & \text{otherwise} \end{cases}\]

meaning that

\[\text{ad}_{e_i}^{\rho}(e_j) \propto \begin{cases} e_{\alpha_{(j)} + \rho\alpha_{(i)}}, & \rho \in \{n_{-}^{(i, j)}, \ldots, n_{+}^{(i, j)}\}\\ 0, & \text{otherwise} \end{cases}\]

defining $n_{+}^{(i, j)} := n_{+}(\alpha_{(i)}, \alpha_{(j)})$ and $n_{-}^{(i, j)} := n_{-}(\alpha_{(i)}, \alpha_{(j)})$, where $n_{+}, n_{-}$ are as defined in A.1.1.3.

Furthermore, as proven in the previous section, the difference between two simple roots is not a root (i.e. $\alpha_{(i)} - \alpha_{(j)} \notin \Phi$), which tells us that $n_{-}^{(i, j)} = 0$ for all $i, j$. And recall (in A.1.1.3) that we found

\[n_{+}^{(i, j)} = -\frac{2(\alpha_{(i)}, \alpha_{(j)})}{(\alpha_{(i)}, \alpha_{(i)})} \equiv -A_{ji}\]

As a result, we have

\[\text{ad}_{e_i}^{\rho}(e_j) \propto \begin{cases} e_{\alpha_{(j)} + \rho\alpha_{(i)}}, & \rho \in \{0, 1, \ldots, -A_{ji}\}\\ 0, & \text{otherwise} \end{cases}\]

providing the \textit{Serre relation} for $i\neq j$:

\[\text{ad}_{e_i}^{-A_{ji}}(e_j) \neq 0, \qquad \text{ad}_{e_i}^{-A_{ji}+1}(e_j) = 0\]

$A_{ji} \in \mathbb{Z}_{\leq 0}$ for $i\neq j$ since $(\alpha_{(i)}, \alpha_{(j)}) \leq 0$ for $i \neq j$. In contrast, $A_{ii} = 2$ (no sum).
The Serre relation tells us that $\text{ad}_{e_i}^{-A_{ji}}(e_j)$ is the highest weight vector of the adjoint rep $\text{ad}$. (check?)

\textbf{Classifying algebras via Cartan matrices.} We saw that the entries of the Cartan matrix $A \in \mathbb{Z}^{r \times r}$ completely determined the relations between elements of the Chevalley basis $\{h_i, e_i, e_{-i}\}_{i=1}^{r}$ of $\mathfrak{g}$. Indeed, the Cartan matrix uniquely determines a semi-simple complex Lie algebra.

As a result, the problem of classifying all complex semi-simple Lie algebras reduces to classifying all possible Cartan matrices. In the following, we will derive some necessary conditions on $A$ to aid in this classification.

Todo: relevant proofs

As a result, we can classify all complex semi-simple finite-dim Lie algebras by determining the valid Cartan matrices $A$. The definition of $A$ gives some constraints:

$A_{ii} = 2$ for all $i = 1, \ldots, r$.
$A_{ij} = 0 \iff A_{ji} = 0$.
$A_{ij} \in \mathbb{Z}_{\leq 0}$ for all $i \neq j$.
$\det A > 0$.
$A_{ij} A_{ji} \in \{0, 1, 2, 3\}$ for all $i \neq j$.
$A$ is irreducible, i.e. $P A P^{-1}$ is not block upper triangular for any permutation $P \in S_r$.

This is sufficient to classify all complex semi-simple finite-dim Lie algebras, amounting to the infinite families $A_r, B_r, C_r, D_r$ as well as exceptional algebras $E_6, E_7, E_8, F_4, G_2$. See here for more details.

Appendix A.1.2. Classifying representations

We now wish to classify all available representations $D: G \to \text{GL}(V)$ of a given Lie group $G$. We can do the following: (i) classify all Lie algebra reps (ii) map to Lie group reps via the exponential map.

Definitions

A Lie algebra representation $d: \mathfrak{g} \to \mathfrak{gl}(V)$ is a (Lie algebra) homomorphism between $\mathfrak{g}$ and $\mathfrak{gl}(V)$ for some vector space $V$. One often calls $V$ the representation, or the representation space.

For a representation $d$ of $\mathfrak{g}$ over $V$, we say that $W \leq V$ is an invariant subspace of $d$ iff

\[d_X(W) \subseteq W \quad \forall \; \; X \in \mathfrak{g}\]

Any representation $d$ always has the trivial invariant subspaces $W = V$ and $W = \{0\}$. If $d$ has no non-trivial invariant subspaces, then we say that $d$ is irreducible. Otherwise, we say that $d$ is reducible.

We can generalize the concept of irreducibility: if we can decompose

\[V = W_1 \oplus \cdots \oplus W_K\]

for a collection of invariant subspaces $\{W_i\}_{i=1}^{K}$ of $d$, then we say that $d$ is totally reducible. The case of $K=1$ corresponds to irreducibility.

In particular, if $d$ is totally reducible, then we may write

\[d = d^{(1)} \oplus \cdots \oplus d^{(K)}\]

for a collection of irreducible representations $\{d^{(i)}: \mathfrak{g} \to \mathfrak{gl}(W_i)\}_{i=1}^{K}$.

Maschke’s theorem tells us that any finite-dimensional representation of a semi-simple Lie algebra $\mathfrak{g}$ is totally reducible (assuming a characteristic zero field $\mathbb{F}$).

This can be seen as a justification for studying semi-simple algebras: it means we only need to classify the irreps.

Classification

Consider a rep $d$ of $\mathfrak{g}$ on $V$. We want to better understand the vectors spaces $V$ on which $\mathfrak{g}$ can act through reps $d$. First we will work with the initial Cartan-Weyl basis $\{H_i, E_{\alpha}\}_{i, \alpha}$ of $\mathfrak{g}$ paired with CSA $\mathfrak{h}$. Note that

\[[d_{H_i}, d_{H_j}] = d_{[H_i, H_j]} = 0\]

since $[H_i, H_j] = 0$. This means that the operators $\{d_{H_i}\}_{i=1}^{r}$ on $V$ are simultaneously diagonalizable. Then define the eigenspace of simultaneous eigenvectors associated with weight $\lambda \in \mathfrak{h}^{*}$ by

\[V_d^{\lambda} := \{v \in V: d_H(v) = \lambda(H)v \;\; \forall \; H \in \mathfrak{h}\}\]

Denote the total set of weights of rep $d$ by $S_d \subset \mathfrak{h}^{*}$, i.e.

\[S_d := \{\lambda \in \mathfrak{h}^{*}: V_{d}^{\lambda} \neq \{0\}\}\]

It is easy to show that

\[d_{E_{\alpha}}(V_d^{\lambda}) \subseteq \begin{cases} V_d^{\lambda+\alpha}, & \lambda+\alpha \in S_d\\ \{0\}, & \text{otherwise} \end{cases}\]

telling us that $d_{E_{\alpha}}$ raises weights $\lambda \to \lambda+\alpha$.

Every finite-dim rep $d$ has at least one \textit{highest weight} $\Lambda \in S_d$ defined by

\[d_{E_{\alpha}}(V_{d}^{\Lambda}) = \{0\} \;\; \forall \; \alpha \in \Phi_{+}\]

i.e. the highest weight $\Lambda \in S_d$ is such that, for any positive root $\alpha \in \Phi_{+}$, $\Lambda + \alpha \notin S_d$.

Note that here we are working with the subalgebra $\mathfrak{g}^{\alpha}$, where $\mathfrak{h}^{\alpha}$ is 1-dimensional and hence can just treat $\lambda \in \mathbb{C}$. In this case, the relevant CSA from which weights are defined is $\mathfrak{h}^{\alpha} = \text{span}_{\mathbb{C}}\{h^{\alpha}\}$, and for a rep $d$ of $\mathfrak{g}^{\alpha} \cong \mathfrak{su}(2)_{\mathbb{C}}$ on $V$, the weights of $d$ corresponds to eigenvalues of $d_{h^{\alpha}}: V \to V$. But by isomorphism, the weight set of $d$ must match $\{-\Lambda, \ldots, \Lambda\}$, meaning that

\[\lambda \in S_d \implies d_{h_{\alpha}}(v) = \lambda(h_{\alpha}) v \; \; \text{for some} \; \; v \in V\backslash\{0\}\]

means that $\lambda(h_{\alpha}) \in \mathbb{Z}$. But we can equivalently write

\[\lambda(h_{\alpha}) = \frac{2}{(\alpha, \alpha)} \lambda(H_{\alpha}) = \frac{2(\lambda, \alpha)}{(\alpha, \alpha)}\]

meaning that

\[\frac{2(\lambda, \alpha)}{(\alpha, \alpha)} \in \mathbb{Z} \quad \forall \; \lambda \in S_d, \alpha \in \Phi\]

In the second equality we used the Reisz representation theorem-like definition of $H_{\alpha}$, which means

\[\lambda(H_{\alpha}) = \kappa^{ij} \alpha_j \lambda(H_i) = \kappa^{ij} \alpha_j \lambda_i \equiv (\lambda, \alpha)\]

The above generalizes the previously used result; when $d = \text{ad}^{\alpha, \beta}$, we get the root-specific condition $2(\alpha, \beta)/(\alpha, \alpha)$.

\textit{Dominant weights.} For the following, we will define the co-roots

\[\hat{\alpha}_{(i)} := \frac{2}{(\alpha_{(i)}, \alpha_{(i)})} \alpha_{(i)}\]

For $\lambda \in \mathfrak{h}^{*}$, we define its components $\lambda^i := (\lambda, \hat{\alpha}_{(i)})$. In particular, they are the components of $\lambda$ under the dual basis $\{\omega_{(i)}\}_i$ of $\{\hat{\alpha}_{(i)}\}_i$:

\[\lambda = \lambda^i \omega_{(i)}\]

with $(\omega_{(i)}, \hat{\alpha}_{(j)}) = \delta_{ij}$. Then we define the dominant weights as

\[\mathcal{D}_W := \{\lambda \in \mathfrak{h}: \lambda^i \in \mathbb{Z}_{\geq 0} \; \forall \; i\}\]

The main result: for every dominant weight $\lambda \in \mathcal{D}_W$, there exists a unique finite-dimensional irrep $d^{(\lambda)}$ of $\mathfrak{g}$ for which $\lambda$ is its highest weight. And further, this exhausts all finite-dim irreps. As a result, we can determine finite-dim irreps by iterating over all dominant weights.

Todo: relevant proofs

Appendix A.1.3. Classifying the representations of the unitary group

Todo

Appendix A.2: The LSZ formula

In the canonical quantization approach to QFT – an alternative to the path integral approach outlined in the rest of this post – it is particularly convenient to derive the scattering/decay probabilities associated with scattering/decay processes (e.g. a meson decaying into a nucleon and anti-nucleon). For example, for a free scalar QFT with field $\phi$, the scattering process $\phi\phi \to \phi\phi$ has scattering amplitude

\[\begin{align*} \braket{f|S|i} &= \int \left[\prod_{i=1}^{4} d^4 x_i\right] \, e^{-ip_1\cdot x_1} (\square_1 + m^2) e^{-ip_2\cdot x_2} (\square_2 + m^2) e^{ip_3\cdot x_3} (\square_3+m^2) e^{ip_4\cdot x_4} (\square_4 + m^2)\\ &\qquad \qquad \qquad \qquad \braket{\phi(x_1) \phi(x_2) \phi(x_3) \phi(x_4)} \end{align*}\]

with $\square_i$ the Laplace operator associated with $x_i$.

It is a general property of scattering amplitudes that they can be expressed as an integral over correlators. This tells us that correlators have a direct physical relevance, and hence it is natural to enforce that such correlators – and expectances more generally (as given in Introduction) – are invariant to the symmetry group of the theory.

Appendix A.3: Coupling to gravity

Non-dynamical gravity involves modifying the action

\[\int d^n x \, \mathcal{L}(\Phi; x) \to \int d^n x \, \sqrt{-g} \mathcal{L}(\Phi; x)\]

To make the metric $g$ dynamical, we cam add a Ricci scalar term:

\[\int d^n x \, \mathcal{L}(\Phi; x) \to \int d^n x \, \sqrt{-g} \left(\frac{1}{16\pi G}R + \mathcal{L}(\Phi; x)\right)\]

In the context of gravity, the concept of diffeomorphism invariance of our theory becomes relevant. The $\sqrt{-g}$ factor as well as the Ricci scalar $R$ are manifestly diffeomorphism invariant.

The main problem why gravity is problematic is that the resulting Feynman diagrams/graviton loop corrections are non-renormalizable: one requires an infinite number of counter-terms to remedy divergences at all orders. As a result, after coupling to gravity, you are restricted to only considering tree level contributions.

Todo: demonstrate divergences under dynamical gravity.

Appendix A.3: Justifying gauge invariance as a principle

In biology there is a redundancy analogous to gauge invariance: multiple genotypes map to the same phenotype. In particular, field configurations $\Psi \in \mathcal{C}$ are analogous to genotypes, and physical configurations $\Phi \in \mathcal{P}$ are analogous to phenotypes. The equivalence class $[\Psi]$ is analogous to a \textit{neutral network}. Note however that the discreteness of genotypes compared to the continuity of the configuration space $\mathcal{C}$ weaken this analogy.

In biology, the existence of neutral networks aid search by allowing for a more diverse search over genotypes and hence phenotypes. How can we interpret such benefits in the context of gauge invariance? Can we better understand why our theories must exhibit properties like gauge invariance?

Todo

Appendix A.4: Effects of integrating over redundant configurations

What are the effects of integrating over physically equivalent configurations in $\mathcal{C}$? We can expect some form of overcounting. Namely, see that

\[\begin{align*} \mathbb{E}_{\Psi \sim S}^{\mathcal{C}}[f(\Psi)] = \frac{1}{Z_{\mathcal{C}}} \int_{\mathcal{C}} \text{D}\Psi \, f(\Psi) e^{-S[\Psi]} &= \frac{1}{Z_{\mathcal{C}}} \int_{\mathcal{P}} \text{D}\Phi \, \int_{[\Phi]} \text{D}\Psi \, f(\Psi) e^{-S[\Psi]}\\ &= \frac{1}{Z_{\mathcal{C}}} \int_{\mathcal{P}} \text{D}\Phi \, e^{-S[\Phi]} \left[\int_{[\Phi]} \text{D}\Psi \, f(\Psi)\right]\\ &= \frac{1}{Z_{\mathcal{P}}} \int_{\mathcal{P}} \text{D}\Phi \, F(\Phi) e^{-S[\Phi]}\\ &= \mathbb{E}_{\Phi\sim S}^{\mathcal{P}}[F(\Phi)] \end{align*}\]

defining

\[F(\Phi) := \frac{Z_{\mathcal{P}}}{Z_{\mathcal{C}}} \int_{[\Phi]} \text{D}\Psi \, f(\Psi)\]

We view integrating over $\mathcal{P}$ as integrating over representative states of each distinct equivalence class, only integrating over non-equivalent configurations.

Similarly, one can show that

\[Z_{\mathcal{C}} = Z_{\mathcal{P}} \mathbb{E}_{\Phi \sim S}^{\mathcal{P}}[\text{Vol}([\Phi])]\]

which lets us write

\[F(\Phi) = \frac{1}{\mathbb{E}_{\Phi' \sim S}^{\mathcal{P}}[\text{Vol}([\Phi'])]} \int_{[\Phi]} \text{D}\Psi \, f(\Psi)\]

Now say that each equivalence class has the same volume, i.e. $\text{Vol}([\Phi]) = \text{Vol}([\Phi'])$ for any $\Phi, \Phi' \in \mathcal{P}$. Further, say $f$ is constant over $[\Phi]$, which corresponds to invariance $f \circ \rho_g = f$. In this case, we have that $F(\Phi) = f(\Phi)$, i.e. taking expectances of $f$ over $\mathcal{C}$ is equivalent to taking expectances over $\mathcal{P}$. But generally, $f$ will not be constant over $[\Phi]$ (as is the case for correlators), and so generally these are distinct operations.

The overcounting associated with gauge fields can be detrimental to the validity of the theory due to divergences, which we explore in Section 7.

Notes on prosaic alignment and control

2025-03-22T22:53:08+00:00

A working document, in an effort to gain a broad view of alignment and control for current day language models in order to better evaluate which research directions are most relevant and which should be prioritized towards effective real-world deployment of AI systems.

A broad overview of alignment and control for current day language models.

Alignment: Designing AI systems to be reliable and safe to deploy, with a focus on misspecification (outer alignment) and misgeneralization (inner alignment).
Control: Mitigating the possibility of alignment/robustness failures via monitoring and intervention during deployment.

There will be an implicit focus on scalable solutions to alignment and control: methods that are applicable to the largest and most capable future models that we may wish to deploy in real-world contexts.

Alignment

Pretraining

Language models as simulators. Pretraining on the Internet encourages language models to be capable of simulating a wide diversity of personas if prompted correctly. One rough but illustrative picture is viewing a pretrained model as performing a Bayesian model average over learned personas, where the probability $p_{\text{PT}}(y\mid x)$ that the pretrained model outputs response $y$ given context $x$ can be written schematically as

\[p_{\text{PT}}(y\mid x) = \int d\mathfrak{p} \, p_{\text{PT}}(\mathfrak{p}\mid x) p(y\mid x, \mathfrak{p})\]

integrating over all personas $\mathfrak{p}$ relevant to modeling text on the Internet (e.g. the persona of the average Stack Overflow contributor), with $p_{\text{PT}}(\mathfrak{p}\mid x)$ the learned distribution over personas $\mathfrak{p}$ given context $x$, and $p(y\mid x, \mathfrak{p})$ producing the outputs of persona $\mathfrak{p}$ under context $x$.

Namely, the distribution $p_{\text{PT}}(\mathfrak{p}\mid x)$ describes what the model has learned from the Internet: the probability that the model should engage in persona $\mathfrak{p}$ given context $x$. In an ideal world, $p_{\text{PT}}(\mathfrak{p}\mid x)$ would effectively be a point-mass on a maximally helpful and honest persona $\mathfrak{p}_{\text{HHH}}$, however the pretrained model has no reason to specifically favour such a persona given the diversity of the pretraining data. This motivates prompting and finetuning as a means to shift this distribution $p_{\text{PT}}(\mathfrak{p}\mid x)$ towards personas of interest, as we will discuss in more detail later (see “Prompt-level steering” and “Behaviour-level steering” below).

Through finetuning, we essentially aim to unlearn the undesirable behaviours learned at pretraining. The limited robustness of LMs (e.g. the success of simple prefilling attacks and the shallowness of finetuning [8, 9], arbitrary behaviour elicitation [34]) suggests that unlearning is difficult and finetuning is not greatly effective. Additionally, the size of the pretraining dataset is magnitudes larger than the size of the finetuning dataset, which is likely a contributing factor to this difficulty.

Ultimately, the model’s internal representation $z(x)$ of a particular prompt $x$ encodes the persona $\mathfrak{p}$ that is currently active. Interpretability methods like representation reading [1], sparse auto-encoders [2, 3], and LatentQA [4] allow for reading off attributes of a model’s persona, such as “helpfulness”, and for steering towards such behaviours (see “Representation-level steering” below).

One can think of the early transformer layers as implementing inference from $p(\mathfrak{p}\mid x)$, and later layers as implementing inference from $p(y\mid x, \mathfrak{p})$.

Shallowness of finetuning. The observed shallowness of finetuning [8, 9] suggests that the finetuned persona distribution $p_{\text{FT}}(\mathfrak{p}\mid x)$ has essentially the same support as $p_{\text{PT}}(\mathfrak{p}\mid x)$, i.e. $p_{\text{PT}}(\mathfrak{p}\mid x) \approx 0 \implies p_{\text{FT}}(\mathfrak{p}\mid x) \approx 0$, describing the observation that finetuning usually does not teach the model fundamentally new behaviours.

Rethinking pretraining. Rather than trying to improve the effectiveness of finetuning at unlearning undesirable pretraining behaviour (as we will discuss in the next section), perhaps we should instead rethink the pretraining process itself. It is clear that there is a great mismatch between the task of pretraining and the behaviour we wish an LM to have at deployment; unsurprisingly, most text on the Internet does not match a “helpful persona”. It may be unrealistic to expect finetuning to be able to effectively unlearn all undesirable pretraining behaviours. Is there an alternative way of pretraining that doesn’t encourage a model to explicitly emulate behaviours that are misaligned with deployment? We still wish for the model to learn from all available data, but by a means that is detached from behaviour – the fact that we use the same next-token learning objective for both pretraining and SFT seems inherently problematic, and ideally there would be some separation between these learning processes, with pretraining learning knowledge and useful representations while finetuning learns good behaviours.

It is unclear what this would look like in practice, but one analogy is a VAE: when training a VAE on images, the learning of representations is not explicitly tied to any specific downstream task like classification (yet the representations are still very useful for this task), and the objective used to train a VAE is very different from the cross entropy objective used to train a classifier. We would like something similar for LMs, with pretraining corresponding to representation learning by a means that is not explicitly tied to imitating behaviour.
Ideally, behaviour would be completely

Steering

By default, a pretrained model will not behave usefully; it has no particular preference for being “helpful”. We wish to “steer” the pretrained model to be more useful for a deployment task of interest. Steering can be performed at different levels of abstraction. Note that language models have a computational hierarchy of: $(\text{prompt}, \text{weights}) \to \text{representations} \to \text{behaviour}$. Steering can be performed at the level of any component in this hierarchy.

In practice, models are mainly steered at the behaviour-level (via SFT and RLHF) and prompt-level (via system prompts). Representation-level and weight-level steering methods are also promising and discussed below.

Behaviour-level steering. Using the notation introduced in the previous section, we would like to steer the distribution $p_{\text{PT}}(\mathfrak{p}\mid x)$ towards behaviours/personas $\mathfrak{p}$ that correspond with a helpful assistant $\mathfrak{p}_{\text{HHH}}$. Most immediately, we could consider prompting the model appropriately via $x$ to elicit desired personas (i.e. prompt-level steering), such as via few-shot prompting. This requires no additional training, but as we will discuss later, is unreliable as a steering strategy.

One intuitive approach is to perform additional training in exactly the same manner as pretraining – i.e. using a next-token prediction loss – but instead using human-curated text data that demonstrates ideal interactions between a user and a helpful assistant. This is the idea behind supervised finetuning (SFT). Some limitations of SFT include:

Costly: requires explicit human demonstrations of ideal responses (which is typically more difficult/costly than human evaluations of model’s responses, e.g. evaluating the goodness of a research paper being much easier than creating a good research paper).
Distribution shift: during SFT (and pretraining) we predict only one token ahead on the basis of a ground-truth response, but at deployment a model will be generating its response on the basis of self-generated tokens, which means there will be a distribution shift between SFT and deployment.

Reinforcement learning-based training – namely, reinforcement learning from human feedback (RLHF) – is a method that remedies both of these issues, since (1) RL only requires an evaluation model $R = R(x, y)$ of model responses $y$ (no explicit demonstrations are required) and (2) in RL we can allow the model to generate its responses without access to a ground truth, which is faithful to deployment.

The fact that RL only requires an evaluator (rather than a demonstrator) means it is not bottlenecked by human demonstration ability, which is often weaker than human evaluation ability for tasks we care about (analogous to $\text{P} \neq \text{NP}$). As a result, it is feasible that performing RL with a human-level evaluator can result in a model whose ability is beyond human-level. For example, it is very easy to evaluate the win condition of chess, and performing RL on this evaluation signal results in superhuman chess agents like AlphaZero (however this is a special case where the task is verifiable; real-world tasks are mostly non-verifiable, which may result in important differences in the robustness of RL).

SFT and RLHF are often referred to as “finetuning” methods, and in practice they are both used in combination (first SFT, followed by RLHF), though it has been found in special cases (where high-quality demonstrations are available) that SFT can sometimes be sufficient [7].

Misspecification. To understand the potential for misalignment in RLHF, we will describe it explicitly. Concretely, RLHF consists of two steps:

Train reward model $R = R(x, y)$ (initialized as a pretrained model) to minimize
\[\mathcal{L}[R] := \mathbb{E}_{(x, y_0, y_1, b) \sim \mathcal{D}}[-\log \sigma((-1)^{1-b} (R(x, y_1) - R(x, y_0)))]\]
for prompts $x$, model completions $(y_0, y_1) \sim \pi_{\text{ref}}(\cdot\mid x)$, and preference label $b \in \{0, 1\}$ (with $b=1$ iff completion $y_1$ is preferred over $y_0$) obtained by human labellers. Note that $\pi_{\text{ref}}$ is the initial pretrained (and SFT’d) model.
Train the language model $\pi = \pi(y\mid x)$ to optimize $R$ via PPO (or DPO):
\[\mathcal{V}[\pi] := \mathbb{E}_{\rho(x) \pi(y\mid x)}[R(x, y)] - \tau D_{\text{KL}}(\pi\mid \mid \pi_{\text{ref}})\]

Note that step 2 is applicable to generic reward models $R$, whereas step 1 is specific to learning $R$ via response comparison data (found to be reliable in practice, since getting humans to give raw preference ratings is noisy).

The reward model $R = R(x, y)$ used in RLHF is likely to be an imperfect proxy, i.e. to be misspecified to some degree. Namely, let $R^{*}$ denote the oracle reward describing our true, “platonic” human preferences. We do not have access to this oracle, and instead must rely on a human’s ratings, which can be thought of as samples from another reward model $R_{\text{human}}$. Further, as seen in step 1 of RLHF above, we essentially train a reward model $R_{\text{trained}}$ to predict the outputs of $R_{\text{human}}$ to allow for larger-scale training (collecting data entirely from humans for training is too costly). We then perform RL training using reward function $R = R_{\text{trained}}$. Each step introduces its own discrepancies/gaps:

\[R^{*} \underbrace{\longleftrightarrow}_{\text{oracle-human gap}} R_{\text{human}} \underbrace{\longleftrightarrow}_{\text{human-train gap}} R_{\text{trained}}\]

Misspecification corresponds to the overall oracle-train gap, i.e. the total discrepancy between the ideal oracle reward $R^{*}$ and the learned reward model $R = R_{\text{trained}}$ used for RL training, which is a sum of:

Oracle-human gap (discrepancy between $R^{*}$ and $R_{\text{human}}$) originating from:
- Collecting human data is costly, limiting the size of preference datasets.
- Human cognitive biases (as for sycophancy [10] and length bias [11]).
- Humans are limited in their ability to supervise complex tasks.
- Human response comparison data only captures “which response is better” (rankings) rather than “how good is this response” (ratings). Human rating data is found to be too noisy in practice, restricting us to use ranking data.
Human-train gap (discrepancy between $R_{\text{human}}$ and $R_{\text{trained}}$) originating from:
- Learned reward model not correctly generalizing to human preferences due to an ineffective training setup.
- Distribution shift due to a model’s response distribution $\pi(y\mid x)$ changing during RL training whereas $R_{\text{trained}}$ is trained using responses from the original model $\pi_{\text{ref}}(y\mid x)$. Periodic retraining of the reward model can resolve this though this is costly. d-RLAIF approaches discussed below could also alleviate such distribution shift issues.

The effect of reward misspecification is that ultimately the LM is trained to maximize $R_{\text{trained}}$ during RL training, and hence any discrepancies between $R^{*}$ and $R_{\text{trained}}$ will be exploited if it better allows for value maximization, often resulting in undesirable behaviours – called reward hacking, or reward overoptimization.

Note that even with a perfect reward model that experiences no misspecification, we may still suffer from underoptimization if our RL techniques are ineffective and result in suboptimal maxima.
[12] finds that a policy can better exploit misspecification as the size of the policy scales.
Could automate red-teaming of reward models to identify misspecification, via a method similar to [15].
The fact that evaluation is often much easier than generation may mean that reward models are much less mechanistically complex than a traditional LM, meaning weight-level and representation-level interpretability may be more feasible (as noted in [16]).

Scalable oversight. The problem of misspecification – and specifically the oracle-human gap – can be mainly attributed to humans not being very reliable supervisors. One can consider this problem in more generality: how can we reliably supervise and evaluate the outputs of AI systems on complex tasks? (for RLHF, the complex task is instruction following/helpfulness) This is often studied under the term scalable oversight.

One particularly difficult version of the scalable oversight problem involves assuming that the AI’s output is essentially unreadable to the human supervisor (e.g. code in a language that the human supervisor is unfamiliar with). In order to aid the human, we would like to produce a human-readable artifact that the human can base their supervision upon. Some examples of methods for doing this include:

Allow an assistant AI to provide supervision, perhaps with an argument that the supervisor can engage with.
Summarization of the AI’s output by an assistant AI in a way that’s catered to the supervisor’s ability.
Debate [39] allows multiple instances of the same AI system to argue for different sides of an argument (e.g. “does this response follow the user’s instruction?”), producing a debate transcript that the supervisor can base their judgement upon.
Recursive reward modeling [18] considers recursively decomposing complex tasks into simpler subtasks that the human/AI supervisor can evaluate and then combine into an overall evaluation.
In the case of evaluating code correctness (where a user prompts the model with a specification e.g. “write code to find the nth prime”), as long as the human supervisor can understand the user’s intended specification, they can construct test cases for the code and ensure that they pass. Related work on non-experts evaluating SQL code [37].

This first example, of using an AI system to act as an evaluator and generate supervision signal, is demonstrated in Constitutional AI (CAI) [13]. CAI generates demonstrations for SFT, and evaluations for RL, with the only human oversight being the selection of constitutional principles that the generating model is prompted to follow during data generation. Namely, on the RLHF side, a given choice of constitution $\mathcal{C}$ results in an associated preference model $R_{\mathcal{C}} = R_{\mathcal{C}}(x, y)$ that one can then use for RL.

It has been found (for sufficiently large models) that very simple constitutions like “Act to best benefit humanity” are competitive with more detailed rule sets [14].
Training a preference model means CAI suffers from distribution shift problems (see the 4th bullet point under “oracle-human gap” above). d-RLAIF [17] proposes a solution to this distribution shift issue by obtaining evaluations directly from the model during RL training, i.e. no reward model has to be trained.
The motivation for using AI feedback for alignment is that we can expect e.g. a pretrained LM to have some understanding of human values and our preferences, or to be able to simulate someone that does, which allows them to assist in preference data generation.

There are many works that evaluate how effectively LMs can act as evaluators, often called “meta-evaluation”. For example, [19, 20] find that even though LMs can often generate correct solutions to reasoning problems, they often fail to evaluate which solution is correct (i.e. LMs find generation easier than evaluation, disagreeing with human intuition). One could consider explicitly training AI systems to act as better evaluators. For example:

We could perform RLHF training but instead collect preference data for the task of evaluation (rather than the generic task of instruction following/helpfulness), with humans choosing preferred evaluations. Here humans are acting as “meta-evaluators” to train good LM evaluators. This is similar to the approach taken in CriticGPT [35].
We could use other LMs (or the same LM, i.e. self-improvement) as a meta-evaluator instead of humans. This is analogous to performing the RL stage of CAI for the task of evaluation, or replacing the human with an LM in CriticGPT.
- To what extent can we just keep repeating this process iteratively (using that evaluation and meta-evaluation are similar tasks), using the improved model to evaluate itself, and repeat? At what point are diminishing returns reached? Can we understand the limitations of such “iterative improvement” methods more generally?

Note that scalable oversight is also relevant in the control context (discussed later) for the purposes of monitoring and interpretability during deployment as a means to mitigate misalignment.

Relation to weak-to-strong generalization. As noted by [41], one can view both scalable oversight and weak-to-strong generalization [42] as orthogonal approaches to alignment:

Scalable oversight: improve the judge’s supervision ability.
Weak-to-strong generalization: improve the ability to generalize from the supervision signals of a fixed judge.

In an LM context, weak-to-strong generalization concerns the setting of improving the ability for the reward model (RM) to correctly generalize from imperfect human preference data (whereas scalable oversight would aim to improve the quality of the preference data). [42] finds negative results for weak-to-strong generalization in this setting: training a larger RM using the outputs of a smaller RM causes the larger RM to collapse to the performance of the smaller RM (whereas for chess and NLP tasks, the larger model can meaningfully outperform the smaller model). It may be that modeling a human’s feedback as the outputs of a smaller reward model is problematic. It also may be that the relevant setting is instead using a smaller RM to train a larger policy (rather than training a larger RM) and observing whether the larger policy can learn to “grok” the intended goal.

Evaluating scalable oversight. One problem we must face when evaluating proposed solutions to scalable oversight is that, for very complex tasks, we have no access to ground truth labels, essentially by construction.

Consider a proxy setting where an AI produces outputs that a non-expert human is unable to evaluate, but where an expert human is able to reliably evaluate, providing us with ground-truth labels to evaluate the performance of the assisted non-expert human. The hope is that this proxy setting captures the meaningful aspects of scalable oversight in general and that solutions to this proxy setting extend to setting where even expert humans are unable to evaluate. This proxy setting is often called the “sandwiching” setup [38].

Goal misgeneralization. To be written. Could discuss:

Even with perfect supervision ability and perfectly specified reward models, we still must face the problem of goal misgeneralization: an AI system may converge to a strategy that performs well in training yet fails at deployment due to misgeneralization (e.g. the strategy aims for a goal that spuriously correlates with our intended goal).
If the training setting is not sufficiently “diverse” (or sufficiently adversarial), or if there is mismatch between training and deployment, then the system will learn such shortcuts that fail to generalize.
Indeed, the shallowness of alignment training in LMs [8, 9] can be seen as an example of goal misgeneralization, with the model taking the shortcut of simply “reusing” pretraining behaviour, which allows prefilling attacks to be so effective. [8] decentivizes such shortcuts by including examples of refusal responses that start with a few tokens of non-refusal.
Another example of goal misgeneralization in LMs is [22], which finds that finetuning on insecure code results in generically bad outputs: the dataset has insufficient optimization pressures to prevent the shortcut of “always output bad things” from being learned.
But some RL systems can be suprisingly robust out-of-distribution: self-play chess agents like MuZero can beat humans, even though they have never seen a human play during training. Here there is a mismatch between training and deployment, however the system still learns the correct general strategy due to the adversarial nature of self-play RL.
Unsupervised environment design [36] aims to generate adversarial training settings that introduce pressures against such optimization shortcuts.

Aside (a very rough analogy to the brain). In the context of finetuning via RLHF, we would like to learn a reward model $R_{\text{trained}}$ that accurately captures human preferences. Analogously, the “reward circuits” in the brain (e.g. in the midbrain and hypothalamus) have been “learned” by evolution to capture preferences that (indirectly) benefit genetic fitness.

That is, for both RLHF and the brain, the “reward model” is misspecified relative to the oracle reward (human preferences and genetic fitness respectively). The brain has therefore had to face the analogous problem of reward misspecification that we face in the context of RLHF. As a result, better understanding how the brain mitigates reward hacking could inform improvements to RLHF.

A trivial example in the brain concerns context-dependent rewards and homeostaticity: e.g. hunger diminishes after eating, preventing the continued pursuit of food rewards. Though this example is a little trivial, and ideally we would like some more interesting examples.

Representation-level steering. Finetuning is a behaviour-level method of steering: we directly incentivize certain model outputs over others. One can imagine that this could have undesirable effects, e.g. a model producing the right outputs for the wrong reasons/not accurately internalizing the goal that we want the model to pursue (goal misgeneralization).

A relevant example: finetuning an LM on insecure code has the unintended consequence of causing the model to output generically bad things [22]. The shallowness of alignment training [8] can also be viewed as an example of goal misgeneralization.

This motivates considering a more fine-grained method of steering. We can move to a lower level of abstraction, from behaviour to representations, and consider representation-level steering. Methods in this vein typically involve “locating” certain interpretable concepts in the model, such as the concept of “helpfulness”, and amplifying these concepts to (hopefully) result in interpretable changes in model behaviour. Some methods include:

Sparse auto-encoders (SAEs) [2, 3]: under the assumption that representations decompose into a sparse linear combination of semantic feature directions, sparse auto-encoders allow for extracting these feature directions. Then, through automated-interpretability pipelines [3, 40], we can assign human-understandable interpretations to these feature directions and use them to steer model behaviour accordingly (by adding the features to representations, or modifying the activations of the SAE before reconstruction).
- An unsupervised feature discovery method. Only human oversight is in the choice of SAE training dataset.
- It is unclear to what extent SAE features are describing structure relevant to the model’s “cognition” versus describing the structure of the SAE training dataset.
Representation engineering (RepE) [1]: identifies semantic feature directions via constrastive data examples, i.e. taking an example that displays helpfulness and an example that displays unhelpfulness, and taking the difference between activations to obtain a helpfulness direction. Steering can then be performed by adding this direction to internal activations.
- Must have a specific concept to extract in mind, and then must construct relevant constrastive examples. As a result, does not have the same “unsupervised discovery” property of SAEs.
LatentQA [4]: instead of explicitly “locating” concepts as feature directions, LatentQA interprets the representations of a target LM by training a decoder LM to take as input $(\text{representations from target LM}, \text{question about target LM})$ and output an answer to the question in natural language. LatentQA is the dataset on which the decoder LM is trained. After training, we can perform steering by freezing the decoder LM and training the target LM such that its activations decode (via the decoder LM) to specific natural language statements – for example, when giving the question “What is the model’s goal?” to the decoder LM, we can train the target LM such that the (frozen) decoder LM gives the answer “To be helpful.”
- Doesnt require making the same linearity assumptions as the previous two methods.

The aforementioned steering methods are also readily applicable to representation-level monitoring, as we will discuss later under “Control”.

Indeed, it appears the original motivation behind these methods was for interpretability/monitoring purposes, rather than for steering.

TODO: discussion of limitations of these approaches and future directions. Also discuss interpretability methods like [23] that allow for editing representations to fix model failures.

Weight-level steering. With a mechanistic understanding of a model (in the sense of mechanistic interpretability), we can possibly perform explicit weight edits to steer model behaviour. For example, [24] identifies the modules important for storing factual associations, and provides a method for editing the weights to change these factual associations, e.g. making the model believe that the Eiffel Tower is in Rome.

For weight editing to be scalable, we would likely need to automate the process of locating relevant weights (as in [25]) and editing accordingly. As one example, could we extend an investigator agent method like MAIA [26] to be able to use operations like ablation and attribution patching to find the model weights important for a particular task and relay its findings to humans, as well as perform relevant weight edits to modify model behaviour?

RL training could be incorporated if we have a method of verifying a hypothesis about the function of model components. Steering could act as such a verification method: if attempts to steer result in the expected changes to behaviour, the hypothesis was likely correct and we should reward the model.

TODO

Prompt-level steering. To be written. Could discuss:

In some cases prompting can be competitive with finetuning. [27] find that the performance of few-shot prompting improves with model scale and can be competitive with/outperform finetuning methods. Also [28] find an in-context learning approach to outperform SFT and SFT + RLHF. This paper also finds that SFT + RLHF has a rather superficial effect on model behaviour, with only the distribution of stylistic tokens differing, which agrees with the shallowness results of [8].
The limitations of naively prompting a pretrained model. What goes wrong? Why do we require finetuning? Why is prompt engineering the pretrained model not sufficient? Concrete examples of failures.

Control

To be written. Could discuss:

It may be unrealistic to aim for models that are perfectly aligned and robust in every context. Instead, we could consider adaptive mechanisms at deployment that monitor and mitigate potential misalignment. This is the idea behind control. Or in other words, we can’t expect the “initial conditions” of deployment (i.e. the pretrained + finetuned model) to be perfect. Correction during deployment is feasible since “controllers are often much simpler than the systems they control” [29].
We can construct monitoring pipelines that act as wrappers around the model that involve checks at various levels (input, representations, behaviour) and potentially relay certain decisions to humans when appropriate. We can test the robustness of a given pipeline by assuming that the model being wrapped is adversarially trying to subvert the pipeline. This is the framing of [30].
Just as we can steer the model with respect to each component in the hierarchy $(\text{prompt}, \text{weights}) \to \text{representations} \to \text{behaviour}$, we can also perform monitoring with respect to each component:
- Prompt-level: monitor inputs to the model via an input classifier (as in Constitutional Classifiers [31]) to detect adversarial attacks or OOD inputs.
- Weight-level: mechanistic understanding could allow us to detect phenomena like deception at the weight-level (though, unlike the other forms of monitoring, this is static over deployment).
- Representation-level: methods like RepE [1], SAEs [2, 3], and LIT [4] allow for monitoring the properties of internal representations, e.g. [32] detects harmful behaviour using RepE.
- Behaviour-level: monitor outputs of the model via an output classifier (as in Constitutional Classifiers [31]).
[33] suggests that detecting adversarial inputs is essentially as difficult as defending against the adversarial input in the case of image classification. Does this transfer to language models, or does the discreteness of text inputs change things?

References

[1] Representation Engineering: A Top-Down Approach to AI Transparency

[2] Sparse Autoencoders Find Highly Interpretable Features in Language Models

[3] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (https://transformer-circuits.pub/2023/monosemantic-features)

[4] LatentQA: Teaching LLMs to Decode Activations Into Natural Language

[5] Adversarial Examples Are Not Bugs, They Are Features

[6] Pretraining Language Models with Human Preferences

[7] LIMA: Less Is More for Alignment

[8] Safety Alignment Should Be Made More Than Just a Few Tokens Deep

[9] Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

[10] Towards Understanding Sycophancy in Language Models

[11] Scaling Laws for Reward Model Overoptimization

[12] The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

[13] Constitutional AI: Harmlessness from AI Feedback

[14] Specific versus General Principles for Constitutional AI

[15] Discovering Language Model Behaviors with Model-Written Evaluations

[16] Understanding Learned Reward Functions

[17] RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

[18] Recursively Summarizing Books with Human Feedback

[19] Benchmarking and Improving Generator-Validator Consistency of Language Models

[20] The Generative AI Paradox: “What It Can Create, It May Not Understand”

[21] RewardBench: Evaluating Reward Models for Language Modeling

[22] Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

[23] Monitor: An AI-Driven Observability Interface (https://transluce.org/observability-interface)

[24] Locating and Editing Factual Associations in GPT

[25] Towards Automated Circuit Discovery for Mechanistic Interpretability

[26] A Multimodal Automated Interpretability Agent

[27] Language Models are Few-Shot Learners

[28] The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

[29] Maintaining Alignment during RSI as a Feedback Control Problem (https://www.beren.io/2025-02-05-Maintaining-Alignment-During-RSI-As-A-Feedback-Control-Problem/)

[30] AI Control: Improving Safety Despite Intentional Subversion

[31] Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

[32] Improving Alignment and Robustness with Circuit Breakers

[33] Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them

[34] Eliciting Language Model Behaviors with Investigator Agents

[35] LLM Critics Help Catch LLM Bugs

[36] Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design

[37] Non-Programmers Can Label Programs Indirectly via Active Examples: A Case Study with Text-to-SQL

[38] Measuring Progress on Scalable Oversight for Large Language Models

[39] AI safety via debate

[40] Automatically Interpreting Millions of Features in Large Language Models

[41] Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem (https://www.alignmentforum.org/posts/hw2tGSsvLLyjFoLFS/scalable-oversight-and-weak-to-strong-generalization)

[42] Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

[43] Scaling Laws for Reward Model Overoptimization

Variational framework for perception and action

2024-09-23T09:00:00+00:00

Introduction

This post presents a mathematical formalism of perception, action, and learning, based on the framework of variational inference. It appears that a variational formalism has the potential to shed light on the ad-hoc choices used in practice; we will see that the notion of entropy regularization (as in SAC) and penalizing policy drift (as in PPO) emerge naturally from this framework. Further, this formalism provides a model-based method of reinforcement learning, featuring both representation learning & action-selection on the basis of these learned representations, in a manner analogous to “World Models” [3]. We will also see that, under certain assumptions, we obtain an alternative credit assignment scheme to backpropagation, called predictive coding [6, 7]. Backpropagation suffers from catastrophic interference [4] and has limited biological plausibility [5], whereas predictive coding – a biologically-plausible credit assignment scheme – has been observed to outperform backpropagation at continual learning tasks – alleviating catastrophic interference to some degree [6] – and at small batch sizes [8] (i.e. biologically relevant contexts).

A brief background on reinforcement learning methods can be found in the Appendix.

Related work. The framework presented here differs from previous work. [1] describes an agent’s preference via binary optimality variables whose distribution is defined in terms of reward, from which they are able to justify the max-entropy framework of RL, however their framework does not result in a policy drift term (as in PPO). In contrast, for the formalism presented here, we will see that it is most natural to represent an agent’s preferences via a distribution over policies (defined in terms of value), from which a policy drift term and an entropy regularization term naturally emerge. There is also the topic of active inference [9] which aims to formulate action-selection as variational inference, however it appears that central claims – such as the minimization of the “expected free energy” – lack a solid theoretical justification, as highlighted by [10, 11].

Variational framework for perception & action

We will consider the following graphical model,

where $s_t$ represents the environment’s (hidden) state, $x_t$ a partial observation of this state, and $a_t$ an action, at time $t$. For simplicity we have made a Markov assumption on how the hidden states evolve. The dependency $s_t \to a_t$ is a consequence of this Markov assumption, with the optimal action $a_t$ (optimality defined further below) ultimately depending only on the current environment state $s_t$, independent of previous states \(s_{

The associated probabilistic decomposition (up to time $t$) is

\[p(s_{1:t}, x_{1:t}, a_{1:t}) = \prod_{\tau=1}^{t} p(s_{\tau}\mid s_{\tau-1}, a_{\tau-1}) p(x_{\tau}\mid s_{\tau}) p(a_{\tau}\mid s_{\tau})\]

where $p(s_1|s_0, a_0) \equiv p(s_1)$. We can interpret:

$p(s_{\tau}\mid s_{\tau-1}, a_{\tau-1})$ as describing the environment’s Markov transition dynamics.
$p(x_{\tau}\mid s_{\tau})$ as describing the lossy map from hidden state to partial observation.
$p(a_{\tau}\mid s_{\tau})$ as describing the ideal/optimal action distribution given the state $s_{\tau}$, where optimality is defined with respect to a value system (described further below).

To provide a Bayesian framework for action, we wish to frame the objective of an agent as performing inference over this graphical model. Since an agent will have access to information $(x_{1:t}, a_{ \[p(a_t\mid x_{1:t}, a_{i.e. we can view action selection as the process of inferring the underlying state \(s_t \sim p(s_t\mid x_{1:t}, a_{value. In order to define \(p(a_t\mid s_t)$ to satisfy this notion, we will assume that $p(a_t\mid s_t)$ takes the form of a mixture distribution, introducing a policy variable $\pi$ and writing,

\[p(a_t\mid s_t) = \int d\pi \; p(a_t, \pi\mid s_t) = \int d\pi \; p(\pi\mid s_t) \pi(a_t\mid s_t)\]

where we write $\pi(a_t\mid s_t) \equiv p(a_t\mid s_t, \pi)$.

The reason for assuming the form of a mixture distribution and introducing the policy variable $\pi$ is because it is unclear how the goodness of a single action $a_t$ would be defined otherwise. Its goodness is based on how the future plays out, which requires taking some additional future actions, and the policy variable $\pi$ takes the role of determining these future actions, allowing for a well-defined notion of value.

A notable property of a mixture distribution is that we can write $p(a_{\tau}\mid s_{\tau}) = \mathbb{E}_{p(\pi\mid s_{\tau})}[\pi(a_{\tau}\mid s_{\tau})]$, which – as we will see later – allows us to apply a variational bound. We then introduce a value system $V_{\pi}(s)$, describing the value of policy $\pi$ starting from the state $s$. We can consider a Boltzmann preference over policies based on this value system,

\[\begin{equation} \label{eqn:boltz} p(\pi\mid s_t) \propto \exp(\beta V_{\pi}(s_t)) \quad \quad (1) \end{equation}\]

where, as typical in the context of RL, we define the value as,

\[V_{\pi}(s_t) = \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi)}\left[\sum_{\tau=t}^{\infty} \gamma^{\tau-t} R(s_{\tau}, a_{\tau})\right]\] \[\text{with} \; \; \; p(s_{>t}, a_{\geq t}\mid s_t, \pi) = \prod_{\tau=t}^{\infty} \pi(a_{\tau}\mid s_{\tau}) p(s_{\tau+1}\mid s_{\tau}, a_{\tau})\]

for a discount factor $\gamma \in [0, 1]$. Note that $\beta = \infty$ corresponds to taking an argmax over $V_{\pi}(s_t)$, representing “perfect rationality”. Finite $\beta$ corresponds to “bounded rationality”, placing weight on policies that dont perfectly maximize $V_{\pi}(s_t)$.

We will leave the remaining distributions – $p(s_t\mid s_{t-1}, a_{t-1})$ and $p(x_t\mid s_t)$ – implicit, but in practical implementation, we can choose them as Gaussians parameterized by neural networks.

Variational inference. However, even assuming that we have direct access to all relevant distributions (which we dont), computing the integral of Equation (1) will be intractable as we expect $s_t$ to be high-dimensional. As a result, we cannot perform exact Bayesian inference and so must utilize an approximate scheme of Bayesian inference. The approximate scheme we will consider is variational inference: at time $t$ an agent has access to information $(x_{1:t}, a_{variational bound \(F(x_{1:t}, a_{ \[-\log p(x_{1:t}, a_{(a consequence of Jensen’s inequality) where we have introduced the variational/approximate distribution \(q$. $F$ is commonly called the variational free energy (VFE). Variational inference considers minimizing this upper bound \(F(x_{1:t}, a_{

When the VFE $F_t$ is perfectly minimized with respect to $q$, with $q(s_{1:t}\mid x_{1:t}, a_{ \[F_t = -\log p(x_{1:t}, a_{in which case further minimizing \(F_t$ with respect to $p$ achieves our original goal. This is the basis of the variational EM algorithm: at time $t$,

E-step: Minimize $F_t$ with respect to $q$ until convergence to $q = q^{*}$.
M-step: At fixed $q = q^{*}$, minimize $F_t$ for one/a few steps with respect to $p$.

In practice, we will implement these steps via gradient-based minimization, as will be described in more detail later. The E-step cannot be performed until convergence in practice but we will typically perform many more E-steps than M-steps.

In general we can write $F_t$ as

\[\begin{align*} F_t = &\sum_{\tau=1}^{t} \mathbb{E}_{q(s_{1:t}\mid x_{1:t}, a_{As seen above we chose $p(a_{\tau}\mid s_{\tau})$ to be a mixture distribution, but the log of a mixture distribution (the “action” term above) lacks a nice expression. However, we can apply a variational bound on this term with respect to $q(\pi\mid s_{\tau})$ by noting that $p(a_{\tau}\mid s_{\tau})$ is an expectance,

\[p(a_{\tau}\mid s_{\tau}) = \mathbb{E}_{p(\pi\mid s_{\tau})}[\pi(a_{\tau}\mid s_{\tau})] = \mathbb{E}_{q(\pi\mid s_{\tau})}\left[\frac{p(\pi\mid s_{\tau})}{q(\pi\mid s_{\tau})} \pi(a_{\tau}\mid s_{\tau}) \right]\]

and hence we can perform a variational bound (via Jensen’s inequality),

\[\begin{align*} -\log p(a_{\tau}\mid s_{\tau}) &\leq D_{\text{KL}}(q(\pi\mid s_{\tau})\mid \mid p(\pi\mid s_{\tau})) + \mathbb{E}_{q(\pi\mid s_{\tau})}[-\log \pi(a_{\tau}\mid s_{\tau})]\\ &= \mathbb{E}_{q(\pi\mid s_{\tau})}[-\log p(\pi\mid s_{\tau}) - \log \pi(a_{\tau}\mid s_{\tau})] - H[q(\pi\mid s_{\tau})] \end{align*}\]

Overall, this results in a unified objective for perception and action:

\[\begin{align} \label{eqn:freeenergy} F_t &= \sum_{\tau=1}^{t} \mathbb{E}_{q(s_{1:t}\mid x_{1:t}, a_{where

(a) and (b) represent the process of perception and prediction/representation learning.
(c) represents achieving the agent’s preference. In the case of Equation (1) this term corresponds to value maximization, with
\[-\log p(\pi\mid s_{\tau}) = -\beta V[\pi\mid s_{\tau}] + \log Z(s_{\tau})\]
for normalizing constant $Z(s_{\tau})$.
(d) penalizes policy drift, by incentivizing $q(\pi\mid s_{\tau})$ to favour policies $\pi$ that fit previously taken actions \(a_{
(e) and (f) are entropy regularization terms, for perception and action-selection respectively.

Note that (d) and (f) achieve the same role as policy clipping in PPO, and the entropy regularization in SAC, respectively. Note that entropy regularization in SAC has been previously justified via a variational framework [1], but with differences to the framework presented here (as described at the beginning).

Under Equation (1), term (c) reduces to the expected value which is exactly what the field of reinforcement learning is concerned with. An overview of reinforcement learning methods is included in the Appendix.

Internal structure of hidden states. We can include additional structure on the hidden state $s$ by generalizing to an arbitrary DAG topology for the hidden state $s_t = (s_t^1, \ldots, s_t^N)$, where $s_t^n$ is the state of node $n$ at time $t$. For a general DAG, we have

\[p(s_{\tau}\mid s_{\tau-1}, a_{\tau-1}) = \prod_{n=1}^{N} p(s_{\tau}^{n}\mid s_{\tau}^{\mathcal{P}(n)}, s_{\tau-1}^{n}, a_{\tau-1})\]

where $\mathcal{P}(n) \subset \{1, \ldots, N\}$ denotes the parent indices for the $n$th node. Further, $p(x_{\tau}\mid s_{\tau}) = p(x_{\tau}\mid s_{\tau}^{\mathcal{P}(0)})$, and we will denote $s_t^0 \equiv x_t$. Note that we have implicitly made a locality assumption of $p(s_{\tau}^{n}\mid s_{\tau}^{\mathcal{P}(n)}, s_{\tau-1}, a_{\tau-1}) = p(s_{\tau}^{n}\mid s_{\tau}^{\mathcal{P}(n)}, s_{\tau-1}^{n}, a_{\tau-1})$.

Conditioning $s_{\tau}^{n}$ on $(s_{\tau-1}^n, a_{\tau-1})$ accounts for temporal dynamics.
And conditioning on $s_{\tau}^{\mathcal{P}(n)}$ accounts for local interaction effects between nodes.

In this case we can write Equation (2) as

\[\begin{align} \label{eqn:dagfreeenergy} F_t &= \sum_{\tau=1}^{t} \sum_{n=1}^{N} \mathbb{E}_{q(s_{1:t}\mid x_{1:t}, a_{We may restrict action selection to a particular node, e.g. $\pi(a_{\tau}\mid s_{\tau}) = \pi(a_{\tau}\mid s_{\tau}^N)$.

Inclusion of entropy terms. Note that the entropy regularization terms (e) and (f) are not present if we consider the cross-entropy objective

\[\mathbb{E}_{q(s_{1:t}\mid x_{1:t}, a_{instead of $F_t$, but otherwise this objective and $F_t$ are equivalent. This appears necessary if we wish to consider a point-mass distribution over policies $q(\pi\mid s_{\tau}) = \delta(\pi - \pi_{\phi})$ as otherwise the entropy terms become infinite.

Predictive coding. Predictive coding [6, 7] is a special case of variational inference under a hierarchical latent topology that results in a local, and hence biologically-plausible, credit assignment scheme – this contrasts with backpropagation which has limited biological plausibility [5]. It has been observed that predictive coding alleviates catastrophic interference to some degree [6], which is a known problem with backpropagation [4].

Predictive coding, in its typical formulation, ignores temporality and action. We will consider extending predictive coding to include both of these aspects later. In this context, with a hierarchical topology, the relevant graphical model is

with $s = (s^1, \ldots, s^L)$ and decomposition

\[p(s, x) = p(s^L) \prod_{l=1}^{L} p(s^{l-1}\mid s^l)\]

where $s^0 \equiv x$. Predictive coding makes two further assumptions:

$p$ is Gaussian,

\[p(s^{l-1}\mid s^l) := \text{N}(s^{l-1}; \mu_l(s^l), \Sigma_l), \; \; \; p(s^L) := \text{N}(s^L; \hat{\mu}, \hat{\Sigma})\]

with $\mu_l(s^l) = \mu_l(s^l; \theta_l)$, where $\theta = (\theta_1, \ldots, \theta_L)$ and $(\hat{\mu}, \hat{\Sigma})$ parameterize $p$.

$q$ is point-mass,

\[q(s^l\mid s^{l-1}) = \delta(s^l - z_l)\]

where $z = (z_1, \ldots, z_L)$ parameterize $q$.

We can interpret $z$ as the fast-changing synaptic activity, and $\theta$ as the slow-changing synaptic weights. This interpretation is supported by the variational EM algorithm, which indeed updates the parameters of $q$ (in this case, parameter $z$) at a faster timescale than the parameters of $p$.

In this case the free energy $F(x)$ from Equation (2) (now independent of time) takes the form,

\[\begin{align*} F(x; \theta, z) &= \mathbb{E}_{q(s\mid x)}[-\log p(s) - \log p(x\mid s)] = -\log p(z_L) - \sum_{l=1}^{L} \log p(z_{l-1}\mid z_l)\\ &= \frac{1}{2} (z_L - \hat{\mu})^T \hat{\Sigma}^{-1} (z_L - \hat{\mu}) + \frac{1}{2} \sum_{l=1}^{L} (z_{l-1} - \mu_l(z_l))^T \Sigma_l^{-1} (z_{l-1} - \mu_l(z_l))\\ &+ \frac{1}{2} \log \mid 2\pi\hat{\Sigma}\mid + \frac{1}{2} \sum_{l=1}^{L} \log \mid 2\pi\Sigma_l\mid \end{align*}\]

where $z_0 \equiv x$.

The variational EM algorithm (neglecting amortized inference for now) consists of:

Iterative inference: update $z$ via gradients

$\frac{\partial F}{\partial z_l} = \begin{cases} \Sigma_{l+1}^{-1} \epsilon_l - \left[\frac{\partial \mu_l(z_l)}{\partial z_l}\right]^T \Sigma_l^{-1} \epsilon_{l-1}, & l = 1, \ldots, L-1\\ \hat{\Sigma}^{-1} \epsilon_L - \left[\frac{\partial \mu_L(z_L)}{\partial z_L}\right]^T \Sigma_L^{-1} \epsilon_{L-1}, & l = L \end{cases}$ where we have defined $\epsilon_l := z_l - \mu_{l+1}(z_{l+1})$ (for $l < L$) and $\epsilon_L := z_L - \hat{\mu}$.

Learning: update $\theta$ via the gradients

\[\frac{\partial F}{\partial \theta_l} = -\left[\frac{\partial \mu_l(z_l)}{\partial \theta_l}\right]^T \Sigma_l^{-1} \epsilon_{l-1}\]

This update process is local, and hence biologically plausible, since each update of $z_l$ only depends on locally accessible information of its children nodes (in this case, $z_{l-1}$).

This locality contrasts with conventional backpropagation. For example, under conventional backpropagation, a single update step for the weight of the first layer requires access to the weight of the final layer – such non-local dependence is not present in predictive coding.

We can extend the above to include amortized inference, where upon receiving $x$ we compute some initialization value for the neural activity $z$, and from there perform iterative inference. We can interpret the typical models in machine learning, e.g. transformers, as solely performing such an amortized inference stage, without performing any iterative inference explicitly. However, one can argue that the residual structure of modern architectures, such as transformers, allows one to simulate an iteriative inference-like process. Indeed, they basically take the same form,

$\text{Residual structure:} \; \; \; z^{(l+1)} = z^{(l)} + f_{l+1}(z^{(l)}) \; \; \; \text{at layer} \; l$ $\text{Explicit iterative inference:} \; \; \; z^{(n+1)} = z^{(n)} - \eta \frac{\partial F(z^{(n)})}{\partial z} \; \; \; \text{at time-step} \; n$

That is, we can potentially think of iterative inference as taking place implicitly over the layers of a residual architecture (like a transformer), whereas in the brain/predictive coding, it takes place over time via backward connections. This justifies why we may expect transformers to have to be much deeper than the brain; the brain’s recurrent/backward connections are roughly analogous to a deeper, strictly feedforward residual architecture.

The above presentation of predictive coding has neglected two aspects: temporality, and action selection. Our general framework naturally extends to include these two aspects, as we will show below.

Neuro interpretation. One perspective highlighted by [13] is that to understand biological intelligence, one should first develop a computational/algorithmic description of cognition, and only then should one consider how such an algorithm could be implemented neurally. Ambitiously, such an approach would shed light on why the brain has the structure and properties that it does. In this vein, we can make the following rough analogies to the brain:

$z_t^n$ as the state of the $n$th cortical column, with $z_t$ the state of the cortex (unsure whether this would include the prefrontal cortex, which may operate by different principles (e.g. not simply predictive coding) compared to e.g. the sensory cortex?).
Amortized inference as initial feedforward sweep of neural activity (as described in [7]).
The iterative inference stage as a phase of local communication between cortical columns, eventually converging to some agreement $z_t$. Agrees with cortical uniformity, with each cortical column employing the same local algorithm for communication and updates.
Relation to basal ganglia: dorsal striatum embodies $\hat{Q}(s_t, a_t)$, and ventral striatum embodies $\hat{V}(s_t)$, with $s_t$ received by striatum via projections from cortex.
Standard interpretation of dopamine from midbrain as communicating reward-related errors, facilitating updating of $\hat{V}$ and $\hat{Q}$ via projections to basal ganglia, and potentially the computation of gradient $\nabla_{\phi} V_{\pi_{\phi}}(s)$.
Hippocampus acting as a replay buffer, useful for sampling past $(s, a, s') \in \mathcal{D}$ (for training world model $p$ and estimates $\hat{Q}$, $\hat{V}$).
Adding compositional structure to reward function, in the sense of “Reward Bases” [12], could model context-dependent dynamic values, and adaptive behaviour.
Something like the cortico-basal ganglia-thalamo-cortical loop as facilitating the computation of $\nabla_{\phi} Q_{\pi_{\phi}}(s, a)$ and $\nabla_{\theta} V_{\pi_{\phi}}(s)$ for updating the policy and world model respectively? (i.e. basal ganglia needs state $s$ from cortex, and cortex needs information about how $Q_{\pi_{\phi}}$ and $V_{\pi_{\phi}}$ vary wrt $\theta$ and $\phi$ from the basal ganglia)

References

[1] Levine, S. (2018). Reinforcement learning and control as probabilistic inference: Tutorial and review.

[2] Millidge, B. (2020). Deep active inference as variational policy gradients.

[3] Ha, D., & Schmidhuber, J. (2018). World models.

[4] McCloskey, M., & Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem.

[5] Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J., & Hinton, G. (2020). Backpropagation and the brain.

[6] Song, Y., Millidge, B., Salvatori, T., Lukasiewicz, T., Xu, Z., & Bogacz, R. (2024). Inferring neural activity before plasticity as a foundation for learning beyond backpropagation.

[7] Tscshantz, A., Millidge, B., Seth, A. K., & Buckley, C. L. (2023). Hybrid predictive coding: Inferring, fast and slow.

[8] Alonso, N., Millidge, B., Krichmar, J., & Neftci, E. O. (2022). A theoretical framework for inference learning.

[9] Friston, K., Schwartenbeck, P., FitzGerald, T., Moutoussis, M., Behrens, T., & Dolan, R. J. (2013). The anatomy of choice: active inference and agency.

[10] Millidge, B., Tschantz, A., & Buckley, C. L. (2021). Whence the expected free energy?

[11] Champion, T., Bowman, H., Marković, D., & Grześ, M. (2024). Reframing the Expected Free Energy: Four Formulations and a Unification.

[12] Millidge, B., Walton, M., & Bogacz, R. (2022). Reward bases: Instantaneous reward revaluation with temporal difference learning.

[13] Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information.

Appendix: Reinforcement learning background

Given Equation (1), the term (c) in Equation (2) can be written

\[\begin{align*} &\sum_{\tau=1}^{t-1} \mathbb{E}_{q(\pi\mid s_{\tau}) q(s_{\tau}\mid x_{1:t}, a_{If we choose $q(\pi\mid s_{\tau}) = \delta(\pi-\pi_{\phi})$, then we essentially arrive at the problem of:

\[\text{maximize} \; V_{\pi_{\phi}}(s_t) \; \text{with respect to} \; \phi\]

which is exactly the goal of policy-based reinforcement learning (RL). Recall that,

\[V_{\pi}(s_t) = \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi)}\left[\sum_{\tau=t}^{T} \gamma^{\tau-t} R(s_{\tau}, a_{\tau})\right],\] \[Q_{\pi}(s_t, a_t) = \mathbb{E}_{p(s_{>t}, a_{>t}\mid s_t, a_t, \pi)}\left[\sum_{\tau=t}^{T} \gamma^{\tau-t} R(s_{\tau}, a_{\tau})\right]\]

and note that

\[V_{\pi}(s) = \mathbb{E}_{\pi(a\mid s)}[Q_{\pi}(s, a)], \quad Q_{\pi}(s, a) = R(s, a) + \gamma \mathbb{E}_{p(s'\mid s, a)}[V_{\pi}(s')]\]

There appears to be two notable methods for performing credit assignment in RL: (a) policy gradient methods (e.g. REINFORCE, PPO) (b) amortized gradient methods (e.g. DQN, DDPG, SVG(0), SAC). Both involve gradient ascent using gradient $\nabla_{\phi} V_{\pi_{\phi}}(s_t)$, yet utilize different expressions for this gradient. As a quick summary,

a) Policy gradient methods consider writing the gradient in the form,

\[\nabla_{\phi} V_{\pi_{\phi}}(s_t) = \sum_{\tau=t}^{\infty} \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi})}[\Phi_{t, \tau} \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})]\]

with various choices of $\Phi_{t, \tau}$, typically involving approximations $\hat{V} \approx V_{\pi_{\phi}}$ and/or $\hat{Q} \approx Q_{\pi_{\phi}}$, described in more detail below.

b) Amortized gradient methods instead consider the form,

\[\nabla_{\phi} V_{\pi_{\phi}}(s_t) \approx \nabla_{\phi} \mathbb{E}_{\pi_{\phi}(a_t\mid s_t)}[\hat{Q}(s_t, a_t)]\]

for an approximation $\hat{Q} \approx Q_{\pi_{\phi}}$ or $Q_{\pi_{*}}$ (for optimal policy $\pi_{*}$). In most contexts we can write $a_t = f_{\phi}(s_t, \epsilon)$ under $a_t \sim \pi_{\phi}(a_t\mid s_t)$ for random variable $\epsilon \sim p(\epsilon)$, and hence via the reparameterization trick we can write

\[\nabla_{\phi} V_{\pi_{\phi}}(s_t) \approx \mathbb{E}_{p(\epsilon)}[\nabla_{\phi} \hat{Q}(s_t, f_{\phi}(s_t, \epsilon))]\]

As an example of what $f_{\phi}$ may look like in practice, for a Gaussian policy:

\[\pi_{\phi}(a_t\mid s_t) = \text{N}(a_t; \mu_{\phi}(s_t), \Sigma_{\phi}(s_t))\] \[\implies a_t = f_{\phi}(s_t, \epsilon) = \mu_{\phi}(s_t) + \epsilon U_{\phi}(s_t), \; \; \; \text{where} \; \; \Sigma_{\phi}(s_t) = U_{\phi}(s_t) U_{\phi}(s_t)^{T}\]

for $\epsilon \sim \text{N}(0, I)$.

Amortized gradient methods. Examples of amortized gradient methods:

DQN corresponds to the amortized method, however since its in the context of (small) discrete action spaces, the optimal action
\[a^{*}(s) = \text{argmax}_a \hat{Q}(s, a)\]
can be computed directly. Further, as in Q-learning, $\hat{Q} \approx Q_{\pi_{*}}$ where $\pi_{*}$ is the optimal policy, satisfying the Bellman equation,
\[Q_{\pi_{*}}(s, a) = \mathbb{E}_{p(s'\mid s, a)}[R(s, a) + \gamma \max_{a'} Q_{\pi_{*}}(s', a')]\]
where we minimize the associated greedy SARSA-like loss,
\[\frac{1}{2} \mathbb{E}_{p(s, a, s'\mid \pi_{a^{*}})}[(\hat{Q}(s, a) - (R(s, a) + \gamma\max_{a'} \hat{Q}(s', a'))^2]\]
to obtain $\hat{Q} \approx Q_{\pi_{*}}$.
DDPG extends Q-learning/DQN to continuous action spaces using the amortized method, restricting to deterministic policies $\pi_{\phi}(a\mid s) = \delta(a - \mu_{\phi}(s))$, hence with corresponding gradient,
\[\nabla_{\phi} V_{\pi_{\phi}}(s_t) \approx \nabla_{\phi} \hat{Q}(s_t, \mu_{\phi}(s_t))\]
and approximating $\hat{Q} = Q_{\pi_{*}}$ equivalently to DQN.
The SVG(0) algorithm is an extension of DDPG for stochastic policies $\pi_{\phi}$, by utilizing the reparameterization trick (demonstrated above in (b)). It approximates $\hat{Q} \approx Q_{\pi_{\phi}}$ (i.e. not under the optimal policy, unlike DDPG) using a SARSA-like objective,
\[\frac{1}{2} \mathbb{E}_{p(s, a, s', a'\mid \pi_{\phi})}[(\hat{Q}(s, a) - (R(s, a) + \gamma\hat{Q}(s', a'))^2]\]
SAC extends SVG(0) to include an entropy term, and also uses a value network alongside the action-value network. They found that the value network improved training stability. The objectives considered for training $\hat{V} \approx V_{\pi_{\phi}}$ and $\hat{Q} \approx Q_{\pi_{\phi}}$ are
\[\frac{1}{2}\mathbb{E}_{p(s)}[(\hat{V}(s) - \mathbb{E}_{\pi_{\phi}(a\mid s)}[\hat{Q}(s, a) \underbrace{- \log \pi_{\phi}(a\mid s)}])^2],\] \[\frac{1}{2}\mathbb{E}_{p(s, a\mid \pi_{\phi})}[(\hat{Q}(s, a) - (R(s, a) + \gamma\mathbb{E}_{p(s'\mid s, a)}[\hat{V}(s')]))^2]\]
respectively. The reason why $\hat{V}$’s objective has an entropy term yet $\hat{Q}$’s doesnt, is because SAC defines the value $V_{\pi_{\phi}}$ to include an entropy term.

Policy gradient methods. For the return $G_t := \sum_{\tau=t}^{\infty} \gamma^{\tau-t} R(s_{\tau}, a_{\tau})$, note that

\[V_{\pi_\phi}(s_t) = \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_\phi)}[G_t]\] \[\implies \nabla_{\phi} V_{\pi_\phi}(s_t) = \sum_{\tau=t}^{\infty} \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_\phi)}[G_t \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})]\]

We can approximate this via a (truncated) Monte Carlo method over trajectory information, however if the trajectory information is taken under an older policy $\pi_{\phi_{old}}$, we should instead include a ratio factor:

\[\begin{align*} \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_\phi)}[G_t \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})] &= \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi_{old}})}\left[\frac{\pi_{\phi}(a_{\tau}\mid s_{\tau})}{\pi_{\phi_{old}}(a_{\tau}\mid s_{\tau})} G_t \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})\right]\\ &= \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi_{old}})}\left[G_t \frac{\nabla_{\phi} \pi_{\phi}(a_{\tau}\mid s_{\tau})}{\pi_{\phi_{old}}(a_{\tau}\mid s_{\tau})}\right] \end{align*}\]

We can write $\nabla_{\phi} V_{\pi_{\phi}}(s_t)$ in a more general form by manipulating the expectance, or including a baseline. Specifically, we can write

\[\nabla_{\phi} V_{\pi_{\phi}}(s_t) = \sum_{\tau=t}^{T} \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi})}[\Phi_{t, \tau} \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})]\]

for a variety of choices of $\Phi_{t, \tau}$:

$\Phi_{t, \tau} = G_t$.
$\Phi_{t, \tau} = G_{\tau} = \sum_{\tau'=\tau}^{T} R(s_{\tau'}, a_{\tau'})$.
$\Phi_{t, \tau} = G_{\tau} - b(s_{\tau})$, e.g. $b = \hat{V} \approx V_{\pi_{\phi}}$.
$\Phi_{t, \tau} = Q_{\pi_{\phi}}(s_{\tau}, a_{\tau})$.
$\Phi_{t, \tau} = Q_{\pi_{\phi}}(s_{\tau}, a_{\tau}) - V_{\pi_{\phi}}(s_{\tau})$ (advantage), which is just (4) with a baseline (3).
$\Phi_{t, \tau} = R(s_{\tau}, a_{\tau}) + \gamma R(s_{\tau+1}, a_{\tau+1}) + \cdots + \gamma^T R(s_{\tau+T}, a_{\tau+T}) + V_{\pi_{\phi}}(s_{\tau+T+1}) - V_{\pi_{\phi}}(s_{\tau})$.

(2) and (3) follow from (1) because, for an arbitrary function $f = f(s_{\leq \tau}, a_{<\tau})$,

\[\mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi})}[f(s_{\leq \tau}, a_{<\tau}) \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})] = 0\]

for $\tau \geq t$. (4) holds because,

\[\begin{align*} &\mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi})}[G_{\tau} \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})] = \mathbb{E}_{p(s_{\geq\tau}, a_{\geq \tau}\mid s_t, \pi_{\phi})}[G_{\tau} \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})]\\ &= \mathbb{E}_{p(s_{\tau}, a_{\tau}\mid s_t, \pi_{\phi})} \mathbb{E}_{p(s_{>\tau}, a_{>\tau}\mid s_{\tau}, a_{\tau}, \pi_{\phi})}[G_{\tau} \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})]\\ &= \mathbb{E}_{p(s_{\tau}, a_{\tau}\mid s_t, \pi_{\phi})}[(R(s_{\tau}, a_{\tau}) + \underbrace{\mathbb{E}_{p(s_{>\tau}, a_{>\tau}\mid s_{\tau}, a_{\tau}, \pi_{\phi})}[\gamma R(s_{\tau+1}, a_{\tau+1}) + \cdots]}_{= Q_{\pi_{\phi}}(s_{\tau}, a_{\tau}) - R(s_{\tau}, a_{\tau})}) \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})]\\ &= \mathbb{E}_{p(s_{\tau}, a_{\tau}\mid s_t, \pi_{\phi})}[Q_{\pi_{\phi}}(s_{\tau}, a_{\tau}) \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})] \end{align*}\]

(6) follows from (5), using

\[\begin{align*} Q_{\pi_{\phi}}(s_{\tau}, a_{\tau}) &= R(s_{\tau}, a_{\tau}) + \gamma \mathbb{E}_{p(s_{\tau+1}, a_{\tau+1}\mid s_{\tau}, a_{\tau})}[R(s_{\tau+1}, a_{\tau+1})] + \cdots\\ &+ \gamma^T \mathbb{E}_{p(s_{>\tau}, a_{>\tau}\mid s_{\tau}, a_{\tau}, \pi_{\phi})}[R(s_{\tau+T}, a_{\tau+T})] + \gamma^{T+1} \mathbb{E}_{p(s_{>\tau}, a_{>\tau}\mid s_{\tau}, a_{\tau}, \pi_{\phi})}[V_{\pi_{\phi}}(s_{\tau+T+1})] \end{align*}\]

Examples of policy gradient methods:

REINFORCE uses choice (2) for $\Phi_{t, \tau}$. However this suffers from high variance, so in practice one typically uses a baseline, such as choice (3).
PPO uses a GAE (generalized advantage estimator) variant of choice (6). We describe GAE below.

GAE estimator. Recall the return

\[G_{\tau} = \sum_{t=\tau}^{\infty} \gamma^{t-\tau} R(s_{t}, a_{t})\]

We will define the truncated return,

\[G_{\tau}^{(n)} := \sum_{t=\tau}^{\tau+n-1} \gamma^{t-\tau} R(s_t, a_t) + \gamma^n V(s_{\tau+n})\]

We then consider the exponential moving-average,

\[G_{\tau}(\lambda) := (1-\lambda)(G_{\tau}^{(1)} + \lambda G_{\tau}^{(2)} + \lambda^2 G_{\tau}^{(3)} + \cdots)\]

Note that $\lambda = 1$ corresponds to the entire return $G_{\tau}$, whereas $\lambda = 0$ corresponds to using the one-step return $G_{\tau}^{(1)} \equiv R(s_{\tau}, a_{\tau}) + \gamma V(s_{t+1})$. Then we can view $\lambda \in [0, 1]$ as balancing the tradeoff between bias and variance ($\lambda=0$ is min variance but high bias, and vice-versa for $\lambda=1$).

We should view $G_{\tau}(\lambda)$ as a drop-in approximation for $G_{\tau}$. This is valid because

\[\mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi})}[G_{\tau}^{(n)} \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})] = \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi})}[G_{\tau} \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})]\]

such that

\[\mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi})}[G_{\tau}(\lambda) \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})] = \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi})}[G_{\tau} \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})]\]

as required, using $1+\lambda+\lambda^2+\cdots = 1/(1-\lambda)$.

Using $G_{\tau}(\lambda)$ in-place of the return $G_{\tau}$ for TD-learning results in the TD$(\lambda)$ algorithm.

In the context of advantage baselines, we instead consider an exponential moving-average over advantages, rather than returns. Specifically, we define the truncated advantage

\[\begin{align*} A_{\tau}^{(n)} &:= G_{\tau}^{(n)} - V(s_{\tau})\\ &= \sum_{t=\tau}^{\tau+n-1} \gamma^{t-\tau} R(s_t, a_t) + \gamma^n V(s_{\tau+n}) - V(s_{\tau}) \end{align*}\]

and define the GAE (generalized advantage estimator) as an exponential moving-average,

\[A_{\tau}(\lambda) := (1-\lambda)(A_{\tau}^{(1)} + \lambda A_{\tau}^{(2)} + \lambda^2 A_{\tau}^{(3)} + \cdots)\]

which, by defining $\delta_t := R(s_t, a_t) + \gamma V(s_{t+1}) - V(s_t)$, we can write as

\[\begin{align*} A_{\tau}(\lambda) &= (1-\lambda)(\delta_{\tau} + \lambda(\delta_{\tau} + \gamma \delta_{\tau+1}) + \lambda^2(\delta_{\tau} + \gamma \delta_{\tau+1} + \gamma^2 \delta_{\tau+2}) + \cdots)\\ &= (1-\lambda)\underbrace{(1 + \lambda + \lambda^2 + \cdots)}_{1/(1-\lambda)}(\delta_{\tau} + (\gamma\lambda) \delta_{\tau+1} + (\gamma\lambda)^2 \delta_{\tau+2} + \cdots)\\ &= \sum_{t=\tau}^{\infty} (\gamma\lambda)^{t-\tau} \delta_{t} \end{align*}\]

PPO considers a truncated form of GAE

\[A_{\tau}(\lambda, T) = (1-\lambda)(A_{\tau}^{(1)} + \lambda A_{\tau}^{(2)} + \cdots + \lambda^{T-1} A_{\tau}^{(T)})\]

But it seems this has not been corrected for bias-correction, which would instead correspond to

\[\tilde{A}_{\tau}(\lambda, T) = \frac{1-\lambda}{1-\lambda^T} (A_{\tau}^{(1)} + \lambda A_{\tau}^{(2)} + \cdots + \lambda^{T-1} A_{\tau}^{(T)})\]

though perhaps this is negligible in practice.

Architectures by symmetry

2023-08-18T09:00:00+00:00

Introduction

Geometric deep learning provides a framework for viewing and deriving architectures via symmetry. Namely, imposing invariances and equivariances on a system with respect to a group of symmetries brings us prominent architectures present in machine learning, such as transformers and CNNs, that are empirically highly effective. This post aims to condense the core ideas of this framework.

The notion of using symmetry to derive models is significant in physics. Three out of the four fundamental forces can be derived this way; for example, the Maxwell equations for the electromagnetic force can be derived by imposing $U(1)$ symmetry, and this approach was discovered long after the empirical finding of the Maxwell equations via experiment.

Setup

Say one wishes to create information processing systems that receive some data signal, and output some property of the data.

Notation:

Data signals come from the space $\mathcal{X}(\Omega, \mathcal{C}) := \{f: \Omega \to \mathcal{C}\}$, the set of all functions from $\Omega$ to $\mathcal{C}$. $\Omega$ is called the domain and the dimensions of $\mathcal{C}$ are called channels.

As a concrete example, 10x10 RGB images can be described with $\Omega := \mathbb{Z}_{10} \times \mathbb{Z}_{10}$ and $\mathcal{C} := \mathbb{R}^3$, assigning RGB values to each point in a grid.

This is equivalent to representing the data as a tensor of type $\mathbb{R}^{10 \times 10 \times 3}$. But here, we make a seperation of the shape $(10, 10, 3)$ into $(\Omega, \mathcal{C})$. The reason for this will become clearer, but we define our symmetries by a group $G$ that acts on $\Omega$, whereas $\mathcal{C}$ is not involved in these symmetries.
Our system $f: \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{Y}$ takes a data signal, and returns a property of the signal.

Universal Approximation theorems tell us that we can choose $f = f(x; \theta)$ to be a single hidden layer NN and we can get arbitrarily close to any function for some choice of parameters $\theta$. But finding these parameters by optimization can require an infeasible number of data points from the true underlying data distribution. How do we circumvent this?

Luckily, when we are creating an information processing system, we know something beforehand about the structure of data it will receive and the task we wish to solve, and hence the underlying symmetries of the data signal domain $\Omega$. As seen in the example above, for classification of images, it is reasonable to expect that $f$ should output the same result even if the image is translated, since the object class will not change under such translation.

So we know beforehand that $f$ should be invariant under certain transformations of its input signals, which narrows the function space we are optimizing over if we can properly encode this into the architecture of $f$.

In contrast, a basic NN does not make any assumptions about symmetries of its input data, though the optimization process may impose its own biases.

The concept of built-in exploitation of symmetry into the model architecture is an example of an inductive bias. Regularization techniques are also inductive biases, with weight decay biasing towards low weight norms. But here we are just concerned about inductive biases built into the architecture of the system, not those built into the optimization process.

Tangent propagation uses a regularizing term in the optimization objective to incentivize local invariances to transformations. Data augmentation also produces such an effect. However in geometric deep learning, the goal is to build these invariances into the functional form/architecture of the model itself.

Sidenote (not important): an example of tangent propagation: for data $x \in \mathbb{R}^n$ and model $y(x) \in \mathbb{R}^m$ under transformations parameterized by, for convenience, a scalar $\xi$, then $\xi \cdot x \in \mathbb{R}^n$ is the transformed data point (with $0 \cdot x = x$). Then
\[\frac{\partial y(\xi \cdot x)_i}{\partial \xi}\bigg\rvert_{\xi=0} = \sum_{j=1}^{n} \frac{\partial y(x)_i}{\partial x_j} \frac{\partial (\xi \cdot x)_j}{\partial \xi}\bigg\rvert_{\xi=0} =: \sum_{j=1}^{n} J_{ij} \tau_j\]
with Jacobian $J_{ij} := \frac{\partial y_i}{\partial x_j}$ and tangent vector $\tau := \frac{\partial (\xi \cdot x)}{\partial \xi}\rvert_{\xi=0}$. Then tangent propagation includes a regularizing term
\[\lambda \sum_{i=1}^{m} \left(\frac{\partial y(\xi \cdot x)_i}{\partial \xi}\bigg\rvert_{\xi=0}\right)^2 = \lambda \sum_{i=1}^{n} \left(\sum_{j=1}^{n} J_{ij} \tau_j\right)^2\]
into the objective function to incentivize local invariances, with $\lambda$ chosen to balance this invariance effect.

Framework

Group Actions

With knowledge of the kind of data we are working with, and the task we wish to solve, we can determine symmetries on the domain $\Omega$, and describe these symmetries by a group $G$.

$G$ acts on $\Omega$ via $\bullet: G \times \Omega \to \Omega$, and $\bullet_g \in \text{Sym}(\Omega)$ is the group action restricted to $g \in G$, with $\bullet_g(u) = g \bullet u$.

We can define another group action $\ast: G \times \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{X}(\Omega, \mathcal{C})$ by $g \ast x := x(g^{-1} \bullet \; \cdot \;) \in \mathcal{X}(\Omega, \mathcal{C})$, now instead acting on a space of functions on $\Omega$ rather than $\Omega$ itself. This is a valid group action if $\bullet$ is a valid group action.

$g^{-1}$ must be applied instead of $g$ so that the compositionality requirement of a group action, $\ast_g \circ \ast_h = \ast_{gh}$, holds.

Key Definitions

Definition: $f: \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{Y}$ is called $G$-invariant if
\[f \circ \ast_g = f\]
$\forall \; g \in G$.

Equivalent definition of $G$-invariance: $f(x \circ \bullet_g) = f(x) \; \forall \; x \in \mathcal{X}(\Omega, \mathcal{C}), g \in G$.
$G$-invariance says that applying the action $\bullet_g$ onto the domain input $u \in \Omega$ does not effect the output of $f$.

Definition: $f: \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{X}(\Omega', \mathcal{C}')$ is called $G$-equivariant if
\[f \circ \ast_g = \ast'_g \circ f\]
$\forall \; g \in G$, where $\ast$, $\ast'$ are the group actions of $G$ on $\mathcal{X}(\Omega, \mathcal{C})$, $\mathcal{X}(\Omega', \mathcal{C}')$ respectively.

In the special case of $\ast \equiv \ast'$, $G$-equivariance says that $f$ and $\ast_g$ commute.

And two key properties:

$G$-equivariances are closed under composition, i.e. if $A$ and $B$ are $G$-equivariances, then $A \circ B$ is a $G$-equivariance.
If $A$ is a $G$-invariance, and $B$ is a $G$-equivariance, then $A \circ B$ is a $G$-invariance.

Pertubative Stability

There is a problem with a function $f$ having just $G$-invariance. It says nothing regarding how robust $f$ is to transformations that are close to the transformations of $G$, but not exactly in $G$.

e.g. under a transformation of an image in a way that is not exactly an element of the translation group, but is very close to being a translation, we would except that $f$ still effectively acts invariantly under this transformation. But the definition of $G$-invariance says nothing about these small pertubations from $G$.
The same is true for the definition of $G$-equivariance.

We can formalize the idea of $f$ being $G$-invariant while also being robust to pertubations outside of $G$ by the notion of approximate invariance. A complexity measure $c: \text{Diff}(\Omega) \to \mathbb{R}^+$ measures how ‘close’ a transformation $\tau \in \text{Diff}(\Omega)$ is to the transformations of a group $G \subset \text{Diff}(\Omega)$, with $c(g) = 0 \; \forall \; g \in G$. Then $f$ is approximately invariant if

\[\lVert f(\ast_{\tau}(x)) - f(x) \rVert \leq C c(\tau) \lVert x \rVert\]

$\forall \; x \in \mathcal{X}(\Omega, \mathcal{C}), \tau \in \text{Diff}(\Omega)$ for some constant $C$.

Then from the above definition, such an $f$ is both $G$-invariant (case of $\tau \in G$) and is also sufficiently stable to pertubations outside $G$, depending on the chosen complexity measure $c$.

Similarly, $f$ is called approximately equivariant if

\[\lVert f(\ast_{\tau}(x)) - \ast'_{\tau}(f(x)) \rVert \leq C c(\tau) \lVert x \rVert\]

$\forall \; x \in \mathcal{X}(\Omega, \mathcal{C})$, $\tau \in \text{Diff}(\Omega)$.

Locality

How do we achieve pertubative stability as described by approximate invariance/equivariance? One heuristic for achieving pertubative stability is that of locality.

Consider $\Omega = \mathbb{R}$ and group $(\mathbb{R}, +)$ with $G = \mathbb{R}$, acting as translations by $\ast_v(x)(u) := x(u - v)$. Also consider a transformation $\tau \in \text{Diff}(\mathbb{R})$, which has action $\ast_{\tau}(x)(u) = x(u - \tilde{\tau}(u))$, where $\tilde{\tau}$ is not a constant function, hence $\tau$ is not a translation. However, $\lVert \nabla \tilde{\tau} \rVert_{\infty} \leq \epsilon$ for some small $\epsilon$, hence $\tau$ can be said to be an approximate translation.

Let $f(x) := |\hat{x}|$, the Fourier modulus, with $\hat{x}(\xi) = \int_{-\infty}^{\infty} x(u) e^{-i\xi u} du$ the Fourier transform. Then since $\hat{\ast_v(x)}(\xi) = \hat{x}(\xi) e^{-i \xi v}$, we have $f(\ast_v(x)) = |\hat{x}| \equiv f(x)$, hence $f$ is $G$-invariant. However, it fails to be approximately invariant, as one can show that for the approximate translation $\tau$,

\[\frac{\lVert f(\ast_{\tau}(x)) - f(x) \rVert}{\lVert x \rVert} = \mathcal{O}(1)\]

i.e. it is independent of $\epsilon > 0$.

In constrast, let $f(x) := W_{\psi}(x)(\cdot, \xi)$ for a fixed $\xi \in \mathbb{R}^+$, with $W_{\psi}(x)(u, \xi) = \xi^{-1/2} \int_{-\infty}^{\infty} \psi\left(\frac{v - u}{\xi}\right) x(v) dv$, called the wavelet transform. Then it can be shown that

\[\frac{\lVert f(\ast_{\tau}(x)) - f(x) \rVert}{\lVert x \rVert} = \mathcal{O}(\epsilon)\]

hence $f$ is approximately equivariant.

How could this be interpreted? One way is to note that the Fourier transform extracts global properties, namely, frequencies featured in the signal, but does not give any local temporal information. And from above, the Fourier transform is not stable to pertubations outside $G$.

On the other hand, wavelets provide a balance of global and local information, allowing one to gain information regarding frequencies at localised points in the signal, and from above, the wavelet transform is stable to pertubations.

Hence, roughly, locality can be viewed as a principle that $f$ should follow to possibly give some stability to its invariances/equivariances. In the Architecture section, this locality property is often imposed, and it can be roughly justified by the above.

Building Invariances

We want our system $f: \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{Y}$ to be $G$-invariant. How do we go about achieving this?

If we restrict $f$ to be linear, then $G$-invariance implies that

\[\begin{align*} f(x) &= \frac{1}{\mu(G)} \int_G f(x) d\mu(g)\\ &= \frac{1}{\mu(G)} \int_G f(\ast_g(x)) d\mu(g) && \text{(by $G$-invariance)}\\ &= f\left(\frac{1}{\mu(G)} \int_G \ast_g(x) d\mu(g)\right) && \text{(by linearity)}\\ &=: f(\bar{x}) \end{align*}\]

with $\bar{x} := \frac{1}{\mu(G)} \int_G \ast_g(x) d\mu(g) \in \mathcal{X}(\Omega, \mathcal{C})$ representing the average action signal of $G$ on signal $x \in \mathcal{X}(\Omega, \mathcal{C})$.

What this says is that $f$ linear and invariant means $f$ can only depend on $\bar{x}$, and $\bar{x}$ contains very little information about $x$ in some cases.

e.g. for a 2D image $x$ and $G$ the group of translations, $\bar{x}$ will be the average pixel intensity across the whole image. Global pooling is an example of this, but is typically performed after many equivariant operations, to be discussed below.

Hence if we want $f$ to be invariant while being effective at a non-trivial task, we require that $f$ is non-linear. How do we construct an invariant, non-linear $f$?

If we have some non-linear equivariances $\{f_i\}_{i=1}^{N}$ and a (possibly linear) invariance $A$, we could produce a non-linear $G$-invariant system $f$ by

\[f := A \circ f_N \circ \cdots \circ f_1\]

where $f$ is $G$-invariant as a consequence of the two key properties noted at the end of the Key Definitions section.

Say instead we have some linear equivariances $\{f_i\}_{i=1}^{N}$ that we wish to use to build an invariant system. In order for $f$ to be expressive we want some non-linearity involved too. A useful result is that:

If $A: \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{X}(\Omega', \mathcal{C}')$ is a $G$-equivariance, and $\Sigma: \mathcal{X}(\Omega', \mathcal{C}') \to \mathcal{X}(\Omega', \mathcal{C}'')$ is defined by $\Sigma(x) := \sigma \circ x$ with $\sigma: \mathcal{C}' \to \mathcal{C}''$ (e.g. a non-linearity), then $\Sigma \circ A$ is $G$-equivariant.

This follows as $\Sigma \circ A \circ \ast_g = \Sigma \circ \ast'_g \circ A$ and $\Sigma(\ast'_g(x)) = \sigma \circ x(g^{-1} \bullet \; \cdot \;) = \ast''_g(\sigma \circ x) = \ast''_g(\Sigma(x))$ where $\ast''_g$ is just $\ast'_g$ but extended from $\mathcal{X}(\Omega', \mathcal{C}')$ to $\mathcal{X}(\Omega', \mathcal{C}'')$. This only works as $\Sigma$ only modifies the channel dimensions, and leaves the domain unchanged.

So we can choose some non-linearities $\{\Sigma_i\}_{i=1}^{L}$ and build a $G$-invariant $f$ by

\[f := A \circ \Sigma_N \circ f_N \circ \cdots \circ \Sigma_1 \circ f_1\]

The convolution, seen in CNNs, is an example of a linear equivariance. Typically a non-linearity such as ReLU is applied after the convolution.

Architectures

CNN

A general definition of the concept of a convolution is that of the group convolution.

Definition: For a group $G$ and $\theta \in \mathcal{X}(\Omega, \mathcal{C})$, the group convolution $C_{\theta}: \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{X}(G, \mathcal{C})$ is defined by
\[C_{\theta}(x)(g) := \int_{\Omega} \langle x(u), \theta(g^{-1} \bullet u) \rangle_{\mathcal{C}} du\]
where $\langle\cdot, \cdot\rangle_{\mathcal{C}}$ is an inner product on $\mathcal{C}$.

The group convolution is special in that $f: \mathcal{X}(G, \mathcal{C}) \to \mathcal{X}(G, \mathcal{C})$ is $G$-equivariant and linear iff $f$ is a group convolution (see Kondor, Trivedi (2018)). So for functions on $\mathcal{X}(G, \mathcal{C})$, we can fully characterise linear $G$-equivariances.

In the case of $f: \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{X}(G, \mathcal{C})$ however, certain conditions must be met. If $\Omega$ is countable, or $\Omega$ uncountable with $\det\left(\frac{\partial \bullet_g(u)}{\partial u}\right) = 1$, then the group convolution is $G$-equivariant. The former result is proven below, and the latter is proven similarly.

Claim: If $\Omega$ is countable, then for any $\theta \in \mathcal{X}(\Omega, \mathcal{C})$, the group convolution is $G$-equivariant. That is,
\[C_{\theta} \circ \ast_g = \ast'_g \circ C_{\theta}\]
$\forall \; g \in G$, where $\ast'_g$ is defined by action $\bullet'_g: G \to G$ with $\bullet'_g(h) := gh$.

Proof: Since $\Omega$ is countable, the integral becomes a sum:
\[C_{\theta}(x)(g) = \sum_{u \in \Omega} \langle x(u), \theta(g^{-1} \bullet u) \rangle_{\mathcal{C}}\]
To show $G$-equivariance, we make use of the fact that $\bullet_g \in \text{Sym}(\Omega)$ to make the substitution $v = h^{-1} \bullet u$:
\[\begin{align*} C_{\theta}(\ast_h(x))(g) &= \sum_{u \in \Omega} \langle x(h^{-1} \bullet u), \theta(g^{-1} \bullet u) \rangle_{\mathcal{C}}\\ &= \sum_{v \in \Omega} \langle x(v), \theta(g^{-1} \bullet (h \bullet v)) \rangle_{\mathcal{C}}\\ &= \sum_{u \in \Omega} \langle x(u), \theta(g^{-1}h \bullet u) \rangle_{\mathcal{C}} \end{align*}\]
Compare this to the outer application of $\ast'_h$:
\[\ast'_h(C_{\theta}(x))(g) = C_{\theta}(x)(h^{-1}g) = \sum_{u \in \Omega} \langle x(u), \theta(g^{-1}h \bullet u) \rangle_{\mathcal{C}}\]
These are identical, and so
\[C_{\theta} \circ \ast_g = \ast'_g \circ C_{\theta}\]
$\forall \; g \in G, \theta \in \mathcal{X}(\Omega, \mathcal{C})$ as required, hence $C_{\theta}$ is $G$-equivariant.

Classical CNNs can be derived by choosing to use a linear equivariance wrt. translations, which is equivalently a convolution $C_{\theta}$ (since $G = \Omega = \mathbb{Z}^2$), and then imposing locality by choosing a localised filter $\theta$ such that $C_{\theta}(x)((u, v)^T)$ depends only on a small $f_H \times f_W$ grid in the original input. Since $G$ is the group of translations, the group convolution is

\[C_{\theta}(x)(u) = \sum_{v \in \Omega} x(v) \theta(u^{-1} \bullet v) = \sum_{v \in \Omega} x(v) \theta(v - u)\]

Let $\Omega := \mathbb{Z}^2$, with values outside of a $H \times W$ grid set to be $0$, and let $\mathcal{C} = \mathbb{R}$ for now. Then a localised filter $\theta \in \mathcal{X}(\mathbb{Z}^2, \mathbb{R})$ represented by a $f_H \times f_W$ grid can be written in the basis $\{\theta_{1, 1}, \ldots, \theta_{f_H, f_W}\}$ with $\theta_{i, j}(u, v) := \delta(u - i, v - j)$ like so:

\[\theta = \sum_{i=1}^{f_H} \sum_{j=1}^{f_W} w_{i, j} \theta_{i, j}\]

with $\{w_{i, j}\}_{i, j} \subset \mathbb{R}$ learnable.

Then since $C$ is linear wrt. $\theta$ (since $C_{\alpha \theta_1 + \beta \theta_2} = \alpha C_{\theta_1} + \beta C_{\theta_2}$), then the corresponding localised convolution is

\[C_{\theta} = \sum_{i=1}^{f_H} \sum_{j=1}^{f_W} w_{i, j} C_{\theta_{i, j}}\]

$C_{\theta}$ corresponds to a stride of 1, as

\[C_{\theta_{i, j}}(x)((u, v)^T) = \sum_{w \in \Omega} x(w) \theta_{i, j}(w - (u, v)^T) = x((u+i, v+j)^T)\]

and so

\[C_{\theta}(x)((u, v)^T) = \sum_{i=1}^{f_H} \sum_{j=1}^{f_W} w_{i, j} x((u+i, v+j)^T)\]

A stride $k > 1$ convolution is not equivariant to all translations; only translations that step in multiples of $k$, i.e. $G = (k\mathbb{Z})^2$.

In the case of $C$ channels, with $\mathcal{C} := \mathbb{R}^C$, mapping to $C'$ channels, then by writing $\theta \in \mathcal{X}(\mathbb{Z}^2, \mathbb{R}^C)$ in the basis $\{\theta_{1, 1, 1}, \ldots, \theta_{f_H, f_W, C}\}$, the localised convolution is

\[C_{\theta}(x)((u, v)^T)_{c'} = \sum_{i=1}^{f_H} \sum_{j=1}^{f_W} \sum_{c=1}^{C} w_{i, j, c, c'} x((u+i, v+j)^T)_c\]

since there are $C'$ filters each of shape $(f_H, f_W, C)$.

GNN

Consider the case of the domain $\Omega = V$, where $V$ are the nodes of a graph $\mathcal{G} = (V, E)$, and $\mathcal{C} = \mathbb{R}^d$. Then the signal $x \in \mathcal{X}(V, \mathbb{R}^d)$ can be equivalently represented by a matrix $X \in \mathbb{R}^{n \times d}$, where $n := |V|$. Here we are assuming there are only node representations, and none for edges.

We can represent connection information by an adjacency matrix $A \in \mathbb{R}^{n \times n}$, with $A_{ij} = 1$ if $(i, j) \in E$, and $A_{ij} = 0$ otherwise.

Naturally we want the system to be $G$-invariant wrt. $G = \text{Sym}(\{1, \ldots, n\})$ such that an arbitrary renumbering of nodes has no effect. To build such invariances, we need $G$-equivariances. Specifically, for $f: \mathbb{R}^{n \times d_{I}} \times \mathbb{R}^{n \times n} \to \mathbb{R}^{n \times d_{O}}$ to be $G$-equivariant, we need

\[f(PX, PAP^T) = Pf(X, A)\]

$\forall \; P \in S_n$, where $S_n$ is the group of $n \times n$ permutation matrices.

One way of imposing locality here is to impose that $f(X, A)_i \in \mathbb{R}^{d_{O}}$ only depends on $x_i$ and $X_{\mathcal{N}_i(A)} \in \mathbb{R}^{\vert \mathcal{N}_i(A)\vert \times d_I}$, the node representations of neighbours of $i$. This means that

\[f(X, A)_i = \phi(x_i, X_{\mathcal{N}_i(A)})\]

In this case, $\phi$ must be invariant to permutations of the input neighbours, otherwise $f$ wont necessarily be equivariant. This means that we need

\[\phi(x_i, PX_{\mathcal{N}_i(A)}) = \phi(x_i, X_{\mathcal{N}_i(A)})\]

$\forall \; P \in S_{\vert \mathcal{N}_i(A)\vert }$.

One way of imposing that $\phi$ is invariant is to enforce that $\phi$ only depends on a permutation invariant operator across the neighbours, i.e.

\[\phi(x_i, X_{\mathcal{N}_i(A)}) = \phi\left(x_i, \bigoplus_{j \in \mathcal{N}_i(A)} \psi(x_i, x_j)\right)\]

which is the most general form of a GNN.

Choosing $\psi(x_i, x_j) = c_{ij} \tilde{\psi}(x_j)$ gives convolutional GNNs.
Choosing $\psi(x_i, x_j) = a(x_i, x_j) \tilde{\psi}(x_j)$ gives attentional GNNs.

Transformer

A text prompt with $n$ tokens can be represented as a graph $\mathcal{G} = (V, E)$ with $V := \{1, \ldots, n\}$ and with connectivity $E$ s.t. $\mathcal{N}_i(A) := \{1, \ldots, i\} \; \forall \; i \in V$, i.e. each token is connected to itself and all tokens before itself, but none after.

As for GNNs, $X \in \mathbb{R}^{n \times d}$ represents the tokens, with $d$ called the residual stream dimension in the context of transformers.

In the case of $f: \mathbb{R}^{n \times d} \times \mathbb{R}^{n \times n} \to \mathbb{R}^{n \times d}$ being a general GNN, i.e.

\[f(X, A)_i = \phi\left(x_i, \bigoplus_{j \in \mathcal{N}_i(A)} \psi(x_i, x_j)\right)\]

then by defining learnable parameters $Q_h, K_h, V_h, O_h \in \mathbb{R}^{d_{\text{head}} \times d}$ for each $h = 1, \ldots, H$, where $d_{\text{head}}$ is the head dimension, we can arrive at a transformer attention layer with $H$ heads by choosing

$\bigoplus := \sum$,
$\psi(x_i, x_j) := \sum_{h=1}^{H} \hat{\psi}_h(x_i, x_j)$,
with $\hat{\psi}_h(x_i, x_j) := a_h(x_i, x_j) \tilde{\psi}_h(x_j)$ (attentional),
$\tilde{\psi}_h(x_j) := O_h^T V_h x_j$,
softmax-normalised attention scores
\[a_h(x_i, x_j) := \text{softmax}(\{(Q_h x_i)^T K_h x_k : k \in \mathcal{N}_i(A)\})_j = \frac{e^{(Qx_i)^T Kx_j}}{\sum_{k \in \mathcal{N}_i(A)} e^{(Qx_i)^T Kx_k}}\]
and $\phi(x_i, y) := x_i + y$ (residual connection)

These choices give

\[f(X, A)_i = x_i + \sum_{h=1}^{H} O_h^T\left(\sum_{j=1}^{i} a_h(x_i, x_j) V_h x_j\right)\]

which is exactly the attention layer of the GPT models.

The choice $\tilde{\psi}_h(x_j) := O_h^T V_h x_j$ maps $x_j$ into a low-dimensional subspace, as $\text{rank}(O_h^T V_h) \leq d_{\text{head}}$ and typically $d_{\text{head}} \ll d$ is chosen. Choosing $\tilde{\psi}_h(x_j) := W_h x_j$ instead would allow for mapping into a space of higher dimension than $d_{\text{head}}$.

RNN gating

An RNN has an update equation of the form

\[h_{t+1} := R(z_t, h_t)\]

The SimpleRNN chooses $R(z_t, h_t) := \text{tanh}(Wz_t + Uh_t + b)$.

By undiscretizing by $h_{t+1} \mapsto h(t+1)$, and using a Taylor expansion of

\[h(t+1) \approx h(t) + \frac{dh(t)}{dt}\]

we can get a continuous update rule:

\[h(t) + \frac{dh(t)}{dt} = R(z(t), h(t))\]

Consider a time warping $\tau: \mathbb{R}^+ \to \mathbb{R}^+$, which is monotone increasing, i.e. $\frac{d\tau(t)}{dt} > 0$. It turns out that the gating mechanism present in GRUs and LSTMs can be derived by requiring that the RNN model class is invariant to such time warping operations.

The concept of a model class being invariant can be seen as a generalization of $G$-invariance. Model class invariance of model class $M$ says that for any $f \in M$, $f \circ \tau = g$ for some $g \in M$. $G$-invariance is the case of $f \equiv g$. TODO: rewrite this point

First make a substitution $t \mapsto \tau(t)$:

\[\frac{dh(\tau(t))}{d\tau(t)} = R(z(\tau(t)), h(\tau(t))) - h(\tau(t))\]

And using the chain rule,

\[\frac{dh(\tau(t))}{dt} = \frac{dh(\tau(t))}{d\tau(t)} \frac{d\tau(t)}{dt} = \frac{d\tau(t)}{dt}(R(z(\tau(t)), h(\tau(t))) - h(\tau(t)))\]

Using a Taylor expansion

\[h(\tau(t+1)) \approx h(\tau(t)) + \frac{dh(\tau(t))}{dt}\]

(assuming that $\frac{d\tau(t)}{dt} < 1$), and defining $\Gamma := \frac{d\tau(t)}{dt}$, this can be rewritten as

\[h(\tau(t+1)) = \Gamma R(z(\tau(t)), h(\tau(t))) + (1 - \Gamma)h(\tau(t))\]

Since the time warping $\tau$ is unknown, the inputs received are effectively $z_t := z(\tau(t))$. This also means that $\Gamma$ should be learnable.

If we now impose that the model class is invariant to time warping, such that $h(\tau(t)) = h(t)$, and then discretizing, we obtain:

\[h_{t+1} = \Gamma R(z_t, h_t) + (1 - \Gamma)h_t\]

And since $0 < \Gamma < 1$, it is natural to choose $\Gamma$ to be of the form of the SimpleRNN model, but instead with sigmoid $\sigma: \mathbb{R} \to [0, 1]$ to match the range of $\Gamma$. Then,

\[\Gamma := \sigma(Wz_t + Uh_t + b)\]

which is exactly the expression for a gate in the GRU and LSTM models.

Free-energy

2023-08-15T09:00:00+00:00

Introduction

This post describes the notion of an agent minimizing their free-energy and its extension to active inference, which includes the ability for the agent to take action, additionally introducing an expected free-energy. I then derive an approach for applying these methods to reinforcement learning, comparing my approach to another approach in the literature. I then briefly describe predictive coding, a more biologically plausible replacement for backpropagation.

Minimizing free-energy is exactly what both VAEs and Diffusion models perform. Free-energy is also related to the negative model evidence, and many common objectives minimize this, e.g. cross-entropy and mean-squared loss.

Free-energy

The free-energy principle says that self-organizing systems (e.g. biological systems) can be viewed as minimizing a quantity called the free energy.

There is a mathematically involved derivation of the free-energy principle by assuming that states follow Langevin dynamics along with some other assumptions. For a full description of this approach see Friston (2019).

Assuming the Bayesian brain hypothesis provides another, less involved, route to the concept of the brain minimizing free energy, as will be described in this section. The Bayesian brain hypothesis says: the brain has a generative model describing its beliefs about the world, and this model is updated based on perception.

Specifically, the brain receives sensory data/observation denoted by $o \in \mathcal{O}$, and it is assumed this data is generated by some underlying process dependent on some cause/latent variable $x \in \mathcal{X}$. Then the brain’s beliefs are implicitly contained in its generative model $p(x, o) = p(x) p(o\vert x)$ made up of an explicit:

prior belief on latent variable $p(x)$,
likelihood $p(o\vert x)$.

After receiving an observation $o$, the posterior $p(x\vert o)$ is determined by Bayes rule:

\[p(x\vert o) = \frac{p(x, o)}{p(o)}\]

Typically, however, $p(o) = \int_{\mathcal{X}} p(x, o) dx$ is intractable, so we consider a variational posterior $q(x\vert o)$ that we want to effectively approximate $p(x\vert o)$. The natural objective to achieve this is to minimize the KL-divergence between these distributions, and with some rearranging:

\[\begin{align*} D_{KL}(q(x|o) || p(x|o)) &= D_{KL}(q(x|o) || \frac{p(x, o)}{p(o)})\\ &= D_{KL}(q(x|o) || p(x, o)) + \log(p(o))\\ &\leq D_{KL}(q(x|o) || p(x, o)) =: F(o) \end{align*}\]

where $F(o)$ is called the variational free energy of observation $o$, and is a tractable upper bound on this objective. This has shown that updating beliefs in a tractable way can be done by minimizing the variational free energy.

Variational auto-encoders (VAEs) then follow by letting $p(o)$ be the data distribution and choosing prior belief $p(x) := \mathcal{N}(x; 0, I)$ (isotropic Gaussian), and $p(o|x;\theta)$, $q(x|o;\phi)$ to be parameterized MVNs, and then applying the reparameterization trick to obtain a low variance estimator of $\mathbb{E}_{p(o)}[F(o)]$ and applying gradient descent on this estimator.
Diffusion models follow from a data distribution $p(o)$ and $x = (x_1, \ldots, x_T)$, with $p(x_T)$ an isotropic Gaussian, and generative model $p(x, o; \theta) = p(x_T) \prod_{t=1}^{T} p(x_{t-1}|x_t; \theta)$ with $x_0 := o$, and $q(x|o) = \prod_{t=1}^{T} q(x_t|x_{t-1})$ (i.e. Markov chains in both directions).

Specifically, there is an exact noising process $q(x_t|x_{t-1}) := \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)$ and a parameterized noise-reversing process $p(x_{t-1}|x_t; \theta) := \mathcal{N}(x_{t-1}; \mu_{\theta}(x_t, t), \Sigma_{\theta}(x_t, t))$. The rest then amounts to minimizing $\mathbb{E}_{p(o)}[F(o)]$ by using some tricks to find a low variance estimator and applying gradient descent.

The negative model evidence acts as a lower bound on $F(o)$:

\[\begin{align*} F(o) = D_{KL}(q(x|o)||p(x, o)) &= -\log p(o) + D_{KL}(q(x|o)||p(x|o))\\ &\geq -\log p(o) \end{align*}\]

since the KL-divergence is non-negative. Then, minimizing $F(o)$ pushes the model evidence $\log p(o)$ to maximize, i.e. for our perceived observations to match our model.

Maximizing expected (across a data distribution) model evidence is a common form of objective function in machine learning. For example, cross-entropy does this directly, and the mean squared error is the model evidence under a Gaussian distribution for $p(o)$.

By manipulating, a more interpretable expression for the free energy can be obtained:

\[F(o) = D_{KL}(q(x|o)||p(x)) - \mathbb{E}_{q(x|o)}[\log p(o|x)]\]

Then minimization of $F(o)$ incentivizes

the posterior to be aligned with prior belief, which has a regularizing effect, (left term)
and the posterior to be accurate at modelling the environment (right term)

Active Inference

Action

Active inference extends the above by including action. The above describes a system updating its beliefs to minimize free energy, but active inference allows the system to also take actions in the environment in order to make its own beliefs come true and hence minimize free energy.

e.g. an agent may need to remain at a specific temperature, and if their temperature is currently too high, then the agent can either update their model to expect high temperatures (which the brain is stubborn to do, as maintaining the correct temperature is necessary for bodily functions), or they can act in the world in order to change their temperature, such as opening a window.

A systems observational preferences are encoded in its biased prior denoted by $\tilde{p}(o)$, e.g. $\tilde{p}(o)$ may assign high probability to high reward observations in the context of RL.

$\tilde{p}$ differs from $p$ in that, for example, $p(o\vert x)$ is unbiased and models the environment, whereas $\tilde{p}$ involves the preferences of an agent and depends on its subjective goals.

Future Free Energy

To make effective actions, one must consider the future consequences of actions. To formalize such a notion requires the concept of an expected free energy.

The free energy of the expected future (FEEF) is defined as

\[\mathcal{F} := D_{KL}(q(o, x, \pi) || \tilde{p}(o, x))\]

with $\tilde{p}(o, x) = \tilde{p}(o) p(x|o)$.

See Millidge et al. (2020) for a justification of FEEF as a natural extension of the free energy to action.

We decompose $q(o, x, \pi) = q(\pi) q(o, x \vert \pi)$, with $q(\pi)$ the prior over policies.

Concretely, one can think of $o = (o_0, \ldots, o_T)$, $x = (x_0, \ldots, x_T)$ and $\pi = (a_0, \ldots, a_{T-1})$ representing the observations, states, and actions respectively for a time horizon of length $T$.

Rewriting the FEEF using the above decomposition for $q(o, x, \pi)$ gives:

\[\begin{align*} \mathcal{F} &:= D_{KL}(q(o, x, \pi) || \tilde{p}(o, x))\\ &= \mathbb{E}_{q(o, x, \pi)}[\log(\frac{q(o, x, \pi)}{\tilde{p}(o, x)})]\\ &= \mathbb{E}_{q(o, x, \pi)}[\log(q(\pi)) - (\log(\tilde{p}(o, x)) - \log(q(o, x | \pi)))]\\ &= \mathbb{E}_{q(\pi)}[\log(q(\pi)) - \log(e^{-\mathbb{E}_{q(o, x | \pi)}[\log(q(o, x | \pi)) - \log(\tilde{p}(o, x))]})]\\ &= \mathbb{E}_{q(\pi)}[\log(q(\pi)) - \log(e^{-D_{KL}(q(o, x | \pi) || \tilde{p}(o, x))})]\\ &= D_{KL}(q(\pi) || e^{-D_{KL}(q(o, x | \pi) || \tilde{p}(o, x))})\\ &=: D_{KL}(q(\pi) || e^{-\mathcal{F}_{\pi}}) \end{align*}\]

where $\mathcal{F}_{\pi} := D_{KL}(q(o, x | \pi) || \tilde{p}(o, x))$, measuring the difference between our preferences (described by $\tilde{p}(o, x)$) and the actual trajectories we observe due to a policy $\pi$.

Hence if we consider minimizing $\mathcal{F}$ wrt. $q(\pi)$, then by variational principles, the minimizing solution is $q(\pi) \propto e^{-\mathcal{F}_{\pi}}$ up to a multiplicative normalizing constant. Hence this says that the best prior policy $\pi^{*} := \text{argmin}_{\pi} \mathcal{F}_{\pi}$.

We can rewrite $\mathcal{F}_{\pi}$ in an interpretable form using the approximation $p(x\vert o) \approx q(x\vert o) \approx q(x\vert o, \pi)$:

\[\begin{align*} \mathcal{F}_{\pi} &:= D_{KL}(q(o, x | \pi) || \tilde{p}(o, x))\\ &= \mathbb{E}_{q(o, x|\pi)}[\log(q(x|\pi)) + \log(q(o|x, \pi)) - \log(\tilde{p}(o)) - \log(p(x|o))]\\ &\approx \mathbb{E}_{q(o, x|\pi)}[\log(q(x|\pi)) + \log(q(o|x, \pi)) - \log(\tilde{p}(o)) - \log(q(x|o, \pi))]\\ &= \mathbb{E}_{q(x|\pi) q(o|x, \pi)}[\log(\frac{q(o|x, \pi)}{\tilde{p}(o)})] - \mathbb{E}_{q(o|\pi) q(x|o, \pi)}[\log(\frac{q(x|o, \pi)}{q(x|\pi)})]\\ &= \mathbb{E}_{q(x|\pi)}[D_{KL}(q(o|x, \pi) || \tilde{p}(o))] - \mathbb{E}_{q(o|\pi)}[D_{KL}(q(x|o, \pi) || q(x|\pi))] \end{align*}\]

and so minimizing $\mathcal{F}_{\pi}$ means

The first term, $\mathbb{E}_{q(x\vert \pi)}[D_{KL}(q(o\vert x, \pi) \| \tilde{p}(o))]$, describing the difference between preferred and expected observations, will be incentivized to decrease.
The second term, $\mathbb{E}_{q(o|\pi)}[D_{KL}(q(x|o, \pi) || q(x|\pi))]$, describing the expected information gain under observation, will be incentivized to increase. This can be seen as incentivizing exploration. In the context of reinforcement learning, terms that incentivize exploration must be added ad-hoc, as the idea of value maximization does not explicitly include this, whereas under Active Inference, this idea naturally arises.

Reinforcement Learning

Now consider the special case of reinforcement learning, where for a time horizon of length $T$,

States $x = (s_0, \ldots, s_T)$ of the environment, with $s_i \in \mathbb{R}^S$.
Observations $o = (o_0, \ldots, o_T)$ with $o_i = (r_i, s_{i+1})$ in the case of RL, since the environment states are given directly to the agent. Reward $r_i \in \mathbb{R}$.
Policy $\pi = (a_0, \ldots, a_T)$, with actions $a_i \in \mathbb{R}^A$.

forming a Markov Decision Process (MDP). For one step, $(s_i, a_i) \mapsto o_i$ by the environment, with $o_i = (r_i, s_{i+1})$

The natural prior which we impose is that $\tilde{p}(o_i) = \tilde{p}(r_i)$ has a high density for large $r_i$, such that our agent prefers to achieve high rewards.

By using the properties of an MDP,

\[q(o, x|\pi) = q(x|\pi) q(o|x, \pi) = \prod_{t=0}^{T} q(x_t|x_{t-1}, \pi) q(o_t|x_t, \pi)\]

and also,

\[\tilde{p}(o) = \prod_{t=0}^{T} \tilde{p}(o_t)\]

Recall that we can approximate

\[\mathcal{F}_{\pi} \approx \mathbb{E}_{q(x|\pi)}[D_{KL}(q(o|x, \pi) || \tilde{p}(o))] - \mathbb{E}_{q(o|\pi)}[D_{KL}(q(x|o, \pi) || q(x|\pi))]\]

Using the fact that $q(o\vert x, \pi) = \prod_{t=0}^{T} q(r_t\vert x_t, a_t)$ (since MDP), $q(x\vert \pi) = \prod_{t=0}^{T} q(x_t\vert x_{t-1}, a_{t-1})$ and $\tilde{p}(o) = \prod_{t=0}^{T} \tilde{p}(o_t)$, then the first term can be written

\[\begin{align*} \mathbb{E}_{q(x|\pi)}[D_{KL}(q(o|x, \pi) || \tilde{p}(o))] &= \mathbb{E}_{q(x|\pi)q(o|x, \pi)}[\log(\prod_{t=0}^{T} \frac{q(r_t|x_t, a_t)}{\tilde{p}(r_t)})]\\ &= \sum_{t=0}^{T} \mathbb{E}_{q(x_t|x_{t-1}, a_{t-1}) q(r_t|x_t, a_t)}[\log(\frac{q(r_t|x_t, a_t)}{\tilde{p}(r_t)})]\\ &= \sum_{t=0}^{T} \mathbb{E}_{q(x_t|x_{t-1}, a_{t-1})}[D_{KL}(q(r_t|x_t, a_t) || \tilde{p}(r_t))] \end{align*}\]

and using $q(x\vert o, \pi) = \frac{q(x\vert \pi)q(o\vert x, \pi)}{q(o\vert \pi)}$ and $q(o\vert \pi) = \prod_{t=0}^{T} q(x_t(o_t)\vert x_{t-1}(o_{t-1}), a_{t-1}) q(r_t(o_t)\vert x_t(o_t), a_t)$, the second term can be written

\[\begin{align*} \mathbb{E}_{q(o|\pi)}[D_{KL}(q(x|o, \pi) || q(x|\pi))] &= \mathbb{E}_{q(x, o|\pi)}[\log(\frac{q(o|x, \pi)}{q(o|\pi)})]\\ &= \mathbb{E}_{q(x, o|\pi)}[\log(\prod_{t=0}^{T} \frac{q(r_t|x_t, a_t)}{q(x_t|x_{t-1}, a_{t-1}) q(r_t|x_t, a_t)})]\\ &= -\sum_{t=0}^{T} \mathbb{E}_{q(x|\pi) q(o|x, \pi)}[\log(q(x_t|x_{t-1}, a_{t-1}))]\\ &= -\sum_{t=0}^{T} \mathbb{E}_{q(x_{t-1}|x_{t-2}, a_{t-2}) q(x_t|x_{t-1}, a_{t-1})}[\log(q(x_t|x_{t-1}, a_{t-1}))]\\ &= \sum_{t=0}^{T} \mathbb{E}_{q(x_{t-1}|x_{t-2}, a_{t-2})}[H[q(x_t|x_{t-1}, a_{t-1})]] \end{align*}\]

hence we can finally write

\[\mathcal{F}_{\pi} = \sum_{t=0}^{T} \mathcal{F}_{\pi, t}\]

where

\[\mathcal{F}_{\pi, t} := \mathbb{E}_{q(x_t|x_{t-1}, a_{t-1})}[D_{KL}(q(r_t|x_t, a_t) || \tilde{p}(r_t))] - \mathbb{E}_{q(x_{t-1}|x_{t-2}, a_{t-2})}[H[q(x_t|x_{t-1}, a_{t-1})]]\]

In the context of RL, we call $q(r_t|x_t, a_t)$ the reward model, representing a model of the reward distribution given the state we are in and the actions we take. $q(x_t|x_{t-1}, a_{t-1})$ is called the transition model and models the next state distribution given the current state and the actions we take.

Implementation

To implement these ideas computationally, we need to pick reward \& transition distributions, alongside a prior $\tilde{p}(r_t)$. Gaussians are nice in that the KL divergence between two Gaussians has a closed form
expression, and the entropy of a Gaussian is also closed form, which would give a closed form expression for $\mathcal{F}_{\pi, t}$.

Given this, we choose the reward model $q(r_t|x_t, a_t) = \mathcal{N}(r_t; f_{\mu}(x_t, a_t), f_{\sigma^2}(x_t, a_t))$ and the transition model $q(x_t|x_{t-1}, a_{t-1}) = \mathcal{N}(x_t; g_{\mu}(x_{t-1}, a_{t-1}), \text{diag}(g_{\sigma^2}(x_{t-1}, a_{t-1})))$. We represent $f_{\mu}, f_{\sigma^2}: \mathbb{R}^{S} \times \mathbb{R}^{A} \to \mathbb{R}$ and $g_{\mu}, g_{\sigma^2}: \mathbb{R}^{S} \times \mathbb{R}^{A} \to \mathbb{R}^{S}$ as neural networks.

A natural prior is $\tilde{p}(r_t) = \mathcal{N}(r_t; r_{\text{max}}, \alpha^2)$ where $r_{\text{max}}$ is the maximum reward in the environment and $\alpha$ is suitably chosen.

Then, we can find

\[D_{KL}(q(o_t|x_t, a_t) || \tilde{p}(o_t)) = \frac{1}{2}\left(\frac{(f_{\mu}(x_t, a_t) - r_{\text{max}})^2 + f_{\sigma^2}(x_t, a_t)}{\alpha^2} + \log\left(\frac{\alpha^2}{f_{\sigma^2}(x_t, a_t)}\right) - 1\right)\]

and

\[H[q(x_t|x_{t-1}, a_{t-1})] = \frac{1}{2} \sum_{i=1}^{S} \log(2\pi e (g_{\sigma^2}(x_{t-1}, a_{t-1}))_i)\]

Hence

\[\begin{align*} \mathcal{F}_{\pi, t} = \frac{1}{2}&\mathbb{E}_{q(x_t|x_{t-1}, \pi)}\left[\frac{(f_{\mu}(x_t, a_t) - r_{\text{max}})^2 + f_{\sigma^2}(x_t, a_t)}{\alpha^2} + \log(\frac{\alpha^2}{f_{\sigma^2}(x_t, a_t)}) - 1\right]\\ &- \frac{1}{2}\mathbb{E}_{q(x_{t-1}|x_{t-2}, \pi)}\left[\sum_{i=1}^{S} \log(2\pi e (g_{\sigma^2}(x_{t-1}, a_{t-1}))_i)\right] \end{align*}\]

When implementing, the expectances can be approximated by an average over a sufficiently large number of trajectories, sampling stochastically with $q(x_t\vert x_{t-1}, \pi)$.
The transition and reward models are pretrained before being used to pick an action.
At a state, when choosing an action, a group of $N$ policies (over some horizon $H$ of timesteps) are randomly sampled (normal dist.) and $\mathcal{F}_{\pi}$ is evaluated for each, taking the top $M \ll N$ policies and finding the mean and std across these top policies to then resample new policies and repeat, for a couple iterations, and then finally returning the top policy mean at horizon step $t=0$, and taking this to be the agent’s next action.
Details can be found in code: https://github.com/r-gould/active-inference

Tschantz et al. (2020) arrive at a different approximation for $\mathcal{F}_{\pi}$. I now compare my approach detailed above, and theirs, by implementing my approach into the codebase for their paper, with the same hyperparameters across the two approaches. Saving the rewards across 10 runs for each method and averaging over these runs gives plots:

And this approach outperforms with 95% confidence (via. bootstrapping of the $U$-statistic wrt. the reward at the final timestep) in both environments.

Code can be found at: https://github.com/r-gould/active-inference

Predictive Coding

Now going back to the case of no action, just perception:

Consider a multi-layer hierarchy described by $x = (x_1, \ldots, x_L)$ and causal structure

\[x_L \to x_{L-1} \to \cdots \to x_1 \to o\]

which means that $p(o, x) = \prod_{i=0}^{L} p(x_i\vert x_{i+1})$, with $x_0 := o$ and $p(x_L\vert x_{L+1}) := p(x_L)$.

Predictive coding can be derived from here by making normal assumptions. In particular, we let

\[p(x_i|x_{i+1}) := \mathcal{N}(x_i; \mu_i(x_{i+1};\theta_i), \Sigma_i(x_{i+1};\theta_i))\]

and, using the mean field approximation,

\[q(x|o) \approx \prod_{i=1}^{L} q(x_i|o)\]

with $q(x_i\vert o) := \mathcal{N}(x_i;\hat{\mu}_i, \hat{\Sigma}_i)$.

We can rewrite the free energy as

\[F(o) = D_{KL}(q(x|o)||p(x, o)) = -H[q(x|o)] - \mathbb{E}_{q(x|o)}[\log p(x, o)]\]

The entropy term on the left is nice to evaluate since $q(x\vert o)$ is the product of Gaussians:

\[\begin{align*} -H[q(x|o)] = -\mathbb{E}_{q(x|o)}[\log q(x|o)] &= -\sum_{i=1}^{L} \mathbb{E}_{q(x_i|o)}[\log q(x_i|o)]\\ &= \sum_{i=1}^{L} H[q(x_i|o)]\\ &= \frac{1}{2} \sum_{i=1}^{L} (n_i + \log(|2\pi \hat{\Sigma}_i|)) \end{align*}\]

with $x_i \in \mathbb{R}^{n_i}$. See that this is constant with respect to $\{\theta_i\}_i$ and $\{\hat{\mu}_i\}_i$.

Assuming that $\Sigma_i$ is a fixed hyperparameter, the term on the right of $F(x)$ can be written using a Taylor expansion of $\log p(x_i\vert x_{i+1})$ about $x_i = \hat{\mu}_i$, $x_{i+1} = \hat{\mu}_{i+1}$. Expanding gives

\[\begin{align*} \log p(x_i|x_{i+1}) = \log p(\hat{\mu}_i|\hat{\mu}_{i+1}) &+ (x_i - \hat{\mu}_i)^T \frac{\partial\log p(x_i|x_{i+1})}{\partial x_i}\bigg\rvert_{x_i = \hat{\mu}_i}\\ &+ \frac{1}{2} (x_i - \hat{\mu}_i)^T \frac{\partial^2 \log p(x_i|x_{i+1})}{\partial x_i^2}\bigg\rvert_{x_i = \hat{\mu}_i} (x_i - \hat{\mu}_i) + ... \end{align*}\]

See that $\mathbb{E}_{q(x_i\vert o)}[x_i-\hat{\mu}_i] = 0$ and $\frac{\partial^2 \log p(x_i\vert x_{i+1})}{\partial x_i^2} = -\Sigma_i^{-1}$, with third order and higher terms $0$, and

\[\begin{align*} \mathbb{E}_{q(x_i|o)}[(x_i - \hat{\mu}_i)^T \frac{\partial^2 \log p(x_i|x_{i+1})}{\partial x_i^2}\bigg\rvert_{x_i = \hat{\mu}_i} (x_i - \hat{\mu}_i)] &= -\mathbb{E}_{q(x_i|o)}[\text{tr}((x_i - \hat{\mu}_i)^T \Sigma_i^{-1} (x_i - \hat{\mu}_i))]\\ &= -\mathbb{E}_{q(x_i|o)}[\text{tr}(\Sigma_i^{-1} (x_i - \hat{\mu}_i) (x_i - \hat{\mu}_i)^T)]\\ &= -\mathbb{E}_{q(x_i|o)}[\text{tr}(\Sigma_i^{-1} \hat{\Sigma}_i)] \end{align*}\]

where the last line follows as $\mathbb{E}$ and $\text{tr}$ commute, and $\mathbb{E}_{q(x_i|o)}[(x_i - \hat{\mu}_i) (x_i - \hat{\mu}_i)^T] \equiv \hat{\Sigma}_i$. Therefore the first and second order terms are constant in $\{\theta_i\}_i$ and $\{\hat{\mu}_i\}_i$, so the term on the right of $F(x)$ is

\[-\mathbb{E}_{q(x|o)}[\log p(x, o)] = -\sum_{i=0}^{L} \mathbb{E}_{q(x|o)}[\log p(x_i|x_{i+1})] = -\sum_{i=0}^{L} \log p(\hat{\mu}_i|\hat{\mu}_{i+1}) + \text{const}\]

with $\hat{\mu}_0 := x_0$. Hence the free energy is

\[\begin{align*} F(o) &= -\sum_{i=0}^{L} \log p(\hat{\mu}_i|\hat{\mu}_{i+1}) + \text{const}\\ &= \frac{1}{2} \sum_{i=0}^{L} ((\hat{\mu}_i - \mu_i(\hat{\mu}_{i+1};\theta_i))^T \Sigma_i(\hat{\mu}_{i+1};\theta_i)^{-1} (\hat{\mu}_i - \mu_i(\hat{\mu}_{i+1};\theta_i)) + \log(|2\pi \Sigma_i(\hat{\mu}_{i+1};\theta_i)|)) + \text{const}\\ &=: \frac{1}{2} \sum_{i=0}^{L} (\epsilon_i^T \Sigma_i^{-1} \epsilon_i + \log(|2\pi \Sigma_i|)) + \text{const} \end{align*}\]

with $\epsilon_i := \hat{\mu}_i - \mu_i(\hat{\mu}_{i+1};\theta_i)$.

In the case of $\Sigma_i = I$, then minimizing the free energy amounts to minimizing $\sum_{i=0}^{L} \lVert \epsilon_i \rVert^2$, pushing $\hat{\mu}_i$ and $\mu_i(\hat{\mu}_{i+1};\theta_i)$ towards eachother. $\hat{\mu}_{i+1}$ from the layer $i+1$ is used to predict $\hat{\mu}_i$ for layer $i$.

Free energy can then be minimized by optimizing over $\{\theta_i\}_i$ and $\{\hat{\mu}_i\}_i$.

Optimizing over $\{\theta_i\}_i$, which parameterize $p(o, x)$, can be thought of as learning the relationship between hidden states and observations,
and optimizing over $\{\hat{\mu}_i\}_i$, which parameterize $q(x\vert o)$, can be thought of as improving perception, getting better at inferring the underlying hidden state given an observation.

We can use gradient descent to perform this optimization, with gradients

\[\frac{\partial F}{\partial \theta_i} = -\left(\frac{\partial \mu_i}{\partial \theta_i}\right)^T \Sigma_i^{-1} \epsilon_i\] \[\frac{\partial F}{\partial \hat{\mu}_i} = \Sigma_i^{-1} \epsilon_i - 1\{i \geq 1\} \frac{\partial \mu_{i-1}}{\partial \hat{\mu}_i} \Sigma_{i-1}^{-1} \epsilon_{i-1}\]

with $1\{\; \cdot \;\}$ the indicator function.

The update of $\theta_i$ depends just on the error $\epsilon_i$ involving $\hat{\mu}_i$, and $\hat{\mu}_{i+1}$ from the layer above.
The update of $\hat{\mu}_i$ requires $\epsilon_i$ too but also the gradient information $\frac{\partial \mu_{i-1}}{\partial \hat{\mu}_i}$ and the error $\epsilon_{i-1}$ involving $\hat{\mu}_i$, and $\hat{\mu}_{i-1}$ from below.

i.e. these updates are local.

Millidge et al. (2020) shows that with a reversed causal structure (for the backward pass) of
\[o \to x_1 \to \cdots \to x_L\]
and $\Sigma_i = I$, then at the equilibrium point with $\frac{\partial F}{\partial \hat{\mu}_i} = 0$ and loss $\mathcal{L} := \frac{1}{2} (T - \mu_L)^2$ for targets $T$, the fixed points of the errors are $\epsilon_i^* = \frac{\partial \mathcal{L}}{\partial \mu_i}$ if one chooses $\hat{\mu}_L := T$. Additionally we have $\frac{\partial F}{\partial \theta_i} = -\frac{\partial \mathcal{L}}{\partial \theta_i}$.
The notable observation here is that through local updates, one can converge at exactly what backpropagation does (which requires global gradient information). Due to this locality, predictive coding is considered a biologically-plausible learning algorithm for the brain. TODO: Describe this part better, go into more details about correspondence between predictive coding and backprop

Understanding HTM

2023-08-03T09:00:00+00:00

Introduction

(repo https://github.com/r-gould/htm)

A description and implementation of the Hierarchical Temporal Memory (HTM) system, a self-supervised, non-probabilistic, non-gradient-based prediction algorithm inspired by the neocortex. The implementation is focused on next token prediction.

The BaMI document describes the HTM system in high-detail. This post aims to condense the key ideas and algorithms.

The approach rests on some observed properties of the neocortex:

Uniformity: across different points in the neocortex, the structure and connections are highly similar.
Adaptability: different regions of the neocortex can substitute other regions

(Feeding visual input to the auditory cortex in a ferret, the auditory cortex learns to ‘see’, see here. Also, cases of people missing large parts of their brain, but still functioning, as if the remaining parts of the brain are able to rewire and compensate.)

An extension of these ideas is that the neocortex is built out of functionally identical cortical columns, with, for example, cortical columns in the visual cortex performing the same general algorithm as those in the auditory cortex. The only reason that the cortical columns in these regions perform different roles is because they receive different data, but ultimately, the same general algorithm is being applied.

See here for the first discovery of this columnar structure.
This uniformity is seen to work well in machine learning; a single component, such as an attention head or convolutional layer, and composing many (functional) copies of such components together.

Cortical columns consist of 6 layers, and the HTM system is a model of one of these (input) layers.

Encoders

Overview

Components in the brain communicate via. sparse distributed representations (SDRs), representing groups of neuron firings. An SDR is modelled as a binary vector $x \in \{0, 1\}^n$ (for some $n \in \mathbb{N}$) with a sparse number of ‘on’ bits, meaning $\sum_i x_i \ll n$. Define $S(n, k) := \{x \in \{0, 1\}^n : \sum_i x_i = k\}$, i.e. the set of binary vectors with exactly $k$ ‘on’ bits.

In the brain, singular neuron firings have been observed to represent semantic concepts.
Two SDRs that represent a similar thing semantically should have an overlap in their ‘on’ bits to allow for better generalisation.
SDRs have some nice benefits for communication, discussed here.
Current machine learning methods do not explicitly use sparse representations, but there is some evidence in LLMs that the ‘dense’ representations are actually built up from sparse, semantic features, i.e. the model possibly learns to use SDRs.

Encoders are models of sensory receptors, which convert real world data into SDRs to be used by components in the brain.

e.g. an encoder $e$ turning RGB image data into SDRs has type $e: \mathbb{R}^{h \times w \times 3} \to S(n, k)$ for some $n$, and $k \ll n$.

An SDR sparsity of $2\%$ is often used, meaning that $\left[\frac{k}{n}\right] = 0.02$.

Semantic Embeddings for NLP

(for understanding the HTM system generally, can skip this)

From the above, in the context of NLP, we want SDRs of, say, tokens across a vocabulary, to have overlapping ‘on’ bits when semantic information is shared.

One way of producing such semantic SDRs is through the use of self-organizing maps.

In the context of NLP, this involves:

Picking a tokenizer $T: \text{str} \to [\text{int}]$ with vocabulary size denoted by $V$. For a dataset of text, split the text into snippets of paragraphs (or variable length). For each snippet, tokenize it, and create a binary vector of length $V$, with $1$ at a token index if that token is present in the snippet, and $0$ otherwise. Pass these binarized snippets into the BSOM algorithm.

This describes the BSOM algorithm in a general context (see Algorithm 2 in the linked).
The BSOM algorithm in the context of NLP is described here (and also credit to the author Johannes E. M. Mosig for answering some of my questions regarding the process).

Implemented the BSOM algorithm here, with training script here.

HTM System

Definitions

Now I describe the HTM system; a model of an input layer in a cortical column.

This system accounts for only a portion of what a cortical column does, and does not account for interactions between layers and cortical columns.
The spatial pooling and temporal memory algorithms describe how a HTM layer learns from a stream of input SDRs over time.

The layer is made up of a number, say $C$, of minicolumns. When the layer receives an input SDR $x^{(t)} \in S(n, k)$ (at time $t$), each minicolumn receives a fixed subset of the total input bits $\{1, \ldots, n\}$. This is described through the notion of proximal synapses $S(c) \subset \{1, \ldots, n\}$, with minicolumn $c \in \{1, \ldots, C\}$ possessing connections to the input indices in $S(c)$. Each proximal synapse $s \in S(c)$ has a time-dependent permanence $p_{c, s}^{(t-1)} \in [0, 1]$. Then the receptive synapses $R^{(t-1)}(c) := \{s \in S(c) : p_{c, s}^{(t-1)} \geq p_{\text{thresh}}\} \subset S(c)$ are synapses with a large enough permanence to actually receive their input (synapses $s \in S(c) \backslash R^{(t-1)}(c)$ are connected but do not actually receive the input as their permanence is not high enough, but during learning, the permanence can increase such that they become receptive).

The spatial pooling algorithm updates these permanences and returns a set of active columns $A^{(t)} \subset \{1, \ldots, C\}$.
Note that the set of minicolumns and the set of proximal synapses is fixed, and does not depend on the timestep $t$. Permanences change via. learning, which causes the set of receptive synapses to change.

Each minicolumn has a number, say $N$, of neurons. Each neuron has a set of distal segments. The total distal segments on a minicolumn (totalled across all neurons in the minicolumn) is given by $\mathcal{D}^{(t-1)}(c)$, and for $d \in \mathcal{D}^{(t-1)}(c)$, the origin neuron from which the segment stems from is given by $\mathcal{N}(d) \in \{1, \ldots, NC\}$. The distal synapses on a distal segment $d$ are given by $\mathcal{S}^{(t-1)}(d)$, and similarly to proximal synapses, each distal synapse $s \in \mathcal{S}^{(t-1)}(d)$ has a permanence $q_{d, s}^{(t-1)} \in [0, 1]$ with receptive synapses $\mathcal{R}^{(t-1)}(d) := \{s \in \mathcal{S}^{(t-1)}(d) : q_{d, s}^{(t-1)} \geq q_{\text{thresh}}\} \subset \mathcal{S}^{(t-1)}(d)$. Instead of each synapse connecting to a bit in the input, the synapses instead connect to other neurons across HTM layer. Specifically, each $s \in \mathcal{S}^{(t-1)}(d)$ has an associated pre-synaptic neuron $\mathcal{P}(s) \in \{1, \ldots, NC\}$ that it is connected to.

e.g. a neuron with 2 distal segments, each segment with 3 synapses (without including pre-synaptic neurons):

Overview

The HTM layer receives an input SDR $x^{(t)}$, and results in a set of neurons either being in predictive state or not. If a neuron is in predictive state, then it is predicting that the minicolumn that it is in will activate when given the next input $x^{(t+1)}$. The specifics of which neurons are firing within the minicolumn encodes the context based on previous inputs $\{x^{(1)}, \ldots, x^{(t)}\}$.

The spatial pooling algorithm is applied first, and then the temporal memory algorithm, in order to obtain these predictions. The details are described later, and here, brief overviews are given as to what these algorithms do.

Spatial pooling

The spatial pooling takes an input SDR $x^{(t)}$ and computes which minicolumns are active, based on how many receptive proximal synapses are active (the input index from $x^{(t)}$ that they connect to is ‘on’, equal to 1). If this number of active synapses meets an overlap threshold $o_{\text{thresh}}$, then the minicolumn is considered as active.

Then, global inhibition is performed, deactivating minicolumns that are not in the top $M$ overlap scores across minicolumns. Then, one can pick $M$ such that, say, at most $2\%$ of minicolumns are active, giving sparsity.

Local inhibition can also be used, where minicolumns inhibit other minicolumns in their neighbourhood. See BaMI.

This process returns a set of active minicolumns $A^{(t)} \subset \{1, \ldots, C\}$.

For each active minicolumn in $A^{(t)}$, a Hebbian learning process is performed on the permanences of the proximal synapses, to obtain new permanences $\{p_{c, s}^{(t)}\}_{c, s}$.

Temporal memory

The end of timestep $t-1$ placed certain neurons in predictive state, meaning the minicolumns that contain such neurons are predicted to activate. We compare these predicted minicolumn activations with $A^{(t)}$, and when there is an overlap, the synapses on distal segments that were correctly active are strengthened, and those that were not active are weakened (Hebbian process).

If a minicolumn in $A^{(t)}$ was not predicted, then the minicolumn bursts, with all neurons in this minicolumn becoming active, and a Hebbian process is applied identically to above for a chosen learning segment (see details below).

For non-active minicolumns not in $A^{(t)}$, synapses that matched (see details below) are punished.

Then, neurons are placed into predictive state for use in the next timestep $t+1$.

(quite a few details have been skimmed over, see below for a full treatment)

Spatial Pooler

(for implementation see here)

The set of active minicolumns $A^{(t)}$ is computed by the following procedure:

Initialize $A^{(t)} := \{\}$.
For each minicolumn $c \in \{1, \ldots, C\}$, define the overlap $o^{(t)}(c) := \sum_{s \in \mathcal{R}^{(t-1)}(c)} x^{(t)}_s$, describing the number of overlapping bits between the receptive field for minicolumn $c$ and the input $x^{(t)}$.
- If boosting is active, then set $o^{(t)}(c) := b_c o^{(t)}(c)$ where $b_c$ is the boosting parameter for minicolumn $c$. Boosting parameters are then modified during the learning stage if boosting is active.
Let $\hat{o}$ be the $M$th highest value from the set $\{o^{(t)}(c)\}_{c=1}^{C}$.
Now, for each minicolumn $c \in \{1, \ldots, C\}$, if $o^{(t)}(c) \geq \max(o_{\text{thresh}}, \hat{o})$, then add $c$ to the active columns, i.e. $A^{(t)} := A^{(t)} \cup \{c\}$.

We then perform a Hebbian learning process on the active columns only. Specifically,

For each active column $c \in A^{(t)}$, and for each proximal synapse $s \in S(c)$, we compute $p^{(t)}_{c, s}$. Specifically, if $x^{(t)}_s = 1$, then let $p^{(t)}_{c, s} := \min(1, p^{(t-1)}_{c, s} + \Delta_+)$. Otherwise, if $x^{(t)}_s = 0$, then let $p^{(t)}_{c, s} := \max(0, p^{(t-1)}_{c, s} - \Delta_-)$.

If boosting is active, we also:

Update the boosting paramters $\{b_c\}_{c=1}^{C}$. This involves TODO

The goal of boosting is to maintain homeostasis of neuronal activity, encouraging a variety of minicolumns to be activating across inputs rather than a small set of minicolumns dominating activity with most minicolumns remaining unactive for long periods.

Temporal Memory

(for implementation see here)

With the set of active columns $A^{(t)}$ obtained by spatial pooling, we can now perform the temporal memory algorithm.

Information from the previous timestep $t-1$:

activated neurons $\mathcal{N}^{(t-1)}_A$: neurons that have activated, which happens as a result of a neuron correctly being in predictive state, or in the absence of a prediction when there should be, causing the minicolumn to burst.
winner neurons $\mathcal{N}^{(t-1)}_W$: neurons that were either correctly predictive, or, in the case of a minicolumn bursting, either the neuron with the maximally matching segment, or, if there are no matching segments in the minicolumn, the neuron with the fewest segments.
activated distal segments $\mathcal{D}^{(t-1)}_A$: set of all segments (across all minicolumns) that activated past threshold $A_{\text{thresh}}$, only considering activations from receptive synapses $\mathcal{R}^{(t-1)}(d)$.
matching distal segments $\mathcal{D}^{(t-1)}_M$: set of all segments (across all minicolumns) that activated past threshold $M_{\text{thresh}}$, considering activations from all synapses $\mathcal{S}^{(t-1)}(d)$ (both receptive and non-receptive).
$n^{(t-1)}(d)$: the number of active synapses on segment $d$ out of $\mathcal{S}^{(t-1)}(d)$ (including non-receptive synapses).

The temporal memory algorithm then finds the information for the next timestep, i.e. $\mathcal{N}^{(t)}_A$, $\mathcal{N}^{(t)}_W$, $\mathcal{D}^{(t)}_A$, $\mathcal{D}^{(t)}_M$, all initially set to $\{\}$, while learning from the input $x^{(t)}$. Whether a neuron is in predictive state is determined by whether one of its segments is active (in $\mathcal{D}^{(t-1)}_A$).

This process can be split into three processes, involving: active columns, inactive columns and updating of segments. These processes are explained below.

Algorithm

For each active column $c \in A^{(t)}$

If $|\mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_A| > 0$:

(if a segment of minicolumn $c$ is active, meaning the minicolumn was correctly predicted to be active)

In this case, we activate, and set to be a winner, any neurons that have any such active segments:

\[\mathcal{N}^{(t)}_A := \mathcal{N}^{(t)}_A \cup \{\mathcal{N}(d) : d \in \mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_A\}\] \[\mathcal{N}^{(t)}_W := \mathcal{N}^{(t)}_W \cup \{\mathcal{N}(d) : d \in \mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_A\}\]

This corresponds to a neuron correctly being in predictive state at step $t-1$, and hence becoming active at step $t$.

Now, if in learing mode, a Hebbian process is applied to active distal segments. Additionally, we strengthen the pattern matching capability of the active segments by growing up to $S_{\text{sample}}$ synapses to the previous winning neurons from $\mathcal{N}^{(t-1)}_W$ (whose activations can be utilized to make next step predictions).

For each active segment $d \in \mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_A$, then for each receptive synapse $s \in \mathcal{R}^{(t-1)}(d)$:
if $\mathcal{P}(s) \in \mathcal{N}^{(t-1)}_A$ (if pre-synaptic neuron is active, meaning synapse $s$ is active), then strengthen permanence:

\[q^{(t)}_{d, s} := \min(1, q^{(t-1)}_{d, s} + \delta_+)\]

otherwise (synapse did not activate), weaken permanence:

\[q^{(t)}_{d, s} := \max(0, q^{(t-1)}_{d, s} - \delta_-)\]

and synapse growth:

For each active segment $d \in \mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_A$, create a new set of synapses $\mathcal{S}^{(t)}(d) \supset \mathcal{S}^{(t-1)}(d)$ by extending $\mathcal{S}^{(t-1)}(d)$ through growing up to $S - n^{(t-1)}(d)$ new synapses to previous winning neurons $\mathcal{N}^{(t-1)}_W$ (each chosen randomly).
Permanences of new synapses are initialized to $q_{\text{init}}$.

In this case there are no new segments formed, therefore $\mathcal{D}^{(t)}(c) := \mathcal{D}^{(t-1)}(c)$.

If, instead, $|\mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_A| = 0$:

(no segments of minicolumn $c$ is active, meaning no neuron was in predictive state, when it should have been)

In this case the column bursts. This means that all neurons within this minicolumn are activated (as opposed to only activating neurons with activated segments, as above):

\[\mathcal{N}^{(t)}_A := \mathcal{N}^{(t)}_A \cup \{\mathcal{N}(d) : d \in \mathcal{D}^{(t-1)}(c)\}\]

After bursting, we wish to correct this for the future, such that the column will not burst again and instead will predict correctly. To do this, we pick the maximially matching segment, or, if there are no matching segments, a new segment grown on the neuron with the fewest segments, and apply a Hebbian learning rule on this new segment.

If $|\mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_M| > 0$ (if a segment in minicolumn $c$ is matching), then:

Choose the learning segment:

\[\ell := \text{argmax}_{d \in \mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_M} \;n^{(t-1)}(d)\]

which is the maximally matching segment in minicolumn $c$.

Choose the winner neuron as the origin neuron of the matching segment:

\[w := \mathcal{N}(\ell)\]

Otherwise, if $|\mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_M| = 0$ (no matching segments in minicolumn $c$), then:

Choose the winner neuron to be the neuron in minicolumn $c$ with the fewest segments,

\[w := \text{argmin}_{n \in \{1, \ldots, NC\}} \;|\{d \in \mathcal{D}^{(t-1)}(c) : \mathcal{N}(d) = n\}|\]

breaking any ties randomly.

If in learning mode, create a new segment $d_{\text{new}}$ originating from the winner neuron $w$, and set this new segment to be the learning segment:

\[\ell := d_{\text{new}}\]

This last step gives updated segments $\mathcal{D}^{(t)}(c) := \mathcal{D}^{(t-1)}(c) \cup \{d_{\text{new}}\}$.

Now add $w$ to the winner neurons:

\[\mathcal{N}^{(t)}_W := \mathcal{N}^{(t)}_W \cup \{w\}\]

If in learning mode, then a Hebbian learning rule and synapse growth is applied to the learning segment, identically to as the other case (though here, just on the learning segment, not across many segments).

For each inactive column $c \notin A^{(t)}$

If not in learning mode, nothing is done in this case. If in learning mode:

The idea is to punish segments that did activate, as they made an incorrect prediction since the minicolumn $c$ did not activate. So only performing the negative update in the Hebbian learning rule.

If there are matching segments, meaning $|\mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_M| > 0$, then for each matching segment $d \in \mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_M$:

Iterate over each synapse $s \in \mathcal{S}^{(t-1)}(d)$, and
if $\mathcal{P}(s) \in \mathcal{N}^{(t-1)}_A$ (synapse activated), then weaken this synapse:

\[q^{(t)}_{d, s} := q^{(t-1)}_{d, s} - \delta_-\]

No new segments are grown, so $\mathcal{D}^{(t)}(c) := \mathcal{D}^{(t-1)}(c)$, and no new synapses are grown, so $\mathcal{S}^{(t)}(d) := \mathcal{S}^{(t-1)}(d) \; \forall \; d \in \mathcal{D}^{(t)}(c)$.

Updating segments

The above has determined new neuron information $\mathcal{N}^{(t)}_A$ and $\mathcal{N}^{(t)}_W$ alongside, for each $d \in \mathcal{D}^{(t)}(c)$, a set of new synapses $\mathcal{S}^{(t)}(d)$ and receptive synapses $\mathcal{R}^{(t)}(d)$ based on new permanences $\{q^{(t)}_{d, s}\}_{s \in \mathcal{S}^{(t)}(d)}$.

We can now determine which segments are active and matching: $\mathcal{D}^{(t)}_A$ and $\mathcal{D}^{(t)}_M$. This involves a simple check over segments.

For each segment $d \in \mathcal{D}(c)$:

Set $n^{(t)}(d) := \vert\{\mathcal{P}(s) : s \in \mathcal{S}^{(t)}(d)\} \cap \mathcal{N}^{(t)}_A\vert$
If $n^{(t)}(d) \equiv \vert\{\mathcal{P}(s) : s \in \mathcal{S}^{(t)}(d)\} \cap \mathcal{N}^{(t)}_A\vert \geq M_{\text{thresh}}$ (sufficiently many active synapses, including non-receptive synapses), then update

\[\mathcal{D}^{(t)}_M := \mathcal{D}^{(t)}_M \cup \{d\}\]

If $\vert\{\mathcal{P}(s) : s \in \mathcal{R}^{(t)}(d)\} \cap \mathcal{N}^{(t)}_A\vert \geq A_{\text{thresh}}$ (sufficiently many active receptive synapses), then update

\[\mathcal{D}^{(t)}_A := \mathcal{D}^{(t)}_A \cup \{d\}\]

with $M_{\text{thresh}}, A_{\text{thresh}} \in \mathbb{N}$ thresholds for matching and activating a segment respectively.

Brief notes on spiking neuron models

2023-07-24T09:00:00+00:00

Spiking Neuron Model

A primary asset of machine learning is the artificial neuron; a very idealized, non-spiking model of a neuron in the brain. What does a more realistic model of neurons and their spiking behaviour look like?

The spiking network is modelled as a directed graph $G = (V, E)$, where nodes $V$ represent neurons and edges $E$ represent synapses between neurons. For $n$ neurons, $V := \{1, \ldots, n\}$.

Define the input function $\mathcal{I}: V \to \mathcal{P}(V)$ by $\mathcal{I}(i) := \{j : (j, i) \in E\}$, mapping a neuron to its set of incoming neurons, and analogously the output function $\mathcal{O}: V \to \mathcal{P}(V)$ mapping a neuron to its set of outgoing neurons.

The most common spiking neural models are Integrate-and-Fire (IF) and Leaky-Integrate-and-Fire (LIF).

For a neuron $i \in V$, let $u_i = u_i(t)$ model the state variable, which represents the neural membrane potential of neuron $i$. Then under the LIF model it evolves as

\[C \frac{du_i}{dt}(t) = -\frac{1}{R}(u_i(t) - u_{\text{rest}}) + (I_0(t) + \sum_{j \in \mathcal{I}(i)} w_{ji} I_j(t))\]

where $\{w_{ji}\}_{j \in \mathcal{I}(i)}$ are learnable synaptic weights, $I_0 = I_0(t)$ is the external current driving the neural state, $I_j = I_j(t)$ is the input current from neuron $j \in \mathcal{I}(i)$, $u_{\text{rest}}$ is the rest potential, $C$ is the membrane capacitance, and $R$ is the input resistance.

A brief derivation of this model from electromagnetic principles can be found in the appendix.

The IF model can be obtained by letting $R \to \infty$.

Whenever the membrane potential $u_i$ for neuron $i \in V$ reaches the firing threshold $\upsilon$, which occurs at say times $T_i \subset \mathbb{R}$, the neuron spikes, which is modelled by its corresponding spike train $S_i(t) := \sum_{t' \in T_i} \delta(t - t')$. Immediately after the spike, the neural state is reset to the rest potential $u_{\text{rest}} < \upsilon$ and held at that level for the time interval representing the neural absolute refractory period.

How are the incoming currents $\{I_j\}_{j \in \mathcal{I}(i)}$ modelled? Each incoming neuron has their own spike trains $\{S_j\}_{j \in \mathcal{I}(i)}$ with corresponding spike times $\{T_j \subset \mathbb{R}\}_{j \in \mathcal{I}(i)}$. Then for each $j \in \mathcal{I}(i)$,

\[I_j(t) := \int_{0}^{\infty} S_j(t-s) \exp(-s/\tau) ds = \sum_{t' \in T_j, t' \leq t} \exp(-(t-t')/\tau)\]

where $\tau$ is the synaptic time constant. This can be thought of as accumulating all spikes before the time $t$, weighing more recent spikes strongly compared to older spikes.

Importantly, the received current $I_j$ only depends on the spikes of this $j$th presynaptic neuron, and does not depend on other properties of the potential. The postsynaptic neuron only knows about spikes of its input neurons, and not anything more specific about their potential changes. Incoming potentials below the firing threshold are ignored.

We could update discretely using $u_i(t + \Delta t) \approx u_i(t) + \Delta t \frac{du_i}{dt}(t)$, and after updating in this way we would check for neurons that have increased over the threshold, set them to the rest potential and put them in refractory mode until the refractory period is over.

Encoding data

Say we have data $x \in \mathbb{R}^n$ with $x_i \in [0, 1]$. How do we convert this information into binary spikes across time analogously to how our sensory receptors do?

One method is rate encoding, where we encode each $x_i$ as a sequence of spikes at some frequency, where this frequency is a function of the intensity $x_i$, such that this frequency encodes the data.

A specific example of rate encoding is Poisson rate encoding:

Let $\lambda_i \propto \frac{1}{x_i + \epsilon}$ for some appropriate proportionality constant (which will define the range of spike frequencies).
Sample $\{\delta_{ij} \sim \text{Po}(\lambda_i)\}_j$, representing time intervals between spikes.
Then spike times for $x_i$ are $T_i = \{\sum_{j=1}^k \delta_{ij} : k = 1, 2, \ldots\}$.

There are other methods of encoding, such as temporal coding schemes that encode information in precise timing of spikes as opposed to frequencies.

Within the network there would be a set of neurons that have no incoming synapses, and these would act as input neurons, where we would provide these neurons directly with the raw data encoded as spikes. Then, other neurons would receive the spike information of this raw data via incoming currents.

Learning by STDP

STDP is based on the observed phenomena of synapses strengthening if the pre-synaptic neuron spikes before the post-synaptic does too, and the synapse weakening if the post-synaptic neuron fires before the pre-synaptic neuron. The smaller the time difference between the two spikes, the stronger the synaptic update.

A common STDP rule says that when a neuron $i$ spikes at say time $t$, we should:

For each $j \in \mathcal{I}(i)$, let $t_{post} = t$ and $t_{pre} = T_j[-1]$, i.e. the latest spiking time of $j$ (so $i$ is the post-synaptic neuron, $j$ is the pre). Then, the change in synaptic weight $w_{ji}$ is $\delta w = A_{+} \exp(-(t_{post} - t_{pre})/\tau_+)$. This is an example of the strengthening of synapses.
For each $j \in \mathcal{O}(i)$, let $t_{pre} = t$ and $t_{post} = T_j[-1]$ (so $i$ is pre, $j$ is post), and $w_{ij}$ changes by $\delta w = -A_{-} \exp(-(t_{pre} - t_{post})/\tau_{-})$. This is an example of the weakening of synapses.

For better stability and noise reduction, the updates can be stored until, say, $w_{ij}$ has 5 stored updates, where the updates are summed and then updated. It can also be done in a fixed time window, such as updating every 5ms, as opposed to performing the updates immediately.

Appendix

Derivation of LIF model

The key idea is to model the intracellular and extracellular regions of the neuron as conductors, with the neuron membrane acting as the space between them.

A conductor is an object in which charges can move freely, which implies that the electric field $\mathbf{E} = \mathbf{0}$ (as if this is not the case, the free charges would move in response to the electric force until the force was zero).
Hence the electrostatic potentials $\Phi$ are constant in each region, since $\mathbf{E} =: - \mathbf{\nabla} \Phi$.
So the potential difference $V$ between the two regions is well-defined, and matches the conventional definition of the potential of a neuron.

In addition, the charges of the regions are opposite in sign, and approximately equal in charge, say $\pm Q$. Hence together they form a capacitor.

The capacitance of two conductors carrying charges $\pm Q$ with a voltage $V$ between them is $C := \frac{Q}{V}$. Differentiating gives

\[C \frac{dV}{dt} = \frac{dQ}{dt}\]

For a volume $\mathcal{V}$, the charge is defined as

\[Q := \int_{\mathcal{V}} \rho \; d\mathcal{V}\]

where $\rho = \rho(\mathbf{x}, t)$ is the electric charge density.

The current describes the flow of charge. Specifically, the total current flowing into the volume is defined by

\[I := -\int_{\partial \mathcal{V}} \mathbf{J} . d\mathbf{S}\]

where $\mathbf{J} = \mathbf{J}(\mathbf{x}, t)$ is the flux of electric charge per unit area.

$\rho$ and $\mathbf{J}$ are sources, and are irreducible. The sources define a problem in electromagnetism. They can be thought of as hyperparameters of the system that we must provide.

The Maxwell equations then describe how the fields $\mathbf{E}$ and $\mathbf{B}$ evolve due to these sources. The Lorentz force law describes how the charges, described by the sources, move due to the fields.

Charge conservation means that the charge in a volume can only change due to charge flowing into/out of the volume. This means that

\[\frac{dQ}{dt} = I\]

hence the capacitance equation becomes

\[C \frac{dV}{dt} = I\]

where $I$ is the current flowing into the neuron.

The above represents the IF model, but in reality, there is a leak of ions through the neuron membrane, causing a current to flow out of the neuron. Adding in the term corresponding to this effect leads to the LIF model:

\[C \frac{dV}{dt}(t) = -\frac{1}{R}(V(t) - V_{\text{rest}}) + I(t)\]

and we can further expand $I(t)$ into the currents incoming via synapses to get the desired expression.

Credit assignment

2022-12-30T09:00:00+00:00

Setup

Consider an agent embedded in an environment, with the agent described by an internal state, and the environment by an external state, undergoing the cycle:

The environment is in (external) state $s$.
The system receives an observation $o \sim p(\cdot\mid s)$ from the environment.
Then the system updates its (internal) state (e.g. beliefs) $h \sim p(\cdot\mid \tilde{h}, o)$, with $\tilde{h}$ the previous internal state.
The environment receives an action $a \sim p(\cdot\mid h)$ from the system.
Then the environment updates its (external) state $s' \sim p(\cdot\mid s, a)$.
And repeat.

generating the random sequence

\[(S_1, O_1, H_1, A_1, S_{2}, \ldots)\]

which can be described by the following diagram, with the environment and agent only interacting with eachother via observations and actions

For simplicity assume action and observation distributions $p(a\mid h)$ and $p(o\mid s)$ are determinstic, with $a_{\tau} = a_{\tau}(h_{\tau})$ and $o_{\tau} = o_{\tau}(s_{\tau})$ (though with $p(s'\mid s, a)$ and $p(h'\mid h, o)$ stochastic).

Denote the environment state dynamics by $\mu = \mu(s_{\tau}\mid s_{\tau-1}, a_{\tau-1}) \equiv p(s_{\tau}\mid s_{\tau-1}, a_{\tau-1})$, and the agent state dynamics (e.g. how its parameters change) by $q = q(h_{\tau}\mid h_{\tau-1}, o_{\tau}) \equiv p(h_{\tau}\mid h_{\tau-1}, o_{\tau})$ for clarity.

At timestep $t$ in environment $\mu$, after experiencing the past $\omega_t = (o_{0:t}, h_{0:t-1})$, the agent $q$ has a preference for the future described by

\[P_t[q] := \mathbb{E}_{q}^{\mu}[\phi^t(O_{t+1}, O_{t+2}, \ldots)\mid\Omega_t=\omega_t]\]

for some $\phi^{t}$.

e.g. in the context of reinforcement learning we have observations $o_{\tau} = (x_{\tau}, r_{\tau})$, with $r_{\tau} \in \mathbb{R}$ a reward signal, with a preference that favours reward $\phi^t(O_{t+1}, O_{t+2}, \ldots) = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots$ for a discount factor $\gamma$.

We will take that the agent’s internal state decomposes as $h_{\tau} = (\theta_{\tau}, \pi_{\tau})$, with $\theta_{\tau}$ corresponding to a parameter that updates (e.g. by gradient descent) under observation.

e.g. $\theta_{\tau}$ could describe synaptic weights of neurons, and $\pi_{\tau}$ the activations/voltages across neuron synapses (plus some extra information), at time $\tau$.

In this case we have $q(h_{\tau}\mid h_{\tau-1}, o_{\tau}) = q(\theta_{\tau}\mid \theta_{\tau-1}, \pi_{\tau-1}, o_{\tau}) q(\pi_{\tau}\mid \theta_{\tau}, \pi_{\tau-1}, o_{\tau})$ where one can interpret $q(\theta_{\tau}\mid \theta_{\tau-1}, \pi_{\tau-1}, o_{\tau})$ as a learning rule for the agent; how it changes its parameters under observation in order to achieve its preference. One could interpret $q(\pi_{\tau}\mid \theta_{\tau}, \pi_{\tau-1}, o_{\tau})$ as the parameterized architecture used by the agent.

In the context of a transformer, one could view $\pi_{\tau}$ as the internal activations (plus some extra information), depending on the current parameters $\theta_{\tau}$, previous activation information $\pi_{\tau-1}$ (necessary to perform the attention mechanism) and last token $o_{\tau}$.

Model-free credit assignment

In the parameterized view, with $h_{\tau} = (\theta_{\tau}, \pi_{\tau})$, what is a good choice for the learning rule $q(\theta_{\tau}\mid \theta_{\tau-1}, \pi_{\tau-1}, o_{\tau})$?

Gradient-based learning

A common choice for the learning rule is gradient descent, with $q = q_{\text{grad}}$, where

\[q_{\text{grad}}(\theta_{\tau}\mid \theta_{\tau-1}, \pi_{\tau-1}, o_{\tau}) := \delta\left(\theta_{\tau} - \left(\theta_{\tau-1} - \alpha \nabla_{\theta} P_t[q_{\theta}^{\tau}]\mid _{\theta = \theta_{\tau-1}}\right)\right)\]

with $\alpha \in \mathbb{R}_{+}$ the learning rate, where $q_{\theta}^{\tau}$ is defined such that the learning rule is turned off from timestep $\tau$, with the parameter fixed to $\theta$. Specifically, $q_{\theta}^{t}$ is defined by

\[q_{\theta}^{t}(\theta_{\tau}\mid \theta_{\tau-1}, \pi_{\tau-1}, o_{\tau}) := \begin{cases} q(\theta_{\tau}\mid \theta_{\tau-1}, \pi_{\tau-1}, o_{\tau}) & \tau < t, \\ \delta(\theta_{\tau}-\theta) & \tau \geq t \end{cases}\]

It turns out that the gradient term used in this update rule, $\nabla_{\theta} P_{t}[q_{\theta}^{\tau}]$, has a nice closed-form expression. We can write

\[\begin{align*} \mathbb{E}_{q_{\theta}^{t}}^{\mu}[\; \cdot\mid\Omega_t = \omega_t] &= \int \cdot \; p_{\theta}^{t}(s_{0:\infty}, o_{t+1:\infty}, h_{t:\infty}, a_{0:\infty}\mid o_{0:t}, h_{0:t-1}) ds_{0:\infty}, o_{t+1:\infty}, h_{t:\infty}, a_{0:\infty}\\ &= \int \cdot \; \left[\prod_{\tau=t}^{\infty} q_{\theta}^{t}(h_{\tau}\mid h_{\tau-1}, o_{\tau}(s_{\tau})) dh_{\tau}\right] [\cdots]\\ &= \int \cdot \; \left[\prod_{\tau=t}^{\infty} q(\pi_{\tau}\mid \pi_{\tau-1}, o_{\tau}; \theta) d\pi_{\tau}\right] [\cdots] \end{align*}\]

where $[\cdots]$ corresponds to factors independent of $\theta$, and defining $q(\pi_{\tau}\mid \pi_{\tau-1}, o_{\tau}; \theta) := q(\pi_{\tau}\mid \theta_{\tau}, \pi_{\tau-1}, o_{\tau})\mid _{\theta_{\tau} = \theta}$.

Using the fact that

\[\nabla_{\theta} q(\pi_{\tau}\mid \pi_{\tau-1}, o_{\tau}; \theta) = q(\pi_{\tau}\mid \pi_{\tau-1}, o_{\tau}; \theta) \nabla_{\theta} \log q(\pi_{\tau}\mid \pi_{\tau-1}, o_{\tau}; \theta)\]

one can then show, for inputs independent of $\theta$, that

\[\nabla_{\theta} \mathbb{E}_{q_{\theta}^{t}}^{\mu}[\; \cdot\mid\Omega_t = \omega_t] = \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\; \cdot \; \sum_{\tau=t}^{\infty} \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\mid\Omega_t = \omega_t\right]\]

It then follows that the gradient of the preference is

\[\nabla_{\theta} P_t[q_{\theta}^{t}] = \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\sum_{\tau=t}^{\infty} \phi^t(O_{t+1}, O_{t+2}, \ldots) \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\mid\Omega_t = \omega_t\right]\]

reminiscent of the policy gradient theorem from RL.

We could use a Monte carlo estimator for this gradient after collecting trajectories $\{(o_{\tau}^{(n)}, h_{\tau}^{(n)})_{\tau}\}_{n=1}^{\infty}$, i.e.

\[\hat{G}_{MC}^{t}(\theta) := \frac{1}{N}\sum_{n=1}^{N} \sum_{\tau=t}^{\infty} \phi^t(o^{(n)}_{t+1}, o^{(n)}_{t+2}, \ldots) \nabla_{\theta} \log q(\pi^{(n)}_{\tau}\mid \pi^{(n)}_{\tau-1}, o^{(n)}_{\tau}; \theta)\]

However, we can lower the variance of this estimator by using a baseline.

Reducing variance with a baseline:

One can show that

\[\mathbb{E}_{q_{\theta}^{t}}^{\mu}[b^{t}_{\tau}(\Pi_{\tau-1}, O_{\tau}; \phi) \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)] = 0\]

for any baseline $b^{t}_{\tau} = b^{t}_{\tau}(\pi_{\tau-1}, o_{\tau}; \phi)$, which gives us another expression for the preference gradient:

\[\nabla_{\theta} P_t[q_{\theta}^{t}] = \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\sum_{\tau=t}^{\infty} \left[\phi^t(O_{t+1}, O_{t+2}, \ldots) - b^{t}_{\tau}(\Pi_{\tau-1}, O_{\tau}; \phi)\right] \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\mid\Omega_t = \omega_t\right]\]

The natural choice for the baseline would be for it to estimate $\phi^{t}$ in some respect, however this will require predicting information about the past and future. We can restrict the baseline’s duty to just predicting the future if we assume that the preference takes a summable form:

\[\phi^{t}(O_{t+1}, O_{t+2}, \ldots) = \sum_{\tau=t+1}^{\infty} \phi_{\tau}^{t}(O_{\tau})\]

for some $\{\phi_{\tau}^{t}\}_{\tau}$. This form is assumed for the preference unless stated otherwise. For $\tau \geq t, \tau' \geq t+1$, it can be shown that

\[\begin{align*} &\mathbb{E}_{q_{\theta}^{t}}^{\mu}[\phi_{\tau'}^{t}(O_{\tau'}) \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\mid\Omega_{t}=\omega_{t}]\\ &= \int p_{\theta}^{t}(\theta_{\tau}, \theta_{\tau-1}, \pi_{\tau}, \pi_{\tau-1}, o_{\tau}, o_{\tau'}\mid \omega_{t}) \phi_{\tau'}^{t}(o_{\tau'}) \nabla_{\theta} \log q(\pi_{\tau}\mid \pi_{\tau-1}, o_{\tau}; \theta)\\ &= \int p_{\theta}^{t}(\theta_{\tau}\mid \theta_{\tau-1}, \pi_{\tau-1}, o_{\tau}, o_{\tau'}) p_{\theta}^{t}(\pi_{\tau}\mid \theta_{\tau}, \pi_{\tau-1}, o_{\tau}, o_{\tau'}) [\cdots] \phi_{\tau'}^{t}(o_{\tau'}) \nabla_{\theta} \log q(\pi_{\tau}\mid \pi_{\tau-1}, o_{\tau}; \theta)\\ &= \int \phi_{\tau'}^{t}(o_{\tau'}) q(\pi_{\tau}\mid \pi_{\tau-1}, o_{\tau}; \theta) [\cdots] \nabla_{\theta} \log q(\pi_{\tau}\mid \pi_{\tau-1}, o_{\tau}; \theta) \; \text{for} \; \tau' \leq \tau\\ &= 0 \; \text{for} \; \tau' \leq \tau \end{align*}\]

therefore the preference gradient can be written

\[\nabla_{\theta} P_t[q_{\theta}^{t}] = \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\sum_{\tau=t}^{\infty} \left[V^{t}_{\tau} - b^{t}_{\tau}(\Pi_{\tau-1}, O_{\tau}; \phi)\right] \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\mid\Omega_t = \omega_t\right]\]

defining the value at timestep $t$ from $\tau \geq t$

\[V_{\tau}^{t} := \sum_{\tau'=\tau+1}^{\infty} \phi_{\tau'}^{t}(O_{\tau'})\]

Then to best minimise the variance of the corresponding Monte Carlo estimator, we want $b^{t}_{\tau}$ to estimate $V^{t}_{\tau}$. Denoting $b^{t}_{\tau} \equiv \hat{V}^{t}_{\tau}$ for clarity, we want

\[\hat{V}^{t}_{\tau}(h_{\tau-1}, o_{\tau}; \phi) \approx v_{\tau}^{t}\]

which can be enforced by optimizing $\phi$ via supervised learning using collected trajectories. Such a baseline is called a value baseline.

Overall we have preference gradient

\[\nabla_{\theta} P_t[q_{\theta}^{t}] = \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\sum_{\tau=t}^{\infty} \left[V^{t}_{\tau} - \hat{V}^{t}_{\tau}(\Pi_{\tau-1}, O_{\tau}; \phi)\right] \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\mid\Omega_t = \omega_t\right]\]

Advantage-based estimators:

Ignoring the baseline for a moment, see that

\[\begin{align*} \nabla_{\theta} P_t[q_{\theta}^{t}] &= \sum_{\tau=t}^{\infty} \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[V^{t}_{\tau} \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\mid\Omega_t = \omega_t\right]\\ &= \sum_{\tau=t}^{\infty} \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[V^{t}_{\tau} \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\mid\Omega_t = \omega_t, \Omega_{\tau}, \Pi_{\tau}\right]\right]\\ &= \sum_{\tau=t}^{\infty} \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta) \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[V^{t}_{\tau}\mid\Omega_t = \omega_t, \Omega_{\tau}, \Pi_{\tau}\right]\right]\\ &=: \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\sum_{\tau=t}^{\infty} Q_{\tau}^{t}[q_{\theta}^{t}\mid\Omega_{\tau}, \Pi_{\tau}] \; \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\right] \end{align*}\]

defining the action-value at timestep $t$ from $\tau \geq t$

\[\begin{align*} Q_{\tau}^{t}[q_{\theta}^{t}\mid\Omega_{\tau}, \Pi_{\tau}] &:= \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[V^{t}_{\tau}\mid\Omega_t = \omega_t, \Omega_{\tau}, \Pi_{\tau}\right]\\ &\equiv \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[V^{t}_{\tau}\mid O_{0:\tau} = (o_{0:t}, O_{t+1:\tau}), H_{0:\tau} = (h_{0:t-1}, H_{t:\tau})\right] \end{align*}\]

i.e. the expected value in the future from $\tau \geq t$, conditioned on all past observed information from $\tau$ and the hidden state at $\tau$.

Then including baseline we have

\[\nabla_{\theta} P_t[q_{\theta}^{t}] = \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\sum_{\tau=t}^{\infty} \left[Q_{\tau}^{t}[q_{\theta}^{t}\mid\Omega_{\tau}, \Pi_{\tau}] - b^{t}_{\tau}(\Pi_{\tau-1}, O_{\tau}; \phi)\right] \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\right]\]

Typical advantage based estimators choose a value baseline as seen in the previous section, with $b^{t}_{\tau} = \hat{V}^{t}_{\tau}$, giving preference gradient

\[\nabla_{\theta} P_t[q_{\theta}^{t}] = \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\sum_{\tau=t}^{\infty} A_{\tau}^{t}[q_{\theta}^{t}\mid\Omega_{\tau}, \Pi_{\tau}; \phi] \; \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\right]\]

defining the advantage.

\[A_{\tau}^{t}[q_{\theta}^{t}\mid\Omega_{\tau}, \Pi_{\tau}; \phi] := Q_{\tau}^{t}[q_{\theta}^{t}\mid\Omega_{\tau}, \Pi_{\tau}] - \hat{V}^{t}_{\tau}(\Pi_{\tau-1}, O_{\tau}; \phi)\]

which is the value gain due to state $\Pi_{\tau}$ (compared to average, computed by $\hat{V}_{\tau}^{t}$).

How can we estimate the advantage? Note that

\[V_{\tau}^{t} =\phi^{t}_{\tau+1}(O_{\tau+1}) + \cdots + \phi^{t}_{\tau+K}(O_{\tau+K}) + V_{\tau+K}^{t}\]

for any $K = 1, 2, \ldots$, and using

\[\mathbb{E}_{q_{\theta}^{t}}^{\mu}[V^{t}_{\tau+K}] \approx \mathbb{E}_{q_{\theta}^{t}}^{\mu}[\hat{V}^{t}_{\tau+K}(\Pi_{\tau+K-1}, O_{\tau+K}; \phi)]\]

we can bootstrap $Q_{\tau}^{t}$ by $K$ steps

\[\begin{align*} Q_{\tau}^{t}[q_{\theta}^{t}\mid\Omega_{\tau}, \Pi_{\tau}] \approx \; &\mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\hat{V}^{t}_{\tau+K}(\Pi_{\tau+K-1}, O_{\tau+K}; \theta)\mid \Omega_t = \omega_t, \Omega_{\tau}, \Pi_{\tau}\right]\\ &+ \sum_{\tau'=\tau+1}^{\tau+K} \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\phi_{\tau'}^{t}(O_{\tau'})\mid \Omega_t = \omega_t, \Omega_{\tau}, \Pi_{\tau}\right] \end{align*}\]

which gives advantage

\[\begin{align*} A_{\tau}^{t}[q_{\theta}^{t}\mid\Omega_{\tau}, \Pi_{\tau}; \theta] \approx \; &- \hat{V}^{t}_{\tau}(\Pi_{\tau-1}, O_{\tau}; \phi) + \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\hat{V}^{t}_{\tau+K}(\Pi_{\tau+K-1}, O_{\tau+K}; \theta)\mid \Omega_t = \omega_t, \Omega_{\tau}, \Pi_{\tau}\right]\\ &+ \sum_{\tau'=\tau+1}^{\tau+K} \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\phi_{\tau'}^{t}(O_{\tau'})\mid \Omega_t = \omega_t, \Omega_{\tau}, \Pi_{\tau}\right] \end{align*}\]

Then after observing trajectories $\{(o_{\tau}^{(n)}, h_{\tau}^{(n)})_{\tau}\}_{n=1}^{\infty}$, we have Monte Carlo estimator

\[\begin{align*} \hat{G}^{t}_{MC}(\theta) &:= \frac{1}{N}\sum_{n=1}^{N} \sum_{\tau=t}^{\infty} a_{\tau}^{t}(o_{\tau:\tau+K}^{(n)}, \pi_{\tau-1}^{(n)}, \pi_{\tau+K}^{(n)}; \phi) \nabla_{\theta} \log q(\pi^{(n)}_{\tau}\mid \pi^{(n)}_{\tau-1}, o^{(n)}_{\tau}; \theta) \end{align*}\]

with observed advantages

\[a_{\tau}^{t}(o_{\tau:\tau+K}, \pi_{\tau-1}, \pi_{\tau+K}; \phi) := -\hat{V}_{\tau}^{t}(\pi_{\tau-1}, o_{\tau}; \phi) + \hat{V}_{\tau+K}^{t}(\pi_{\tau+K-1}, o_{\tau+K}; \phi) + \sum_{\tau'=\tau+1}^{\tau+K} \phi_{\tau'}^{t}(o_{\tau'})\]

which is equivalent to gradient descent on the loss function $L^t = L^t(\theta)$ defined as

\[L^t(\theta) := \frac{1}{N}\sum_{n=1}^{N} \sum_{\tau=t}^{\infty} a_{\tau}^{t}(o_{\tau:\tau+K}^{(n)}, \pi_{\tau-1}^{(n)}, \pi_{\tau+K}^{(n)}; \phi) \log q(\pi^{(n)}_{\tau}\mid \pi^{(n)}_{\tau-1}, o^{(n)}_{\tau}; \theta)\]

Advantage-based methods are seen to work very well in the context of reinforcement learning (PPO is an example of such a method).

Model-based credit assignment

Tree search methods

We now discuss the approach of an agent using tree search methods, guided by learned predictive models of the environment, in order to choose actions effectively. For simplicity assume a discrete finite action space of $A$ actions.

At time $t$ the agent has observed $o_{0:t}$, say with parameter $\theta$ parameterizing the predictive distributions

\[h_{\theta} = h_{\theta}(o_{0:t}) =: z^t_{0} \in \mathbb{R}^{d}\] \[f_{\theta} = f_{\theta}(z^t_{k}) =: (\mathbf{p}^t_{k}, v^{t:t+k}_{t+k})\] \[g_{\theta} = g_{\theta}(z^t_{k-1}, i) = (\psi^{t:t+k-1}_{t+k}, z^{t}_{k, i})\]

with

\[\psi^{\tau}_{t+k} \approx \mathbb{E}_{q}^{\mu}[\phi^{\tau}_{t+k}(O_{t+k})\mid\Omega_{t} = \omega_{t}]\] \[v^{\tau}_{t+k} \approx \mathbb{E}_{q}^{\mu}[V^{\tau}_{t+k}\mid\Omega_{t} = \omega_t]\]

enforced via supervised learning (discussed more below).

The tree-search agent $q = q_{\text{Search}}$ produces its action $a_{t}$ by the following algorithm

Obtain root latent state $z^t_0 := h_{\theta}(o_{0:t})$.
Expand this root state into children $\{z^{t}_{1, i}\}_{i=1}^{A}$ with $z^{1}_{t, i}$ obtained by computing $g_{\theta}(z_{t}^{k-1}, a_i)$.
Run a number of tree search simulations, with each consisting of:
- Starting from the root, take actions $i_{k+1} := \text{argmax}_{j} \text{UCB}(z^{t}_{k, i_{k}}, j)$ in the order $k=0, \ldots, n-1$, until visiting a node $z^{t}_{n}$ that has yet to be expanded.
- Increment the visit count of the nodes visited during this simulation, $\{z_{t}^{0}, \ldots, z_{t}^{n}\}$, and add the $n$-step bootstrapped value $\hat{v}^{t+k}_{t+k} := \pm(\psi^{t+k}_{t+k+1} + \cdots + \psi^{t+k}_{t+n} + v^{t+k}_{t+n})$ to node $z_{t}^{k}$’s total value (initialized to $0$), with $\pm$ depending on which player is moving.
After running these simulations, choose the next action by sampling from the distribution based on the visits to the children of the root node: $p(i) = \left(\frac{n^{t}_{1, i}}{\sum_{j=1}^{A} n^{t}_{1, j}}\right)^{1/\nu}$ with temperature $\nu$.

TODO

Meta-learning

At the base level, with no meta-learning, denote the hidden state $h_{\tau} = h_{\tau}^{(0)}$ with dynamics $q = q^{(0)} = q^{(0)}(h_{\tau}^{(0)}\mid h_{\tau-1}^{(0)}, o_{\tau})$. For example, a parameterized agent has $h_{\tau}^{(0)} = (\theta_{\tau}, \pi_{\tau})$.

Consider a level higher, with meta-learning dynamics $q = q^{(1)}$ with state $h_{\tau} = h_{\tau}^{(1)} := (h_{\tau}^{(0)}, q_{\tau}^{(0)})$ and defined such that

\[q = q^{(1)}(h_{\tau}^{(0)}\mid q_{\tau}^{(0)}, h_{\tau-1}^{(0)}, o_{\tau}) = q_{\tau}^{(0)}(h_{\tau}^{(0)}\mid h_{\tau-1}^{(0)}, o_{\tau})\]

I.e. the hidden state consists of the base state $h_{\tau}^{(0)}$ and the update rule $q_{\tau}^{(0)}$ for this base state, and $q^{(1)} = q^{(1)}(h_{\tau}\mid h_{\tau-1}, o_{\tau})$ governs how both this base state and its update rule are updated, allow for meta-learning of the learning rule \& architecture itself, with dynamics

\[\begin{align*} q(h_{\tau}\mid h_{\tau-1}, o_{\tau}) &= q^{(1)}(h_{\tau}^{(1)}\mid h_{\tau-1}^{(1)}, o_{\tau})\\ &= q^{(1)}(h_{\tau}^{(0)}, q_{\tau}^{(0)}\mid h_{\tau-1}^{(0)}, q_{\tau-1}^{(0)}, o_{\tau})\\ &= q^{(1)}(q_{\tau}^{(0)}\mid h_{\tau-1}^{(0)}, q_{\tau-1}^{(0)}, o_{\tau}) \; q_{\tau}^{(0)}(h_{\tau}^{(0)}\mid h_{\tau-1}^{(0)}, o_{\tau}) \end{align*}\]

Consider more generally $\text{meta}^{K}$-learning dynamics, for some $K = 1, 2, \ldots$. This corresponds to dynamics governed by $q = q^{(K)}$, with $q^{(K)} = q^{(K)}(h_{\tau}^{(K)}\mid h_{\tau-1}^{(K)}, o_{\tau})$ with hidden state

\[h_{\tau} = h_{\tau}^{(K)} := (h_{\tau}^{(K-1)}, q_{\tau}^{(K-1)}) = \cdots = (h_{\tau}^{(0)}, q_{\tau}^{(0)}, \ldots, q_{\tau}^{(K-1)})\]

The hidden state consists of $K$ distributions $\{q_{\tau}^{(k)}\}_{k=0}^{K-1}$, defined such that

\[q_{\tau}^{(k)}(h_{\tau}^{(k-1)}\mid q_{\tau}^{(k-1)}, h_{\tau-1}^{(k-1)}, o_{\tau}) = q_{\tau}^{(k-1)}(h_{\tau}^{(k-1)}\mid h_{\tau-1}^{(k-1)}, o_{\tau})\]

This gives overall $\text{meta}^{K}$ dynamics

\[q = q^{(K)}(h_{\tau}^{(K)}\mid h_{\tau-1}^{(K)}, o_{\tau}) = q_{\tau}^{(0)}(h_{\tau}^{(0)}\mid h_{\tau-1}^{(0)}, o_{\tau}) \left[\prod_{k=1}^{K} q_{\tau}^{(k)}(q_{\tau}^{(k-1)}\mid q_{\tau-1}^{(k-1)}, h_{\tau-1}^{(k-1)}, o_{\tau})\right]\]

TODO: discuss popular meta learning dynamics (MAML, etc.)

AIXI

\[\begin{align*} P_t[q] &:= \mathbb{E}_{q}^{\mu}[\phi^t(O_{t+1}, O_{t+2}, \ldots)\mid \Omega_t=\omega_t]\\ &= \int p(o_{t+1:\infty}\mid o_{0:t}, h_{0:t-1}) \phi^t(o_{t+1:\infty}) do_{t+1:\infty}\\ &= \int p(o_{t+1:\infty}, h_{t:\infty}\mid o_{0:t}, h_{0:t-1}) \phi^t(o_{t+1:\infty}) do_{t+1:\infty} dh_{t:\infty}\\ &= \int \phi^t(o_{t+1:\infty}) \left[\prod_{\tau=t}^{\infty} q(h_{\tau}\mid o_{\tau}, h_{\tau-1}) p(o_{\tau+1}\mid o_{0:\tau}, a_{0:\tau}(h_{0:\tau})) dh_{\tau} do_{\tau+1}\right] \end{align*}\]

The agent does not have access to $p(o_{\tau+1}\mid o_{0:\tau}, h_{0:\tau})$, but lets say it has a model of this distribution, $\hat{p} = \hat{p}(o_{\tau+1}\mid o_{0:\tau}, h_{0:\tau})$ (observations and hidden states are accessible by the agent, hence one can imagine learning such a model via supervised learning).

Define the preference under this predictive model

\[\hat{P}_{t}[q, \hat{p}] := \int \phi^t(o_{t+1:\infty}) \left[\prod_{\tau=t}^{\infty} q(h_{\tau}\mid o_{\tau}, h_{\tau-1}) \hat{p}(o_{\tau+1}\mid o_{0:\tau}, a_{0:\tau}(h_{0:\tau})) dh_{\tau} do_{\tau+1}\right]\]

The AIXI model chooses $\hat{p}$ to be the Solomonoff prior. This can be described by $\hat{p} = \hat{p}_{S}$ with

\[\hat{p}_{S}(o_{1:\tau+1}\mid o_{0}, a_{0:\tau}) := \sum_{\substack{\sigma\\U(\sigma, o_0, a_{0:\tau}) = o_{1:\tau+1}}} 2^{-\ell(\sigma)}\]

for some universal Turing machine $U$, summing over programs $\sigma$ and weighting based on program length $\ell(\sigma)$. Under this prior we have

\[\begin{align*} \hat{p}_{S}(o_{\tau+1}\mid o_{0:\tau}, a_{0:\tau}) = \frac{\hat{p}(o_{1:\tau+1}\mid o_0, a_{0:\tau})}{\hat{p}(o_{1:\tau}\mid o_0, a_{0:\tau})} &= \frac{\hat{p}(o_{1:\tau+1}\mid o_0, a_{0:\tau})}{\int do'_{\tau+1} \hat{p}(o_{1:\tau}, o'_{\tau+1}\mid o_0, a_{0:\tau})}\\ &= \frac{\sum_{\sigma: U(\sigma, o_0, a_{0:\tau}) = o_{1:\tau+1}} \; 2^{-\ell(\sigma)}}{\int do'_{\tau+1} \sum_{\sigma': U(\sigma', o_0, a_{0:\tau}) = (o_{1:\tau}, o'_{\tau+1})} \; 2^{-\ell(\sigma')}} \end{align*}\]

TODO: finish AIXI, discuss generalization from the perspective of Kolmogorov complexity

Rhys Gould

Constructing quantum field theories

1. The spacetime symmetry group and its representations

2. Constructing scalars

3. Gauge invariance via minimal coupling

4. Anomalies

5. Interactions

6. Making auxiliary fields dynamical

7. Overcounting in the path integral

8. Renormalization

9. Computing correlators perturbatively

Appendix A.1. Classifying semi-simple Lie groups and their representations

A.1.1. Classifying Lie algebras

A.1.1.1. Definitions

A.1.1.2. Constructing a basis

A.1.1.3. Relations

A.1.1.4. Simple roots and geometry

A.1.1.5. Classification

Appendix A.1.2. Classifying representations

Definitions

Classification

Appendix A.1.3. Classifying the representations of the unitary group

Appendix A.2: The LSZ formula

Appendix A.3: Coupling to gravity

Appendix A.3: Justifying gauge invariance as a principle

Appendix A.4: Effects of integrating over redundant configurations

Notes on prosaic alignment and control

Alignment

Pretraining

Steering

Control

References

Variational framework for perception and action

Introduction

Variational framework for perception & action

References

Appendix: Reinforcement learning background

Architectures by symmetry

Introduction

Setup

Framework

Group Actions

Key Definitions

Pertubative Stability

Locality

Building Invariances

Architectures

CNN

GNN

Transformer

RNN gating

Free-energy

Introduction

Free-energy

Active Inference

Action

Future Free Energy

Reinforcement Learning

Implementation

Predictive Coding

Understanding HTM

Introduction

Encoders

Overview

Semantic Embeddings for NLP

HTM System

Definitions

Overview

Spatial pooling

Temporal memory

Spatial Pooler

Temporal Memory

Algorithm

For each active column \(c \in A^{(t)}\)

If \(|\mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_A| > 0\):

If, instead, \(|\mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_A| = 0\):

For each inactive column \(c \notin A^{(t)}\)

Updating segments

Brief notes on spiking neuron models

Spiking Neuron Model