Notation: We will denote the space of smooth maps between a manifold \(M\) and a vector space \(V\) by \(C^{\infty}(M, V)\). For a functional \(F: C^{\infty}(M, V) \to \mathbb{R}\), we will denote its functional derivative at \(f \in C^{\infty}(M, V)\) evaluated at point \(x \in M\) by \(\frac{\delta F[f]}{\delta f(x)} \in \mathbb{R}\). We will denote the space of invertible linear maps from \(V\) to \(V\) by \(\text{GL}(V)\).
Introduction. In quantum field theory, a theory is described by a functional \(S: \mathcal{C} \to \mathbb{R}\) called the action, mapping field configurations \(\Psi \in \mathcal{C}\) to a real number \(S[\Psi]\). \(\Psi\) will generally be made up of a collection of fields \(\Psi = (\Psi_1, \ldots, \Psi_N)\) relevant to our theory. With an action, we can integrate over the space of field configurations \(\mathcal{C}\) via the measure
\[\text{D}\Psi \, \mathbb{P}[\Psi] \equiv \left[\prod_{i=1}^{N} \text{D}\Psi_i\right] \mathbb{P}[\Psi], \qquad \text{with} \quad \mathbb{P}[\Psi] := \frac{1}{Z} e^{-S[\Psi]}\]with probability density \(\mathbb{P}: \mathcal{C} \to [0, \infty)\), defining the normalization constant
\[Z = \int_{\mathcal{C}} \text{D}\Psi \, e^{-S[\Psi]}\]often called the partition function (we will discuss the meaning of \(\text{D}\Psi\) below). Explicitly, the field configuration space will take the form
\[\begin{align*} \mathcal{C} &= C^{\infty}(M, V^{(1)}) \times \cdots \times C^{\infty}(M, V^{(N)})\\ &\cong C^{\infty}(M, V) \end{align*}\]with \(\Psi_i \in C^{\infty}(M, V^{(i)})\) for a (finite-dimensional) vector space \(V^{(i)}\) and spacetime manifold \(M\), and defining \(V := V^{(1)} \oplus \cdots \oplus V^{(N)}\).
Physical quantities that can be measured via experiment – such as scattering probabilities, as demonstrated by the LSZ formula (see Appendix A.2) – are directly related to field expectances, also called correlators, that take the general form
\[\mathbb{E}_{\Psi \sim S}[\Psi_{i_1}(x_1) \cdots \Psi_{i_n}(x_n)] := \frac{1}{Z} \int_{\mathcal{C}} \text{D}\Psi \, \Psi_{i_1}(x_1) \cdots \Psi_{i_n}(x_n) e^{-S[\Psi]}\]for arbitrary \(n \in \mathbb{Z}_{+}\), indices \((i_1, \ldots, i_n) \in \{1, \ldots, N\}^n\), and points \((x_1, \ldots, x_n) \subset M\). Given that such expectances are directly related to physical quantities, an important constraint will be for \(S\) to be such that these expectances are invariant under transformations that leave \(\Psi\) physically equivalent (e.g. Lorentz transformations), which we will soon make precise.
Constructing a well-behaved definition of the path integral measure \(\text{D}\Psi\) is non-trivial. The approach that we will use is to expand each field \(\Psi_i \in C^{\infty}(M, V^{(i)})\) in an eigenfunction basis \(\{\psi_i^j\}_j\) of \(C^{\infty}(M, V^{(i)})\), letting us write \(\Psi_i(x) = \sum_j a_i^j \psi_i^j(x)\) for expansion coefficients \(\{a_i^j\}_j\) and allowing us to define
\[\text{D}\Psi_i := \prod_j da_i^j\]but since this is an infinite product, it will generally result in divergences, requiring some form of regulation, either via Fujikawa regulation (as used when computing anomalies, as in Section 4), or by truncating the infinite product via a cutoff (as used in the context of Wilsonian renormalization, as in Section 8), or some other regulation method.
To motivate the guiding principles behind the construction of our theory \(S\), we first must introduce some central concepts. First, we should view the configuration space \(\mathcal{C}\) as exhibiting physical redundancy, containing many configurations that are physically equivalent. In particular, we can think of there as being some true non-redundant physical configuration space \(\mathcal{P}\), and with \(\mathcal{C}\) being partitioned as
\[\mathcal{C} = \bigcup_{\Phi \in \mathcal{P}} [\Phi]\]for equivalence classes \([\Phi] = \{\Phi' \in \mathcal{C}: \Phi \sim \Phi'\}\) defined by a physical equivalence relation \(\sim\) over \(\mathcal{C}\). Said differently, we have
\[\mathcal{P} \cong \mathcal{C}/\sim\]The physical equivalence relation \(\sim\) will be defined by the orbits of a collection of Lie groups \((G^{(0)}, G^{(1)}, \ldots, G^{(K)})\) (i.e. groups that are also manifolds), consisting of a spacetime symmetry group \(G^{(0)}\) inherent to the particular manifold and metric \((M, g)\) under consideration, as well as a collection of gauge symmetry groups \((G^{(1)}, \ldots, G^{(K)})\). We will denote the overall symmetry group by \(G := G^{(0)} \times G^{(1)} \times \cdots \times G^{(K)}\).
In order to define this physical equivalence relation \(\sim\) over \(\mathcal{C}\) using these groups, we require representations of these groups that describe how they actually act on the field content \(\Psi\). Concretely, the representation of \(G^{(k)}\) acting on \(\Psi_i \in C^{\infty}(M, V^{(i)})\) will be denoted \(\rho^{(i, k)}: G^{(k)} \to \text{GL}(V^{(i, k)})\), where each representation is assigned its own sector \(V^{(i, k)}\) of \(V^{(i)}\) on which to act, with
\[V^{(i)} = \underbrace{V^{(i, 0)}}_{\text{spacetime sector}} \oplus \underbrace{V^{(i, 1)} \oplus \cdots \oplus V^{(i, K)}}_{\text{gauge sectors}}\]Namely, though \(\Psi_i\) lives in an infinite-dimensional space \(C^{\infty}(M, V^{(i)})\), the representations that we will consider will only act on (a subspace of) the finite-dimensional output space \(V^{(i)}\). The overall action of \(G\) on \(\Psi_i\) is described by \(\rho^{(i)} := \rho^{(i, 0)} \oplus \rho^{(i, 1)} \oplus \cdots \oplus \rho^{(i, K)}\), with \(\rho^{(i)}: G \to \text{GL}(V^{(i)})\).
With such representations, we can now define a physical equivalence relation \(\sim\). Recall that we can view \(\mathcal{C} = C^{\infty}(M, V)\). Then we can condense the representations introduced above into a single representation \(\rho := \rho^{(1)} \oplus \cdots \oplus \rho^{(N)}\), where \(\rho: G \to \text{GL}(V)\), allowing us to denote the overall transformation of field content by \(\Psi \mapsto \rho_g \Psi\) under \(g = (g_0, \ldots, g_K) \in G\). This lets us define the physical equivalence class associated with \(\Psi \in \mathcal{C}\) to be
\[[\Psi] := \{\rho_{g} \Psi: g \in C^{\infty}(M, G)\} \subset \mathcal{C}\]where note that \(\rho_{g} \Psi \in C^{\infty}(M, V)\) with \([\rho_{g} \Psi](x) \equiv \rho_{g(x)} \Psi(x) \in V\). This provides us with a physical equivalence relation on \(\mathcal{C}\):
\[\Psi \sim \Psi' \iff \Psi' \in [\Psi]\]Guiding principles. With these concepts introduced, we can now formulate the central guiding principles to constructing physical theories in this framework.
The main principle that we will work towards satisfying is invariance of correlators:
\[\mathbb{E}_{\Psi \sim S}\left[[\rho^{(i_1)}_{g} \Psi_{i_1}](x_1) \cdots [\rho^{(i_n)}_{g} \Psi_{i_n}](x_n)\right] = \mathbb{E}_{\Psi \sim S}[\Psi_{i_1}(x_1) \cdots \Psi_{i_n}(x_n)] \quad \forall \; \; g \in C^{\infty}(M, G)\]for arbitrary \(n \in \mathbb{Z}_{+}\), indices \((i_1, \ldots, i_n) \in \{1, \ldots, N\}^n\), and points \((x_1, \ldots, x_n) \subset M\). Or written schematically,
\[\mathbb{E}_{\Psi \sim S} \circ \rho_g = \mathbb{E}_{\Psi \sim S}\]Importantly, in order for this condition to hold generically, one can see that we require two independent conditions to hold:
\[\begin{equation} \label{eqn:Sinvar} S[\rho_g \Psi] = S[\Psi] \qquad \forall \; \; g \in C^{\infty}(M, G) \qquad\qquad (\text{Equation 1}) \end{equation}\]and
\[\begin{equation} \label{eqn:Dinvar} \text{D}(\rho_g \Psi) = \text{D}\Psi \qquad \forall \; \; g \in C^{\infty}(M, G) \qquad\qquad (\text{Equation 2}) \end{equation}\]Our theory is ultimately described by the triple \((S, G, \rho)\) consisting of the action, the group content, and representation content respectively. Equation 2 only places constraints on \((G, \rho)\), whereas Equation 1 constrains all of \((S, G, \rho)\). The focus of Section 1-3 and Sec 5-7 will be constructing a theory that achieves Equation 1, while the implications of Equation 2 are examined in Section 4.
In constructing a theory \(S\) that satisfies Equation 1, the spacetime group \(G^{(0)}\) will play a particularly special role. Namely, we will begin with a theory of empty field content \(\Psi = \emptyset\) and add to the field content incrementally, using representations of \(G^{(0)}\) to construct the initial field content of the theory (i.e. spinors), with \(S\) designed to be invariant (in the sense of Equation 1) to \(G^{(0)}\) by design. From here, we will then obtain invariance to \((G^{(1)}, \ldots, G^{(K)})\) via a process called minimal coupling. A total overview:
Section 3: To make \(S[\Psi]\) invariant to the gauge groups \((G^{(1)}, \ldots, G^{(K)})\), we perform minimal coupling, which involves extending the field content
\[\Psi \mapsto \Psi \cup \{A^{(k)}\}_{k=1}^{K}\]for a 4-vector \(A^{(k)}\) associated with each gauge group \(G^{(k)}\) that transforms appropriately under \(G^{(k)}\), and adding appropriate terms to \(S[\Psi]\) involving these new fields.
Section 4: Checking that Equation 2 is satisfied requires us to compute anomalies, describing the extent to which \(\text{D}\Psi\) fails to be invariant under \(\rho\). Anomalies must vanish in order for Equation 2 to be satisfied. We will see that the Standard Model, defined by the gauge groups
\[(G^{(1)}, G^{(2)}, G^{(3)}) = (U(1), SU(2), SU(3))\]indeed has vanishing anomalies as required.
Section 5: To add non-trivial interaction terms between spinors to the action \(S[\Psi]\), extend the field content
\[\Psi \mapsto \Psi \cup \{H\}\]for a single scalar \(H\) that transforms appropriately under the gauge groups to preserve gauge invariance of \(S[\Psi]\). In the context of the Standard Model, \(H\) is called the Higgs boson, and the associated interaction terms are called the Yukawa terms.
Section 7: To remedy divergences that arise from integrating over physically-equivalent configuriations, we must introduce ghost fields \(\bar{c}^{(k)}, c^{(k)} \in C^{\infty}(M, \mathfrak{g}^{(k)})\) for each gauge field \(A^{(k)}\):
\[\Psi \mapsto \Psi \cup \{\bar{c}^{(k)}, c^{(k)}\}_{k=1}^{K}\]This requires adding certain contributions to the action that break gauge invariance. However, remnants of gauge invariance still persist through BRST invariance.
Main references of relevance: David Tong’s Standard Model notes are most relevant to Section 1-6, and David Skinner’s Advanced QFT notes are relevant to Section 7-9. Section 4 is also closely based on David Tong’s Gauge Theory notes.
We begin with an empty theory of no field content \(\Phi = \emptyset\). To start things off, we must choose some (\(d\)-dimensional) spacetime manifold \(M\) to embed our theory into, as well as a metric \(g\) over this manifold. Roughly, the manifold \(M\) describes the topology of our spacetime, and the metric \(g\) describes the geometry of our spacetime.
Our manifold \(M\) comes equipped with a tangent space \(T(M)\), and in particular, a (local) coordinate basis \(\{\partial_{\mu}\}_{\mu=0}^{d-1} \subset T(M)\) of this tangent space, acting as partial derivatives that can act on our (to be constructed) field content of type \(C^{\infty}(M, V)\).
Motivating the spacetime symmetry group. To begin defining the objects and fields \(\Psi_i\) relevant to \((M, g)\), our starting point will be to consider the isometries of the metric \(g\), corresponding to the coordinate transformations that leave \(g\) invariant. Particularly, this set of transformations forms a group \(\text{Iso}(M, g)\). We would like our spacetime symmetry group \(G^{(0)}\) to be related to \(\text{Iso}(M, g)\) in some capacity.
Our metric \(g\) will have some signature, describing the signs of its eigenvalues. By the assumption of \(g\) being non-degenerate (i.e. invertible as a matrix), we have that \(g\) has no zero eigenvalues. We say that \(g\) has signature \((r, s)\) if it has \(r\) positive eigenvalues, and \(s\) negative eigenvalues (where \(r+s = n\)). We then define the corresponding signature matrix \(\Omega_{\mu\nu}^{(r, s)} := \text{diag}(\underbrace{1, \ldots, 1}_{r \, \text{times}}, \underbrace{-1, \ldots, -1}_{s \, \text{times}})\) for \(g\). Then we have the following result: at any point \(x \in M\), there exists a basis \(\{e_{\mu}\}_{\mu} \subset T_x(M)\) such that
\[g_x(e_{\mu}, e_{\nu}) = \Omega_{\mu\nu}^{(r, s)}\]That is, locally, we can always reduce \(g\) to the signature matrix \(\Omega_{\mu\nu}^{(r, s)}\). As a result, the isometry group \(\text{Iso}(M, g)\) locally reduces to \(\text{Iso}(M, \Omega^{(r, s)})\), and note that \(\text{Iso}(M, \Omega^{(r, s)}) \cong \text{IO}(r, s)\), the inhomogeneous orthogonal group of signature \((r, s)\). Further, note that we can decompose \(\text{IO}(r, s)\) as
\[\text{IO}(r, s) = \mathbb{R}^{r, s} \rtimes \text{O}(r, s)\]with \(\text{O}(r, s)\) the linear/homogeneous orthogonal group of signature \((r, s)\), explicitly defined as
\[\text{O}(r, s) = \{A \in \text{GL}(r+s; \mathbb{R}): A^T \Omega^{(r,s)} A = \Omega^{(r,s)}\}\]We will restrict our attention to the local linear isometries \(\text{O}(r, s)\) of \(g\).
To construct some initial field content \(\Psi\), we would like to understand the available representations of \(O(r, s)\), which can most easily be performed by looking at its Lie algebra \(\mathfrak{o}(r, s)\). Simply connected Lie groups are particularly nice for this purpose, since there is a one-to-one correspondence between Lie group representations and Lie algebra representations in this case. However, \(O(r, s)\) is not simply connected.
But, in general, for any finite-dimensional Lie algebra \(\mathfrak{g}\), there is a unique simply connected group \(G\) whose Lie algebra is \(\mathfrak{g}\). Further, for a Lie group \(H\) that also has this algebra \(\mathfrak{g}\), then the corresponding simply connected Lie group \(G\) is isomorphic to the universal covering group of the connected component of \(H\) that contains the identity.
In our case, this means that the unique simply connected Lie group associated with \(\mathfrak{o}(r, s)\) is the universal covering group of \(SO^{+}(r, s)\) (where \(SO^{+}(r, s)\) is the connected component of \(O(r, s)\) that contains the identity). Of particular interest to us will be the case of \((r, s) = (1, d-1)\), in which case we have that for \(d \geq 4\), the universal covering group of \(SO^{+}(1, d-1)\) is the spin group \(\text{Spin}(1, d-1)\). As a result, we will choose our spacetime group to be
\[G^{(0)} = \text{Spin}(1, d-1)\]In total, we have chosen \(G^{(0)}\) to be the unique simply connected Lie group corresponding to (the Lie algebra of) the local linear isometries of \(g\). We will now study its representations.
Representations. Of particular interest to our universe is the choice \(d=4\), corresponding to the signature \((r, s) = (1, 3)\), with manifold \(M \cong \mathbb{R}^{1, 3}\), representing 1 temporal dimension and 3 spatial dimensions. The signature matrix is \(\eta_{\mu\nu}\), called the Minkowski metric, taking the form \(\eta = \text{diag}(1, -1, -1, -1)\).
In this case, we can identify
\[G^{(0)} = \text{Spin}(1, 3) \cong SL(2; \mathbb{C})\]with \(SO^{+}(1, 3) \cong SL(2; \mathbb{C})/\mathbb{Z}_2\). In particular, by construction, we have that
\[\mathfrak{o}(1, 3) \cong \mathfrak{sl}(2; \mathbb{C})\]We therefore wish to classify the spacetime representations of \(G^{(0)}\) via studying the algebra \(\mathfrak{sl}(2; \mathbb{C})\). As shown in Appendix A.1, we can classify all irreducible representations of complex simple Lie algebras, and we have that \(\mathfrak{sl}(2; \mathbb{C})\) is semi-simple, decomposing as the direct sum of simple Lie algebras:
\[\begin{equation} \label{eqn:sliso} \mathfrak{sl}(2; \mathbb{C})_{\mathbb{C}} \cong \mathfrak{su}(2)_{\mathbb{C}} \oplus \mathfrak{su}(2)_{\mathbb{C}} \end{equation}\]This follows from there being a set of generators \(\{L_i\}_{i=1}^{3} \cup \{R_i\}_{i=1}^{3}\) of \(\mathfrak{sl}(2; \mathbb{C})_{\mathbb{C}}\) such that \(\{L_i\}_{i=1}^{3}\) and \(\{R_i\}_{i=1}^{3}\) each satisfy the \(\mathfrak{su}(2)_{\mathbb{C}}\) algebra independently, i.e.
\[[L_i, L_j] = \epsilon_{ijk} L_k, \qquad [R_i, R_j] = \epsilon_{ijk} R_k\]with \([L_i, R_j]= 0\) for all \(i, j\). This corresponds to the Lie algebra isomorphism \(\mathfrak{sl}(2; \mathbb{C})_{\mathbb{C}} \cong \mathfrak{su}(2)_{\mathbb{C}} \oplus \mathfrak{su}(2)_{\mathbb{C}}\).
In more detail, \(\mathfrak{sl}(2; \mathbb{C})\) (non-complexified) has generators with representations
\[\text{Boosts:} \quad K_1 \sim \begin{bmatrix} &1&&\\ 1&&&\\ &&&\\ &&& \end{bmatrix}, \quad K_2 \sim \begin{bmatrix} &&1&\\ &&&\\ 1&&&\\ &&& \end{bmatrix}, \quad K_3 \sim \begin{bmatrix} &&&1\\ &&&\\ &&&\\ 1&&& \end{bmatrix}\] \[\text{Rotations:} \quad J_1 \sim \begin{bmatrix} &&&\\ &&&\\ &&&-1\\ &&1& \end{bmatrix}, \quad J_2 \sim \begin{bmatrix} &&&\\ &&&1\\ &&&\\ &-1&& \end{bmatrix}, \quad J_3 \sim \begin{bmatrix} &&&\\ &&-1&\\ &1&&\\ &&& \end{bmatrix}\]allowing us to write any element \(X \in \mathfrak{sl}(2; \mathbb{C})\) as \(X = \theta^i J_i + \chi^i K_i\) for some \(\theta^i, \chi^j \in \mathbb{R}\). This lets us define generators for the complexified algebra \(\mathfrak{sl}(2; \mathbb{C})_{\mathbb{C}}\):
\[L_i := \frac{1}{2}(J_i + iK_i), \qquad R_i = \frac{1}{2}(J_i - iK_i)\]Namely, we can write any element \(X \in \mathfrak{sl}(2; \mathbb{C})_{\mathbb{C}}\) as \(X = \alpha^i L_i + \beta^i R_i\) for some \(\alpha^i, \beta^j \in \mathbb{C}\). Then explicitly, the isomorphism is
\[X = \alpha^i L_i + \beta^i R_i \mapsto \alpha^i L_i \oplus \beta^i R_i =: X_L \oplus X_R\]Note that we can write
\[\alpha^i = \theta^i - i\chi^i, \qquad \beta^i = \theta^i + i\chi^i\]for coefficients \(\theta^i, \chi^j\) in the original uncomplexified basis (rotation and boost parameters respectively).
In the following, we will denote the irreducible representation of \(\mathfrak{su}(2)_{\mathbb{C}}\) of highest weight \(\Lambda\) by
\[d^{(\Lambda)}: \mathfrak{su}(2)_{\mathbb{C}} \to \mathfrak{gl}(V_{\Lambda})\]with \(\dim d^{(\Lambda)} \equiv \dim V_{\Lambda} = \Lambda + 1\). For more details regarding why we can classify the irreducible representations of complex simple Lie algebras by highest weights \(\Lambda\), see Appendix A.1.
The isomorphism of Equation \ref{eqn:sliso} lets us determine all irreducible representations \(d^{(\Lambda_1, \Lambda_2)}\) of \(\mathfrak{sl}(2; \mathbb{C})_{\mathbb{C}}\), constructed as
\[d^{(\Lambda_1, \Lambda_2)}: \mathfrak{sl}(2; \mathbb{C})_{\mathbb{C}} \to \mathfrak{gl}(V_{\Lambda_1, \Lambda_2})\] \[d_X^{(\Lambda_1, \Lambda_2)} := d_{X_L}^{(\Lambda_1)} \otimes I + I \otimes d_{X_R}^{(\Lambda_2)}\]defining \(V_{\Lambda_1, \Lambda_2} := V_{\Lambda_1} \otimes V_{\Lambda_2}\), with \(\dim d^{(\Lambda_1, \Lambda_2)} = (\Lambda_1 + 1)(\Lambda_2 + 1)\). Here, \(X_L, X_R \in \mathfrak{su}(2)_{\mathbb{C}}\) are related to \(X \in \mathfrak{sl}(2; \mathbb{C})_{\mathbb{C}}\) through the isomorphism outlined above, where
\[X = \alpha^i L_i + \beta^i R_i \implies X_L = \alpha^i L_i, \quad X_R = \beta^i R_i\]We can extend this algebra representation \(d^{(\Lambda_1, \Lambda_2)}\) to a group representation \(D^{(\Lambda_1, \Lambda_2)}\) via the exponential map:
\[D^{(\Lambda_1, \Lambda_2)}: SL(2; \mathbb{C}) \to \text{GL}(V_{\Lambda_1, \Lambda_2})\] \[D_{A}^{(\Lambda_1, \Lambda_2)} := \exp\left(d_{X(A)}^{(\Lambda_1, \Lambda_2)}\right)\]for \(A \in SL(2; \mathbb{C})\), with \(X(A) \in \mathfrak{su}(2)\) related to \(A\) via \(A =: \exp(X(A))\).
Iterating through the first couple of these representations is sufficient to define the core objects of interest:
A scalar \(S\) lives in the 1-dimensional vector space \(V_{0, 0}\) and transforms as
\[S \mapsto_{A} D_{A}^{(0, 0)} S \equiv S\]under \(A \in SL(2; \mathbb{C})\).
A left-handed spinor \(\psi_L\) lives in the 2-dimensional vector space \(V_{1, 0}\) and transforms as
\[\psi_L \mapsto_{A} D_{A}^{(1, 0)} \psi_L \equiv \exp\left(d_{X_L(A)}^{(1)}\right) \psi_L\]under \(A \in SL(2; \mathbb{C})\).
A right-handed spinor \(\psi_R\) lives in the 2-dimensional vector space \(V_{0, 1}\) and transforms as
\[\psi_R \mapsto_{A} D_{A}^{(0, 1)} \psi_R \equiv \exp\left(d_{X_R(A)}^{(1)}\right) \psi_R\]under \(A \in SL(2; \mathbb{C})\).
A 4-vector \(V\) lives in the 4-dimensional vector space \(V_{1, 1}\) and transforms as
\[V \mapsto_{A} D_{A}^{(1, 1)} V \equiv \left(\exp\left(d_{X_L(A)}^{(1)}\right) \otimes \exp\left(d_{X_R(A)}^{(1)}\right)\right) V\]under \(A \in SL(2; \mathbb{C})\).
These 4 objects are all we need to consider to construct the Standard Model.
Spinor transformations. We can write the spinor transformation rules more explicitly. Note that \(d^{(1)}\) describes the fundamental representation of \(\mathfrak{su}(2)_{\mathbb{C}}\), and the Pauli matrices \(\{-i\sigma^j/2\}_{j=1}^{3}\) act as a fundamental representation of \(\mathfrak{su}(2)_{\mathbb{C}}\), allowing us to choose
\[d^{(1)}_{L_i} = -\frac{i}{2} \sigma^i, \qquad d^{(1)}_{R_i} = -\frac{i}{2} \sigma^i\] \[\implies d_{X_L(A)}^{(1)} = -\frac{i}{2} \boldsymbol{\theta}(A) \cdot \boldsymbol{\sigma} - \frac{1}{2} \boldsymbol{\chi}(A) \cdot \boldsymbol{\sigma}, \qquad d_{X_R(A)}^{(1)} = -\frac{i}{2} \boldsymbol{\theta}(A) \cdot \boldsymbol{\sigma} + \frac{1}{2} \boldsymbol{\chi}(A) \cdot \boldsymbol{\sigma}\]for \(A \in SL(2; \mathbb{C})\). For brevity we can define the real-valued (anti-symmetric) matrix \(\omega_{\mu\nu}(A)\) by
\[\omega_{ij}(A) := \epsilon_{ijk} \theta^k(A), \qquad \omega_{0i}(A) := \chi^i(A)\]Then, also defining
\[\sigma^{\mu\nu} := \frac{i}{4}(\sigma^{\mu} \bar{\sigma}^{\nu} - \sigma^{\nu} \bar{\sigma}^{\mu}), \qquad \bar{\sigma}^{\mu\nu} := \frac{i}{4}(\bar{\sigma}^{\mu} \sigma^{\nu} - \bar{\sigma}^{\nu} \sigma^{\mu})\](related via \((\sigma^{\mu\nu})^{\dagger} = \bar{\sigma}^{\mu\nu}\)) we can write
\[d_{X_L(A)}^{(1)} = -\frac{i}{2} \omega_{\mu\nu}(A) \sigma^{\mu\nu}, \qquad d_{X_R(A)}^{(1)} = -\frac{i}{2} \omega_{\mu\nu}(A) \bar{\sigma}^{\mu\nu}\]which gives us the spinor transformation rules:
\[\psi_L \mapsto_{A} \underbrace{\exp\left(-\frac{i}{2} \omega_{\mu\nu}(A) \sigma^{\mu\nu}\right)}_{=: \, L(A)} \psi_L, \qquad \psi_R \mapsto_{A} \underbrace{\exp\left(-\frac{i}{2} \omega_{\mu\nu}(A) \bar{\sigma}^{\mu\nu}\right)}_{=: \, R(A)}\psi_R\]Note that \(L(A)^{\dagger} = R(A)^{-1}\).
One can also introduce a Dirac spinor \(\psi\) to live in the 4-dimensional space \(V_{1, 0} \oplus V_{0, 1}\) and transform as
\[\psi \mapsto_{A} (D_{A}^{(1, 0)} \oplus D_{A}^{(0, 1)}) \psi = \exp\left(-\frac{i}{2} \omega_{\mu\nu}(A) S^{\mu\nu}\right)\]for \(S^{\mu\nu} := \frac{i}{4} [\gamma^{\mu}, \gamma^{\nu}]\) (with \(S^{\mu\nu} = \text{diag}(\sigma^{\mu\nu}, \bar{\sigma}^{\mu\nu})\)). In particular we can view \(\psi = \psi_L \oplus \psi_R\). This construction will be useful when computing anomalies in Section 4.
4-vector transformations. The transformation rule for a 4-vector \(V\) corresponds to
\[V \mapsto_{A} (L(A) \otimes R(A)) V\]where recall that \(V \in V_1 \otimes V_1\) (with \(\dim V_1 = 2\)). Explicitly in indices,
\[V_{ij} \mapsto_{\Lambda} L(A)^k{}_{i} R(A)^{l}{}_{j} V_{kl}\]It turns out that there is a one-to-one correspondence between \(V_{ij}\) and an object \(V_{\mu}\) (with indices \(\mu = 0, 1, 2, 3\)) that transforms as
\[V_{\mu} \mapsto_{A} \Lambda(A)^{\nu}{}_{\mu} V_{\nu}\]with \(\Lambda(A) \in SO^{+}(1, 3)\) constructed from \(A \in SL(2; \mathbb{C})\) via
\[\Lambda(A)^{\mu}{}_{\nu} := \frac{1}{2} \text{tr}(\bar{\sigma}^{\mu} A \sigma_{\nu} A^{\dagger})\]which is related to the double-cover correspondence \(SO^{+}(1, 3) \cong SL(2; \mathbb{C})/\mathbb{Z}_2\) (see that \(\Lambda(A) = \Lambda(-A)\), reflecting the fact that \(SL(2; \mathbb{C})\) is a double-cover). We can write \(\Lambda(A) \in SO^{+}(1, 3)\) more conveniently: define the collection of matrices \(\{M^{\mu\nu}\}_{\mu, \nu}\) by
\[(M^{\mu\nu})^{\sigma}{}_{\rho} = i(\delta^{\nu}{}_{\rho} \eta^{\mu\sigma} - \delta^{\mu}{}_{\rho} \eta^{\nu\sigma})\](note that \(M^{ij} = i\epsilon^{ijk} J_k, M^{0i} = iK_i\)) which allows us to write
\[\Lambda(A) = \exp\left(-\frac{i}{2} \omega_{\mu\nu}(A) M^{\mu\nu}\right)\]For infinitesimal \(\omega_{\mu\nu}\), this expands as \(\Lambda^{\mu}{}_{\nu} = \delta^{\mu}{}_{\nu} + \omega^{\mu}{}_{\nu} + O(\omega^2)\).
Due to this correspondence, when talking of 4-vectors, we will be referring to the object \(V_{\mu}\) that transforms by \(\Lambda(A) \in SO^{+}(1, 3)\) rather than \(V_{ij}\) itself, since \(V_{\mu}\) is much more convenient to work with.
Recall that our theory comes equipped with partial derivatives \(\{\partial_{\mu}\}_{\mu=0}^{3} \subset T(M)\). These objects transform under a coordinate transformation described by the matrix \(B^{\mu}{}_{\nu}\) as
\[\partial_{\mu} \mapsto_B B^{\nu}{}_{\mu} \partial_{\nu}\]meaning that, under a Lorentz transformation \(B = \Lambda(A)\), partial derivatives transform as a 4-vector:
\[\partial_{\mu} \mapsto_A \Lambda(A)^{\nu}{}_{\mu} \partial_{\nu}\]for \(A \in SL(2; \mathbb{C})\). As we will soon see, this transformation rule helps us to construct non-trivial (dynamical) scalars by combining spinors.
We will now consider the field content \(\Psi = \{\psi_{L, i}\}_{i=1}^{N_L} \cup \{\chi_{R, j}\}_{j=1}^{N_R}\) consisting of \(N_L\) left-handed spinors and \(N_R\) right-handed spinors. Our first goal is to construct an action \(S[\Psi]\) that is invariant to the spacetime symmetry group \(G^{(0)}\) (in the sense of \ref{eqn:Sinvar}), which is essentially the statement that \(S[\Psi]\) must transform as a scalar.
To achieve this, we will need to understand how we can combine the spinors \(\{\psi_{L, i}\}_{i=1}^{N_L} \cup \{\chi_{R, j}\}_{j=1}^{N_R}\) in a way that produces a scalar. In the following, we will show that contractions of the form
\[\psi_{L}^{\dagger} \chi_{R}, \qquad \psi_{L}^{T} \sigma^2 \chi_{L}, \qquad \psi_{L}^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \chi_L\]and, flipping the handedness,
\[\chi_{R}^{\dagger} \psi_L, \qquad \psi_R^T \sigma^2 \chi_R, \qquad \psi_R^{\dagger} \sigma^{\mu} \partial_{\mu} \chi_R\]all transform as scalars. There are many more scalars than this, however these are arguably the simplest scalars we can write down and are sufficient for constructing the Standard Model.
It is easy to show that \(\psi_{L}^{\dagger} \chi_{R}\) and \(\chi_{R}^{\dagger} \psi_L\) are scalars.
See that, since \(L(A)^{\dagger} = R(A)^{-1}\),
\[\begin{align*} \psi_L^{\dagger} \chi_R &\mapsto_A \psi_L^{\dagger} L(A)^{\dagger} R(A) \chi_R\\ &= \psi_L^{\dagger} R(A)^{-1} R(A) \chi_R\\ &= \psi_L^{\dagger} \chi_R \end{align*}\]as required. \(\chi_{R}^{\dagger} \psi_L\) follows identically.
\(\psi_L^T \chi_L\) (and \(\psi_R^T \chi_R\)) are not scalars due to the fact that \((\sigma^{\mu})^T \neq \sigma^{\mu}\) for all \(\mu\). Indeed, \((\sigma^2)^T = -\sigma^2\) while \((\sigma^{\mu})^T = \sigma^{\mu}\) for \(\mu \neq 2\). Inserting \(\sigma^2\) results in \(\psi_{L}^{T} \sigma^2 \chi_{L}\) and \(\psi_R^T \sigma^2 \chi_R\) being scalars.
This follows from using \(\sigma^i \sigma^j = \delta_{ij} I + i\epsilon_{ijk} \sigma^k\) to show that
\[\sigma^2 \sigma^{ij} \sigma^2 = i\epsilon_{ijk} \sigma^2 \sigma^{0k} \sigma^2, \qquad \sigma^2 \sigma^{0i} \sigma^2 = \frac{i}{2} (-1)^{1\{i=2\}} \sigma^i\]which can be used to show
\[\sigma^2 L(A) \sigma^2 = (L(A)^T)^{-1}\]and therefore
\[\begin{align*} \psi_L^T \sigma^2 \chi_L &\mapsto_A \psi_L^T L(A)^T \sigma^2 L(A) \chi_L\\ &= \psi_L^T L(A)^T \underbrace{(\sigma^2 L(A)^T \sigma^2)}_{(L(A)^T)^{-1}} \sigma^2 \chi_L\\ &= \psi_L^T \sigma^2 \chi_L \end{align*}\]as required. \(\psi_R^T \sigma^2 \chi_R\) follow similarly.
Showing that \(\psi_{L}^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \chi_L\) (and \(\psi_R^{\dagger} \sigma^{\mu} \partial_{\mu} \chi_R\)) are scalars is more involved. First, we can show that \(\psi_{L}^{\dagger} \bar{\sigma}^{\mu}\chi_L\) transforms as a 4-vector.
We will show this infinitesimally. We must first derive the infinitesimal transformation rule for a 4-vector \(X_{\mu} \mapsto_A \Lambda(A)^{\nu}{}_{\mu} X_{\nu}\). We will write \(\Lambda(A) = \exp(-\frac{i}{2} \omega_{\mu\nu} M^{\mu\nu})\). Then for infinitesimal \(\omega_{\mu\nu}\),
\[\begin{align*} X_{\mu} \mapsto_{A} \Lambda(A)^{\nu}{}_{\mu} X_{\nu} &= X_{\mu} -\frac{i}{2} \omega_{\sigma\rho} (M^{\sigma\rho})^{\nu}{}_{\mu} X_{\nu} + O(\omega^2)\\ &= X_{\mu} - i\omega_{0i} (M^{0i})^{\nu}{}_{\mu} X_{\nu} - \frac{i}{2} \omega_{ij} (M^{ij})^{\nu}{}_{\mu} X_{\nu} + O(\omega^2)\\ &= X_{\mu} + \chi^i(\delta^i_{\mu} \eta^{0\nu} - \delta^0_{\mu} \eta^{i\nu}) X_{\nu} + \frac{1}{2} \epsilon_{ijk} \theta^k (\delta_{\mu}^j \eta^{i\nu} - \delta_{\mu}^i \eta^{j\nu}) X_{\nu} + O(\omega^2)\\ &= X_{\mu} - \delta^0_{\mu} \chi^i X^i + \delta_{\mu}^i(\chi^i X^0 + \epsilon_{ijk} \theta^j X^k) + O(\omega^2) \end{align*}\]and so, raising the index, we have the infinitesimal 4-vector transformation rule
\[X^{\mu} \mapsto_{A} X^{\mu} - \delta^{\mu}_0 \chi^i X^i - \delta^{\mu}_i (\chi^i X^0 + \epsilon_{ijk} \theta^j X^k) + O(\omega^2)\]Now we must show that \(\psi_{L}^{\dagger} \bar{\sigma}^{\mu}\chi_L\) also has this transformation rule. See that
\[\begin{align*} \psi_L^{\dagger} \bar{\sigma}^{\mu} \chi_L &\mapsto_{\Lambda} \psi_L^{\dagger} e^{\frac{i}{2} \omega_{\nu\rho} \bar{\sigma}^{\nu\rho}} \bar{\sigma}^{\mu} e^{-\frac{i}{2} \omega_{\nu\rho} \sigma^{\nu\rho}} \chi_L\\ &= \psi_L^{\dagger} \bar{\sigma}^{\mu} \chi_L + \frac{i}{2} \omega_{\nu\rho} \psi_L^{\dagger} (\bar{\sigma}^{\nu\rho} \bar{\sigma}^{\mu} - \bar{\sigma}^{\mu} \sigma^{\nu\rho}) \chi_L + O(\omega^2) \end{align*}\]Now using \(\omega_{\mu\nu} \bar{\sigma}^{\mu\nu} = \theta^i \sigma^i + i \chi^i \sigma^i\) and \(\omega_{\mu\nu} \sigma^{\mu\nu} = \theta^i \sigma^i - i\chi^i \sigma^i\), we can write
\[\begin{align*} \frac{i}{2} \omega_{\nu\rho} (\bar{\sigma}^{\nu\rho} \bar{\sigma}^{\mu} - \bar{\sigma}^{\mu} \sigma^{\nu\rho}) &= \frac{i}{2} \theta^i [\sigma^i, \bar{\sigma}^{\mu}] - \frac{1}{2} \chi^i \{\sigma^i, \bar{\sigma}^{\mu}\}\\ &= \begin{cases} -\chi^i \sigma^i, & \mu = 0\\ -\epsilon^{jik} \theta^i \sigma^k - \chi^j \sigma^0, & \mu = j \end{cases} \end{align*}\]where we have used \(\chi^i \delta^{ij} \equiv \chi^i \eta^{ik} \delta^j_k = -\chi^i \delta^j_i = -\chi^j\). This exactly matches the transformation rule of a 4-vector.
Further, we have that the contraction of any two 4-vectors is a scalar, telling us that \(\psi_{L}^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \chi_L\) and \(\psi_R^{\dagger} \sigma^{\mu} \partial_{\mu} \chi_R\) are scalars, since \(\partial_{\mu}\) transforms as a 4-vector.
In more detail, given two 4-vectors \(X_{\mu}\) and \(Y_{\mu}\), see that from their transformation rules,
\[\begin{align*} X_{\mu} Y^{\mu} &\mapsto_{A} X_{\mu} Y^{\mu} - \chi^i X_0 Y^i - X_i(\chi^i Y^0 + \epsilon_{ijk} \theta^j Y^k) - \chi^i Y^0 X^i + Y^i(\chi^i X^0 + \epsilon_{ijk} \theta^j X^k) + O(\omega^2)\\ &= X_{\mu} Y^{\mu} + O(\omega^2) \end{align*}\]with all terms cancelling. As a result, contracting two 4-vectors gives a scalar as required.
For our initial action, we will start with only ``kinetic’‘-like spinor terms, involving only first-order derivative terms:
\[\begin{equation} \label{eqn:Skinetic} S[\Psi] = i\int_M d^4 x \, \left(\sum_{i=1}^{N_L} \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \psi_{L, i} + \sum_{j=1}^{N_R} \chi_{R, j}^{\dagger} \sigma^{\mu} \partial_{\mu} \chi_{R, j}\right) \end{equation}\]Importantly, by construction, this action will satisfy Equation \ref{eqn:Sinvar} restricted to only the spacetime symmetry group \(G^{(0)}\).
The factor of \(i\) in this action ensures that the integrand is real-valued up to a total derivative. And assuming appropriate boundary conditions (such that the total derivative term vanishes under integration), this will ensure that \(S[\Psi] \in \mathbb{R}\). In more detail, see that
\[\begin{align*} (i\psi_L^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \psi_L)^{\dagger} &= -i(\partial_{\mu} \psi_L^{\dagger}) \bar{\sigma}^{\mu} \psi_L\\ &= i\psi_L^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \psi_L + \partial_{\mu}(-i\psi_L^{\dagger} \bar{\sigma}^{\mu} \psi_L) \end{align*}\]hence, up to a total derivative, \(i\psi_L^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \psi_L\) is real-valued.
Terms of the form \(\psi_L^{\dagger} \chi_R\) and \(\chi_R^{\dagger} \psi_L\) will be later used as interaction terms for our theory (Section 5), corresponding to Yukawa terms in the Standard Model.
Use the notation \(\psi_{L, i} \in C^{\infty}(M, V^{(L, i)})\) and \(\chi_{R, j} \in C^{\infty}(M, V^{(R, j)})\).
We will now begin to consider gauge invariance. Currently, the field content consists of left-handed and right-handed spinors that only possess a spacetime sector \(V^{(L, i)} = V^{(L, i, 0)} = V_{1, 0}\) (for all \(i = 1, \ldots, N_L\)) and \(V^{(R, j)} = V^{(R, j, 0)} = V_{0, 1}\) (for all \(j = 1, \ldots, N_R\)) respectively.
When considering gauge groups, we promote the output spaces of these spinors:
\[V^{(L, i)} = V^{(L, i, 0)} \to V^{(L, i, 0)} \otimes V^{(L, i, 1)} \otimes \cdots \otimes V^{(L, i, K)},\] \[V^{(R, j)} = V^{(R, j, 0)} \to V^{(R, j, 0)} \otimes V^{(R, j, 1)} \otimes \cdots \otimes V^{(R, j, K)},\]As outlined in the introduction, we will consider gauge groups \((G^{(1)}, \ldots, G^{(K)})\) that come with (unitary) left-handed representations \(\{\rho^{(L, i, k)}\}_{i, k}\) each acting on the gauge sector \(V^{(L, i, k)}\) of \(\psi_{L, i} \in C^{\infty}(M, V^{(L, i)})\), and (unitary) right-handed representations \(\{\rho^{(R, j, k)}\}_{j, k}\) each acting on the gauge sector \(V^{(R, j, k)}\) of \(\chi_{R, j} \in C^{\infty}(M, V^{(R, j)})\). By Equation \ref{eqn:Sinvar}, we require that our action \(S[\Psi]\) is invariant to these representations.
We therefore ask: how can we modify the kinetic action \ref{eqn:Skinetic} to make it gauge invariant? A starting point is to understand the extent to which it fails to be gauge invariant. See that, for a gauge element \(g \in C^{\infty}(M, G^{(k)})\),
\[\begin{align*} \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \psi_{L, i} &\mapsto_g \psi_{L, i}^{\dagger} \rho_{g}^{(L, i, k)\dagger} \bar{\sigma}^{\mu} \partial_{\mu} (\rho_{g}^{(L, i, k)} \psi_{L, i})\\ &= \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \psi_{L, i} + \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} [\rho_{g}^{(L, i, k)\dagger} (\partial_{\mu} \rho_{g}^{(L, i, k)})] \psi_{L, i} \end{align*}\]where in the second line we have used unitarity and that \(\rho_g \bar{\sigma}^{\mu} \equiv \bar{\sigma}^{\mu} \rho_g\) since \(\rho_g\) does not interact with the spacetime sector. The second term captures the failure to achieve gauge invariance. What can we add to our initial action Equation \ref{eqn:Skinetic} in order to remedy this? In general, it appears we must modify the kinetic term
\[\psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \psi_{L, i} \to \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} (\partial_{\mu} - X_{\mu}^{(L, i, k)}) \psi_{L, i}\]for some 4-vector \(X^{(L, i, k)}\) (must be a 4-vector as otherwise the term would no longer be a scalar) that interacts with the gauge sector \(V^{(L, i, k)}\). Denote its transformation under \(G^{(k)}\) as \(X_{\mu}^{(L, i, k)} \mapsto_g \tilde{X}_{\mu}^{(L, i, k)}\). Then this new term transforms as:
\[\begin{align*} \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} (\partial_{\mu} - X_{\mu}^{(L, i, k)}) \psi_{L, i} \mapsto_g \; &\psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} \partial_{\mu} \psi_{L, i} + \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} [\rho_{g}^{(L, i, k)\dagger} (\partial_{\mu} \rho_{g}^{(L, i, k)})] \psi_{L, i}\\ &- \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} [\rho_g^{(L, i, k)\dagger} \tilde{X}_{\mu}^{(L, i, k)} \rho_g^{(L, i, k)}] \psi_{L, i} \end{align*}\]One can see that we will achieve gauge invariance, meaning
\[\psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} (\partial_{\mu} - X_{\mu}^{(L, i, k)}) \psi_{L, i} \mapsto_g \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} (\partial_{\mu} - X_{\mu}^{(L, i, k)}) \psi_{L, i}\]if we choose \(X^{(L, i, k)}\) to transform as
\[X^{(L, i, k)} \mapsto_g \tilde{X}^{(L, i, k)} = \rho_g^{(L, i, k)} X_{\mu}^{(L, i, k)} \rho_g^{(L, i, k)\dagger} + (\partial_{\mu} \rho_g^{(L, i, k)}) \rho_g^{(L, i, k)\dagger}\]Infinitesimal gauge transformations. Writing \(g = \exp(\alpha)\) for \(\alpha \in C^{\infty}(M, \mathfrak{g}^{(k)})\), note that \(\rho_g^{(L, i, k)} = \exp(d_{\alpha}^{(L, i, k)})\) for a representation \(d^{(L, i, k)}\) of \(\mathfrak{g}^{(k)}\). Unitarity of \(\rho^{(L, i, k)}\) implies that \(d^{(L, i, k)}\) is anti-Hermitian.
See that by Taylor expansion, for infinitesimal \(\alpha\), \(X_{\mu}^{(L, i, k)}\) transforms under \(g = \exp(\alpha)\) as
\[X_{\mu}^{(L, i, k)} \mapsto_{g} X_{\mu}^{(L, i, k)} + [d_{\alpha}^{(L, i, k)}, X_{\mu}^{(L, i, k)}] + \partial_{\mu} d_{\alpha}^{(L, i, k)} + O(\alpha^2)\]One can see that this transformation rule will take a particularly convenient form if we write
\[X_{\mu}^{(L, i, k)} = d^{(L, i, k)}_{A_{\mu}^{(k)}}\]for some \(A_{\mu}^{(k)} \in \mathfrak{g}^{(k)}\), which gives
\[\begin{align*} X_{\mu}^{(L, i, k)} &\mapsto_g X_{\mu}^{(L, i, k)} + [d_{\alpha}^{(L, i, k)}, d^{(L, i, k)}_{A_{\mu}^{(k)}}] + \partial_{\mu} d_{\alpha}^{(L, i, k)} + O(\alpha^2)\\ &= X_{\mu}^{(L, i, k)} + d_{[\alpha, A_{\mu}^{(k)}]}^{(L, i, k)} + \partial_{\mu} d_{\alpha}^{(L, i, k)} + O(\alpha^2) \end{align*}\]which can equivalently be described by a transformation of \(A_{\mu}^{(k)}\):
\[A_{\mu}^{(k)} \mapsto_{g} A_{\mu}^{(k)} + [\alpha, A_{\mu}^{(k)}] + \partial_{\mu} \alpha + O(\alpha^2)\]Hence, infinitesimally, we can view the role of all of \(\{X^{(L, i, k)}\}_{i=1}^{N_L} \cup \{X^{(R, j, k)}\}_{j=1}^{N_R}\) as reducible to a single gauge field \(A^{(k)}\), transforming via the above rule.
Achieving gauge invariance. Repeating this for each gauge group \(k=1, \ldots, K\), the above process corresponds to extending our field content
\[\Psi \mapsto \Psi \cup \{A^{(k)}\}_{k=1}^{K}\]to include a gauge field \(A^{(k)} \in C^{\infty}(M, \mathfrak{g}^{(k)})\) for each gauge group \(G^{(k)}\), with \(g = \exp(\alpha) \in C^{\infty}(M, G^{(k)})\) transforming the field content as
\[\psi_{L, i} \mapsto_g \rho_g^{(L, i, k)} \psi_{L, i}, \quad \chi_{R, j} \mapsto_g \rho_g^{(R, j, k)} \chi_{R, j}, \quad A_{\mu}^{(k)} \mapsto_g A_{\mu}^{(k)} + [\alpha, A_{\mu}^{(k)}] + \partial_{\mu} \alpha + O(\alpha^2)\](and leaving \(A^{(k')}\) invariant for \(k' \neq k\)).
This procedure for achieving gauge invariance is called minimal coupling (since we are essentially performing the minimal modification to the kinetic action that achieves gauge invariance…?).
We can write the new gauge invariant action as
\[\begin{equation} \label{eqn:Sgauged} S[\Psi] = i\int_M d^4 x \, \left(\sum_{i=1}^{N_L} \psi_{L, i}^{\dagger} \bar{\sigma}^{\mu} D_{\mu}^{(L, i)} \psi_{L, i} + \sum_{j=1}^{N_R} \chi_{R, j}^{\dagger} \sigma^{\mu} D_{\mu}^{(R, j)} \chi_{R, j}\right) \end{equation}\]where we have defined the covariant derivatives:
\[D_{\mu}^{(L, i)} := \partial_{\mu} - \sum_{k=1}^{K} g_k d^{(L, i, k)}_{A_{\mu}^{(k)}}, \qquad D_{\mu}^{(R, j)} := \partial_{\mu} - \sum_{k=1}^{K} g_k d^{(R, j, k)}_{A_{\mu}^{(k)}}\]and where we have introduced a coupling constant \(g_k\) for each gauge field \(A^{(k)}\), equivalent to the rescaling \(A^{(k)} \mapsto g_k A^{(k)}\), which slightly modifies the transformation rule to:
\[A_{\mu}^{(k)} \mapsto_{g} A_{\mu}^{(k)} + [\alpha, A_{\mu}^{(k)}] + \frac{1}{g_k}\partial_{\mu} \alpha + O(\alpha^2)\]Field strengths. For our discussion of anomalies, it will be useful to introduce the field strengths \(F_{\mu\nu}^{(L, i)}\):
\[F_{\mu\nu}^{(L, i)} := -[D_{\mu}^{(L, i)}, D_{\nu}^{(L, i)}] = \sum_{k=1}^{K} g_k d^{(L, i, k)}_{f_{\mu\nu}^{(k)}} =: \sum_{k=1}^{K} g_k F_{\mu\nu}^{(L, i, k)}\]where we have introduced \(f_{\mu\nu}^{(k)} \in C^{\infty}(M, \mathfrak{g}^{(k)})\), defined
\[f_{\mu\nu}^{(k)} := \partial_{\mu} A_{\nu}^{(k)} - \partial_{\nu} A_{\mu}^{(k)} - g_k[A_{\mu}^{(k)}, A_{\nu}^{(k)}]\]and \(F_{\mu\nu}^{(L, i, k)} := d^{(L, i, k)}_{f_{\mu\nu}^{(k)}}\).
In the above, we have restricted our attention to satisfying Equation \ref{eqn:Sinvar}. In this section, we will address the second condition of Equation \ref{eqn:Dinvar}: the measure \(\text{D}\Psi\) must be invariant to the symmetry group \(G\) of our theory. If the measure fails to be invariant, we say that there is an anomaly in our theory. The condition that the anomaly vanishes places strong constraints on the group content \(G\) and representation content \(\rho\) that we can consider, as we will see shortly.
Gauge transformation rule for spinor measures. The study of whether \(\text{D}\Psi\) is invariant requires proposing a specific definition for \(\text{D}\Psi\). As mentioned in the introduction, we can do so by expanding \(\Psi\) in a basis of \(C^{\infty}(M, V)\) and performing some form of regulation. To construct such a basis, we will consider the Dirac operators
\[\slashed{D}^{(L, i)} := \gamma^{\mu} D_{\mu}^{(L, i)}, \quad \slashed{D}^{(R, j)} := \gamma^{\mu} D_{\mu}^{(R, j)}\]which have eigenfunctions \(\{\phi_{n}^{(i)}\}_{n}\) and \(\{\xi_{m}^{(j)}\}_m\) satisfying
\[i\slashed{D}^{(L, i)} \phi_{n}^{(i)} = \lambda_n^{(i)} \phi_n^{(i)}, \qquad i\slashed{D}^{(R, j)} \xi_{n}^{(j)} = \mu_n^{(j)} \xi_n^{(j)}\]for eigenvalues \(\{\lambda_n^{(i)}\}_n\) and \(\{\mu_n^{(j)}\}_n\) respectively.
Since \(\{\phi_{n}^{(i)}\}_{n}\) and \(\{\xi_{m}^{(j)}\}_m\) are each a basis for Dirac spinors, we can project them to a basis for left-handed and right-handed spinors using the projection matrices
\[P_L := \frac{1}{2}(I + \gamma^5), \qquad P_R : = \frac{1}{2}(I - \gamma^5)\]for the \(4 \times 4\) gamma matrix \(\gamma^5 = \begin{bmatrix} I&0\\ 0&-I \end{bmatrix}\). Since \(\gamma^{\mu} \gamma^5 = -\gamma^5 \gamma^{\mu}\), then
\[i\slashed{D}^{(L, i)} \phi_{n}^{(i)} = \lambda_n^{(i)} \phi_n^{(i)} \iff i\slashed{D}^{(L, i)} (\gamma^5 \phi_{n}^{(i)}) = -\lambda_n^{(i)} (\gamma^5 \phi_n^{(i)})\]and similarly for \(\slashed{D}^{(R, j)}\). This implies that the projected eigenfunctions \(\{P_L \phi^{(i)}_n\}_n\) contain two copies of \(\{P_L \phi^{(i)}_n\}_{n: \lambda_n > 0}\) since \(P_L \gamma^5 = \gamma^5 P_L = P_L\). To avoid overcompleteness and keep orthonormality, we must restrict to some subset \(\{P_L \phi_{\gamma}^{(i)}\}_{\gamma}\) of projected eigenfunctions for left-handed spinors (and \(\{P_R \xi_{\beta}^{(j)}\}_{\beta}\) for right-handed spinors). This lets us write
\[\psi_{L, i}(x) = \sum_{\gamma} a_{\gamma}^{(i)} P_L \phi_{\gamma}^{(i)}, \qquad \psi_{L, i}^{\dagger}(x) = \sum_{\gamma} b_{\gamma}^{(i)} \phi_{\gamma}^{(i)\dagger} P_L\] \[\chi_{R, j}(x) = \sum_{\beta} c_{\beta}^{(j)} P_R \xi_{\beta}^{(j)}, \qquad \chi_{R, j}^{\dagger}(x) = \sum_{\beta} d_{\beta}^{(j)} \xi_{\beta}^{(j)\dagger} P_R\]By orthonormality, we have
\[\int d^4 x \, \phi_{\gamma}^{(i)\dagger}(x) P_L \phi_{\gamma'}^{(i)}(x) = \delta_{\gamma\gamma'}, \qquad \int d^4 x \, \xi_{\beta}^{(j)\dagger}(x) P_R \xi_{\beta'}^{(j)}(x) = \delta_{\beta\beta'}\]We can then take
\[\text{D}\psi_{L, i} := \prod_{\gamma} da_{\gamma}^{(i)}, \qquad \text{D}\psi_{L, i}^{\dagger} := \prod_{\gamma} db_{\gamma}^{(i)}\] \[\text{D}\chi_{R, j} := \prod_{\beta} dc_{\beta}^{(j)}, \qquad \text{D}\chi_{R, j}^{\dagger} := \prod_{\beta} dd_{\beta}^{(j)}\]To achieve Equation \ref{eqn:Dinvar}, we wish to understand how these measures transform under the gauge transformation associated with \(g = \exp(\alpha) \in C^{\infty}(M, G^{(k)})\) (for \(\alpha \in C^{\infty}(M, \mathfrak{g}^{(k)})\)) for arbitrary \(k\):
\[\psi_{L, i} \mapsto_{g} \psi_{L, i} + d^{(L, i, k)}_{\alpha} \psi_{L, i} + O(\alpha^2), \qquad \chi_{R, j} \mapsto_{g} \chi_{R, j} + d^{(R, j, k)}_{\alpha} \chi_{R, j} + O(\alpha^2)\]Using orthonormality, one can show that this transformation is equivalent to transforming the expansion coefficients as
\[a_{\gamma}^{(i)} \mapsto_g \sum_{\gamma'} a_{\gamma'}^{(i)} (\delta_{\gamma\gamma'} + A_{\gamma\gamma'}^{(i, k)}(\alpha)), \qquad b_{\gamma}^{(i)} \mapsto_g \sum_{\gamma'} b_{\gamma'}^{(i)} (\delta_{\gamma\gamma'} + A_{\gamma\gamma'}^{(i, k)\dagger}(\alpha))\] \[c_{\beta}^{(j)} \mapsto_g \sum_{\beta'} c_{\beta'}^{(j)} (\delta_{\beta\beta'} + B_{\beta\beta'}^{(j, k)}(\alpha)), \qquad d_{\beta}^{(j)} \mapsto_g \sum_{\beta'} d_{\beta'}^{(j)} (\delta_{\beta\beta'} + B_{\beta\beta'}^{(j, k)\dagger}(\alpha))\]defining
\[A_{\gamma\gamma'}^{(i, k)}(\alpha) := \int d^4 x \, \phi_{\gamma}^{(i)\dagger}(x) d_{\alpha}^{(L, i, k)} P_L \phi_{\gamma'}^{(i)}(x)\] \[B_{\beta\beta'}^{(j, k)}(\alpha) := \int d^4 x \, \xi_{\beta}^{(j)\dagger}(x) d_{\alpha}^{(R, j, k)} P_R \xi_{\beta'}^{(j)}(x)\]This results in the transformation rules
\[\text{D}\psi_{L, i} \mapsto_{g} \det(I+A^{(i, k)}(\alpha))^{-1} \text{D}\psi_{L, i}, \qquad \text{D}\psi_{L, i}^{\dagger} \mapsto_{g} \det(I+A^{(i, k)\dagger}(\alpha))^{-1} \text{D}\psi_{L, i}^{\dagger}\] \[\text{D}\chi_{R, j} \mapsto_{g} \det(I+B^{(j, k)}(\alpha))^{-1} \text{D}\chi_{R, j}, \qquad \text{D}\chi_{R, j}^{\dagger} \mapsto_{g} \det(I+B^{(i, k)\dagger}(\alpha))^{-1} \text{D}\chi_{R, j}^{\dagger}\]Using \(\det(I+X) \approx \det(e^X) = e^{\text{tr}(X)}\) for infinitesimal \(X\), we have that overall:
\[\text{D}\psi_{L, i} \text{D}\psi_{L, i}^{\dagger} \mapsto_{g} J_g^{(L, i, k)} \text{D}\psi_{L, i} \text{D}\psi_{L, i}^{\dagger}, \qquad J_g^{(L, i, k)} := \exp(-2\text{Re}(\text{tr}(A^{(i, k)}(\alpha))))\] \[\text{D}\chi_{R, j} \text{D}\chi_{R, j}^{\dagger} \mapsto_{g} J_g^{(R, j, k)} \text{D}\chi_{R, j} \text{D}\chi_{R, j}^{\dagger}, \qquad J_g^{(R, j, k)} := \exp(-2\text{Re}(\text{tr}(B^{(i, k)}(\alpha))))\]The total anomaly associated with gauge elements \((g_1, \ldots, g_K) \in G^{(1)} \times \cdots \times G^{(K)}\) (with \(g_k = \exp(\alpha_k)\)) is
\[\begin{align*} \mathcal{A}(g_1, \ldots, g_K) &:= \prod_{k=1}^{K} \left[\left(\prod_{i=1}^{N_L} J_{g_k}^{(L, i, k)}\right) \left(\prod_{j=1}^{N_R} J_{g_k}^{(R, j, k)}\right)\right]\\ &= \exp\left(-2\sum_{k=1}^{K} \left[\sum_{i=1}^{N_L} \text{Re}(\text{tr}(A^{(i, k)}(\alpha_k))) + \sum_{j=1}^{N_R} \text{Re}(\text{tr}(B^{(j, k)}(\alpha_k)))\right]\right) \end{align*}\]with
\[\text{D}\Psi \mapsto_{(g_1, \ldots, g_K)} \; \mathcal{A}(g_1, \ldots, g_K) \text{D}\Psi\]Fujikawa regulation. One problem is that the trace terms, e.g.
\[\text{tr}(A^{(i, k)}(\alpha)) = \sum_{\gamma} \int d^4 x \, \phi_{\gamma}^{(i)\dagger}(x) d_{\alpha}^{(L, i, k)} P_L \phi_{\gamma}^{(i)}(x)\]appear that they will diverge in general. One can see this problem as downstream of the fact that we defined e.g. \(\text{D}\psi_{L, i}\) to be an infinite product of expansion coefficients. One way of resolving such divergences is to apply a cutoff to make the product finite (as we will do for Wilsonian renormalization in Section 8), however applying a cutoff breaks the gauge invariance of our theory. Instead, we will consider Fujikawa regulation, which maintains gauge invariance. This involves replacing such trace terms with their regulated variant:
\[\begin{align*} \text{tr}(A^{(i, k)}(\alpha)) \to \text{tr}_{\Lambda}(A^{(i, k)}(\alpha)) &:= \sum_{\gamma} e^{-\lambda_{\gamma}^2/\Lambda^2}\int d^4 x \, \phi_{\gamma}^{(i)\dagger}(x) d_{\alpha}^{(L, i, k)} P_L \phi_{\gamma}^{(i)}(x)\\ &= \sum_{\gamma} \int d^4 x \, \phi_{\gamma}^{(i)\dagger}(x) d_{\alpha}^{(L, i, k)} P_L e^{(\slashed{D}^{(L, i)})^2/\Lambda^2}\phi_{\gamma}^{(i)}(x) \end{align*}\]for some regularization parameter \(\Lambda\). We will eventually take \(\Lambda \to \infty\) at the end. The regulated anomaly is then
\[\mathcal{A}^{(\Lambda)}(g_1, \ldots, g_K) := \exp\left(-2\sum_{k=1}^{K} \left[\sum_{i=1}^{N_L} \text{Re}(\text{tr}_{\Lambda}(A^{(i, k)}(\alpha_k))) + \sum_{j=1}^{N_R} \text{Re}(\text{tr}_{\Lambda}(B^{(j, k)}(\alpha_k)))\right]\right)\]We will now determine the required conditions on group content \(G\) and representation content \(\rho\) to ensure that
\[\begin{equation} \label{eqn:reganom} \lim_{\Lambda \to \infty} \mathcal{A}^{(\Lambda)}(g_1, \ldots, g_K) = 1 \quad \forall \; (g_1, \ldots, g_K) \in G^{(1)} \times \cdots \times G^{(K)} \end{equation}\]holds, such that the regulated measure \(\text{D}\Psi^{(\Lambda)}\) is gauge invariant. Note that \(\lim_{\Lambda \to \infty} \mathcal{A}^{(\Lambda)} \neq \mathcal{A}\).
Computing the anomaly. We can write this regulated trace as
\[\begin{align*} \text{tr}_{\Lambda}(A^{(i, k)}(\alpha)) &= \int d^4 x \, \left[\sum_{\gamma} \phi_{\gamma}^{(i)\dagger} d_{\alpha}^{(L, i, k)} P_L e^{(\slashed{D}^{(L, i)})^2/\Lambda^2}\phi_{\gamma}^{(i)}\right](x)\\ &= \int d^4 x \, \text{tr}(d_{\alpha}^{(L, i, k)} P_L e^{(\slashed{D}^{(L, i)})^2/\Lambda^2})(x)\\ &= \int d^4 x \frac{d^4 k}{(2\pi)^4} \, \text{tr}_{s, g}(d_{\alpha}^{(L, i, k)} P_L e^{-ik\cdot x} e^{(\slashed{D}^{(L, i)})^2/\Lambda^2} e^{ik\cdot x}) \end{align*}\]where \(\text{tr}_{s, g}\) is a trace over spacetime and gauge indices (i.e. not taken over function space).
Now see that
\[\begin{align*} (\slashed{D}^{(L, i)})^2 &= (D^{(L, i)})^2 +\frac{1}{2} \gamma^{\mu} \gamma^{\nu} [D_{\mu}^{(L, i)}, D_{\nu}^{(L, i)}]\\ &= (D^{(L, i)})^2 - \frac{1}{2} \gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)} \end{align*}\]This lets us write
\[\begin{align*} \text{tr}_{\Lambda}(A^{(i, k)}(\alpha)) &= \int d^4 x \frac{d^4 k}{(2\pi)^4} \, \text{tr}_{s, g}(d_{\alpha}^{(L, i, k)} P_L e^{-ik\cdot x} e^{(\slashed{D}^{(L, i)})^2/\Lambda^2} e^{ik\cdot x})\\ &= \int d^4 x \frac{d^4 k}{(2\pi)^4} \, \text{tr}_{s, g}(d_{\alpha}^{(L, i, k)} P_L e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2} e^{-ik\cdot x} e^{(D^{(L, i)})^2/\Lambda^2} e^{ik\cdot x})\\ &= \int d^4 x \frac{d^4 k}{(2\pi)^4} \, \text{tr}_{s, g}(d_{\alpha}^{(L, i, k)} P_L e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2} e^{(D^{(L, i)} + ik)^2/\Lambda^2})\\ &= \int d^4 x \frac{d^4 k}{(2\pi)^4} \, \text{tr}_{s, g}(d_{\alpha}^{(L, i, k)} P_L e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2} e^{(\partial + ik)^2/\Lambda^2})\\ &= \int d^4 x \frac{d^4 k}{(2\pi)^4} \, e^{-k^2/\Lambda^2} \text{tr}_{s, g}(d_{\alpha}^{(L, i, k)} P_L e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2})\\ &= \int d^4 x \frac{d^4 k}{(2\pi)^4} \, e^{-k^2/\Lambda^2} \text{tr}_{g}(d_{\alpha}^{(L, i, k)} \text{tr}_s(P_L e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2})) \end{align*}\]where in the fourth line we have translated \(k \to k - i\sum_{k=1}^{K} g_k d_{A_{\mu}^k}^{(L, i, k)}\).
We have that
\[\begin{align*} \text{tr}_s(P_L e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2}) &= \frac{1}{2} \text{tr}_s(e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2}) + \frac{1}{2} \text{tr}_s(\gamma^5 e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2})\\ &= 2 - \frac{1}{2\Lambda^4} \left(F_{\mu\nu}^{(L, i)} F^{(L, i)\mu\nu} + iF_{\mu\nu}^{(L, i)} * F^{(L, i)\mu\nu}\right) + O(1/\Lambda^6) \end{align*}\]by using
\[\text{tr}_s(e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2}) = 4 - \frac{1}{\Lambda^4} F_{\mu\nu}^{(L, i)} F^{(L, i)\mu\nu} + O(1/\Lambda^6)\] \[\text{tr}_s(\gamma^5 e^{-\gamma^{\mu} \gamma^{\nu} F_{\mu\nu}^{(L, i)}/2\Lambda^2}) = -\frac{i}{\Lambda^4} F_{\mu\nu}^{(L, i)} * F^{(L, i)\mu\nu} + O(1/\Lambda^6)\]and where \(*\) is the Hodge star operator, with \(F_{\mu\nu}^{(L, i)} * F^{(L, i)\mu\nu} \equiv \frac{1}{2} \epsilon^{\mu\nu\sigma\rho} F_{\mu\nu}^{(L, i)} F^{(L, i)}_{\sigma\rho}\).
We have also made use of the identities
\[\text{tr}(\gamma^{\mu} \gamma^{\nu}) = 4\eta^{\mu\nu}, \quad \text{tr}(\gamma^{\mu} \gamma^{\nu} \gamma^{\sigma} \gamma^{\rho}) = 4(\eta^{\mu\nu} \eta^{\sigma\rho} - \eta^{\mu\sigma} \eta^{\nu\rho} + \eta^{\mu\rho} \eta^{\nu\sigma})\] \[\text{tr}(\gamma^5 \gamma^{\mu} \gamma^{\nu}) = 0, \quad \text{tr}(\gamma^5 \gamma^{\mu} \gamma^{\nu} \gamma^{\sigma} \gamma^{\rho}) = -4i\epsilon^{\mu\nu\sigma\rho}\]found here.
This lets us write
\[\begin{align*} \text{tr}_{\Lambda}(A^{(i, k)}(\alpha)) &= \int d^4 x \left[\frac{\Lambda^4}{8\pi^2} \text{tr}(d_{\alpha}^{(L, i, k)}) - \frac{1}{32\pi^2} \text{tr}(d_{\alpha}^{(L, i, k)} F_{\mu\nu}^{(L, i)} F^{(L, i)\mu\nu} + i d_{\alpha}^{(L, i, k)} F_{\mu\nu}^{(L, i)} * F^{(L, i)\mu\nu}) + O(1/\Lambda^2)\right] \end{align*}\]where we have used \(\int \frac{d^4 k}{(2\pi)^4} \, e^{-k^2/\Lambda^2} = \frac{\Lambda^4}{16\pi^2}\).
Relevant to the anomaly is the real part of this regulated trace. By anti-Hermitianity of \(d^{(L, i, k)}\) (due to unitarity of \(\rho^{(L, i, k)}\)), we have that \(\text{tr}(d_{\alpha}^{(L, i, k)})\) is purely imaginary, as well as \(\text{tr}(d_{\alpha}^{(L, i, k)} F_{\mu\nu}^{(L, i)} F^{(L, i)\mu\nu})\), while we have that the third term satisfies
\[(-i\text{tr}(d_{\alpha}^{(L, i, k)} F_{\mu\nu}^{(L, i)} * F^{(L, i)\mu\nu}))^* = -i\text{tr}(d_{\alpha}^{(L, i, k)} F_{\mu\nu}^{(L, i)} * F^{(L, i)\mu\nu})\]meaning it is purely real. As a result, we have that
\[\text{Re}(\text{tr}_{\Lambda}(A^{(i, k)}(\alpha))) = \exp\left(\frac{i}{16\pi^2} \int d^4 x \, \text{tr}(d_{\alpha}^{(L, i, k)} F_{\mu\nu}^{(L, i)} * F^{(L, i)\mu\nu})\right)\]Now recall that we can write
\[F_{\mu\nu}^{(L, i)} = \sum_{k=1}^{K} g_k f_{\mu\nu}^{(k)a} d_{T_a^{(k)}}^{(L, i, k)}\]for some choice of generators \(\{T_a^{(k)}\}_a \subset \mathfrak{g}_k\), and writing \(f_{\mu\nu}^{(k)} = f_{\mu\nu}^{(k)a} T_a^{(k)}\). This lets us explicitly write
\[\begin{align*} \text{Re}(\text{tr}_{\Lambda}(A^{(i, k)}(\alpha))) &= \exp\left(\frac{i}{32\pi^2} \epsilon^{\mu\nu\sigma\rho} \int d^4 x \, \text{tr}(d_{\alpha}^{(L, i, k)} F_{\mu\nu}^{(L, i)} F^{(L, i)}_{\sigma\rho})\right)\\ &= \exp\left(\frac{i}{32\pi^2} \epsilon^{\mu\nu\sigma\rho} \sum_{r=1}^{K} \sum_{s=1}^{K} g_r g_s \text{tr}(d_{T_a^{(k)}}^{(L, i, k)} d_{T_b^{(r)}}^{(L, i, r)} d_{T_c^{(s)}}^{(L, i, s)}) \int d^4 x \, \alpha^a f_{\mu\nu}^{(r) b} f_{\sigma\rho}^{(s) c}\right) \end{align*}\]The right-handed anomaly is identical but with a minus sign, which stems from the fact that \(P_R = \frac{1}{2}(1 - \gamma^5)\) whereas \(P_L = \frac{1}{2}(1+\gamma^5)\). Namely, we have
\[\text{Re}(\text{tr}_{\Lambda}(B^{(j, k)}(\alpha))) = \exp\left(-\frac{i}{32\pi^2} \epsilon^{\mu\nu\sigma\rho} \sum_{r=1}^{K} \sum_{s=1}^{K} g_r g_s \text{tr}(d_{T_a^{(k)}}^{(R, j, k)} d_{T_b^{(r)}}^{(R, j, r)} d_{T_c^{(s)}}^{(R, j, s)}) \int d^4 x \, \alpha^a f_{\mu\nu}^{(r) b} f_{\sigma\rho}^{(s) c}\right)\]This gives the total regulated anomaly:
\[\begin{align*} \mathcal{A}^{(\Lambda)}(g_1, \ldots, g_K) &= \exp\Bigg(-\frac{i}{16\pi^2} \epsilon^{\mu\nu\sigma\rho} \sum_{k=1}^{K} \sum_{r=1}^{K} \sum_{s=1}^{K} g_r g_s \int d^4 x \, \alpha_k^a f_{\mu\nu}^{(r)b} f_{\sigma\rho}^{(s)c}\\ &\qquad\qquad \left[\sum_{i=1}^{N_L} \text{tr}(d_{T_a^{(k)}}^{(L, i, k)} d_{T_b^{(r)}}^{(L, i, r)} d_{T_c^{(s)}}^{(L, i, s)}) - \sum_{j=1}^{N_R} \text{tr}(d_{T_a^{(k)}}^{(R, j, k)} d_{T_b^{(r)}}^{(R, j, r)} d_{T_c^{(s)}}^{(R, j, s)})\right]\Bigg) \end{align*}\]If we require the anomaly to cancel for any values of couplings \(\{g_k\}_{k=1}^{K}\) and any gauge elements \(\{\alpha_k\}_{k=1}^{K}\), then the total anomaly of the theory cancels (i.e. Equation \ref{eqn:reganom} is satisfied) iff
\[\begin{equation} \label{eqn:anomalycond} \boxed{\sum_{i=1}^{N_L} C_{L, i}(k, r, s) \text{tr}(d_{T_a^{(k)}}^{(L, i, k)} d_{T_b^{(r)}}^{(L, i, r)} d_{T_c^{(s)}}^{(L, i, s)}) = \sum_{j=1}^{N_R} C_{R, j}(k, r, s) \text{tr}(d_{T_a^{(k)}}^{(R, j, k)} d_{T_b^{(r)}}^{(R, j, r)} d_{T_c^{(s)}}^{(R, j, s)}) \quad \forall \; \; k, r, s \quad \forall \; \; a, b, c} \end{equation}\]where we have factored out dimensional factors from the trace over redundant dimensions, captured by
\[C_{L, i}(a, b, \ldots, c) := \prod_{k \notin \{a, b, \ldots, c\}} \dim d^{(L, i, k)}, \qquad C_{R, j}(a, b, \ldots, c) := \prod_{k \notin \{a, b, \ldots, c\}} \dim d^{(R, j, k)}\]We can split this condition into four distinct cases:
\[\sum_{i=1}^{N_L} C_{L, i}(k) \text{tr}(d_{T_a^{(k)}}^{(L, i, k)} \{d_{T_b^{(k)}}^{(L, i, k)}, d_{T_c^{(k)}}^{(L, i, k)}\}) = \sum_{j=1}^{N_R} C_{R, j}(k) \text{tr}(d_{T_a^{(k)}}^{(R, j, k)} \{d_{T_b^{(k)}}^{(R, j, k)}, d_{T_c^{(k)}}^{(R, j, k)}\}) \quad \forall \; \; k,\] \[\sum_{i=1}^{N_L} C_{L, i}(k, r) \text{tr}(d_{T_a^{(k)}}^{(L, i, k)}) \text{tr}(d_{T_b^{(r)}}^{(L, i, r)} d_{T_c^{(r)}}^{(L, i, r)}) = \sum_{j=1}^{N_R} C_{R, j}(k, r) \text{tr}(d_{T_a^{(k)}}^{(R, j, k)}) \text{tr}(d_{T_b^{(r)}}^{(R, j, r)} d_{T_c^{(r)}}^{(R, j, r)}) \quad \forall \; \; r \neq k,\] \[\sum_{i=1}^{N_L} C_{L, i}(k, r) \text{tr}(d_{T_a^{(k)}}^{(L, i, k)} d_{T_b^{(k)}}^{(L, i, k)}) \text{tr}(d_{T_c^{(r)}}^{(L, i, r)}) = \sum_{j=1}^{N_R} C_{R, j}(k, r) \text{tr}(d_{T_a^{(k)}}^{(R, j, k)} d_{T_b^{(k)}}^{(R, j, k)}) \text{tr}(d_{T_c^{(r)}}^{(R, j, r)}) \quad \forall \; \; r \neq k,\] \[\sum_{i=1}^{N_L} C_{L, i}(k, r, s) \text{tr}(d_{T_a^{(k)}}^{(L, i, k)}) \text{tr}(d_{T_b^{(r)}}^{(L, i, r)}) \text{tr}(d_{T_c^{(s)}}^{(L, i, s)}) = \sum_{j=1}^{N_R} C_{R, j}(k, r, s) \text{tr}(d_{T_a^{(k)}}^{(R, j, k)}) \text{tr}(d_{T_b^{(r)}}^{(R, j, r)}) \text{tr}(d_{T_c^{(s)}}^{(R, j, s)}) \quad \forall \; \; r \neq s, \; r \neq k, \; s \neq k\]The final three conditions are called mixed anomaly conditions, since they mix together the representations of different gauge groups.
If the representation content \(\rho\) of our theory satisfies this condition, then the (regulated) measure \(\text{D}\Psi\) will be gauge invariant.
Some things to consider:
The special case of the Standard Model. A theory is defined by the triple \((S, G, \rho)\) as well as the field content \(\Psi\). Here we will describe the choices for these components in the Standard Model.
The Standard Model corresponds to the spacetime group \(G^{(0)} = \text{Spin}(1, 3) \cong \text{SL}(2; \mathbb{C})\) as considered above, and gauge groups
\[(G^{(1)}, G^{(2)}, G^{(3)}) = (U(1), SU(2), SU(3))\]The field content includes 2 left-handed spinors, 3 right-handed spinors, and a Higgs boson:
\[\Psi = (Q_L, L_L, u_R, d_R, e_R, H)\]To succintly state the representation content, we will denote the trivial rep of a gauge group by \(\mathbf{1}\), the fundamental rep of \(SU(2)\) by \(\mathbf{2}\), and the fundamental rep of \(SU(3)\) by \(\mathbf{3}\). Further, since the reps of \(U(1)\) are labelled by real numbers \(c \in \mathbb{R}\), we will denote the corresponding rep by \(\mathbf{c}\). We will then use the notation \(\phi: (\mathbf{a}, \mathbf{b})_{c}\) to say that the field \(\phi\) transforms under rep \(\mathbf{a}\) of \(SU(2)\), rep \(\mathbf{b}\) of \(SU(3)\), and rep \(\mathbf{c}\) of \(U(1)\) (where \(a \in \{1, 2\}\), \(b \in \{1, 3\}\), and \(c \in \mathbb{R}\) for our purposes).
Then the representation content of the Standard Model amounts to:
\[Q_L: (\mathbf{2}, \mathbf{3})_{1/6}, \quad L_L: (\mathbf{2}, \mathbf{1})_{-1/2}, \quad u_R: (\mathbf{1}, \mathbf{3})_{2/3},\] \[d_R: (\mathbf{1}, \mathbf{3})_{-1/3}, \quad e_R: (\mathbf{1}, \mathbf{1})_{-1}, \quad H: (\mathbf{2}, \mathbf{1})_{1/2}\]Why these particular choices? The above is arguably one of the simplest setups that satisfies the anomaly condition Equation \ref{eqn:anomalycond}. This is argued in the notes (gauge theory).
Todo: should better justify why this particular choice is one of the most simple, and what the other choices are.
(This section is very unfinished)
The action \ref{eqn:Sgauged} has achieved our goal of satisfying \ref{eqn:Sinvar}: it is both invariant to the spacetime group \(G^{(0)}\) by construction, and through minimal coupling, we have achieved invariance to the gauge groups \(G^{(1)} \times \cdots \times G^{(K)}\). However, the action does not involve any interactions, with no contractions between distinct spinors (e.g. no terms of the form \(\psi_L^{\dagger} \chi_R\)).
We previously found that terms of the form \(\psi_L^{\dagger} \chi_R\) (and \(\chi_R^{\dagger} \psi_L\)) were invariant to the spacetime group \(G^{(0)}\). But gauge invariance is problematic since, generally,
\[\rho^{(L, i, k)\dagger} \rho^{(R, j, k)} \neq I\]Indeed, as we will see, the choice of representations in the Standard Model are such that no term of the form \(\psi_L^{\dagger} \chi_R\) can be gauge invariant. This differs from the case of minimal coupling (Section 3), as here it is not an additive term that causes gauge invariant to fail. To achieve gauge invariance in this case, one could imagine introducing a scalar field \(H\) that transforms under the gauge groups precisely such that
\[\psi_L^{\dagger} H \chi_R\]is gauge invariant. Note that \(H\) must transform as a scalar in order for the term to remain a scalar. Is there some maximal choice of the transformation properties of \(H\), such that we can produce the maximum number of interaction terms possible?
In the Standard Model, \(H\) transforms under the gauge groups as \((\mathbf{2}, \mathbf{1})_{1/2}\), which allows for the interaction terms… Todo
(This section is unfinished)
In the above, we have introduced auxiliary fields \(\{A^{(k)}\}_{k=1}^{K}\) and \(H\) to our theory in order to help us achieve gauge invariance. Here, we make a key step: we promote these fields to be dynamical in their own right, possessing their own kinetic terms in the action \(S[\Psi]\). In particular, these kinetic terms should include derivatives of these auxiliary fields, and preserve overall spacetime & gauge invariance.
Kinetic terms for \(A^{(k)}\). In order to make \(A^{(k)}\) dynamical, we would like to construct a spacetime \& gauge invariant term that depends on \(A_{\mu}^{(k)}\) and its derivatives \(\partial_{\nu} A_{\mu}^{(k)}\). Recall that we defined
\[f_{\mu\nu}^{(k)} := \partial_{\mu} A_{\nu}^{(k)} - \partial_{\nu} A_{\mu}^{(k)} - g_k [A_{\mu}^{(k)}, A_{\nu}^{(k)}]\]We further have that \(f_{\mu\nu}^{(k)}\) transforms as
\[f_{\mu\nu}^{(k)} \mapsto_{\alpha} f_{\mu\nu}^{(k)} + [\alpha, f_{\mu\nu}^{(k)}] + O(\alpha^2)\]under a gauge transformation by \(\alpha \in C^{\infty}(M, \mathfrak{g}^{(k)})\).
As shown in Appendix A.1.1, each semi-simple Lie algebra \(\mathfrak{g}^{(k)}\) has its own natural metric \(\kappa^{(k)}: \mathfrak{g}^{(k)} \times \mathfrak{g}^{(k)} \to \mathbb{C}\), called the Killing form. Importantly, the Killing form satisfies the identity
\[\kappa^{(k)}(X, [Y, Z]) = \kappa^{(k)}(Y, [Z, X]) = \kappa^{(k)}(Z, [X, Y])\](also proven in Appendix A.1.1) for any \(X, Y, Z \in \mathfrak{g}^{(k)}\). This identity gives us that the inner product (under metric \(\kappa\)) transforms as
\[\kappa^{(k)}(f_{\mu\nu}^{(k)}, f^{(k)\mu\nu}) \mapsto_{\alpha} \kappa^{(k)}(f_{\mu\nu}^{(k)}, f^{(k)\mu\nu}) + O(\alpha^2)\]i.e. \(\kappa(f_{\mu\nu}^{(k)}, f^{(k)\mu\nu})\) is (infinitesimally) gauge invariant. As a result, it is a natural candidate as a kinetic term for \(A^{(k)}\). And since all 4-vector indices are contracted, it will be a Lorentz scalar.
Free and kinetic terms for \(H\). \(H^{\dagger} H\) is the simplest Lorentz and gauge invariant term for \(H\) (recall that \(H\) is a Lorentz scalar). As a result, we can consider the contributions of the form \((H^{\dagger} H)^n\) for \(n \in \{1, 2, \ldots\}\). We will focus specifically on the cases of \(n = 1,2\).
We focus only on \(H^{\dagger} H\) and \((H^{\dagger} H)^2\) terms because \((H^{\dagger} H)^k\) for \(k > 2\) are not relevant or marginal, as defined at the end of Section 8. In the notation of Section 8, assuming canonical normalization of the kinetic contribution (\((m, n) = (a, b)\)), an \((m, n)\) contribution is said to be irrelevant at low energies iff \(n-d+m(d-b)/a > 0\). For us, \((a, b) = (2, 2)\), and since \(d=4\), the condition for irrelevance becomes
\[n+m > 4\]And since \((H^{\dagger} H)^k\) has \((m, n) = (2k, 0)\), terms with \(k > 2\) are irrelevant.
We can also construct terms that involve derivatives of \(H\). Most naively we could consider \((\partial_{\mu} H)^{\dagger} \partial^{\mu} H\), however this will not be gauge invariant for the same reasons as seen in Section 3. Instead, we must do something analogous to minimal coupling. By proceeding similarly to Section 3, one can find that
\[(D_{\mu}^{(H)} H)^{\dagger} D^{(H)\mu} H\]is both Lorentz and gauge invariant, defining the associated covariant derivative
\[D_{\mu}^{(H)} := \partial_{\mu} - \sum_{k=1}^{K} g_k d_{A_{\mu}^{(k)}}^{(H, k)}\]where we have denoted the gauge representations that act on \(H\) by \(\{d^{(H, k)}\}_{k=1}^{K}\).
This motivates adding the contribution
\[(D_{\mu}^{(H)} H)^{\dagger} D^{(H)\mu} H + \alpha H^{\dagger} H + \beta (H^{\dagger} H)^2\]or equivalently
\[(D_{\mu}^{(H)} H)^{\dagger} D^{(H)\mu} - \frac{\lambda}{2}(H^{\dagger} H - v^2)^2\]which allows us to interpret the contribution as a kinetic term + potential term, where the potential has ground states defined by \(H^{\dagger} H = v^2\). This interpretation will be useful for the discussion of symmetry breaking.
Symmetry breaking. Todo: Reparameterize \(H\) by expanding around ground state, and perform appropriate gauge transformation to \(S[\Psi]\). Demonstrate that this has given ``mass’’ to the spinors.
Todo: argue that the reps chosen for the Higgs boson are `maximal’ in some sense for achieving interactions.
Todo: show that, via symmetry breaking, fermions gaining mass.
There is some kind of ``overcounting’’ happening when integrating over any field object \(\Psi_i \in \mathcal{C}_i\) due to \(\mathcal{C}_i\) including physically-equivalent field configurations that are related by a gauge transformation. In the specific case of gauge fields, this overcounting is detrimental and results in divergences. To remedy this issue requires breaking gauge invariance, but some remnants of gauge invariance still persist through BRST invariance. We will now outline these concepts.
In the following, we will restrict our attention to a particular gauge field \(A \in C^{\infty}(M, \mathfrak{g}^{(k)})\) associated with gauge group \(G^{(k)}\) – the following analysis will apply to each of the \(K\) gauge fields individually. For gauge element \(\alpha \in C^{\infty}(M, \mathfrak{g}^{(k)})\), we will use the notation
\[A_{\mu}^{\alpha} := A_{\mu} + [\alpha, A_{\mu}] + \frac{1}{g_k} \partial_{\mu} \alpha\]for the gauge transformation of \(A\) by \(\alpha\). We will work with a set of generators \(\{T_a\}_a \subset \mathfrak{g}^{(k)}\), with \([T_a, T_b] = f^c{}_{ab} T_c\) for structure constants \(\{f^c{}_{ab}\}_{a, b, c}\). We can write the components of \(A^{\alpha}\) explicitly as
\[(A_{\mu}^{\alpha})^a = A_{\mu}^a + \alpha^b A_{\mu}^c f^a{}_{bc} + \frac{1}{g_k} \partial_{\mu} \alpha^a\]where \(A_{\mu} = A_{\mu}^a T_a\). For the following we will define the matrix of differential operators
\[\Delta_{\mu}(A)^a{}_{b} := f^a{}_{bc} A_{\mu}^c + \frac{1}{g_k} \delta^a_b \partial_{\mu}\]which lets us write
\[(A_{\mu}^{\alpha})^a = A_{\mu}^a + \Delta_{\mu}(A)^a{}_{b} \alpha^b\]In the following we will work in Minkowski signature, meaning \(\mathbb{P}[\Psi] \propto e^{iS[\Psi]}\) (rather than \(\propto e^{-S[\Psi]}\)).
Remedying gauge overcounting. By looking at the partition function, we can interpret the effects of gauge overcounting and identify the divergent contribution. We will make use of the two identities
\[1 = N(\xi) \int D\omega \, e^{-i\int d^d x \, \omega(x)^2/2\xi}, \qquad 1 = \int D\alpha \, \delta(G_{\omega}(A^{\alpha}))\det\left(\frac{\delta G_{\omega}(A^{\alpha})}{\delta \alpha}\right)\]introducing a constant \(\xi\) and normalization factor \(N(\xi)\), as well as a gauge condition \(G_{\omega}(A) := \partial^{\mu} A_{\mu} - \omega\) (though the identity holds for generic conditions). Multiplying these two identities together gives
\[1 = N(\xi) \int D\alpha \, e^{-i\int(\partial^{\mu} A_{\mu}^{\alpha})^2/2\xi} \det\left(\frac{\delta G_{\omega}(A^{\alpha})}{\delta \alpha}\right)\bigg|_{\omega = \partial^{\mu} A_{\mu}^{\alpha}}\] \[\iff 1 = N(\xi) \det(\Xi(A)) \int D\alpha \, e^{-i\int(\partial^{\mu} A_{\mu}^{\alpha})^2/2\xi}\]introducing the differential operator
\[\Xi(A)^a{}_{b} := \partial^{\mu} \Delta_{\mu}(A)^a{}_{b} = f^a{}_{bc} A_{\mu}^c \partial^{\mu} + f^a{}_{bc} (\partial^{\mu} A_{\mu}^c) + \frac{1}{g_k} \delta_b^a \partial^{\mu} \partial_{\mu}\]Inserting this result, and writing the field content as \(\Psi = (\Phi, A)\), our partition function can be written in the form
\[\begin{align*} Z &= \int \text{D}\Phi \, \text{D}A \; e^{iS[\Phi, A]}\\ &\propto \int \text{D}\Phi \, \text{D}A \, \text{D}\alpha \, \det(\Xi(A)) e^{iS[\Phi, A]-i\int(\partial^{\mu} A_{\mu}^{\alpha})^2/2\xi} \end{align*}\]Abelian gauge fields. In the case of \(G^{(k)}\) being abelian, then \(f^a{}_{bc} = 0\) for all \(a, b, c\), which means that
\[\Xi(A^{\alpha}) = \Xi(A)\]In particular, \(\Xi(A)\) is independent of \(A\) in this case. As a result, we can make use of gauge invariance (of both \(S\) and measures \(\text{D}\Phi, \text{D}A\)) to find
\[Z = \underbrace{\left[\int \text{D}\alpha\right]}_{\text{divergent overcounting factor}} \det\left(\frac{1}{g_k} \partial^{\mu} \partial_{\mu}\right) \int \text{D}\Phi \, \text{D}A \, e^{iS[\Phi, A]-i\int(\partial^{\mu} A_{\mu})^2/2\xi}\]i.e. we have been able to pull an infinite overcounting factor outside of our partition function. In order for our theory to be valid, we divide out by this overcounting factor, instead considering the partition function
\[\tilde{Z} := \int \text{D}\Phi \, \text{D}A \, e^{iS[\Phi, A]-i\int(\partial^{\mu} A_{\mu})^2/2\xi}\]That is, our action has now become
\[S[\Phi, A] \to S[\Phi, A] - \frac{1}{2\xi} \int d^d x \, (\partial^{\mu} A_{\mu}^a(x)) (\partial^{\nu} A_{\nu a}(x))\]But significantly, this new contribution is \textbf{not gauge invariant}, breaking the original gauge invariance of our theory. It turns out that there are still remnants of gauge invariance left through the BRST transformations, as we will describe further below.
Non-abelian gauge fields. In the abelian case, \(\Xi(A)\) was actually independent of \(A\), which let us pull the \(\det(\Xi(A))\) factor out of the partition function. However, in the general case of \(G^{(k)}\) non-abelian, this is no longer possible since \(\Xi(A)\) generally depends on \(A\). In particular, \(\Xi(A^{\alpha}) \neq \Xi(A)\). The partition takes the form (up to proportionality constants)
\[Z = \int \text{D}\Phi \, \text{D}A \, \text{D}\alpha \, \det(\Xi(A^{-\alpha})) e^{iS[\Phi, A] - i\int (\partial^{\mu} A_{\mu})^2/2\xi}\]But interestingly we can write
\[\det(\Xi(A)) = \int \text{D}\bar{c} \, \text{D}c \, e^{-i \int d^d x \, \bar{c}_a \Xi(A)^a{}_{b} c^b}\]for so-called ghost fields \(c, \bar{c} \in C^{\infty}(M, \mathfrak{g}^{(k)})\) associated with \(A\). This lets us consider the partition function
\[\tilde{Z} := \int \text{D}\Phi \, \text{D}A \, e^{iS[\Phi, A]-i\int(\partial^{\mu} A_{\mu})^2/2\xi-i \int \bar{c}_a \Xi(A)^a{}_{b} c^b}\]corresponding to modifying our action as
\[S[\Phi, A] \to S[\Phi, A] - \frac{1}{2\xi} \int d^d x \, (\partial^{\mu} A_{\mu}^a(x)) (\partial^{\nu} A_{\nu a}(x)) -\int d^d x \, \bar{c}_a(x) \Xi(A)^a{}_{b} c^b(x)\]Again, these extra contributions will generally break gauge invariance.
In total, for the generic non-abelian case, we must extend our field content to include ghost fields for each gauge field:
\[\Psi \mapsto \Psi \cup \{\bar{c}^{(k)}, c^{(k)}\}_{k=1}^{K}\]and modifying the action as
\[S[\Psi] \to S[\Psi] - \sum_{k=1}^{K} \left[\frac{1}{2\xi_k} \int d^d x \, (\partial^{\mu} A_{\mu}^{(k)a}(x)) (\partial^{\nu} A_{\nu a}^{(k)}(x)) +\int d^d x \, \bar{c}_a^{(k)}(x) \Xi^{(k)}(A^{(k)})^a{}_{b} c^{(k) b}(x)\right]\]We will derive gauge transformation rules for these ghost fields below.
BRST invariance. Though this new action no longer satisfies the total gauge invariance we originally had, a restricted case of gauge invariance still remains, called BRST invariance.
In particular, recall that gauge transformations act infinitesimally as
\[A_{\mu}^{(k)} \mapsto_{\alpha} A_{\mu}^{(k)} + \Delta_{\mu}(A^{(k)})\alpha + O(\alpha^2)\]for \(\alpha \in C^{\infty}(M, \mathfrak{g}^{(k)})\). Under this transformation (and a corresponding transformation on the remaining field content), one can show that the modified Lagrangian transforms as \(\mathcal{L} \mapsto_{\alpha} \mathcal{L} + \delta\mathcal{L}\) for
\[\begin{align*} \delta\mathcal{L} &= -\frac{1}{\xi_k}(\partial^{\nu} A_{\nu a}^{(k)}) (\partial^{\mu} \partial_{\mu} \alpha^a + f^{(k)a}{}_{bc} (\alpha^b \partial^{\mu} A_{\mu}^{(k) c} + A_{\mu}^{(k) c} \partial^{\mu} \alpha^b)) - \delta\bar{c}_a^{(k)} (f^{(k)a}{}_{bc} c^{(k)b} \partial^{\mu} A_{\mu}^{(k)c} + f^{(k)a}{}_{bc} A_{\mu}^{(k)c} \partial^{\mu} c^{(k)b} + \partial^{\mu} \partial_{\mu} c^{(k)a})\\ &\quad\,\;-\bar{c}_a^{(k)}(f^{(k)a}{}_{bc} f^{(k)c}{}_{de} [A_{\mu}^{(k)e}\partial^{\mu} \alpha^d + \alpha^d \partial^{\mu} A_{\mu}^{(k)e}] + f^{(k)a}{}_{bc} \partial^{\mu} \partial_{\mu} \alpha^c) c^{(k)b} - \bar{c}_a^{(k)} (f^{(k)a}{}_{bc} f^{(k)c}{}_{de} \alpha^d A^{(k)e}_{\mu} + f^{(k)a}{}_{bc} \partial_{\mu} \alpha^c) \partial^{\mu} c^{(k)b}\\ &\quad\,\;-\bar{c}_a^{(k)} (f^{(k)a}{}_{bc} \delta c^{(k)b} \partial^{\mu} A_{\mu}^{(k)c}+ f^{(k)a}{}_{bc} A_{\mu}^{(k)c} \partial^{\mu}[\delta c^{(k)b}] + \partial^{\mu} \partial_{\mu} [\delta c^{(k)a}]) \end{align*}\]with \(c^{(k)} \mapsto_{\alpha} c^{(k)} + \delta c^{(k)}\) and \(\bar{c}^{(k)} \mapsto_{\alpha} \bar{c}^{(k)} + \delta\bar{c}^{(k)}\).
One can convince themseves that the second derivative terms will nicely cancel if we choose \(\alpha^a = c^a\), and take the ghost fields to transform as
\[\delta \bar{c}^{(k)a} = -\frac{1}{\xi_k} \partial^{\mu} A_{\mu a}^{(k)}, \qquad \delta c^{(k)a} = -\frac{1}{2} f^{(k)a}{}_{bc} \alpha^c c^{(k)b}\]where note that we must be careful when commuting variables since \(c^a c^b = -c^b c^a\) (Grassmann valued). We have also made use of
\[f^a{}_{e[b} f^e{}_{cd]} = 0\]which follows from the Jacobi identity. One can then check that, under these choices, the remaining terms also cancel and we get \(\delta\mathcal{L} = 0\).
As a result, our original gauge invariance has been reduced to a restricted set of gauge transformations that involve our new ghost fields. This restricted set are called the BRST transformations.
In the following, denote \(p = \dim V\) (where recall \(\Psi \in C^{\infty}(M, V)\)). Recall that physical quantities are related to field expectances of the general form
\[\mathbb{E}_{\Psi \sim S}[\Psi_{i_1}(x_1) \cdots \Psi_{i_n}(x_n)] \equiv \int_{\mathcal{C}} \left[\prod_{j=1}^{N} \text{D}\Psi_j\right] \Psi_{i_1}(x_1) \cdots \Psi_{i_n}(x_n) e^{-S[\Psi]}\]As outlined in the introduction, in order to construct a well-behaved definition of \(\text{D}\Psi\), one can consider expanding \(\Psi\) in some discrete basis \(\{\psi_n\}_n\) of \(C^{\infty}(M, V)\):
\[\Psi(x) = \sum_{n \in \mathbb{Z}} a_n \psi_n(x)\]Then naively, we could define
\[\text{D}\Psi := \prod_{n \in \mathbb{Z}} d a_n\]but since this is an infinite product, it will generally result in divergences, necessitating regulation. In the following, we will consider cutoff regulation: truncate the product by some cutoff parameter \(N \in \mathbb{Z}_{+}\):
\[\text{D}\Psi^{(N)} := \prod_{n\in \mathbb{Z}_N} d a_n\]using the notation \(\mathbb{Z}_N := \{-N, -N+1, \ldots, N-1, N\}\). Of particular relevance to renormalization is the Fourier basis, corresponding to \(n \in \mathbb{Z}_N^d\) (for spacetime dimension \(\dim M = d\)) and \(\psi_n(x) = e^{i\epsilon n \cdot x}\) for some fixed discretization scale \(\epsilon\). In this basis, we may write
\[\begin{align*} \Psi^{(N)}(x) &= \sum_{n \in \mathbb{Z}_N^d} a_n e^{i\epsilon n\cdot x}\\ &= \sum_{k \in \epsilon \mathbb{Z}_N^d} \tilde{\Psi}(k) e^{ik\cdot x} \end{align*}\]allowing us to interpret \(k\) as momentum, using the notation \(\tilde{\Psi}(k)\) for components. Namely, at a cutoff scale \(N\) and discretization scale \(\epsilon\), the maximum allowable energy is \(\Lambda := \sqrt{d} \epsilon N\). In the following we will use the notation \(\Psi^{(\Lambda)} \equiv \Psi^{(N)}\), and denote the configuration space spanned by \(\{\psi_n\}_{n \in \mathbb{Z}_N}\) by \(\mathcal{C}^{(\Lambda)}\).
Given that the space of field configurations \(\mathcal{C}^{(\Lambda)}\) is defined relative to some energy scale \(\Lambda\), this means that theories \(S^{(\Lambda)}: \mathcal{C}^{(\Lambda)} \to \mathbb{R}\) are \textbf{inherently energy-dependent}, describing physics at a particular energy scale. We say that \(S^{(\Lambda)}\) is a \(\Lambda\)-energy theory.
Lowering the cutoff parameter. We will now show that there is a natural procedure for lowering the energy of a theory \(S^{(\Lambda)}\), resulting in a theory \(S^{(\mu)}\) for \(\mu < \Lambda\). First see that
\[\begin{align*} \text{D}\Psi^{(\Lambda)} = \prod_{k \in \epsilon\mathbb{Z}_N^d} d^d \tilde{\Psi}(k) &= \prod_{k \in \epsilon\mathbb{Z}_M^d} d^d \tilde{\Psi}(k) \prod_{k \in \epsilon\mathbb{Z}_{M+1,N}^d} d^d \tilde{\Psi}(k)\\ &= \text{D}\Psi^{(\mu)} \text{D}\Psi^{(\mu, \Lambda)} \end{align*}\]for some lower cutoff \(M \in \mathbb{Z}_{+}\) with energy scale \(\mu := \sqrt{d} \epsilon M\), where
\[\Psi^{(\mu)}(x) = \sum_{k \in \epsilon \mathbb{Z}_M^d} \tilde{\Psi}(k) e^{ik\cdot x}, \qquad \Psi^{(\mu, \Lambda)} = \sum_{k \in \epsilon \mathbb{Z}_{M+1, N}^d} \tilde{\Psi}(k) e^{ik\cdot x}\]which allows us to write
\[\Psi^{(\Lambda)}(x) = \Psi^{(\mu)}(x) + \Psi^{(\mu, \Lambda)}(x)\]In the following we will write
\[\int \text{D}\Psi^{(\Lambda)} \equiv \int_{\mathcal{C}^{(\Lambda)}} \text{D}\Psi\]Now see that, for a \(\Lambda\)-energy theory \(S^{(\Lambda)}: \mathcal{C}^{(\Lambda)} \to \mathbb{R}\), we can write the \(\Lambda\)-energy partition function as
\[\begin{align*} Z^{(\Lambda)} &= \int_{\mathcal{C}^{(\Lambda)}} \text{D}\Psi \, e^{-S^{(\Lambda)}[\Psi]}\\ &= \int_{\mathcal{C}^{(\mu)}} \text{D}\chi \int_{\mathcal{C}^{(\mu, \Lambda)}} \text{D} \eta \, e^{-S^{(\Lambda)}[\chi + \eta]}\\ &=: \int_{\mathcal{C}^{(\mu)}} \text{D}\chi \, e^{-S^{(\mu)}[\chi]} \end{align*}\]where we have defined the effective action \(S^{(\mu)}: \mathcal{C}^{(\mu)} \to \mathbb{R}\) by
\[S^{(\mu)}[\chi] := -\log\left(\int_{\mathcal{C}^{(\mu, \Lambda)}} \text{D} \eta \, e^{-S^{(\Lambda)}[\chi + \eta]}\right)\]Furthermore, since in general we can decompose
\[S^{(\Lambda)}[\chi + \eta] = S^{(\Lambda)}[\chi] + S_0^{(\Lambda)}[\eta] + S_I^{(\Lambda)}[\chi, \eta]\]we can write
\[S^{(\mu)}[\chi] = S^{(\Lambda)}[\chi] + S^{(\Lambda\to\mu)}[\chi]\]defining
\[S^{(\Lambda\to\mu)}[\chi] := -\log\left(\int_{\mathcal{C}^{(\mu, \Lambda)}} \text{D} \eta \, e^{-S_0^{(\Lambda)}[\eta] - S_I^{(\Lambda)}[\chi, \eta]}\right)\]Denote the space of \(\Lambda\)-energy theories by \(\mathcal{S}^{(\Lambda)} \subseteq (\mathcal{C}^{(\Lambda)} \to \mathbb{R})\). Above, we have shown that there is a particularly natural lowering operation
\[\mathfrak{L}^{(\Lambda\to\mu)}: \mathcal{S}^{(\Lambda)} \to \mathcal{S}^{(\mu)}, \quad S^{(\Lambda)} \mapsto S^{(\mu)} = S^{(\Lambda)} + S^{(\Lambda\to\mu)}\]mapping \(\Lambda\)-energy theories to \(\mu\)-energy theories by integrating out momentum modes between \((\mu, \Lambda]\), which has the effect of shifting by \(S^{(\Lambda\to\mu)}\).
Extrapolating to higher energies. In reality, the experiments we can run on Earth are at a relatively low anthropic energy scale \(\mu\) and so the situation is reversed: experiments help us to determine a \(\mu\)-energy theory \(S^{(\mu)}\) (e.g. the Standard Model) that agrees with experimental results on Earth, and we wish to extrapolate this theory to a theory \(S^{(\Lambda)}\) that works at higher energy levels \(\Lambda > \mu\). One may even wish to extend the theory to arbitrarily high energies \(\Lambda \to \infty\).
Concretely, given a low-energy theory \(S^{(\mu)}\) (determined via experiments on Earth at energy scale \(\mu\)), the requirement that the higher-energy theory \(S^{(\Lambda)}\) agrees with experiments at energy \(\mu\) corresponds to \(S^{(\Lambda)}\) satisfying
\[\begin{equation} \label{eqn:lower} \mathfrak{L}^{(\Lambda\to\mu)}[S^{(\Lambda)}] = S^{(\mu)} \end{equation}\]Asymptotic freedom. We say that \(S^{(\mu)}\) is asymptotically free if such a theory \(S^{(\Lambda)}\) exists for \(\Lambda = \infty\). For this theory \(S^{(\infty)}\) to be valid, physical predictions must be finite, corresponding to finite correlators:
\[\mathbb{E}_{\Psi \sim S^{(\infty)}}[\Psi_{i_1}(x_1) \cdots \Psi_{i_n}(x_n)] \;\; \text{finite}\]In the following, given some fixed \(S^{(\mu)}\), we will consider constructing such a \(S^{(\Lambda)}\) via the approach of counterterms. In particular, denoting \(S^{(\mu\to\Lambda)} \equiv -S^{(\Lambda\to\mu)}\), we may write
\[S^{(\Lambda)}[\chi] = S^{(\mu)}[\chi] + S^{(\mu\to\Lambda)}[\chi]\]As we will see, we will then design the contributions \(S^{(\mu\to\Lambda)}\) to be such that \(S^{(\Lambda)}\) is well-defined in the limit \(\Lambda \to \infty\) (i.e. finite correlators).
In the following we will write
\[\begin{equation} \label{eqn:actcounter} S^{(\Lambda)}[\chi] = \sum_{m, n} \lambda_{m, n}^{(\Lambda)} \int_M d^d x \, L_{m, n}[\chi; x] \end{equation}\]where \(L_{m, n}[\chi; x]\) denotes all contributions with \(m\) powers of \(\chi\) and \(n\) derivatives of \(\chi\), with \(\{\lambda_{m, n}^{(\Lambda)}\}_{m, n}\) the associated couplings. Similarly, we will denote the couplings of \(S^{(\mu)}\) by \(\{\lambda_{m,n}^{(\mu)}\}_{m, n}\), and of \(S^{(\mu\to\Lambda)}\) by \(\{\delta\lambda_{m, n}^{(\mu\to\Lambda)}\}_{m, n}\). Note that \(\lambda_{m, n}^{(\Lambda)} = \lambda_{m, n}^{(\mu)} + \delta\lambda_{m, n}^{(\mu\to\Lambda)}\). We will denote the kinetic contribution by the indices \((m, n) = (a, b)\) (usually \((a, b) = (2, 2)\)).
We will assume that \(S^{(\mu)}\) is canonically normalized, in the sense that its kinetic contribution has a coefficient of \(1\), i.e. \(\lambda_{a, b}^{(\mu)} = 1\). Generally \(\lambda_{a, b}^{(\Lambda)} \neq 1\) due to the contributions \(S^{(\mu\to\Lambda)}\). To canonically normalize \(S^{(\Lambda)}\), we will introduce the normalized field
\[\hat{\chi} := Z^{1/a} \chi\]defining \(Z := 1 + \delta\lambda_{a, b}^{(\mu\to\Lambda)}\). Then Equation \ref{eqn:actcounter} becomes
\[\hat{S}^{(\Lambda)}[\hat{\chi}] = \hat{S}^{(\mu)}[\hat{\chi}] + \hat{S}^{(\mu\to\Lambda)}[\hat{\chi}]\]where we have defined the normalized actions
\[\hat{S}^{(\Lambda)}[\hat{\chi}] := S^{(\Lambda)}[Z^{-1/a} \hat{\chi}], \quad \hat{S}^{(\mu)}[\hat{\chi}] := S^{(\mu)}[Z^{-1/a} \hat{\chi}], \quad \hat{S}^{(\mu\to\Lambda)}[\hat{\chi}] := S^{(\mu\to\Lambda)}[Z^{-1/a} \hat{\chi}]\]with couplings \(\{\hat{\lambda}^{(\Lambda)}_{m, n}\}_{m, n}\), \(\{\hat{\lambda}^{(\mu)}_{m, n}\}_{m, n}\), \(\{\delta\hat{\lambda}_{m, n}^{(\mu\to\Lambda)}\}_{m, n}\) respectively. Importantly, we now have that \(\hat{S}^{(\Lambda)}\) is canonically normalized, with \(\hat{\lambda}_{a, b}^{(\Lambda)} = 1\). Explicitly we can write these new couplings in terms of the original couplings:
\[\begin{align*} \hat{\lambda}^{(\Lambda)}_{m, n} &= Z^{-m/a} \lambda_{m, n}^{(\Lambda)}\\ &= \frac{\lambda_{m, n}^{(\mu)} + \delta\lambda_{m, n}^{(\mu\to\Lambda)}}{Z^{m/a}} \end{align*}\]One often calls \(\hat{S}^{(\Lambda)}\) the bare theory, and \(S^{(\mu)}\) the renormalized theory.
Counterterm renormalization involves designing the counterterms \(\delta\lambda_{m, n}^{(\mu\to\infty)}\) such that the infinite energy theory \(\hat{S}^{(\infty)}\) of couplings \(\hat{\lambda}^{(\infty)}_{m, n}\) has finite correlators. As we outline in Section 9, one can compute correlators perturbatively. In our context, we can do so by taking \(\delta\lambda_{m, n}^{(\mu\to\infty)}\) to be sufficiently small such that we can treat \(\hat{S}^{(\mu\to\infty)}\) as a perturbation. The general procedure of counterterm renormalization includes:
In step 2, the prescription of counterterm renormalization describes how one chooses \(\delta\lambda_{m, n}^{(\mu\to\Lambda)}\), since generally various different choices can result in finite correlators (e.g. can add an arbitrary constant). Different prescription schemes include the minimal subtraction (MS) scheme, the modified minimal subtraction (\(\bar{\text{MS}}\)) scheme, and the mass-shell scheme.
Raising the discretization scale. In the following, we will more explicitly write \(S^{(N, \epsilon)} \equiv S^{(\Lambda)}\) to denote the theory defined on configuration space \(\mathcal{C}^{(N, \epsilon)} \equiv \mathcal{C}^{(\Lambda)}\), with associated field configuration \(\Psi^{(N, \epsilon)} \equiv \Psi^{(\Lambda)}\). Through the lowering operator \(\mathfrak{L}^{(N \to M)} \equiv \mathfrak{L}^{(\Lambda\to\mu)}\), we saw that we can transform a \((N, \epsilon)\)-theory into a \((M, \epsilon)\)-theory for \(M < N\), which has the effect of lowering the energy scale from \(\Lambda \propto N\epsilon\) to \(\mu \propto M\epsilon\).
It turns out that we can also construct a mapping from a \((N, \epsilon)\)-theory to a \((N, \eta)\)-theory with a raised cutoff \(\eta > \epsilon\). To see this, we will first introduce a new field \(\xi\) to have Fourier components
\[\tilde{\xi}(k) := (\eta/\epsilon)^d \tilde{\Psi}(\epsilon k/\eta) \implies d^d \tilde{\Psi}(k) = d^d \tilde{\xi}(\eta k/\epsilon)\]which allows us to write
\[\begin{align*} \text{D}\Psi^{(N, \epsilon)} = \prod_{k \in \epsilon \mathbb{Z}_N^d} d^d \tilde{\Psi}(k) &= \prod_{k \in \epsilon \mathbb{Z}_N^d} d^d \tilde{\xi}(\eta k/\epsilon)\\ &= \prod_{k \in \eta \mathbb{Z}_N^d} d^d \tilde{\xi}(k)\\ &= \text{D}\xi^{(N, \eta)} \end{align*}\]One can also show that
\[\Psi^{(N, \epsilon)}(x) = (\epsilon/\eta)^d \xi^{(N, \eta)}(\epsilon x/\eta)\]These two results let us write
\[\begin{align*} Z^{(N, \epsilon)} &= \int_{\mathcal{C}^{(N, \epsilon)}} \text{D}\Psi \, e^{-S^{(N, \epsilon)}[\Psi]}\\ &= \int_{\mathcal{C}^{(N, \eta)}} \text{D}\xi \, e^{-S^{(N, \epsilon)}[(\epsilon/\eta)^d \xi(\epsilon \, \cdot/\eta)]} \end{align*}\]where see that
\[\begin{align*} S^{(N, \epsilon)}[(\epsilon/\eta)^d \xi(\epsilon \, \cdot/\eta)] &= \sum_{m, n} \lambda_{m, n}^{(N, \epsilon)} \int_M d^d x \, L_{m, n}[(\epsilon/\eta)^d \xi(\epsilon \, \cdot/\eta); x]\\ &= \sum_{m, n} (\epsilon/\eta)^{dm + n} \lambda_{m, n}^{(N, \epsilon)} \int_M d^d x \, L_{m, n}[\xi; \epsilon x/\eta]\\ &= \sum_{m, n} (\epsilon/\eta)^{d(m-1) + n} \lambda_{m, n}^{(N, \epsilon)} \int_M d^d \tilde{x} \, L_{m, n}[\xi; \tilde{x}]\\ &= \sum_{m, n} \underbrace{\frac{(\epsilon/\eta)^{n-d+m(d-b)/a}}{(\lambda_{a, b}^{(N, \epsilon)})^{m/a}} \lambda_{m, n}^{(N, \epsilon)}}_{=: \, \tilde{\lambda}^{(N, \eta)}} \int_M d^d \tilde{x} \, L_{m, n}[\tilde{\xi}; \tilde{x}]\\ &= \sum_{m, n} \tilde{\lambda}^{(N, \eta)}_{m, n} \int_M d^d \tilde{x} \, L_{m, n}[\tilde{\xi}; \tilde{x}]\\ &=: \tilde{S}^{(N, \eta)}[\tilde{\xi}] \end{align*}\]defining the rescaled field
\[\tilde{\xi} := (\epsilon/\eta)^{d+(b-d)/a} (\lambda_{a, b}^{(N, \epsilon)})^{1/a} \xi\]precisely such that \(\tilde{S}^{(N, \eta)}[\tilde{\xi}]\) is canonically normalized, i.e. \(\tilde{\lambda}_{a, b}^{(N, \eta)} = 1\). As a result, we can write
\[Z^{(N, \epsilon)} \propto \int_{\mathcal{C}^{(N, \eta)}} \text{D}\tilde{\xi} \, e^{-\tilde{S}^{(N, \eta)}[\tilde{\xi}]}\]The above has therefore shown that there is a particularly natural raising operation for the discretization scale
\[\mathfrak{R}^{(\epsilon\to\eta)}: \mathcal{S}^{(N, \epsilon)} \to \mathcal{S}^{(N, \eta)}, \quad S^{(N, \epsilon)} \mapsto \tilde{S}^{(N, \eta)}\]Coarse-graining. Consider the map
\[\mathfrak{C}^{(N \to M)}: \mathcal{S}^{(N, \epsilon)} \to \mathcal{S}^{(M, N\epsilon/M)}\] \[\mathfrak{C}^{(N \to M)} := \mathfrak{R}^{(\epsilon\to N\epsilon/M)} \circ \mathfrak{L}^{(N \to M)}\]If a theory \(S^{(N, \epsilon)}\) has couplings \(\{\lambda_{m, n}^{(N, \epsilon)}\}_{m, n}\), then \(\mathfrak{C}^{(N \to M)}[S^{(N, \epsilon)}]\) has couplings
\[\begin{equation} \label{eqn:coarsecouplings} \lambda^{(M, N\epsilon/M)}_{m, n} = (M/N)^{n-d+m(d-b)/a} \frac{\lambda_{m, n}^{(N, \epsilon)} + \delta\lambda_{m, n}^{(N \to M, \epsilon)}}{(\lambda_{a, b}^{(N, \epsilon)} +\delta\lambda_{a, b}^{(N\to M, \epsilon)})^{m/a}} \end{equation}\]This operation is particularly special in that it \textbf{preserves energy scales}, since \(N\epsilon = M(N\epsilon/M)\). It also maps between canonically normalized theories.
As a result, we can view \(\mathfrak{C}^{(N\to M)}\) as coarse-graining a theory by decreasing the cutoff and raising the discretization scale, while maintaining the same overall energy scale.
We can classify the possible contributions \(\{L_{m, n}\}_{m, n}\) to a theory based on how their couplings \(\{\lambda_{m, n}^{(N, \epsilon)}\}_{m, n}\) behave under coarse-graining. Namely, from Equation \ref{eqn:coarsecouplings}, we have that for each \((m, n)\):
Todo: usually one interprets irrelevant terms as terms that can be ignored at low energy. But under coarse-graining the energy scale is not changing, so how can we remedy this difference in viewpoint?
Todo: the general prescription for writing down low-energy theories: just write down all relevant and marginal terms that satisfy spacetime and gauge invariance. The Standard Model action can be derived by such a procedure (but also with some extra terms, like the theta term).
Todo: how is gauge invariance affected by cutoff regularization? (in comparison, Fujikawa regulation preserves gauge invariance)
Todo: The existence of Landau poles suggests that the Standard Model may not be asymptotically free, however this analysis is only perturbative and hence does not act as a proof.
As mentioned previously, physical quantities such as scattering probabilities can be written in terms of correlators: expectances of the general form
\[\mathbb{E}_{\Psi \sim S}[\Psi_{i_1}(x_1) \cdots \Psi_{i_n}(x_n)] := \frac{1}{Z} \int_{\mathcal{C}} \text{D}\Psi \, \Psi_{i_1}(x_1) \cdots \Psi_{i_n}(x_n) e^{-S[\Psi]}\]for arbitrary \(n \in \mathbb{Z}_{+}\), indices \((i_1, \ldots, i_n) \in \{1, \ldots, N\}^n\), and points \((x_1, \ldots, x_n) \subset M\).
But how can we actually compute correlators? In the
Generally we can write
\[\mathbb{E}_{\Psi \sim S}[\Psi_{i_1}(x_1) \cdots \Psi_{i_n}(x_n)] = (-1)^n \frac{\delta^n Z[J_1, \ldots, J_N]}{\delta J_{i_n}(x_n) \cdots \delta J_{i_1}(x_1)}\bigg|_{J_1 = \cdots = J_N = 0}\]where we have introduced a source \(J_i \in C^{\infty}(M, V^{(i)})\) for each field \(\Psi_i\) and defined
\[Z[J_1, \ldots, J_N] := \int_{\mathcal{C}} \text{D}\Psi \, e^{-S[\Psi] - \sum_{i=1}^{N} \int_M d^d x \, J_i(x) \cdot \Psi_i(x)}\]Decomposing \(S[\Psi] = S_0[\Psi] + \Delta S[\Psi]\) into free contributions \(S_0\) (i.e. kinetic and mass terms) and perturbative contributions \(\Delta S\) (i.e. interactions with small couplings), we are able to compute \(Z[J_1, \ldots, J_N]\) perturbatively via Taylor expansion of:
\[Z[J_1, \ldots, J_N] = \exp\left(-\Delta S\left[\frac{\delta}{\delta J_1}, \ldots, \frac{\delta}{\delta J_N}\right]\right) Z_0[J_1, \ldots, J_N]\]with \(Z_0[J_1, \ldots, J_N]\) defined
\[Z_0[J_1, \ldots, J_N] := \int_{\mathcal{C}} \text{D}\Psi \, e^{-S_0[\Psi] - \sum_{i=1}^{N} \int_M d^d x \, J_i(x) \cdot \Psi_i(x)}\]Because \(S_0[\Psi]\) consists of free contributions and is at most quadratic in \(\Psi\), and since the source terms are linear in \(\Psi\), one can usually complete the square within the exponent and determine a closed form for \(Z_0[J_1, \ldots, J_N]\).
For example, a real scalar field \(\phi\) has free action
\[S_0[\phi] = \frac{1}{2} \partial_{\mu} \phi \partial^{\mu} \phi - \frac{1}{2} m^2 \phi^2\]and by using integration by parts and completing the square, one can find that
\[Z_0[J] = Z_0[0] \exp\left(-\frac{1}{2} \int d^d x \, d^d y \, J(x) D_F(x-y) J(y)\right)\]where \(D_F(x-y)\) is the propagator for a real scalar field, defined by the Green’s function property
\[i(\square + m^2) D_F(x-y) = \delta(x-y)\]Feynman rules. The correlator Feynman rules are a distilled set of rules for computing the correlator term-by-term perturbatively (via Taylor expansion), as described above.
Todo: Give explicit conversion between Feynman diagrams and relevant integrals.
To understand the kinds of invariances that our theory can have – specifically, the available choices of Lie groups for the gauge group – it is useful to classify the available Lie groups. It happens that each Lie group has a (not necessarily unique) \textit{Lie algebra}, and it is much easier to classify the available Lie algebras, and so this will be the focus of A.1.1.
Furthermore, ultimately these Lie groups must act on our field content, which requires choosing a \textit{representation} of the Lie group (as outlined in the Introduction). In A.1.2, we will classify the available representations to better understand the choices we can make.
The unitary group \(U(n)\) is not semi-simple, and so the classification of its representations must be approached differently. We will explore this in Appendix A.1.3.
A Lie algebra \((\mathfrak{g}, [\cdot, \cdot])\) is a vector space \(\mathfrak{g}\) (over some \(\mathbb{F}\)) paired with a bracket
\[[\cdot, \cdot]: \mathfrak{g} \times \mathfrak{g} \to \mathfrak{g}\]with the bracket satisfying (i) anti-symmetry (ii) bilinearity (iii) the Jacobi identity:
\[[X, [Y, Z]] + [Y, [Z, X]] + [Z, [X, Y]] = 0\]which is a convenient property for various proofs.
For a vector space \(V\) with some associative product \(*\) (e.g. \(V\) a matrix space with matrix multiplication \(*\)), then \((V, [\cdot, \cdot])\) is a Lie algebra if one defines \([\cdot, \cdot]\) by
\[[X, Y] := X * Y - Y * X\]For two Lie algebras \(\mathfrak{g}, \mathfrak{h}\), we say that \(f: \mathfrak{g} \to \mathfrak{h}\) is a (Lie algebra) homomorphism iff (i) it is a vector space homomorphism (i.e. \(f\) linear) and (ii) \(f\) commutes with \([\cdot, \cdot]\) in the sense that:
\[f([X, Y]) = [f(X), f(Y)]\]As usual, we say that \(f: \mathfrak{g} \to \mathfrak{h}\) is an isomorphism iff \(f\) is a bijective homomorphism whose inverse is also a homomorphism. If such an isomorphism exists, we write \(\mathfrak{g} \cong \mathfrak{h}\).
For a Lie algebra \(\mathfrak{g}\), a sub-algebra \(\mathfrak{h} \leq \mathfrak{g}\) is said to be ideal iff
\[[\mathfrak{g}, \mathfrak{h}] \subseteq \mathfrak{h}\]using the notation \([\mathfrak{g}, \mathfrak{h}] \equiv \{[X, Y]: X \in \mathfrak{g}, Y \in \mathfrak{h}\}\). Any Lie algebra \(\mathfrak{g}\) always has the trivial ideals \(\mathfrak{h} = \mathfrak{g}\) and \(\mathfrak{h} = \{0\}\).
\textit{Semi-simplicity.} We say that \(\mathfrak{g}\) is \textit{semi-simple} if its non-trivial ideal sub-algebras \(\mathfrak{h}\) are non-abelian.
For sub-algebras \(\mathfrak{h}\) that are instead abelian, we say that \(\mathfrak{h}\) is a \textit{maximal abelian subalgebra} if it is not contained in any larger sub-algebra.
For a complex semi-simple Lie algebra \(\mathfrak{g}\), we say that a sub-algebra \(\mathfrak{h} \leq \mathfrak{g}\) is a \textit{Cartan subalgebra} (CSA) iff \(\mathfrak{h}\) is abelian \& maximal, and \(\text{ad}_H: \mathfrak{g} \to \mathfrak{g}\) is diagonalizable for all \(H \in \mathfrak{h}\).
Semi-simple Lie algebras are special in that there is a natural metric on the Lie algebra when we have semi-simplicity (indeed, \(\mathfrak{g}\) is semi-simple iff its Killing form is non-degenerate, which allows us to define a metric).
A \textit{simple} Lie algebra is a non-abelian Lie algebra with no non-trivial ideals. We can always write a semi-simple Lie algebra as a direct sum of simple Lie algebras. Indeed, an alternative definition of a semi-simple Lie algebra is as any direct sum of simple Lie algebras.
In the following, we will \textbf{classify all complex semi-simple finite-dim Lie algebras} \(\mathfrak{g}\), making use of the existence of a CSA \(\mathfrak{h} \leq \mathfrak{g}\).
Given a CSA \(\mathfrak{h}\) of our Lie algebra \(\mathfrak{g}\), abelian-ness tells us that
\[\text{ad}_H(H') = 0 \; \forall \; H, H' \in \mathfrak{h}\] \[\implies [\text{ad}_H, \text{ad}_{H'}] = \text{ad}_{[H, H']} \equiv 0\]and further, since \(\text{ad}_H\) is diagonalizable, the collection \(\{\text{ad}_H\}_{H \in \mathfrak{h}}\) is simltaneously diagonalizable (since diagonalizable operators that commute with eachother are simultaneously diagonalizable). By spectral decomposition theorem, this means that \(\mathfrak{g}\) is spanned by the simultaneous eigenvectors of \(\{\text{ad}_H\}_{H \in \mathfrak{h}}\).
Overall we can write \(\mathfrak{g}\) as a span of the Cartan-Weyl basis:
\[\mathfrak{g} = \text{span}_{\mathbb{C}} \mathcal{B}, \quad \mathcal{B} := \underbrace{\{H_i\}_{i=1}^{r}}_{\text{basis of} \; \mathfrak{h}} \cup \{E_{\alpha}\}_{\alpha \in \Phi}\]where \(\{E_{\alpha}\}_{\alpha \in \Phi}\) are simultaneous eigenvectors of \(\{\text{ad}_H\}_{H \in \mathfrak{h}}\) satisfying
\[\text{ad}_H(E_{\alpha}) = \alpha(H) E_{\alpha} \quad \forall \; \; H \in \mathfrak{h}\]for roots \(\Phi \ni \alpha: \mathfrak{h} \to \mathbb{C}\), and with \(\text{ad}_H(H_i) = 0 \; \forall \; i\) (since \(H_i \in \mathfrak{h}\)), and defining \(r := \dim\mathfrak{h}\). Further, one can show that the roots \(\alpha \in \Phi\) are linear, and hence \(\Phi \subset \mathfrak{h}^{*}\).
Additionally, we will generally have \(\dim\mathfrak{h} \leq \dim\mathfrak{g}/2\), which implies that \(|\Phi| \geq \dim\mathfrak{h}^{*}\). As we will see later, one can show that the roots span \(\mathfrak{h}^{*}\), i.e. \(\mathfrak{h}^{*} = \text{span}_{\mathbb{C}}\Phi\). As a result, \(\Phi\) can be viewed as an overcomplete basis for \(\mathfrak{h}^{*}\).
By definition, this set of generators has relations
\[[H_i, H_j] = 0, \quad [H_i, E_{\alpha}] = \alpha(H_i) E_{\alpha}\]and further, \([E_{\alpha}, E_{\beta}]\) can be determined via Jacobi’s identity:
\[\text{ad}_{H}([E_{\alpha}, E_{\beta}]) = (\alpha(H) + \beta(H)) [E_{\alpha}, E_{\beta}]\] \[\implies \begin{cases} [E_{\alpha}, E_{\beta}] \in \mathfrak{h}, & \alpha+\beta = 0\\ [E_{\alpha}, E_{\beta}]=N_{\alpha, \beta} E_{\alpha+\beta}, & \alpha+\beta \in \Phi\\ [E_{\alpha}, E_{\beta}]=0, & \text{otherwise} \end{cases}\]for proportionality factor \(N_{\alpha, \beta} \neq 0\).
\textit{Natural metric for semi-simple Lie algebras.} For the following, introduce the Killing form \(\kappa\): a symmetric \((0, 2)\) tensor defined by
\[\kappa(X, Y) := \text{tr}(\text{ad}_X \circ \text{ad}_Y)\]We can find its components more explicitly. For some basis \(\{T_i\}_i \subset \mathfrak{g}\) with structure constants \([T_i, T_j] = f^k{}_{ij} T_k\), expanding the trace in this basis gives us
\[\begin{align*} \kappa_{ij} := \kappa(T_i, T_j) &= \text{tr}(\text{ad}_{T_i} \circ \text{ad}_{T_j})\\ &= [T_i, [T_j, T_k]]^k\\ &= f^k{}_{il} f^l{}_{jk} \end{align*}\]We have that \(\kappa\) is non-degenerate (equivalent to the matrix \(\kappa_{ij}\) being invertible in any basis) iff \(\mathfrak{g}\) is semi-simple.
Therefore, in the case of \(\mathfrak{g}\) being semi-simple, we can view \(\kappa\) as analogous to a metric (i.e. a symmetric, non-degenerate \((0, 2)\) tensor). As we will see, we will lower indices using \(\kappa_{ij}\) and raise indices using \(\kappa^{ij} \equiv (\kappa^{-1})^{ij}\). In the following we will work with the inner product
\[(X, Y) := \kappa(X, Y) = \kappa_{ij} X^i Y^j = \kappa^{ij} X_i Y_j\]for any \(X, Y \in \mathfrak{g}\). This inner product also naturally extends to an inner product over \(\mathfrak{g}^{*}\): for \(\omega, \eta \in \mathfrak{g}^{*}\) we will write
\[(\omega, \eta) := (X_{\omega}, X_{\eta}) \equiv \kappa^{ij} \omega_i \eta_j\]where (analogous to Riesz representation theorem) for \(\lambda \in \mathfrak{g}^{*}\) we define \(X_{\lambda} \in \mathfrak{g}\) as satisfying
\[\lambda(Y) =: \kappa(X_{\lambda}, Y) \quad \forall \; Y \in \mathfrak{g}\]which implies that \(X_{\lambda} = \kappa^{ij} \lambda_j T_i\) for basis \(\{T_i\}_i \subset \mathfrak{g}\) (found by writing in components).
\(\kappa\) is a special metric in that it satisfies a Jacobi identity-like property:
\[\kappa(X, [Y, Z]) = \kappa(Y, [Z, X]) = \kappa(Z, [X, Y])\]Proof: using the fact that \(\text{ad}\) is a rep of \(\mathfrak{g}\) (which can be proven via Jacobi’s identity), we have
\[\begin{align*} \kappa(X, [Y, Z]) &= \text{tr}(\text{ad}_X \circ \text{ad}_{[Y, Z]})\\ &= \text{tr}(\text{ad}_X \circ [\text{ad}_Y, \text{ad}_Z])\\ &= \text{tr}(\text{ad}_Y \circ [\text{ad}_Z, \text{ad}_X]) \equiv \kappa(Y, [Z, X])\\ &= \text{tr}(\text{ad}_Z \circ [\text{ad}_X, \text{ad}_Y]) \equiv \kappa(Z, [X, Y]) \end{align*}\]using the cyclic property of \(\text{tr}\).
For \(\alpha \in \mathfrak{h}^{*}\), we will use similar notation to above, with \(H_{\alpha} = \kappa^{ij} \alpha_j H_i \in \mathfrak{h}\) (with \(\kappa_{ij} \equiv \kappa(H_i, H_j)\)) satisfying
\[\alpha(H) = \kappa(H_{\alpha}, H) \quad \forall \; H \in \mathfrak{h}\]\textit{Relations.} Using the properties established above, we can write
\[[E_{\alpha}, E_{-\alpha}] = (E_{\alpha}, E_{-\alpha}) H_{\alpha}\]This follows from the invariance property of \(\kappa\):
\[\kappa(H, [E_{\alpha}, E_{-\alpha}]) = \kappa(E_{\alpha}, [E_{-\alpha}, H]) = \underbrace{\alpha(H)}_{\kappa(H_{\alpha}, H)} \kappa(E_{\alpha}, E_{-\alpha})\] \[\implies \kappa(H, [E_{\alpha}, E_{-\alpha}] - \kappa(E_{\alpha}, E_{-\alpha}) H_{\alpha}) = 0 \quad \forall \; H \in \mathfrak{h}\] \[\implies [E_{\alpha}, E_{-\alpha}] = \kappa(E_{\alpha}, E_{-\alpha}) H_{\alpha} \equiv (E_{\alpha}, E_{-\alpha}) H_{\alpha}\]by non-degeneracy of \(\kappa\).
Instead of considering the generators \(\{H_i, E_{\alpha}\}_{i, \alpha \in \Phi}\), we will instead consider \(\{H_{\alpha}, E_{\alpha}\}_{\alpha \in \Phi}\) to be our generators, which (using \(H_{\alpha} = \kappa^{ij} \alpha_j H_i\)) have relations
\[[H_{\alpha}, H_{\beta}] = 0, \quad [H_{\alpha}, E_{\beta}] = (\alpha, \beta) E_{\beta},\] \[[E_{\alpha}, E_{\beta}] = \begin{cases} (E_{\alpha}, E_{-\alpha}) H_{\alpha} & \alpha+\beta=0\\ N_{\alpha, \beta} E_{\alpha+\beta} & \alpha+\beta \in \Phi\\ 0 & \text{otherwise} \end{cases}\]Note that the sub-algebra generated by \((H_{\alpha}, E_{\alpha}, E_{-\alpha})\) has relations
\[[H_{\alpha}, E_{\pm\alpha}] = \pm(\alpha, \alpha)E_{\pm\alpha}, \quad [E_{\alpha}, E_{-\alpha}] = (E_{\alpha}, E_{-\alpha}) H_{\alpha}\]We would like for these relations to precisely match the relations of \(\mathfrak{su}(2)_{\mathbb{C}}\). To achieve this, we define rescaled elements
\[h_{\alpha} := \frac{2}{(\alpha, \alpha)} H_{\alpha}, \quad e_{\alpha} := \sqrt{\frac{2}{(\alpha, \alpha) (E_{\alpha}, E_{-\alpha})}} E_{\alpha}\]with \((h_{\alpha}, e_{\alpha}, e_{-\alpha})\) precisely satisfying the \(\mathfrak{su}(2)_{\mathbb{C}}\) algebra:
\[[h_{\alpha}, e_{\pm\alpha}] = \pm 2e_{\pm\alpha}, \quad [e_{\alpha}, e_{-\alpha}] = h_{\alpha}\]That is, each root \(\alpha \in \Phi\) is associated with a subalgebra corresponding to \(\mathfrak{su}(2)_{\mathbb{C}}\).
These rescaled elements \(\{h_{\alpha}, e_{\alpha}, e_{-\alpha}\}_{\alpha}\) have global relations
\[[h_{\alpha}, h_{\beta}] = 0, \quad [h_{\alpha}, e_{\beta}] = \frac{2(\alpha, \beta)}{(\alpha, \alpha)} e_{\beta}, \quad [e_{\alpha}, e_{\beta}] = \begin{cases} h_{\alpha} & \alpha + \beta = 0\\ n_{\alpha, \beta} e_{\alpha+\beta} & \alpha+\beta \in \Phi\\ 0 & \text{otherwise} \end{cases}\]This construction implies that we have
\[\frac{2(\alpha, \beta)}{(\alpha, \alpha)} \in \mathbb{Z} \quad \forall \; \alpha, \beta \in \Phi\]Now we will construct a valid irrep of \(\mathfrak{g}^{\alpha}\) on some space \(V\) and apply the above to obtain the desired result. The minimal invariant subspace \(V\) of \(\text{ad}\) will act as a particularly natural irrep. See that
\[\text{ad}_{h^{\alpha}}(e^{\beta}) = \frac{2(\alpha, \beta)}{(\alpha, \alpha)} e^{\beta}, \quad \text{ad}_{e^{\alpha}}(e^{\beta}) = \begin{cases} n_{\alpha, \beta} e^{\alpha+\beta}, & \alpha+\beta \in \Phi\\ 0, & \text{otherwise} \end{cases}\]assuming that \(\beta \in \Phi\) is such that \(\beta \neq -\alpha\) (as if \(\beta = -\alpha\), then \(h^{\alpha}\) would have to be included in our vector space, but we are looking for the \textit{minimal} invariant subspace). As a result, we clearly have that
\[V_{\alpha, \beta} := \text{span}_{\mathbb{C}}\{e^{\beta+\rho\alpha} \, | \, \rho \in \mathbb{Z}: \beta+\rho\alpha \in \Phi\}\]is the minimal invariant subspace. As a result, \(\text{ad}^{\alpha, \beta}: \mathfrak{g}^{\alpha} \to \mathfrak{gl}(V_{\alpha, \beta}), X \mapsto [X, \cdot]\) is an irrep, and by isomorphism \(\mathfrak{g}^{\alpha} \cong \mathfrak{su}(2)_{\mathbb{C}}\), we have that
\[S_{\text{ad}^{\alpha, \beta}} = \{-\Lambda, -\Lambda+2, \ldots, \Lambda\} \quad \text{for some} \; \Lambda \in \mathbb{Z}_{+}\]and note that, by definition, this weight set can be written
\[\begin{align*} S_{\text{ad}^{\alpha, \beta}} &= \{\lambda \in \mathbb{C} \, | \, \exists \, v \in V_{\alpha, \beta}: \text{ad}_{h^{\alpha}}^{\alpha, \beta}(v) = \lambda v\}\\ &= \left\{\frac{2(\alpha, \beta)}{(\alpha, \alpha)} + 2\rho \, | \, \rho \in \mathbb{Z}: \beta+\rho\alpha \in \Phi\right\} \end{align*}\]and so we arrive at
\[\frac{2(\alpha, \beta)}{(\alpha, \alpha)} \in \mathbb{Z} \quad \forall \; \alpha, \beta \in \Phi\]For any \(\alpha, \beta \in \Phi\), these results also imply that there must be some maximum \(\rho = n_{+} \in \mathbb{Z}_{\geq 0}\) for which \(\beta+\rho \alpha \in \Phi\), as well as some minimum \(\rho = n_{-} \in \mathbb{Z}_{\leq 0}\) for which \(\beta+\rho\alpha \in \Phi\) holds. Since these describe the endpoints of the weight set, corresponding to \(\Lambda\) and \(-\Lambda\) respectively, then we must have
\[\frac{2(\alpha, \beta)}{(\alpha, \alpha)} = -n_{+} - n_{-}\]For a pair of roots \(\alpha, \beta \in \Phi\), we will use the notation \(n_{+} = n_{+}(\alpha, \beta)\) and \(n_{-} = n_{-}(\alpha, \beta)\) for these integers.
In the above, we have shown that each root string of length \(\ell_{\alpha, \beta}\) is in correspondence with a rep of \(\mathfrak{su}(2)_{\mathbb{C}}\) of dimension \(\ell_{\alpha, \beta}\). Particularly, the constructed rep \(\text{ad}^{\alpha, \beta}\) acts on
\[V_{\alpha, \beta} = \text{span}_{\mathbb{C}}\{e_{\gamma}: \gamma \in S_{\alpha, \beta}\}\]| which has dimension $$\dim V_{\alpha, \beta} = | S_{\alpha, \beta} | \equiv \ell_{\alpha, \beta}\(. And since\)\ell_{\alpha, \beta} = n_{+} - n_{-} + 1\(, we have that for simple roots\)\alpha, \beta \in \Phi_S\((such that\)n_- = 0$$), we can write |
The \textit{Cartan matrix} \(A\) is defined
\[A_{ij} := \frac{2(\alpha_{(i)}, \alpha_{(j)})}{(\alpha_{(j)}, \alpha_{(j)})}\]Since \((\alpha, \alpha) > 0\), this tells us that we must have \((\alpha, \beta) \leq 0\) for all \(\alpha, \beta \in \Phi_S\) (as otherwise, \(\ell_{\alpha, \beta} < 1\) which is invalid). As we will see, this tells us that the non-diagonal elements of the Cartan matrix are non-positive: \(A_{ij} \leq 0\) for \(i \neq j\).
Using the notation \(\ell_{i, j} := \ell_{\alpha_{(i)}, \alpha_{(j)}}\), we therefore have
\[\ell_{i, j} = 1-A_{ji}\]which we use later to derive Serre’s relation.
We have that \(\mathfrak{h}^{*} = \text{span}_{\mathbb{C}} \Phi\).
Proof: if it doesnt span, then there exists some \(X \in \mathfrak{h}\) such that \(\alpha(X) \equiv \alpha_i X^i = 0\) for all \(\alpha \in \Phi\). From here, we can use the existence of \(X\) to construct a non-trivial ideal subalgebra \(\mathfrak{k}\) (i.e. \([\mathfrak{g}, \mathfrak{k}] \subseteq \mathfrak{k}\)) that is abelian, contradicting semi-simplicity of \(\mathfrak{g}\). This is trivial to do: \(X \in \mathfrak{h}\) and so clearly the span of \(X\) will be an abelian subalgebra (since CSAs are abelian), and so now we must show that its ideal. See that
\[[X, E_{\alpha}] = X^i [H_i, E_{\alpha}] = \underbrace{X^i \alpha_i}_{=\, 0} E^{\alpha} = 0\](using commutation relations) and hence \(\text{span}_{\mathbb{C}} \{X\}\) is an ideal abelian subalgebra of \(\mathfrak{g}\), contradicting semi-simplicity.
Since \(\mid\Phi\mid \geq \dim\mathfrak{h}^{*}\), the set of roots will generally be an overcomplete basis for \(\mathfrak{h}^{*}\). This motivates constructing a minimal set of exactly \(r\) roots that acts as a basis for \(\mathfrak{h}^{*}\).
To do so, we perform the following reduction to obtain the \textit{simple roots}.
\textit{Constructing the simple roots.} As we have used above in writing \(E_{-\alpha}\) as a generator, we have that \(\alpha \in \Phi \iff -\alpha \in \Phi\).
Proof: To show this, we must make use of two results:
Proof: Since \(\alpha \neq 0\), there exists \(H' \in \mathfrak{h}\) such that \(\alpha(H') \neq 0\), and so
\[\alpha(H') \kappa(H, E_{\alpha}) = \kappa(H, \alpha(H') E_{\alpha}) = \kappa(H, [H', E_{\alpha}]) = \kappa(E_{\alpha}, [H, H']) = 0\]These results tell us that, for any \(\alpha \in \Phi\), \(\kappa(E_{\alpha}, H) = \kappa(E_{\alpha}, E_{\beta}) = 0\) for any \(H \in \mathfrak{h}\) and any \(\beta \in \Phi\) such that \(\beta \neq -\alpha\). But since \(\kappa\) is non-degenerate by semi-simplicity, \(\kappa(E_{\alpha}, \cdot)\) cannot map everything to zero. The only option is for \(-\alpha \in \Phi\) with \(\kappa(E_{\alpha}, E_{-\alpha}) \neq 0\).
And since \(0 \notin \Phi\), we have that \(\mid\Phi\mid\) is even. Let’s separate \(\mathfrak{h}^{*} \cong \mathbb{C}^{r}\) into two halves via a \(r-1\) \(\mathbb{C}\)-dimensional hyperplane (that intersects with the origin). This hyperplane will split \(\Phi\) into two equally sized sets of roots:
\[\Phi := \Phi_{+} \cup \Phi_{-}\]Namely, if \(\alpha \in \Phi_{+}\), then we will have \(-\alpha \in \Phi_{-}\). Further, if \(\alpha, \beta \in \Phi_{+}\), then \(\alpha + \beta \in \Phi_{+}\) (and similarly for \(\Phi_{-}\)).
Now that we have halved the size of \(\Phi\) to \(\Phi_{+}\), there is one further reduction we perform. We say that a root \(\alpha \in \Phi\) is simple iff it is a positive root (\(\alpha \in \Phi_{+}\)) and cannot be written as the sum of two positive roots. We denote the set of simple roots by
\[\Phi_S = \{\alpha_{(i)} : i = 1, \ldots, |\Phi_S|\}\]We will now show that \(\Phi_S\) is a basis for \(\mathfrak{h}^{*}\) with \(\mid\Phi_S\mid = r\).
Firstly, we have that
\[\Phi_{+} \subseteq \text{span}_{\mathbb{Z}_{\geq 0}} \Phi_S, \quad \Phi_{-} \subseteq \text{span}_{\mathbb{Z}_{\leq 0}} \Phi_S\]telling us that any positive root can be written as a linear combination of simple roots with positive coefficients, and similarly for negative roots.
This result implies that \(\Phi \subseteq \text{span}_{\mathbb{Z}} \Phi_S\) and, since \(\Phi\) \(\mathbb{C}\)-spans \(\mathfrak{h}^{*}\), we therefore have that
\[\text{span}_{\mathbb{C}} \Phi_S = \mathfrak{h}^{*}\]also.
Finally, we have that the simple roots are linearly independent, meaning that \(|\Phi_S| = r\) and hence \(\Phi_S\) is a basis for \(\mathfrak{h}^{*}\).
Proof: Note that we can write any \(\lambda \in \mathfrak{h}^{*}\) as
\[\lambda = \sum_i c_i \alpha_{(i)}, \qquad c_i \in \mathbb{C}\]To show linear independence of \(\Phi_S\) it is sufficient to show that \(\lambda = 0 \implies c_i = 0 \; \forall \; i\). See that if \(\lambda = 0\), we can apply \((\cdot, \alpha_{(j)})\) to the above:
\[0 = \sum_i c_i (\alpha_{(i)}, \alpha_{(j)}) \quad \forall \; \; j\]This condition can be written as a matrix equation:
\[Ac = 0\]for \(c = (c_1, \ldots, c_s)^T \in \mathbb{C}^s\) and the symmetric matrix \(A\) defined by \(A_{ij} := (\alpha_{(i)}, \alpha_{(j)})\). We have that \(A\) is non-singular as it is positive definite:
\[x^T A x = x^i x^j (\alpha_{(i)}, \alpha_{(j)}) = (\chi, \chi) > 0\]for any non-zero \(x \in \mathbb{C}^{s}\), defining \(\chi := x^i \alpha_{(i)} \in \mathfrak{h}^{*}\). Hence \(A\) is non-singular, and so \(Ac = 0\) has the unique solution \(c=0\), giving us linear independence.
One useful result is that the difference of two simple roots is not a root.
Proof: For any two simple roots \(\alpha_{(i)}, \alpha_{(j)} \in \Phi_S\), define
\[\alpha := \alpha_{(i)} - \alpha_{(j)}\]Now assume that \(\alpha \in \Phi\), then WLOG we can assume \(\alpha \in \Phi_{+}\) (as otherwise we can switch sign \(\alpha \mapsto -\alpha\) in the definition of \(\alpha\)). Then \(\alpha + \alpha_{(j)} = \alpha_{(i)}\) is the sum of two positive roots that equals a simple root, which contradicts the definition of a simple root. Therefore, we must have \(\alpha_{(i)} - \alpha_{(j)} \notin \Phi\).
The simple roots
\[\Phi_S = \{\alpha_{(i)}: i = 1, \ldots, r\}\]provide a canonical \(\mathbb{C}\)-basis for \(\mathfrak{h}^{*}\). Writing \(h_i := h_{\alpha_{(i)}}\) and \(e_{i} := e_{\alpha_{(i)}}\) (and \(h_{-i} := h_{-\alpha_{(i)}}\), \(e_{-i} := e_{-\alpha_{(i)}}\)), the triple \((h_i, e_i, e_{-i})\) generate a subalgebra corresponding to \(\mathfrak{su}(2)_{\mathbb{C}}\), globally acting as
\[[h_i, h_j] = 0, \quad [h_i, e_{\pm j}] = \pm \underbrace{\frac{2(\alpha_{(i)}, \alpha_{(j)})}{(\alpha_{(i)}, \alpha_{(i)})}}_{=: A_{ji}} e_{\pm j}, \quad [e_i, e_{-j}] = \delta_{ij} h_i,\]defining the Cartan matrix
\[A_{ij} := \frac{2(\alpha_{(i)}, \alpha_{(j)})}{(\alpha_{(j)}, \alpha_{(j)})}\]Similarly to before, we can define the subalgebra \(\mathfrak{g}_i := \text{span}_{\mathbb{C}}\{h_i, e_i, e_{-i}\}\), which has \(\mathfrak{g}_{i} \cong \mathfrak{su}(2)_{\mathbb{C}}\) and associated CSA \(\mathfrak{h}_{i} := \text{span}_{\mathbb{C}}\{h_i\}\). The generators \(\{h_i, e_i, e_{-i}\}_{i}\) are called the Chevalley basis of \(\mathfrak{g}\).
In addition to the above relations, we have
\[\text{ad}_{e_i}(e_j) \equiv [e_i, e_j] \propto \begin{cases} e_{\alpha_{(i)}+\alpha_{(j)}}, & \alpha_{(i)} + \alpha_{(j)} \in \Phi\\ 0, & \text{otherwise} \end{cases}\]meaning that
\[\text{ad}_{e_i}^{\rho}(e_j) \propto \begin{cases} e_{\alpha_{(j)} + \rho\alpha_{(i)}}, & \rho \in \{n_{-}^{(i, j)}, \ldots, n_{+}^{(i, j)}\}\\ 0, & \text{otherwise} \end{cases}\]defining \(n_{+}^{(i, j)} := n_{+}(\alpha_{(i)}, \alpha_{(j)})\) and \(n_{-}^{(i, j)} := n_{-}(\alpha_{(i)}, \alpha_{(j)})\), where \(n_{+}, n_{-}\) are as defined in A.1.1.3.
Furthermore, as proven in the previous section, the difference between two simple roots is not a root (i.e. \(\alpha_{(i)} - \alpha_{(j)} \notin \Phi\)), which tells us that \(n_{-}^{(i, j)} = 0\) for all \(i, j\). And recall (in A.1.1.3) that we found
\[n_{+}^{(i, j)} = -\frac{2(\alpha_{(i)}, \alpha_{(j)})}{(\alpha_{(i)}, \alpha_{(i)})} \equiv -A_{ji}\]As a result, we have
\[\text{ad}_{e_i}^{\rho}(e_j) \propto \begin{cases} e_{\alpha_{(j)} + \rho\alpha_{(i)}}, & \rho \in \{0, 1, \ldots, -A_{ji}\}\\ 0, & \text{otherwise} \end{cases}\]providing the \textit{Serre relation} for \(i\neq j\):
\[\text{ad}_{e_i}^{-A_{ji}}(e_j) \neq 0, \qquad \text{ad}_{e_i}^{-A_{ji}+1}(e_j) = 0\]\textbf{Classifying algebras via Cartan matrices.} We saw that the entries of the Cartan matrix \(A \in \mathbb{Z}^{r \times r}\) completely determined the relations between elements of the Chevalley basis \(\{h_i, e_i, e_{-i}\}_{i=1}^{r}\) of \(\mathfrak{g}\). Indeed, the Cartan matrix uniquely determines a semi-simple complex Lie algebra.
As a result, the problem of classifying all complex semi-simple Lie algebras reduces to classifying all possible Cartan matrices. In the following, we will derive some necessary conditions on \(A\) to aid in this classification.
Todo: relevant proofs
As a result, we can classify all complex semi-simple finite-dim Lie algebras by determining the valid Cartan matrices \(A\). The definition of \(A\) gives some constraints:
This is sufficient to classify all complex semi-simple finite-dim Lie algebras, amounting to the infinite families \(A_r, B_r, C_r, D_r\) as well as exceptional algebras \(E_6, E_7, E_8, F_4, G_2\). See here for more details.
We now wish to classify all available representations \(D: G \to \text{GL}(V)\) of a given Lie group \(G\). We can do the following: (i) classify all Lie algebra reps (ii) map to Lie group reps via the exponential map.
A Lie algebra representation \(d: \mathfrak{g} \to \mathfrak{gl}(V)\) is a (Lie algebra) homomorphism between \(\mathfrak{g}\) and \(\mathfrak{gl}(V)\) for some vector space \(V\). One often calls \(V\) the representation, or the representation space.
For a representation \(d\) of \(\mathfrak{g}\) over \(V\), we say that \(W \leq V\) is an invariant subspace of \(d\) iff
\[d_X(W) \subseteq W \quad \forall \; \; X \in \mathfrak{g}\]Any representation \(d\) always has the trivial invariant subspaces \(W = V\) and \(W = \{0\}\). If \(d\) has no non-trivial invariant subspaces, then we say that \(d\) is irreducible. Otherwise, we say that \(d\) is reducible.
We can generalize the concept of irreducibility: if we can decompose
\[V = W_1 \oplus \cdots \oplus W_K\]for a collection of invariant subspaces \(\{W_i\}_{i=1}^{K}\) of \(d\), then we say that \(d\) is totally reducible. The case of \(K=1\) corresponds to irreducibility.
In particular, if \(d\) is totally reducible, then we may write
\[d = d^{(1)} \oplus \cdots \oplus d^{(K)}\]for a collection of irreducible representations \(\{d^{(i)}: \mathfrak{g} \to \mathfrak{gl}(W_i)\}_{i=1}^{K}\).
Maschke’s theorem tells us that any finite-dimensional representation of a semi-simple Lie algebra \(\mathfrak{g}\) is totally reducible (assuming a characteristic zero field \(\mathbb{F}\)).
Consider a rep \(d\) of \(\mathfrak{g}\) on \(V\). We want to better understand the vectors spaces \(V\) on which \(\mathfrak{g}\) can act through reps \(d\). First we will work with the initial Cartan-Weyl basis \(\{H_i, E_{\alpha}\}_{i, \alpha}\) of \(\mathfrak{g}\) paired with CSA \(\mathfrak{h}\). Note that
\[[d_{H_i}, d_{H_j}] = d_{[H_i, H_j]} = 0\]since \([H_i, H_j] = 0\). This means that the operators \(\{d_{H_i}\}_{i=1}^{r}\) on \(V\) are simultaneously diagonalizable. Then define the eigenspace of simultaneous eigenvectors associated with weight \(\lambda \in \mathfrak{h}^{*}\) by
\[V_d^{\lambda} := \{v \in V: d_H(v) = \lambda(H)v \;\; \forall \; H \in \mathfrak{h}\}\]Denote the total set of weights of rep \(d\) by \(S_d \subset \mathfrak{h}^{*}\), i.e.
\[S_d := \{\lambda \in \mathfrak{h}^{*}: V_{d}^{\lambda} \neq \{0\}\}\]It is easy to show that
\[d_{E_{\alpha}}(V_d^{\lambda}) \subseteq \begin{cases} V_d^{\lambda+\alpha}, & \lambda+\alpha \in S_d\\ \{0\}, & \text{otherwise} \end{cases}\]telling us that \(d_{E_{\alpha}}\) raises weights \(\lambda \to \lambda+\alpha\).
Every finite-dim rep \(d\) has at least one \textit{highest weight} \(\Lambda \in S_d\) defined by
\[d_{E_{\alpha}}(V_{d}^{\Lambda}) = \{0\} \;\; \forall \; \alpha \in \Phi_{+}\]i.e. the highest weight \(\Lambda \in S_d\) is such that, for any positive root \(\alpha \in \Phi_{+}\), \(\Lambda + \alpha \notin S_d\).
Note that here we are working with the subalgebra \(\mathfrak{g}^{\alpha}\), where \(\mathfrak{h}^{\alpha}\) is 1-dimensional and hence can just treat \(\lambda \in \mathbb{C}\). In this case, the relevant CSA from which weights are defined is \(\mathfrak{h}^{\alpha} = \text{span}_{\mathbb{C}}\{h^{\alpha}\}\), and for a rep \(d\) of \(\mathfrak{g}^{\alpha} \cong \mathfrak{su}(2)_{\mathbb{C}}\) on \(V\), the weights of \(d\) corresponds to eigenvalues of \(d_{h^{\alpha}}: V \to V\). But by isomorphism, the weight set of \(d\) must match \(\{-\Lambda, \ldots, \Lambda\}\), meaning that
\[\lambda \in S_d \implies d_{h_{\alpha}}(v) = \lambda(h_{\alpha}) v \; \; \text{for some} \; \; v \in V\backslash\{0\}\]means that \(\lambda(h_{\alpha}) \in \mathbb{Z}\). But we can equivalently write
\[\lambda(h_{\alpha}) = \frac{2}{(\alpha, \alpha)} \lambda(H_{\alpha}) = \frac{2(\lambda, \alpha)}{(\alpha, \alpha)}\]meaning that
\[\frac{2(\lambda, \alpha)}{(\alpha, \alpha)} \in \mathbb{Z} \quad \forall \; \lambda \in S_d, \alpha \in \Phi\]In the second equality we used the Reisz representation theorem-like definition of \(H_{\alpha}\), which means
\[\lambda(H_{\alpha}) = \kappa^{ij} \alpha_j \lambda(H_i) = \kappa^{ij} \alpha_j \lambda_i \equiv (\lambda, \alpha)\]The above generalizes the previously used result; when \(d = \text{ad}^{\alpha, \beta}\), we get the root-specific condition \(2(\alpha, \beta)/(\alpha, \alpha)\).
\textit{Dominant weights.} For the following, we will define the co-roots
\[\hat{\alpha}_{(i)} := \frac{2}{(\alpha_{(i)}, \alpha_{(i)})} \alpha_{(i)}\]For \(\lambda \in \mathfrak{h}^{*}\), we define its components \(\lambda^i := (\lambda, \hat{\alpha}_{(i)})\). In particular, they are the components of \(\lambda\) under the dual basis \(\{\omega_{(i)}\}_i\) of \(\{\hat{\alpha}_{(i)}\}_i\):
\[\lambda = \lambda^i \omega_{(i)}\]with \((\omega_{(i)}, \hat{\alpha}_{(j)}) = \delta_{ij}\). Then we define the dominant weights as
\[\mathcal{D}_W := \{\lambda \in \mathfrak{h}: \lambda^i \in \mathbb{Z}_{\geq 0} \; \forall \; i\}\]The main result: for every dominant weight \(\lambda \in \mathcal{D}_W\), there exists a unique finite-dimensional irrep \(d^{(\lambda)}\) of \(\mathfrak{g}\) for which \(\lambda\) is its highest weight. And further, this exhausts all finite-dim irreps. As a result, we can determine finite-dim irreps by iterating over all dominant weights.
Todo: relevant proofs
Todo
In the canonical quantization approach to QFT – an alternative to the path integral approach outlined in the rest of this post – it is particularly convenient to derive the scattering/decay probabilities associated with scattering/decay processes (e.g. a meson decaying into a nucleon and anti-nucleon). For example, for a free scalar QFT with field \(\phi\), the scattering process \(\phi\phi \to \phi\phi\) has scattering amplitude
\[\begin{align*} \braket{f|S|i} &= \int \left[\prod_{i=1}^{4} d^4 x_i\right] \, e^{-ip_1\cdot x_1} (\square_1 + m^2) e^{-ip_2\cdot x_2} (\square_2 + m^2) e^{ip_3\cdot x_3} (\square_3+m^2) e^{ip_4\cdot x_4} (\square_4 + m^2)\\ &\qquad \qquad \qquad \qquad \braket{\phi(x_1) \phi(x_2) \phi(x_3) \phi(x_4)} \end{align*}\]with \(\square_i\) the Laplace operator associated with \(x_i\).
It is a general property of scattering amplitudes that they can be expressed as an integral over correlators. This tells us that correlators have a direct physical relevance, and hence it is natural to enforce that such correlators – and expectances more generally (as given in Introduction) – are invariant to the symmetry group of the theory.
Non-dynamical gravity involves modifying the action
\[\int d^n x \, \mathcal{L}(\Phi; x) \to \int d^n x \, \sqrt{-g} \mathcal{L}(\Phi; x)\]To make the metric \(g\) dynamical, we cam add a Ricci scalar term:
\[\int d^n x \, \mathcal{L}(\Phi; x) \to \int d^n x \, \sqrt{-g} \left(\frac{1}{16\pi G}R + \mathcal{L}(\Phi; x)\right)\]In the context of gravity, the concept of diffeomorphism invariance of our theory becomes relevant. The \(\sqrt{-g}\) factor as well as the Ricci scalar \(R\) are manifestly diffeomorphism invariant.
The main problem why gravity is problematic is that the resulting Feynman diagrams/graviton loop corrections are non-renormalizable: one requires an infinite number of counter-terms to remedy divergences at all orders. As a result, after coupling to gravity, you are restricted to only considering tree level contributions.
Todo: demonstrate divergences under dynamical gravity.
In biology there is a redundancy analogous to gauge invariance: multiple genotypes map to the same phenotype. In particular, field configurations \(\Psi \in \mathcal{C}\) are analogous to genotypes, and physical configurations \(\Phi \in \mathcal{P}\) are analogous to phenotypes. The equivalence class \([\Psi]\) is analogous to a \textit{neutral network}. Note however that the discreteness of genotypes compared to the continuity of the configuration space \(\mathcal{C}\) weaken this analogy.
In biology, the existence of neutral networks aid search by allowing for a more diverse search over genotypes and hence phenotypes. How can we interpret such benefits in the context of gauge invariance? Can we better understand why our theories must exhibit properties like gauge invariance?
Todo
What are the effects of integrating over physically equivalent configurations in \(\mathcal{C}\)? We can expect some form of overcounting. Namely, see that
\[\begin{align*} \mathbb{E}_{\Psi \sim S}^{\mathcal{C}}[f(\Psi)] = \frac{1}{Z_{\mathcal{C}}} \int_{\mathcal{C}} \text{D}\Psi \, f(\Psi) e^{-S[\Psi]} &= \frac{1}{Z_{\mathcal{C}}} \int_{\mathcal{P}} \text{D}\Phi \, \int_{[\Phi]} \text{D}\Psi \, f(\Psi) e^{-S[\Psi]}\\ &= \frac{1}{Z_{\mathcal{C}}} \int_{\mathcal{P}} \text{D}\Phi \, e^{-S[\Phi]} \left[\int_{[\Phi]} \text{D}\Psi \, f(\Psi)\right]\\ &= \frac{1}{Z_{\mathcal{P}}} \int_{\mathcal{P}} \text{D}\Phi \, F(\Phi) e^{-S[\Phi]}\\ &= \mathbb{E}_{\Phi\sim S}^{\mathcal{P}}[F(\Phi)] \end{align*}\]defining
\[F(\Phi) := \frac{Z_{\mathcal{P}}}{Z_{\mathcal{C}}} \int_{[\Phi]} \text{D}\Psi \, f(\Psi)\]We view integrating over \(\mathcal{P}\) as integrating over representative states of each distinct equivalence class, only integrating over non-equivalent configurations.
Similarly, one can show that
\[Z_{\mathcal{C}} = Z_{\mathcal{P}} \mathbb{E}_{\Phi \sim S}^{\mathcal{P}}[\text{Vol}([\Phi])]\]which lets us write
\[F(\Phi) = \frac{1}{\mathbb{E}_{\Phi' \sim S}^{\mathcal{P}}[\text{Vol}([\Phi'])]} \int_{[\Phi]} \text{D}\Psi \, f(\Psi)\]Now say that each equivalence class has the same volume, i.e. \(\text{Vol}([\Phi]) = \text{Vol}([\Phi'])\) for any \(\Phi, \Phi' \in \mathcal{P}\). Further, say \(f\) is constant over \([\Phi]\), which corresponds to invariance \(f \circ \rho_g = f\). In this case, we have that \(F(\Phi) = f(\Phi)\), i.e. taking expectances of \(f\) over \(\mathcal{C}\) is equivalent to taking expectances over \(\mathcal{P}\). But generally, \(f\) will not be constant over \([\Phi]\) (as is the case for correlators), and so generally these are distinct operations.
The overcounting associated with gauge fields can be detrimental to the validity of the theory due to divergences, which we explore in Section 7.
]]>A broad overview of alignment and control for current day language models.
There will be an implicit focus on scalable solutions to alignment and control: methods that are applicable to the largest and most capable future models that we may wish to deploy in real-world contexts.
Language models as simulators. Pretraining on the Internet encourages language models to be capable of simulating a wide diversity of personas if prompted correctly. One rough but illustrative picture is viewing a pretrained model as performing a Bayesian model average over learned personas, where the probability \(p_{\text{PT}}(y\mid x)\) that the pretrained model outputs response \(y\) given context \(x\) can be written schematically as
\[p_{\text{PT}}(y\mid x) = \int d\mathfrak{p} \, p_{\text{PT}}(\mathfrak{p}\mid x) p(y\mid x, \mathfrak{p})\]integrating over all personas \(\mathfrak{p}\) relevant to modeling text on the Internet (e.g. the persona of the average Stack Overflow contributor), with \(p_{\text{PT}}(\mathfrak{p}\mid x)\) the learned distribution over personas \(\mathfrak{p}\) given context \(x\), and \(p(y\mid x, \mathfrak{p})\) producing the outputs of persona \(\mathfrak{p}\) under context \(x\).
Namely, the distribution \(p_{\text{PT}}(\mathfrak{p}\mid x)\) describes what the model has learned from the Internet: the probability that the model should engage in persona \(\mathfrak{p}\) given context \(x\). In an ideal world, \(p_{\text{PT}}(\mathfrak{p}\mid x)\) would effectively be a point-mass on a maximally helpful and honest persona \(\mathfrak{p}_{\text{HHH}}\), however the pretrained model has no reason to specifically favour such a persona given the diversity of the pretraining data. This motivates prompting and finetuning as a means to shift this distribution \(p_{\text{PT}}(\mathfrak{p}\mid x)\) towards personas of interest, as we will discuss in more detail later (see “Prompt-level steering” and “Behaviour-level steering” below).
Ultimately, the model’s internal representation \(z(x)\) of a particular prompt \(x\) encodes the persona \(\mathfrak{p}\) that is currently active. Interpretability methods like representation reading [1], sparse auto-encoders [2, 3], and LatentQA [4] allow for reading off attributes of a model’s persona, such as “helpfulness”, and for steering towards such behaviours (see “Representation-level steering” below).
Shallowness of finetuning. The observed shallowness of finetuning [8, 9] suggests that the finetuned persona distribution \(p_{\text{FT}}(\mathfrak{p}\mid x)\) has essentially the same support as \(p_{\text{PT}}(\mathfrak{p}\mid x)\), i.e. \(p_{\text{PT}}(\mathfrak{p}\mid x) \approx 0 \implies p_{\text{FT}}(\mathfrak{p}\mid x) \approx 0\), describing the observation that finetuning usually does not teach the model fundamentally new behaviours.
Rethinking pretraining. Rather than trying to improve the effectiveness of finetuning at unlearning undesirable pretraining behaviour (as we will discuss in the next section), perhaps we should instead rethink the pretraining process itself. It is clear that there is a great mismatch between the task of pretraining and the behaviour we wish an LM to have at deployment; unsurprisingly, most text on the Internet does not match a “helpful persona”. It may be unrealistic to expect finetuning to be able to effectively unlearn all undesirable pretraining behaviours. Is there an alternative way of pretraining that doesn’t encourage a model to explicitly emulate behaviours that are misaligned with deployment? We still wish for the model to learn from all available data, but by a means that is detached from behaviour – the fact that we use the same next-token learning objective for both pretraining and SFT seems inherently problematic, and ideally there would be some separation between these learning processes, with pretraining learning knowledge and useful representations while finetuning learns good behaviours.
By default, a pretrained model will not behave usefully; it has no particular preference for being “helpful”. We wish to “steer” the pretrained model to be more useful for a deployment task of interest. Steering can be performed at different levels of abstraction. Note that language models have a computational hierarchy of: \((\text{prompt}, \text{weights}) \to \text{representations} \to \text{behaviour}\). Steering can be performed at the level of any component in this hierarchy.
Behaviour-level steering. Using the notation introduced in the previous section, we would like to steer the distribution \(p_{\text{PT}}(\mathfrak{p}\mid x)\) towards behaviours/personas \(\mathfrak{p}\) that correspond with a helpful assistant \(\mathfrak{p}_{\text{HHH}}\). Most immediately, we could consider prompting the model appropriately via \(x\) to elicit desired personas (i.e. prompt-level steering), such as via few-shot prompting. This requires no additional training, but as we will discuss later, is unreliable as a steering strategy.
One intuitive approach is to perform additional training in exactly the same manner as pretraining – i.e. using a next-token prediction loss – but instead using human-curated text data that demonstrates ideal interactions between a user and a helpful assistant. This is the idea behind supervised finetuning (SFT). Some limitations of SFT include:
Reinforcement learning-based training – namely, reinforcement learning from human feedback (RLHF) – is a method that remedies both of these issues, since (1) RL only requires an evaluation model \(R = R(x, y)\) of model responses \(y\) (no explicit demonstrations are required) and (2) in RL we can allow the model to generate its responses without access to a ground truth, which is faithful to deployment.
SFT and RLHF are often referred to as “finetuning” methods, and in practice they are both used in combination (first SFT, followed by RLHF), though it has been found in special cases (where high-quality demonstrations are available) that SFT can sometimes be sufficient [7].
Misspecification. To understand the potential for misalignment in RLHF, we will describe it explicitly. Concretely, RLHF consists of two steps:
Train reward model \(R = R(x, y)\) (initialized as a pretrained model) to minimize
\[\mathcal{L}[R] := \mathbb{E}_{(x, y_0, y_1, b) \sim \mathcal{D}}[-\log \sigma((-1)^{1-b} (R(x, y_1) - R(x, y_0)))]\]for prompts \(x\), model completions \((y_0, y_1) \sim \pi_{\text{ref}}(\cdot\mid x)\), and preference label \(b \in \{0, 1\}\) (with \(b=1\) iff completion \(y_1\) is preferred over \(y_0\)) obtained by human labellers. Note that \(\pi_{\text{ref}}\) is the initial pretrained (and SFT’d) model.
Train the language model \(\pi = \pi(y\mid x)\) to optimize \(R\) via PPO (or DPO):
\[\mathcal{V}[\pi] := \mathbb{E}_{\rho(x) \pi(y\mid x)}[R(x, y)] - \tau D_{\text{KL}}(\pi\mid \mid \pi_{\text{ref}})\]Note that step 2 is applicable to generic reward models \(R\), whereas step 1 is specific to learning \(R\) via response comparison data (found to be reliable in practice, since getting humans to give raw preference ratings is noisy).
The reward model \(R = R(x, y)\) used in RLHF is likely to be an imperfect proxy, i.e. to be misspecified to some degree. Namely, let \(R^{*}\) denote the oracle reward describing our true, “platonic” human preferences. We do not have access to this oracle, and instead must rely on a human’s ratings, which can be thought of as samples from another reward model \(R_{\text{human}}\). Further, as seen in step 1 of RLHF above, we essentially train a reward model \(R_{\text{trained}}\) to predict the outputs of \(R_{\text{human}}\) to allow for larger-scale training (collecting data entirely from humans for training is too costly). We then perform RL training using reward function \(R = R_{\text{trained}}\). Each step introduces its own discrepancies/gaps:
\[R^{*} \underbrace{\longleftrightarrow}_{\text{oracle-human gap}} R_{\text{human}} \underbrace{\longleftrightarrow}_{\text{human-train gap}} R_{\text{trained}}\]Misspecification corresponds to the overall oracle-train gap, i.e. the total discrepancy between the ideal oracle reward \(R^{*}\) and the learned reward model \(R = R_{\text{trained}}\) used for RL training, which is a sum of:
The effect of reward misspecification is that ultimately the LM is trained to maximize \(R_{\text{trained}}\) during RL training, and hence any discrepancies between \(R^{*}\) and \(R_{\text{trained}}\) will be exploited if it better allows for value maximization, often resulting in undesirable behaviours – called reward hacking, or reward overoptimization.
Scalable oversight. The problem of misspecification – and specifically the oracle-human gap – can be mainly attributed to humans not being very reliable supervisors. One can consider this problem in more generality: how can we reliably supervise and evaluate the outputs of AI systems on complex tasks? (for RLHF, the complex task is instruction following/helpfulness) This is often studied under the term scalable oversight.
One particularly difficult version of the scalable oversight problem involves assuming that the AI’s output is essentially unreadable to the human supervisor (e.g. code in a language that the human supervisor is unfamiliar with). In order to aid the human, we would like to produce a human-readable artifact that the human can base their supervision upon. Some examples of methods for doing this include:
This first example, of using an AI system to act as an evaluator and generate supervision signal, is demonstrated in Constitutional AI (CAI) [13]. CAI generates demonstrations for SFT, and evaluations for RL, with the only human oversight being the selection of constitutional principles that the generating model is prompted to follow during data generation. Namely, on the RLHF side, a given choice of constitution \(\mathcal{C}\) results in an associated preference model \(R_{\mathcal{C}} = R_{\mathcal{C}}(x, y)\) that one can then use for RL.
There are many works that evaluate how effectively LMs can act as evaluators, often called “meta-evaluation”. For example, [19, 20] find that even though LMs can often generate correct solutions to reasoning problems, they often fail to evaluate which solution is correct (i.e. LMs find generation easier than evaluation, disagreeing with human intuition). One could consider explicitly training AI systems to act as better evaluators. For example:
Note that scalable oversight is also relevant in the control context (discussed later) for the purposes of monitoring and interpretability during deployment as a means to mitigate misalignment.
Relation to weak-to-strong generalization. As noted by [41], one can view both scalable oversight and weak-to-strong generalization [42] as orthogonal approaches to alignment:
In an LM context, weak-to-strong generalization concerns the setting of improving the ability for the reward model (RM) to correctly generalize from imperfect human preference data (whereas scalable oversight would aim to improve the quality of the preference data). [42] finds negative results for weak-to-strong generalization in this setting: training a larger RM using the outputs of a smaller RM causes the larger RM to collapse to the performance of the smaller RM (whereas for chess and NLP tasks, the larger model can meaningfully outperform the smaller model). It may be that modeling a human’s feedback as the outputs of a smaller reward model is problematic. It also may be that the relevant setting is instead using a smaller RM to train a larger policy (rather than training a larger RM) and observing whether the larger policy can learn to “grok” the intended goal.
Evaluating scalable oversight. One problem we must face when evaluating proposed solutions to scalable oversight is that, for very complex tasks, we have no access to ground truth labels, essentially by construction.
Consider a proxy setting where an AI produces outputs that a non-expert human is unable to evaluate, but where an expert human is able to reliably evaluate, providing us with ground-truth labels to evaluate the performance of the assisted non-expert human. The hope is that this proxy setting captures the meaningful aspects of scalable oversight in general and that solutions to this proxy setting extend to setting where even expert humans are unable to evaluate. This proxy setting is often called the “sandwiching” setup [38].
Goal misgeneralization. To be written. Could discuss:
Aside (a very rough analogy to the brain). In the context of finetuning via RLHF, we would like to learn a reward model \(R_{\text{trained}}\) that accurately captures human preferences. Analogously, the “reward circuits” in the brain (e.g. in the midbrain and hypothalamus) have been “learned” by evolution to capture preferences that (indirectly) benefit genetic fitness.
That is, for both RLHF and the brain, the “reward model” is misspecified relative to the oracle reward (human preferences and genetic fitness respectively). The brain has therefore had to face the analogous problem of reward misspecification that we face in the context of RLHF. As a result, better understanding how the brain mitigates reward hacking could inform improvements to RLHF.
Representation-level steering. Finetuning is a behaviour-level method of steering: we directly incentivize certain model outputs over others. One can imagine that this could have undesirable effects, e.g. a model producing the right outputs for the wrong reasons/not accurately internalizing the goal that we want the model to pursue (goal misgeneralization).
This motivates considering a more fine-grained method of steering. We can move to a lower level of abstraction, from behaviour to representations, and consider representation-level steering. Methods in this vein typically involve “locating” certain interpretable concepts in the model, such as the concept of “helpfulness”, and amplifying these concepts to (hopefully) result in interpretable changes in model behaviour. Some methods include:
The aforementioned steering methods are also readily applicable to representation-level monitoring, as we will discuss later under “Control”.
TODO: discussion of limitations of these approaches and future directions. Also discuss interpretability methods like [23] that allow for editing representations to fix model failures.
Weight-level steering. With a mechanistic understanding of a model (in the sense of mechanistic interpretability), we can possibly perform explicit weight edits to steer model behaviour. For example, [24] identifies the modules important for storing factual associations, and provides a method for editing the weights to change these factual associations, e.g. making the model believe that the Eiffel Tower is in Rome.
For weight editing to be scalable, we would likely need to automate the process of locating relevant weights (as in [25]) and editing accordingly. As one example, could we extend an investigator agent method like MAIA [26] to be able to use operations like ablation and attribution patching to find the model weights important for a particular task and relay its findings to humans, as well as perform relevant weight edits to modify model behaviour?
TODO
Prompt-level steering. To be written. Could discuss:
To be written. Could discuss:
[1] Representation Engineering: A Top-Down Approach to AI Transparency
[2] Sparse Autoencoders Find Highly Interpretable Features in Language Models
[3] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (https://transformer-circuits.pub/2023/monosemantic-features)
[4] LatentQA: Teaching LLMs to Decode Activations Into Natural Language
[5] Adversarial Examples Are Not Bugs, They Are Features
[6] Pretraining Language Models with Human Preferences
[7] LIMA: Less Is More for Alignment
[8] Safety Alignment Should Be Made More Than Just a Few Tokens Deep
[9] Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
[10] Towards Understanding Sycophancy in Language Models
[11] Scaling Laws for Reward Model Overoptimization
[12] The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
[13] Constitutional AI: Harmlessness from AI Feedback
[14] Specific versus General Principles for Constitutional AI
[15] Discovering Language Model Behaviors with Model-Written Evaluations
[16] Understanding Learned Reward Functions
[17] RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
[18] Recursively Summarizing Books with Human Feedback
[19] Benchmarking and Improving Generator-Validator Consistency of Language Models
[20] The Generative AI Paradox: “What It Can Create, It May Not Understand”
[21] RewardBench: Evaluating Reward Models for Language Modeling
[22] Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
[23] Monitor: An AI-Driven Observability Interface (https://transluce.org/observability-interface)
[24] Locating and Editing Factual Associations in GPT
[25] Towards Automated Circuit Discovery for Mechanistic Interpretability
[26] A Multimodal Automated Interpretability Agent
[27] Language Models are Few-Shot Learners
[28] The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
[29] Maintaining Alignment during RSI as a Feedback Control Problem (https://www.beren.io/2025-02-05-Maintaining-Alignment-During-RSI-As-A-Feedback-Control-Problem/)
[30] AI Control: Improving Safety Despite Intentional Subversion
[31] Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
[32] Improving Alignment and Robustness with Circuit Breakers
[33] Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them
[34] Eliciting Language Model Behaviors with Investigator Agents
[35] LLM Critics Help Catch LLM Bugs
[36] Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design
[37] Non-Programmers Can Label Programs Indirectly via Active Examples: A Case Study with Text-to-SQL
[38] Measuring Progress on Scalable Oversight for Large Language Models
[39] AI safety via debate
[40] Automatically Interpreting Millions of Features in Large Language Models
[41] Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem (https://www.alignmentforum.org/posts/hw2tGSsvLLyjFoLFS/scalable-oversight-and-weak-to-strong-generalization)
[42] Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
[43] Scaling Laws for Reward Model Overoptimization
]]>This post presents a mathematical formalism of perception, action, and learning, based on the framework of variational inference. It appears that a variational formalism has the potential to shed light on the ad-hoc choices used in practice; we will see that the notion of entropy regularization (as in SAC) and penalizing policy drift (as in PPO) emerge naturally from this framework. Further, this formalism provides a model-based method of reinforcement learning, featuring both representation learning & action-selection on the basis of these learned representations, in a manner analogous to “World Models” [3]. We will also see that, under certain assumptions, we obtain an alternative credit assignment scheme to backpropagation, called predictive coding [6, 7]. Backpropagation suffers from catastrophic interference [4] and has limited biological plausibility [5], whereas predictive coding – a biologically-plausible credit assignment scheme – has been observed to outperform backpropagation at continual learning tasks – alleviating catastrophic interference to some degree [6] – and at small batch sizes [8] (i.e. biologically relevant contexts).
A brief background on reinforcement learning methods can be found in the Appendix.
Related work. The framework presented here differs from previous work. [1] describes an agent’s preference via binary optimality variables whose distribution is defined in terms of reward, from which they are able to justify the max-entropy framework of RL, however their framework does not result in a policy drift term (as in PPO). In contrast, for the formalism presented here, we will see that it is most natural to represent an agent’s preferences via a distribution over policies (defined in terms of value), from which a policy drift term and an entropy regularization term naturally emerge. There is also the topic of active inference [9] which aims to formulate action-selection as variational inference, however it appears that central claims – such as the minimization of the “expected free energy” – lack a solid theoretical justification, as highlighted by [10, 11].
We will consider the following graphical model,
where \(s_t\) represents the environment’s (hidden) state, \(x_t\) a partial observation of this state, and \(a_t\) an action, at time \(t\). For simplicity we have made a Markov assumption on how the hidden states evolve. The dependency \(s_t \to a_t\) is a consequence of this Markov assumption, with the optimal action \(a_t\) (optimality defined further below) ultimately depending only on the current environment state \(s_t\), independent of previous states \(s_{<t}\) (given \(s_t\)).
The associated probabilistic decomposition (up to time \(t\)) is
\[p(s_{1:t}, x_{1:t}, a_{1:t}) = \prod_{\tau=1}^{t} p(s_{\tau}\mid s_{\tau-1}, a_{\tau-1}) p(x_{\tau}\mid s_{\tau}) p(a_{\tau}\mid s_{\tau})\]where \(p(s_1|s_0, a_0) \equiv p(s_1)\). We can interpret:
To provide a Bayesian framework for action, we wish to frame the objective of an agent as performing inference over this graphical model. Since an agent will have access to information \((x_{1:t}, a_{<t})\) at time \(t\), we wish to frame action selection of the next action \(a_t\) as the process of computing/sampling from
\[p(a_t\mid x_{1:t}, a_{<t}) \equiv \int ds_t \; p(s_t\mid x_{1:t}, a_{<t}) p(a_t\mid s_t)\]i.e. we can view action selection as the process of inferring the underlying state \(s_t \sim p(s_t\mid x_{1:t}, a_{<t})\), and then inferring the associated action \(a_t \sim p(a_t\mid s_t)\). But how do we appropriately define \(p(a_t\mid s_t)\) to represent an “ideal” agent? Naively we may consider \(p(a_t\mid s_t)\) to be a point-mass/Dirac-delta distribution centered at the optimal action \(a^{*}(s_t)\), however this does not provide a convenient notion of Bayesian inference. Instead we will consider a smoothed version that places the most weight on the optimal action \(a^{*}(s_t)\), but also places non-zero weight on other actions, in a way that is correlated with an action/policy’s value. In order to define \(p(a_t\mid s_t)\) to satisfy this notion, we will assume that \(p(a_t\mid s_t)\) takes the form of a mixture distribution, introducing a policy variable \(\pi\) and writing,
\[p(a_t\mid s_t) = \int d\pi \; p(a_t, \pi\mid s_t) = \int d\pi \; p(\pi\mid s_t) \pi(a_t\mid s_t)\]where we write \(\pi(a_t\mid s_t) \equiv p(a_t\mid s_t, \pi)\).
A notable property of a mixture distribution is that we can write \(p(a_{\tau}\mid s_{\tau}) = \mathbb{E}_{p(\pi\mid s_{\tau})}[\pi(a_{\tau}\mid s_{\tau})]\), which – as we will see later – allows us to apply a variational bound. We then introduce a value system \(V_{\pi}(s)\), describing the value of policy \(\pi\) starting from the state \(s\). We can consider a Boltzmann preference over policies based on this value system,
\[\begin{equation} \label{eqn:boltz} p(\pi\mid s_t) \propto \exp(\beta V_{\pi}(s_t)) \quad \quad (1) \end{equation}\]where, as typical in the context of RL, we define the value as,
\[V_{\pi}(s_t) = \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi)}\left[\sum_{\tau=t}^{\infty} \gamma^{\tau-t} R(s_{\tau}, a_{\tau})\right]\] \[\text{with} \; \; \; p(s_{>t}, a_{\geq t}\mid s_t, \pi) = \prod_{\tau=t}^{\infty} \pi(a_{\tau}\mid s_{\tau}) p(s_{\tau+1}\mid s_{\tau}, a_{\tau})\]for a discount factor \(\gamma \in [0, 1]\). Note that \(\beta = \infty\) corresponds to taking an argmax over \(V_{\pi}(s_t)\), representing “perfect rationality”. Finite \(\beta\) corresponds to “bounded rationality”, placing weight on policies that dont perfectly maximize \(V_{\pi}(s_t)\).
Variational inference. However, even assuming that we have direct access to all relevant distributions (which we dont), computing the integral of Equation (1) will be intractable as we expect \(s_t\) to be high-dimensional. As a result, we cannot perform exact Bayesian inference and so must utilize an approximate scheme of Bayesian inference. The approximate scheme we will consider is variational inference: at time \(t\) an agent has access to information \((x_{1:t}, a_{<t})\), and we will consider minimizing (with respect to \(p\)) the model’s corresponding surprise \(-\log p(x_{1:t}, a_{<t})\) by instead minimizing (with respect to \(p\) and \(q\)) the variational bound \(F(x_{1:t}, a_{<t})\),
\[-\log p(x_{1:t}, a_{<t}) \leq F(x_{1:t}, a_{<t}) := D_{\text{KL}}(q(s_{1:t}\mid x_{1:t}, a_{<t})\mid \mid p(s_{1:t}, x_{1:t}, a_{<t}))\](a consequence of Jensen’s inequality) where we have introduced the variational/approximate distribution \(q\). \(F\) is commonly called the variational free energy (VFE). Variational inference considers minimizing this upper bound \(F(x_{1:t}, a_{<t})\) on surprisal, rather than minimizing the surprisal itself (which is intractable to do). We will use the shorthand \(F_t \equiv F(x_{1:t}, a_{<t})\).
When the VFE \(F_t\) is perfectly minimized with respect to \(q\), with \(q(s_{1:t}\mid x_{1:t}, a_{<t}) = p(s_{1:t}\mid x_{1:t}, a_{<t})\), then the VFE is exactly equal to the surprisal,
\[F_t = -\log p(x_{1:t}, a_{<t})\]in which case further minimizing \(F_t\) with respect to \(p\) achieves our original goal. This is the basis of the variational EM algorithm: at time \(t\),
In practice, we will implement these steps via gradient-based minimization, as will be described in more detail later. The E-step cannot be performed until convergence in practice but we will typically perform many more E-steps than M-steps.
In general we can write \(F_t\) as
\[\begin{align*} F_t = &\sum_{\tau=1}^{t} \mathbb{E}_{q(s_{1:t}\mid x_{1:t}, a_{<t})}[\underbrace{-\log p(s_{\tau}\mid s_{\tau-1}, a_{\tau-1})}_{\text{perception}} \underbrace{-\log p(x_{\tau}\mid s_{\tau})}_{\text{prediction}}]\\ &+ \sum_{\tau=1}^{t-1} \mathbb{E}_{q(s_{\tau}\mid x_{1:t}, a_{<t})}[\underbrace{-\log p(a_{\tau}\mid s_{\tau})}_{\text{action}}]\\ &\underbrace{-H[q(s_{1:t}\mid x_{1:t}, a_{<t})]}_{\text{entropy regularization}} \end{align*}\]As seen above we chose \(p(a_{\tau}\mid s_{\tau})\) to be a mixture distribution, but the log of a mixture distribution (the “action” term above) lacks a nice expression. However, we can apply a variational bound on this term with respect to \(q(\pi\mid s_{\tau})\) by noting that \(p(a_{\tau}\mid s_{\tau})\) is an expectance,
\[p(a_{\tau}\mid s_{\tau}) = \mathbb{E}_{p(\pi\mid s_{\tau})}[\pi(a_{\tau}\mid s_{\tau})] = \mathbb{E}_{q(\pi\mid s_{\tau})}\left[\frac{p(\pi\mid s_{\tau})}{q(\pi\mid s_{\tau})} \pi(a_{\tau}\mid s_{\tau}) \right]\]and hence we can perform a variational bound (via Jensen’s inequality),
\[\begin{align*} -\log p(a_{\tau}\mid s_{\tau}) &\leq D_{\text{KL}}(q(\pi\mid s_{\tau})\mid \mid p(\pi\mid s_{\tau})) + \mathbb{E}_{q(\pi\mid s_{\tau})}[-\log \pi(a_{\tau}\mid s_{\tau})]\\ &= \mathbb{E}_{q(\pi\mid s_{\tau})}[-\log p(\pi\mid s_{\tau}) - \log \pi(a_{\tau}\mid s_{\tau})] - H[q(\pi\mid s_{\tau})] \end{align*}\]Overall, this results in a unified objective for perception and action:
\[\begin{align} \label{eqn:freeenergy} F_t &= \sum_{\tau=1}^{t} \mathbb{E}_{q(s_{1:t}\mid x_{1:t}, a_{<t})}[\underbrace{-\log p(s_{\tau}\mid s_{\tau-1}, a_{\tau-1})}_{\text{(a)}} \underbrace{-\log p(x_{\tau}\mid s_{\tau})}_{\text{(b)}}]\nonumber \\ &+\sum_{\tau=1}^{t-1} \mathbb{E}_{q(\pi\mid s_{\tau}) q(s_{\tau}\mid x_{1:t}, a_{<t})}[\underbrace{-\log p(\pi\mid s_{\tau})}_{\text{(c)}} \underbrace{-\log \pi(a_{\tau}\mid s_{\tau})}_{\text{(d)}}] \quad \quad \quad (2)\\ &\underbrace{- H[q(s_{1:t}\mid x_{1:t}, a_{<t})]}_{\text{(e)}} + \sum_{\tau=1}^{t-1} \mathbb{E}_{q(s_{\tau}\mid x_{1:t}, a_{<t})}[\underbrace{-H[q(\pi\mid s_{\tau})]}_{\text{(f)}}] \nonumber \end{align}\]where
(c) represents achieving the agent’s preference. In the case of Equation (1) this term corresponds to value maximization, with
\[-\log p(\pi\mid s_{\tau}) = -\beta V[\pi\mid s_{\tau}] + \log Z(s_{\tau})\]for normalizing constant \(Z(s_{\tau})\).
Note that (d) and (f) achieve the same role as policy clipping in PPO, and the entropy regularization in SAC, respectively. Note that entropy regularization in SAC has been previously justified via a variational framework [1], but with differences to the framework presented here (as described at the beginning).
Under Equation (1), term (c) reduces to the expected value which is exactly what the field of reinforcement learning is concerned with. An overview of reinforcement learning methods is included in the Appendix.
Internal structure of hidden states. We can include additional structure on the hidden state \(s\) by generalizing to an arbitrary DAG topology for the hidden state \(s_t = (s_t^1, \ldots, s_t^N)\), where \(s_t^n\) is the state of node \(n\) at time \(t\). For a general DAG, we have
\[p(s_{\tau}\mid s_{\tau-1}, a_{\tau-1}) = \prod_{n=1}^{N} p(s_{\tau}^{n}\mid s_{\tau}^{\mathcal{P}(n)}, s_{\tau-1}^{n}, a_{\tau-1})\]where \(\mathcal{P}(n) \subset \{1, \ldots, N\}\) denotes the parent indices for the \(n\)th node. Further, \(p(x_{\tau}\mid s_{\tau}) = p(x_{\tau}\mid s_{\tau}^{\mathcal{P}(0)})\), and we will denote \(s_t^0 \equiv x_t\). Note that we have implicitly made a locality assumption of \(p(s_{\tau}^{n}\mid s_{\tau}^{\mathcal{P}(n)}, s_{\tau-1}, a_{\tau-1}) = p(s_{\tau}^{n}\mid s_{\tau}^{\mathcal{P}(n)}, s_{\tau-1}^{n}, a_{\tau-1})\).
In this case we can write Equation (2) as
\[\begin{align} \label{eqn:dagfreeenergy} F_t &= \sum_{\tau=1}^{t} \sum_{n=1}^{N} \mathbb{E}_{q(s_{1:t}\mid x_{1:t}, a_{<t})}[\underbrace{-\log p(s_{\tau}^n\mid s_{\tau}^{\mathcal{P}(n)}, s_{\tau-1}^n, a_{\tau-1})}_{\text{(a)}}]+\sum_{\tau=1}^{t} \mathbb{E}_{q(s_{1:t}\mid x_{1:t}, a_{<t})}[\underbrace{-\log p(x_{\tau}\mid s_{\tau}^{\mathcal{P}(0)})}_{\text{(b)}}]\nonumber \\ &+\sum_{\tau=1}^{t-1} \mathbb{E}_{q(\pi\mid s_{\tau}) q(s_{\tau}\mid x_{1:t}, a_{<t})}[\underbrace{-\log p(\pi\mid s_{\tau})}_{\text{(c)}} \underbrace{-\log \pi(a_{\tau}\mid s_{\tau})}_{\text{(d)}}]\\ &\underbrace{- H[q(s_{1:t}\mid x_{1:t}, a_{<t})]}_{\text{(e)}} + \sum_{\tau=1}^{t-1} \mathbb{E}_{q(s_{\tau}\mid x_{1:t}, a_{<t})}[\underbrace{-H[q(\pi\mid s_{\tau})]}_{\text{(f)}}] \nonumber \end{align}\]We may restrict action selection to a particular node, e.g. \(\pi(a_{\tau}\mid s_{\tau}) = \pi(a_{\tau}\mid s_{\tau}^N)\).
Inclusion of entropy terms. Note that the entropy regularization terms (e) and (f) are not present if we consider the cross-entropy objective
\[\mathbb{E}_{q(s_{1:t}\mid x_{1:t}, a_{<t})}[-\log p(s_{1:t}, x_{1:t}, a_{<t})]\]instead of \(F_t\), but otherwise this objective and \(F_t\) are equivalent. This appears necessary if we wish to consider a point-mass distribution over policies \(q(\pi\mid s_{\tau}) = \delta(\pi - \pi_{\phi})\) as otherwise the entropy terms become infinite.
Predictive coding. Predictive coding [6, 7] is a special case of variational inference under a hierarchical latent topology that results in a local, and hence biologically-plausible, credit assignment scheme – this contrasts with backpropagation which has limited biological plausibility [5]. It has been observed that predictive coding alleviates catastrophic interference to some degree [6], which is a known problem with backpropagation [4].
Predictive coding, in its typical formulation, ignores temporality and action. We will consider extending predictive coding to include both of these aspects later. In this context, with a hierarchical topology, the relevant graphical model is
with \(s = (s^1, \ldots, s^L)\) and decomposition
\[p(s, x) = p(s^L) \prod_{l=1}^{L} p(s^{l-1}\mid s^l)\]where \(s^0 \equiv x\). Predictive coding makes two further assumptions:
with \(\mu_l(s^l) = \mu_l(s^l; \theta_l)\), where \(\theta = (\theta_1, \ldots, \theta_L)\) and \((\hat{\mu}, \hat{\Sigma})\) parameterize \(p\).
where \(z = (z_1, \ldots, z_L)\) parameterize \(q\).
We can interpret \(z\) as the fast-changing synaptic activity, and \(\theta\) as the slow-changing synaptic weights. This interpretation is supported by the variational EM algorithm, which indeed updates the parameters of \(q\) (in this case, parameter \(z\)) at a faster timescale than the parameters of \(p\).
In this case the free energy \(F(x)\) from Equation (2) (now independent of time) takes the form,
\[\begin{align*} F(x; \theta, z) &= \mathbb{E}_{q(s\mid x)}[-\log p(s) - \log p(x\mid s)] = -\log p(z_L) - \sum_{l=1}^{L} \log p(z_{l-1}\mid z_l)\\ &= \frac{1}{2} (z_L - \hat{\mu})^T \hat{\Sigma}^{-1} (z_L - \hat{\mu}) + \frac{1}{2} \sum_{l=1}^{L} (z_{l-1} - \mu_l(z_l))^T \Sigma_l^{-1} (z_{l-1} - \mu_l(z_l))\\ &+ \frac{1}{2} \log \mid 2\pi\hat{\Sigma}\mid + \frac{1}{2} \sum_{l=1}^{L} \log \mid 2\pi\Sigma_l\mid \end{align*}\]where \(z_0 \equiv x\).
The variational EM algorithm (neglecting amortized inference for now) consists of:
\(\frac{\partial F}{\partial z_l} = \begin{cases} \Sigma_{l+1}^{-1} \epsilon_l - \left[\frac{\partial \mu_l(z_l)}{\partial z_l}\right]^T \Sigma_l^{-1} \epsilon_{l-1}, & l = 1, \ldots, L-1\\ \hat{\Sigma}^{-1} \epsilon_L - \left[\frac{\partial \mu_L(z_L)}{\partial z_L}\right]^T \Sigma_L^{-1} \epsilon_{L-1}, & l = L \end{cases}\) where we have defined \(\epsilon_l := z_l - \mu_{l+1}(z_{l+1})\) (for \(l < L\)) and \(\epsilon_L := z_L - \hat{\mu}\).
This update process is local, and hence biologically plausible, since each update of \(z_l\) only depends on locally accessible information of its children nodes (in this case, \(z_{l-1}\)).
We can extend the above to include amortized inference, where upon receiving \(x\) we compute some initialization value for the neural activity \(z\), and from there perform iterative inference. We can interpret the typical models in machine learning, e.g. transformers, as solely performing such an amortized inference stage, without performing any iterative inference explicitly. However, one can argue that the residual structure of modern architectures, such as transformers, allows one to simulate an iteriative inference-like process. Indeed, they basically take the same form,
\(\text{Residual structure:} \; \; \; z^{(l+1)} = z^{(l)} + f_{l+1}(z^{(l)}) \; \; \; \text{at layer} \; l\) \(\text{Explicit iterative inference:} \; \; \; z^{(n+1)} = z^{(n)} - \eta \frac{\partial F(z^{(n)})}{\partial z} \; \; \; \text{at time-step} \; n\)
That is, we can potentially think of iterative inference as taking place implicitly over the layers of a residual architecture (like a transformer), whereas in the brain/predictive coding, it takes place over time via backward connections. This justifies why we may expect transformers to have to be much deeper than the brain; the brain’s recurrent/backward connections are roughly analogous to a deeper, strictly feedforward residual architecture.
The above presentation of predictive coding has neglected two aspects: temporality, and action selection. Our general framework naturally extends to include these two aspects, as we will show below.
Neuro interpretation. One perspective highlighted by [13] is that to understand biological intelligence, one should first develop a computational/algorithmic description of cognition, and only then should one consider how such an algorithm could be implemented neurally. Ambitiously, such an approach would shed light on why the brain has the structure and properties that it does. In this vein, we can make the following rough analogies to the brain:
[1] Levine, S. (2018). Reinforcement learning and control as probabilistic inference: Tutorial and review.
[2] Millidge, B. (2020). Deep active inference as variational policy gradients.
[3] Ha, D., & Schmidhuber, J. (2018). World models.
[4] McCloskey, M., & Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem.
[5] Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J., & Hinton, G. (2020). Backpropagation and the brain.
[6] Song, Y., Millidge, B., Salvatori, T., Lukasiewicz, T., Xu, Z., & Bogacz, R. (2024). Inferring neural activity before plasticity as a foundation for learning beyond backpropagation.
[7] Tscshantz, A., Millidge, B., Seth, A. K., & Buckley, C. L. (2023). Hybrid predictive coding: Inferring, fast and slow.
[8] Alonso, N., Millidge, B., Krichmar, J., & Neftci, E. O. (2022). A theoretical framework for inference learning.
[9] Friston, K., Schwartenbeck, P., FitzGerald, T., Moutoussis, M., Behrens, T., & Dolan, R. J. (2013). The anatomy of choice: active inference and agency.
[10] Millidge, B., Tschantz, A., & Buckley, C. L. (2021). Whence the expected free energy?
[11] Champion, T., Bowman, H., Marković, D., & Grześ, M. (2024). Reframing the Expected Free Energy: Four Formulations and a Unification.
[12] Millidge, B., Walton, M., & Bogacz, R. (2022). Reward bases: Instantaneous reward revaluation with temporal difference learning.
[13] Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information.
Given Equation (1), the term (c) in Equation (2) can be written
\[\begin{align*} &\sum_{\tau=1}^{t-1} \mathbb{E}_{q(\pi\mid s_{\tau}) q(s_{\tau}\mid x_{1:t}, a_{<t})}[-\log p(\pi\mid s_{\tau})]\\ &= -\beta \sum_{\tau=1}^{t-1} \mathbb{E}_{q(\pi\mid s_{\tau}) q(s_{\tau}\mid x_{1:t}, a_{<t})}[V_{\pi}(s_{\tau})] \end{align*}\]If we choose \(q(\pi\mid s_{\tau}) = \delta(\pi-\pi_{\phi})\), then we essentially arrive at the problem of:
\[\text{maximize} \; V_{\pi_{\phi}}(s_t) \; \text{with respect to} \; \phi\]which is exactly the goal of policy-based reinforcement learning (RL). Recall that,
\[V_{\pi}(s_t) = \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi)}\left[\sum_{\tau=t}^{T} \gamma^{\tau-t} R(s_{\tau}, a_{\tau})\right],\] \[Q_{\pi}(s_t, a_t) = \mathbb{E}_{p(s_{>t}, a_{>t}\mid s_t, a_t, \pi)}\left[\sum_{\tau=t}^{T} \gamma^{\tau-t} R(s_{\tau}, a_{\tau})\right]\]and note that
\[V_{\pi}(s) = \mathbb{E}_{\pi(a\mid s)}[Q_{\pi}(s, a)], \quad Q_{\pi}(s, a) = R(s, a) + \gamma \mathbb{E}_{p(s'\mid s, a)}[V_{\pi}(s')]\]There appears to be two notable methods for performing credit assignment in RL: (a) policy gradient methods (e.g. REINFORCE, PPO) (b) amortized gradient methods (e.g. DQN, DDPG, SVG(0), SAC). Both involve gradient ascent using gradient \(\nabla_{\phi} V_{\pi_{\phi}}(s_t)\), yet utilize different expressions for this gradient. As a quick summary,
a) Policy gradient methods consider writing the gradient in the form,
\[\nabla_{\phi} V_{\pi_{\phi}}(s_t) = \sum_{\tau=t}^{\infty} \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi})}[\Phi_{t, \tau} \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})]\]with various choices of \(\Phi_{t, \tau}\), typically involving approximations \(\hat{V} \approx V_{\pi_{\phi}}\) and/or \(\hat{Q} \approx Q_{\pi_{\phi}}\), described in more detail below.
b) Amortized gradient methods instead consider the form,
\[\nabla_{\phi} V_{\pi_{\phi}}(s_t) \approx \nabla_{\phi} \mathbb{E}_{\pi_{\phi}(a_t\mid s_t)}[\hat{Q}(s_t, a_t)]\]for an approximation \(\hat{Q} \approx Q_{\pi_{\phi}}\) or \(Q_{\pi_{*}}\) (for optimal policy \(\pi_{*}\)). In most contexts we can write \(a_t = f_{\phi}(s_t, \epsilon)\) under \(a_t \sim \pi_{\phi}(a_t\mid s_t)\) for random variable \(\epsilon \sim p(\epsilon)\), and hence via the reparameterization trick we can write
\[\nabla_{\phi} V_{\pi_{\phi}}(s_t) \approx \mathbb{E}_{p(\epsilon)}[\nabla_{\phi} \hat{Q}(s_t, f_{\phi}(s_t, \epsilon))]\]As an example of what \(f_{\phi}\) may look like in practice, for a Gaussian policy:
\[\pi_{\phi}(a_t\mid s_t) = \text{N}(a_t; \mu_{\phi}(s_t), \Sigma_{\phi}(s_t))\] \[\implies a_t = f_{\phi}(s_t, \epsilon) = \mu_{\phi}(s_t) + \epsilon U_{\phi}(s_t), \; \; \; \text{where} \; \; \Sigma_{\phi}(s_t) = U_{\phi}(s_t) U_{\phi}(s_t)^{T}\]for \(\epsilon \sim \text{N}(0, I)\).
Amortized gradient methods. Examples of amortized gradient methods:
DQN corresponds to the amortized method, however since its in the context of (small) discrete action spaces, the optimal action
\[a^{*}(s) = \text{argmax}_a \hat{Q}(s, a)\]can be computed directly. Further, as in Q-learning, \(\hat{Q} \approx Q_{\pi_{*}}\) where \(\pi_{*}\) is the optimal policy, satisfying the Bellman equation,
\[Q_{\pi_{*}}(s, a) = \mathbb{E}_{p(s'\mid s, a)}[R(s, a) + \gamma \max_{a'} Q_{\pi_{*}}(s', a')]\]where we minimize the associated greedy SARSA-like loss,
\[\frac{1}{2} \mathbb{E}_{p(s, a, s'\mid \pi_{a^{*}})}[(\hat{Q}(s, a) - (R(s, a) + \gamma\max_{a'} \hat{Q}(s', a'))^2]\]to obtain \(\hat{Q} \approx Q_{\pi_{*}}\).
DDPG extends Q-learning/DQN to continuous action spaces using the amortized method, restricting to deterministic policies \(\pi_{\phi}(a\mid s) = \delta(a - \mu_{\phi}(s))\), hence with corresponding gradient,
\[\nabla_{\phi} V_{\pi_{\phi}}(s_t) \approx \nabla_{\phi} \hat{Q}(s_t, \mu_{\phi}(s_t))\]and approximating \(\hat{Q} = Q_{\pi_{*}}\) equivalently to DQN.
The SVG(0) algorithm is an extension of DDPG for stochastic policies \(\pi_{\phi}\), by utilizing the reparameterization trick (demonstrated above in (b)). It approximates \(\hat{Q} \approx Q_{\pi_{\phi}}\) (i.e. not under the optimal policy, unlike DDPG) using a SARSA-like objective,
\[\frac{1}{2} \mathbb{E}_{p(s, a, s', a'\mid \pi_{\phi})}[(\hat{Q}(s, a) - (R(s, a) + \gamma\hat{Q}(s', a'))^2]\]SAC extends SVG(0) to include an entropy term, and also uses a value network alongside the action-value network. They found that the value network improved training stability. The objectives considered for training \(\hat{V} \approx V_{\pi_{\phi}}\) and \(\hat{Q} \approx Q_{\pi_{\phi}}\) are
\[\frac{1}{2}\mathbb{E}_{p(s)}[(\hat{V}(s) - \mathbb{E}_{\pi_{\phi}(a\mid s)}[\hat{Q}(s, a) \underbrace{- \log \pi_{\phi}(a\mid s)}])^2],\] \[\frac{1}{2}\mathbb{E}_{p(s, a\mid \pi_{\phi})}[(\hat{Q}(s, a) - (R(s, a) + \gamma\mathbb{E}_{p(s'\mid s, a)}[\hat{V}(s')]))^2]\]respectively. The reason why \(\hat{V}\)’s objective has an entropy term yet \(\hat{Q}\)’s doesnt, is because SAC defines the value \(V_{\pi_{\phi}}\) to include an entropy term.
Policy gradient methods. For the return \(G_t := \sum_{\tau=t}^{\infty} \gamma^{\tau-t} R(s_{\tau}, a_{\tau})\), note that
\[V_{\pi_\phi}(s_t) = \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_\phi)}[G_t]\] \[\implies \nabla_{\phi} V_{\pi_\phi}(s_t) = \sum_{\tau=t}^{\infty} \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_\phi)}[G_t \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})]\]We can approximate this via a (truncated) Monte Carlo method over trajectory information, however if the trajectory information is taken under an older policy \(\pi_{\phi_{old}}\), we should instead include a ratio factor:
\[\begin{align*} \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_\phi)}[G_t \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})] &= \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi_{old}})}\left[\frac{\pi_{\phi}(a_{\tau}\mid s_{\tau})}{\pi_{\phi_{old}}(a_{\tau}\mid s_{\tau})} G_t \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})\right]\\ &= \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi_{old}})}\left[G_t \frac{\nabla_{\phi} \pi_{\phi}(a_{\tau}\mid s_{\tau})}{\pi_{\phi_{old}}(a_{\tau}\mid s_{\tau})}\right] \end{align*}\]We can write \(\nabla_{\phi} V_{\pi_{\phi}}(s_t)\) in a more general form by manipulating the expectance, or including a baseline. Specifically, we can write
\[\nabla_{\phi} V_{\pi_{\phi}}(s_t) = \sum_{\tau=t}^{T} \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi})}[\Phi_{t, \tau} \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})]\]for a variety of choices of \(\Phi_{t, \tau}\):
(2) and (3) follow from (1) because, for an arbitrary function \(f = f(s_{\leq \tau}, a_{<\tau})\),
\[\mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi})}[f(s_{\leq \tau}, a_{<\tau}) \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})] = 0\]for \(\tau \geq t\). (4) holds because,
\[\begin{align*} &\mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi})}[G_{\tau} \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})] = \mathbb{E}_{p(s_{\geq\tau}, a_{\geq \tau}\mid s_t, \pi_{\phi})}[G_{\tau} \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})]\\ &= \mathbb{E}_{p(s_{\tau}, a_{\tau}\mid s_t, \pi_{\phi})} \mathbb{E}_{p(s_{>\tau}, a_{>\tau}\mid s_{\tau}, a_{\tau}, \pi_{\phi})}[G_{\tau} \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})]\\ &= \mathbb{E}_{p(s_{\tau}, a_{\tau}\mid s_t, \pi_{\phi})}[(R(s_{\tau}, a_{\tau}) + \underbrace{\mathbb{E}_{p(s_{>\tau}, a_{>\tau}\mid s_{\tau}, a_{\tau}, \pi_{\phi})}[\gamma R(s_{\tau+1}, a_{\tau+1}) + \cdots]}_{= Q_{\pi_{\phi}}(s_{\tau}, a_{\tau}) - R(s_{\tau}, a_{\tau})}) \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})]\\ &= \mathbb{E}_{p(s_{\tau}, a_{\tau}\mid s_t, \pi_{\phi})}[Q_{\pi_{\phi}}(s_{\tau}, a_{\tau}) \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})] \end{align*}\](6) follows from (5), using
\[\begin{align*} Q_{\pi_{\phi}}(s_{\tau}, a_{\tau}) &= R(s_{\tau}, a_{\tau}) + \gamma \mathbb{E}_{p(s_{\tau+1}, a_{\tau+1}\mid s_{\tau}, a_{\tau})}[R(s_{\tau+1}, a_{\tau+1})] + \cdots\\ &+ \gamma^T \mathbb{E}_{p(s_{>\tau}, a_{>\tau}\mid s_{\tau}, a_{\tau}, \pi_{\phi})}[R(s_{\tau+T}, a_{\tau+T})] + \gamma^{T+1} \mathbb{E}_{p(s_{>\tau}, a_{>\tau}\mid s_{\tau}, a_{\tau}, \pi_{\phi})}[V_{\pi_{\phi}}(s_{\tau+T+1})] \end{align*}\]Examples of policy gradient methods:
GAE estimator. Recall the return
\[G_{\tau} = \sum_{t=\tau}^{\infty} \gamma^{t-\tau} R(s_{t}, a_{t})\]We will define the truncated return,
\[G_{\tau}^{(n)} := \sum_{t=\tau}^{\tau+n-1} \gamma^{t-\tau} R(s_t, a_t) + \gamma^n V(s_{\tau+n})\]We then consider the exponential moving-average,
\[G_{\tau}(\lambda) := (1-\lambda)(G_{\tau}^{(1)} + \lambda G_{\tau}^{(2)} + \lambda^2 G_{\tau}^{(3)} + \cdots)\]Note that \(\lambda = 1\) corresponds to the entire return \(G_{\tau}\), whereas \(\lambda = 0\) corresponds to using the one-step return \(G_{\tau}^{(1)} \equiv R(s_{\tau}, a_{\tau}) + \gamma V(s_{t+1})\). Then we can view \(\lambda \in [0, 1]\) as balancing the tradeoff between bias and variance (\(\lambda=0\) is min variance but high bias, and vice-versa for \(\lambda=1\)).
We should view \(G_{\tau}(\lambda)\) as a drop-in approximation for \(G_{\tau}\). This is valid because
\[\mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi})}[G_{\tau}^{(n)} \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})] = \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi})}[G_{\tau} \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})]\]such that
\[\mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi})}[G_{\tau}(\lambda) \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})] = \mathbb{E}_{p(s_{>t}, a_{\geq t}\mid s_t, \pi_{\phi})}[G_{\tau} \nabla_{\phi} \log \pi_{\phi}(a_{\tau}\mid s_{\tau})]\]as required, using \(1+\lambda+\lambda^2+\cdots = 1/(1-\lambda)\).
Using \(G_{\tau}(\lambda)\) in-place of the return \(G_{\tau}\) for TD-learning results in the TD\((\lambda)\) algorithm.
In the context of advantage baselines, we instead consider an exponential moving-average over advantages, rather than returns. Specifically, we define the truncated advantage
\[\begin{align*} A_{\tau}^{(n)} &:= G_{\tau}^{(n)} - V(s_{\tau})\\ &= \sum_{t=\tau}^{\tau+n-1} \gamma^{t-\tau} R(s_t, a_t) + \gamma^n V(s_{\tau+n}) - V(s_{\tau}) \end{align*}\]and define the GAE (generalized advantage estimator) as an exponential moving-average,
\[A_{\tau}(\lambda) := (1-\lambda)(A_{\tau}^{(1)} + \lambda A_{\tau}^{(2)} + \lambda^2 A_{\tau}^{(3)} + \cdots)\]which, by defining \(\delta_t := R(s_t, a_t) + \gamma V(s_{t+1}) - V(s_t)\), we can write as
\[\begin{align*} A_{\tau}(\lambda) &= (1-\lambda)(\delta_{\tau} + \lambda(\delta_{\tau} + \gamma \delta_{\tau+1}) + \lambda^2(\delta_{\tau} + \gamma \delta_{\tau+1} + \gamma^2 \delta_{\tau+2}) + \cdots)\\ &= (1-\lambda)\underbrace{(1 + \lambda + \lambda^2 + \cdots)}_{1/(1-\lambda)}(\delta_{\tau} + (\gamma\lambda) \delta_{\tau+1} + (\gamma\lambda)^2 \delta_{\tau+2} + \cdots)\\ &= \sum_{t=\tau}^{\infty} (\gamma\lambda)^{t-\tau} \delta_{t} \end{align*}\]PPO considers a truncated form of GAE
\[A_{\tau}(\lambda, T) = (1-\lambda)(A_{\tau}^{(1)} + \lambda A_{\tau}^{(2)} + \cdots + \lambda^{T-1} A_{\tau}^{(T)})\]But it seems this has not been corrected for bias-correction, which would instead correspond to
\[\tilde{A}_{\tau}(\lambda, T) = \frac{1-\lambda}{1-\lambda^T} (A_{\tau}^{(1)} + \lambda A_{\tau}^{(2)} + \cdots + \lambda^{T-1} A_{\tau}^{(T)})\]though perhaps this is negligible in practice.
]]>Geometric deep learning provides a framework for viewing and deriving architectures via symmetry. Namely, imposing invariances and equivariances on a system with respect to a group of symmetries brings us prominent architectures present in machine learning, such as transformers and CNNs, that are empirically highly effective. This post aims to condense the core ideas of this framework.
Say one wishes to create information processing systems that receive some data signal, and output some property of the data.
Notation:
Data signals come from the space \(\mathcal{X}(\Omega, \mathcal{C}) := \{f: \Omega \to \mathcal{C}\}\), the set of all functions from \(\Omega\) to \(\mathcal{C}\). \(\Omega\) is called the domain and the dimensions of \(\mathcal{C}\) are called channels.
As a concrete example, 10x10 RGB images can be described with \(\Omega := \mathbb{Z}_{10} \times \mathbb{Z}_{10}\) and \(\mathcal{C} := \mathbb{R}^3\), assigning RGB values to each point in a grid.
This is equivalent to representing the data as a tensor of type \(\mathbb{R}^{10 \times 10 \times 3}\). But here, we make a seperation of the shape \((10, 10, 3)\) into \((\Omega, \mathcal{C})\). The reason for this will become clearer, but we define our symmetries by a group \(G\) that acts on \(\Omega\), whereas \(\mathcal{C}\) is not involved in these symmetries.
Our system \(f: \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{Y}\) takes a data signal, and returns a property of the signal.
Universal Approximation theorems tell us that we can choose \(f = f(x; \theta)\) to be a single hidden layer NN and we can get arbitrarily close to any function for some choice of parameters \(\theta\). But finding these parameters by optimization can require an infeasible number of data points from the true underlying data distribution. How do we circumvent this?
Luckily, when we are creating an information processing system, we know something beforehand about the structure of data it will receive and the task we wish to solve, and hence the underlying symmetries of the data signal domain \(\Omega\). As seen in the example above, for classification of images, it is reasonable to expect that \(f\) should output the same result even if the image is translated, since the object class will not change under such translation.
So we know beforehand that \(f\) should be invariant under certain transformations of its input signals, which narrows the function space we are optimizing over if we can properly encode this into the architecture of \(f\).
The concept of built-in exploitation of symmetry into the model architecture is an example of an inductive bias. Regularization techniques are also inductive biases, with weight decay biasing towards low weight norms. But here we are just concerned about inductive biases built into the architecture of the system, not those built into the optimization process.
Tangent propagation uses a regularizing term in the optimization objective to incentivize local invariances to transformations. Data augmentation also produces such an effect. However in geometric deep learning, the goal is to build these invariances into the functional form/architecture of the model itself.
Sidenote (not important): an example of tangent propagation: for data \(x \in \mathbb{R}^n\) and model \(y(x) \in \mathbb{R}^m\) under transformations parameterized by, for convenience, a scalar \(\xi\), then \(\xi \cdot x \in \mathbb{R}^n\) is the transformed data point (with \(0 \cdot x = x\)). Then
\[\frac{\partial y(\xi \cdot x)_i}{\partial \xi}\bigg\rvert_{\xi=0} = \sum_{j=1}^{n} \frac{\partial y(x)_i}{\partial x_j} \frac{\partial (\xi \cdot x)_j}{\partial \xi}\bigg\rvert_{\xi=0} =: \sum_{j=1}^{n} J_{ij} \tau_j\]with Jacobian \(J_{ij} := \frac{\partial y_i}{\partial x_j}\) and tangent vector \(\tau := \frac{\partial (\xi \cdot x)}{\partial \xi}\rvert_{\xi=0}\). Then tangent propagation includes a regularizing term
\[\lambda \sum_{i=1}^{m} \left(\frac{\partial y(\xi \cdot x)_i}{\partial \xi}\bigg\rvert_{\xi=0}\right)^2 = \lambda \sum_{i=1}^{n} \left(\sum_{j=1}^{n} J_{ij} \tau_j\right)^2\]into the objective function to incentivize local invariances, with \(\lambda\) chosen to balance this invariance effect.
With knowledge of the kind of data we are working with, and the task we wish to solve, we can determine symmetries on the domain \(\Omega\), and describe these symmetries by a group \(G\).
\(G\) acts on \(\Omega\) via \(\bullet: G \times \Omega \to \Omega\), and \(\bullet_g \in \text{Sym}(\Omega)\) is the group action restricted to \(g \in G\), with \(\bullet_g(u) = g \bullet u\).
We can define another group action \(\ast: G \times \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{X}(\Omega, \mathcal{C})\) by \(g \ast x := x(g^{-1} \bullet \; \cdot \;) \in \mathcal{X}(\Omega, \mathcal{C})\), now instead acting on a space of functions on \(\Omega\) rather than \(\Omega\) itself. This is a valid group action if \(\bullet\) is a valid group action.
Definition: \(f: \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{Y}\) is called \(G\)-invariant if
\[f \circ \ast_g = f\]\(\forall \; g \in G\).
Definition: \(f: \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{X}(\Omega', \mathcal{C}')\) is called \(G\)-equivariant if
\[f \circ \ast_g = \ast'_g \circ f\]\(\forall \; g \in G\), where \(\ast\), \(\ast'\) are the group actions of \(G\) on \(\mathcal{X}(\Omega, \mathcal{C})\), \(\mathcal{X}(\Omega', \mathcal{C}')\) respectively.
And two key properties:
There is a problem with a function \(f\) having just \(G\)-invariance. It says nothing regarding how robust \(f\) is to transformations that are close to the transformations of \(G\), but not exactly in \(G\).
We can formalize the idea of \(f\) being \(G\)-invariant while also being robust to pertubations outside of \(G\) by the notion of approximate invariance. A complexity measure \(c: \text{Diff}(\Omega) \to \mathbb{R}^+\) measures how ‘close’ a transformation \(\tau \in \text{Diff}(\Omega)\) is to the transformations of a group \(G \subset \text{Diff}(\Omega)\), with \(c(g) = 0 \; \forall \; g \in G\). Then \(f\) is approximately invariant if
\[\lVert f(\ast_{\tau}(x)) - f(x) \rVert \leq C c(\tau) \lVert x \rVert\]\(\forall \; x \in \mathcal{X}(\Omega, \mathcal{C}), \tau \in \text{Diff}(\Omega)\) for some constant \(C\).
Then from the above definition, such an \(f\) is both \(G\)-invariant (case of \(\tau \in G\)) and is also sufficiently stable to pertubations outside \(G\), depending on the chosen complexity measure \(c\).
Similarly, \(f\) is called approximately equivariant if
\[\lVert f(\ast_{\tau}(x)) - \ast'_{\tau}(f(x)) \rVert \leq C c(\tau) \lVert x \rVert\]\(\forall \; x \in \mathcal{X}(\Omega, \mathcal{C})\), \(\tau \in \text{Diff}(\Omega)\).
How do we achieve pertubative stability as described by approximate invariance/equivariance? One heuristic for achieving pertubative stability is that of locality.
Consider \(\Omega = \mathbb{R}\) and group \((\mathbb{R}, +)\) with \(G = \mathbb{R}\), acting as translations by \(\ast_v(x)(u) := x(u - v)\). Also consider a transformation \(\tau \in \text{Diff}(\mathbb{R})\), which has action \(\ast_{\tau}(x)(u) = x(u - \tilde{\tau}(u))\), where \(\tilde{\tau}\) is not a constant function, hence \(\tau\) is not a translation. However, \(\lVert \nabla \tilde{\tau} \rVert_{\infty} \leq \epsilon\) for some small \(\epsilon\), hence \(\tau\) can be said to be an approximate translation.
Let \(f(x) := |\hat{x}|\), the Fourier modulus, with \(\hat{x}(\xi) = \int_{-\infty}^{\infty} x(u) e^{-i\xi u} du\) the Fourier transform. Then since \(\hat{\ast_v(x)}(\xi) = \hat{x}(\xi) e^{-i \xi v}\), we have \(f(\ast_v(x)) = |\hat{x}| \equiv f(x)\), hence \(f\) is \(G\)-invariant. However, it fails to be approximately invariant, as one can show that for the approximate translation \(\tau\),
\[\frac{\lVert f(\ast_{\tau}(x)) - f(x) \rVert}{\lVert x \rVert} = \mathcal{O}(1)\]i.e. it is independent of \(\epsilon > 0\).
In constrast, let \(f(x) := W_{\psi}(x)(\cdot, \xi)\) for a fixed \(\xi \in \mathbb{R}^+\), with \(W_{\psi}(x)(u, \xi) = \xi^{-1/2} \int_{-\infty}^{\infty} \psi\left(\frac{v - u}{\xi}\right) x(v) dv\), called the wavelet transform. Then it can be shown that
\[\frac{\lVert f(\ast_{\tau}(x)) - f(x) \rVert}{\lVert x \rVert} = \mathcal{O}(\epsilon)\]hence \(f\) is approximately equivariant.
How could this be interpreted? One way is to note that the Fourier transform extracts global properties, namely, frequencies featured in the signal, but does not give any local temporal information. And from above, the Fourier transform is not stable to pertubations outside \(G\).
On the other hand, wavelets provide a balance of global and local information, allowing one to gain information regarding frequencies at localised points in the signal, and from above, the wavelet transform is stable to pertubations.
Hence, roughly, locality can be viewed as a principle that \(f\) should follow to possibly give some stability to its invariances/equivariances. In the Architecture section, this locality property is often imposed, and it can be roughly justified by the above.
We want our system \(f: \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{Y}\) to be \(G\)-invariant. How do we go about achieving this?
If we restrict \(f\) to be linear, then \(G\)-invariance implies that
\[\begin{align*} f(x) &= \frac{1}{\mu(G)} \int_G f(x) d\mu(g)\\ &= \frac{1}{\mu(G)} \int_G f(\ast_g(x)) d\mu(g) && \text{(by $G$-invariance)}\\ &= f\left(\frac{1}{\mu(G)} \int_G \ast_g(x) d\mu(g)\right) && \text{(by linearity)}\\ &=: f(\bar{x}) \end{align*}\]with \(\bar{x} := \frac{1}{\mu(G)} \int_G \ast_g(x) d\mu(g) \in \mathcal{X}(\Omega, \mathcal{C})\) representing the average action signal of \(G\) on signal \(x \in \mathcal{X}(\Omega, \mathcal{C})\).
What this says is that \(f\) linear and invariant means \(f\) can only depend on \(\bar{x}\), and \(\bar{x}\) contains very little information about \(x\) in some cases.
Hence if we want \(f\) to be invariant while being effective at a non-trivial task, we require that \(f\) is non-linear. How do we construct an invariant, non-linear \(f\)?
If we have some non-linear equivariances \(\{f_i\}_{i=1}^{N}\) and a (possibly linear) invariance \(A\), we could produce a non-linear \(G\)-invariant system \(f\) by
\[f := A \circ f_N \circ \cdots \circ f_1\]where \(f\) is \(G\)-invariant as a consequence of the two key properties noted at the end of the Key Definitions section.
Say instead we have some linear equivariances \(\{f_i\}_{i=1}^{N}\) that we wish to use to build an invariant system. In order for \(f\) to be expressive we want some non-linearity involved too. A useful result is that:
If \(A: \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{X}(\Omega', \mathcal{C}')\) is a \(G\)-equivariance, and \(\Sigma: \mathcal{X}(\Omega', \mathcal{C}') \to \mathcal{X}(\Omega', \mathcal{C}'')\) is defined by \(\Sigma(x) := \sigma \circ x\) with \(\sigma: \mathcal{C}' \to \mathcal{C}''\) (e.g. a non-linearity), then \(\Sigma \circ A\) is \(G\)-equivariant.
So we can choose some non-linearities \(\{\Sigma_i\}_{i=1}^{L}\) and build a \(G\)-invariant \(f\) by
\[f := A \circ \Sigma_N \circ f_N \circ \cdots \circ \Sigma_1 \circ f_1\]A general definition of the concept of a convolution is that of the group convolution.
Definition: For a group \(G\) and \(\theta \in \mathcal{X}(\Omega, \mathcal{C})\), the group convolution \(C_{\theta}: \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{X}(G, \mathcal{C})\) is defined by
\[C_{\theta}(x)(g) := \int_{\Omega} \langle x(u), \theta(g^{-1} \bullet u) \rangle_{\mathcal{C}} du\]where \(\langle\cdot, \cdot\rangle_{\mathcal{C}}\) is an inner product on \(\mathcal{C}\).
The group convolution is special in that \(f: \mathcal{X}(G, \mathcal{C}) \to \mathcal{X}(G, \mathcal{C})\) is \(G\)-equivariant and linear iff \(f\) is a group convolution (see Kondor, Trivedi (2018)). So for functions on \(\mathcal{X}(G, \mathcal{C})\), we can fully characterise linear \(G\)-equivariances.
In the case of \(f: \mathcal{X}(\Omega, \mathcal{C}) \to \mathcal{X}(G, \mathcal{C})\) however, certain conditions must be met. If \(\Omega\) is countable, or \(\Omega\) uncountable with \(\det\left(\frac{\partial \bullet_g(u)}{\partial u}\right) = 1\), then the group convolution is \(G\)-equivariant. The former result is proven below, and the latter is proven similarly.
Claim: If \(\Omega\) is countable, then for any \(\theta \in \mathcal{X}(\Omega, \mathcal{C})\), the group convolution is \(G\)-equivariant. That is,
\[C_{\theta} \circ \ast_g = \ast'_g \circ C_{\theta}\]\(\forall \; g \in G\), where \(\ast'_g\) is defined by action \(\bullet'_g: G \to G\) with \(\bullet'_g(h) := gh\).
Proof: Since \(\Omega\) is countable, the integral becomes a sum:
\[C_{\theta}(x)(g) = \sum_{u \in \Omega} \langle x(u), \theta(g^{-1} \bullet u) \rangle_{\mathcal{C}}\]To show \(G\)-equivariance, we make use of the fact that \(\bullet_g \in \text{Sym}(\Omega)\) to make the substitution \(v = h^{-1} \bullet u\):
\[\begin{align*} C_{\theta}(\ast_h(x))(g) &= \sum_{u \in \Omega} \langle x(h^{-1} \bullet u), \theta(g^{-1} \bullet u) \rangle_{\mathcal{C}}\\ &= \sum_{v \in \Omega} \langle x(v), \theta(g^{-1} \bullet (h \bullet v)) \rangle_{\mathcal{C}}\\ &= \sum_{u \in \Omega} \langle x(u), \theta(g^{-1}h \bullet u) \rangle_{\mathcal{C}} \end{align*}\]Compare this to the outer application of \(\ast'_h\):
\[\ast'_h(C_{\theta}(x))(g) = C_{\theta}(x)(h^{-1}g) = \sum_{u \in \Omega} \langle x(u), \theta(g^{-1}h \bullet u) \rangle_{\mathcal{C}}\]These are identical, and so
\[C_{\theta} \circ \ast_g = \ast'_g \circ C_{\theta}\]\(\forall \; g \in G, \theta \in \mathcal{X}(\Omega, \mathcal{C})\) as required, hence \(C_{\theta}\) is \(G\)-equivariant.
Classical CNNs can be derived by choosing to use a linear equivariance wrt. translations, which is equivalently a convolution \(C_{\theta}\) (since \(G = \Omega = \mathbb{Z}^2\)), and then imposing locality by choosing a localised filter \(\theta\) such that \(C_{\theta}(x)((u, v)^T)\) depends only on a small \(f_H \times f_W\) grid in the original input. Since \(G\) is the group of translations, the group convolution is
\[C_{\theta}(x)(u) = \sum_{v \in \Omega} x(v) \theta(u^{-1} \bullet v) = \sum_{v \in \Omega} x(v) \theta(v - u)\]Let \(\Omega := \mathbb{Z}^2\), with values outside of a \(H \times W\) grid set to be \(0\), and let \(\mathcal{C} = \mathbb{R}\) for now. Then a localised filter \(\theta \in \mathcal{X}(\mathbb{Z}^2, \mathbb{R})\) represented by a \(f_H \times f_W\) grid can be written in the basis \(\{\theta_{1, 1}, \ldots, \theta_{f_H, f_W}\}\) with \(\theta_{i, j}(u, v) := \delta(u - i, v - j)\) like so:
\[\theta = \sum_{i=1}^{f_H} \sum_{j=1}^{f_W} w_{i, j} \theta_{i, j}\]with \(\{w_{i, j}\}_{i, j} \subset \mathbb{R}\) learnable.
Then since \(C\) is linear wrt. \(\theta\) (since \(C_{\alpha \theta_1 + \beta \theta_2} = \alpha C_{\theta_1} + \beta C_{\theta_2}\)), then the corresponding localised convolution is
\[C_{\theta} = \sum_{i=1}^{f_H} \sum_{j=1}^{f_W} w_{i, j} C_{\theta_{i, j}}\]\(C_{\theta}\) corresponds to a stride of 1, as
\[C_{\theta_{i, j}}(x)((u, v)^T) = \sum_{w \in \Omega} x(w) \theta_{i, j}(w - (u, v)^T) = x((u+i, v+j)^T)\]and so
\[C_{\theta}(x)((u, v)^T) = \sum_{i=1}^{f_H} \sum_{j=1}^{f_W} w_{i, j} x((u+i, v+j)^T)\]In the case of \(C\) channels, with \(\mathcal{C} := \mathbb{R}^C\), mapping to \(C'\) channels, then by writing \(\theta \in \mathcal{X}(\mathbb{Z}^2, \mathbb{R}^C)\) in the basis \(\{\theta_{1, 1, 1}, \ldots, \theta_{f_H, f_W, C}\}\), the localised convolution is
\[C_{\theta}(x)((u, v)^T)_{c'} = \sum_{i=1}^{f_H} \sum_{j=1}^{f_W} \sum_{c=1}^{C} w_{i, j, c, c'} x((u+i, v+j)^T)_c\]since there are \(C'\) filters each of shape \((f_H, f_W, C)\).
Consider the case of the domain \(\Omega = V\), where \(V\) are the nodes of a graph \(\mathcal{G} = (V, E)\), and \(\mathcal{C} = \mathbb{R}^d\). Then the signal \(x \in \mathcal{X}(V, \mathbb{R}^d)\) can be equivalently represented by a matrix \(X \in \mathbb{R}^{n \times d}\), where \(n := |V|\). Here we are assuming there are only node representations, and none for edges.
We can represent connection information by an adjacency matrix \(A \in \mathbb{R}^{n \times n}\), with \(A_{ij} = 1\) if \((i, j) \in E\), and \(A_{ij} = 0\) otherwise.
Naturally we want the system to be \(G\)-invariant wrt. \(G = \text{Sym}(\{1, \ldots, n\})\) such that an arbitrary renumbering of nodes has no effect. To build such invariances, we need \(G\)-equivariances. Specifically, for \(f: \mathbb{R}^{n \times d_{I}} \times \mathbb{R}^{n \times n} \to \mathbb{R}^{n \times d_{O}}\) to be \(G\)-equivariant, we need
\[f(PX, PAP^T) = Pf(X, A)\]\(\forall \; P \in S_n\), where \(S_n\) is the group of \(n \times n\) permutation matrices.
One way of imposing locality here is to impose that \(f(X, A)_i \in \mathbb{R}^{d_{O}}\) only depends on \(x_i\) and \(X_{\mathcal{N}_i(A)} \in \mathbb{R}^{\vert \mathcal{N}_i(A)\vert \times d_I}\), the node representations of neighbours of \(i\). This means that
\[f(X, A)_i = \phi(x_i, X_{\mathcal{N}_i(A)})\]In this case, \(\phi\) must be invariant to permutations of the input neighbours, otherwise \(f\) wont necessarily be equivariant. This means that we need
\[\phi(x_i, PX_{\mathcal{N}_i(A)}) = \phi(x_i, X_{\mathcal{N}_i(A)})\]\(\forall \; P \in S_{\vert \mathcal{N}_i(A)\vert }\).
One way of imposing that \(\phi\) is invariant is to enforce that \(\phi\) only depends on a permutation invariant operator across the neighbours, i.e.
\[\phi(x_i, X_{\mathcal{N}_i(A)}) = \phi\left(x_i, \bigoplus_{j \in \mathcal{N}_i(A)} \psi(x_i, x_j)\right)\]which is the most general form of a GNN.
A text prompt with \(n\) tokens can be represented as a graph \(\mathcal{G} = (V, E)\) with \(V := \{1, \ldots, n\}\) and with connectivity \(E\) s.t. \(\mathcal{N}_i(A) := \{1, \ldots, i\} \; \forall \; i \in V\), i.e. each token is connected to itself and all tokens before itself, but none after.
As for GNNs, \(X \in \mathbb{R}^{n \times d}\) represents the tokens, with \(d\) called the residual stream dimension in the context of transformers.
In the case of \(f: \mathbb{R}^{n \times d} \times \mathbb{R}^{n \times n} \to \mathbb{R}^{n \times d}\) being a general GNN, i.e.
\[f(X, A)_i = \phi\left(x_i, \bigoplus_{j \in \mathcal{N}_i(A)} \psi(x_i, x_j)\right)\]then by defining learnable parameters \(Q_h, K_h, V_h, O_h \in \mathbb{R}^{d_{\text{head}} \times d}\) for each \(h = 1, \ldots, H\), where \(d_{\text{head}}\) is the head dimension, we can arrive at a transformer attention layer with \(H\) heads by choosing
softmax-normalised attention scores
\[a_h(x_i, x_j) := \text{softmax}(\{(Q_h x_i)^T K_h x_k : k \in \mathcal{N}_i(A)\})_j = \frac{e^{(Qx_i)^T Kx_j}}{\sum_{k \in \mathcal{N}_i(A)} e^{(Qx_i)^T Kx_k}}\]These choices give
\[f(X, A)_i = x_i + \sum_{h=1}^{H} O_h^T\left(\sum_{j=1}^{i} a_h(x_i, x_j) V_h x_j\right)\]which is exactly the attention layer of the GPT models.
An RNN has an update equation of the form
\[h_{t+1} := R(z_t, h_t)\]By undiscretizing by \(h_{t+1} \mapsto h(t+1)\), and using a Taylor expansion of
\[h(t+1) \approx h(t) + \frac{dh(t)}{dt}\]we can get a continuous update rule:
\[h(t) + \frac{dh(t)}{dt} = R(z(t), h(t))\]Consider a time warping \(\tau: \mathbb{R}^+ \to \mathbb{R}^+\), which is monotone increasing, i.e. \(\frac{d\tau(t)}{dt} > 0\). It turns out that the gating mechanism present in GRUs and LSTMs can be derived by requiring that the RNN model class is invariant to such time warping operations.
First make a substitution \(t \mapsto \tau(t)\):
\[\frac{dh(\tau(t))}{d\tau(t)} = R(z(\tau(t)), h(\tau(t))) - h(\tau(t))\]And using the chain rule,
\[\frac{dh(\tau(t))}{dt} = \frac{dh(\tau(t))}{d\tau(t)} \frac{d\tau(t)}{dt} = \frac{d\tau(t)}{dt}(R(z(\tau(t)), h(\tau(t))) - h(\tau(t)))\]Using a Taylor expansion
\[h(\tau(t+1)) \approx h(\tau(t)) + \frac{dh(\tau(t))}{dt}\](assuming that \(\frac{d\tau(t)}{dt} < 1\)), and defining \(\Gamma := \frac{d\tau(t)}{dt}\), this can be rewritten as
\[h(\tau(t+1)) = \Gamma R(z(\tau(t)), h(\tau(t))) + (1 - \Gamma)h(\tau(t))\]Since the time warping \(\tau\) is unknown, the inputs received are effectively \(z_t := z(\tau(t))\). This also means that \(\Gamma\) should be learnable.
If we now impose that the model class is invariant to time warping, such that \(h(\tau(t)) = h(t)\), and then discretizing, we obtain:
\[h_{t+1} = \Gamma R(z_t, h_t) + (1 - \Gamma)h_t\]And since \(0 < \Gamma < 1\), it is natural to choose \(\Gamma\) to be of the form of the SimpleRNN model, but instead with sigmoid \(\sigma: \mathbb{R} \to [0, 1]\) to match the range of \(\Gamma\). Then,
\[\Gamma := \sigma(Wz_t + Uh_t + b)\]which is exactly the expression for a gate in the GRU and LSTM models.
]]>This post describes the notion of an agent minimizing their free-energy and its extension to active inference, which includes the ability for the agent to take action, additionally introducing an expected free-energy. I then derive an approach for applying these methods to reinforcement learning, comparing my approach to another approach in the literature. I then briefly describe predictive coding, a more biologically plausible replacement for backpropagation.
The free-energy principle says that self-organizing systems (e.g. biological systems) can be viewed as minimizing a quantity called the free energy.
There is a mathematically involved derivation of the free-energy principle by assuming that states follow Langevin dynamics along with some other assumptions. For a full description of this approach see Friston (2019).
Assuming the Bayesian brain hypothesis provides another, less involved, route to the concept of the brain minimizing free energy, as will be described in this section. The Bayesian brain hypothesis says: the brain has a generative model describing its beliefs about the world, and this model is updated based on perception.
Specifically, the brain receives sensory data/observation denoted by \(o \in \mathcal{O}\), and it is assumed this data is generated by some underlying process dependent on some cause/latent variable \(x \in \mathcal{X}\). Then the brain’s beliefs are implicitly contained in its generative model \(p(x, o) = p(x) p(o\vert x)\) made up of an explicit:
After receiving an observation \(o\), the posterior \(p(x\vert o)\) is determined by Bayes rule:
\[p(x\vert o) = \frac{p(x, o)}{p(o)}\]Typically, however, \(p(o) = \int_{\mathcal{X}} p(x, o) dx\) is intractable, so we consider a variational posterior \(q(x\vert o)\) that we want to effectively approximate \(p(x\vert o)\). The natural objective to achieve this is to minimize the KL-divergence between these distributions, and with some rearranging:
\[\begin{align*} D_{KL}(q(x|o) || p(x|o)) &= D_{KL}(q(x|o) || \frac{p(x, o)}{p(o)})\\ &= D_{KL}(q(x|o) || p(x, o)) + \log(p(o))\\ &\leq D_{KL}(q(x|o) || p(x, o)) =: F(o) \end{align*}\]where \(F(o)\) is called the variational free energy of observation \(o\), and is a tractable upper bound on this objective. This has shown that updating beliefs in a tractable way can be done by minimizing the variational free energy.
Diffusion models follow from a data distribution \(p(o)\) and \(x = (x_1, \ldots, x_T)\), with \(p(x_T)\) an isotropic Gaussian, and generative model \(p(x, o; \theta) = p(x_T) \prod_{t=1}^{T} p(x_{t-1}|x_t; \theta)\) with \(x_0 := o\), and \(q(x|o) = \prod_{t=1}^{T} q(x_t|x_{t-1})\) (i.e. Markov chains in both directions).
Specifically, there is an exact noising process \(q(x_t|x_{t-1}) := \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)\) and a parameterized noise-reversing process \(p(x_{t-1}|x_t; \theta) := \mathcal{N}(x_{t-1}; \mu_{\theta}(x_t, t), \Sigma_{\theta}(x_t, t))\). The rest then amounts to minimizing \(\mathbb{E}_{p(o)}[F(o)]\) by using some tricks to find a low variance estimator and applying gradient descent.
The negative model evidence acts as a lower bound on \(F(o)\):
\[\begin{align*} F(o) = D_{KL}(q(x|o)||p(x, o)) &= -\log p(o) + D_{KL}(q(x|o)||p(x|o))\\ &\geq -\log p(o) \end{align*}\]since the KL-divergence is non-negative. Then, minimizing \(F(o)\) pushes the model evidence \(\log p(o)\) to maximize, i.e. for our perceived observations to match our model.
By manipulating, a more interpretable expression for the free energy can be obtained:
\[F(o) = D_{KL}(q(x|o)||p(x)) - \mathbb{E}_{q(x|o)}[\log p(o|x)]\]Then minimization of \(F(o)\) incentivizes
Active inference extends the above by including action. The above describes a system updating its beliefs to minimize free energy, but active inference allows the system to also take actions in the environment in order to make its own beliefs come true and hence minimize free energy.
A systems observational preferences are encoded in its biased prior denoted by \(\tilde{p}(o)\), e.g. \(\tilde{p}(o)\) may assign high probability to high reward observations in the context of RL.
To make effective actions, one must consider the future consequences of actions. To formalize such a notion requires the concept of an expected free energy.
The free energy of the expected future (FEEF) is defined as
\[\mathcal{F} := D_{KL}(q(o, x, \pi) || \tilde{p}(o, x))\]with \(\tilde{p}(o, x) = \tilde{p}(o) p(x|o)\).
We decompose \(q(o, x, \pi) = q(\pi) q(o, x \vert \pi)\), with \(q(\pi)\) the prior over policies.
Concretely, one can think of \(o = (o_0, \ldots, o_T)\), \(x = (x_0, \ldots, x_T)\) and \(\pi = (a_0, \ldots, a_{T-1})\) representing the observations, states, and actions respectively for a time horizon of length \(T\).
Rewriting the FEEF using the above decomposition for \(q(o, x, \pi)\) gives:
\[\begin{align*} \mathcal{F} &:= D_{KL}(q(o, x, \pi) || \tilde{p}(o, x))\\ &= \mathbb{E}_{q(o, x, \pi)}[\log(\frac{q(o, x, \pi)}{\tilde{p}(o, x)})]\\ &= \mathbb{E}_{q(o, x, \pi)}[\log(q(\pi)) - (\log(\tilde{p}(o, x)) - \log(q(o, x | \pi)))]\\ &= \mathbb{E}_{q(\pi)}[\log(q(\pi)) - \log(e^{-\mathbb{E}_{q(o, x | \pi)}[\log(q(o, x | \pi)) - \log(\tilde{p}(o, x))]})]\\ &= \mathbb{E}_{q(\pi)}[\log(q(\pi)) - \log(e^{-D_{KL}(q(o, x | \pi) || \tilde{p}(o, x))})]\\ &= D_{KL}(q(\pi) || e^{-D_{KL}(q(o, x | \pi) || \tilde{p}(o, x))})\\ &=: D_{KL}(q(\pi) || e^{-\mathcal{F}_{\pi}}) \end{align*}\]where \(\mathcal{F}_{\pi} := D_{KL}(q(o, x | \pi) || \tilde{p}(o, x))\), measuring the difference between our preferences (described by \(\tilde{p}(o, x)\)) and the actual trajectories we observe due to a policy \(\pi\).
Hence if we consider minimizing \(\mathcal{F}\) wrt. \(q(\pi)\), then by variational principles, the minimizing solution is \(q(\pi) \propto e^{-\mathcal{F}_{\pi}}\) up to a multiplicative normalizing constant. Hence this says that the best prior policy \(\pi^{*} := \text{argmin}_{\pi} \mathcal{F}_{\pi}\).
We can rewrite \(\mathcal{F}_{\pi}\) in an interpretable form using the approximation \(p(x\vert o) \approx q(x\vert o) \approx q(x\vert o, \pi)\):
\[\begin{align*} \mathcal{F}_{\pi} &:= D_{KL}(q(o, x | \pi) || \tilde{p}(o, x))\\ &= \mathbb{E}_{q(o, x|\pi)}[\log(q(x|\pi)) + \log(q(o|x, \pi)) - \log(\tilde{p}(o)) - \log(p(x|o))]\\ &\approx \mathbb{E}_{q(o, x|\pi)}[\log(q(x|\pi)) + \log(q(o|x, \pi)) - \log(\tilde{p}(o)) - \log(q(x|o, \pi))]\\ &= \mathbb{E}_{q(x|\pi) q(o|x, \pi)}[\log(\frac{q(o|x, \pi)}{\tilde{p}(o)})] - \mathbb{E}_{q(o|\pi) q(x|o, \pi)}[\log(\frac{q(x|o, \pi)}{q(x|\pi)})]\\ &= \mathbb{E}_{q(x|\pi)}[D_{KL}(q(o|x, \pi) || \tilde{p}(o))] - \mathbb{E}_{q(o|\pi)}[D_{KL}(q(x|o, \pi) || q(x|\pi))] \end{align*}\]and so minimizing \(\mathcal{F}_{\pi}\) means
Now consider the special case of reinforcement learning, where for a time horizon of length \(T\),
forming a Markov Decision Process (MDP). For one step, \((s_i, a_i) \mapsto o_i\) by the environment, with \(o_i = (r_i, s_{i+1})\)
The natural prior which we impose is that \(\tilde{p}(o_i) = \tilde{p}(r_i)\) has a high density for large \(r_i\), such that our agent prefers to achieve high rewards.
By using the properties of an MDP,
\[q(o, x|\pi) = q(x|\pi) q(o|x, \pi) = \prod_{t=0}^{T} q(x_t|x_{t-1}, \pi) q(o_t|x_t, \pi)\]and also,
\[\tilde{p}(o) = \prod_{t=0}^{T} \tilde{p}(o_t)\]Recall that we can approximate
\[\mathcal{F}_{\pi} \approx \mathbb{E}_{q(x|\pi)}[D_{KL}(q(o|x, \pi) || \tilde{p}(o))] - \mathbb{E}_{q(o|\pi)}[D_{KL}(q(x|o, \pi) || q(x|\pi))]\]Using the fact that \(q(o\vert x, \pi) = \prod_{t=0}^{T} q(r_t\vert x_t, a_t)\) (since MDP), \(q(x\vert \pi) = \prod_{t=0}^{T} q(x_t\vert x_{t-1}, a_{t-1})\) and \(\tilde{p}(o) = \prod_{t=0}^{T} \tilde{p}(o_t)\), then the first term can be written
\[\begin{align*} \mathbb{E}_{q(x|\pi)}[D_{KL}(q(o|x, \pi) || \tilde{p}(o))] &= \mathbb{E}_{q(x|\pi)q(o|x, \pi)}[\log(\prod_{t=0}^{T} \frac{q(r_t|x_t, a_t)}{\tilde{p}(r_t)})]\\ &= \sum_{t=0}^{T} \mathbb{E}_{q(x_t|x_{t-1}, a_{t-1}) q(r_t|x_t, a_t)}[\log(\frac{q(r_t|x_t, a_t)}{\tilde{p}(r_t)})]\\ &= \sum_{t=0}^{T} \mathbb{E}_{q(x_t|x_{t-1}, a_{t-1})}[D_{KL}(q(r_t|x_t, a_t) || \tilde{p}(r_t))] \end{align*}\]and using \(q(x\vert o, \pi) = \frac{q(x\vert \pi)q(o\vert x, \pi)}{q(o\vert \pi)}\) and \(q(o\vert \pi) = \prod_{t=0}^{T} q(x_t(o_t)\vert x_{t-1}(o_{t-1}), a_{t-1}) q(r_t(o_t)\vert x_t(o_t), a_t)\), the second term can be written
\[\begin{align*} \mathbb{E}_{q(o|\pi)}[D_{KL}(q(x|o, \pi) || q(x|\pi))] &= \mathbb{E}_{q(x, o|\pi)}[\log(\frac{q(o|x, \pi)}{q(o|\pi)})]\\ &= \mathbb{E}_{q(x, o|\pi)}[\log(\prod_{t=0}^{T} \frac{q(r_t|x_t, a_t)}{q(x_t|x_{t-1}, a_{t-1}) q(r_t|x_t, a_t)})]\\ &= -\sum_{t=0}^{T} \mathbb{E}_{q(x|\pi) q(o|x, \pi)}[\log(q(x_t|x_{t-1}, a_{t-1}))]\\ &= -\sum_{t=0}^{T} \mathbb{E}_{q(x_{t-1}|x_{t-2}, a_{t-2}) q(x_t|x_{t-1}, a_{t-1})}[\log(q(x_t|x_{t-1}, a_{t-1}))]\\ &= \sum_{t=0}^{T} \mathbb{E}_{q(x_{t-1}|x_{t-2}, a_{t-2})}[H[q(x_t|x_{t-1}, a_{t-1})]] \end{align*}\]hence we can finally write
\[\mathcal{F}_{\pi} = \sum_{t=0}^{T} \mathcal{F}_{\pi, t}\]where
\[\mathcal{F}_{\pi, t} := \mathbb{E}_{q(x_t|x_{t-1}, a_{t-1})}[D_{KL}(q(r_t|x_t, a_t) || \tilde{p}(r_t))] - \mathbb{E}_{q(x_{t-1}|x_{t-2}, a_{t-2})}[H[q(x_t|x_{t-1}, a_{t-1})]]\]In the context of RL, we call \(q(r_t|x_t, a_t)\) the reward model, representing a model of the reward distribution given the state we are in and the actions we take. \(q(x_t|x_{t-1}, a_{t-1})\) is called the transition model and models the next state distribution given the current state and the actions we take.
To implement these ideas computationally, we need to pick reward \& transition distributions, alongside a prior \(\tilde{p}(r_t)\). Gaussians are nice in that the KL divergence between two Gaussians has a closed form
expression, and the entropy of a Gaussian is also closed form, which would give a closed form expression for \(\mathcal{F}_{\pi, t}\).
Given this, we choose the reward model \(q(r_t|x_t, a_t) = \mathcal{N}(r_t; f_{\mu}(x_t, a_t), f_{\sigma^2}(x_t, a_t))\) and the transition model \(q(x_t|x_{t-1}, a_{t-1}) = \mathcal{N}(x_t; g_{\mu}(x_{t-1}, a_{t-1}), \text{diag}(g_{\sigma^2}(x_{t-1}, a_{t-1})))\). We represent \(f_{\mu}, f_{\sigma^2}: \mathbb{R}^{S} \times \mathbb{R}^{A} \to \mathbb{R}\) and \(g_{\mu}, g_{\sigma^2}: \mathbb{R}^{S} \times \mathbb{R}^{A} \to \mathbb{R}^{S}\) as neural networks.
A natural prior is \(\tilde{p}(r_t) = \mathcal{N}(r_t; r_{\text{max}}, \alpha^2)\) where \(r_{\text{max}}\) is the maximum reward in the environment and \(\alpha\) is suitably chosen.
Then, we can find
\[D_{KL}(q(o_t|x_t, a_t) || \tilde{p}(o_t)) = \frac{1}{2}\left(\frac{(f_{\mu}(x_t, a_t) - r_{\text{max}})^2 + f_{\sigma^2}(x_t, a_t)}{\alpha^2} + \log\left(\frac{\alpha^2}{f_{\sigma^2}(x_t, a_t)}\right) - 1\right)\]and
\[H[q(x_t|x_{t-1}, a_{t-1})] = \frac{1}{2} \sum_{i=1}^{S} \log(2\pi e (g_{\sigma^2}(x_{t-1}, a_{t-1}))_i)\]Hence
\[\begin{align*} \mathcal{F}_{\pi, t} = \frac{1}{2}&\mathbb{E}_{q(x_t|x_{t-1}, \pi)}\left[\frac{(f_{\mu}(x_t, a_t) - r_{\text{max}})^2 + f_{\sigma^2}(x_t, a_t)}{\alpha^2} + \log(\frac{\alpha^2}{f_{\sigma^2}(x_t, a_t)}) - 1\right]\\ &- \frac{1}{2}\mathbb{E}_{q(x_{t-1}|x_{t-2}, \pi)}\left[\sum_{i=1}^{S} \log(2\pi e (g_{\sigma^2}(x_{t-1}, a_{t-1}))_i)\right] \end{align*}\]Tschantz et al. (2020) arrive at a different approximation for \(\mathcal{F}_{\pi}\). I now compare my approach detailed above, and theirs, by implementing my approach into the codebase for their paper, with the same hyperparameters across the two approaches. Saving the rewards across 10 runs for each method and averaging over these runs gives plots:

And this approach outperforms with 95% confidence (via. bootstrapping of the \(U\)-statistic wrt. the reward at the final timestep) in both environments.
Code can be found at: https://github.com/r-gould/active-inference
Now going back to the case of no action, just perception:
Consider a multi-layer hierarchy described by \(x = (x_1, \ldots, x_L)\) and causal structure
\[x_L \to x_{L-1} \to \cdots \to x_1 \to o\]which means that \(p(o, x) = \prod_{i=0}^{L} p(x_i\vert x_{i+1})\), with \(x_0 := o\) and \(p(x_L\vert x_{L+1}) := p(x_L)\).
Predictive coding can be derived from here by making normal assumptions. In particular, we let
\[p(x_i|x_{i+1}) := \mathcal{N}(x_i; \mu_i(x_{i+1};\theta_i), \Sigma_i(x_{i+1};\theta_i))\]and, using the mean field approximation,
\[q(x|o) \approx \prod_{i=1}^{L} q(x_i|o)\]with \(q(x_i\vert o) := \mathcal{N}(x_i;\hat{\mu}_i, \hat{\Sigma}_i)\).
We can rewrite the free energy as
\[F(o) = D_{KL}(q(x|o)||p(x, o)) = -H[q(x|o)] - \mathbb{E}_{q(x|o)}[\log p(x, o)]\]The entropy term on the left is nice to evaluate since \(q(x\vert o)\) is the product of Gaussians:
\[\begin{align*} -H[q(x|o)] = -\mathbb{E}_{q(x|o)}[\log q(x|o)] &= -\sum_{i=1}^{L} \mathbb{E}_{q(x_i|o)}[\log q(x_i|o)]\\ &= \sum_{i=1}^{L} H[q(x_i|o)]\\ &= \frac{1}{2} \sum_{i=1}^{L} (n_i + \log(|2\pi \hat{\Sigma}_i|)) \end{align*}\]with \(x_i \in \mathbb{R}^{n_i}\). See that this is constant with respect to \(\{\theta_i\}_i\) and \(\{\hat{\mu}_i\}_i\).
Assuming that \(\Sigma_i\) is a fixed hyperparameter, the term on the right of \(F(x)\) can be written using a Taylor expansion of \(\log p(x_i\vert x_{i+1})\) about \(x_i = \hat{\mu}_i\), \(x_{i+1} = \hat{\mu}_{i+1}\). Expanding gives
\[\begin{align*} \log p(x_i|x_{i+1}) = \log p(\hat{\mu}_i|\hat{\mu}_{i+1}) &+ (x_i - \hat{\mu}_i)^T \frac{\partial\log p(x_i|x_{i+1})}{\partial x_i}\bigg\rvert_{x_i = \hat{\mu}_i}\\ &+ \frac{1}{2} (x_i - \hat{\mu}_i)^T \frac{\partial^2 \log p(x_i|x_{i+1})}{\partial x_i^2}\bigg\rvert_{x_i = \hat{\mu}_i} (x_i - \hat{\mu}_i) + ... \end{align*}\]See that \(\mathbb{E}_{q(x_i\vert o)}[x_i-\hat{\mu}_i] = 0\) and \(\frac{\partial^2 \log p(x_i\vert x_{i+1})}{\partial x_i^2} = -\Sigma_i^{-1}\), with third order and higher terms \(0\), and
\[\begin{align*} \mathbb{E}_{q(x_i|o)}[(x_i - \hat{\mu}_i)^T \frac{\partial^2 \log p(x_i|x_{i+1})}{\partial x_i^2}\bigg\rvert_{x_i = \hat{\mu}_i} (x_i - \hat{\mu}_i)] &= -\mathbb{E}_{q(x_i|o)}[\text{tr}((x_i - \hat{\mu}_i)^T \Sigma_i^{-1} (x_i - \hat{\mu}_i))]\\ &= -\mathbb{E}_{q(x_i|o)}[\text{tr}(\Sigma_i^{-1} (x_i - \hat{\mu}_i) (x_i - \hat{\mu}_i)^T)]\\ &= -\mathbb{E}_{q(x_i|o)}[\text{tr}(\Sigma_i^{-1} \hat{\Sigma}_i)] \end{align*}\]where the last line follows as \(\mathbb{E}\) and \(\text{tr}\) commute, and \(\mathbb{E}_{q(x_i|o)}[(x_i - \hat{\mu}_i) (x_i - \hat{\mu}_i)^T] \equiv \hat{\Sigma}_i\). Therefore the first and second order terms are constant in \(\{\theta_i\}_i\) and \(\{\hat{\mu}_i\}_i\), so the term on the right of \(F(x)\) is
\[-\mathbb{E}_{q(x|o)}[\log p(x, o)] = -\sum_{i=0}^{L} \mathbb{E}_{q(x|o)}[\log p(x_i|x_{i+1})] = -\sum_{i=0}^{L} \log p(\hat{\mu}_i|\hat{\mu}_{i+1}) + \text{const}\]with \(\hat{\mu}_0 := x_0\). Hence the free energy is
\[\begin{align*} F(o) &= -\sum_{i=0}^{L} \log p(\hat{\mu}_i|\hat{\mu}_{i+1}) + \text{const}\\ &= \frac{1}{2} \sum_{i=0}^{L} ((\hat{\mu}_i - \mu_i(\hat{\mu}_{i+1};\theta_i))^T \Sigma_i(\hat{\mu}_{i+1};\theta_i)^{-1} (\hat{\mu}_i - \mu_i(\hat{\mu}_{i+1};\theta_i)) + \log(|2\pi \Sigma_i(\hat{\mu}_{i+1};\theta_i)|)) + \text{const}\\ &=: \frac{1}{2} \sum_{i=0}^{L} (\epsilon_i^T \Sigma_i^{-1} \epsilon_i + \log(|2\pi \Sigma_i|)) + \text{const} \end{align*}\]with \(\epsilon_i := \hat{\mu}_i - \mu_i(\hat{\mu}_{i+1};\theta_i)\).
Free energy can then be minimized by optimizing over \(\{\theta_i\}_i\) and \(\{\hat{\mu}_i\}_i\).
We can use gradient descent to perform this optimization, with gradients
\[\frac{\partial F}{\partial \theta_i} = -\left(\frac{\partial \mu_i}{\partial \theta_i}\right)^T \Sigma_i^{-1} \epsilon_i\] \[\frac{\partial F}{\partial \hat{\mu}_i} = \Sigma_i^{-1} \epsilon_i - 1\{i \geq 1\} \frac{\partial \mu_{i-1}}{\partial \hat{\mu}_i} \Sigma_{i-1}^{-1} \epsilon_{i-1}\]with \(1\{\; \cdot \;\}\) the indicator function.
i.e. these updates are local.
Millidge et al. (2020) shows that with a reversed causal structure (for the backward pass) of
\[o \to x_1 \to \cdots \to x_L\]and \(\Sigma_i = I\), then at the equilibrium point with \(\frac{\partial F}{\partial \hat{\mu}_i} = 0\) and loss \(\mathcal{L} := \frac{1}{2} (T - \mu_L)^2\) for targets \(T\), the fixed points of the errors are \(\epsilon_i^* = \frac{\partial \mathcal{L}}{\partial \mu_i}\) if one chooses \(\hat{\mu}_L := T\). Additionally we have \(\frac{\partial F}{\partial \theta_i} = -\frac{\partial \mathcal{L}}{\partial \theta_i}\).
The notable observation here is that through local updates, one can converge at exactly what backpropagation does (which requires global gradient information). Due to this locality, predictive coding is considered a biologically-plausible learning algorithm for the brain. TODO: Describe this part better, go into more details about correspondence between predictive coding and backprop
(repo https://github.com/r-gould/htm)
A description and implementation of the Hierarchical Temporal Memory (HTM) system, a self-supervised, non-probabilistic, non-gradient-based prediction algorithm inspired by the neocortex. The implementation is focused on next token prediction.
The approach rests on some observed properties of the neocortex:
Adaptability: different regions of the neocortex can substitute other regions
(Feeding visual input to the auditory cortex in a ferret, the auditory cortex learns to ‘see’, see here. Also, cases of people missing large parts of their brain, but still functioning, as if the remaining parts of the brain are able to rewire and compensate.)
An extension of these ideas is that the neocortex is built out of functionally identical cortical columns, with, for example, cortical columns in the visual cortex performing the same general algorithm as those in the auditory cortex. The only reason that the cortical columns in these regions perform different roles is because they receive different data, but ultimately, the same general algorithm is being applied.
Cortical columns consist of 6 layers, and the HTM system is a model of one of these (input) layers.
Components in the brain communicate via. sparse distributed representations (SDRs), representing groups of neuron firings. An SDR is modelled as a binary vector \(x \in \{0, 1\}^n\) (for some \(n \in \mathbb{N}\)) with a sparse number of ‘on’ bits, meaning \(\sum_i x_i \ll n\). Define \(S(n, k) := \{x \in \{0, 1\}^n : \sum_i x_i = k\}\), i.e. the set of binary vectors with exactly \(k\) ‘on’ bits.
Encoders are models of sensory receptors, which convert real world data into SDRs to be used by components in the brain.
An SDR sparsity of \(2\%\) is often used, meaning that \(\left[\frac{k}{n}\right] = 0.02\).
(for understanding the HTM system generally, can skip this)
From the above, in the context of NLP, we want SDRs of, say, tokens across a vocabulary, to have overlapping ‘on’ bits when semantic information is shared.
One way of producing such semantic SDRs is through the use of self-organizing maps.
In the context of NLP, this involves:
Picking a tokenizer \(T: \text{str} \to [\text{int}]\) with vocabulary size denoted by \(V\). For a dataset of text, split the text into snippets of paragraphs (or variable length). For each snippet, tokenize it, and create a binary vector of length \(V\), with \(1\) at a token index if that token is present in the snippet, and \(0\) otherwise. Pass these binarized snippets into the BSOM algorithm.
Implemented the BSOM algorithm here, with training script here.
Now I describe the HTM system; a model of an input layer in a cortical column.
The layer is made up of a number, say \(C\), of minicolumns. When the layer receives an input SDR \(x^{(t)} \in S(n, k)\) (at time \(t\)), each minicolumn receives a fixed subset of the total input bits \(\{1, \ldots, n\}\). This is described through the notion of proximal synapses \(S(c) \subset \{1, \ldots, n\}\), with minicolumn \(c \in \{1, \ldots, C\}\) possessing connections to the input indices in \(S(c)\). Each proximal synapse \(s \in S(c)\) has a time-dependent permanence \(p_{c, s}^{(t-1)} \in [0, 1]\). Then the receptive synapses \(R^{(t-1)}(c) := \{s \in S(c) : p_{c, s}^{(t-1)} \geq p_{\text{thresh}}\} \subset S(c)\) are synapses with a large enough permanence to actually receive their input (synapses \(s \in S(c) \backslash R^{(t-1)}(c)\) are connected but do not actually receive the input as their permanence is not high enough, but during learning, the permanence can increase such that they become receptive).
Each minicolumn has a number, say \(N\), of neurons. Each neuron has a set of distal segments. The total distal segments on a minicolumn (totalled across all neurons in the minicolumn) is given by \(\mathcal{D}^{(t-1)}(c)\), and for \(d \in \mathcal{D}^{(t-1)}(c)\), the origin neuron from which the segment stems from is given by \(\mathcal{N}(d) \in \{1, \ldots, NC\}\). The distal synapses on a distal segment \(d\) are given by \(\mathcal{S}^{(t-1)}(d)\), and similarly to proximal synapses, each distal synapse \(s \in \mathcal{S}^{(t-1)}(d)\) has a permanence \(q_{d, s}^{(t-1)} \in [0, 1]\) with receptive synapses \(\mathcal{R}^{(t-1)}(d) := \{s \in \mathcal{S}^{(t-1)}(d) : q_{d, s}^{(t-1)} \geq q_{\text{thresh}}\} \subset \mathcal{S}^{(t-1)}(d)\). Instead of each synapse connecting to a bit in the input, the synapses instead connect to other neurons across HTM layer. Specifically, each \(s \in \mathcal{S}^{(t-1)}(d)\) has an associated pre-synaptic neuron \(\mathcal{P}(s) \in \{1, \ldots, NC\}\) that it is connected to.
e.g. a neuron with 2 distal segments, each segment with 3 synapses (without including pre-synaptic neurons):
The HTM layer receives an input SDR \(x^{(t)}\), and results in a set of neurons either being in predictive state or not. If a neuron is in predictive state, then it is predicting that the minicolumn that it is in will activate when given the next input \(x^{(t+1)}\). The specifics of which neurons are firing within the minicolumn encodes the context based on previous inputs \(\{x^{(1)}, \ldots, x^{(t)}\}\).
The spatial pooling algorithm is applied first, and then the temporal memory algorithm, in order to obtain these predictions. The details are described later, and here, brief overviews are given as to what these algorithms do.
The spatial pooling takes an input SDR \(x^{(t)}\) and computes which minicolumns are active, based on how many receptive proximal synapses are active (the input index from \(x^{(t)}\) that they connect to is ‘on’, equal to 1). If this number of active synapses meets an overlap threshold \(o_{\text{thresh}}\), then the minicolumn is considered as active.
Then, global inhibition is performed, deactivating minicolumns that are not in the top \(M\) overlap scores across minicolumns. Then, one can pick \(M\) such that, say, at most \(2\%\) of minicolumns are active, giving sparsity.
This process returns a set of active minicolumns \(A^{(t)} \subset \{1, \ldots, C\}\).
For each active minicolumn in \(A^{(t)}\), a Hebbian learning process is performed on the permanences of the proximal synapses, to obtain new permanences \(\{p_{c, s}^{(t)}\}_{c, s}\).
The end of timestep \(t-1\) placed certain neurons in predictive state, meaning the minicolumns that contain such neurons are predicted to activate. We compare these predicted minicolumn activations with \(A^{(t)}\), and when there is an overlap, the synapses on distal segments that were correctly active are strengthened, and those that were not active are weakened (Hebbian process).
If a minicolumn in \(A^{(t)}\) was not predicted, then the minicolumn bursts, with all neurons in this minicolumn becoming active, and a Hebbian process is applied identically to above for a chosen learning segment (see details below).
For non-active minicolumns not in \(A^{(t)}\), synapses that matched (see details below) are punished.
Then, neurons are placed into predictive state for use in the next timestep \(t+1\).
(quite a few details have been skimmed over, see below for a full treatment)
(for implementation see here)
The set of active minicolumns \(A^{(t)}\) is computed by the following procedure:
We then perform a Hebbian learning process on the active columns only. Specifically,
If boosting is active, we also:
The goal of boosting is to maintain homeostasis of neuronal activity, encouraging a variety of minicolumns to be activating across inputs rather than a small set of minicolumns dominating activity with most minicolumns remaining unactive for long periods.
(for implementation see here)
With the set of active columns \(A^{(t)}\) obtained by spatial pooling, we can now perform the temporal memory algorithm.
Information from the previous timestep \(t-1\):
The temporal memory algorithm then finds the information for the next timestep, i.e. \(\mathcal{N}^{(t)}_A\), \(\mathcal{N}^{(t)}_W\), \(\mathcal{D}^{(t)}_A\), \(\mathcal{D}^{(t)}_M\), all initially set to \(\{\}\), while learning from the input \(x^{(t)}\). Whether a neuron is in predictive state is determined by whether one of its segments is active (in \(\mathcal{D}^{(t-1)}_A\)).
This process can be split into three processes, involving: active columns, inactive columns and updating of segments. These processes are explained below.
(if a segment of minicolumn \(c\) is active, meaning the minicolumn was correctly predicted to be active)
In this case, we activate, and set to be a winner, any neurons that have any such active segments:
\[\mathcal{N}^{(t)}_A := \mathcal{N}^{(t)}_A \cup \{\mathcal{N}(d) : d \in \mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_A\}\] \[\mathcal{N}^{(t)}_W := \mathcal{N}^{(t)}_W \cup \{\mathcal{N}(d) : d \in \mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_A\}\]This corresponds to a neuron correctly being in predictive state at step \(t-1\), and hence becoming active at step \(t\).
Now, if in learing mode, a Hebbian process is applied to active distal segments. Additionally, we strengthen the pattern matching capability of the active segments by growing up to \(S_{\text{sample}}\) synapses to the previous winning neurons from \(\mathcal{N}^{(t-1)}_W\) (whose activations can be utilized to make next step predictions).
and synapse growth:
In this case there are no new segments formed, therefore \(\mathcal{D}^{(t)}(c) := \mathcal{D}^{(t-1)}(c)\).
(no segments of minicolumn \(c\) is active, meaning no neuron was in predictive state, when it should have been)
In this case the column bursts. This means that all neurons within this minicolumn are activated (as opposed to only activating neurons with activated segments, as above):
\[\mathcal{N}^{(t)}_A := \mathcal{N}^{(t)}_A \cup \{\mathcal{N}(d) : d \in \mathcal{D}^{(t-1)}(c)\}\]After bursting, we wish to correct this for the future, such that the column will not burst again and instead will predict correctly. To do this, we pick the maximially matching segment, or, if there are no matching segments, a new segment grown on the neuron with the fewest segments, and apply a Hebbian learning rule on this new segment.
If \(|\mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_M| > 0\) (if a segment in minicolumn \(c\) is matching), then:
which is the maximally matching segment in minicolumn \(c\).
Otherwise, if \(|\mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_M| = 0\) (no matching segments in minicolumn \(c\)), then:
breaking any ties randomly.
This last step gives updated segments \(\mathcal{D}^{(t)}(c) := \mathcal{D}^{(t-1)}(c) \cup \{d_{\text{new}}\}\).
Now add \(w\) to the winner neurons:
\[\mathcal{N}^{(t)}_W := \mathcal{N}^{(t)}_W \cup \{w\}\]If in learning mode, then a Hebbian learning rule and synapse growth is applied to the learning segment, identically to as the other case (though here, just on the learning segment, not across many segments).
If not in learning mode, nothing is done in this case. If in learning mode:
The idea is to punish segments that did activate, as they made an incorrect prediction since the minicolumn \(c\) did not activate. So only performing the negative update in the Hebbian learning rule.
If there are matching segments, meaning \(|\mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_M| > 0\), then for each matching segment \(d \in \mathcal{D}^{(t-1)}(c) \cap \mathcal{D}^{(t-1)}_M\):
No new segments are grown, so \(\mathcal{D}^{(t)}(c) := \mathcal{D}^{(t-1)}(c)\), and no new synapses are grown, so \(\mathcal{S}^{(t)}(d) := \mathcal{S}^{(t-1)}(d) \; \forall \; d \in \mathcal{D}^{(t)}(c)\).
The above has determined new neuron information \(\mathcal{N}^{(t)}_A\) and \(\mathcal{N}^{(t)}_W\) alongside, for each \(d \in \mathcal{D}^{(t)}(c)\), a set of new synapses \(\mathcal{S}^{(t)}(d)\) and receptive synapses \(\mathcal{R}^{(t)}(d)\) based on new permanences \(\{q^{(t)}_{d, s}\}_{s \in \mathcal{S}^{(t)}(d)}\).
We can now determine which segments are active and matching: \(\mathcal{D}^{(t)}_A\) and \(\mathcal{D}^{(t)}_M\). This involves a simple check over segments.
For each segment \(d \in \mathcal{D}(c)\):
with \(M_{\text{thresh}}, A_{\text{thresh}} \in \mathbb{N}\) thresholds for matching and activating a segment respectively.
]]>A primary asset of machine learning is the artificial neuron; a very idealized, non-spiking model of a neuron in the brain. What does a more realistic model of neurons and their spiking behaviour look like?
The spiking network is modelled as a directed graph \(G = (V, E)\), where nodes \(V\) represent neurons and edges \(E\) represent synapses between neurons. For \(n\) neurons, \(V := \{1, \ldots, n\}\).
The most common spiking neural models are Integrate-and-Fire (IF) and Leaky-Integrate-and-Fire (LIF).
For a neuron \(i \in V\), let \(u_i = u_i(t)\) model the state variable, which represents the neural membrane potential of neuron \(i\). Then under the LIF model it evolves as
\[C \frac{du_i}{dt}(t) = -\frac{1}{R}(u_i(t) - u_{\text{rest}}) + (I_0(t) + \sum_{j \in \mathcal{I}(i)} w_{ji} I_j(t))\]where \(\{w_{ji}\}_{j \in \mathcal{I}(i)}\) are learnable synaptic weights, \(I_0 = I_0(t)\) is the external current driving the neural state, \(I_j = I_j(t)\) is the input current from neuron \(j \in \mathcal{I}(i)\), \(u_{\text{rest}}\) is the rest potential, \(C\) is the membrane capacitance, and \(R\) is the input resistance.
A brief derivation of this model from electromagnetic principles can be found in the appendix.
Whenever the membrane potential \(u_i\) for neuron \(i \in V\) reaches the firing threshold \(\upsilon\), which occurs at say times \(T_i \subset \mathbb{R}\), the neuron spikes, which is modelled by its corresponding spike train \(S_i(t) := \sum_{t' \in T_i} \delta(t - t')\). Immediately after the spike, the neural state is reset to the rest potential \(u_{\text{rest}} < \upsilon\) and held at that level for the time interval representing the neural absolute refractory period.
How are the incoming currents \(\{I_j\}_{j \in \mathcal{I}(i)}\) modelled? Each incoming neuron has their own spike trains \(\{S_j\}_{j \in \mathcal{I}(i)}\) with corresponding spike times \(\{T_j \subset \mathbb{R}\}_{j \in \mathcal{I}(i)}\). Then for each \(j \in \mathcal{I}(i)\),
\[I_j(t) := \int_{0}^{\infty} S_j(t-s) \exp(-s/\tau) ds = \sum_{t' \in T_j, t' \leq t} \exp(-(t-t')/\tau)\]where \(\tau\) is the synaptic time constant. This can be thought of as accumulating all spikes before the time \(t\), weighing more recent spikes strongly compared to older spikes.
Importantly, the received current \(I_j\) only depends on the spikes of this \(j\)th presynaptic neuron, and does not depend on other properties of the potential. The postsynaptic neuron only knows about spikes of its input neurons, and not anything more specific about their potential changes. Incoming potentials below the firing threshold are ignored.
We could update discretely using \(u_i(t + \Delta t) \approx u_i(t) + \Delta t \frac{du_i}{dt}(t)\), and after updating in this way we would check for neurons that have increased over the threshold, set them to the rest potential and put them in refractory mode until the refractory period is over.
Say we have data \(x \in \mathbb{R}^n\) with \(x_i \in [0, 1]\). How do we convert this information into binary spikes across time analogously to how our sensory receptors do?
One method is rate encoding, where we encode each \(x_i\) as a sequence of spikes at some frequency, where this frequency is a function of the intensity \(x_i\), such that this frequency encodes the data.
A specific example of rate encoding is Poisson rate encoding:
There are other methods of encoding, such as temporal coding schemes that encode information in precise timing of spikes as opposed to frequencies.
Within the network there would be a set of neurons that have no incoming synapses, and these would act as input neurons, where we would provide these neurons directly with the raw data encoded as spikes. Then, other neurons would receive the spike information of this raw data via incoming currents.
STDP is based on the observed phenomena of synapses strengthening if the pre-synaptic neuron spikes before the post-synaptic does too, and the synapse weakening if the post-synaptic neuron fires before the pre-synaptic neuron. The smaller the time difference between the two spikes, the stronger the synaptic update.
A common STDP rule says that when a neuron \(i\) spikes at say time \(t\), we should:
For better stability and noise reduction, the updates can be stored until, say, \(w_{ij}\) has 5 stored updates, where the updates are summed and then updated. It can also be done in a fixed time window, such as updating every 5ms, as opposed to performing the updates immediately.
The key idea is to model the intracellular and extracellular regions of the neuron as conductors, with the neuron membrane acting as the space between them.
In addition, the charges of the regions are opposite in sign, and approximately equal in charge, say \(\pm Q\). Hence together they form a capacitor.
The capacitance of two conductors carrying charges \(\pm Q\) with a voltage \(V\) between them is \(C := \frac{Q}{V}\). Differentiating gives
\[C \frac{dV}{dt} = \frac{dQ}{dt}\]For a volume \(\mathcal{V}\), the charge is defined as
\[Q := \int_{\mathcal{V}} \rho \; d\mathcal{V}\]where \(\rho = \rho(\mathbf{x}, t)\) is the electric charge density.
The current describes the flow of charge. Specifically, the total current flowing into the volume is defined by
\[I := -\int_{\partial \mathcal{V}} \mathbf{J} . d\mathbf{S}\]where \(\mathbf{J} = \mathbf{J}(\mathbf{x}, t)\) is the flux of electric charge per unit area.
\(\rho\) and \(\mathbf{J}\) are sources, and are irreducible. The sources define a problem in electromagnetism. They can be thought of as hyperparameters of the system that we must provide.
The Maxwell equations then describe how the fields \(\mathbf{E}\) and \(\mathbf{B}\) evolve due to these sources. The Lorentz force law describes how the charges, described by the sources, move due to the fields.
Charge conservation means that the charge in a volume can only change due to charge flowing into/out of the volume. This means that
\[\frac{dQ}{dt} = I\]hence the capacitance equation becomes
\[C \frac{dV}{dt} = I\]where \(I\) is the current flowing into the neuron.
The above represents the IF model, but in reality, there is a leak of ions through the neuron membrane, causing a current to flow out of the neuron. Adding in the term corresponding to this effect leads to the LIF model:
\[C \frac{dV}{dt}(t) = -\frac{1}{R}(V(t) - V_{\text{rest}}) + I(t)\]and we can further expand \(I(t)\) into the currents incoming via synapses to get the desired expression.
]]>Consider an agent embedded in an environment, with the agent described by an internal state, and the environment by an external state, undergoing the cycle:
generating the random sequence
\[(S_1, O_1, H_1, A_1, S_{2}, \ldots)\]which can be described by the following diagram, with the environment and agent only interacting with eachother via observations and actions
For simplicity assume action and observation distributions \(p(a\mid h)\) and \(p(o\mid s)\) are determinstic, with \(a_{\tau} = a_{\tau}(h_{\tau})\) and \(o_{\tau} = o_{\tau}(s_{\tau})\) (though with \(p(s'\mid s, a)\) and \(p(h'\mid h, o)\) stochastic).
Denote the environment state dynamics by \(\mu = \mu(s_{\tau}\mid s_{\tau-1}, a_{\tau-1}) \equiv p(s_{\tau}\mid s_{\tau-1}, a_{\tau-1})\), and the agent state dynamics (e.g. how its parameters change) by \(q = q(h_{\tau}\mid h_{\tau-1}, o_{\tau}) \equiv p(h_{\tau}\mid h_{\tau-1}, o_{\tau})\) for clarity.
At timestep \(t\) in environment \(\mu\), after experiencing the past \(\omega_t = (o_{0:t}, h_{0:t-1})\), the agent \(q\) has a preference for the future described by
\[P_t[q] := \mathbb{E}_{q}^{\mu}[\phi^t(O_{t+1}, O_{t+2}, \ldots)\mid\Omega_t=\omega_t]\]for some \(\phi^{t}\).
We will take that the agent’s internal state decomposes as \(h_{\tau} = (\theta_{\tau}, \pi_{\tau})\), with \(\theta_{\tau}\) corresponding to a parameter that updates (e.g. by gradient descent) under observation.
In this case we have \(q(h_{\tau}\mid h_{\tau-1}, o_{\tau}) = q(\theta_{\tau}\mid \theta_{\tau-1}, \pi_{\tau-1}, o_{\tau}) q(\pi_{\tau}\mid \theta_{\tau}, \pi_{\tau-1}, o_{\tau})\) where one can interpret \(q(\theta_{\tau}\mid \theta_{\tau-1}, \pi_{\tau-1}, o_{\tau})\) as a learning rule for the agent; how it changes its parameters under observation in order to achieve its preference. One could interpret \(q(\pi_{\tau}\mid \theta_{\tau}, \pi_{\tau-1}, o_{\tau})\) as the parameterized architecture used by the agent.
In the parameterized view, with \(h_{\tau} = (\theta_{\tau}, \pi_{\tau})\), what is a good choice for the learning rule \(q(\theta_{\tau}\mid \theta_{\tau-1}, \pi_{\tau-1}, o_{\tau})\)?
A common choice for the learning rule is gradient descent, with \(q = q_{\text{grad}}\), where
\[q_{\text{grad}}(\theta_{\tau}\mid \theta_{\tau-1}, \pi_{\tau-1}, o_{\tau}) := \delta\left(\theta_{\tau} - \left(\theta_{\tau-1} - \alpha \nabla_{\theta} P_t[q_{\theta}^{\tau}]\mid _{\theta = \theta_{\tau-1}}\right)\right)\]with \(\alpha \in \mathbb{R}_{+}\) the learning rate, where \(q_{\theta}^{\tau}\) is defined such that the learning rule is turned off from timestep \(\tau\), with the parameter fixed to \(\theta\). Specifically, \(q_{\theta}^{t}\) is defined by
\[q_{\theta}^{t}(\theta_{\tau}\mid \theta_{\tau-1}, \pi_{\tau-1}, o_{\tau}) := \begin{cases} q(\theta_{\tau}\mid \theta_{\tau-1}, \pi_{\tau-1}, o_{\tau}) & \tau < t, \\ \delta(\theta_{\tau}-\theta) & \tau \geq t \end{cases}\]It turns out that the gradient term used in this update rule, \(\nabla_{\theta} P_{t}[q_{\theta}^{\tau}]\), has a nice closed-form expression. We can write
\[\begin{align*} \mathbb{E}_{q_{\theta}^{t}}^{\mu}[\; \cdot\mid\Omega_t = \omega_t] &= \int \cdot \; p_{\theta}^{t}(s_{0:\infty}, o_{t+1:\infty}, h_{t:\infty}, a_{0:\infty}\mid o_{0:t}, h_{0:t-1}) ds_{0:\infty}, o_{t+1:\infty}, h_{t:\infty}, a_{0:\infty}\\ &= \int \cdot \; \left[\prod_{\tau=t}^{\infty} q_{\theta}^{t}(h_{\tau}\mid h_{\tau-1}, o_{\tau}(s_{\tau})) dh_{\tau}\right] [\cdots]\\ &= \int \cdot \; \left[\prod_{\tau=t}^{\infty} q(\pi_{\tau}\mid \pi_{\tau-1}, o_{\tau}; \theta) d\pi_{\tau}\right] [\cdots] \end{align*}\]where \([\cdots]\) corresponds to factors independent of \(\theta\), and defining \(q(\pi_{\tau}\mid \pi_{\tau-1}, o_{\tau}; \theta) := q(\pi_{\tau}\mid \theta_{\tau}, \pi_{\tau-1}, o_{\tau})\mid _{\theta_{\tau} = \theta}\).
Using the fact that
\[\nabla_{\theta} q(\pi_{\tau}\mid \pi_{\tau-1}, o_{\tau}; \theta) = q(\pi_{\tau}\mid \pi_{\tau-1}, o_{\tau}; \theta) \nabla_{\theta} \log q(\pi_{\tau}\mid \pi_{\tau-1}, o_{\tau}; \theta)\]one can then show, for inputs independent of \(\theta\), that
\[\nabla_{\theta} \mathbb{E}_{q_{\theta}^{t}}^{\mu}[\; \cdot\mid\Omega_t = \omega_t] = \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\; \cdot \; \sum_{\tau=t}^{\infty} \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\mid\Omega_t = \omega_t\right]\]It then follows that the gradient of the preference is
\[\nabla_{\theta} P_t[q_{\theta}^{t}] = \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\sum_{\tau=t}^{\infty} \phi^t(O_{t+1}, O_{t+2}, \ldots) \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\mid\Omega_t = \omega_t\right]\]reminiscent of the policy gradient theorem from RL.
We could use a Monte carlo estimator for this gradient after collecting trajectories \(\{(o_{\tau}^{(n)}, h_{\tau}^{(n)})_{\tau}\}_{n=1}^{\infty}\), i.e.
\[\hat{G}_{MC}^{t}(\theta) := \frac{1}{N}\sum_{n=1}^{N} \sum_{\tau=t}^{\infty} \phi^t(o^{(n)}_{t+1}, o^{(n)}_{t+2}, \ldots) \nabla_{\theta} \log q(\pi^{(n)}_{\tau}\mid \pi^{(n)}_{\tau-1}, o^{(n)}_{\tau}; \theta)\]However, we can lower the variance of this estimator by using a baseline.
Reducing variance with a baseline:
One can show that
\[\mathbb{E}_{q_{\theta}^{t}}^{\mu}[b^{t}_{\tau}(\Pi_{\tau-1}, O_{\tau}; \phi) \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)] = 0\]for any baseline \(b^{t}_{\tau} = b^{t}_{\tau}(\pi_{\tau-1}, o_{\tau}; \phi)\), which gives us another expression for the preference gradient:
\[\nabla_{\theta} P_t[q_{\theta}^{t}] = \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\sum_{\tau=t}^{\infty} \left[\phi^t(O_{t+1}, O_{t+2}, \ldots) - b^{t}_{\tau}(\Pi_{\tau-1}, O_{\tau}; \phi)\right] \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\mid\Omega_t = \omega_t\right]\]The natural choice for the baseline would be for it to estimate \(\phi^{t}\) in some respect, however this will require predicting information about the past and future. We can restrict the baseline’s duty to just predicting the future if we assume that the preference takes a summable form:
\[\phi^{t}(O_{t+1}, O_{t+2}, \ldots) = \sum_{\tau=t+1}^{\infty} \phi_{\tau}^{t}(O_{\tau})\]for some \(\{\phi_{\tau}^{t}\}_{\tau}\). This form is assumed for the preference unless stated otherwise. For \(\tau \geq t, \tau' \geq t+1\), it can be shown that
\[\begin{align*} &\mathbb{E}_{q_{\theta}^{t}}^{\mu}[\phi_{\tau'}^{t}(O_{\tau'}) \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\mid\Omega_{t}=\omega_{t}]\\ &= \int p_{\theta}^{t}(\theta_{\tau}, \theta_{\tau-1}, \pi_{\tau}, \pi_{\tau-1}, o_{\tau}, o_{\tau'}\mid \omega_{t}) \phi_{\tau'}^{t}(o_{\tau'}) \nabla_{\theta} \log q(\pi_{\tau}\mid \pi_{\tau-1}, o_{\tau}; \theta)\\ &= \int p_{\theta}^{t}(\theta_{\tau}\mid \theta_{\tau-1}, \pi_{\tau-1}, o_{\tau}, o_{\tau'}) p_{\theta}^{t}(\pi_{\tau}\mid \theta_{\tau}, \pi_{\tau-1}, o_{\tau}, o_{\tau'}) [\cdots] \phi_{\tau'}^{t}(o_{\tau'}) \nabla_{\theta} \log q(\pi_{\tau}\mid \pi_{\tau-1}, o_{\tau}; \theta)\\ &= \int \phi_{\tau'}^{t}(o_{\tau'}) q(\pi_{\tau}\mid \pi_{\tau-1}, o_{\tau}; \theta) [\cdots] \nabla_{\theta} \log q(\pi_{\tau}\mid \pi_{\tau-1}, o_{\tau}; \theta) \; \text{for} \; \tau' \leq \tau\\ &= 0 \; \text{for} \; \tau' \leq \tau \end{align*}\]therefore the preference gradient can be written
\[\nabla_{\theta} P_t[q_{\theta}^{t}] = \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\sum_{\tau=t}^{\infty} \left[V^{t}_{\tau} - b^{t}_{\tau}(\Pi_{\tau-1}, O_{\tau}; \phi)\right] \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\mid\Omega_t = \omega_t\right]\]defining the value at timestep \(t\) from \(\tau \geq t\)
\[V_{\tau}^{t} := \sum_{\tau'=\tau+1}^{\infty} \phi_{\tau'}^{t}(O_{\tau'})\]Then to best minimise the variance of the corresponding Monte Carlo estimator, we want \(b^{t}_{\tau}\) to estimate \(V^{t}_{\tau}\). Denoting \(b^{t}_{\tau} \equiv \hat{V}^{t}_{\tau}\) for clarity, we want
\[\hat{V}^{t}_{\tau}(h_{\tau-1}, o_{\tau}; \phi) \approx v_{\tau}^{t}\]which can be enforced by optimizing \(\phi\) via supervised learning using collected trajectories. Such a baseline is called a value baseline.
Overall we have preference gradient
\[\nabla_{\theta} P_t[q_{\theta}^{t}] = \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\sum_{\tau=t}^{\infty} \left[V^{t}_{\tau} - \hat{V}^{t}_{\tau}(\Pi_{\tau-1}, O_{\tau}; \phi)\right] \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\mid\Omega_t = \omega_t\right]\]Advantage-based estimators:
Ignoring the baseline for a moment, see that
\[\begin{align*} \nabla_{\theta} P_t[q_{\theta}^{t}] &= \sum_{\tau=t}^{\infty} \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[V^{t}_{\tau} \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\mid\Omega_t = \omega_t\right]\\ &= \sum_{\tau=t}^{\infty} \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[V^{t}_{\tau} \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\mid\Omega_t = \omega_t, \Omega_{\tau}, \Pi_{\tau}\right]\right]\\ &= \sum_{\tau=t}^{\infty} \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta) \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[V^{t}_{\tau}\mid\Omega_t = \omega_t, \Omega_{\tau}, \Pi_{\tau}\right]\right]\\ &=: \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\sum_{\tau=t}^{\infty} Q_{\tau}^{t}[q_{\theta}^{t}\mid\Omega_{\tau}, \Pi_{\tau}] \; \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\right] \end{align*}\]defining the action-value at timestep \(t\) from \(\tau \geq t\)
\[\begin{align*} Q_{\tau}^{t}[q_{\theta}^{t}\mid\Omega_{\tau}, \Pi_{\tau}] &:= \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[V^{t}_{\tau}\mid\Omega_t = \omega_t, \Omega_{\tau}, \Pi_{\tau}\right]\\ &\equiv \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[V^{t}_{\tau}\mid O_{0:\tau} = (o_{0:t}, O_{t+1:\tau}), H_{0:\tau} = (h_{0:t-1}, H_{t:\tau})\right] \end{align*}\]i.e. the expected value in the future from \(\tau \geq t\), conditioned on all past observed information from \(\tau\) and the hidden state at \(\tau\).
Then including baseline we have
\[\nabla_{\theta} P_t[q_{\theta}^{t}] = \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\sum_{\tau=t}^{\infty} \left[Q_{\tau}^{t}[q_{\theta}^{t}\mid\Omega_{\tau}, \Pi_{\tau}] - b^{t}_{\tau}(\Pi_{\tau-1}, O_{\tau}; \phi)\right] \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\right]\]Typical advantage based estimators choose a value baseline as seen in the previous section, with \(b^{t}_{\tau} = \hat{V}^{t}_{\tau}\), giving preference gradient
\[\nabla_{\theta} P_t[q_{\theta}^{t}] = \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\sum_{\tau=t}^{\infty} A_{\tau}^{t}[q_{\theta}^{t}\mid\Omega_{\tau}, \Pi_{\tau}; \phi] \; \nabla_{\theta} \log q(\Pi_{\tau}\mid \Pi_{\tau-1}, O_{\tau}; \theta)\right]\]defining the advantage.
\[A_{\tau}^{t}[q_{\theta}^{t}\mid\Omega_{\tau}, \Pi_{\tau}; \phi] := Q_{\tau}^{t}[q_{\theta}^{t}\mid\Omega_{\tau}, \Pi_{\tau}] - \hat{V}^{t}_{\tau}(\Pi_{\tau-1}, O_{\tau}; \phi)\]which is the value gain due to state \(\Pi_{\tau}\) (compared to average, computed by \(\hat{V}_{\tau}^{t}\)).
How can we estimate the advantage? Note that
\[V_{\tau}^{t} =\phi^{t}_{\tau+1}(O_{\tau+1}) + \cdots + \phi^{t}_{\tau+K}(O_{\tau+K}) + V_{\tau+K}^{t}\]for any \(K = 1, 2, \ldots\), and using
\[\mathbb{E}_{q_{\theta}^{t}}^{\mu}[V^{t}_{\tau+K}] \approx \mathbb{E}_{q_{\theta}^{t}}^{\mu}[\hat{V}^{t}_{\tau+K}(\Pi_{\tau+K-1}, O_{\tau+K}; \phi)]\]we can bootstrap \(Q_{\tau}^{t}\) by \(K\) steps
\[\begin{align*} Q_{\tau}^{t}[q_{\theta}^{t}\mid\Omega_{\tau}, \Pi_{\tau}] \approx \; &\mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\hat{V}^{t}_{\tau+K}(\Pi_{\tau+K-1}, O_{\tau+K}; \theta)\mid \Omega_t = \omega_t, \Omega_{\tau}, \Pi_{\tau}\right]\\ &+ \sum_{\tau'=\tau+1}^{\tau+K} \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\phi_{\tau'}^{t}(O_{\tau'})\mid \Omega_t = \omega_t, \Omega_{\tau}, \Pi_{\tau}\right] \end{align*}\]which gives advantage
\[\begin{align*} A_{\tau}^{t}[q_{\theta}^{t}\mid\Omega_{\tau}, \Pi_{\tau}; \theta] \approx \; &- \hat{V}^{t}_{\tau}(\Pi_{\tau-1}, O_{\tau}; \phi) + \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\hat{V}^{t}_{\tau+K}(\Pi_{\tau+K-1}, O_{\tau+K}; \theta)\mid \Omega_t = \omega_t, \Omega_{\tau}, \Pi_{\tau}\right]\\ &+ \sum_{\tau'=\tau+1}^{\tau+K} \mathbb{E}_{q_{\theta}^{t}}^{\mu}\left[\phi_{\tau'}^{t}(O_{\tau'})\mid \Omega_t = \omega_t, \Omega_{\tau}, \Pi_{\tau}\right] \end{align*}\]Then after observing trajectories \(\{(o_{\tau}^{(n)}, h_{\tau}^{(n)})_{\tau}\}_{n=1}^{\infty}\), we have Monte Carlo estimator
\[\begin{align*} \hat{G}^{t}_{MC}(\theta) &:= \frac{1}{N}\sum_{n=1}^{N} \sum_{\tau=t}^{\infty} a_{\tau}^{t}(o_{\tau:\tau+K}^{(n)}, \pi_{\tau-1}^{(n)}, \pi_{\tau+K}^{(n)}; \phi) \nabla_{\theta} \log q(\pi^{(n)}_{\tau}\mid \pi^{(n)}_{\tau-1}, o^{(n)}_{\tau}; \theta) \end{align*}\]with observed advantages
\[a_{\tau}^{t}(o_{\tau:\tau+K}, \pi_{\tau-1}, \pi_{\tau+K}; \phi) := -\hat{V}_{\tau}^{t}(\pi_{\tau-1}, o_{\tau}; \phi) + \hat{V}_{\tau+K}^{t}(\pi_{\tau+K-1}, o_{\tau+K}; \phi) + \sum_{\tau'=\tau+1}^{\tau+K} \phi_{\tau'}^{t}(o_{\tau'})\]which is equivalent to gradient descent on the loss function \(L^t = L^t(\theta)\) defined as
\[L^t(\theta) := \frac{1}{N}\sum_{n=1}^{N} \sum_{\tau=t}^{\infty} a_{\tau}^{t}(o_{\tau:\tau+K}^{(n)}, \pi_{\tau-1}^{(n)}, \pi_{\tau+K}^{(n)}; \phi) \log q(\pi^{(n)}_{\tau}\mid \pi^{(n)}_{\tau-1}, o^{(n)}_{\tau}; \theta)\]Advantage-based methods are seen to work very well in the context of reinforcement learning (PPO is an example of such a method).
We now discuss the approach of an agent using tree search methods, guided by learned predictive models of the environment, in order to choose actions effectively. For simplicity assume a discrete finite action space of \(A\) actions.
At time \(t\) the agent has observed \(o_{0:t}\), say with parameter \(\theta\) parameterizing the predictive distributions
\[h_{\theta} = h_{\theta}(o_{0:t}) =: z^t_{0} \in \mathbb{R}^{d}\] \[f_{\theta} = f_{\theta}(z^t_{k}) =: (\mathbf{p}^t_{k}, v^{t:t+k}_{t+k})\] \[g_{\theta} = g_{\theta}(z^t_{k-1}, i) = (\psi^{t:t+k-1}_{t+k}, z^{t}_{k, i})\]with
\[\psi^{\tau}_{t+k} \approx \mathbb{E}_{q}^{\mu}[\phi^{\tau}_{t+k}(O_{t+k})\mid\Omega_{t} = \omega_{t}]\] \[v^{\tau}_{t+k} \approx \mathbb{E}_{q}^{\mu}[V^{\tau}_{t+k}\mid\Omega_{t} = \omega_t]\]enforced via supervised learning (discussed more below).
The tree-search agent \(q = q_{\text{Search}}\) produces its action \(a_{t}\) by the following algorithm
TODO
At the base level, with no meta-learning, denote the hidden state \(h_{\tau} = h_{\tau}^{(0)}\) with dynamics \(q = q^{(0)} = q^{(0)}(h_{\tau}^{(0)}\mid h_{\tau-1}^{(0)}, o_{\tau})\). For example, a parameterized agent has \(h_{\tau}^{(0)} = (\theta_{\tau}, \pi_{\tau})\).
Consider a level higher, with meta-learning dynamics \(q = q^{(1)}\) with state \(h_{\tau} = h_{\tau}^{(1)} := (h_{\tau}^{(0)}, q_{\tau}^{(0)})\) and defined such that
\[q = q^{(1)}(h_{\tau}^{(0)}\mid q_{\tau}^{(0)}, h_{\tau-1}^{(0)}, o_{\tau}) = q_{\tau}^{(0)}(h_{\tau}^{(0)}\mid h_{\tau-1}^{(0)}, o_{\tau})\]I.e. the hidden state consists of the base state \(h_{\tau}^{(0)}\) and the update rule \(q_{\tau}^{(0)}\) for this base state, and \(q^{(1)} = q^{(1)}(h_{\tau}\mid h_{\tau-1}, o_{\tau})\) governs how both this base state and its update rule are updated, allow for meta-learning of the learning rule \& architecture itself, with dynamics
\[\begin{align*} q(h_{\tau}\mid h_{\tau-1}, o_{\tau}) &= q^{(1)}(h_{\tau}^{(1)}\mid h_{\tau-1}^{(1)}, o_{\tau})\\ &= q^{(1)}(h_{\tau}^{(0)}, q_{\tau}^{(0)}\mid h_{\tau-1}^{(0)}, q_{\tau-1}^{(0)}, o_{\tau})\\ &= q^{(1)}(q_{\tau}^{(0)}\mid h_{\tau-1}^{(0)}, q_{\tau-1}^{(0)}, o_{\tau}) \; q_{\tau}^{(0)}(h_{\tau}^{(0)}\mid h_{\tau-1}^{(0)}, o_{\tau}) \end{align*}\]Consider more generally \(\text{meta}^{K}\)-learning dynamics, for some \(K = 1, 2, \ldots\). This corresponds to dynamics governed by \(q = q^{(K)}\), with \(q^{(K)} = q^{(K)}(h_{\tau}^{(K)}\mid h_{\tau-1}^{(K)}, o_{\tau})\) with hidden state
\[h_{\tau} = h_{\tau}^{(K)} := (h_{\tau}^{(K-1)}, q_{\tau}^{(K-1)}) = \cdots = (h_{\tau}^{(0)}, q_{\tau}^{(0)}, \ldots, q_{\tau}^{(K-1)})\]The hidden state consists of \(K\) distributions \(\{q_{\tau}^{(k)}\}_{k=0}^{K-1}\), defined such that
\[q_{\tau}^{(k)}(h_{\tau}^{(k-1)}\mid q_{\tau}^{(k-1)}, h_{\tau-1}^{(k-1)}, o_{\tau}) = q_{\tau}^{(k-1)}(h_{\tau}^{(k-1)}\mid h_{\tau-1}^{(k-1)}, o_{\tau})\]This gives overall \(\text{meta}^{K}\) dynamics
\[q = q^{(K)}(h_{\tau}^{(K)}\mid h_{\tau-1}^{(K)}, o_{\tau}) = q_{\tau}^{(0)}(h_{\tau}^{(0)}\mid h_{\tau-1}^{(0)}, o_{\tau}) \left[\prod_{k=1}^{K} q_{\tau}^{(k)}(q_{\tau}^{(k-1)}\mid q_{\tau-1}^{(k-1)}, h_{\tau-1}^{(k-1)}, o_{\tau})\right]\]TODO: discuss popular meta learning dynamics (MAML, etc.)
The agent does not have access to \(p(o_{\tau+1}\mid o_{0:\tau}, h_{0:\tau})\), but lets say it has a model of this distribution, \(\hat{p} = \hat{p}(o_{\tau+1}\mid o_{0:\tau}, h_{0:\tau})\) (observations and hidden states are accessible by the agent, hence one can imagine learning such a model via supervised learning).
Define the preference under this predictive model
\[\hat{P}_{t}[q, \hat{p}] := \int \phi^t(o_{t+1:\infty}) \left[\prod_{\tau=t}^{\infty} q(h_{\tau}\mid o_{\tau}, h_{\tau-1}) \hat{p}(o_{\tau+1}\mid o_{0:\tau}, a_{0:\tau}(h_{0:\tau})) dh_{\tau} do_{\tau+1}\right]\]The AIXI model chooses \(\hat{p}\) to be the Solomonoff prior. This can be described by \(\hat{p} = \hat{p}_{S}\) with
\[\hat{p}_{S}(o_{1:\tau+1}\mid o_{0}, a_{0:\tau}) := \sum_{\substack{\sigma\\U(\sigma, o_0, a_{0:\tau}) = o_{1:\tau+1}}} 2^{-\ell(\sigma)}\]for some universal Turing machine \(U\), summing over programs \(\sigma\) and weighting based on program length \(\ell(\sigma)\). Under this prior we have
\[\begin{align*} \hat{p}_{S}(o_{\tau+1}\mid o_{0:\tau}, a_{0:\tau}) = \frac{\hat{p}(o_{1:\tau+1}\mid o_0, a_{0:\tau})}{\hat{p}(o_{1:\tau}\mid o_0, a_{0:\tau})} &= \frac{\hat{p}(o_{1:\tau+1}\mid o_0, a_{0:\tau})}{\int do'_{\tau+1} \hat{p}(o_{1:\tau}, o'_{\tau+1}\mid o_0, a_{0:\tau})}\\ &= \frac{\sum_{\sigma: U(\sigma, o_0, a_{0:\tau}) = o_{1:\tau+1}} \; 2^{-\ell(\sigma)}}{\int do'_{\tau+1} \sum_{\sigma': U(\sigma', o_0, a_{0:\tau}) = (o_{1:\tau}, o'_{\tau+1})} \; 2^{-\ell(\sigma')}} \end{align*}\]TODO: finish AIXI, discuss generalization from the perspective of Kolmogorov complexity