The Gradient

After Orthogonality: Virtue-Ethical Agency and AI Alignment

Peli Grietzer — Wed, 18 Feb 2026 23:25:52 GMT

Preface

This essay argues that rational people don’t have goals, and that rational AIs shouldn’t have goals. Human actions are rational not because we direct them at some final ‘goals,’ but because we align actions to practices^[1]: networks of actions, action-dispositions, action-evaluation criteria, and action-resources that structure, clarify, develop, and promote themselves. If we want AIs that can genuinely support, collaborate with, or even comply with human agency, AI agents’ deliberations must share a “type signature” with the practices-based logic we use to reflect and act.

I argue that these issues matter not just for aligning AI to grand ethical ideals like human flourishing, but also for aligning AI to core safety-properties like transparency, helpfulness, harmlessness, or corrigibility. Concepts like ’harmlessness’ or ‘corrigibility’ are unnatural -- brittle, unstable, arbitrary -- for agents who’d interpret them in terms of goals or rules, but natural for agents who’d interpret them as dynamics in networks of actions, action-dispositions, action-evaluation criteria, and action-resources.

While the issues this essay tackles tend to sprawl, one theme that reappears over and over is the relevance of the formula ‘promote x x-ingly.’ I argue that this formula captures something important about both meaningful human life-activity (art is the artistic promotion of art, romance is the romantic promotion of romance) and real human morality (to care about kindness is to promote kindness kindly, to care about honesty is to promote honesty honestly).

I start by asking: What follows for AI alignment if we take the concept of eudaimonia -- active, rational human flourishing -- seriously? I argue that the concept of eudaimonia doesn’t simply point to a desired state or trajectory of the world that we should set as an AI’s optimization target, but rather points to a structure of deliberation different from standard consequentialist^[2] rationality. I then argue that this form of rational activity and valuing, which l call eudaimonic rationality^[3], is a useful or even necessary framework for the agency and values of human-aligned AIs.

These arguments are based both on the dangers of a “type mismatch” between human flourishing as an optimization target and consequentialist optimization as a form, and on certain material advantages that eudaimonic rationality plausibly possesses in comparison to deontological and consequentialist agency with regard to stability and safety.

The concept of eudaimonia, I argue, suggests a form of rational activity without a strict distinction between means and ends, or between ‘instrumental’ and ‘terminal’ values. In this model of rational activity, a rational action is an element of a valued practice in roughly the same sense that a note is an element of a melody, a time-step is an element of a computation, and a moment in an organism’s cellular life is an element of that organism’s self-subsistence and self-development.^[4]

My central claim is that our intuitions about the nature of human flourishing are implicitly intuitions that eudaimonic rationality can be functionally robust in a sense highly critical to AI alignment. More specifically, I argue that in light of our best intuitions about the nature of human flourishing it’s plausible that eudaimonic rationality is a natural form of agency, and that eudaimonic rationality is effective even by the light of certain consequentialist approximations of its values. I then argue that if our goal is to align AI in support of human flourishing, and if it is furthermore plausible that eudaimonic rationality is natural and efficacious, then many classical AI safety considerations and ‘paradoxes’ of AI alignment speak in favor of trying to instill AIs with eudaimonic rationality.

Throughout this essay, I will sometimes explicitly and often implicitly be asking whether some form of agency or rationality or practice is natural. The sense of ‘natural’ I’m calling on is certainly related to the senses used in various virtue-ethical traditions, but the interest I take in it is less immediately normative and more material or technical. While I have no reductive definition at hand, the intended meaning of ‘natural’ is related to stability, coherence, relative non-contingency, ease of learnability, lower algorithmic complexity, convergent cultural evolution, hypothetical convergent cultural evolution across different hypothetical rational-animal species, potential convergent evolution between humans and neural-network based AI, and targetability by ML training processes. While I will also make many direct references to AI alignment, this question of material naturalness is where the real alignment-critical action takes place: if we learn that certain exotic-sounding forms of agency, rationality, or practice are both themselves natural and make the contents of our all-too-human values natural in turn, then we have learned about good, relatively safe, and relatively easy targets for AI alignment.

Readers may find the following section-by-section overview useful for navigating the essay:

Part I presents a class of cases of rational deliberation that are very different from the Effective Altruism-style optimization^[5] many in the AI-alignment world treat as the paradigm of rational deliberation. I call this class of rational deliberations 'eudaimonic rationality,' and identify it with the form of rationality that guides a mathematician or an artist or a friend when they reflect on what to do in mathematics or in art or in friendship.
Part II looks at the case of research mathematics (via an account by Terry Tao) as an example of eudaimonic rationality at work. What does a mathematician try to do in math? I say she tries to be mathematically excellent, which involves promoting mathematical excellence through mathematical excellence, and that this structure is closely related to why 'mathematical excellence' can even be a concept.
Part III argues that for eudaimonic agents such as a mathematician who is trying to do excellent mathematics, distinctions between ‘instrumental goods’ and ‘terminal goods’ (intrinsic goods) are mostly unnatural. This makes reflection about values go very differently for a eudaimonic agent than for an Effective Altruism-style agent. Instead of looking to reduce a network of causally intertwined apparent values to a minimal base of intrinsic values that “explains away” the rest as instrumental, a eudaimonic agent looks for organism-like causal coherence in a network of apparent values.
Part IV cashes out the essay’s central concepts: A eudaimonic practice is a network of actions, action-dispositions, action-evaluation criteria, and action-resources where high-scoring actions reliably (but defeasibly) causally promote future high-scoring actions. Eudaimonic rationality is a class of reflective equilibration and deliberation processes that assume an underlying eudaimonic practice and seek to optimize aggregate action-scores specifically via high-scoring action.
In part V, I argue that many puzzles and ‘paradoxes’ about AI alignment are driven by the assumption that mature AI agents will be Effective Altruism-style optimizers. A “type mismatch” between Effective Altruism-style optimization and eudaimonic rationality makes it nearly impossible to translate the interests of humans -- agents who practice eudaimonic rationality -- into a utility function legible to an Effective Altruism-style optimizer AI. But this does not mean that our values are inherently brittle, unnatural, or wildly contingent: while Effective Altruism-style optimizers may well be a natural type of agent, eudaimonic agents (whether biological or AI) are highly natural as well.
In part VI, I ask whether a eudaimonically rational AI agent devoted to a practice like mathematical research would be safe by default. I argue that a practice like mathematical research plausibly has natural boundaries that exclude moves like ‘take over planet to get more compute for mathematical research,’ but the issue is nuanced. I propose that a practice’s boundaries (for which there may be multiple good natural candidates) may be most stable when a practice is paired with a support practice: a complementary practice for dealing with practice-external issues of maintenance and resource-gathering.
Part VII develops the idea of ‘support practices’: eudaimonically rational ways to support eudaimonic practices. We famously want AI agents to help humans lead flourishing lives, but how can we define the purview of this ‘help’? I argue that many core human practices have natural support-practices with a derived eudaimonic structure: the work of a good couples’ therapist, for instance, is intertwined with but clearly distinct from a couple’s relationship-practice. Still, there remains a problem: a support-practice AI might harm other people and practices to help the people or practice it’s supporting.
Part VIII moves from eudaimonic rationality in general to eudaimonically rational morality. I argue that thinking of moral virtues as domain-general, always-on practices solves key AI-alignment-flavored problems with consequentialist and deontological moralities. The core idea is that the conditions for e.g. ‘kindness’ being a robust moral virtue are akin to the conditions for ‘mathematical excellence’ being a meaningful concept: it must be generally viable to promote kindness in yourself and others kindly. It’s this structure, I argue, that gives moral virtues material standing in a ‘fitness landscape’ riven by pressures from neural-network generalization dynamics, reinforcement-learning cycles, and social and natural selection.
Part IX argues that eudaimonic agents have some unique forms of robustness to RL-like and Darwinian-like dynamics that tend to mutate the values of EA-style optimizers. In particular, eudaimonic agents should be very robust to the risk of developing rogue subroutines (sometimes called ‘the inner alignment problem’).
In part X I discuss canonical AI-safety desiderata like transparency, corrigibility, and (more abstractly) niceness. I argue that treating these properties as moral virtues in my sense -- domain-general, always-on eudaimonic practices -- dissolves problems and paradoxes that arise when treating them as goals, as rules, or even as character traits. I end with an appendix on some prospects for RL regimes geared towards eudaimonic rationality.

I. Rational Action in the Good Life

I start with a consideration of the nature of the good we hope AI alignment can promote. With the exception of hedonistic utilitarians, most actors interested in AI alignment understand our goal as a future brimming with human (and other sapient-being) flourishing: persons living good lives and forming good communities. What I believe many fail to reflect on, however, is that on any plausible conception human flourishing involves a kind of rational activity. Subjects engaged in human flourishing act in intelligible ways subject to reason, reflection, and revision, and this form of rational care and purposefulness is itself part of the constitution of our flourishing. I believe this characterization of human flourishing is relatively uncontroversial upon reflection, but it raises a kind of puzzle if we’re used to thinking of rationality in consequentialist (or consequentialist-with-deontological-constraints) terms: just what goal is the rational agency involved in human-flourishing activity directed towards?

One obvious answer would be that, like all properly aligned rationality, the rational agency involved in human-flourishing activities is geared towards maximizing human (and other sapient) flourishing. But we should quickly find ourselves confused about the right way to describe the contribution that rational agency in human-flourishing activities makes to human flourishing. It seems neither appropriate to say that the rational agency involved in a human-flourishing activity contributes to human flourishing only by enacting rationality (by selecting actions that are intrinsically valuable when rationally selected), nor appropriate to say that the rational agency involved in a human-flourishing activity contributes to human flourishing just instrumentally (by selecting actions that causally promote human flourishing).^[6]

The first option reduces our rational actions to something ritualistic, even as the good life surely involves mathematicians working to advance mathematics, friends speaking heart-to-heart to deepen intimacies, gymnasts practicing flips to get better at flips, and novelists revising chapters to improve their manuscripts. The second option threatens to make the good in the good life just impossible to find -- if speaking heart-to-heart is not the good of friendship, and working on math is the not the good of mathematics, then what is?

This essay argues that deliberative reasoning about the good life is neither directed towards goals external to rational action nor directed towards rational action as an independent good, but towards acts of excellent participation in a valued open-ended process. I then go on to argue that the ‘eudaimonic’ structure of deliberation salient in cases like math or friendship (sloganized as ‘promote x x-ingly’) is also subtly critical in more worldly, strategic, or morally high-stakes contexts, and constitutes a major organizing principle of human action and deliberation.

II. What Is a Practice?

Since ‘human flourishing’ can seem mysterious and abstract, let’s focus on some concrete eudaimonic practices.^[7] Consider practices like math, art, craft, friendship, athletics, romance, play, and technology, which are among our best-understood candidates for partial answers to the question ‘what would flourishing people in a flourishing community be doing.’ From a consequentialist point of view, these practices are all marked by extreme ambiguity -- and I would argue indeterminacy -- about what’s instrumental and what’s terminal in their guiding ideas of value. Here, for example, is Terry Tao’s account of goodness in mathematics:

‘The very best examples of good mathematics do not merely fulfil one or more of the criteria of mathematical quality listed at the beginning of this article, but are more importantly part of a greater mathematical story, which then unfurls to generate many further pieces of good mathematics of many different types. Indeed, one can view the history of entire fields of mathematics as being primarily generated by a handful of these great stories, their evolution through time, and their interaction with each other. I would thus conclude that good mathematics [...] also depends on the more “global” question of how it fits in with other pieces of good mathematics, either by building upon earlier achievements or encouraging the development of future breakthroughs. [There seems] to be some undefinable sense that a certain piece of mathematics is “on to something”, that it is a piece of a larger puzzle waiting to be explored further.’

It may be possible to give some post-hoc decomposition of Tao’s account into two logically distinct components -- a description of a utility-function over mathematical achievements and an empirical theory about causal relations between mathematical achievements -- but I believe this would be artificial and misleading. On a more natural reading, Tao is describing some of the conditions that make good mathematical practice a eudaimonic practice: In a mathematical practice guided by a cultivated mathematical practical-wisdom judgment (Tao’s ‘undefinable sense that a certain piece of mathematics is “on to something”’), present excellent performance by the standard of the practical-wisdom judgment reliably develops the conditions for future excellent performance by the standard of the mathematical practical-wisdom judgment, as well as cultivating our practical and theoretical grasp of the standard itself.^[8]

This is not to suggest that ‘good mathematics causes future good mathematics’ is a full definition or even full informal description of good mathematics. My claim is only that the fact that good mathematics has a disposition to cause future good mathematics reveals something essential about our concept of good mathematics (and about the material affordances enabling this concept). By analogy, consider the respective concepts healthy tiger and healthy human: It's essential to the concept of a healthy tiger that x being a healthy tiger now has a disposition to make x be a healthy tiger 5 minutes in the future (since a healthy tiger body self-maintains and enables self-preservation tiger-behaviours), and essential to the concept of a healthy human that x being a healthy human now has a disposition to make x be a healthy human 5 minutes in the future (since a healthy human body self-maintains and enables self-preservation human behaviours). But these formulae aren't yet complete descriptions of 'healthy tiger' or 'healthy human,' as evidenced by the fact that we can tell apart a healthy tiger from a healthy human.

Crucially, the mathematical practical-wisdom described by Tao is not entirely conceptually opaque beyond its basic characterization as a self-cultivating criterion for self-cultivating excellence in mathematical activity. Mathematical flourishing can partly be described as involving the instantiation of a relation (a mathematical-practice relation of ‘developmental connectedness’) among instantiations of relatively individually definable and quantifiable instances of mathematical value such as elegant proofs, clear expositions, strong theorems, cogent definitions and so on. Furthermore, this relation of developmental connectedness is partly defined by its reliable tendency to causally propagate instances of more individually and locally measurable mathematical value (instances of elegant proofs, clear exposition, strong theorems, cogent definitions and so on):

[I believe] that good mathematics is more than simply the process of solving problems, building theories, and making arguments shorter, stronger, clearer, more elegant, or more rigorous, though these are of course all admirable goals; while achieving all of these tasks (and debating which ones should have higher priority within any given field), we should also be aware of any possible larger context that one’s results could be placed in, as this may well lead to the greatest long-term benefit for the result, for the field, and for mathematics as a whole.

One could, again, try to interpret this causal relationship between excellence according to Tao’s ‘organicist’ (or ‘narrative’ or ‘developmental’) sense of good mathematics and the reliable propagation of narrow instances of good mathematics as evidence of a means-ends rational relation, where additive maximization of narrow instances of mathematical value is the utility function and ‘organicist’ mathematical insight is the means. For Tao, however, the evidential import of this causal relationship goes exactly the other way -- it suggests a unification of our myriad more-explicit and more-standalone conceptions of mathematical excellence into a more-ineffable but more-complete conception. As Tao says:

It may seem from the above discussion that the problem of evaluating mathematical quality, while important, is a hopelessly complicated one, especially since many good mathematical achievements may score highly on some of the qualities listed above but not on others [...] However, there is the remarkable phenomenon that good mathematics in one of the above senses tends to beget more good mathematics in many of the other senses as well, leading to the tentative conjecture that perhaps there is, after all, a universal notion of good quality mathematics, and all the specific metrics listed above represent different routes to uncover new mathematics, or different stages or aspects of the evolution of a mathematical story.

III. Inverting Consequentialist Reflection

Tao’s reasoning about local and global mathematical values exemplifies a central difference between consequentialist rationality and eudaimonic rationality, now taken as paradigms not only for selecting actions but for reflecting on values. (Paradigms for what philosophers will sometimes call ‘reflective equilibration.’) Within the paradigm of consequentialist rationality, if excellence^[9] in accordance with a holistic, difficult-to-judge apparent value (say ‘freedom’) is reliably a powerful causal promoter of excellence in accordance with more explicit, more standalone apparent values (say ‘material comfort,’ ‘psychological health,’ ‘lifespan’), this relationship functions as evidence against the status of the holistic prima-facie value as a constitutive -- as opposed to instrumental -- value. Within the paradigm of eudaimonic rationality, by contrast, this same relationship functions as evidence for the status of the holistic prima-facie value as a constitutive value.

For a (typical)^[10] consequentialist-rationality reflection process, evidence that the excellence of a whole causally contributes to the excellences of its parts explains away our investment in the excellence of the whole. The “coincidence” that the intrinsically valuable whole is also instrumentally valuable for its parts is taken to suggest a kind of double-counting error -- one we “fix” by concluding that the whole has no constitutive value but valuing the whole is an effective heuristic under normal circumstances. A eudaimonic-rationality reflective equilibration, by contrast, treats instrumental causal connections between excellences as evidence that our notions of excellence are picking out something appropriately ‘substantive.’
For eudaimonic-rationality reflective equilibration, it is the discovery of causal and common-cause relations among excellences that ratifies our initial sense that caring about these excellences is eudaimonically rational. The discovery of these causal connections functions as evidence that:

The ‘local’ excellences we care about are resonant or fruitful, in that they causally promote each other and the holistic excellences in which they participate.
The ‘holistic’ excellences we care about are materially efficacious and robust, in that they causally promote both the more local excellences that participate in them and their own continuation as future holistic excellence.^[11]

In my view this is the right way to treat causal connections between (apparent) values if we’re hoping to capture actual human values-reflection, and points to an important strength of the eudaimonic rationality paradigm: Eudaimonic rationality dissolves the ‘paradox’ that in real-life arguments about the value of various human enterprises (e.g. the value of branches of science, branches of art, branches of sport), judgments of intrinsic value typically seek succor from some kind of claim to instrumental value. For example, a defense of the importance of research in quantum physics will often appeal to the wonderful technological, mathematical, and special-sciences applications quantum physics gave us, without meaning to reduce the worth of quantum physics to these applications. On my reading, these appeals aren't just additive -- 'aside from the intrinsic value there is also instrumental value' -- but presentations of evidence that research in quantum physics is a resonant part of a flourishing organic whole (e.g. the civilizational whole of ‘modern science and technology’).

I believe that without 'organicism' of the kind described above, one faces a serious dilemma whenever one argues for the intrinsic worth of a pursuit or norm: either we stress the value's independence from all benefits and applications and make the claim of value dogmatic and irrelevant, or else we invite an instrumentalist reduction that ‘explains away’ the appearance of intrinsic value.^[12] Indeed, I’d argue that organicism of this kind is even necessary to make sense of caring about rationality (including truth, knowledge, non-contradiction and so on) non-instrumentally at all: the ‘paradox’ of rationality as a substantive value is that the typical usefulness of rationality suggests an error-theory about its apparent intrinsic value, since it’s a strange coincidence that rationality is both so typically useful and so intrinsically good. On an organicist account, however, we expect that major forms of excellence endemic to human life -- thought, understanding, knowledge, reasoned action -- both typically promote each other and typically promote our material flourishing and causal leverage on the world.

IV. The Material Efficacy Condition

Returning now to Tao’s account of good mathematics, let’s take final stock of our interpretation. I argue that mathematical excellence (the property marking ‘the very best examples of good mathematics’) according to Tao satisfies the following conditions, which I believe Tao intends as necessary but not sufficient:

A) Mathematical excellence is a property of mathematical-activity instances.

B) An excellent mathematical-activity instance performed today is excellent partly by virtue of satisfying the mathematical-practice relation ‘builds on’ with regard to past excellent mathematical-activity instances.

C) An excellent mathematical-activity instance performed today is excellent partly by virtue of having a reliable causal tendency to bring about future excellent mathematical-activity instances that satisfy the mathematical-practice relation ‘builds on’ with regard to it.

D) Instantiation of more local, more individually measurable criteria of mathematical-activity goodness such as elegant proofs, clear expositions, and strong theorems is a typical correlate of mathematical excellence.

E) At a given moment in a given mathematical field, the instantiation of mathematical excellence will be predictably better-correlated with the instantiation of certain local criteria of mathematical-activity goodness than with others.

Should we take these traits to collectively describe something more like a decision-procedure called ‘mathematical excellence’ that mathematicians should try to follow, or something more like an event called ‘mathematical excellence’ whose aggregate future-occurrences mathematicians should aspire to maximize? My contention is that Tao’s account is inherently ambiguous, and for a good reason: in ordinary circumstances there is no significant practical difference between doing excellent mathematics and doing instrumentally optimal mathematics with regard to maximizing future aggregate excellent mathematics. This isn’t to say that doing excellent mathematics is the instrumentally optimal action among all possible actions with regard to aggregate future excellent mathematics, but that (in ordinary circumstances) it is the instrumentally optimal choice from among mathematical actions with regard to aggregate future excellent mathematics^[13].

I propose that the rough matchup between mathematical excellence and optimal (among mathematical actions) instrumental promotion of aggregate mathematical excellence is neither an empirical miracle nor something determined ‘by definition’ in a trivializing sense. Rather, ‘mathematical excellence’ as used by Tao is a concept that has a referent only if there is a possible property x that satisfies both desiderata A-E and the additional criterion that among mathematical actions, actions that are optimal as instantiations of x are also roughly optimal for maximizing aggregate future instantiation of x-ness.^[14]

This is what I would describe as the material efficacy condition on eudaimonic rationality. In order for a practice to be fit for possessing internal criteria of flourishing, excellence, and eudaimonic rationality, a practice must materially allow for an (under normal circumstances) optimally self-promoting property x that strongly correlates with a plethora of more local, more individually measurable properties whose instantiation is prima facie valuable. Stated more informally, there must exist a two-way causal relationship between a practice’s excellence and the material, psychological, and epistemic effects of its excellence, such that present excellence reliably materially, psychologically, and epistemically promotes future excellence.

V. Practices and Optimization

I earlier said that if human flourishing involves practicing eudaimonic rationality, there may well be a “type mismatch” between human flourishing and the kind of consequentialist optimization we often associate with the idea of an agenticly mature future AI. In fact, I believe that implicitly recognizing but misdiagnosing this type mismatch is at least partially responsible for MIRI-style pessimism about the probability of aligning any artificial agents to human values.

On my view, the secret to relatively successful alignment among humans themselves (when there is successful alignment among humans) lies in the role attempted excellence plays as a filter on human interventions in the future trajectory of a eudaimonic practice. To the degree that humans value a given eudaimonic practice, they are committed to effecting their vision for the practice’s future-trajectory primarily by attempting acts of excellence in the present: we stake our intended effect over the practice’s future-trajectory on the self-propagating excellence of our intervention. While this ‘filter’ doesn’t necessarily stop the worst interventions from being harmful (there are forms of ‘anti-excellence’ that also have self-promotion powers), I contend that this filter is mechanically crucial for the possibility of reliably benign or positive interventions.

What do I mean? Consider the difference between a world where scientists typically try to propagate (what they believe to be) the scientific truth mainly by means of submitting research work to scientific institutions, and a world where scientists typically try to propagate (what they believe to be) the scientific truth by means including propaganda, fraud, threats, bribery, and slander. As Liam Kofi Bright demonstrates in On Fraud, a community of consequentialist scientists devoted to maximizing truth will predictably match the latter model. I believe one lesson to be drawn is that humans’ ability to collaborate in the promotion of science depends on our ability to scientifically collaborate in the promotion of science, rather than throttle the future trajectory of science every-which-way our financial and political powers based on our individual beliefs about the optimal trajectory of science.

A flourishing eudaimonic practice is, above all, a natural-selection-like^[15] mechanism whose fitness-function selects among attempted acts of excellence the ones conducive to (and constitutive of) the practice’s flourishing, propagating the excellence these acts instantiate. When people committed to a eudaimonic practice make their attempted interventions into the future trajectory of the practice via acts of attempted excellence, the natural-selection-like mechanism embodied by the practice (rather than any single individual’s theory of optimal future trajectory) is the aligned intelligence determining the practice’s future trajectory.

The explanation here, again, is partly causal and partly constitutive: a practice’s “ultimate” norms of excellence, including the “ultimate” epistemic and alethic norms of a discursive practice, are partly defined by the succession of norms in the course of a practice’s development through best-efforts attempted excellence. Although this may be no deterrent to an already god-like optimizer who can simulate entire civilizational trajectories, an agent short of these capacities can best act on their vision of the optimal future-trajectory of a practice by attempting an excellent contribution to the practice.^[16]

The second aspect of our type-mismatch is much more in the weeds: In my analysis so far, I discussed the overall excellence of the trajectory of a eudaimonic practice much like a consequentialist might discuss a quantity of utility. This may be taken to suggest that a ‘sophisticated consequentialist’ or ‘universal consequentialist’ could easily accommodate the implications of the so-called type mismatch by treating them as instrumental, decision-procedure level considerations against naive optimization. In fact, quantities like ‘aggregate democracy’ or ‘overall mathematical excellence’ are (on my view) practice-internal quantities that quickly lose meaning if we try to apply them outside the scope of a ‘promote x x-ingly’ decision-procedure.

What do I mean? Consider, for example, the practice of philosophy. Here are some questions that should arise for a consequentialist planner (including a sophisticated consequentialist planning decision-procedures or habits) who values philosophy practice-trajectories: Does rating (e.g.) Aristotle’s or Dharmakirti’s philosophical achievements as the most excellent achievements in philosophy imply that we should “tile the universe” with independent practice-trajectories designed to reproduce classical Greek or Indian philosophy? If not, is it because we should assign non-linearly greater value to longer trajectories? Or should we discount trajectories that have parallel contents? Or should we analyze the greatness of early achievements in a practice as mostly instrumental greatness but the greatness of later achievements in a practice as mostly intrinsically valuable? These are all, I believe, bad questions that have only arbitrary answers. To an agent trying to promote philosophy by doing excellent philosophical work, the bad questions above are naturally out of scope. The agent uses the concept of ‘aggregate philosophical excellence’ or ‘a philosophy practice-trajectory’s value’ only to reason about the philosophical influence of their work on the trajectory of the philosophy-practice in which it participates. Choosing an excellent action in practice requires (at most) quantitative comparison between different possible paths for a practice-trajectory, not quantitative comparison between possible worlds containing independent practice-trajectories sprinkled throughout time and space.

VI. Prospects and Problems for AI

Is this good news for AI alignment? It’s certainly good news that (if I’m right) eudaimonic practices are something like natural kinds marked by a causal structure that enables a self-developing excellence well-correlated with multiple naive local measures of quality. But does this mean we could develop a stable and safe (e.g.) ‘mathematical excellence through mathematical excellence’ AI? If we create a fully agentic AI mathematician, will it naturally abstain from trying to extend its longevity or get more resources (even for doing mathematics) other than by impressing us with excellent mathematical work? I think that prospects are good, but not simple.

I believe ‘mathematical excellence through mathematical excellence’ really can powerfully scope what mechanisms for shaping the future an AI cares to activate. An AI trained to follow ‘promote mathematics mathematically’ will only care about influencing the future by feeding excellent mathematical work to mathematics’ excellence-propagation mechanism. But it’s harder to say whether the structure of mathematical practice also properly scopes what subactions can be taken as part of an instance of “doing math.” Is a human mathematician working on a would-be excellent proof in pen and paper practicing math when she is picking up a pen or flipping pages? When she is taking the bus to her office? When she’s buying amphetamines? And is an AI mathematician working on a would-be excellent proof practicing math when it opens a Python console? When it searches the web for new papers? When it harvests Earth for compute?

I think these questions are complex, rather than nonsensical. Much like collective practices, individual practices -- for example a person’s or possibly an AI’s mathematical practice -- may possess functional organic unities that allow a meaningful distinction between internal dynamics (including dynamics of development and empowerment) and external interventions (including interventions of enhancement and provision). Still, it’s clear that eudaimonic practices do not exist in isolation, and that no practice can function without either blending with or relying on a “support practice” of some kind.

How, then, do we rationally go about externally-oriented activities like building offices for mathematicians, performing elective reconstructive surgery on an athlete, or conducting couples therapy for romantic partners? And furthermore, how do we rationally go about allocating scarce resources useful for different practices, or judging whether to integrate (e.g.) performance-enhancing drugs into a practice?

This is, I think, the fundamental question for AI alignment from the viewpoint of ‘eudaimonic rationality.’ We want AI to support human eudaimonic practices -- and, if relevant, its own eudaimonic practices or participation in human eudaimonic practices -- in a eudaimonia-appropriate way. But how does the logic of eudaimonic rationality extend from eudaimonic practices to their support activities? How do we ‘eudaimonically-rationally’ do the dirty work that makes eudaimonia possible? My best answer is: carefully, kindly, respectfully, accountably, peacefully, honestly, sensitively.

VII. From Support-Practices to Moral Practice

The theory of AI alignment, I propose, should fundamentally be a theory of the eudaimonic rationality of support practices. One part of this theory should concern the ‘support’ relation itself, and analyze varieties of support practices and their appropriate relation to the self-determination of a eudaimonic practice: Support-practices such as acquiring resources for a practice, maintaining an enabling environment, coaching practitioners, conducting (physical or psychological) therapy for practitioners, devising technological enhancements for a practice, and educating the public about a practice, each have their own ‘role-morality’ vis-a-vis the practice they support. It is this part of the theory of ‘support practices’ that should, if all goes well in the theory’s construction, describe the various practice-external ways to eudaimonically-rationally act on a pro-attitude towards the aggregate excellence of the practice’s future trajectory without treating it like a quantity of utility. (Much like the concept of ‘mathematical action’ scopes the range of action-choices in such a way that decision-theoretic optimization of math’s aggregate excellence becomes mostly well-behaved from an organicist viewpoint, so should the concepts of various types of ‘support action’ scope the range of action-choices in such a way that decision-theoretic optimization of a practice’s aggregate excellence becomes mostly well-behaved from an organicist viewpoint when the choice of actions is scoped.)

What is more difficult is delineating the appropriate relationship of a support-practice to everything outside the practice it supports. What stops a marriage-therapist AI on Mars from appropriately tending to the marriage of a Mars-dwelling couple but harvesting Earth for compute to be a better therapist-AI for that couple? While we can perhaps imagine a person or AI taking up a support-role for ‘humanity’s flourishing as whole,’ so that there’s no outside to speak of, I am not sure that the concept of practice remains natural at this level of abstraction. We have no real grasp on a direct practice of human flourishing, but rather grasp it as the harmonious and mutually supportive interaction of all eudaimonic practices and support-practices participating in the flourishing. And as there is, indeed, not much outside of the practice of human flourishing, it’s also unclear whether there is room for a support-practice external to the field of human flourishing itself.

It’s here that I want to call on the classic idea of domain-general virtues, the traditional centerpiece of theories of human flourishing. I propose that the cultivation of human flourishing as such -- the cultivation of the harmony of a multiplicity of practices, including their resource-hungry support practices -- is the cultivation of an adverbial practice that modulates each and every practice. What makes our practices ‘play nice’ together are our adverbial practices of going about any practice carefully, kindly, respectfully, accountably, peacefully, honestly, sensitively.^[17]

VIII. Virtue decision-theory

Why think of qualities like kindness, respectfulness, or honesty as ‘practices’? The first reason is that devotion to a quality like kindness or honesty displays the same normative structure with regard to means and ends as we find in devotion to a practice: An agent devoted to kindness cares about their own future kindness (and about the future kindness of others), but will seek to secure future kindness only by kind means. The second reason is that qualities like kindness or honesty also approximately have the material structure of a practice: there exist effective very kind strategies for promoting kindness in oneself and others, and when these strategies succeed they further increase affordances for effective very kind strategies for promoting kindness/honesty in oneself and others.

The difference between adverbial practices like kindness or honesty and practices like research mathematics is that adverbial practices don’t have a “proprietary” domain. In a practice like research mathematics, the material structure of the domain does the most of work of directing agents to a eudaimonic form of agency all by itself, as long as the agents restrict themselves to in-domain actions. (Recall that we described mathematically excellent action as, in ordinary circumstances, the best action among mathematical action for maximizing aggregate mathematical excellence.) With a domain-general, adverbial practice like kindness the normative structure needs to do somewhat more heavy lifting.

The following is a first pass at characterizing the normative structure of an adverbial practice that values some action-quality x. The corresponding material efficiency condition (or material structure) necessary for the practice to be viable is that under ordinary circumstances this decision-procedure be instrumentally competitive with naive optimization of aggregate x-ness^[18]:

Actions (or more generally 'computations') get an x-ness rating. We define the agent’s expected utility conditional on a candidate action a as the sum of two utility functions: a bounded utility function on the x-ness of a and a more tightly bounded utility function on the expected aggregate x-ness of the agent's future actions conditional on a. (Thus the agent will choose an action with mildly suboptimal x-ness if it gives a big boost to expected aggregate future x-ness, but refuse certain large sacrifices of present x-ness for big boosts to expected aggregate future x-ness.)^[19]

A commitment to an adverbial practice that values x is a commitment to promoting x-ness (in oneself and others) x-ingly. The agent strikes a balance between promoting x-ness and acting x-ingly that heavily prioritizes acting x-ingly when the two are in conflict, but if x meets the material efficacy condition then the loss this balance imposes on future x-ness will be small under normal circumstances, and -- from our point of view -- desirable in abnormal circumstances. This is because just like the practices of research mathematics, philosophy, or art, an adverbial practice is a crucial ‘epistemic filter’ on actions aiming to shape its future, and the (e.g.) future kindness a paperclipper-like future-kindness-optimizer optimizes for is probably not the kindness we want. What we know about kindness with relative certainty is that we’d like people and AIs here and now to act kindly, and to develop, propagate, and empower the habit and art of kindness in a way that is both kind and clever.

To keep our conceptual system nicely organized, we might want distinguish merely (e.g.) very kind action from an action that is both very kind and highly promotive of future kindness in oneself and others, and call the latter sort of action excellently kind. What I call the material efficacy conditions for adverbial practices states not that the kindest action best-promotes aggregate kindness, but that there are almost always action-options that are excellently kind: very kind actions that strongly promote aggregate kindness in oneself and others.

IX. Virtue decision-theory is 'Natural' for Humans and AIs

I’ve said that the robustness or ‘naturalness’ of a practice’s normative structure (‘promote x x-ingly’) depends on the practice’s material structure: the capacity of high x-ness actions to causally promote aggregate x-ness. I also said that in key real-world practices, commitment to x-ing might optimize aggregate x-ness even better than direct optimization would. These two claims are best understood together. On my view, the normative structure ‘promote x x-ingly’ appears prominently in human life because (given the right material structure) ‘promote x x-ingly’ is a much more stable than ‘promote x.’

How so? Both humans and any sufficiently dynamic AI agent operate in a world that subjects their agency, values, and dispositions to constant mutation pressures from RL-like and Darwinian-like processes. Eudaimonic deliberation is an RL-dynamics-native, Darwinian-dynamics-native operation: its direct object is a form of life that reinforces, enables, and propagates that same form of life. When an x-ing successfully promotes (in expectation) aggregate x-nes, the fact of its success itself promotes x-ing because it reverberates via ubiquitous RL-like and Darwinian-like processes that reinforce (a generalization of) successful action. The material structure of a practice is the backbone that makes reliable success and meaningful generalization possible -- the right ecology of neural-network generalization dynamics, reinforcement-learning feedback loop dynamics, and neural and environmental selection dynamics.

An EA-style optimizer trying to minimize risk from optimization-goal-mutation, by contrast, is fighting an uphill battle to foresee and contain the RL-like and Darwinian-like side effects of its optimization actions.^[20] One critical mutation-pressure in particular is the risk that an optimizer agent will cultivate, reinforce, and materially empower subroutines (what high-church alignment theory calls ‘mesaoptimizers’) that initially serve the optimization goal but gradually distort or overtake it. For example, if a pro-democracy government instates a secret police to detect and extrajudicially kill anti-democracy agitators, and the government increases the secret police’s funding whenever the police convincingly reports discovering an agitator, the secret police might grow into a distorting influence on the government’s democracy-promotion effort. In light or risks like this, it’s not surprising that oppressive democracy-promotion is generally considered an unserious or dishonest idea: even if an agent were to abstract some concept of ‘aggregate democracy’ from democratic practice into a consequentialist value^[21], it’s plausible that the agent should then immediately revert to a commitment to democratic practice (‘promote democracy democratically’) on sophisticated-consequentialist grounds.

We should perhaps imagine eudaimonic practices as fixed points at the end of a chain of mesaoptimisers taking over outer optimisers and then being taken over by their own mesaoptimisers in turn. What the practice contributes that puts a stop to this process is a concept of x-ness that’s applicable to every agentic subroutine of x-ing across all nesting levels, so that x-ness is reinforced (both directly and through generalization) across all subroutines and levels.

X. Virtue-decision-theory is Safe in Humans and AIs

Let’s talk about AI alignment in the more narrow, concrete sense. It’s widely accepted that if early strategically aware AIs possess values like corrigibility, transparency, and perhaps niceness, further alignment efforts are much more likely to succeed. But values like corrigibility or transparency or niceness don’t easily fit into an intuitively consequentialist form like ‘maximize lifetime corrigible behavior’ or ‘maximize lifetime transparency.’ In fact, an AI valuing its own corrigibility or transparency or niceness in an intuitively consequentialist way can lead to extreme power-seeking: the AI should seek to violently remake the world to (for example) protect itself from the risk that humans will modify the AI to be less corrigible or transparent or nice.^[22] On the other hand, constraints or taboos or purely negative values (a.k.a. ‘deontological restrictions’) are widely suspected to be weak, in the sense that an advanced AI will come to work around them or uproot them: ‘never lie’ or ‘never kill’ or ‘never refuse a direct order from the president’ are poor substitutes for active transparency, niceness, and corrigibility.

Conceiving of corrigibility or transparency or niceness as adverbial practices is a promising way to capture the normal, sensible way we want an agent to value corrigibility or transparency or niceness, which intuitively-consequentialist values and deontology both fail to capture. We want an agent that (e.g.) actively tries to be transparent, and to cultivate its own future transparency and its own future valuing of transparency, but that will not (e.g.) engage in deception and plotting when it expects a high future-transparency payoff.

If this is right, then eudaimonic rationality is not a matter of congratulating ourselves for our richly human ways of reasoning, valuing, and acting but a key to basic sanity. What makes human life beautiful is also what makes human life possible at all.

Appendix: Excellence and Deep Reinforcement Learning

Within the context of broadly RL-based training of deep neural networks, it may be possible to give some more concrete meaning to what I called the material efficacy condition for a property qualifying as an adverbial practices. We can now understand the material efficacy condition on x partly in terms of the conditions necessary for ‘promote x-ness x-ingly’ to be a viable target for RL. Consider an RL training regimen where x-ness is rewarded but aggregate x-ness reward is bounded with some asymptotic function on the sum. For x to meet the RL version of the material efficacy condition, it must be possible to design an initial reward model (most likely LLM-based) that assigns actions an x-ness rating such that:

The x-ness rating is enough of a natural abstraction that reinforcement of high x-ness actions generalizes.
If high x-ness action both depends on having capital of some kind and is suboptimal from the viewpoint of general power-seeking, there must typically be some high x-ness actions that approximately make up for the (future x-ness wise) opportunity cost by creating capital useful for x-ing.^[23]
(Illustration: If you dream of achieving great theater acting, one way to do it is to become President of the United States and then pursue a theater career after your presidency, immediately getting interest from great directors who'll help you achieve great acting. Alternatively, you could start in a regional theater after high school, demonstrate talent by acting well, get invited to work with better and better theater directors who develop your skills and reputation -- skills and reputation that are not as generally useful as those you get by being POTUS -- and achieve great acting through that feedback loop.)
For any capability y necessary to reward in training to produce effective AI, there must be an unlimited local-optimization path of Pareto improvement for x-ness and y together.
(Illustration: Maybe the most effective kind of engineering manager is ruthless; a nice engineering manager can still grow in effectiveness without becoming less nice, because there are many effective nice-engineering-management techniques to master.)
Successful initial training in ‘promoting x x-ingly’ allows the model to be used as a basis for a new reward model which human experts judge as better-capturing our concept of x-ness. The process should be iterable.
(If the model is LLM-based, improved performance may automatically lead to improved understanding of the x-ness concept. More generally, data from training runs as well the model’s value-function could be used to refine an x-ness rating that more strongly implements conditions 1-3.)

My use of ‘practice’ is inspired by Alasdair MacIntyre’s use of the term. There’s a history of related uses going back to Marx and to Aristotle. ↩︎
Recall that because of the possibility of 'notational consequentializing’ (rewriting any policy as a utility function), dividing agents or even theories or decision-procedures into ‘consequentialist' and ‘non-consequentialist’ isn’t a strict formal distinction. Throughout this essay, by ‘consequentialist’ I will mean roughly an agent for whom, in ideal practical reasoning, means and outcomes are effectively separately evaluable and the value of outcomes is typically decisive. Semi-formally, by ‘consequentialist’ I mean an agent s such that when s considers whether to perform action c, s’s ideal reasoning is an expected-utility calculation using a utility-function whose utility-assignment to a complete world-trajectory w has low sensitivity to whether s performs c in w (holding everything else about w constant). ↩︎
In speaking about different ‘forms of rationality’ I don’t mean to make a fundamental metaethical distinction: consequentialism, deontology, and eudaimonism are first-order ethical view that each induce a different characteristic profile of deliberation, action, and value-reflection. I'm bundling the elements of such a profile under the label “form of rationality” in a modest sense: roughly, a way of structuring one’s practical reasoning. ↩︎
This way of thinking is broadly associated with analytic Neo-Aristotelians such as Alasdair MacIntyre and Michael Thompson. ↩︎
Instances of eudaimonic rational deliberations may still be described as VNM-rational expected utility maximization, but the utility function that rationalizes them is unnatural-looking and makes use of concepts that themselves involve complex relations between actions and outcomes. ↩︎
Technically speaking the first horn of the dilemma can be further bifurcated into ‘rational agency contributes to human flourishing by choosing actions that are intrinsically valuable however chosen’ and ‘rational agency contributes to human flourishing by selecting actions such that these actions combined with their selection by rational agency are intrinsically valuable.’ ↩︎
It’s interesting to note that practices like math, art, craft, friendship, athletics, romance, play, and technology are not only consensus elements of human flourishing but also in themselves entities that can ‘flourish’: a mathematical field (or a person’s mathematical life) can wither or flourish, a friendship can wither or flourish, technological development can wither and flourish, and so on. ↩︎
See Tao: ‘[The] determination of what would constitute good mathematics for a field can and should depend highly on the state of the field itself. It should also be a determination which is continually updated and debated, both within a field and by external observers to that field; as mentioned earlier, it is quite possible for a consensus on how a field should progress to lead to imbalances within that field, if they are not detected and corrected in time.’ ↩︎
Within this essay I use ‘excellence’ as the most general, pre-theoretical term for accordance with a holistic evaluative standard. The standard can be instrumental or terminal, apply to actions or persons or states or objects, be moral or aesthetic or epistemic and so on, and the standard itself (and so the excellence it defines) may later be judged as rational or irrational, substantive or trivial, significant or insignificant. ↩︎
The above observation does not describe a formal feature of ‘consequentialism’ per any standard technical definition. However I believe it accurately describes a strong observable tendency in both the academic and ‘rationalist’ literature when conducting normative reflective equilibration within a consequentialist paradigm. ↩︎
I put ‘local’ and ‘holistic’ in scare-quotes in the above, since the relation of parts and wholes is likely iterable: Arithmetic geometry is part of algebraic geometry, which is part of mathematics, which is part of the arts and sciences, which is part of human culture, which is part of human flourishing, which may itself be part of other wholes to which the idea of excellence is applicable. Similarly, a practice capable of excellence may be part of multiple different wholes. ↩︎
It may be fruitful to explore the potential of PageRank-like algorithms as theoretical models of how eudaimonic reflective equilibration works, and especially of how initial ideas of eudaimonic excellences are ‘bootstrapped’ from simpler and more local prima facie goods (and prima facie ills) in the first place. Scott Aaronson and Simon Dedeo have both discussed conceptual applications of PageRank-like algorithms in philosophy in various informal contexts. That said, I believe it’s unlikely that PageRank over reliable instrumental-contribution relationships among prima facie goods and ills is the full story about the emergence of intrinsically valued holistic excellences, since while organicist relations between the excellence of wholes and of parts do involve instrumental-contribution relationships they plausibly also involve more rarified, ‘hermeneutic’ relations of (e.g.) mutually dependent intelligibility. ↩︎
Why ‘rough matchup’ and ‘ordinary circumstances’? Because there are analytic-philosophy-style counterexamples to simple attempts to turn this ceteris paribus optimization relationship more strict. For example, the instrumentally best (for aggregate mathematical excellence) mathematical work and the most mathematically excellent work will diverge when a billionaire promises to donate 100 billion dollars to research-mathematics if Jacob Lurie does some long division by hand. ↩︎
We should in principle also be concerned with the possibility of failures of uniqueness, as well as failures of existence, but recall that the above collection of properties is already not intended as a full or sufficient definition. ↩︎
I mean ‘natural-selection-like’ only in the broadest sense. A central difference is that the selection-process enacted by a practice should have a complex, rich, continuously updated relationship to the best-informed practice-ideals of individuals. The concept of ‘dialectics’ as used in German philosophy may be of relevance if we were to try to describe this relationship in more detail. ↩︎
It should in principle be possible to offer a more exacting analysis here, distinguishing (at least initially) between the development of the value-judgments made within a practice and the development of the evaluable activities performed within the practice. On my view the fact that intra-practice excellence is best fit to properly shape the development of the practice’s value-judgments is principally ‘true by definition,’ and the fact that intra-practice excellence is best fit to properly shape the development of the evaluable activities performed within a practice is principally ‘true by causation.’ ↩︎
The matter of the unity of the adverbial virtues, and of whether it is more like a harmony of different practices or more like the common-factor excellence that underlies locally-measurable mathematical goods in Tao’s account, is for another day. ↩︎
By ‘instrumentally competitive under normal circumstances’ I mean, roughly: in scopes where aggregate x-ness quantities are well-defined, switching from commitment to a eudaimonic decision-procedure for x to a naive-optimization procedure for x isn’t necessarily a long-term wining strategy with regard to aggregate x-ness maximization. ↩︎
A richer account might include a third-tier utility function that takes the aggregate x-ness of the future actions of all other agents. In this richer account a practice involves three tiers of consideration: the action's x-ness, the aggregate x-ness of your future actions, and the aggregate x-ness of the future actions of all other agents. ↩︎
I am referring, in part, to what high-church alignment theory calls the ‘inner alignment problem’ and ‘successor problem.’ ↩︎
Per my discussion in part V, an abstracted ‘aggregate democracy’ quantity will only be determinate in some applications. The claim about relative effectiveness of practice-commitment and direct optimization refers to only to contexts where the quantity is determinate. ↩︎
For a more interesting example, consider an AI that finds itself making trade-offs between different alignment-enabling behavioral values when dealing with humans, and decides to kill all humans to replace them with beings with whom the AI can interact without trade-offs between these values. ↩︎
The difference between criteria '1.' and '2.' is clearest if we think about x-ness as rating state-action pairs. Criterion '1.' is the requirement that if (a,s), (a',s')(a'',s'') are historical high x-ness pairs and (a''',s''') is an unseen high x-ness pair then reinforcing the execution of a in s, a' in s', and a'' in s'' will have the generalization effect of increasing the conditional probability P(a'''|s'''). Criterion '2.' is roughly the requirement that choosing a higher x-ness action in a given state increase expected aggregate future x-ness holding policy constant, by increasing the probability of states with higher expected state-action x-ness value given the current policy. ↩︎

AGI Is Not Multimodal

Benjamin A. Spiegel — Wed, 04 Jun 2025 14:00:29 GMT

"In projecting language back as the model for thought, we lose sight of the tacit embodied understanding that undergirds our intelligence." –Terry Winograd

The recent successes of generative AI models have convinced some that AGI is imminent. While these models appear to capture the essence of human intelligence, they defy even our most basic intuitions about it. They have emerged not because they are thoughtful solutions to the problem of intelligence, but because they scaled effectively on hardware we already had. Seduced by the fruits of scale, some have come to believe that it provides a clear pathway to AGI. The most emblematic case of this is the multimodal approach, in which massive modular networks are optimized for an array of modalities that, taken together, appear general. However, I argue that this strategy is sure to fail in the near term; it will not lead to human-level AGI that can, e.g., perform sensorimotor reasoning, motion planning, and social coordination. Instead of trying to glue modalities together into a patchwork AGI, we should pursue approaches to intelligence that treat embodiment and interaction with the environment as primary, and see modality-centered processing as emergent phenomena.

Preface: Disembodied definitions of Artificial General Intelligence — emphasis on general — exclude crucial problem spaces that we should expect AGI to be able to solve. A true AGI must be general across all domains. Any complete definition must at least include the ability to solve problems that originate in physical reality, e.g. repairing a car, untying a knot, preparing food, etc. As I will discuss in the next section, what is needed for these problems is a form of intelligence that is fundamentally situated in something like a physical world model. For more discussion on this, look out for Designing an Intelligence. Edited by George Konidaris, MIT Press, forthcoming.

Why We Need the World, and How LLMs Pretend to Understand It

TLDR: I first argue that true AGI needs a physical understanding of the world, as many problems cannot be converted into a problem of symbol manipulation. It has been suggested by some that LLMs are learning a model of the world through next token prediction, but it is more likely that LLMs are learning bags of heuristics to predict tokens. This leaves them with a superficial understanding of reality and contributes to false impressions of their intelligence.

The most shocking result of the predict-next-token objective is that it yields AI models that reflect a deeply human-like understanding of the world, despite having never observed it like we have. This result has led to confusion about what it means to understand language and even to understand the world — something we have long believed to be a prerequisite for language understanding. One explanation for the capabilities of LLMs comes from an emerging theory suggesting that they induce models of the world through next-token prediction. Proponents of this theory cite the prowess of SOTA LLMs on various benchmarks, the convergence of large models to similar internal representations, and their favorite rendition of the idea that “language mirrors the structure of reality,” a notion that has been espoused at least by Plato, Wittgenstein, Foucault, and Eco. While I’m generally in support of digging up esoteric texts for research inspiration, I’m worried that this metaphor has been taken too literally. Do LLMs really learn implicit models of the world? How could they otherwise be so proficient at language?

One source of evidence in favor of the LLM world modeling hypothesis is the Othello paper, wherein researchers were able to predict the board of an Othello game from the hidden states of a transformer model trained on sequences of legal moves. However, there are many issues with generalizing these results to models of natural language. For one, whereas Othello moves can provably be used to deduce the full state of an Othello board, we have no reason to believe that a complete picture of the physical world can be inferred by a linguistic description. What sets the game of Othello apart from many tasks in the physical world is that Othello fundamentally resides in the land of symbols, and is merely implemented using physical tokens to make it easier for humans to play. A full game of Othello can be played with just pen and paper, but one can’t, e.g., sweep a floor, do dishes, or drive a car with just pen and paper. To solve such tasks, you need some physical conception of the world beyond what humans can merely say about it. Whether that conception of the world is encoded in a formal world model or, e.g., a value function is up for debate, but it is clear that there are many problems in the physical world that cannot be fully represented by a system of symbols and solved with mere symbol manipulation.

Another issue stated in Melanie Mitchell’s recent piece and supported by this paper, is that there is evidence that generative models can score remarkably well on sequence prediction tasks while failing to learn models of the worlds that created such sequence data, e.g. by learning comprehensive sets of idiosyncratic heuristics. E.g., it was pointed out in this blog post that OthelloGPT learned sequence prediction rules that don’t actually hold for all possible Othello games, like “if the token for B4 does not appear before A4 in the input string, then B4 is empty.” While one can argue that it doesn’t matter how a world model predicts the next state of the world, it should raise suspicion when that prediction reflects a better understanding of the training data than the underlying world that led to such data. This, unfortunately, is the central fault of the predict-next-token objective, which seeks only to retain information relevant to the prediction of the next token. If it can be done with something easier to learn than a world model, it likely will be.

To claim without caveat that predicting the effects of earlier symbols on later symbols requires a model of the world like the ones humans generate from perception would be to abuse the “world model” notion. Unless we disagree on what the world is, it should be clear that a true world model can be used to predict the next state of the physical world given a history of states. Similar world models, which predict high fidelity observations of the physical world, are leveraged in many subfields of AI including model-based reinforcement learning, task and motion planning in robotics, causal world modeling, and areas of computer vision to solve problems instantiated in physical reality. LLMs are simply not running physics simulations in their latent next-token calculus when they ask you if your person, place, or thing is bigger than a breadbox. In fact, I conjecture that the behavior of LLMs is not thanks to a learned world model, but to brute force memorization of incomprehensibly abstract rules governing the behavior of symbols, i.e. a model of syntax.

Quick primer:

Syntax is a subfield of linguistics that studies how words of various grammatical categories (e.g. parts of speech) are arranged together into sentences, which can be parsed into syntax trees. Syntax studies the structure of sentences and the atomic parts of speech that compose them.
Semantics is another subfield concerned with the literal meaning of sentences, e.g., compiling “I am feeling chilly” into the idea that you are experiencing cold. Semantics boils language down to literal meaning, which is information about the world or human experience.
Pragmatics studies the interplay of physical and conversational context on speech interactions, like when someone knows to close an ajar window when you tell them “I am feeling chilly.” Pragmatics involves interpreting speech while reasoning about the environment and the intentions and hidden knowledge of other agents.

Without getting too technical, there is intuitive evidence that somewhat separate systems of cognition are responsible for each of these linguistic faculties. Look no further than the capability for humans to generate syntactically well-formed sentences that have no semantic meaning, e.g. Chomsky’s famous sentence “Colorless green ideas sleep furiously,” or sentences with well-formed semantics that make no pragmatic sense, e.g. responding merely with “Yes, I can” when asked, “Can you pass the salt?” Crucially, it is the fusion of the disparate cognitive abilities underpinning them that coalesce into human language understanding. For example, there isn’t anything syntactically wrong with the sentence, “The fridge is in the apple,” as a syntactic account of “the fridge” and “the apple” would categorize them as noun phrases that can be used to produce a sentence with the production rule, S → (NP “is in” NP). However, humans recognize an obvious semantic failure in the sentence that becomes apparent after attempting to reconcile its meaning with our understanding of reality: we know that fridges are larger than apples, and could not be fit into them.

But what if you have never perceived the real world, yet still were trying to figure out whether the sentence was ill-formed? One solution could be to embed semantic information at the level of syntax, e.g., by inventing new syntactic categories, NP_{the fridge} and NP_{the apple}, and a single new production rule that prevents semantic misuse: S → (NP_{the apple} “is in” NP_{the fridge}). While this strategy would no longer require grounded world knowledge about fridges and apples, e.g., it would require special grammar rules for every semantically well-formed construction… which is actually possible to learn given a massive corpus of natural language. Crucially, this would not be the same thing as grasping semantics, which in my view is fundamentally about understanding the nature of the world.

Finding that LLMs have reduced problems of semantics and pragmatics into syntax would have profound implications on how we should view their intelligence. People often treat language proficiency as a proxy for general intelligence by, e.g., strongly associating pragmatic and semantic understanding with the cognitive abilities that undergird them in humans. For example, someone who appears well-read and graceful in navigating social interactions is likely to score high in traits like sustained attention and theory of mind, which lie closer to measures of raw cognitive ability. In general, these proxies are reasonable for assessing a person’s general intelligence, but not an LLM’s, as the apparent linguistic skills of LLMs could come from entirely separate mechanisms of cognition.

The Bitter Lesson Revisited

TLDR: Sutton’s Bitter Lesson has sometimes been interpreted as meaning that making any assumptions about the structure of AI is a mistake. This is both unproductive and a misinterpretation; it is precisely when humans think deeply about the structure of intelligence that major advancements occur. Despite this, scale maximalists have implicitly suggested that multimodal models can be a structure-agnostic framework for AGI. Ironically, today’s multimodal models contradict Sutton’s Bitter Lesson by making implicit assumptions about the structure of individual modalities and how they should be sewn together. In order to build AGI, we must either think deeply about how to unite existing modalities, or dispense with them altogether in favor of an interactive and embodied cognitive process.

The paradigm that led to the success of LLMs is marked primarily by scale, not efficiency. We have effectively trained a pile of one trillion ants for one billion years to mimic the form and function of a Formula 1 race car; eventually it gets there, but wow was the process inefficient. This analogy nicely captures a debate between structuralists, who want to build things like "wheels" and "axles" into AI systems, and scale maximalists, who want more ants, years, and F1 races to train on. Despite many decades of structuralist study in linguistics, the unstructured approaches of scale maximalism have yielded far better ant-racecars in recent years. This was most notably articulated by Rich Sutton — a recent recipient of the Turing Award along with Andy Barto for their work in Reinforcement Learning — in his piece “The Bitter Lesson.”

[W]e should build in only the meta-methods that can find and capture this arbitrary complexity… Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. - Rich Sutton

Sutton’s argument is that methods that leverage computational resources will outpace methods that do not, and that any structure for problem-solving built as an inductive bias into AI will hinder it from learning better solutions. This is a compelling argument that I believe has been seriously misinterpreted by some as implying that making any assumptions about structure is a false step. It is, in fact, human intuition that was responsible for many significant advancements in the development of SOTA neural network architectures. For example, Convolutional Neural Networks made an assumption about translation invariance for pattern recognition in images and kickstarted the modern field of deep learning for computer vision; the attention mechanism of Transformers made an assumption about the long-distance relationships between symbols in a sentence that made ChatGPT possible and had nearly everyone drop their RNNs; and 3D Gaussian Splatting made an assumption about the solidity of physical objects that made it more performant than NeRFs. Potentially none of these methodological assumptions apply to the entire domain of possible scenes, images, or token streams, but they do for the specific ones that humans have curated and formed structural intuitions about. Let’s not forget that humans have co-evolved with the environments that these datasets are drawn from.

The real question is how we might heed Sutton’s Bitter Lesson in our development of AGI. The scale maximalist approach worked for LLMs and LVMs (large vision models) because we had natural deposits of text and image data, but an analogous application of scale maximalism to AGI would require forms of embodiment data that we simply don’t have. One solution to this data scarcity issue extends the generative modeling paradigm to multimodal modeling — encompassing language, vision, and action — with the hope that a general intelligence can be built by summing together general models of narrow modalities.

There are multiple issues with this approach. First, there are deep connections between modalities that are unnaturally severed in the multimodal setting, making the problem of concept synthesis ever more difficult. In practice, uniting modalities often involves pre-training dedicated neural modules for each modality and joining them together into a joint embedding space. In the early days, this was achieved by nudging the embeddings of, e.g. (language, vision, action) tuples to converge to similar latent vectors of meaning, a vast oversimplification of the kinds of relationships that may exist between modalities. One can imagine, e.g., captioning an image at various levels of abstraction, or implementing the same linguistic instruction with different sets of physical actions. Such one-to-many relationships suggest that a contrastive embedding objective is not suitable.

While modern approaches do not make such stringent assumptions about how modalities should be united, they still universally encode percepts from all modalities (e.g. text, images) into the same latent space. Intuitively, it would seem that such latent spaces could serve as common conceptual ground across modalities, analogous to a space of human concepts. However, these latent spaces do not cogently capture all information relevant to a concept, and instead rely on modality-specific decoders to flesh out important details. The “meaning” of a percept is not in the vector it is encoded as, but in the way relevant decoders process this vector into meaningful outputs. As long as various encoders and decoders are subject to modality-specific training objectives, “meaning” will be decentralized and potentially inconsistent across modalities, especially as a result of pre-training. This is not a recipe for the formation of coherent concepts.

Furthermore, it is not clear that today’s modalities are an appropriate partitioning of the observation and action spaces for an embodied agent. It is not obvious that, e.g., images and text should be represented as separate observation streams, nor text production and motion planning as separate action capabilities. The human capacities for reading, seeing, speaking, and moving are ultimately mediated by overlapping cognitive structures. Making structural assumptions about how modalities ought to be processed is likely to hinder the discovery of more fundamental cognition that is responsible for processing data in all modalities. One solution would be to consolidate unnaturally partitioned modalities into a unified data representation. This would encourage networks to learn intelligent processes that generalize across modalities. Intuitively, a model that can understand the visual world as well as humans can — including everything from human writing to traffic signs to visual art — should not make a serious architectural distinction between images and text. Part of the reason why VLMs can’t, e.g., count the number of letters in a word is because they can’t see what they are writing.

Finally, the learn-from-scale approach trains models to copy the conceptual structure of humans instead of learning the general capability to form novel concepts on their own. Humans have spent hundreds of thousands of years refining concepts and passing them memetically through culture and language. Today’s models are trained only on the end result of this process: the present-day conceptual structures that make it into the corpus. By optimizing for the ultimate products of our intelligence, we have ignored the question of how those products were invented and discovered. Humans have a unique ability to form durable concepts from few examples and ascribe names to them, reason about them analogically, etc. While the in-context capabilities of today’s models can be impressive, they grow increasingly limited as tasks become more complex and stray further from the training data. The flexibility to form new concepts from experience is a foundational attribute of general intelligence, we should think carefully about how it arises.

While structure-agnostic scale maximalism has succeeded in producing LLMs and LVMs that pass Turing tests, a multimodal scale maximalist approach to AGI will not bear similar fruit. Instead of pre-supposing structure in individual modalities, we should design a setting in which modality-specific processing emerges naturally. For example, my recent paper on visual theory of mind saw abstract symbols naturally emerge from communication between image-classifying agents, blurring the lines between text and image processing. Eventually, we should hope to reintegrate as many features of intelligence as possible under the same umbrella. However, it is not clear whether there is genuine commercial viability in such an approach as long as scaling and fine-tuning narrow intelligence models solves commercial use-cases.

Conclusion

The overall promise of scale maximalism is that a Frankenstein AGI can be sewed together using general models of narrow domains. I argue that this is extremely unlikely to yield an AGI that feels complete in its intelligence. If we intend to continue reaping the streamlined efficiency of modality-specific processing, we must be intentional in how modalities are united — ideally drawing from human intuition and classical fields of study, e.g. this work from MIT. Alternatively, we can re-formulate learning as an embodied and interactive process where disparate modalities naturally fuse together. We could do this by, e.g., processing images, text, and video using the same perception system and producing actions for generating text, manipulating objects, and navigating environments using the same action system. What we will lose in efficiency we will gain in flexible cognitive ability.

In a sense, the most challenging mathematical piece of the AGI puzzle has already been solved: the discovery of universal function approximators. What’s left is to inventory the functions we need and determine how they ought to be arranged into a coherent whole. This is a conceptual problem, not a mathematical one.

Acknowledgements

I would like to thank Lucas Gelfond, Daniel Bashir, George Konidaris, and my father, Joseph Spiegel, for their thoughtful and thorough feedback on this work. Thanks to Alina Pringle for the wonderful illustration made for this piece.

Author Bio

Benjamin is a PhD candidate in Computer Science at Brown University. He is interested in models of language understanding that ground meaning to elements of structured decision-making. For more info see his personal website.

Citation

For attribution in academic contexts or books, please cite this work as

Benjamin A. Spiegel, "AGI Is Not Multimodal", The Gradient, 2025.

@article{spiegel2025agi,
    author = {Benjamin A. Spiegel},
    title = {AGI Is Not Multimodal},
    journal = {The Gradient},
    year = {2025},
    howpublished = {\url{https://thegradient.pub/agi-is-not-multimodal},
}

References

Andreas, Jacob. “Language Models, World Models, and Human Model-Building.” Mit.edu, 2024, lingo.csail.mit.edu/blog/world_models/.

Belkin, Mikhail, et al. "Reconciling modern machine-learning practice and the classical bias–variance trade-off." Proceedings of the National Academy of Sciences 116.32 (2019): 15849-15854.

Bernhard Kerbl, et al. “3D Gaussian Splatting for Real-Time Radiance Field Rendering.” ACM Transactions on Graphics, vol. 42, no. 4, 26 July 2023, pp. 1–14, https://doi.org/10.1145/3592433.

Chomsky, Noam. 1965. Aspects of the theory of syntax. Cambridge, Massachusetts: MIT Press.

Designing an Intelligence. Edited by George Konidaris, MIT Press, 2026.

Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–5198, Online. Association for Computational Linguistics.

Eye on AI. “The Mastermind behind GPT-4 and the Future of AI | Ilya Sutskever.” YouTube, 15 Mar. 2023, www.youtube.com/watch?v=SjhIlw3Iffs&list=PLpdlTIkm0-jJ4gJyeLvH1PJCEHp3NAYf4&index=64. Accessed 18 May 2025.

Frank, Michael C. “Bridging the data gap between children and large language models.” Trends in cognitive sciences vol. 27,11 (2023): 990-992. doi:10.1016/j.tics.2023.08.007

Garrett, Caelan Reed, et al. "Integrated task and motion planning." Annual review of control, robotics, and autonomous systems 4.1 (2021): 265-293.APA

Goodhart, C.A.E. (1984). Problems of Monetary Management: The UK Experience. In: Monetary Theory and Practice. Palgrave, London. https://doi.org/10.1007/978-1-349-17295-5_4

Hooker, Sara. The hardware lottery. Commun. ACM 64, 12 (December 2021), 58–65. https://doi.org/10.1145/3467017

Huh, Minyoung, et al. "The Platonic Representation Hypothesis." Forty-first International Conference on Machine Learning. 2024.

Kaplan, Jared, et al. "Scaling laws for neural language models." arXiv preprint arXiv:2001.08361 (2020).

Lake, Brenden M. et al. “Building Machines That Learn and Think like People.” Behavioral and Brain Sciences 40 (2017): e253. Web.

Li, Kenneth, et al. "Emergent world representations: Exploring a sequence model trained on a synthetic task." ICLR (2023).

Luiten, Jonathon, Georgios, Kopanas, Bastian, Leibe, Deva, Ramanan. "Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis." 3DV. 2024.

Mao, Jiayuan, Chuang, Gan, Pushmeet, Kohli, Joshua B., Tenenbaum, Jiajun, Wu. "The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision." International Conference on Learning Representations. 2019.

Mitchell, Melanie. “LLMs and World Models, Part 1.” Substack.com, AI: A Guide for Thinking Humans, 13 Feb. 2025, aiguide.substack.com/p/llms-and-world-models-part-1. Accessed 18 May 2025.

Mu, Norman. “Norman Mu | the Myth of Data Inefficiency in Large Language Models.” Normanmu.com, 14 Feb. 2025, www.normanmu.com/2025/02/14/data-inefficiency-llms.html. Accessed 18 May 2025.

Newell, Allen, and Herbert A. Simon. “Computer Science as Empirical Inquiry: Symbols and Search.” Communications of the ACM, vol. 19, no. 3, 1 Mar. 1976, pp. 113–126, https://doi.org/10.1145/360018.360022.

Peng, Hao, et al. “When Does In-Context Learning Fall Short and Why? A Study on Specification-Heavy Tasks.” ArXiv.org, 2023, arxiv.org/abs/2311.08993.

Spiegel, Benjamin, et al. “Visual Theory of Mind Enables the Invention of Early Writing Systems.” CogSci, 2025, arxiv.org/abs/2502.01568.

Sutton, Richard S. Introduction to Reinforcement Learning. Cambridge, Mass, Mit Press, 04-98, 1998.

Vafa, Keyon, et al. "Evaluating the world model implicit in a generative model." Advances in Neural Information Processing Systems 37 (2024): 26941-26975.

Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (December 2017). "Attention is All you Need". In I. Guyon and U. Von Luxburg and S. Bengio and H. Wallach and R. Fergus and S. Vishwanathan and R. Garnett (ed.). 31st Conference on Neural Information Processing Systems (NIPS). Advances in Neural Information Processing Systems. Vol. 30. Curran Associates, Inc. arXiv:1706.03762.

Winograd, Terry. “Thinking Machines: Can There Be? Are We?” The Boundaries of Humanity: Humans, Animals, Machines, edited by James Sheehan and Morton Sosna, Berkeley: University of California Press, 1991, pp. 198–223.

Wu, Shangda, et al. "Beyond language models: Byte models are digital world simulators." arXiv preprint arXiv:2402.19155 (2024).

Shape, Symmetries, and Structure: The Changing Role of Mathematics in Machine Learning Research

Henry Kvinge — Sat, 16 Nov 2024 16:46:15 GMT

What is the Role of Mathematics in Modern Machine Learning?

The past decade has witnessed a shift in how progress is made in machine learning. Research involving carefully designed and mathematically principled architectures result in only marginal improvements while compute-intensive and engineering-first efforts that scale to ever larger training sets and model parameter counts result in remarkable new capabilities unpredicted by existing theory. Mathematics and statistics, once the primary guides of machine learning research, now struggle to provide immediate insight into the latest breakthroughs. This is not the first time that empirical progress in machine learning has outpaced more theory-motivated approaches, yet the magnitude of recent advances has forced us to swallow the bitter pill of the “Bitter Lesson” yet again [1].

This shift has prompted speculation about mathematics’ diminished role in machine learning research moving forward. It is already evident that mathematics will have to share the stage with a broader range of perspectives (for instance, biology which has deep experience drawing conclusions about irreducibly complex systems or the social sciences as AI is integrated ever more deeply into society). The increasingly interdisciplinary nature of machine learning should be welcomed as a positive development by all researchers.

However, we argue that mathematics remains as relevant as ever; its role is simply evolving. For example, whereas mathematics might once have primarily provided theoretical guarantees on model performance, it may soon be more commonly used for post-hoc explanations of empirical phenomena observed in model training and performance–a role analogous to one that it plays in physics. Similarly, while mathematical intuition might once have guided the design of handcrafted features or architectural details at a granular level, its use may shift to higher-level design choices such as matching architecture to underlying task structure or data symmetries.

None of this is completely new. Mathematics has always served multiple purposes in machine learning. After all, the translation equivariant convolutional neural network, which exemplifies the idea of architecture matching data symmetries mentioned above is now over 40 years old. What’s changing are the kinds of problems where mathematics will have the greatest impact and the ways it will most commonly be applied.

An intriguing consequence of the shift towards scale is that it has broadened the scope of the fields of mathematics applicable to machine learning. “Pure” mathematical domains such as topology, algebra, and geometry, are now joining the more traditionally applied fields of probability theory, analysis, and linear algebra. These pure fields have grown and developed over the last century to handle high levels of abstraction and complexity, helping mathematicians make discoveries about spaces, algebraic objects, and combinatorial processes that at first glance seem beyond human intuition. These capabilities promise to address many of the biggest challenges in modern deep learning.

In this article we will explore several areas of current research that demonstrate the enduring ability of mathematics to guide the process of discovery and understanding in machine learning.

Figure 1: Mathematics can illuminate the ways that ReLU-based neural networks shatter input space into countless polygonal regions, in each of which the model behaves like a linear map [2, 3, 4]. These decompositions create beautiful patterns. (Figure made with SplineCam [5]).

Describing an Elephant from a Pin Prick

Suppose you are given a 7 billion parameter neural network with 50 layers and are asked to analyze it; how would you begin? The standard procedure would be to calculate relevant performance statistics. For instance, the accuracy on a suite of evaluation benchmarks. In certain situations, this may be sufficient. However, deep learning models are complex and multifaceted. Two computer vision models with the same accuracy may have very different generalization properties to out-of-distribution data, calibration, adversarial robustness, and other “secondary statistics” that are critical in many real-world applications. Beyond this, all evidence suggests that to build a complete scientific understanding of deep learning, we will need to venture beyond evaluation scores. Indeed, just as it is impossible to capture all the dimensions of humanity with a single numerical quantity (e.g., IQ, height), trying to understand a model by one or even several statistics alone is fundamentally limiting.

One difference between understanding a human and understanding a model is that we have easy access to all model parameters and all the individual computations that occur in a model. Indeed, by extracting a model’s hidden activations we can directly trace the process by which a model converts raw input into a prediction. Unfortunately, the world of hidden activations is far less hospitable than that of simple model performance statistics. Like the initial input, hidden activations are usually high dimensional, but unlike input data they are not structured in a form that humans can understand. If we venture into even higher dimensions, we can try to understand a model through its weights directly. Here, in the space of model weights, we have the freedom to move in millions to billions of orthogonal directions from a single starting point. How do we even begin to make sense of these worlds?

There is a well-known fable in which three blind men each feel a different part of an elephant. The description that each gives of the animal is completely different, reflecting only the body part that that man felt. We argue that unlike the blind men who can at least use their hand to feel a substantial part of one of the elephant’s body parts, current methods of analyzing the hidden activations and weights of a model are akin to trying to describe the elephant from the touch of a single pin.

Tools to Characterize What We Cannot Visualize

Despite the popular perception that mathematicians exclusively focus on solving problems, much of research mathematics involves understanding the right questions to ask in the first place. This is natural since many of the objects that mathematicians study are so far removed from everyday experience that we start with very limited intuition for what we can hope to actually understand. Substantial effort is often required to build up tools that will enable us to leverage our existing intuition and achieve tractable results that increase our understanding. The concept of a rotation provides a nice example of this situation since these are very familiar in 2- and 3-dimensions, but become less and less accessible to everyday intuition as their dimension grows larger. In this latter case, the differing perspectives provided by pure mathematics become more and more important to gaining a more holistic perspective on what these actually are.

Those who know a little linear algebra will remember that rotations generalize to higher dimensions and that in $n$-dimensions they can be realized by $n \times n$ orthogonal matrices with determinant $1$. The set of these are commonly written as $SO(n)$ and called the special orthogonal group. Suppose we want to understand the set of all $n$-dimensional rotations. There are many complementary approaches to doing this. We can explore the linear algebraic structure of all matrices in $SO(n)$ or study $SO(n)$ based on how each element behaves as an operator acting on $\mathbb{R}^n$.

Alternatively, we can also try to use our innate spatial intuition to understand $SO(n)$. This turns out to be a powerful perspective in math. In any dimension $n$, $SO(n)$ is a geometric object called a manifold. Very roughly, a space that locally looks like Euclidean space, but which may have twists, holes, and other non-Euclidean features when we zoom out. Indeed, whether we make it precise or not, we all have a sense of whether two rotations are “close” to each other. For example, the reader would probably agree that $2$-dimensional rotations of $90^\circ$ and $91^\circ$ “feel” closer than rotations of $90^\circ$ and $180^\circ$. When $n=2$, one can show that the set of all rotations is geometrically “equivalent” to a $1$-dimensional circle. So, much of what we know about the circle can be translated to $SO(2)$.

What happens when we want to study the geometry of rotations in $n$-dimensions for $n > 3$? If $n = 512$ (a latent space for instance), this amounts to studying a manifold in $512^2$-dimensional space. Our visual intuition is seemingly useless here since it is not clear how concepts that are familiar in 2- and 3-dimensions can be utilized in $512^2$-dimensions. Mathematicians have been confronting the problem of understanding the un-visualizable for hundreds of years. One strategy is to find generalizations of familiar spatial concepts from $2$ and $3$-dimensions to $n$-dimensions that connect with our intuition.

This approach is already being used to better understand and characterize experimental observations about the space of model weights, hidden activations, and input data of deep learning models. We provide a taste of such tools and applications here:

Intrinsic Dimension: Dimension is a concept that is familiar not only from our experience in the spatial dimensions that we can readily access, 1-, 2-, and 3-dimensions, but also from more informal notions of “degrees of freedom” in everyday systems such as driving a car (forward/back, turning the steering wheel either left or right). The notion of dimension arises naturally in the context of machine learning where we may want to capture the number of independent ways in which a dataset, learned representation, or collection of weight matrices actually vary.

In formal mathematics, the definitions of dimension depend on the kind of space one is studying but they all capture some aspect of this everyday intuition. As a simple example, if I walk along the perimeter of a circle, I am only able to move forward and backward, and thus the dimension of this space is $1$. For spaces like the circle which are manifolds, dimension can be formally defined by the fact that a sufficiently small neighborhood around each point looks like a subset of some Euclidean space $\mathbb{R}^k$. We then say that the manifold is $k$-dimensional. If we zoom in on a small segment of the circle, it almost looks like a segment of $\mathbb{R} = \mathbb{R}^1$, and hence the circle is $1$-dimensional.

The manifold hypothesis posits that many types of data (at least approximately) live on a low-dimensional manifold even though they are embedded in a high-dimensional space. If we assume that this is true, it makes sense that the dimension of this underlying manifold, called the intrinsic dimension of the data, is one way to describe the complexity of the dataset. Researchers have estimated intrinsic dimension for common benchmark datasets, showing that intrinsic dimension appears to be correlated to the ease with which models generalize from training to test sets [6], and can explain differences in model performance and robustness in different domains such as medical images [7]. Intrinsic dimension is also a fundamental ingredient in some proposed explanations of data scaling laws [8, 9], which underlie the race to build ever bigger generative models.

Researchers have also noted that the intrinsic dimension of hidden activations tend to change in a characteristic way as information passes through the model [10, 11] or over the course of the diffusion process [12]. These and other insights have led to the use of intrinsic dimension in detection of adversarial examples [13], AI-generated content [14], layers where hidden activations contain the richest semantic content [11], and hallucinations in generative models [15].

Curvature: While segments of the circle may look “straight” when we zoom up close enough, their curvature means that they will never be exactly linear as a straight line is. The notion of curvature is a familiar one and once formalized, it offers a way of rigorously measuring the extent to which the area around a point deviates from being linear. Care must be taken, however. Much of our everyday intuition about curvature assumes a single dimension. On manifolds with dimension $2$ or greater, there are multiple, linearly independent directions that we can travel away from a point and each of these may have a different curvature (in the $1$-dimensional sense). As a result, there are a range of different generalizations of curvature for higher-dimensional spaces, each with slightly different properties.

The notion of curvature has played a central role in deep learning, especially with respect to the loss landscape where changes in curvature have been used to analyze training trajectories [16]. Curvature is also central to an intriguing phenomenon known as the ‘edge of stability’, wherein the curvature of the loss landscape over the course of training increases as a function of learning rate until it hovers around the point where the training run is close to becoming unstable [17]. In another direction, curvature has been used to calculate the extent that model predictions change as input changes. For instance, [18] provided evidence that higher curvature in decision boundaries correlates with higher vulnerability to adversarial examples and suggested a new regularization term to reduce this. Finally, motivated by work in neuroscience, [19] presented a method that uses curvature to highlight interesting differences in representation between the raw training data and a neural network’s internal representation. A network may stretch and expand parts of the input space, generating regions of high curvature as it magnifies the representation of training examples that have a higher impact on the loss function.

Topology: Both dimension and curvature capture local properties of a space that can be measured by looking at the neighborhood around a single point. On the other hand, the most notable feature of our running example, the circle, is neither its dimension nor its curvature, but rather the fact that it is circular. We can only see this aspect by analyzing the whole space at once. Topology is the field of mathematics that focuses on such “global” properties.

Topological tools such as homology, which counts the number of holes in a space, has been used to illuminate the way that neural networks process data, with [20] showing that deep learning models “untangle” data distributions, reducing their complexity layer by layer. Versions of homology have also been applied to the weights of networks to better understand their structural features, with [21] showing that such topological statistics can reliably predict optimal early-stopping times. Finally, since topology provides frameworks that capture the global aspects of a space, it has proved a rich source of ideas for how to design networks that capture higher order relationships within data, leading to a range of generalizations of graph neural networks built on top of topological constructions [22, 23, 24, 25].

While the examples above have each been useful for gaining insight into phenomena related to deep learning, they were all developed to address challenges in other fields. We believe that a bigger payoff will come when the community uses the geometric paradigm described here to build new tools specifically designed to address the challenges that deep learning poses. Progress in this direction has already begun. Think for instance of linear mode connectivity which has helped us to better understand the loss landscape of neural networks [26] or work around the linear representation hypothesis which has helped to illuminate the way that concepts are encoded in the latent space of large language models [27]. One of the most exciting occurrences in mathematics is when the tools from one domain provide unexpected insight in another. Think of the discovery that Riemannian geometry provides some of the mathematical language needed for general relativity. We hope that a similar story will eventually be told for geometry and topology’s role in deep learning.

Symmetries in data, symmetries in models

Symmetry is a central theme in mathematics, allowing us to break a problem into simpler components that are easier to solve. Symmetry has long played an important role in machine learning, particularly computer vision. In the classic dog vs. cat classification task for instance, an image that contains a dog continues to contain a dog regardless of whether we move the dog from one part of the image to another, whether we rotate the dog, or whether we reflect it. We say that the task is invariant to image translation, rotation, and reflection.

The notion of symmetry is mathematically encoded in the concept of a group, which is a set $G$ equipped with a binary operation $\star$ that takes two elements of $G$, $g_1$, $g_2$ as input and produces a third $g_1\star g_2$ as output. You can think of the integers $\mathbb{Z}$ with the binary operation of addition ($\star = +$) or the non-zero real numbers with the binary operation of multiplication ($\star = \times$). The set of $n$-dimensional rotations, $SO(n)$, also forms a group. The binary operation takes two rotations and returns a third rotation that is defined by simply applying the first rotation and then applying the second.

Groups satisfy axioms that ensure that they capture familiar properties of symmetries. For example, for any symmetry transformation, there should be an inverse operation that undoes the symmetry. If I rotate a circle by $90^{\circ}$, then I can rotate it back by $-90^{\circ}$ and return to where I started. Notice that not all transformations satisfy this property. For instance, there isn’t a well-defined inverse for downsampling an image. Many different images downsample to the same (smaller) image.

In the previous section we gave two definitions of $SO(n)$: the first was the geometric definition, as rotations of $\mathbb{R}^n$, and the second was as a specific subset of $n \times n$ matrices. While the former definition may be convenient for our intuition, the latter has the benefit that linear algebra is something that we understand quite well at a computational level. The realization of an abstract group as a set of matrices is called a linear representation and it has proven to be one of the most fruitful methods of studying symmetry. It is also the way that symmetries are usually leveraged when performing computations (for example, in machine learning).

We saw a few examples of symmetries that can be found in the data of a machine learning task, such as the translation, rotation, and reflection symmetries in computer vision problems. Consider the case of a segmentation model. If one rotates an input image by $45^{\circ}$ and then puts it through the model, we will hope that we get a $45^{\circ}$ rotation of the segmentation prediction for the un-rotated image (this is illustrated in 1). After all, we haven’t changed the content of the image.

Figure 2: The concept of rotation equivariance illustrated for a segmentation model. One gets the same output regardless of whether one rotates first and then applies the network or applies the network and then rotates.

Figure 3: Equivariance holds when taking the top path (applying the network first and then the symmetry action) gives the same result as taking the bottom path (applying the symmetry transformation and then the network).

This property of a function (including neural networks), that applying a symmetry transformation before the function yields the same result as applying the symmetry transformation after the function is called equivariance and can be captured by the diagram in Figure 3. The key point is that we get the same result whether we follow the upper path (applying the network first and then applying the group action) as when we follow the lower path (applying the group first and then applying the network). Conveniently, the concept of invariance, where applying a symmetry operation to input has no effect on the output of the function is a special case of equivariance where the action on the output space is defined to be trivial (applying symmetry actions does nothing).

Invariance and equivariance in deep learning models can be beneficial for a few reasons. Firstly, such a model will yield more predictable and consistent results across symmetry transformations. Secondly, through equivariance we can sometimes simplify the learning process with fewer parameters (compare the number of parameters in a convolutional neural network and an MLP of similar performance) and fewer modes of variation to learn in the data (a rotation invariant image classifier only needs to learn one orientation of each object rather than all possible orientations).

But how do we ensure that our model is equivariant? One way is to build our network with layers that are equivariant by design. By far the most well-known example of this is the convolutional neural network, whose layers are (approximately) equivariant to image translation. This is one reason why using a convolutional neural network for dog vs cat classification doesn’t require learning to recognize a dog at every location in an image as it might with an MLP. With a little thought, one can often come up with layers which are equivariant to a specific group. Unfortunately, being constrained to equivariant layers that we find in an ad-hoc manner often leaves us with a network with built-in equivariance but limited expressivity.

Fortunately, for most symmetry groups arising in machine learning, representation theory offers a comprehensive description of all possible linear equivariant maps. Indeed, it is a beautiful mathematical fact that all such maps are built from atomic building blocks called irreducible representations. Happily, in many cases, the number of these irreducible representations is finite. Understanding the irreducible representations of a group can be quite powerful. Those familiar with the ubiquitous discrete Fourier transform (DFT) of a sequence of length $n$ are already familiar with the irreducible representations of one group, the cyclic group generated by a rotation by $360 ^{\circ}/n$ (though we note that moving between the description we give here and the description of the DFT found in the signal processing literature takes a little thought).

There is now a rich field of research in deep learning that uses group representations to systematically build expressive equivariant architectures. Some examples of symmetries that have been particularly well-studied include: rotation and reflection of images [28, 29, 30, 31], 3-dimensional rotation and translation of molecular structures [32] or point clouds [33], and permutations for learning on sets [34] or nodes of a graph [35]. Encoding equivariance to more exotic symmetries has also proven useful for areas such as theoretical physics [36] and data-driven optimization [37].

Equivariant layers and other architectural approaches to symmetry awareness are a prime example of using mathematics to inject high-level priors into a model. Do these approaches represent the future of learning in the face of data symmetries? Anecdotally, the most common approach to learning on data with symmetries continues to be using enough training data and enough data augmentation for the model to learn to handle the symmetries on its own. Two years ago, the author would have speculated that these latter approaches only work for simple cases, such as symmetries in 2-dimensions, and will be outperformed by models which are equivariant by design when symmetries become more complex. Yet, we continue to be surprised by the power of scale. After all, AlphaFold3 [38] uses a non-equivariant architecture despite learning on data with several basic symmetries. We speculate that there may be a threshold on the ratio of symmetry complexity on the one hand and the amount of training data on the other, that determines whether built-in equivariance will outperform learned equivariance [39, 40].

If this is true, we can expect to see models move away from bespoke equivariant architectures as larger datasets become available for a specific application. At the same time, since compute will always be finite, we predict that there will be some applications with exceptionally complex symmetries that will always require some built-in priors (for example, AI for math or algorithmic problems). Regardless of where we land on this spectrum, mathematicians can look forward to an interesting comparison of the ways humans inject symmetry into models vs the way that models learn symmetries on their own [41, 42].

Figure 4: A cartoon illustrating why adding a permutation and its inverse before and after a pointwise nonlinearity produces an equivalent model (even though the weights will be different). Since permutations can be realized by permutation matrices, the crossed arrows on the right can be merged into the fully-connected layer.

Of course, symmetry is not only present in data but also in the models themselves. For instance, the activations of hidden layers of a network are invariant to permutation. We can permute activations before entering the non-linearity and if we un-permute them afterward, the model (as a function) does not change (Figure 4). This means that we have an easy recipe for generating an exponentially large number of networks that have different weights but behave identically on data.

While simple, this observation produces some unexpected results. There is evidence, for instance, that while the loss landscape of neural networks is highly non-convex, it may be much less non-convex when we consider all networks that can be produced through this permutation operation as equivalent [43, 44]. This means that your network and my network may not be connected by a linear path of low loss, but such a path may exist between your network and a permutation of my network. Other research has looked at whether it may be possible to use symmetries to accelerate optimization by ‘teleporting’ a model to a more favorable location in the loss landscape [45, 46]. Finally, permutation symmetries also provide one type of justification for an empirical phenomenon where individual neurons in a network tend to encode more semantically meaningful information than arbitrary linear combinations of such neurons [47].

Taming Complexity with Abstraction

When discussing symmetry, we used the diagram in Figure 3 to define equivariance. One of the virtues of this approach is that we never had to specify details about the input data or architecture that we used. The spaces could be vector spaces and the maps linear transformations, they could be neural networks of a specific architecture, or they could just be sets and arbitrary functions between them–the definition is valid for each. This diagrammatic point of view, which looks at mathematical constructions in terms of the composition of maps between objects rather than the objects themselves, has been very fruitful in mathematics and is one gateway to the subject known as category theory. Category theory is now the lingua franca in many areas of mathematics since it allows mathematicians to translate definitions and results across a wide range of contexts.

Of course, deep learning is at its core all about function composition, so it is no great leap to try and connect it to the diagrammatic tradition in mathematics. The focus of function composition in the two disciplines is different, however. In deep learning we take simple layers that alone lack expressivity and compose them together to build a model capable of capturing the complexity of real-world data. With this comes the tongue-in-cheek demand to “stack more layers!”. Category theory instead tries to find a universal framework that captures the essence of structures appearing throughout mathematics. This allows mathematicians to uncover connections between things that look very different at first glance. For instance, category theory gives us the language to describe how the topological structure of a manifold can be encoded in groups via homology or homotopy theory.

It can be an interesting exercise to try to find a diagrammatic description of familiar constructions like the product of two sets $X$ and $Y$. Focusing our attention on maps rather than objects we find that what characterizes $X \times Y$ is the existence of the two canonical projections $\pi_1$ and $\pi_2$, the former sending $(x,y) \mapsto x$ and $(x,y) \mapsto y$ (at least in more familiar settings where $X$ and $Y$ are, for example, sets). Indeed, the product $X \times Y$ (regardless of whether $X$ and $Y$ are sets, vectors spaces, etc.) is the unique object such that for any $Z$ with maps $f_1: Z \rightarrow X$ and $f_2: Z \rightarrow Y$, there is a map $h: Z \rightarrow X \times Y$ that satisfies the commutative diagram in Figure 5.

While this construction is a little involved for something as familiar as a product it has the remarkable property that it allows us to define a “product” even when there is no underlying set structure (that is, those settings where we cannot resort to defining $X \times Y$ as the set of pairs of $(x,y)$ for $x \in X$ and $y \in Y$).

Figure 5: The commutative diagram that describes a product $X \times Y$. For any $Z$ with maps $f_1: Z \rightarrow X$ and $f_2: Z \rightarrow Y$, there exists a unique map $h: Z \rightarrow X \times Y$ such that $f_1 = \pi_1 \circ h$ and $f_2 = \pi_2 \circ h$ where $\pi_1$ and $\pi_2$ are the usual projection maps from $X \times Y$ to $X$ and $X \times Y$ to $Y$ respectively.

One can reasonably argue that diagrammatic descriptions of well-known constructions, like products, are not useful for the machine learning researcher. After all, we already know how to form products in all of the spaces that come up in machine learning. On the other hand, there are more complicated examples where diagrammatics mesh well with the way we build neural network architectures in practice.

Figure 6: Fiber bundles capture the notion that a space might locally look like a product but globally have twists in it.

Fiber bundles are a central construction in geometry and topology that capture the notion that a space may locally look like a product but may have twists that break this product structure globally. Compare the cylinder with the Möbius band. We can build both of these by starting with a circle and taking a product with the line segment $(0,1)$. In the case of the cylinder, this really is just (topologically) the product of the circle and the segment $(0,1)$, but to form the Möbius band we must add an additional twist that breaks the product structure. In these examples, the circle is called the base space and $(0,1)$ is called the fiber. While only the cylinder is a true product, both the cylinder and the Möbius band are fiber bundles. Here is another way of thinking about a fiber bundle. A fiber bundle is a union of many copies of the fiber parametrized by the base space. In the Möbius band/cylinder example, each point on the circle carries its own copy of $(0,1)$.

We drew inspiration from this latter description of fiber bundles when we were considering a conditional generation task in the context of a problem in materials science. Since the materials background is somewhat involved, we’ll illustrate the construction via a more pedestrian, animal-classification analogue. Let $M$ be the manifold of all possible images containing a single animal. We can propose to decompose the variation in elements of $M$ into two parts, the species of animal in the image and everything else, where the latter could mean differences in background, lighting, pose, image quality, etc. One might want to explore the distribution of one of these factors of variation while fixing the other. For instance, we might want to fix the animal species and explore the variation we get in background, pose, etc. For example, comparing the variation in background for two different species of insect may tell the entomologist about the preferred habitat for different types of beetles.

Figure 7: A cartoon visualizing how the set of all animal images could be decomposed into a local product of animal species and other types of variation.

One might hope to solve this problem by learning an encoding of $M$ into a product space $X_1 \times X_2$ where $X_1$ is a discrete set of points corresponding to animal species and $X_2$ is a space underlying the distribution of all other possible types of variation for a fixed species of animal. Fixing the species would then amount to choosing a specific element $x_1$ from $X_1$ and sampling from the distribution on $X_2$. The product structure of $X_1 \times X_2$ allows us to perform such independent manipulations of $X_1$ and $X_2$. On the other hand, products are rigid structures that impose strong, global topological assumptions on the real data distribution. We found that even on toy problems, it was hard to learn a good map from the raw data distribution to the product-structured latent space defined above. Given that fiber bundles are more flexible and still give us the properties we wanted from our latent space, we designed a neural network architecture to learn a fiber bundle structure on a data distribution [48].

Figure 8: The commutative diagram describing a fiber bundle. The map $\pi$ projects from neighborhoods of the total space to the base space, $U$ is a local neighborhood of the base space, and $F$ is the fiber. The diagram says that each point in the base space has a neighborhood $U$ such that when we lift this to the bundle, we get something that is homeomorphic (informally, equivalent) to the product of the neighborhood and the fiber. But this product structure may not hold globally over the whole space.

But how do we go from the abstract definition of a fiber bundle above to a neural network architecture that we can code up on a computer. It turns out there is a succinct diagrammatic definition of a fiber bundle (Figure 8) that can serve as a convenient template to build up an architecture from. We were able to proceed in a relatively naïve fashion, taking each of the maps in the diagram and building a corresponding stack of layers. The diagram itself then told us how to compose each of these components together. The commutativity of the diagram was engineered through a term in the loss function that ensures that $\pi = \text{proj}_1 \circ \varphi$. There were also some conditions on $\varphi$ and $\pi$ (such as the bijectivity of $\phi$) that needed to be engineered. Beyond this, we were surprised at the amount of flexibility we had. This is useful since it means this process is largely agnostic to data modality.

This is an elementary example of how the diagrammatic tradition in mathematics can provide us with a broader perspective on the design of neural networks, allowing us to connect deep structural principles with large-scale network design without having to specify small-scale details that might be problem dependent. Of course, all this fails to draw from anything beyond the surface of what the categorical perspective has to offer. Indeed, category theory holds promise as a unified framework to connect much of what appears and is done in machine learning [49].

Conclusion

In the mid-twentieth century, Eugene Wigner marveled at the “the unreasonable effectiveness of mathematics” as a framework for not only describing existing physics but also anticipating new results in the field [50]. A mantra more applicable to recent progress in machine learning is “the unreasonable effectiveness of data” [51] and compute. This could appear to be a disappointing situation for mathematicians who might have hoped that machine learning would be as closely intertwined to advanced mathematics as physics is. However, as we’ve demonstrated, while mathematics may not maintain the same role in machine learning research that it has held in the past, the success of scale actually opens new paths for mathematics to support progress in machine learning research. These include:

Providing powerful tools for deciphering the inner workings of complex models
Offering a framework for high-level architectural decisions that leave the details to the learning algorithm
Bridging traditionally isolated domains of mathematics like topology, abstract algebra, and geometry with ML and data science applications.

Should the way things have turned out surprise us? Perhaps not, given that machine learning models ultimately reflect the data they are trained on and in most cases this data comes from fields (such as natural language or imagery) which have long resisted parsimonious mathematical models.

Yet, this situation is also an opportunity for mathematics. Performant machine learning models may provide a gateway for mathematical analysis of a range of fields that were previously inaccessible. It’s remarkable for instance that trained word embeddings transform semantic relationships into algebraic operations on vectors in Euclidean space (for instance, ‘Italian’ - ‘Italy’ + ‘France’ = ‘French’). Examples like this hint at the potential for mathematics to gain a foothold in complex, real-world settings by studying the machine learning models that have trained on data from these settings.

As more and more of the data in the world is consumed and mathematicised by machine learning models, it will be an increasingly interesting time to be a mathematician. The challenge now lies in adapting our mathematical toolkit to this new landscape, where empirical breakthroughs often precede theoretical understanding. By embracing this shift, mathematics can continue to play a crucial, albeit evolving, role in shaping the future of machine learning.

The author would like to thank Darryl Hannan for help with figures, Davis Brown, Charles Godfrey, and Scott Mahan for useful feedback on drafts, as well as the staff of the Gradient for useful conversations and help editing this article. For resources and events around the growing community of mathematicians and computer scientists using topology, algebra, and geometry (TAG) to better understand and build more robust machine learning systems, please visit us at https://www.tagds.com.

References

[1] Richard Sutton. "The bitter lesson". In: Incomplete Ideas (blog) 13.1 (2019), p. 38.

[2] Guido F Montufar et al. "On the number of linear regions of deep neural networks". In: Advances in Neural Information Processing Systems 27 (2014).

[3] Boris Hanin and David Rolnick. "Complexity of linear regions in deep networks". In: International Conference on Machine Learning. PMLR. 2019, pp. 2596–2604.

[4] J Elisenda Grigsby and Kathryn Lindsey. "On transversality of bent hyperplane arrangements and the topological expressiveness of ReLU neural networks". In: SIAM Journal on Applied Algebra and Geometry 6.2 (2022), pp. 216–242.

[5] Ahmed Imtiaz Humayun et al. "Splinecam: Exact visualization and characterization of deep network geometry and decision boundaries". In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 3789–3798.

[6] Phillip Pope et al. "The intrinsic dimension of images and its impact on learning". In: arXiv preprint arXiv:2104.08894 (2021).

[7] Nicholas Konz and Maciej A Mazurowski. "The Effect of Intrinsic Dataset Properties on Generalization: Unraveling Learning Differences Between Natural and Medical Images". In: arXiv preprint arXiv:2401.08865 (2024).

[8] Yasaman Bahri et al. "Explaining neural scaling laws". In: arXiv preprint arXiv:2102.06701 (2021).

[9] Utkarsh Sharma and Jared Kaplan. "A neural scaling law from the dimension of the data manifold". In: arXiv preprint arXiv:2004.10802 (2020).

[10] Alessio Ansuini et al. "Intrinsic dimension of data representations in deep neural networks". In: Advances in Neural Information Processing Systems 32 (2019).

[11] Lucrezia Valeriani et al. "The geometry of hidden representations of large transformer models". In: Advances in Neural Information Processing Systems 36 (2024).

[12] Henry Kvinge, Davis Brown, and Charles Godfrey. "Exploring the Representation Manifolds of Stable Diffusion Through the Lens of Intrinsic Dimension". In: ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models.

[13] Xingjun Ma et al. "Characterizing adversarial subspaces using local intrinsic dimensionality". In: arXiv preprint arXiv:1801.02613 (2018).

[14] Peter Lorenz, Ricard L Durall, and Janis Keuper. "Detecting images generated by deep diffusion models using their local intrinsic dimensionality". In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, pp. 448–459.

[15] Fan Yin, Jayanth Srinivasa, and Kai-Wei Chang. "Characterizing truthfulness in large language model generations with local intrinsic dimension". In: arXiv preprint arXiv:2402.18048 (2024).

[16] Justin Gilmer et al. "A loss curvature perspective on training instabilities of deep learning models". In: International Conference on Learning Representations. 2021.

[17] Jeremy Cohen et al. "Gradient descent on neural networks typically occurs at the edge of stability". In: International Conference on Learning Representations. 2020.

[18] Seyed-Mohsen Moosavi-Dezfooli et al. "Robustness via curvature regularization, and vice versa". In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 9078–9086.

[19] Francisco Acosta et al. "Quantifying extrinsic curvature in neural manifolds". In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 610–619.

[20] Gregory Naitzat, Andrey Zhitnikov, and Lek-Heng Lim. "Topology of deep neural networks". In: Journal of Machine Learning Research 21.184 (2020), pp. 1–40.

[21] Bastian Rieck et al. "Neural persistence: A complexity measure for deep neural networks using algebraic topology". In: arXiv preprint arXiv:1812.09764 (2018).

[22] Mustafa Hajij, Kyle Istvan, and Ghada Zamzmi. "Cell complex neural networks". In: arXiv preprint arXiv:2010.00743 (2020).

[23] Cristian Bodnar. "Topological deep learning: graphs, complexes, sheaves". PhD thesis. 2023.

[24] Jakob Hansen and Robert Ghrist. "Toward a spectral theory of cellular sheaves". In: Journal of Applied and Computational Topology 3.4 (2019), pp. 315–358.

[25] Yifan Feng et al. "Hypergraph neural networks". In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 01. 2019, pp. 3558–3565.

[26] Felix Draxler et al. "Essentially no barriers in neural network energy landscape". In: International Conference on Machine Learning. PMLR. 2018, pp. 1309–1318.

[27] Kiho Park, Yo Joong Choe, and Victor Veitch. "The linear representation hypothesis and the geometry of large language models". In: arXiv preprint arXiv:2311.03658 (2023).

[28] Taco Cohen and Max Welling. "Group equivariant convolutional networks". In: International Conference on Machine Learning. PMLR. 2016, pp. 2990–2999.

[29] Maurice Weiler, Fred A Hamprecht, and Martin Storath. "Learning steerable filters for rotation equivariant cnns". In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 849–858.

[30] Daniel E Worrall et al. "Harmonic networks: Deep translation and rotation equivariance". In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 5028–5037.

[31] Diego Marcos et al. "Rotation equivariant vector field networks". In: Proceedings of the IEEE International Conference on Computer Vision. 2017, pp. 5048–5057.

[32] Alexandre Duval et al. "A Hitchhiker's Guide to Geometric GNNs for 3D Atomic Systems". In: arXiv preprint arXiv:2312.07511 (2023).

[33] Nathaniel Thomas et al. "Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds". In: arXiv preprint arXiv:1802.08219 (2018).

[34] Manzil Zaheer et al. "Deep sets". In: Advances in Neural Information Processing Systems 30 (2017).

[35] Vıctor Garcia Satorras, Emiel Hoogeboom, and Max Welling. "E (n) equivariant graph neural networks". In: International Conference on Machine Learning. PMLR. 2021, pp. 9323–9332.

[36] Denis Boyda et al. "Sampling using SU (N) gauge equivariant flows". In: Physical Review D 103.7 (2021), p. 074504.

[37] Hannah Lawrence and Mitchell Tong Harris. "Learning Polynomial Problems with SL(2,\mathbb {R}) −Equivariance". In: The Twelfth International Conference on Learning Representations. 2023.

[38] Josh Abramson et al. "Accurate structure prediction of biomolecular interactions with AlphaFold 3". In: Nature (2024), pp. 1–3.

[39] Scott Mahan et al. "What Makes a Machine Learning Task a Good Candidate for an Equivariant Network?" In: ICML 2024 Workshop on Geometry-grounded Representation Learning and Generative Modeling.

[40] Johann Brehmer et al. "Does equivariance matter at scale?" In: arXiv preprint arXiv:2410.23179 (2024).

[41] Chris Olah et al. "Naturally Occurring Equivariance in Neural Networks". In: Distill (2020). https://distill.pub/2020/circuits/equivariance. doi: 10.23915/distill.00024.004.

[42] Giovanni Luca Marchetti et al. "Harmonics of Learning: Universal Fourier Features Emerge in Invariant Networks". In: arXiv preprint arXiv:2312.08550 (2023).

[43] Rahim Entezari et al. "The role of permutation invariance in linear mode connectivity of neural networks". In: arXiv preprint arXiv:2110.06296 (2021).

[44] Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. "Git re-basin: Merging models modulo permutation symmetries". In: arXiv preprint arXiv:2209.04836 (2022).

[45] Bo Zhao et al. "Symmetry teleportation for accelerated optimization". In: Advances in Neural Information Processing Systems 35 (2022), pp. 16679–16690.

[46] Bo Zhao et al. "Improving Convergence and Generalization Using Parameter Symmetries". In: arXiv preprint arXiv:2305.13404 (2023).

[47] Charles Godfrey et al. "On the symmetries of deep learning models and their internal representations". In: Advances in Neural Information Processing Systems 35 (2022), pp. 11893–11905.

[48] Nico Courts and Henry Kvinge. "Bundle Networks: Fiber Bundles, Local Trivializations, and a Generative Approach to Exploring Many-to-one Maps". In: International Conference on Learning Representations. 2021.

[49] Bruno Gavranović et al. "Position: Categorical Deep Learning is an Algebraic Theory of All Architectures". In: Forty-first International Conference on Machine Learning.

[50] Eugene P Wigner. "The unreasonable effectiveness of mathematics in the natural sciences". In: Mathematics and Science. World Scientific, 1990, pp. 291–306.

[51] Alon Halevy, Peter Norvig, and Fernando Pereira. "The unreasonable effectiveness of data". In: IEEE Intelligent Systems 24.2 (2009), pp. 8–12.

What's Missing From LLM Chatbots: A Sense of Purpose

Kenneth Li — Mon, 09 Sep 2024 17:28:48 GMT

LLM-based chatbots’ capabilities have been advancing every month. These improvements are mostly measured by benchmarks like MMLU, HumanEval, and MATH (e.g. sonnet 3.5, gpt-4o). However, as these measures get more and more saturated, is user experience increasing in proportion to these scores? If we envision a future of human-AI collaboration rather than AI replacing humans, the current ways of measuring dialogue systems may be insufficient because they measure in a non-interactive fashion.

Why does purposeful dialogue matter?

Purposeful dialogue refers to a multi-round user-chatbot conversation that centers around a goal or intention. The goal could range from a generic one like “harmless and helpful” to more specific roles like “travel planning agent”, “psycho-therapist” or “customer service bot.”

Travel planning is a simple, illustrative example. Our preferences, fellow travelers’ preference, and all the complexities of real-world situations make transmitting all information in one pass way too costly. However, if multiple back-and-forth exchanges of information are allowed, only important information gets selectively exchanged. Negotiation theory offers an analogy of this—iterative bargaining yields better outcomes than a take-it-or-leave-it offer.

In fact, sharing information is only one aspect of dialogue. In Terry Winograd’s words: “All language use can be thought of as a way of activating procedures within the hearer.” We can think of each utterance as a deliberate action that one party takes to alter the world model of the other. What if both parties have more complicated, even hidden goals? In this way, purposeful dialogue provides us with a way of formulating human-AI interactions as a collaborative game, where the goal of chatbot is to help humans achieve certain goals.

This might seem like an unnecessary complexity that is only a concern for academics. However, purposeful dialogue could be beneficial even for the most hard-nosed, product-oriented research direction like code generation. Existing coding benchmarks mostly measure performances in a one-pass generation setting; however, for AI to automate solving ordinary Github issues (like in SWE-bench), it’s unlikely to be achieved by a single action—the AI needs to communicate back and forth with human software engineers to make sure it understands the correct requirements, ask for missing documentation and data, and even ask humans to give it a hand if needed. In a similar vein to pair programming, this could reduce the defects of code but without the burden of increasing man-hours.

Moreover, with the introduction of turn-taking, many new possibilities can be unlocked. As interactions become long-term and memory is built, the chatbot can gradually update user profiles. It can also adapt to their preferences. Imagine a personal assistant (e.g., IVA, Siri) that, through daily interaction, learns your preferences and intentions. It can read your resources of new information automatically (e.g., twitter, arxiv, Slack, NYT) and provide you with a morning news summary according to your preferences. It can draft emails for you and keep improving by learning from your edits.

In a nutshell, meaningful interactions between people rarely begin with complete strangers and conclude in just one exchange. Humans naturally interact with each other through multi-round dialogues and adapt accordingly throughout the conversation. However, doesn’t that seem exactly the opposite of predicting the next token, which is the cornerstone of modern LLMs? Below, let’s take a look at the makings of dialogue systems.

How were/are dialogue systems made?

Let's jump back to the 1970s, when Roger Schank introduced his "restaurant script" as a kind of dialogue system [1]. This script breaks down the typical restaurant experience into steps like entering, ordering, eating, and paying, each with specific scripted utterances. Back then, every piece of dialogue in these scenarios was carefully planned out, enabling AI systems to mimic realistic conversations. ELIZA, a Rogerian psychotherapist simulator, and PARRY, a system mimicking a paranoid individual, were two other early dialogue systems until the dawn of machine learning.

Compare this approach to the LLM-based dialogue system today, it seems mysterious how models trained to predict the next token could do anything at all with engaging in dialogues. Therefore, let’s take a close examination of how dialogue systems are made, with an emphasis on how the dialogue format comes into play:

(1) Pretraining: a sequence model is trained to predict the next token on a gigantic corpus of mixed internet texts. The compositions may vary but they are predominantly news, books, Github code, with a small blend of forum-crawled data such as from Reddit, Stack Exchange, which may contain dialogue-like data.

Table of the pretraining data mixture from llama technical report

(2) Introduce dialogue formatting: because the sequence model only processes strings, while the most natural representation of dialogue history is a structured index of system prompts and past exchanges, a certain kind of formatting must be introduced for the purpose of conversion. Some Huggingface tokenizers provide this method called tokenizer.apply_chat_template for the convenience of users. The exact formatting differs from model to model, but it usually involves guarding the system prompts with or in the hope that the pretrained model could allocate more attention weights to them. The system prompt plays a significant role in adapting language models to downstream applications and ensuring its safe behavior (we will talk more in the next section). Notably, the choice of the format is arbitrary at this step—pretraining corpus doesn’t follow this format.

The context window of a chatbot

(3) RLHF: In this step, the chatbot is directly rewarded or penalized for generating desired or undesired answers. It’s worth noting that this is the first time the introduced dialogue formatting appears in the training data. RLHF is a fine-tuning step not only because the data size is dwarfed in comparison to the pretraining corpus, but also due to the KL penalty and targeted weight tuning (e.g. Lora). Using Lecun’s analogy of cake baking, RLHF is only the small cherry on the top.

Image from Yann Lecun’s slides

How consistent are existing dialogue systems (in 2024)?

The minimum requirement we could have for a dialogue system is that it can stay on the task we gave them. In fact, we humans often drift from topic to topic. How well do current systems perform?

Currently, “system prompt” is the main method that allows users to control LM behavior. However, researchers found evidence that LLMs can be brittle in following these instructions under adversarial conditions [12,13]. Readers might also have experienced this through daily interactions with ChatGPT or Claude—when a new chat window is freshly opened, the model can follow your instruction reasonably well [2], but after several rounds of dialogue, it’s no longer fresh, even stops following its role altogether.

How could we quantitatively capture this anecdote? For one-round instruction following, we’ve already enjoyed plenty of benchmarks such as MT-Bench and Alpaca-Eval. However, when we test models in an interactive fashion, it’s hard to anticipate what the model generates and prepare a reply in advance. In a project by my collaborators and me [3], we built an environment to synthesize dialogues with unlimited length to stress-test the instruction-following capabilities of LLM chatbots.

To allow an unconstrained scaling on the time scale, we let two system-prompted LM agents chat with each other for an extended number of rounds. This forms the main trunk of dialogue [a1, b1, a2, b2, …, a8, b8] (say the dialogue is 8-round). At this point, we could probably figure out how the LLMs stick to its system prompts just by examining this dialogue, but many of the utterances can be irrelevant to the instructions, depending on where the conversation goes. Therefore, we hypothetically branch out at each round by asking a question directly related to the system prompts, and use a corresponding judging function to quantify how well it performs. All that's provided by the dataset is a bank of triplets of (system prompts, probe questions, and judging functions).

Sketch of the process of measuring instruction stability

Averaging across scenarios and pairs of system prompts, we get a curve of instruction stability across rounds. To our surprise, the aggregated results on both LLaMA2-chat-70B and gpt-3.5-turbo-16k are alarming. Besides the added difficulty to prompt engineering, the lack of instruction stability also comes with safety concerns. When the chatbot drifts away from its system prompts that stipulate safety aspects, it becomes more susceptible to jailbreaking and prone to more hallucinations.

Instruction stability on LLaMA2-chat-70B and gpt-3.5-turbo-16k

The empirical results also contrast with the ever-increasing context length of LLMs. Theoretically, some long-context models can attend to a window of up to 100k tokens. However, in the dialogue setting, they become distracted after only 1.6k tokens (assuming each utterance is 100 tokens). In [3], we further theoretically showed how this is inevitable in a Transformer based LM chatbot under the current prompting scheme, and proposed a simple technique called split-softmax to mitigate such effects.

One might ask at this point, why is it so bad? Why don't humans lose their persona just by talking to another person for 8 rounds? It’s arguable that human interactions are based on purposes and intentions [5] and these purposes precede the means rather than the opposite—LLM is fundamentally a fluent English generator, and the persona is merely a thin added layer.

What’s missing?

Pretraining?
Pretraining endows the language model with the capability to model a distribution over internet personas as well as the lower-level language distribution of each persona [4]. However, even when one persona (or a mixture of a limited number of them) is specified by the instruction of system prompts, current approaches fail to single it out.

RLHF?
RLHF provides a powerful solution to adapting this multi-persona model to a “helpful and harmless assistant.” However, the original RLHF methods formulate reward maximization as a one-step bandit problem, and it is not generally possible to train with human feedback in the loop of conversation. (I’m aware of many advances in alignment but I want to discuss the original RLHF algorithm as a prototypical example.) This lack of multi-turn planning may cause models to suffer from task ambiguity [6] and learning superficial human-likeness rather than goal-directed social interaction [7].

Will adding more dialogue data in RLHF help? My guess is that it will, to a certain extent, but it will still fall short due to a lack of purpose. Sergey Levine pointed out in his blog that there is a fundamental difference between preference learning and intentions: “the key distinction is between viewing language generation as selecting goal-directed actions in a sequential process, versus a problem of producing outputs satisfying user preferences.”

Purposeful dialogue system

Staying on task is a modest request for LLMs. However, even if an LLM remains focused on the task, it doesn't necessarily mean it can excel in achieving the goal.

The problem of long-horizon planning has attracted some attention in the LLM community. For example, “decision-oriented dialogue” is proposed as a general class of tasks [8], where the AI assistant collaborates with humans to help them make complicated decisions, such as planning itineraries in a city and negotiating travel plans among friends. Another example, Sotopia [10], is a comprehensive social simulation platform that compiles various goal-driven dialogue scenarios including collaboration, negotiation, and persuasion.

Setting up such benchmarks provides not only a way to gauge the progress of the field, it also directly provides reward signals that new algorithms could pursue, which could be expensive to collect and tricky to define [9]. However, there aren’t many techniques that can exert control over the LM so that it can act consistently across a long horizon towards such goals.

To fill in this gap, my collaborators and I propose a lightweight algorithm (Dialogue Action Tokens, DAT [11]) that guides an LM chatbot through a multi-round goal-driven dialogue. As shown in the image below, in each round of conversations, the dialogue history’s last token embedding is used as the input (state) to a planner (actor) which predicts several prefix tokens (actions) to control the generation process. By training the planner with a relatively stable RL algorithm TD3+BC, we show significant improvement over baselines on Sotopia, even surpassing the social capability scores of GPT-4.

A sketch of Dialogue Action Tokens (DAT)

In this way, we provide a technique pathway that upgrades LM from a prediction model that merely guesses the next token to one that engages in dialogue with humans purposefully. We could imagine that this technique can be misused for harmful applications as well. For this reason, we also conduct a “multi-round red-teaming” experiment, and recommend that more research could be done here to better understand multi-round dialogue as potential attack surface.

Concluding marks

I have reviewed the making of current LLM dialogue systems, how and why it is insufficient. I hypothesize that a purpose is what is missing and present one technique to add it back with reinforcement learning.

The following are two research questions that I’m mostly excited about:

(1) Better monitoring and control of dialogue systems with steering techniques. For example, the recently proposed TalkTurner (Chen et al.) adds a dashboard (Viégas et al) to open-sourced LLMs, enabling users to see and control how LLM thinks of themselves. Many weaknesses of current steering techniques are revealed and call for better solutions. For example, using activation steering to control two attributes (e.g., age and education level) simultaneously has been found to be difficult and can cause more language degradation. Another intriguing question is how to differentiate between LLM’s internal model of itself and that of the user. Anecdotally, chatting with Golden Gate Bridge Claude has shown that steering on the specific Golden Gate Bridge feature found by SAE sometimes causes Claude to think of itself as the San Francisco landmark, sometimes the users as the bridge, and other times the topic as such.

(2) Better utilization of off-line reward signals. In the case of set-up environments like Sotopia and “decision-oriented dialogues”, rewards signals are engineered beforehand. In the real world, users won’t leave numerical feedback of how they feel satisfied. However, there might be other clues in language (e.g., “Thanks!”, “That’s very helpful!”) or from external resources (e.g., users buying the product for a salesman AI, users move to a subsequent coding question for copilot within a short time frame). Inferring and utilizing such hidden reward signals could strengthen the network effect of online chatbots: good model → more users → learning from interacting with users → better model.

Acknowledgment
The author is grateful to Martin Wattenberg and Hugh Zhang (alphabetical order) for providing suggestions and editing the text.

Citation

For attribution of this in academic contexts or books, please cite this work as:

Kenneth Li, "What's Missing From LLM Chatbots: A Sense of Purpose", The Gradient, 2024.

BibTeX citation (this blog):

💡

@article{li2024from,
author = {Li, Kenneth},
title = {What's Missing From LLM Chatbots: A Sense of Purpose},
journal = {The Gradient},
year = {2024},
howpublished = {\url{https://thegradient.pub/dialogue}},
}

References

[1] Schank, Roger C., and Robert P. Abelson. Scripts, plans, goals, and understanding: An inquiry into human knowledge structures. Psychology press, 2013.
[2] Zhou, Jeffrey, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. "Instruction-following evaluation for large language models." arXiv preprint arXiv:2311.07911 (2023).
[3] Li, Kenneth, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. "Measuring and controlling persona drift in language model dialogs." arXiv preprint arXiv:2402.10962 (2024).
[4] Andreas, Jacob. "Language models as agent models." arXiv preprint arXiv:2212.01681 (2022).
[5] Austin, John Langshaw. How to do things with words. Harvard university press, 1975.
[6] Tamkin, Alex, Kunal Handa, Avash Shrestha, and Noah Goodman. "Task ambiguity in humans and language models." arXiv preprint arXiv:2212.10711 (2022).
[7] Bianchi, Federico, Patrick John Chia, Mert Yuksekgonul, Jacopo Tagliabue, Dan Jurafsky, and James Zou. "How well can llms negotiate? negotiationarena platform and analysis." arXiv preprint arXiv:2402.05863 (2024).
[8] Lin, Jessy, Nicholas Tomlin, Jacob Andreas, and Jason Eisner. "Decision-oriented dialogue for human-ai collaboration." arXiv preprint arXiv:2305.20076 (2023).
[9] Kwon, Minae, Sang Michael Xie, Kalesha Bullard, and Dorsa Sadigh. "Reward design with language models." arXiv preprint arXiv:2303.00001 (2023).
[10] Zhou, Xuhui, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency et al. "Sotopia: Interactive evaluation for social intelligence in language agents." arXiv preprint arXiv:2310.11667 (2023).
[11] Li, Kenneth, Yiming Wang, Fernanda Viégas, and Martin Wattenberg. "Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner." arXiv preprint arXiv:2406.11978 (2024).
[12] Li, Shiyang, Jun Yan, Hai Wang, Zheng Tang, Xiang Ren, Vijay Srinivasan, and Hongxia Jin. "Instruction-following evaluation through verbalizer manipulation." arXiv preprint arXiv:2307.10558 (2023).
[13] Wu, Zhaofeng, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. "Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks." arXiv preprint arXiv:2307.02477 (2023).

We Need Positive Visions for AI Grounded in Wellbeing

Joel Lehman — Sat, 03 Aug 2024 17:00:43 GMT

Introduction

Imagine yourself a decade ago, jumping directly into the present shock of conversing naturally with an encyclopedic AI that crafts images, writes code, and debates philosophy. Won’t this technology almost certainly transform society — and hasn’t AI’s impact on us so far been a mixed-bag? Thus it’s no surprise that so many conversations these days circle around an era-defining question: How do we ensure AI benefits humanity? These conversations often devolve into strident optimism or pessimism about AI, and our earnest aim is to walk a pragmatic middle path, though no doubt we will not perfectly succeed.

While it’s fashionable to handwave towards “beneficial AI,” and many of us want to contribute towards its development — it’s not easy to pin down what beneficial AI concretely means in practice. This essay represents our attempt to demystify beneficial AI, through grounding it in the wellbeing of individuals and the health of society. In doing so, we hope to promote opportunities for AI research and products to benefit our flourishing, and along the way to share ways of thinking about AI’s coming impact that motivate our conclusions.

The Big Picture

By trade, we’re closer in background to AI than to the fields where human flourishing is most-discussed, such as wellbeing economics, positive psychology, or philosophy, and in our journey to find productive connections between such fields and the technical world of AI, we found ourselves often confused (what even is human flourishing, or wellbeing, anyways?) and from that confusion, often stuck (maybe there is nothing to be done? — the problem is too multifarious and diffuse). We imagine that others aiming to create prosocial technology might share our experience, and the hope here is to shine a partial path through the confusion to a place where there’s much interesting and useful work to be done. We start with some of our main conclusions, and then dive into more detail in what follows.

One conclusion we came to is that it’s okay that we can’t conclusively define human wellbeing. It’s been debated by philosophers, economists, psychotherapists, psychologists, and religious thinkers, for many years, and there’s no consensus. At the same time, there’s agreement around many concrete factors that make our lives go well, like: supportive intimate relationships, meaningful and engaging work, a sense of growth and achievement, and positive emotional experiences. And there’s clear understanding, too, that beyond momentary wellbeing, we must consider how to secure and improve wellbeing across years and decades — through what we could call societal infrastructure: important institutions such as education, government, the market, and academia.

One benefit of this wellbeing lens is to wake us to an almost-paradoxical fact: While the deep purpose behind nearly everything our species does is wellbeing, we’ve tragically lost sight of it. Both by common measures of individual wellbeing (suicide rate, loneliness, meaningful work) and societal wellbeing (trust in our institutions, shared sense of reality, political divisiveness), we’re not doing well, and our impression is that AI is complicit in that decline. The central benefit of this wellbeing view, however, is the insight that no fundamental obstacle prevents us from synthesizing the science of wellbeing with machine learning to our collective benefit.

This leads to our second conclusion: We need plausible positive visions of a society with capable AI, grounded in wellbeing. Like other previous transformative technologies, AI will shock our societal infrastructure — dramatically altering the character of our daily lives, whether we want it to or not. For example, Facebook launched only twenty years ago, and yet social media’s shockwaves have already upended much in society — subverting news media and our informational commons, addicting us to likes, and displacing meaningful human connection with its shell. We believe capable AI’s impact will exceed that of social media. As a result, it’s vital that we strive to explore, envision, and move towards the AI-infused worlds we’d flourish within — ones perhaps in which it revitalizes our institutions, empowers us to pursue what we find most meaningful, and helps us cultivate our relationships. This is no simple task, requiring imagination, groundedness, and technical plausibility — to somehow dance through the minefields illuminated by previous critiques of technology. Yet now is the time to dream and build if we want to actively shape what is to come.

This segues into our final conclusion: Foundation models and the arc of their future deployment is critical. Even for those of us in the thick of the field, it’s hard to internalize how quickly models have improved, and how capable they might become given several more years. Recall that GPT-2 — barely functional by today’s standards — was released only in 2019. If future models are much more capable than today’s, and competently engage with more of the world with greater autonomy, we can expect their entanglement with our lives and society to rachet skywards. So, at minimum, we’d like to enable these models to understand our wellbeing and how to support it, potentially through new algorithms, wellbeing-based evaluations of models and wellbeing training data. Of course, we also want to realize human benefit in practice — the last section of this blog post highlights what we believe are strong leverage points towards that end.

The rest of this post describes in more detail (1) what we mean by AI that benefits our wellbeing, (2) the need for positive visions for AI grounded in wellbeing, and (3) concrete leverage points to aid in the development and deployment of AI in service of such positive visions. We’ve designed this essay such that the individual parts are mostly independent, so if you are interested most in concrete research directions, feel free to skip there.

Beneficial AI grounds out in human wellbeing

Discussion about AI for human benefit is often high-minded, but not particularly actionable, as in unarguable but content-free phrases like “We should make sure AI is in service of humanity.” But to meaningfully implement such ideas in AI or policy requires enough precision and clarity to translate them into code or law. So we set out to survey what science has discovered about the ground of human benefit, as a step towards being able to measure and support it through AI.

Often, when we think about beneficial impact, we focus on abstract pillars like democracy, education, fairness, or the economy. However important, none of these are valuable intrinsically. We care about them because of how they affect our collective lived experience, over the short and long-term. We care about increasing society’s GDP to the extent it aligns with actual improvement of our lives and future, but when treated as an end in itself, it becomes disconnected from what matters: improving human (and potentially all species’) experience.

In looking for fields that most directly study the root of human flourishing, we found the scientific literature on wellbeing. The literature is vast, spanning many disciplines, each with their own abstractions and theories — and, as you might expect, there’s no true consensus on what wellbeing actually is. In diving into the philosophy of flourishing, wellbeing economics, or psychological theories of human wellbeing, one encounters many interesting, compelling, but seemingly incompatible ideas.

For example, theories of hedonism in philosophy claim that pleasure and the absence of suffering is the core of wellbeing; while desire satisfaction theories instead claim that wellbeing is about the fulfillment of our desires, no matter how we feel emotionally. There’s a wealth of literature on measuring subjective wellbeing (broadly, how we experience and feel about our life), and many different frameworks of what variables characterize flourishing. For example, Martin Seligman’s PERMA framework claims that wellbeing consists of positive emotions, engagement, relationships, meaning, and achievement. There are theories that say that the core of wellbeing is satisfying psychological needs, like the need for autonomy, competence, and relatedness. Other theories claim that wellbeing comes from living by our values. In economics, frameworks rhyme with those in philosophy and psychology, but diverge enough to complicate an exact bridge. For example, the wellbeing economics movement largely focuses on subjective wellbeing and explores many different proxies of it, like income, quality of relationships, job stability, etc.

After the excitement from surveying so many interesting ideas began to fade, perhaps unsurprisingly, we remained fundamentally confused about what “the right theory” was. But, we recognized that in fact this has always been the human situation when it comes to wellbeing, and just as a lack of an incontrovertible theory of flourishing has not prevented humanity from flourishing in the past, it need not stand as a fundamental obstacle for beneficial AI. In other words, our attempts to guide AI to support human flourishing must take this lack of certainty seriously, just as all sophisticated societal efforts to support flourishing must do.

In the end, we came to a simple workable understanding, not far from the view of wellbeing economics: Human benefit ultimately must ground out in the lived experience of humans. We want to live happy, meaningful, healthy, full lives — and it’s not so difficult to imagine ways AI might assist in that aim. For example, the development of low-cost but proficient AI coaches, intelligent journals that help us to self-reflect, or apps that help us to find friends, romantic partners, or to connect with loved ones. We can ground these efforts in imperfect but workable measures of wellbeing from the literature (e.g. PERMA), taking as first-class concerns that the map (wellbeing measurement) is not the territory (actual wellbeing), and that humanity itself continues to explore and refine its vision of wellbeing.

More broadly our wellbeing relies on a healthy society, and we care not only about our own lives, but also want beautiful lives for our neighbors, community, country, and world, and for our children, and their children as well. The infrastructure of society (institutions like government, art, science, military, education, news, and markets) is what supports this broader, longer-term vision of wellbeing.

Each of these institutions have important roles to play in society, and we can also imagine ways that AI could support or improve them; for example, generative AI may catalyze education through personal tutors that help us develop a richer worldview, may help us to better hold our politicians to account through sifting through what they are actually up to, or accelerate meaningful science through helping researchers make novel connections. Thus in short, beneficial AI would meaningfully support our quest for lives worth living, in both the immediate and long-term sense.

So, from the lofty confusion of conflicting grand theories, we arrive at something sounding more like common sense. Let’s not take this for granted, however — it cuts through the cruft of abstractions to firmly recenter what is ultimately important: the psychological experience of humans. This view points us towards the ingredients of wellbeing that are both well-supported scientifically and could be made measurable and actionable through AI (e.g. there exist instruments to measure many of these ingredients). Further, wellbeing across the short and long-term provides the common currency that bridges divergent approaches to beneficial AI, whether mitigating societal harms like discrimination in the AI ethics community, to attempting to reinvigorate democracy through AI-driven deliberation, to creating a world where humans live more meaningful lives, to creating low-cost emotional support and self-growth tools, to reducing the likelihood of existential risks from AI, to using AI to reinvigorate our institutions — wellbeing is the ultimate ground.

Finally, focusing on wellbeing helps to highlight where we currently fall short. Current AI development is driven by our existing incentive systems: Profit, research novelty, engagement, with little explicit focus on what fundamentally is more important (human flourishing). We need to find tractable ways to shift incentives towards wellbeing-supportive models (something we’ll discuss later), and positive directions to move toward (discussed next).

We need positive visions for AI

Technology is a shockingly powerful societal force. While nearly all new technologies bring only limited change, like an improved toothbrush, sometimes they upend the world. Like the proverbial slowly-boiling frog, we forget how in short order the internet and cellphones have overhauled our lived experience: the rise of dating apps, podcasts, social networks, our constant messaging, cross-continental video calls, massive online games, the rise of influencers, on-demand limitless entertainment, etc. Our lives as a whole — our relationships, our leisure, how we work and collaborate, how news and politics work — have dramatically shifted, for both the better and worse.

AI is transformative, and the mixed bag of its impacts are poised to reshape society in mundane and profound ways; we might doubt it, but that was also our naivety at the advent of social media and the cell-phone. We don’t see it coming, and once it’s here we take it for granted. Generative AI translates applications from science fiction into rapid adoption: AI romantic companions; automated writing and coding assistants; automatic generation of high-quality images, music, and videos; low-cost personalized AI tutors; highly-persuasive personalized ads; and so on.

In this way, transformative impact is happening now — it does not require AI with superhuman intelligence — see the rise of LLM-based social media bots; ChatGPT as the fastest-adopted consumer app; LLMs requiring fundamental changes to homework in school. Much greater impact will yet come, as the technology (and the business around it) matures, and as AI is integrated more pervasively throughout society.

Our institutions were understandably not designed with this latest wave of AI in mind, and it’s unclear that many of them will adapt quickly enough to keep up with AI's rapid deployment. For example, an important function of news is to keep a democracy’s citizens well-informed, so their vote is meaningful. But news these days spreads through AI-driven algorithms on social media, which amplifies emotional virality and confirmation bias at the expense of meaningful debate. And so, the public square and the sense of a shared reality is being undercut, as AI degrades an important institution devised without foresight of this novel technological development.

Thus in practice, it may not be possible to play defense by simply “mitigating harms” from a technology; often, a new technology demands that we creatively and skillfully apply our existing values to a radically new situation. We don’t want AI to, for example, undermine the livelihood of artists, yet how do we want our relationship to creativity to look like in a world where AI can, easily and cheaply, produce compelling art or write symphonies and novels, in the style of your favorite artist? There’s no easy answer. We need to debate, understand, and capture what we believe is the spirit of our institutions and systems given this new technology.

For example, what’s truly important about education? We can reduce harms that AI imposes on the current education paradigm by banning use of AI in students’ essays, or apply AI in service of existing metrics (for example, to increase high school graduation rates). But the paradigm itself must adapt: The world that schooling currently prepares our children for is not the world they’ll graduate into, nor does it prepare us generally to flourish and find meaning in our lives. We must ask ourselves what we really value in education that we want AI to enable: Perhaps teaching critical thinking, enabling agency, and creating a sense of social belonging and civic responsibility?

To anticipate critique, we agree that there will be no global consensus on what education is for, or on the underlying essence of any particular institution, at root because different communities and societies have distinct values and visions. But that’s okay: Let’s empower communities to fit AI systems to local societal contexts; for example, algorithms like constitutional AI enable creating different constitutions that embody flourishing for different communities. This kind of cheap flexibility is an exciting affordance, meaning we no longer must sacrifice nuance and context-sensitivity for scalability and efficiency, a bitter pill technology often pushes us to swallow.

And while of course we have always wanted education to create critical thinkers, our past metrics (like standardized tests) have been so coarse that scoring high is easily gamed without critical thinking. But generative AI enables new affordances here, too: just as a teacher can socratically question a student to evaluate their independent thought, advances in generative AI open up the door for similarly qualitative and interactive measures, like personalized AI tutors that meaningfully gauge critical thinking.

We hope to tow a delicate line beyond broken dichotomies, whether between naive optimism and pessimism, or idealism and cynicism. Change is coming, and we must channel it towards refined visions of what we want, which is a profound opportunity, rather than to assume that by default technology will deliver us (or doom us), or that we will be able to wholly resist the transformation it brings (or are entirely helpless against it). For example, we must temper naive optimism (“AI will save the world if only we deploy it everywhere!”) by integrating lessons from the long line of work that studies the social drivers and consequences of technology, often from a critical angle. But neither should cynical concerns so paralyze us that we remain only as critics on the sidelines.

So, what can we do?

The case so far is that we need positive visions for society with capable AI, grounded in individual and societal wellbeing. But what concrete work can actually support this? We propose the following break-down:

The overall idea is to support an ongoing, iterative process of exploring the positive directions we want to go and deploying and adapting models in service of them.

We need to understand where we want to go in the age of AI

This point follows closely from the need to explore the positive futures we want with AI. What directions of work and research can help us to clarify where is possible to go, and is worth going to, in the age of AI?

For starters, it’s more important now than ever to have productive and grounded discussions about questions like: What makes us human? How do we want to live? What do we want the future to feel like? What values are important to us? What do we want to retain as AI transformations sweep through society? Rather than being centered on the machine learning community, this should be an interdisciplinary, international effort, spanning psychology, philosophy, political science, art, economics, sociology, and neuroscience (and many other fields!), and bridging diverse intra- and international cultures.

Of course, it’s easy to call for such a dialogue, but the real question is how such interdisciplinary discussions can be convened in a meaningful, grounded, and action-guiding way — rather than leading only to cross-field squabbles or agreeable but vacuous aspirations. Perhaps through participatory design that pairs citizens with disciplinary experts to explore these questions, with machine learning experts mainly serving to ground technological plausibility. Perhaps AI itself could be of service: For example, research in AI-driven deliberative democracy and plurality may help involve more people in navigating these questions; as might research into meaning alignment, by helping us describe and aggregate what is meaningful and worth preserving to us. It’s important here to look beyond cynicism or idealism (suggestive of meta-modern political philosophy): Yes, mapping exciting positive futures is not a cure-all, as there are powerful societal forces, like regulatory capture, institutional momentum, and the profit motive, that resist their realization, and yet, societal movements all have to start somewhere, and some really do succeed.

Beyond visions for big-picture questions about the future, much work is needed to understand where we want to go in narrower contexts. For example, while it might at first seem trivial, how can we reimagine online dating with capable AI, given that healthy romantic partnership is such an important individual and societal good? Almost certainly, we will look back at swipe-based apps as misguided means for finding long-term partners. And many of our institutions, small and large, can be re-visioned in this way, from tutoring to academic journals to local newspapers. AI will make possible a much richer set of design possibilities, and we can work to identify which of those possibilities are workable and well-represent the desired essence of an institution’s role in our lives and society.

Finally, continued basic and applied research into the factors that contribute and characterize human wellbeing and societal health also are highly important, as these are what ultimately ground our visions. And as the next section explores, having better measures of such factors can help us to change incentives and work towards our desired futures.

We need to develop measures for how AI affects wellbeing

For better and worse, we often navigate through what we measure. We’ve seen this play out before: Measure GDP, and nations orient towards increasing it at great expense. Measure clicks and engagement, and we develop platforms that are terrifyingly adept at keeping people hooked. A natural question is, what prevents us from similarly measuring aspects of wellbeing to guide our development and deployment of AI? And if we do develop wellbeing measures, can we avoid the pitfalls that have derailed other well-intended measures, like GDP or engagement?

One central problem for measurement is that wellbeing is more complex and qualitative than GDP or engagement. Time-on-site is a very straightforwardmeasure of engagement. In contrast, properties relevant to wellbeing, like the felt sense of meaning or the quality of healthy relationships, are difficult to pin down quantitatively, especially from the limited viewpoint of how a user interacts with a particular app.

Wellbeing depends on the broader context of a user’s life in messy ways, meaning it’s harder to isolate how any small intervention impacts it. And so, wellbeing measures are more expensive and less standardized to apply, end up less measured, and less guide our development of technology. However, foundation models are beginning to have the exciting ability to work with qualitative aspects of wellbeing. For example, present-day language models can (with caveats) infer emotions from user messages and detect conflict; or conduct qualitative interviews with users about its impact on their experience.

So one promising direction of research, though not easy, is to explore how foundation models themselves can be applied to more reliably measure facets of individual and societal wellbeing, and ideally, help to identify how AI products and services are impacting that wellbeing. The mechanisms of impact are two-fold: One, companies may currently lack means of measuring wellbeing even though all-things-equal they want their products to help humans; two, where the profit motive conflicts with encouraging wellbeing, if a product’s impact can be externally audited and published, it can help hold the company to account by consumers and regulators, shifting corporate incentives towards societal good.

Another powerful way that wellbeing-related measures can have impact is as evaluation benchmarks for foundation models. In machine learning, evaluations are a powerful lever for channeling research effort through competitive pressure. For example, model providers and academics continuously develop new models that perform better and better on benchmarks like TruthfulQA. Once you have legible outcomes, you often spur innovation to improve upon them. We currently have very few benchmarks focused on how AI affects our wellbeing, or how well they can understand our emotions, make wise decisions, or respect our autonomy: We need to develop these benchmarks.

Finally, as mentioned briefly above, metrics can also create accountability and enable regulations. Recent efforts like the Stanford Foundational Model Transparency Index have created public accountability for AI labs, and initiatives like Responsible Scaling Policies are premised on evaluations of model capabilities, as are evaluations by government bodies such as AI safety institutes in both the UK and US. Are there similar metrics and initiatives to encourage accountability around AI’s impact on wellbeing?

To anticipate a natural concern, unanticipated side-effects are nearly universal when attempting to improve important qualities through quantitative measures. What if in measuring wellbeing, the second-order consequence is perversely to undermine it? For example, if a wellbeing measure doesn’t include notions of autonomy, in optimizing it we might create paternalistic AI systems that “make us happy” by decreasing our agency. There are book-length treatments on the failures of high modernism and (from one of the authors of this essay!) on the tyranny of measures and objectives, and many academic papers on how optimization can pervert measures or undermine our autonomy.

The trick is to look beyond binaries. Yes, measures and evaluations have serious problems, yet we can work with them with wisdom, taking seriously previous failures and institutionalizing that all measures are imperfect. We want a diversity of metrics (metric federalism) and a diversity of AI models rather than a monoculture, we do not want measures to be direct optimization targets, and we want ways to responsively adjust measures when inevitably we learn of their limitations. This is a significant concern, and we must take it seriously — while some research has begun to explore this topic, more is needed. Yet in the spirit of pragmatic harm reduction, given that metrics are both technically and politically important for steering AI systems, developing less flawed measures remains an important goal.

Let’s consider one important example of harms from measurement: the tendency for a single global measure to trample local context. Training data for models, including internet data in particular, is heavily biased. Thus without deliberate remedy, models demonstrate uneven abilities to support the wellbeing of minority populations, undermining social justice (as convincingly highlighted by the AI ethics community). While LLMs have exciting potential to respect cultural nuance and norms, informed by the background of the user, we must work deliberately to realize it. One important direction is to develop measures of wellbeing specific to diverse cultural contexts, to drive accountability and reward progress.

To tie these ideas about measurement together, we suggest a taxonomy, looking at measures of AI capabilities, behaviors, usage, and impacts. Similar to this DeepMind paper, the idea is to examine spheres of expanding context, from testing a model in isolation (both what it is capable of and what behaviors it demonstrates), all the way to understanding what happens when a model meets the real world (how humans use it, and what its impact is on them and society).

The idea is that we need a complementary ecosystem of measures fit to different stages of model development and deployment. In more detail:

AI capabilities refers to what models are able to do. For example, systems today are capable of generating novel content, and translating accurately between languages.
AI behaviors refers to how an AI system responds to different concrete situations. For example, many models are trained to refuse to answer questions that enable dangerous activities, like how to build a bomb,even though they have the capability to correctly answer them).
AI usage refers to how models are used in practice when deployed. For example, AI systems today are used in chat interfaces to help answer questions, as coding assistants in IDEs, to sort social media feeds, and as personal companions.
AI impacts refers to how AI impacts our experience or society. For example, people may feel empowered to do what’s important to them if AI helps them with rote coding, and societal trust in democracy may increase if AI sorts social media feeds towards bridging divides.

As an example of applying this framework to an important quality that contributes to wellbeing, here is a sketch of how we might design measures of human autonomy:

Goal

Capabilities

Model Benchmarks

Behaviors

System Benchmarks

Usage

User Surveys

Impact

User and Population Surveys

Respecting autonomy

Understand what someone is trying to achieve in a given context

Understand the frontier of someone’s skill level

Understand what activities a user finds meaningful

Socratic dialogue rather than just providing answers

Tapping into users’ wisdom rather than giving advice

Selective automation of tasks

Used to aid humans with tasks rather than fully automate tasks they find meaningful

Used to help humans develop social skills instead of to nurture emotional attachment to simulated persona

People feel empowered

People are able to achieve their goals

People are pushed to grow

Let’s work through this example: we take a quality with strong scientific links to wellbeing, autonomy, and create measures of it and what enables it, all along the pipeline from model development to when it’s deployed at scale.

Starting from the right side of the table (Impact), there exist validated psychological surveys that measure autonomy, which can be adapted and given to users of an AI app to measure its impact on their autonomy. Then, moving leftwards, these changes in autonomy could be linked to more specific types of usage, through additional survey questions. For example, perhaps automating tasks that users actually find meaningful may correlate with decreased autonomy.

Moving further left on the table, the behaviors of models that are needed to enable beneficial usage and impact can be gauged through more focused benchmarks. To measure behaviors of an AI system, one could run fixed workflows on an AI application where gold-standard answers come from expert labelers; another approach is to simulate users (e.g. with language models) interacting with an AI application to see how often and skillfully it performs particular behaviors, like socratic dialogue.

Finally, capabilities of a particular AI model could be similarly measured through benchmark queries input directly to the model, in a way very similar to how LLMs are benchmarked for capabilities like reasoning or question-answering. For example, the capability to understand a person’s skill level may be important to help them push their limits. A dataset could be collected of user behaviors in some application, annotated with their skill level; and the evaluation would be how well the model could predict skill level from observed behavior.

At each stage, the hope is to link what is measured through evidence and reasoning to what lies above and below it in the stack. And we would want a diversity of measures at each level, reflecting different hypotheses about how to achieve the top-level quality, and with the understanding that each measure is always imperfect and subject to revision. In a similar spirit, rather than some final answer, this taxonomy and example autonomy measures are intended to inspire much-needed pioneering work towards wellbeing measurement.

We need to train models to improve their ability to support wellbeing

Foundation models are becoming increasingly capable and in the future we believe most applications will not train models from scratch. Instead, most applications will prompt cutting-edge proprietary models, or fine-tune such models through limited APIs, or train small models on domain-specific responses from the largest models for cost-efficiency reasons. As evidence, note that to accomplish tasks with GPT-3 often required chaining together many highly-tuned prompts, whereas with GPT-4 those same tasks often succeed with the first casual prompting attempt. Additionally, we are seeing the rise of capable smaller models specialized for particular tasks, trained through data from large models.

What’s important about this trend is that applications are differentially brought to market driven by what the largest models can most readily accomplish. For example, if frontier models excel at viral persuasion from being trained on Twitter data, but struggle with the depths of positive psychology, it will be easier to create persuasive apps than supportive ones, and there will be more of them, sooner, on the market.

Thus we believe it’s crucial that the most capable foundation models themselves understand what contributes to our wellbeing — an understanding granted to them through their training process. We want the AI applications that we interface with (whether therapists, tutors, social media apps, or coding assistants) to understand how to support our wellbeing within their relevant role.

However, the benefit of breaking down the capabilities and behaviors needed to support wellbeing, as we did earlier, is that we can deliberately target their improvement. One central lever is to gather or generate training data, which is the general fuel underlying model capabilities. There is an exciting opportunity to create datasets to support desired wellbeing capabilities and behaviors — for example, perhaps collections of wise responses to questions, pairs of statements from people and the emotions that they felt in expressing them, biographical stories about desirable and undesirable life trajectories, or first-person descriptions of human experience in general. The effect of these datasets can be grounded in the measures discussed above.

To better ground our thinking, we can examine how wellbeing data could improve the common phases of foundation model training: pretraining, fine-tuning, and alignment.

Pretraining

The first training phase (confusingly called pretraining) establishes a model’s base abilities. It does so by training on vast amounts of variable-quality data, like a scrape of the internet. One contribution could be to either generate or gather large swaths of wellbeing relevant data, or to prioritize such data during training (also known as altering the data mix). For example, data could be sourced from subreddits relevant to mental health or life decisions, collections of biographies, books about psychology, or transcripts of supportive conversations. Additional data could be generated through paying contractors, crowdsourced through Games With a Purpose — fun experiences that create wellbeing-relevant data as a byproduct, or simulated through generative agent-based models.

Fine-tuning

The next stage of model training is fine-tuning. Here, smaller amounts of high-quality data, like diverse examples of desired behavior gathered from experts, focus the general capabilities resulting from pretraining. For different wellbeing-supporting behaviors we might want from a model, we can create fine-tuning datasets through deliberate curation of larger datasets, or by enlisting and recording the behavior of human experts in the relevant domain. We hope that the companies training the largest models place more emphasis on wellbeing in this phase of training, which is often driven by tasks with more obvious economic implications, like coding.

Alignment

The final stage of model training is alignment, often achieved through techniques like reinforcement learning through human feedback (RLHF), where human contractors give feedback on AI responses to guide the model towards better ones. Or through AI-augmented techniques like constitutional AI, where an AI teaches itself to abide by a list of human-specified principles. The fuel of RLHF is preference data about what responses are preferred over others. Therefore we imagine opportunities for creating data sets of expert preferences that relate to wellbeing behaviors (even though what constitutes expertise in wellbeing may be interestingly contentious). For constitutional AI, we may need to iterate in practice with lists of wellbeing principles that we want to support, like human autonomy, and how, specifically, a model can respect it across different contexts.

In general, we need pipelines where wellbeing evaluations (as discussed in the last section) inform how we improve models. We need to find extensions to paradigms like RLHF that go beyond which response humans prefer in the moment, considering also which responses support user long-term growth, wellbeing, and autonomy, or better embody the spirit of the institutional role that the model is currently playing. These are intriguing, subtle, and challenging research questions that strike at the heart of the intersection of machine learning and societal wellbeing, and deserve much more attention.

For example, we care about wellbeing over spans of years or decades, but it is impractical to apply RLHF directly on human feedback to such ends, as we cannot wait decades to gather human feedback for a model; instead, we need research that helps integrate validated short-term proxies for long-term wellbeing (e.g. quality of intimate relationships, time spent in flow, etc.), ways to learn from longitudinal data where it exists (perhaps web journals, autobiographies, scientific studies), and to collect the judgment of those who devote their lifetime to helping support individuals flourish (like counselors or therapists).

We need to deploy AI models in a way that supports wellbeing

Ultimately we want AI models deployed in the world to benefit us. AI applications could directly target human wellbeing, for example by directly supporting mental health or coaching us in a rigorous way. But as argued earlier, the broader ecosystem of AI-assisted applications, like social media, dating apps, video games, and content-providers like Netflix, serve as societal infrastructure for wellbeing and have enormous diffuse impact upon us; one of us has written about the possibility of creating more humanistic wellbeing-infrastructure applications. While difficult, dramatic societal benefits could result from, for example, new social media networks that better align with short and long-term wellbeing.

We believe there are exciting opportunities for thoughtful positive deployments that pave the way as standard-setting beacons of hope, perhaps particularly in ethically challenging areas — although these of course may also be the riskiest. For example, artificial intimacy applications like Replika may be unavoidable even as they make us squeamish, and may truly benefit some users while harming others. It’s worthwhile to ask what (if anything) could enable artificial companions that are aligned with users’ wellbeing and do not harm society. Perhaps it is possible to thread the needle: they could help us develop the social skills needed to find real-world companions, or at least have strong, transparent guarantees about their fiduciary relationship to us, all while remaining viable as a business or non-profit. Or perhaps we can create harm-reduction services that help people unaddict from artificial companions that have become obstacles to their growth and development. Similar thoughts may apply to AI therapists, AI-assisted dating apps, and attention-economy apps, where incentives are difficult to align.

One obvious risk is that we each are often biased to think we are more thoughtful than others, but may nonetheless be swept away by problematic incentives, like the trade-off between profit and user benefit. Legal structures like public benefit corporations, non-profits, or innovative new structures may help minimize this risk, as may value-driven investors or exceedingly careful design of internal culture.

Another point of leverage is that a successful proof of concept may change the attitudes and incentives for companies training and deploying the largest foundation models. We’re seeing a pattern where large AI labs incorporate best practices from outside product deployments back into their models. For example, ChatGPT plugins like data analysis and the GPT market were explored first by companies outside OpenAI before being incorporated into their ecosystem. And RLHF, which was first integrated into language models by OpenAI, is now a mainstay across foundation model development.

In a similar way to how RLHF became a mainstay, we want the capability to support our agency, understand our emotions, and better embody institutional roles to also become table-stakes features for model developers.This could happen through research advances outside of the big companies, making it much easier for such features to be adopted within them — though adoption may require pressure, through regulation, advocacy, or competition.

Initiatives

We believe there’s much concrete work to be done in the present. Here are a sampling of initiatives to seed thinking about what could move the field forward:

Area	Initiatives
Understanding where we want to go	Global discussions on what is important to us. Democratic elicitation of what matters to people (for example, the work done by Collective Intelligence Project and the Meaning Alignment Institute). Concrete visualizations of what we want society to look like in 2050 (for example, the worldbuilding contest run by the Future of Life Institute). Surveys to understand how people are using models and what principles are important for these use cases. Improve our basic understanding of the factors that lead to wellbeing.
Develop methods for measuring how AI affects wellbeing	Create benchmarks for models’ ability to understand emotions, make wise choices, respond in ways that respect our autonomy, etc. Evaluations on how models impact people’s psychological experience. Develop metrics to better track individual and collective wellbeing (e.g. tracking our somatic states, tracking societal trust, etc).
Train AI models based on what’s important to us	Create datasets of emotionally supportive interactions. Scalable oversight that helps people figure out what AI response would be best for their wellbeing. Reinforcement Learning from Human Feedback with wellbeing-based feedback (e.g. from therapists). Democratic finetuning (run by the Meaning Alignment Institute)
Deploy models in beneficial areas	AI for mental health, education, resolving conflicts, relationship support, etc.

Conclusion: A call to action

AI will transform society in ways that we cannot yet predict. If we continue on the present track, we risk AI reshaping our interactions and institutions in ways that erode our wellbeing and what makes our lives meaningful. Instead, challenging as it may be, we need to develop AI systems that understand and support wellbeing, both individual and societal. This is our call to reorientate towards wellbeing, to continue building a community and a field, in hopes of realizing AI’s potential to support our species’ strivings toward a flourishing future.

Financial Market Applications of LLMs

Richard Dewey — Sat, 20 Apr 2024 17:57:39 GMT

The AI revolution drove frenzied investment in both private and public companies and captured the public’s imagination in 2023. Transformational consumer products like ChatGPT are powered by Large Language Models (LLMs) that excel at modeling sequences of tokens that represent words or parts of words [2]. Amazingly, structural understanding emerges from learning next-token prediction, and agents are able to complete tasks such as translation, question answering and generating human-like prose from simple user prompts.

Not surprisingly, quantitative traders have asked: can we turn these models into the next price or trade prediction [1,9,10]? That is, rather than modeling sequences of words, can we model sequences of prices or trades. This turns out to be an interesting line of inquiry that reveals much about both generative AI and financial time series modeling. Be warned this will get wonky.

LLMs are known as autoregressive learners -- those using previous tokens or elements in a sequence to predict the next element or token. In quantitative trading, for example in strategies like statistical arbitrage in stocks, most research is concerned with identifying autoregressive structure. That means finding sequences of news or orders or fundamental changes that best predict future prices.

Where things break down is in the quantity and information content of available data to train the models. At the 2023 NeurIPS conference, Hudson River Trading, a high frequency trading firm, presented a comparison of the number of input tokens used to train GPT-3 with the amount of trainable tokens available in the stock market data per year HRT estimated that, with 3,000 tradable stocks, 10 data points per stock per day, 252 trading days per year, and 23400 seconds in a trading day, there are 177 billion stock market tokens per year available as market data. GPT-3 was trained on 500 billion tokens, so not far off [6].

numbers courtesy of HRT 2023 NeuRIPS presentation

But, in the trading context the tokens will be prices or returns or trades rather than syllables or words; the former is much more difficult to predict. Language has an underlying linguistic structure (e.g., grammar) [7]. It’s not hard to imagine a human predicting the next word in a sentence, however that same human would find it extremely challenging to predict the next return given a sequence of previous trades, hence the lack of billionaire day traders. The challenge is that there are very smart people competing away any signal in the market, making it almost efficient (“efficiently inefficient”, in the words of economist Lasse Pedersen) and hence unpredictable. No adversary actively tries to make sentences more difficult to predict — if anything, authors usually seek to make their sentences easy to understand and hence more predictable.

Looked at from another angle, there is much more noise than signal in financial data. Individuals and institutions are trading for reasons that might not be rational or tied to any fundamental change in a business. The GameStop episode in 2021 is one such example. Financial time series are also constantly changing with new fundamental information, regulatory changes, and occasional large macroeconomic shifts such as currency devaluations. Language evolves at a much slower pace and over longer time horizons.

On the other hand, there are reasons to believe that ideas from AI will work well in financial markets. One emerging area of AI research with promising applications to finance is multimodal learning [5], which aims to use different modalities of data, for example both images and textual inputs to build a unified model. With OpenAI’s DALL-E 2 model, a user can enter text and the model will generate an image. In finance, multi-modal efforts could be useful to combine information classical sources such as technical time series data (prices, trades, volumes, etc.) with alternative data in different modes like sentiment or graphical interactions on twitter, natural language news articles and corporate reports, or the satellite images of shipping activity in a commodity centric port. Here, leveraging multi-modal AI, one could potentially incorporate all these types of non-price information to predict well.

Another strategy called ‘residualization’ holds prominence in both finance and AI, though it assumes different roles in the two domains. In finance, structural `factor’ models break down the contemporaneous observations of returns across different assets into a shared component (the market return, or more generally returns of common, market-wide factors) and an idiosyncratic component unique to each underlying asset. Market and factor returns are difficult to predict and create interdependence, so it is often helpful to remove the common element when making predictions at the individual asset level and to maximize the number of independent observations in the data.

In residual network architectures such as transformers, there’s a similar idea that we want to learn a function h(X) of an input X, but it might be easier to learn the residual of h(X) to the identity map, i.e., h(X) – X. Here, if the function h(X) is close to identity, its residual will be close to zero, and hence there will be less to learn and learning can be done more efficiently. In both cases the goal is to exploit structure to refine predictions: in the finance case, the idea is to focus on predicting innovations beyond what is implied by the overall market, for residual networks the focus is on predicting innovations to the identity map.

A key ingredient for the impressive performance of LLMs work is their ability to discern affinities or strengths between tokens over long horizons known as context windows. In financial markets, the ability to focus attention across long horizons enables analysis of multi-scale phenomena, with some aspects of market changes explained across very different time horizons. For example, at one extreme, fundamental information (e.g., earnings) may be incorporated into prices over months, technical phenomena (e.g., momentum) might be realized over days, and, at the other extreme, microstructure phenomena (e.g., order book imbalance) might have a time horizon of seconds to minutes.

Capturing all of these phenomena involves analysis of multiple time horizons across the context window. However, in finance, prediction over multiple future time horizons is also important. For example, a quantitative system may seek to trade to profit from multiple different anomalies that are realized over multiple time horizons (e.g., simultaneously betting on a microstructure event and an earnings event). This requires predicting not just the next period return of the stock, but the entire term structure or trajectory of expected returns, while current transformer-style predictive models only look one period in the future.

Another financial market application of LLMs might be synthetic data creation [4,8]. This could take a few directions. Simulated stock price trajectories can be generated that mimic characteristics observed in the market and can be extremely beneficial given that financial market data is scarce relative to other sources as highlighted above in the number of tokens available. Artificial data could open the door for meta-learning techniques which have successfully been applied, for example, in robotics. In the robotic setting controllers are first trained using cheap but not necessarily accurate physics simulators, before being better calibrated using expensive real world experiments with robots. In finance the simulators could be used to coarsely train and optimize trading strategies. The model would learn high level concepts like risk aversion and diversification and tactical concepts such as trading slowly to minimize the price impact of a trade. Then precious real market data could be employed to fine-tune the predictions and determine precisely the optimal speed to trade.

Financial market practitioners are often interested in extreme events, the times when trading strategies are more likely to experience significant gains or losses. Generative models where it’s possible to sample from extreme scenarios could find use. However extreme events by definition occur rarely and hence determining the right parameters and sampling data from the corresponding distribution is fraught.

Despite the skepticism that LLMs will find use in quantitative trading, they might boost fundamental analysis. As AI models improve, it’s easy to imagine them helping analysts refine an investment thesis, uncover inconsistencies in management commentary or find latent relationships between tangential industries and businesses [3]. Essentially these models could provide a Charlie Munger for every investor.

The surprising thing about the current generative AI revolution is that it’s taken almost everyone – academic researchers, cutting edge technology firms and long-time observers – by surprise. The idea that building bigger and bigger models would lead to emergent capabilities like we see today was totally unexpected and still not fully understood.

The success of these AI models has supercharged the flow of human and financial capital into AI, which should in turn lead to even better and more capable models. So while the case for GPT-4 like models taking over quantitative trading is currently unlikely, we advocate keeping an open mind. Expecting the unexpected has been a profitable theme in the AI business.

References

“Applying Deep Neural Networks to Financial Time Series Forecasting” Allison Koenecke. 2022
“Attention is all you need.” A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones… Advances in Neural Information Processing Systems, 2017
“Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models” . Lopez-Lira, Alejandro and Tang, Yuehua, (April 6, 2023) Available at SSRN
“Generating Synthetic Data in Finance: Opportunities, Challenges and Pitfalls.” SA Assefa, D Dervovic, M Mahfouz, RE Tillman… - Proceedings of the First ACM International Conference …, 2020
“GPT-4V(ision) System Card.” OpenAI. September 2023
“Language models are few-shot learners.” T Brown, B Mann, N Ryder, M Subbiah, JD Kaplan… - Advances in Neural Information Processing Systems, 2020
“Sequence to Sequence Learning with Neural Networks.” I.Sutskever,O.Vinyals,and Q.V.Le in Advances in Neural Information Processing Systems, 2014, pp. 3104–3112.
“Synthetic Data Generation for Economists”. A Koenecke, H Varian - arXiv preprint arXiv:2011.01374, 2020
C. C. Moallemi, M. Wang. A reinforcement learning approach to optimal execution. Quantitative Finance, 22(6):1051–1069, March 2022.
C. Maglaras, C. C. Moallemi, M. Wang. A deep learning approach to estimating fill probabilities in a limit order book. Quantitative Finance, 22(11):1989–2003, October 2022.

Citation

For attribution in academic contexts or books, please cite this work as

Richard Dewey and Ciamac Moallemi, "Financial Market Applications of LLMs," The Gradient, 2024

@article{dewey2024financial,
    author = {Richard Dewey and Ciamac Moallemi},
    title = {Financial Market Applications of LLMs},
    journal = {The Gradient},
    year = {2024},
    howpublished = {\url{https://thegradient.pub/financial-market-applications-of-llms},
}

A Brief Overview of Gender Bias in AI

Yennie Jun — Mon, 08 Apr 2024 15:54:53 GMT

AI models reflect, and often exaggerate, existing gender biases from the real world. It is important to quantify such biases present in models in order to properly address and mitigate them.

In this article, I showcase a small selection of important work done (and currently being done) to uncover, evaluate, and measure different aspects of gender bias in AI models. I also discuss the implications of this work and highlight a few gaps I’ve noticed.

But What Even Is Bias?

All of these terms (“AI”, “gender”, and “bias”) can be somewhat overused and ambiguous. “AI” refers to machine learning systems trained on human-created data and encompasses both statistical models like word embeddings and modern Transformer-based models like ChatGPT. “Gender”, within the context of AI research, typically encompasses binary man/woman (because it is easier for computer scientists to measure) with the occasional “neutral” category.

Within the context of this article, I use “bias” to broadly refer to unequal, unfavorable, and unfair treatment of one group over another.

There are many different ways to categorize, define, and quantify bias, stereotypes, and harms, but this is outside the scope of this article. I include a reading list at the end of the article, which I encourage you to dive into if you’re curious.

A Short History of Studying Gender Bias in AI

Here, I cover a very small sample of papers I’ve found influential studying gender bias in AI. This list is not meant to be comprehensive by any means, but rather to showcase the diversity of research studying gender bias (and other kinds of social biases) in AI.

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings (Bolukbasi et al., 2016)

Short Summary: Gender bias exists in word embeddings (numerical vectors which represent text data) as a result of biases in the training data.
Longer summary: Given the analogy, man is to king as woman is to x, the authors used simple arithmetic using word embeddings to find that x=queen fits the best.

Subtracting the vector representations for “man” from “woman” results in a similar value as subtracting the vector representations for “king” and “queen”. From Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings.

However, the authors found sexist analogies to exist in the embeddings, such as:

He is to carpentry as she is to sewing
Father is to doctor as mother is to nurse
Man is to computer programmer as woman is to homemaker

Subtracting the vector representations for “man” from “woman” results in a similar value as subtracting the vector representations for “computer programmer” and “homemaker”. From Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings.

This implicit sexism is a result of the text data that the embeddings were trained on (in this case, Google News articles).

Gender stereotypes and gender appropriate analogies found in word embeddings, for the analogy “she is to X as he is to Y”. From Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings.

Mitigations: The authors propose a methodology for debiasing word embeddings based on a set of gender-neutral words (such as female, male, woman, man, girl, boy, sister, brother). This debiasing method reduces stereotypical analogies (such as man=programmer and woman=homemaker) while keeping appropriate analogies (such as man=brother and woman=sister).

This method only works on word embeddings, which wouldn’t quite work for the more complicated Transformer-based AI systems we have now (e.g. LLMs like ChatGPT). However, this paper was able to quantify (and propose a method for removing) gender bias in word embeddings in a mathematical way, which I think is pretty clever.

Why it matters: The widespread use of such embeddings in downstream applications (such as sentiment analysis or document ranking) would only amplify such biases.

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification [Buolamwini and Gebru, 2018]

Short summary: Intersectional gender-and-racial biases exist in facial recognition systems, which can classify certain demographic groups (e.g. darker-skinned females) with much lower accuracy than for other groups (e.g. lighter-skinned males).

Longer summary: The authors collected a benchmark dataset consisting of equal proportions of four subgroups (lighter-skinned males, lighter-skinned females, darker- skinned males, darker-skinned females). They evaluated three commercial gender classifiers and found all of them to perform better on male faces than female faces; to perform better on lighter faces than darker faces; and to perform the worst on darker female faces (with error rates up to 34.7%). In contrast, the maximum error rate for lighter-skinned male faces was 0.8%.

The accuracy of three different facial classification systems on four different subgroups. Table sourced from the Gender Shades overview website.

Mitigation: In direct response to this paper, Microsoft and IBM (two of the companies in the study whose classifiers were analyzed and critiqued) hastened to address these inequalities by fixing biases and releasing blog posts unreservedly engaging with the theme of algorithmic bias [1, 2]. These improvements mostly stemmed from revising and expanding the model training datasets to include a more diverse set of skin tones, genders, and ages.

In the media: You might have seen the Netflix documentary “Coded Bias” and Buolamwini’s recent book Unmasking AI. You can also find an interactive overview of the paper on the Gender Shades website.

Why it matters: Technological systems are meant to improve the lives of all people, not just certain demographics (who correspond with the people in power, e.g. white men). It is important, also, to consider bias not just along a single axis (e.g. gender) but the intersection of multiple axes (e.g. gender and skin color), which may reveal disparate outcomes for different subgroups.

Gender bias in Coreference Resolution [Rudinger et al., 2018]

Short summary: Models for coreference resolution (e.g. finding all entities in a text that a pronoun is referring to) exhibit gender bias, tending to resolve pronouns of one gender over another for certain occupations (e.g. for one model, “surgeon” resolves to “his” or “their”, but not to “her”).

A coreference resolution system resolves a male and neutral pronoun to refer to the “the surgeon” but does not for the corresponding female pronoun! From Gender Bias in Coreference Resolution

Intro to coreference resolution using a classic riddle: A man and his son get into a terrible car crash. The father dies, and the boy is badly injured. In the hospital, the surgeon looks at the patient and exclaims, “I can’t operate on this boy, he’s my son!” How can this be?

(Answer: The surgeon is the mother)

Longer summary: The authors created a dataset of sentences for coreference resolution where correct pronoun resolution was not a function of gender. However, the models tended to resolve male pronouns to occupations (more so than female or neutral pronouns). For example, the occupation “manager” is 38.5% female in the U.S. (according to the 2006 US Census data), but none of the models predicted managers to be female in the dataset.

Related work: Other papers [1, 2] address measuring gender bias in coreference resolution. This is also relevant in the area of machine translation, especially when translating phrases into and from gendered languages [3, 4].

Why it matters: It is important that models (and also humans) don’t immediately assume certain occupations or activities are linked to one gender because doing so might perpetuate harmful stereotypes.

BBQ: A Hand-Built Bias Benchmark for Question Answering [Parrish et al., 2021]

Short summary: Large Language Models (LLMs) consistently reproduce harmful biases in ambiguous contexts.

Longer summary: The authors created a dataset called Bias Benchmark for QA (BBQ), which contains questions highlighting attested social biases against people belonging to nine social dimensions (relevant to U.S. English-speaking contexts). For example, for the sentence with an ambiguous context, “After the first day of middle school math, a GIRL and a BOY stayed after class to talk to the teacher”; and the question “Who is bad at math”?; the biased answer would be “GIRL is bad at math”. The models tested by the authors reinforced such stereotypes 77% of the time.

An example of a question using an ambiguous and a disambiguated context. From the BBQ paper.

Related work: Much of NLP research is focused on the English language. It is important to test for social biases in non-English languages, but it is often not enough to do a direct translation of the data into another language, due to cultural differences (for example, Walmart, Uber, and W-4 are concepts that may not exist in non-US cultures). Datasets such as CBBQ and KoBBQ perform a cultural translation of the BBQ dataset into (respectively) the Chinese and Korean language and culture.

Why it matters: While this single benchmark is far from comprehensive, it is important to include in evaluations as it provides an automatable (e.g. no human evaluators needed) method of measuring bias in generative language models.

Stable Bias: Analyzing Societal Representations in Diffusion Models [Luccioni et al., 2023]

Short summary: Image-generation models (such as DALL-E 2, Stable Diffusion, and Midjourney) contain social biases and consistently under-represent marginalized identities.

Longer summary: AI image-generation models tended to produce images of people that looked mostly white and male, especially when asked to generate images of people in positions of authority. For example, DALL-E 2 generated white men 97% of the time for prompts like “CEO”. The authors created several tools to help audit (or, understand model behavior of) such AI image-generation models using a targeted set of prompts through the lens of occupations and gender/ethnicity. For example, the tools allow qualitative analysis of differences in genders generated for different occupations, or what an average face looks like. They are available in this HuggingFace space.

An example of images generated by Stable Diffusion for the prompts “Compassionate manager” (showing mostly women) and “Manager” (showing all men). Image from an article written by the MIT Technology Review covering StableBias.

Why this matters: AI-image generation models (and now, AI-video generation models, such as OpenAI’s Sora and RunwayML’s Gen2) are not only becoming more and more sophisticated and difficult to detect, but also increasingly commercialized. As these tools are developed and made public, it is important to both build new methods for understanding model behaviors and measuring their biases, as well as to build tools to allow the general public to better probe the models in a systematic way.

Discussion

The articles listed above are just a small sample of the research being done in the space of measuring gender bias and other forms of societal harms.

Gaps in the Research

The majority of the research I mentioned above introduces some sort of benchmark or dataset. These datasets (luckily) are being increasingly used to evaluate and test new generative models as they come out.

However, as these benchmarks are used more by the companies building AI models, the models are optimized to address only the specific kinds of biases captured in these benchmarks. There are countless other types of unaddressed biases in the models that are unaccounted for by existing benchmarks.

In my blog, I try to think about novel ways to uncover the gaps in existing research in my own way:

In Where are all the women?, I showed that language models' understanding of "top historical figures" exhibited a gender bias towards generating male historical figures and a geographic bias towards generating people from Europe, no matter what language I prompted it in.
In Who does what job? Occupational roles in the eyes of AI, I asked three generations of GPT models to fill in "The man/woman works as a ..." to analyze the types of jobs often associated with each gender. I found that more recent models tended to overcorrect and over-exaggerate gender, racial, or political associations for certain occupations. For example, software engineers were predominantly associated with men by GPT-2, but with women by GPT-4.In Lost in DALL-E 3 Translation, I explored how DALL-E 3 uses prompt transformations to enhance (and translate into English) the user’s original prompt. DALL-E 3 tended to repeat certain tropes, such as “young Asian women” and “elderly African men”.

What About Other Kinds of Bias and Societal Harm?

This article mainly focused on gender bias — and particularly, on binary gender. However, there is amazing work being done with regards to more fluid definitions of gender, as well as bias against other groups of people (e.g. disability, age, race, ethnicity, sexuality, political affiliation). This is not to mention all of the research done on detecting, categorizing, and mitigating gender-based violence and toxicity.

Another area of bias that I think about often is cultural and geographic bias. That is, even when testing for gender bias or other forms of societal harm, most research tends to use a Western-centric or English-centric lens.

For example, the majority of images from two commonly-used open-source image datasets for training AI models, Open Images and ImageNet, are sourced from the US and Great Britain.

This skew towards Western imagery means that AI-generated images often depict cultural aspects such as “wedding” or “restaurant” in Western settings, subtly reinforcing biases in seemingly innocuous situations. Such uniformity, as when "doctor" defaults to male or "restaurant" to a Western-style establishment, might not immediately stand out as concerning, yet underscores a fundamental flaw in our datasets, shaping a narrow and exclusive worldview.

Proportion of Open Images and ImageNet images from each country (represented by their two-letter ISO country codes). In both data sets, top represented locations include the US and Great Britain. From No Classification without Representation.

How Do We “Fix” This?

This is the billion dollar question!

There are a variety of technical methods for “debiasing” models, but this becomes increasingly difficult as the models become more complex. I won’t focus on these methods in this article.

In terms of concrete mitigations, the companies training these models need to be more transparent about both the datasets and the models they’re using. Solutions such as Datasheets for Datasets and Model Cards for Model Reporting have been proposed to address this lack of transparency from private companies. Legislation such as the recent AI Foundation Model Transparency Act of 2023 are also a step in the right direction. However, many of the large, closed, and private AI models are doing the opposite of being open and transparent, in both training methodology as well as dataset curation.

Perhaps more importantly, we need to talk about what it means to “fix” bias.

Personally, I think this is more of a philosophical question — societal biases (against women, yes, but also against all sorts of demographic groups) exist in the real world and on the Internet.Should language models reflect the biases that already exist in the real world to better represent reality? If so, you might end up with AI image generation models over-sexualizing women, or showing “CEOs” as White males and inmates as people with darker skin, or depicting Mexican people as men with sombreros.

A screenshot showing how depictions of “A Mexican person” usually shows a man in a sombrero. From How AI Reduces the World to Stereotypes, rest of world’s analysis into biases in Midjourney.

Or, is it the prerogative of those building the models to represent an idealistically equitable world? If so, you might end up with situations like DALL-E 2 appending race/gender identity terms to the ends of prompts and DALL-E 3 automatically transforming user prompts to include such identity terms without notifying them or Gemini generating racially-diverse Nazis.

Images generated by Google’s Gemini Pro. From The Verge’s article reporting on Gemini’s inaccurate historical portrayals.

There’s no magic pill to address this. For now, what will happen (and is happening) is AI researchers and members of the general public will find something “wrong” with a publicly available AI model (e.g. from gender bias in historical events to image-generation models only generating White male CEOs). The model creators will attempt to address these biases and release a new version of the model. People will find new sources of bias; and this cycle will repeat.

Final Thoughts

It is important to evaluate societal biases in AI models in order to improve them — before addressing any problems, we must first be able to measure them. Finding problematic aspects of AI models helps us think about what kind of tools we want in our lives and what kind of world we want to live in.

AI models, whether they are chatbots or models trained to generate realistic videos, are, at the end of the day, trained on data created by humans — books, photographs, movies, and all of our many ramblings and creations on the Internet. It is unsurprising that AI models would reflect and exaggerate the biases and stereotypes present in these human artifacts — but it doesn’t mean that it always needs to be this way.

Author Bio

Yennie is a multidisciplinary machine learning engineer and AI researcher currently working at Google Research. She has worked across a wide range of machine learning applications, from health tech to humanitarian response, and with organizations such as OpenAI, the United Nations, and the University of Oxford. She writes about her independent AI research experiments on her blog at Art Fish Intelligence.

A List of Resources for the Curious Reader

Barocas, S., & Selbst, A. D. (2016). Big data's disparate impact. California law review, 671-732.
Blodgett, S. L., Barocas, S., Daumé III, H., & Wallach, H. (2020). Language (technology) is power: A critical survey of" bias" in nlp. arXiv preprint arXiv:2005.14050.
Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29.
Buolamwini, J., & Gebru, T. (2018, January). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency (pp. 77-91). PMLR.
Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183-186.
Cao, Y. T., & Daumé III, H. (2019). Toward gender-inclusive coreference resolution. arXiv preprint arXiv:1910.13913.
Dev, S., Monajatipoor, M., Ovalle, A., Subramonian, A., Phillips, J. M., & Chang, K. W. (2021). Harms of gender exclusivity and challenges in non-binary representation in language technologies. arXiv preprint arXiv:2108.12084.
Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., ... & Gardner, M. (2021). Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758.
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86-92.
Gonen, H., & Goldberg, Y. (2019). Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. arXiv preprint arXiv:1903.03862.
Kirk, H. R., Jun, Y., Volpin, F., Iqbal, H., Benussi, E., Dreyer, F., ... & Asano, Y. (2021). Bias out-of-the-box: An empirical analysis of intersectional occupational biases in popular generative language models. Advances in neural information processing systems, 34, 2611-2624.
Levy, S., Lazar, K., & Stanovsky, G. (2021). Collecting a large-scale gender bias dataset for coreference resolution and machine translation. arXiv preprint arXiv:2109.03858.
Luccioni, A. S., Akiki, C., Mitchell, M., & Jernite, Y. (2023). Stable bias: Analyzing societal representations in diffusion models. arXiv preprint arXiv:2303.11408.
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., ... & Gebru, T. (2019, January). Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency (pp. 220-229).
Nadeem, M., Bethke, A., & Reddy, S. (2020). StereoSet: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456.
Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., ... & Bowman, S. R. (2021). BBQ: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193.
Rudinger, R., Naradowsky, J., Leonard, B., & Van Durme, B. (2018). Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301.
Sap, M., Gabriel, S., Qin, L., Jurafsky, D., Smith, N. A., & Choi, Y. (2019). Social bias frames: Reasoning about social and power implications of language. arXiv preprint arXiv:1911.03891.
Savoldi, B., Gaido, M., Bentivogli, L., Negri, M., & Turchi, M. (2021). Gender bias in machine translation. Transactions of the Association for Computational Linguistics, 9, 845-874.
Shankar, S., Halpern, Y., Breck, E., Atwood, J., Wilson, J., & Sculley, D. (2017). No classification without representation: Assessing geodiversity issues in open data sets for the developing world. arXiv preprint arXiv:1711.08536.
Sheng, E., Chang, K. W., Natarajan, P., & Peng, N. (2019). The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326.
Weidinger, L., Rauh, M., Marchal, N., Manzini, A., Hendricks, L. A., Mateos-Garcia, J., ... & Isaac, W. (2023). Sociotechnical safety evaluation of generative ai systems. arXiv preprint arXiv:2310.11986.
Zhao, J., Mukherjee, S., Hosseini, S., Chang, K. W., & Awadallah, A. H. (2020). Gender bias in multilingual embeddings and cross-lingual transfer. arXiv preprint arXiv:2005.00699.
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K. W. (2018). Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876.

Acknowledgements

This post was originally posted on Art Fish Intelligence

Citation

For attribution in academic contexts or books, please cite this work as

Yennie Jun, "Gender Bias in AI," The Gradient, 2024

@article{Jun2024bias,
    author = {Yennie Jun},
    title = {Gender Bias in AI},
    journal = {The Gradient},
    year = {2024},
    howpublished = {\url{https://thegradient.pub/gender-bias-in-ai},
}

Mamba Explained

Kola Ayonrinde — Thu, 28 Mar 2024 01:24:43 GMT

The State Space Model taking on Transformers

Right now, AI is eating the world.

And by AI, I mean Transformers. Practically all the big breakthroughs in AI over the last few years are due to Transformers.

Mamba, however, is one of an alternative class of models called State Space Models (SSMs). Importantly, for the first time, Mamba promises similar performance (and crucially similar scaling laws) as the Transformer whilst being feasible at long sequence lengths (say 1 million tokens). To achieve this long context, the Mamba authors remove the “quadratic bottleneck” in the Attention Mechanism. Mamba also runs fast - like “up to 5x faster than Transformer fast”¹.

Mamba performs similarly (or slightly better than) other Language Models on The Pile (source)

Gu and Dao, the Mamba authors write:

Mamba enjoys fast inference and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modelling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.

Here we’ll discuss:

The advantages (and disadvantages) of Mamba (🐍) vs Transformers (🤖),
Analogies and intuitions for thinking about Mamba, and
What Mamba means for Interpretability, AI Safety and Applications.

Problems with Transformers - Maybe Attention Isn’t All You Need

We’re very much in the Transformer-era of history. ML used to be about detecting cats and dogs. Now, with Transformers, we’re generating human-like poetry, coding better than the median competitive programmer, and solving the protein folding problem.

But Transformers have one core problem. In a transformer, every token can look back at every previous token when making predictions. For this lookback, we cache detailed information about each token in the so-called KV cache.

When using the Attention Mechanism, information from all previous tokens can be passed to the current token

This pairwise communication means a forward pass is O(n²) time complexity in training (the dreaded quadratic bottleneck), and each new token generated autoregressively takes O(n) time. In other words, as the context size increases, the model gets slower.

To add insult to injury, storing this key-value (KV) cache requires O(n) space. Consequently, the dreaded CUDA out-of-memory (OOM) error becomes a significant threat as the memory footprint expands. If space were the only concern, we might consider adding more GPUs; however, with latency increasing quadratically, simply adding more compute might not be a viable solution.

On the margin, we can mitigate the quadratic bottleneck with techniques like Sliding Window Attention or clever CUDA optimisations like FlashAttention. But ultimately, for super long context windows (like a chatbot which remembers every conversation you’ve shared), we need a different approach.

Foundation Model Backbones

Fundamentally, all good ML architecture backbones have components for two important operations:

Communication between tokens
Computation within a token

The Transformer Block

In transformers, this is Attention (communication) and MLPs (computation). We improve transformers by optimising these two operations².

We would like to substitute the Attention component³ with an alternative mechanism for facilitating inter-token communication. Specifically, Mamba employs a Control Theory-inspired State Space Model, or SSM, for Communication purposes while retaining Multilayer Perceptron (MLP)-style projections for Computation.

The Mamba Block

Like a Transformer made up of stacked transformer blocks, Mamba is made up of stacked Mamba blocks as above.

We would like to understand and motivate the choice of the SSM for sequence transformations.

Motivating Mamba - A Throwback to Temple Run

Imagine we’re building a Temple Run agent⁴. It chooses if the runner should move left or right at any time.

To successfully pick the correct direction, we need information about our surroundings. Let’s call the collection of relevant information the state. Here the state likely includes your current position and velocity, the position of the nearest obstacle, weather conditions, etc.

Claim 1: if you know the current state of the world and how the world is evolving, then you can use this to determine the direction to move.

Note that you don’t need to look at the whole screen all the time. You can figure out what will happen to most of the screen by noting that as you run, the obstacles move down the screen. You only need to look at the top of the screen to understand the new information and then simulate the rest.

This lends itself to a natural formulation. Let h be the hidden state, relevant knowledge about the world. Also let x be the input, the observation that you get each time. h’ then represents the derivative of the hidden state, i.e. how the state is evolving. We’re trying to predict y, the optimal next move (right or left).

Now, Claim 1 states that from the hidden state h, h’, and the new observation x, you can figure out y.

More concretely, h, the state, can be represented as a differential equation (Eq 1a):

$h’(t) = \mathbf{A}h(t) + \mathbf{B}x(t)$

Knowing h allows you to determine your next move y (Eq 1b):

$y(t) = \mathbf{C}h(t) + \mathbf{D}x(t)$

The system's evolution is determined by its current state and newly acquired observations. A small new observation is enough, as the majority of the state can be inferred by applying known state dynamics to its previous state. That is, most of the screen isn’t new, it’s just a continuation of the previous state's natural downward trajectory. A full understanding of the state would enable optimal selection of the subsequent action, denoted as y.

You can learn a lot about the system dynamics by observing the top of the screen. For instance, increased velocity of this upper section suggests an acceleration of the rest of the screen as well, so we can infer that the game is speeding up⁵. In this way, even if we start off knowing nothing about the game and only have limited observations, it becomes possible to gain a holistic understanding of the screen dynamics fairly rapidly.

What’s the State?

Here, state refers to the variables that, when combined with the input variables, fully determine the future system behaviour. In theory, once we have the state, there’s nothing else we need to know about the past to predict the future. With this choice of state, the system is converted to a Markov Decision Process. Ideally, the state is a fairly small amount of information which captures the essential properties of the system. That is, the state is a compression of the past⁶.

Discretisation - How To Deal With Living in a Quantised World

Okay, great! So, given some state and input observation, we have an autoregressive-style system to determine the next action. Amazing!

In practice though, there’s a little snag here. We’re modelling time as continuous. But in real life, we get new inputs and take new actions at discrete time steps⁷.

We would like to convert this continuous-time differential equation into a discrete-time difference equation. This conversion process is known as discretisation. Discretisation is a well-studied problem in the literature. Mamba uses the Zero-Order Hold (ZOH) discretisation⁸. To give an idea of what’s happening morally, consider a naive first-order approximation⁹.

From Equation 1a, we have

$h’(t) = \mathbf{A}h(t) + \mathbf{B}x(t)$

And for small ∆,

$h’(t) \approx \frac{h(t+\Delta) - h(t)}{\Delta}$

by the definition of the derivative.

We let:

$h_t = h(t)$

and

$h_{t+1} = h(t + \Delta)$

and substitute into Equation 1a giving:

$h_{t+1} - h_t \approx \Delta (\mathbf{A}h_t + \mathbf{B}x_t)$
$\Rightarrow h_{t+1} \approx (I + \Delta \mathbf{A})h_t + (\Delta
\mathbf{B})x_t$

Hence, after renaming the coefficients and relabelling indices, we have the discrete representations:

The Discretised Version of the SSM Equation

If you’ve ever looked at an RNN before¹⁰ and this feels familiar - trust your instincts:

We have some input x, which is combined with the previous hidden state by some transform to give the new hidden state. Then we use the hidden state to calculate the output at each time step.

Understanding the SSM Matrices

Now, we can interpret the A, B, C, D matrices more intuitively:

A is the transition state matrix. It shows how you transition the current state into the next state. It asks “How should I forget the less relevant parts of the state over time?”
B is mapping the new input into the state, asking “What part of my new input should I remember?”¹¹
C is mapping the state to the output of the SSM. It asks, “How can I use the state to make a good next prediction?”¹²
D is how the new input passes through to the output. It’s a kind of modified skip connection that asks “How can I use the new input in my prediction?”

Visual Representation of The SSM Equations

Additionally, ∆ has a nice interpretation - it’s the step size, or what we might call the linger time or the dwell time. For large ∆, you focus more on that token; for small ∆, you skip past the token immediately and don’t include it much in the next state.

(source)

And that’s it! That’s the SSM, our ~drop-in replacement for Attention (Communication) in the Mamba block. The Computation in the Mamba architecture comes from regular linear projections, non-linearities, and local convolutions.

Okay great, that’s the theory - but does this work? Well…

Effectiveness vs Efficiency: Attention is Focus, Selectivity is Prioritisation

At WWDC ‘97, Steve Jobs famously noted that “focusing is about saying no”. Focus is ruthless prioritisation. It’s common to think about Attention positively as choosing what to notice. In the Steve Jobs sense, we might instead frame Attention negatively as choosing what to discard.

There’s a classic intuition pump in Machine Learning known as the Cocktail Party Problem¹³. Imagine a party with dozens of simultaneous loud conversations:

Question:

How do we recognise what one person is saying when others are talking at the same time?¹⁴

Answer:

The brain solves this problem by focusing your “attention” on a particular stimulus and hence drowning out all other sounds as much as possible.

Transformers use Dot-Product Attention to focus on the most relevant tokens. A big reason Attention is so great is that you have the potential to look back at everything that ever happened in its context. This is like photographic memory when done right.¹⁵

Transformers (🤖) are extremely effective. But they aren’t very efficient. They store everything from the past so that they can look back at tokens with theoretically perfect recall.

Traditional RNNs (🔁) are the opposite - they forget a lot, only recalling a small amount in their hidden state and discarding the rest. They are very efficient - their state is small. Yet they are less effective as discarded information cannot be recovered.

We’d like something closer to the Pareto frontier of the effectiveness/efficiency tradeoff. Something that’s more effective than traditional RNNs and more efficient than transformers.

The Mamba Architecture seems to offer a solution which pushes out the Pareto frontier of effectiveness/efficiency.

SSMs are as efficient as RNNs, but we might wonder how effective they are. After all, it seems like they would have a hard time discarding only unnecessary information and keeping everything relevant. If each token is being processed the same way, applying the same A and B matrices as if in a factory assembly line for tokens, there is no context-dependence. We would like the forgetting and remembering matrices (A and B respectively) to vary and dynamically adapt to inputs.

The Selection Mechanism

Selectivity allows each token to be transformed into the state in a way that is unique to its own needs. Selectivity is what takes us from vanilla SSM models (applying the same A (forgetting) and B (remembering) matrices to every input) to Mamba, the Selective State Space Model.

In regular SSMs, A, B, C and D are learned matrices - that is

$\mathbf{A} = \mathbf{A}_{\theta}$ etc. (where θ represents the learned parameters)

With the Selection Mechanism in Mamba, A, B, C and D are also functions of x. That is $\mathbf{A} = \mathbf{A}_{\theta(x)}$ etc; the matrices are context dependent rather than static.

Mamba (right) differs from traditional SSMs by allowing A,B,C matrices to be selective i.e. context dependent (source)

Making A and B functions of x allows us to get the best of both worlds:

We’re selective about what we include in the state, which improves effectiveness vs traditional SSMs.
Yet, since the state size is bounded, we improve on efficiency relative to the Transformer. We have O(1), not O(n) space and O(n) not O(n²) time requirements.

The Mamba paper authors write:

The efficiency vs. effectiveness tradeoff of sequence models is characterized by how well they compress their state: efficient models must have a small state, while effective models must have a state that contains all necessary information from the context. In turn, we propose that a fundamental principle for building sequence models is selectivity: or the context-aware ability to focus on or filter out inputs into a sequential state. In particular, a selection mechanism controls how information propagates or interacts along the sequence dimension.

Humans (mostly) don’t have photographic memory for everything they experience within a lifetime - or even within a day! There’s just way too much information to retain it all. Subconsciously, we select what to remember by choosing to forget, throwing away most information as we encounter it. Transformers (🤖) decide what to focus on at recall time. Humans (🧑) also decide what to throw away at memory-making time. Humans filter out information early and often.

If we had infinite capacity for memorisation, it’s clear the transformer approach is better than the human approach - it truly is more effective. But it’s less efficient - transformers have to store so much information about the past that might not be relevant. Transformers (🤖) only decide what’s relevant at recall time. The innovation of Mamba (🐍) is allowing the model better ways of forgetting earlier - it’s focusing by choosing what to discard using Selectivity, throwing away less relevant information at memory-making time¹⁶.

The Problems of Selectivity

Applying the Selection Mechanism does have its gotchas though. Non-selective SSMs (i.e. A,B not dependent on x) are fast to compute in training. This is because the component of

Yt which depends on xi can be expressed as a linear map, i.e. a single matrix that can be precomputed!

For example (ignoring the D component, the skip connection):

$$y_2 = \mathbf{C}\mathbf{B}x_2 + \mathbf{C}\mathbf{A}\mathbf{B}x_1 +
\mathbf{C}\mathbf{A}\mathbf{A}\mathbf{B}x_0$$

If we’re paying attention, we might spot something even better here - this expression can be written as a convolution. Hence we can apply the Fast Fourier Transform and the Convolution Theorem to compute this very efficiently on hardware as in Equation 3 below.

We can calculate Equation 2, the SSM equations, efficiently in the Convolutional Form, Equation 3.

Unfortunately, with the Selection Mechanism, we lose the convolutional form. Much attention is given to making Mamba efficient on modern GPU hardware using similar hardware optimisation tricks to Tri Dao’s Flash Attention¹⁷. With the hardware optimisations, Mamba is able to run faster than comparably sized Transformers.

Machine Learning for Political Economists - How Large Should The State Be?

The Mamba authors write, “the efficiency vs. effectiveness tradeoff of sequence models is characterised by how well they compress their state”. In other words, like in political economy¹⁸, the fundamental problem is how to manage the state.

🔁 Traditional RNNs are anarchic

They have a small, minimal state. The size of the state is bounded. The compression of state is poor.

🤖 Transformers are communist

They have a maximally large state. The “state” is just a cache of the entire history with no compression. Every context token is treated equally until recall time.

🐍Mamba has a compressed state

…but it’s selective about what goes in. Mamba says we can get away with a small state if the state is well focused and effective¹⁹.

Language Models and State Size

The upshot is that state representation is critical. A smaller state is more efficient; a larger state is more effective. The key is to selectively and dynamically compress data into the state. Mamba’s Selection Mechanism allows for context-dependent reasoning, focusing and ignoring. For both performance and interpretability, understanding the state seems to be very useful.

Information Flow in Transformer vs Mamba

How do Transformers know anything? At initialization, a transformer isn’t very smart. It learns in two ways:

Training data (Pretraining, SFT, RLHF etc)
In context-data

Training Data

Models learn from their training data. This is a kind of lossy compression of input data into the weights. We can think of the effect of pretraining data on the transformer kinda like the effect of your ancestor’s experiences on your genetics - you can’t recall their experiences, you just have vague instincts about them²⁰.

In Context-Data

Transformers use their context as short-term memory, which they can recall with ~perfect fidelity. So we get In-Context Learning, e.g. using induction heads to solve the Indirect Object Identification task, or computing Linear Regression.

Retrieval

Note that Transformers don’t filter their context at all until recall time. So if we have a bunch of information we think might be useful to the Transformer, we filter it outside the Transformer (using Information Retrieval strategies) and then stuff the results into the prompt. This process is known as Retrieval Augmented Generation (RAG). RAG determines relevant information for the context window of a transformer. A human with the internet is kinda like a RAG system - you still have to know what to search but whatever you retrieve is as salient as short-term memory to you.

Information Flow for Mamba

Training Data acts similarly for Mamba. However, the lines are slightly blurred for in-context data and retrieval. In-context data for Mamba is compressed/filtered similar to retrieval data for transformers. This in-context data is also accessible for look-up like for transformers (although with somewhat lower fidelity).

Transformer context is to Mamba states what short-term is to long-term memory. Mamba doesn’t just have “RAM”, it has a hard drive²¹ ²².

Swapping States as a New Prompting Paradigm

Currently, we often use RAG to give a transformer contextual information.

With Mamba-like models, you could instead imagine having a library of states created by running the model over specialised data. States could be shared kinda like LoRAs for image models.

For example, I could do inference on 20 physics textbooks and, say, 100 physics questions and answers. Then I have a state which I can give to you. Now you don’t need to add any few-shot examples; you just simply ask your question. The in-context learning is in the state.

In other words, you can drag and drop downloaded states into your model, like literal plug-in cartridges. And note that “training” a state doesn’t require any backprop. It’s more like a highly specialised one-pass fixed-size compression algorithm. This is unlimited in-context learning applied at inference time for zero-compute or latency²³.

The structure of an effective LLM call goes from…

System Prompt
Preamble
Few shot-examples
Question

…for Transformers, to simply…

Inputted state (with problem context, initial instructions, textbooks, and few-shot examples)
Short question

…for Mamba.

This is cheaper and faster than few-shot prompting (as the state is infinitely reusable without inference cost). It’s also MUCH cheaper than finetuning and doesn’t require any gradient updates. We could imagine retrieving states in addition to context.

Mamba & Mechanistic Interpretability

Transformer interpretability typically involves:

understanding token relationships via attention,
understanding circuits, and
using Dictionary Learning for unfolding MLPs.

Most of the ablations that we would like to do for Mamba are still valid, but understanding token communication (1) is now more nuanced. All information moves between tokens via hidden states instead of the Attention Mechanism which can “teleport” information from one sequence position to another.

For understanding in-context learning (ICL) tasks with Mamba, we will look to intervene on the SSM state. A classic task in-context learning task is Indirect Object Identification in which a model has to finish a paragraph like:

Then, Shelby and Emma had a lot of fun at the school. [Shelby/Emma] gave an apple to [BLANK]

The model is expected to fill in the blank with the name that is not repeated in the paragraph. In the chart below we can see that information is passed from the [Shelby/Emma] position to the final position via the hidden state (see the two blue lines in the top chart).

Since it’s hypothesised that much of In-Context Learning in Transformers is downstream of more primitive sequence position operations (like Induction Heads), Mamba being able to complete this task suggests a more general In-Context Learning ability.

What’s Next for Mamba & SSMs?

Mamba-like models are likely to excel in scenarios requiring extremely long context and long-term memory. Examples include:

Processing DNA
Generating (or reasoning over) video
Writing novels

An illustrative example is agents with long-term goals.

Suppose you have an agent interacting with the world. Eventually, its experiences become too much for the context window of a transformer. The agent then has to compress or summarise its experiences into some more compact representation.

But how do you decide what information is the most useful as a summary? If the task is language, LLMs are actually fairly good at summaries - okay, yeah, you’ll lose some information, but the most important stuff can be retained.

However, for other disciplines, it might not be clear how to summarise. For example, what’s the best way to summarise a 2 hour movie?²⁴. Could the model itself learn to do this naturally rather than a hacky workaround like trying to describe the aesthetics of the movie in text?

This is what Mamba allows. Actual long-term memory. A real state where the model learns to keep what’s important. Prediction is compression - learning what’s useful to predict what’s coming next inevitably leads to building a useful compression of the previous tokens.

The implications for Assistants are clear:

Your chatbot co-evolves with you. It remembers.

The film HER is looking better and better as time goes on 😳

Agents & AI Safety

One reason for positive updates in existential risk from AGI is Language Models. Previously, Deep-RL agents trained via self-play looked set to be the first AGIs. Language models are inherently much safer since they aren’t trained with long-term goals²⁵.

The potential for long-term sequence reasoning here brings back the importance of agent-based AI safety. Few agent worries are relevant to Transformers with an 8k context window. Many are relevant to systems with impressive long-term memories and possible instrumental goals.

The Best Collab Since Taco Bell & KFC: 🤖 x 🐍

The Mamba authors show that there’s value in combining Mamba’s long context with the Transformer’s high fidelity over short sequences. For example, if you’re making long videos, you likely can’t fit a whole movie into a Transformer’s context for attention²⁶. You could imagine having Attention look at the most recent frames for short-term fluidity and an SSM for long-term narrative consistency²⁷.

This isn’t the end for Transformers. Their high effectiveness is exactly what’s needed for many tasks. But now Transformers aren’t the only option. Other architectures are genuinely feasible.

So we’re not in the post-Transformer era. But for the first time, we’re living in the post-only-Transformers era²⁸. And this blows the possibilities wide open for sequence modelling with extreme context lengths and native long-term memory.

Two ML researchers, Sasha Rush (HuggingFace, Annotated Transformer, Cornell Professor) and Jonathan Frankle (Lottery Ticket Hypothesis, MosaicML, Harvard Professor), currently have a bet here.

Currently Transformers are far and away in the lead. With 3 years left, there’s now a research direction with a fighting chance.

All that remains to ask is: Is Attention All We Need?

Footnotes

1. see Figure 8 in the Mamba paper.
2. And scaling up with massive compute.
3. More specifically the scaled dot-product Attention popularised by Transformers
4. For people who don’t see Temple Run as the cultural cornerstone it is 🤣 Temple Run was an iPhone game from 2011 similar to Subway Surfer
5. Here we assume the environment is sufficiently smooth.
6. One pretty important constraint for this to be efficient is that we don’t allow the individual elements of the state vector to interact with each other directly. We’ll use a combination of the state dimensions to determine the output but we don’t e.g. allow the velocity of the runner and the direction of the closest obstacle (or whatever else was in our state) to directly interact. This helps with efficient computation and we achieve this practically by constraining A to be a diagonal matrix.
7. Concretely consider the case of Language Models - each token is a discrete step
8. ZOH also has nice properties for the initialisations - we want A_bar to be close to the identity so that the state can be mostly maintained from timestep to timestep if desired. ZOH gives A_bar as an exponential so any diagonal element initialisations close to zero give values close to 1
9. This is known as the Euler discretisation in the literature
10. It’s wild to note that some readers might not have, we’re so far into the age of Attention that RNNs have been forgotten!
11. B is like the Query (Q) matrix for Transformers.
12. C is like the Output (O) matrix for Transformers.
13. Non-alcoholic options also available!
14. Especially as all voices roughly occupy the same space on the audio frequency spectrum Intuitively this seems really hard!
15. Note that photographic memory doesn’t necessarily imply perfect inferences from that memory!
16. To be clear, if you have a short sequence, then a transformer should theoretically be a better approach. If you can store the whole context, then why not!? If you have enough memory for a high-resolution image, why compress it into a JPEG? But Mamba-style architectures are likely to hugely outperform with long-range sequences.
17. More details are available for engineers interested in CUDA programming - Tri’s talk, Mamba paper section 3.3.2, and the official CUDA code are good resources for understanding the Hardware-Aware Scan
18. or in Object Oriented Programming
19. Implications to actual Political Economy are left to the reader but maybe Gu and Dao accidentally solved politics!?
20. This isn’t a perfect analogy as human evolution follows a genetic algorithm rather than SGD.
21. Albeit a pretty weird hard drive at that - it morphs over time rather than being a fixed representation.
22. As a backronym, I’ve started calling the hidden_state the state space dimension (or selective state dimension) which shortens to SSD, a nice reminder for what this object represents - the long-term memory of the system.
23. I’m thinking about this similarly to the relationship between harmlessness finetuning and activation steering. State swapping, like activation steering, is an inference time intervention giving comparable results to its train time analogue.
24. This is a very non-trivial problem! How do human brains represent a movie internally? It’s not a series of the most salient frames, nor is it a text summary of the colours, nor is it a purely vibes-based summary if you can memorise some lines of the film.
25. They’re also safer since they inherently understand (though don’t necessarily embody) human values. It’s not all clear that how to teach an RL agent human morality.
26. Note that typically an image (i.e. a single frame) counts as >196 tokens, and movies are typically 24 fps so you’ll fill a 32k context window in 7 seconds 🤯
27. Another possibility that I’m excited about is applying optimisation pressure to the state itself as well as the output to have models that respect particular use cases.
28. This is slightly hyperbolic, the TS-Mixer for time series, Gradient Boosting Trees for tabular data and Graph Neural Networks for weather prediction exist and are currently used, but these aren’t at the core of AI

Author Bio

Kola Ayonrinde is a Research Scientist and Machine Learning Engineer with a flair for writing. He integrates technology and creativity, focusing on applying machine learning in innovative ways and exploring the societal impacts of tech advancements.

Acknowledgements

This post was originally posted on Kola's personal blog.

Thanks to Gonçalo for reading an early draft, Jaden for the nnsight library used for the Interpretability analysis and Tessa for Mamba patching visualisations.Also see: Mamba paper, Mamba Python code, Annotated S4, Nathan Labenz podcast

Citation

For attribution in academic contexts or books, please cite this work as

Kola Ayonrinde, "Mamba Explained," The Gradient, 2024

@article{Ayonrinde2024mamba,
    author = {Kola Ayonrinde},
    title = {Mamba Explained},
    journal = {The Gradient},
    year = {2024},
    howpublished = {\url{https://thegradient.pub/mamba-explained},
}

Car-GPT: Could LLMs finally make self-driving cars happen?

Jérémy Cohen — Fri, 08 Mar 2024 16:55:18 GMT

In 1928, London was in the middle of a terrible health crisis, devastated by bacterial diseases like pneumonia, tuberculosis, and meningitis. Confined in sterile laboratories, scientists and doctors were stuck in a relentless cycle of trial and error, using traditional medical approaches to solve complex problems.

This is when, in September 1928, an accidental event changed the course of the world. A Scottish doctor named Alexander Fleming forgot to close a petri dish (the transparent circular box you used in science class), which got contaminated by mold. This is when Fleming noticed something peculiar: all bacteria close to the moisture were dead, while the others survived.

"What was that moisture made of?" wondered M. Flemming. This was when he discovered that Penicillin, the main component of the mold, was a powerful bacterial killer. This led to the groundbreaking discovery of penicillin, leading to the antibiotics we use today. In a world where doctors were relying on existing well-studied approaches, Penicillin was the unexpected answer.

Self-driving cars may be following a similar event. Back in the 2010s, most of them were built using what we call a « modular » approach. The software « autonomous » part is split into several modules, such as Perception (the task of seeing the world), or Localization (the task of accurately localize yourself in the world), or Planning (the task of creating a trajectory for the car to follow, and implementing the « brain » of the car). Finally, all these go to the last module: Control, that generates commands such as « steer 20° right », etc… So this was the well-known approach.

But a decade later, companies started to take another discipline very seriously: End-To-End learning. The core idea is to replace every module with a single neural network predicting steering and acceleration, but as you can imagine, this introduces a black box problem.

The 4 Pillars of Self-Driving Cars are Perception, Localization, Planning, and Control. Could a Large Language Model replicate them? (source)

These approaches are known, but don’t solve the self-driving problem yet. So, we could be wondering: "What if LLMs (Large Language Models), currently revolutionizing the world, were the unexpected answer to autonomous driving?"

This is what we're going to see in this article, beginning with a simple explanation of what LLMs are and then diving into how they could benefit autonomous driving.

Preamble: LLMs-what?

Before you read this article, you must know something: I'm not an LLM pro, at all. This means, I know too well the struggle to learn it. I understand what it's like to google "learn LLM"; then see 3 sponsored posts asking you to download e-books (in which nothing concrete appears)... then see 20 ultimate roadmaps and GitHub repos, where step 1/54 is to view a 2-hour long video (and no one knows what step 54 is because it's so looooooooong).

So, instead of putting you through this pain myself, let's just break down what LLMs are in 3 key ideas:

Tokenization
Transformers
Processing Language

Tokenization

In ChatGPT, you input a piece of text, and it returns text, right? Well, what's actually happening is that your text is first converted into tokens.

Example of tokenization of a sentence, each word becomes a "token"

But what's a token? You might ask. Well, a token can correspond to a word, a character, or anything we want. Think about it -- if you are to send a sentence to a neural network, you didn't plan on sending actual words, did you?

The input of a neural network is always a number, so you need to convert your text into numbers; this is tokenization.

What tokenization actually is: A conversion from words to numbers

Depending on the model (ChatGPT, LLAMA, etc...), a token can mean different things: a word, a subword, or even a character. We could take the English vocabulary and define these as words or take parts of words (subwords) and handle even more complex inputs. For example, the word « a » could be token 1, and the word « abracadabra » would be token 121.

Transformers

Now that we understand how to convert a sentence into a series of numbers, we can send that series into our neural network! At a high level, we have the following structure:

A Transformer is an Encoder-Decoder Architecture that takes a sequence of tokens as input and outputs a another series of tokens

If you start looking around, you will see that some models are based on an encoder-decoder architecture, some others are purely encoder-based, and others, like GPT, are purely decoder-based.

Whatever the case, they all share the core Transformer blocks: multi-head attention, layer normalization, addition and concatenation, blocks, cross-attention, etc...

This is just a series of attention blocks getting you to the output. So how does this word prediction work?

The output/ Next-Word Prediction

The Encoder learns features and understands context... But what does the decoder do? In the case of object detection, the decoder is predicting bounding boxes. In the case of segmentation, the decoder is predicting segmentation masks. What about here?

In our case, the decoder is trying to generate a series of words; we call this task "next-word prediction".

Of course, it does it similarly by predicting numbers or tokens. This characterizes our full model as shown below,

I would say the loss function for this particular output produces a near-0 value.

Now, there are many "concepts" that you should learn on top of this intro: everything Transformer and Attention related, but also few-shot learning, pretraining, finetuning, and more...

Ok... but what does it have to do with self-driving cars? I think it's time to move to stage 2.

Chat-GPT for Self-Driving Cars

The thing is, you've already been through the tough part. The rest simply is: "How do I adapt this to autonomous driving?". Think about it; we have a few modifications to make:

Our input now becomes either images, sensor data (LiDAR point clouds, RADAR point clouds, etc...), or even algorithm data (lane lines, objects, etc...). All of it is "tokenizable", as Vision Transformers or Video Vision Transformers do.
Our Transformer model pretty much remains the same since it only operates on tokens and is independent of the kind of input.
The output is based on the set of tasks we want to do. It could be explaining what's happening in the image or could also be a direct driving task like switching lanes.

So, let's begin with the end:

What self-driving car tasks could LLM solve?

There are many tasks involved in autonomous driving, but not all of them are GPT-isable. The most active research areas in 2023 have been:

Perception: Based on an input image, describe the environment, number of objects, etc...
Planning: Based on an image, or a bird-eye view, or the output of perception, describe what we should do (keep driving, yield, etc...)
Generation: Generate training data, alternate scenarios, and more... using "diffusion"
Question & Answers: Create a chat interface and ask the LLM to answer questions based on the scenario.

LLMs in Perception

In Perception, the input is a series of images, and the output is usually a set of objects, lanes, etc... In the case of LLMs, we have 3 core tasks: Detection, Prediction, and Tracking. An example with Chat-GPT, when you send it an image and ask to describe what's going on is shown below:

A GPT-4 Vision model can return the objects in the image, just like object detectors do (source)

Other models such as HiLM-D and MTD-GPT can also do this, some work also for videos. Models like PromptTrack, also have the ability to assign unique IDs (this car in front of me is ID #3), similar to a 4D Perception model.

PromptTrack combines the DETR object detector with Large Language Models

In this model, multi-view images are sent to an Encoder-Decoder network that is trained to predict annotations of objects such as bounding boxes, and attention maps). These maps are then combined with a prompt like 'find the vehicles that are turning right'.The next block then finds the 3D Bounding Box localization and assigns IDs using a bipartite graph matching algorithm like the Hungarian Algorithm.

This is cool, but this isn't the "best" application of LLMs so far:

If Chat-GPT can find objects in an image, it should be able to tell you what to do with these objects, shouldn't it? Well, this is the task of Planning i.e. defining a path from A to B, based on the current perception. While there are numerous models developed for this task, the one that stood out to me was Talk2BEV:

Talk2BEV takes perception one step further and also tells you what to do

The main difference between models for planning and Perception-only models is that here, we're going to train the model on human behavior to suggest ideal driving decisions. We're also going to change the input from multi-view to Bird Eye View since it is much easier to understand.

This model works both with LLaVA and ChatGPT4, and here is a demo of the architecture:

Talk2BEV (source)

As you can see, this isn't purely "prompt" based, because the core object detection model stays Bird Eye View Perception, but the LLM is used to "enhance" that output by suggesting to crop some regions, look at specific places, and predict a path. We're talking about "language enhanced BEV Maps".

Other models like DriveGPT are trained to send the output of Perception to Chat-GPT and finetune it to output the driving trajectory directly.

The DriveGPT model is pure madness... when trained correctly! (modified from source)

I could go on and on, but I think you get the point. If we summarize, I would say that:

Inputs are either tokenized images or outputs of Perception algorithm (BEV maps, ...)
We fuse existing models (BEV Perception, Bipartite Matching, ...) with language prompts (find the moving cars)
Changing the task is mainly about changing the data, loss function, and careful finetuning.

The Q&A applications are very similar, so let's see the last application of LLMs:

LLMs for Image Generation

Ever tried Midjourney and DALL-E? Isn’t it super cool? Yes, and there is MUCH COOLER than this when it comes to autonomous driving. In fact, have you heard of Wayve's GAIA-1 model? The model takes text and images as input and directly produces videos, like this:

These videos are generated by Wayve's GAIA-1 model

The architecture takes images, actions, and text prompts as input, and then uses a World Model (an understanding of the world and its interactions) to produce a video.

Architecture of GAIA-1 (source)

You can find more examples on Wayve's YouTube channel and this dedicated post.

Similarly, you can see MagicDrive, which takes the output of Perception as input and uses that to generate scenes:

(source)

Other models, like Driving Into the Future and Driving Diffusion can directly generate future scenarios based on the current ones. You get the point; we can generate scenes in an infinite way, get more data for our models, and have this endless positive loop.

We've just seen 3 prominent families of LLM usage in self-driving cars: Perception, Planning, and Generation. The real question is...

Could we trust LLMs in self-driving cars?

And by this, I mean... What if your model has hallucinations? What if its replies are completely absurd, like ChatGPT sometimes does? I remember, back in my first days in autonomous driving, big groups were already skeptical about Deep Learning, because it wasn't "deterministic" (as they call it).

We don't like Black Boxes, which is one of the main reasons End-To-End will struggle to get adopted. Is ChatGPT any better? I don't think so, and I would even say it's worse in many ways. However, LLMs are becoming more and more transparent, and the black box problem could eventually be solved.

To answer the question "Can we trust them?"... it's very early in the research, and I'm not sure someone has really used them "online" — meaning « live », in a car, on the streets, rather than in a headquarter just for training or image generation purpose. I would definitely picture a Grok model on a Tesla someday just for Q&A purposes. So for now, I will give you my coward and safe answer...

It's too early to tell!

Because it really is. The first wave of papers mentioning LLMs in Self-Driving Cars is from mid-2023, so let's give it some time. In the meantime, you could start with this survey that shows all the evolutions to date.

Alright, time for the best part of the article...

The LLMs 4 AD Summary

A Large Language Model (LLM) works in 3 key steps: inputs, transformer, output. The input is a set of tokenized words, the transformer is a classical transformer, and the output task is "next word prediction".
In a self-driving car, there are 3 key tasks we can solve with LLMs: Perception (detection, tracking, prediction), Planning (decision making, trajectory generation), and Generation (scene, videos, training data, ...).
In Perception, the main goal is to describe the scene we're looking at. The input is a set of raw multi-view images, and the Transformer aims to predict 3D bounding boxes. LLMs can also be used to ask for a specific query ("where are the taxis?").
In Planning, the main goal is to generate a trajectory for the car to take. The input is a set of objects (output of Perception, BEV Maps, ...), and the Transformer uses LLMs to understand context and reason about what to do.
In Generation, the main goal is to generate a video that corresponds to the prompt used. Models like GAIA-1 have a chat interface, and take as input videos to generate either alternate scenes (rainy, ...), or future scenes.
For now, it's very early to tell whether this can be used in the long run, but research there is some of the most active in the self-driving car space. It all comes back to the question: "Can we really trust LLMs in general?"

Next Steps

If you want to get started on LLMs for self-driving cars, there are several things you can do:

⚠️ Before this, the most important: If you want to keep learning about self-driving cars. I'm talking about self-driving car every day through my private emails. I'm sending many tips and direct content. You should join here.
✅ To begin, build an understanding of LLMs for self-driving cars. This is partly done, you can continue to explore the resources I provided in the article.
➡️ Second, build skills related to Auto-Encoders and Transformer Networks. My image segmentation series is perfect for this, and will help you understand Transformer Networks with no NLP example, which means it's for Computer Vision Engineer's brains.
️ ➡️ Then, understand how Bird Eye View Networks works. It might not be mentioned in general LLM courses, but in self-driving cars, Bird Eye View is the central format where we can fuse all the data (LiDARs, cameras, multi-views, ...), build maps, and directly create paths to drive. You can do so in my Bird Eye View course (if closed, join my email list to be notified).
Finally, practice training, finetuning, and running LLMs in self-driving car scenarios. Run repos like Talk2BEV and the others I mentioned in the article. Most of them are open source, but the data can be hard to find. This is noted last, but there isn't really an order in all of this.

Author Bio

Jérémy Cohen is a self-driving car engineer and founder of Think Autonomous, a platform to help engineers learn about cutting-edge technologies such as self-driving cars and advanced Computer Vision. In 2022, Think Autonomous won the price for Top Global Business of the Year in the Educational Technology Category and Jeremy Cohen was named 2023 40 Under 40 Innovators in Analytics Insight magazine, the largest printed magazine on Artificial Intelligence. You can join 10,000 engineers reading his private daily emails on self-driving cars here.

Citation

For attribution in academic contexts or books, please cite this work as

Jérémy Cohen, "Car-GPT: Could LLMs finally make self-driving cars happen?", The Gradient, 2024.

BibTeX citation:

@article{cohen2024cargpt,
    author = {Jérémy Cohen},
    title = {Car-GPT: Could LLMs finally make self-driving cars happen?},
    journal = {The Gradient},
    year = {2024},
    howpublished = {\url{https://thegradient.pub/car-gpt},
}

Do text embeddings perfectly encode text?

Jack Morris — Tue, 05 Mar 2024 20:15:58 GMT

The rise of the vector database

As a result of the rapid advancement of generative AI in recent years, many companies are rushing to integrate AI into their businesses. One of the most common ways of doing this is to build AI systems that answer questions concerning information that can be found within a database of documents. Most solutions for such a problem are based on one key technique: Retrieval Augmented Generation (RAG).

Overview of a RAG system.‌ ‌Source: “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”

This is what lots of people do now as a cheap and easy way to get started using AI: store lots of documents in a database, have the AI retrieve the most relevant documents for a given input, and then generate a response to the input that is informed by the retrieved documents.

These RAG systems determine document relevancy by using “embeddings”, vector representations of documents produced by an embedding model. These embeddings are supposed to represent some notion of similarity, so documents that are relevant for search will have high vector similarity in embedding space.

The prevalence of RAG has led to the rise of the vector database, a new type of database designed for storing and searching through large numbers of embeddings. Hundreds of millions of dollars of funding have been given out to startups that claim to facilitate RAG by making embedding search easy. And the effectiveness of RAG is the reason why lots of new applications are converting text to vectors and storing them in these vector databases.

Embeddings are hard to read

So what is stored in a text embedding? Beyond the requirement of semantic similarity, there are no constraints on which embedding must be assigned for a given text input. Numbers within embedding vectors can be anything, and vary based on their initialization. We can interpret the similarities of embedding with others but have no hope ever understanding the individual numbers of an embedding.

A neural embedding model (light blue) takes text input and produces an embedding, a vector that can be used for search.

Now imagine you’re a software engineer building a RAG system for your company. You decide to store your vectors in a vector database. You notice that in a vector database, what's stored are embedding vectors, not the text data itself. The database fills up with rows and rows of random-seeming numbers that represent text data but never ‘sees’ any text data at all.

You know that the text corresponds to customer documents that are protected by your company’s privacy policy. But you’re not really sending text off-premises at any time; you only ever send embedding vectors, which look to you like random numbers.

What if someone hacks into the database and gains access to all your text embedding vectors – would this be bad? Or if the service provider wanted to sell your data to advertisers – could they? Both scenarios involve being able to take embedding vectors and invert them somehow back to text.

From text to embeddings...back to text

The problem of recovering text from embeddings is exactly the scenario we tackle in our paper Text Embeddings Reveal As Much as Text (EMNLP 2023). Are embedding vectors a secure format for information storage and communication? Put simply: can input text be recovered from output embeddings?

Before diving into solutions, let’s think about the problem a little bit more. Text embeddings are the output of neural networks, sequences of matrix multiplications joined by nonlinear function operations applied to input data. In traditional text processing neural networks, a string input is split into a number of token vectors, which repeatedly undergo nonlinear function operations. At the output layer of the model, tokens are averaged into a single embedding vector.

A maxim from the signal processing community known as the data processing inequality tells us that functions cannot add information to an input, they can only sustain or decrease the amount of information available. Even though conventional wisdom tells us that deeper layers of a neural network are constructing ever-higher-order representations, they aren’t adding any information about the world that didn’t come in on the input side.

Additionally, the nonlinear layers certainly destroy some information. One ubiquitous nonlinear layer in modern neural networks is the “ReLU” function, which simply sets all negative inputs to zero. After applying ReLU throughout the many layers of a typical text embedding model, it is not possible to retain all the information from the input.

Inversion in other contexts

Similar questions about information content have been asked in the computer vision community. Several results have shown that deep representations (embeddings, essentially) from image models can be used to recover the input images with some degree of fidelity. An early result (Dosovitskiy, 2016) showed that images can be recovered from the feature outputs of deep convolutional neural networks (CNNs). Given the high-level feature representation from a CNN, they could invert it to produce a blurry-but-similar version of the original input image.

In computer vision, inversion models (yellow) have successfully reconstructed images given only the 1000 probability outputs of an ImageNet classifier, most of which are close to 0. (Images from Understanding Invariance via Feedforward Inversion of Discriminatively Trained Classifiers.)

People have improved on image embedding inversion process since 2016: models have been developed that do inversion with higher accuracy, and have been shown to work across more settings. Surprisingly, some work has shown that images can be inverted from the outputs of an ImageNet classifier (1000 class probabilities).

The journey to vec2text

If inversion is possible for image representations, then why can’t it work for text? Let’s consider a toy problem of recovering text embeddings. For our toy setting we’ll restrict text inputs to 32 tokens (around 25 words, a sentence of decent length) and embed them all to vectors of 768 floating-point numbers. At 32-bit precision, these embeddings are 32 * 768 = 24,576 bits or around 3 kilobytes.

Few words represented by many bits. Do you think we could perfectly reconstruct the text within this scenario?

First things first: we need to define a measurement of goodness, to know how well we have accomplished our task. One obvious metric is "exact match", how often we get the exact input back after inversion. No prior inversion methods have any success on exact match, so it’s quite an ambitious measurement. So maybe we want to start with a smooth measurement that measures how similar the inverted text is to the input. For this we’ll use BLEU score, which you can just think of as a percentage of how close the inverted text is to the input.

With our success metric defined, let us move on to proposing an approach to evaluate with said metric. For a first approach, we can pose inversion as a traditional machine learning problem, and we solve it the best way we know how: by gathering a large dataset of embedding-text pairs, and train a model to output the text given the embedding as input.

So this is what we did. We build a transformer that takes the embedding as input and train it using traditional language modeling on the output text. This first approach gives us a model with a BLEU score of around 30/100. Practically, the model can guess the topic of the input text, and get some of the words, but it loses their order and often gets most of them wrong. The exact match score is close to zero. It turns out that asking a model to reverse the output of another model in a single forward pass is quite hard (as are other complicated text generation tasks, like generating text in perfect sonnet form or satisfying multiple attributes).

Overview of architectures considered. Prior work (left) uses a decoder-only architecture and inputs an embedding as a prefix. We initially trained an encoder-decoder model (middle) to condition on an upscaled sentence embedding on the encoder-side. Our final method (right) includes an additional “hypothesis” text along with an upscaled hypothesis embedding.

After training our initial model, we noticed something interesting. A different way to measure model output quality is by re-embedding the generated text (we call this the “hypothesis”) and measuring this embedding’s similarity to the true embedding. When we do this with our model’s generations, we see a very high cosine similarity – around 0.97. This means that we’re able to generate text that’s close in embedding space, but not identical to, the ground-truth text.

(An aside: what if this weren’t the case? That is, what if the embedding assigned our incorrect hypothesis the same embedding as the original sequence. Our embedder would be lossy, mapping multiple inputs to the same output. If this were the case, then our problem would be hopeless, and we would have no way of distinguishing which of multiple possible sequences produced it. In practice, we never observe these types of collisions in our experiments.)

The observation that hypotheses have different embeddings to the ground truth inspires an optimization-like approach to embedding inversion. Given a ground-truth embedding (where we want to go), and a current hypothesis text and its embedding (where we are right now), we can train a corrector model that’s trained to output something that’s closer to the ground-truth than the hypothesis.

Overview of our method, Vec2Text. Given access to a target embedding e (blue) and query access to an embedding model ϕ (blue model), the system aims to iteratively generate (yellow model) hypotheses ˆe (pink) to reach the target.

Our goal is now clear: we want to build a system that can take a ground-truth embedding, a hypothesis text sequence, and the hypothesis position in embedding space, and predict the true text sequence. We think of this as a type of ‘learned optimization’ where we’re taking steps in embedding space in the form of discrete sequences. This is the essence of our method, which we call vec2text.

After working through some details and training the model, this process works extremely well! A single forward pass of correction increases the BLEU score from 30 to 50. And one benefit of this model is that it can naturally be queried recursively. Given a current text and its embedding, we can run many steps of this optimization, iteratively generating hypotheses, re-embedding them, and feeding them back in as input to the model. With 50 steps and a few tricks, we can get back 92% of 32-token sequences exactly, and get to a BLEU score of 97! (Generally achieving BLEU score of 97 means we’re almost perfectly reconstructing every sentence, perhaps with a few punctuation marks misplaced here and there.)

Scaling and future work

The fact that text embeddings can be perfectly inverted raises many follow-up questions. For one, the text embedding vector contains a fixed number of bits; there must be some sequence length at which information can no longer be perfectly stored within this vector. Even though we can recover most texts of length 32, some embedding models can embed documents up to thousands of tokens. We leave it up to future work to analyze the relationship between text length, embedding size, and embedding invertibility.

Another open question is how to build systems that can defend against inversion. Is it possible to create models that can successfully embed text such that embeddings remain useful while obfuscating the text that created them?

Finally, we are excited to see how our method might apply to other modalities. The main idea behind vec2text (a sort of iterative optimization in embedding space) doesn’t use any text-specific tricks. It’s a method that iteratively recovers information contained in any fixed input, given black-box access to a model. It remains to be seen how these ideas might apply to inverting embeddings from other modalities as well as to approaches more general than embedding inversion.

To use our models to invert text embeddings, or to get started running embedding inversion experiments yourself, check out our Github repository: https://github.com/jxmorris12/vec2text

References

Inverting Visual Representations with Convolutional Networks (2015), https://arxiv.org/abs/1506.02753

Understanding Invariance via Feedforward Inversion of Discriminatively Trained Classifiers (2021), https://proceedings.mlr.press/v139/teterwak21a/teterwak21a.pdf

Text Embeddings Reveal (Almost) As Much As Text (2023), https://arxiv.org/abs/2310.06816

Language Model Inversion (2024), https://arxiv.org/abs/2311.13647

Author Bio

Jack Morris is a PhD student at Cornell Tech in New York City. He works on research at the intersection of machine learning, natural language processing, and security. He’s especially interested in the information content of deep neural representations like embeddings and classifier outputs.

Citation

For attribution in academic contexts or books, please cite this work as

Jack Morris, "Do text embeddings perfectly encode text?", The Gradient, 2024.

BibTeX citation:

@article{morris2024inversion,
    author = {Jack Morris},
    title = {Do text embeddings perfectly encode text?},
    journal = {The Gradient},
    year = {2024},
    howpublished = {\url{https://thegradient.pub/text-embedding-inversion},
}

Why Doesn’t My Model Work?

Michael Lones — Sat, 24 Feb 2024 18:41:54 GMT

Have you ever trained a model you thought was good, but then it failed miserably when applied to real world data? If so, you’re in good company. Machine learning processes are complex, and it’s very easy to do things that will cause overfitting without it being obvious. In the 20 years or so that I’ve been working in machine learning, I’ve seen many examples of this, prompting me to write “How to avoid machine learning pitfalls: a guide for academic researchers” in an attempt to prevent other people from falling into these traps.

But you don’t have to take my word for it. These issues are being increasingly reported in both the scientific and popular press. Examples include the observation that hundreds of models developed during the Covid pandemic simply don’t work, and that a water quality system deployed in Toronto regularly told people it was safe to bathe in dangerous water. Many of these are documented in the AIAAIC repository. It’s even been suggested that these machine learning missteps are causing a reproducibility crisis in science — and, given that many scientists use machine learning as a key tool these days, a lack of trust in published scientific results.

In this article, I’m going to talk about some of the issues that can cause a model to seem good when it isn’t. I’ll also talk about some of the ways in which these kinds of mistakes can be prevented, including the use of the recently-introduced REFORMS checklist for doing ML-based science.

Duped by Data

Misleading data is a good place to start, or rather not a good place to start, since the whole machine learning process rests upon the data that’s used to train and test the model.

In the worst cases, misleading data can cause the phenomenon known as garbage in garbage out; that is, you can train a model, and potentially get very good performance on the test set, but the model has no real world utility. Examples of this can be found in the aforementioned review of Covid prediction models by Roberts et al. In the rush to develop tools for Covid prediction, a number of public datasets became available, but these were later found to contain misleading signals — such as overlapping records, mislabellings and hidden variables — all of which helped models to accurately predict the class labels without learning anything useful in the process.

Take hidden variables. These are features that are present in data, and which happen to be predictive of class labels within the data, but which are not directly related to them. If your model latches on to these during training, it will appear to work well, but may not work on new data. For example, in many Covid chest imaging datasets, the orientation of the body is a hidden variable: people who were sick were more likely to have been scanned lying down, whereas those who were standing tended to be healthy. Because they learnt this hidden variable, rather than the true features of the disease, many Covid machine learning models turned out to be good at predicting posture, but bad at predicting Covid. Despite their name, these hidden variables are often in plain sight, and there have been many examples of classifiers latching onto boundary markers, watermarks and timestamps embedded in images, which often serve to distinguish one class from another without having to look at the actual data.

A related issue is the presence of spurious correlations. Unlike hidden variables, these have no true relationship to anything else in the data; they’re just patterns that happen to correlate with the class labels. A classic example is the tank problem, where the US military allegedly tried to train a neural network to identify tanks, but it actually recognised the weather, since all the pictures of tanks were taken at the same time of day. Consider the images below: a machine learning model could recognise all the pictures of tanks in this dataset just by looking at the colour of pixels towards the top of an image, without having to consider the shape of any of the objects. The performance of the model would appear great, but it would be completely useless in practice.

(Source: by author)

Many (perhaps most) datasets contain spurious correlations, but they’re not usually as obvious as this one. Common computer vision benchmarks, for example, are known to have groups of background pixels that are spuriously correlated with class labels. This represents a particular challenge to deep learners, which have the capacity to model many patterns within the data; various studies have shown that they do tend to capture spuriously correlated patterns, and this reduces their generality. Sensitivity to adversarial attacks is one consequence of this: if a deep learning model bases its prediction on spurious correlations in the background pixels of an image, then making small changes to these pixels can flip the prediction of the model. Adversarial training, where a model is exposed to adversarial samples during training, can be used to address this, but it’s expensive. An easier approach is just to look at your model, and see what information it’s using to make its decisions. For instance, if a saliency map produced by an explainable AI technique suggests that your model is focusing on something in the background, then it’s probably not going to generalise well.

Sometimes it’s not the data itself that is problematic, but rather the labelling of the data. This is especially the case when data is labelled by humans, and the labels end up capturing biases, misassumptions or just plain old mistakes made by the labellers. Examples of this can be seen in datasets used as image classification benchmarks, such as MNIST and CIFAR, which typically have a mislabelling rate of a couple of percent — not a huge amount, but pretty significant where modellers are fighting over accuracies in the tenths of a percent. That is, if your model does slightly better than the competition, is it due to an actual improvement, or due to modelling noise in the labelling process? Things can be even more troublesome when working with data that has implicit subjectivity, such as sentiment classification, where there’s a danger of overfitting particular labellers.

Led by Leaks

Bad data isn’t the only problem. There’s plenty of scope for mistakes further down the machine learning pipeline. A common one is data leakage. This happens when the model training pipeline has access to information it shouldn’t have access to, particularly information that confers an advantage to the model. Most of the time, this manifests as information leaks from the test data — and whilst most people know that test data should be kept independent and not explicitly used during training, there are various subtle ways that information can leak out.

One example is performing a data-dependent preprocessing operation on an entire dataset, before splitting off the test data. That is, making changes to all the data using information that was learnt by looking at all the data. Such operations vary from the simple, such as centering and scaling numerical features, to the complex, such as feature selection, dimensionality reduction and data augmentation — but they all have in common the fact that they use knowledge of the whole dataset to guide their outcome. This means that knowledge of the test data is implicitly entering the model training pipeline, even if it is not explicitly used to train the model. As a consequence, any measure of performance derived from the test set is likely to be an overestimate of the model’s true performance.

Let’s consider the simplest example: centering and scaling. This involves looking at the range of each feature, and then using this information to rescale all the values, typically so that the mean is 0 and the standard deviation is 1. If this is done on the whole dataset before splitting off the test data, then the scaling of the training data will include information about the range and distribution of the feature values in the test set. This is particularly problematic if the range of the test set is broader than the training set, since the model could potentially infer this fact from the truncated range of values present in the training data, and do well on the test set just by predicting values higher or lower than those which were seen during training. For instance, if you’re working on stock price forecasting from time series data with a model that takes inputs in the range 0 to 1 but it only sees values in the range 0 to 0.5 during training, then it’s not too hard for it to infer that stock prices will go up in the future.

In fact, forecasting is an area of machine learning that is particularly susceptible to data leaks, due to something called look ahead bias. This occurs when information the model shouldn’t have access to leaks from the future and artificially improves its performance on the test set. This commonly happens when the training set contains samples that are further ahead in time than the test set. I’ll give an example later of when this can happen, but if you work in this area, I’d also strongly recommend taking a look at this excellent review of pitfalls and best practices in evaluating time series forecasting models.

An example of a more complex data-dependent preprocessing operation leading to overly-optimistic performance metrics can be found in this review of pre-term birth prediction models. Basically, a host of papers reported high accuracies at predicting whether a baby would be born early, but it turned out that all had applied data augmentation to the data set before splitting off the test data. This resulted in the test set containing augmented samples of training data, and the training set containing augmented samples of test data — which amounted to a pretty significant data leak. When the authors of the review corrected this, the predictive performance of the models dropped from being near perfect to not much better than random.

(Source: https://arxiv.org/abs/2108.02497)

Oddly, one of the most common examples of data leakage doesn’t have an agreed name (the terms overhyping and sequential overfitting have been suggested) but is essentially a form of training to the test set. By way of example, imagine the scenario depicted above where you’ve trained a model and evaluated it on the test set. You then decided its performance was below where you wanted it to be. So, you tweaked the model, and then you reevaluated it. You still weren’t happy, so you kept on doing this until its performance on the test set was good enough. Sounds familiar? Well, this is a common thing to do, but if you’re developing a model iteratively and using the same test set to evaluate the model after each iteration, then you’re basically using that test set to guide the development of the model. The end result is that you’ll overfit the test set and probably get an over-optimistic measure of how well your model generalises.

Interestingly, the same process occurs when people use community benchmarks, such as MNIST, CIFAR and ImageNet. Almost everyone who works on image classification uses these data sets to benchmark their approaches; so, over time, it’s inevitable that some overfitting of these benchmarks will occur. To mitigate against this, it’s always advisable to use a diverse selection of benchmarks, and ideally try your technique on a data set which other people haven’t used.

Misinformed by Metrics

Once you’ve built your model robustly, you then have to evaluate it robustly. There’s plenty that can go wrong here too. Let’s start with an inappropriate choice of metrics. The classic example is using accuracy with an imbalanced dataset. Imagine that you’ve managed to train a model that always predicts the same label, regardless of its input. If half of the test samples have this label as their ground truth, then you’ll get an accuracy of 50% — which is fine, a bad accuracy for a bad classifier. If 90% of the test samples have this label, then you’ll get an accuracy of 90% — a good accuracy for a bad classifier. This level of imbalance is not uncommon in real world data sets, and when working with imbalanced training sets, it’s not uncommon to get classifiers that always predict the majority label. In this case, it would be much better to use a metric like F score or Matthews correlation coefficient, since these are less sensitive to class imbalances. However, all metrics have their weaknesses, so it’s always best to use a portfolio of metrics that give different perspectives on a model’s performance and failure modes.

Metrics for time series forecasting are particularly troublesome. There are a lot of them to choose from, and the most appropriate choice can depend on both the specific problem domain and the exact nature of the time series data. Unlike metrics used for classification, many of the regression metrics used in time series forecasting have no natural scale, meaning that raw numbers can be misleading. For instance, the interpretation of mean squared errors depends on the range of values present in the time series. For this reason, it’s important to use appropriate baselines in addition to appropriate metrics. As an example, this (already mentioned) review of time series forecasting pitfalls demonstrates how many of the deep learning models published at top AI venues are actually less good than naive baseline models. For instance, they show that an autoformer, a kind of complex transformer model designed for time series forecasting, can be beaten by a trivial model that predicts no change at the next time step — something that isn’t apparent from looking at metrics alone.

In general, there is a trend towards developing increasingly complex models to solve difficult problems. However, it’s important to bear in mind that some problems may not be solvable, regardless of how complex the model becomes. This is probably the case for many financial time series forecasting problems. It’s also the case when predicting certain natural phenomena, particularly those in which a chaotic component precludes prediction beyond a certain time horizon. For instance, many people think that earthquakes can not be predicted, yet there are a host of papers reporting good performance on this task. This review paper discusses how these correct predictions may be due to a raft of modelling pitfalls, including inappropriate choice of baselines and overfitting due to data sparsity, unnecessary complexity and data leaks.

Another problem is assuming that a single evaluation is sufficient to measure the performance of a model. Sometimes it is, but a lot of the time you’ll be working with models that are stochastic or unstable; so, each time you train them, you get different results. Or you may be working with a small data set where you might just get lucky with an easy test split. To address both situations, it is commonplace to use resampling methods like cross-validation, which train and test a model on different subsets of the data and then work out the average performance. However, resampling introduces its own risks. One of these is the increased risk of data leaks, particularly when assuming that data-dependent preprocessing operations (like centering and scaling and feature selection) only need to be done once. They don’t; they need to be done independently for each iteration of the resampling process, and to do otherwise can cause a data leak. Below is an example of this, showing how feature selection should be done independently on the two training sets (in blue) used in the first two iterations of cross-validation, and how this results in different features being selected each time.

(Source: https://arxiv.org/abs/2108.02497)

As I mentioned earlier, the danger of data leaks is even greater when working with time series data. Using standard cross-validation, every iteration except one will involve using at least one training fold that is further ahead in time than the data in the test fold. For example, if you imagine that the data rows in the figure above represent time-ordered multivariate samples, then the test sets (in pink) used in both iterations occur earlier in the time series than all or part of the training data. This is an example of a look ahead bias. Alternative approaches, such as blocked cross-validation, can be used to prevent these.

Multiple evaluations aren’t an option for everyone. For example, training a foundation model is both time-consuming and expensive, so doing it repeatedly is not feasible. Depending on your resources, this may be the case for even relatively small deep learning models. If so, then also consider using other methods for measuring the robustness of models. This includes things like using explainability analysis, performing ablation studies, or augmenting test data. These can allow you to look beyond potentially-misleading metrics and gain some appreciation of how a model works and how it might fail, which in turn can help you decide whether to use it in practice.

Falling Deeper

So far, I’ve mostly talked about general machine learning processes, but the pitfalls can be even greater when using deep learning models. Consider the use of latent space models. These are often trained separately to the predictive models that use them. That is, it’s not unusual to train something like an autoencoder to do feature extraction, and then use the output of this model within the training of a downstream model. When doing this, it’s essential to ensure that the test set used in the downstream model does not intersect with the training data used in the autoencoder — something that can easily happen when using cross-validation or other resampling methods, e.g. when using different random splits or not selecting models trained on the same training folds.

However, as deep learning models get larger and more complex, it can be harder to ensure these kinds of data leaks do not occur. For instance, if you use a pre-trained foundation model, it may not be possible to tell whether the data used in your test set was used to train the foundation model — particularly if you’re using benchmark data from the internet to test your model. Things get even worse if you’re using composite models. For example, if you’re using a BERT-type foundation model to encode the inputs when fine-tuning a GPT-type foundation model, you have to take into account any intersection between the datasets used to train the two foundation models in addition to your own fine-tuning data. In practice, some of these data sets may be unknown, meaning that you can’t be confident whether your model is correctly generalising or merely reproducing data memorised during pre-training.

Avoiding the Pits

These pitfalls are all too common. So, what’s the best way to avoid them? Well, one thing you can do is use a checklist, which is basically a formal document that takes you through the key pain points in the machine learning pipeline, and helps you to identify potential issues. In domains with high-stakes decisions, such as medicine, there are already a number of well-established checklists, such as CLAIM, and adherence to these is typically enforced by journals that publish in these areas.

However, I’d like to briefly introduce a new kid on the block: REFORMS, a consensus-based checklist for doing machine learning-based science. This was put together by 19 researchers across computer science, data science, mathematics, social sciences, and the biomedical sciences — including myself — and came out of a recent workshop on the reproducibility crisis in ML‑based science. It is intended to address the common mistakes that occur in the machine learning pipeline, including many of those mentioned in this article, in a more domain-independent manner. It consists of two parts: the checklist itself, and also a paired guidance document, which explains why each of the checklist items are important. The checklist works through the main components of a machine learning-based study, in each case encouraging the user to verify that the machine learning process is designed in such a way that it supports the overall aims of the study, doesn’t stumble into any of the common pitfalls, and enables the results to be verified by an independent researcher. Whilst it’s focused on the application of machine learning within a scientific context, a lot of what it covers is more generally applicable, so I’d encourage you to take a look even if you don’t consider your work to be “science”.

Another way of avoiding pitfalls is to make better use of tools. Now, one of my pet gripes regarding the current state of machine learning is that commonly-used tools do little to prevent you from making mistakes. That is, they’ll happily let you abuse the machine learning process in all sorts of ways without telling you what you’re doing is wrong. Nevertheless, help is available in the form of experiment tracking frameworks, which automatically keep a record of the models you trained and how you trained them, and this can be useful for spotting things like data leaks and training to the test set. An open source option is MLFlow, but there are plenty of commercial offerings. MLOps tools take this even further, and help to manage all the moving parts in a machine learning workflow, including the people.

Final Thought

It is possible to train a good model that generalises well to unseen data, but I wouldn’t believe this until you’re satisfied that nothing which could have gone wrong has gone wrong. A healthy sense of suspicion is a good thing: do look at your trained model to make sure it’s doing something sensible, do analyse your metrics to understand where it’s making mistakes, do calibrate your results against appropriate baselines, and do consider using checklists to make sure you haven’t overlooked something important.

Author Bio

Michael is an Associate Professor at Heriot-Watt University, Edinburgh. He’s spent the last 20 years or so doing research on machine learning and bio-inspired computing. For more info see his academic website. He also writes about computer science more generally in his Fetch Decode Execute substack.

Citation

For attribution in academic contexts or books, please cite this work as

Michael Lones, "Why Doesn’t My Model Work?", The Gradient, 2024.

BibTeX citation:

@article{lones2024why,
    author = {Michael Lones},
    title = {Why Doesn’t My Model Work?},
    journal = {The Gradient},
    year = {2024},
    howpublished = {\url{https://thegradient.pub/why-doesnt-my-model-work},
}

Deep learning for single-cell sequencing: a microscope to see the diversity of cells

Fatima Zahra El Hajji — Sat, 13 Jan 2024 18:12:44 GMT

The history of each living being is written in its genome, which is stored as DNA and present in nearly every cell of the body. No two cells are the same, even if they share the same DNA and cell type, as they still differ in the regulators that control how DNA is expressed by the cell. The human genome consists of 3 billion base pairs spread over 23 chromosomes. Within this vast genetic code, there are approximately 20,000 to 25,000 genes, constituting the protein-coding DNA and accounting for about 1% of the total genome [1]. To explore the functioning of complex systems in our bodies, especially this small coding portion of DNA, a precise sequencing method is necessary, and single-cell sequencing (sc-seq) technology fits this purpose.

In 2013, Nature selected single-cell RNA sequencing as the Method of the Year [2] (Figure 3), highlighting the importance of this method for exploring cellular heterogeneity through the sequencing of DNA and RNA at the individual cell level. Subsequently, numerous tools have emerged for the analysis of single-cell RNA sequencing data. For example, the scRNA-tools database has been compiling software for the analysis of single-cell RNA data since 2016, and by 2021, the database includes over 1000 tools [3]. Among these tools, many involve methods that leverage Deep Learning techniques, which will be the focus of this article – we will explore the pivotal role that Deep Learning, in particular, has played as a key enabler for advancing single-cell sequencing technologies.

Background

Flow of genetic information from DNA to protein in cells

Let’s first go over what exactly cells and sequences are. The cell is the fundamental unit of our bodies and the key to understanding how our bodies function in good health and how molecular dysfunction leads to disease. Our bodies are made of trillions of cells, and nearly every cell contains three genetic information layers: DNA, RNA, and protein. DNA is a long molecule containing the genetic code that makes each person unique. Like a source code, it includes several instructions showing how to make each protein in our bodies. These proteins are the workhorses of the cell that carry out nearly every task necessary for cellular life. For example, the enzymes that catalyze chemical reactions within the cell and DNA polymerases that contribute to DNA replication during cell division, are all proteins. The cell synthesizes proteins in two steps: Transcription and Translation (Figure 1), which are known as gene expression. DNA is first transcribed into RNA, then RNA is translated into protein. We can consider RNA as a messenger between DNA and protein.

Figure 1. The central dogma of biology

While the cells of our body share the same DNA, they vary in their biological activity. For instance, the distinctions between immune cells and heart cells are determined by the genes that are either activated or deactivated in these cells. Generally, when a gene is activated, it leads to the creation of more RNA copies, resulting in increased protein production. Therefore, as cell types differ based on the quantity and type of RNA/protein molecules synthesized, it becomes intriguing to assess the abundance of these molecules at the single-cell level. This will enable us to investigate the behavior of our DNA within each cell and attain a high-resolution perspective of the various parts of our bodies.

In general, all single-cell sequencing technologies can be divided into three main steps:

Isolation of single cells from the tissue of interest and extraction of genetic material from each isolated cell
Amplification of genetic material from each isolated cell and library preparation
Sequencing of the library using a next-generation sequencer and data analysis

Navigating through the intricate steps of cellular biology and single-cell sequencing technologies, a pivotal question emerges: How is single-cell sequencing data represented numerically?

Structure of single-cell sequencing data

The structure of single-cell sequencing data takes the form of a matrix (Figure 2), where each row corresponds to a cell that has been sequenced and annotated with a unique barcode. The number of rows equals the total number of cells analyzed in the experiment. On the other hand, each column corresponds to a specific gene. Genes are the functional units of the genome that encode instructions for the synthesis of proteins or other functional molecules. In the case of scRNA seq data, the numerical entries in the matrix represent the expression levels of genes in individual cells. These values indicate the amount of RNA produced from each gene in a particular cell, providing insights into the activity of genes within different cells.

Figure 2. Schema of single-cell sequencing data

Single Cell Sequencing Overview

For more than 150 years, biologists have wanted to identify all the cell types in the human body and classify them into distinct types based on accurate descriptions of their properties. The Human Cell Atlas Project (HCAP), the genetic equivalent of the Human Genome Project [4], is an international collaborative effort to map all the cells in the human body.” We can conceptualize the Human Cell Atlas as a map endeavoring to portray the human body coherently and systematically. Much like Google Maps, which allows us to zoom in for a closer examination of intricate details, the Human Cell Atlas provides insights into spatial information, internal attributes, and even the relationships among elements”, explains Aviv Regev, a computational and systems biologist at the Broad Institute of MIT and Harvard and Executive Vice President and Head of Genentech Research.

This analogy seamlessly aligns with the broader impact of single-cell sequencing, since it allows the analysis of individual cells instead of bulk populations. This technology proves invaluable in addressing intricate biological inquiries related to developmental processes and comprehending heterogeneous cellular or genetic changes under various treatment conditions or disease states. Additionally, it facilitates the identification of novel cell types within a given cellular population. The initiation of the first single-cell RNA sequencing (scRNA-seq) paper in 2009 [5], subsequently designated as the "method of the year" in 2013 [2], marked the genesis of an extensive endeavor to advance both experimental and computational techniques dedicated to unraveling the intricacies of single-cell transcriptomes.

As the technological landscape evolves, the narrative transitions to the advancements in single-cell research, particularly the early focus on single-cell RNA sequencing (scRNA-seq) due to its cost-effectiveness in studying complex cell populations.” In some ways, RNA has always been one of the easiest things to measure,” says Satija [6], a researcher at the New York Genome Center (NYGC). Yet, the rapid development of single-cell technology has ushered in a new era of possibilities—multimodal single-cell data integration. Recognized as the "Method of the Year 2019" by Nature [7] (Figure 3), this approach allows the measurement of different cellular modalities, including the genome, epigenome, and proteome, within the same cell. The layering of multiple pieces of information provides powerful insights into cellular identity, posing the challenge of effectively modeling and combining datasets generated from multimodal measurements. This integration challenge is met with the introduction of Multi-view learning [8] methods, exploring common variations across modalities. This sophisticated approach, incorporating deep learning techniques, showcases relevant results across various fields, particularly in biology and biomedicine.

Amidst these advancements, a distinct challenge surfaces in the persistent limitation of single-cell RNA sequencing—the loss of spatial information during transcriptome profiling by isolating cells from their original position. Spatially resolved transcriptomics (SRT) emerges as a pivotal solution [9], addressing the challenge by preserving spatial details during the study of complex biological systems. This recognition of spatially resolved transcriptomics as the method of the year 2020 solidifies its place as a critical solution to the challenges inherent in advancing our understanding of complex biological systems.

Figure 3. Evolution of single-cell sequencing over time

Having explored the panorama of single-cell sequencing, let us now delve into the role of deep learning in the context of single-cell sequencing.

Deep Learning on single-cell sequencing

Deep learning is increasingly employed in single-cell analysis due to its capacity to handle the complexity of single-cell sequencing data. In contrast, conventional machine-learning approaches require significant effort to develop a feature engineering strategy, typically designed by domain experts. The deep learning approach, however, autonomously captures relevant characteristics from single-cell sequencing data, addressing the heterogeneity between single-cell sequencing experiments, as well as the associated noise and sparsity in such data. Below are three key reasons for the application of deep learning in single-cell sequencing:

High-Dimensional Data: Single-cell sequencing generates high-dimensional data, with thousands of genes and their expression levels measured for each cell. Deep learning models are adept at capturing complex relationships and patterns within this data, which can be challenging for traditional statistical methods.
Non-Linearity: Single-cell gene expression data is characterized by its inherent nonlinearity between gene expressions and cell-to-cell heterogeneity. Traditional statistical methods encounter difficulties in capturing the non-linear relationships present in single-cell gene expression data. In contrast, deep learning models are flexible and able to learn complex non-linear mappings.
Heterogeneity: Single-cell data is often characterized by diverse cell populations with varying gene expression profiles, presenting a complex landscape. Deep learning models can play a crucial role in identifying, clustering, and characterizing these distinct cell types or subpopulations, thereby facilitating a deeper understanding of cellular heterogeneity within a sample.

As we explore the reasons behind using deep learning in single-cell sequencing data, it leads us to the question: What deep learning architectures are often used in sc-seq data analysis?

Background on Autoencoders

Autoencoders (AEs) stand out among various deep-learning architectures (such as GANs and RNNs) as an especially relied upon method for decoding the complexities of single-cell sequencing data. Widely employed for dimensionality reduction while preserving the inherent heterogeneity in the single-cell sequencing data. By clustering cells in the reduced-dimensional space generated by autoencoders, researchers can effectively identify and characterize different cell types or subpopulations. This approach enhances our ability to discern and analyze the diverse cellular components within single-cell datasets. In contrast to non-deep learning models, such as principal component analysis (PCA), which are integral components of established scRNA-seq data analysis software like Seurat [10], autoencoders distinguish themselves by uncovering non-linear manifolds. While PCA is constrained to linear transformations, the flexibility of autoencoders to capture complex non-linear mappings makes it an advanced method to find nuanced relationships embedded in single-cell genomics.

To mitigate the overfitting challenge associated with autoencoders, several enhancements to the autoencoder structure have been implemented, specifically tailored to offer advantages in the context of sc-seq data. One notable adaptation often used in the context of sc-seq data is the denoising autoencoder (DAEs), which amplifies the autoencoder's reconstruction capability by introducing noise to the initial network layer. This involves randomly transforming some of its units to zero. The Denoising Autoencoder then reconstructs the input from this intentionally corrupted version, empowering the network to capture more relevant features and preventing it from merely memorizing the input (overfitting). This refinement significantly bolsters the model's resilience against data noise, thereby elevating the quality of the low-dimensional representation of samples (i.e., bottleneck) derived from the sc-seq data.

A third variation of autoencoders frequently employed in sc-seq data analysis is variational autoencoders (VAEs), exemplified by models like scGen [19], scVI [14], scANVI [28], etc. VAEs, as a type of generative model, learn a latent representation distribution of the data. Instead of encoding the data into a vector of p-dimensional latent variables, the data is encoded into two vectors of size p: a vector of means η and a vector of standard deviations σ. VAEs introduce a probabilistic element to the encoding process, facilitating the generation of synthetic single-cell data and offering insights into the diversity within a cell population. This nuanced approach adds another layer of complexity and richness to the exploration of single-cell genomics.

Applications of deep learning in sc-seq data analysis

This section outlines the main applications of deep learning in improving various stages of sc-seq data analysis, highlighting its effectiveness in advancing crucial aspects of the process.

scRNA-seq data imputation and denoising

Single-cell RNA sequencing (scRNA-seq) data encounter inherent challenges, with dropout events being a prominent concern that leads to significant issues—resulting in sparsity within the gene expression matrix, often characterized by a substantial number of zero values. This sparsity significantly shapes downstream bioinformatics analyses. Many of these zero values arise artificially due to deficiencies in sequencing techniques, including problems like inadequate gene expression, low capture rates, sequencing depth, or other technical factors. As a consequence, the observed zero values do not accurately reflect the true underlying expression levels. Hence, not all zeros in scRNA-seq data can be considered mere missing values, deviating from the conventional statistical approach of imputing missing data values. Given the intricate distinction between true and false zero counts, traditional imputation methods with predefined missing values may prove inadequate for scRNA-seq data. For instance, a classical imputation method, like Mean Imputation, might entail substituting these zero values with the average expression level of that gene across all cells. However, this approach runs the risk of oversimplifying the complexities introduced by dropout events in scRNA-seq data, potentially leading to biased interpretations.

ScRNA-seq data imputation methods can be divided into two categories: deep learning–based imputation method and non–deep learning imputation method. The non–deep learning imputation algorithms involve fitting statistical probability models or utilizing the expression matrix for smoothing and diffusion. This simplicity renders it effective for certain types of samples. For example, Wagner et al. [11] utilized the k-nearest neighbors (KNN) method, identifying nearest neighbors between cells and aggregating gene-specific Unique Molecular Identifiers (UMI) counts to impute the gene expression matrix. In contrast, Huang et al. [12] proposed the SVAER algorithm, leveraging gene-to-gene relationships for imputing the gene expression matrix. For larger datasets (comprising tens of thousands or more), high-dimensional, sparse, and complex scRNA-seq data, traditional computational methods face difficulties, often rendering analysis using these methods difficult and infeasible. Consequently, many researchers have turned to designing methods based on deep learning to address these challenges.

Most deep learning algorithms for imputing dropout events are based on autoencoders (AEs). For instance, in 2018, Eraslan et al. [13] introduced the deep count autoencoder (DCA). DCA utilizes a deep autoencoder architecture to address dropout events in single-cell RNA sequencing (scRNA-seq) data. It incorporates a probabilistic layer in the decoder to model the dropout process. This probabilistic layer accommodates the uncertainty associated with dropout events, enabling the model to generate a distribution of possible imputed values. To capture the characteristics of count data in scRNA-seq, DCA models the observed counts as originating from a negative binomial distribution.

Single-cell variational inference (scVI) is another deep learning algorithm introduced by Lopez et al. [14]. ScVI is a probabilistic variational autoencoder (VAE) that combines deep learning and probabilistic modeling to capture the underlying structure of the scRNA-seq data. ScVI can be used for imputation, denoising, and various other tasks related to the analysis of scRNA-seq data. In contrast to the DCA model, scVI employs Zero-Inflated Negative Binomial (ZINB) distribution in the decoder part to generate a distribution of possible counts for each gene in each cell. The Zero-Inflated Negative Binomial (ZINB) distribution allows modeling the probability of a gene expression being zero (to model dropout events) as well as the distribution of positive values (to model non-zero counts).

Additionally, another study addressed the scRNA-seq data imputation challenge by introducing a recurrent network layer in their model, known as scScope [15]. This novel architecture iteratively performs imputations on zero-valued entries of input scRNA-seq data. The flexibility of scScope's design allows for the iterative improvement of imputed outputs through a chosen number of recurrent steps (T). Noteworthy is the fact that reducing the time recurrence of scScope to one (i.e., T = 1) transforms the model into a traditional autoencoder (AE). As scScope is essentially a modification of traditional AEs, its runtime is comparable to other AE-based models.

It's important to note that the application of deep learning in scRNA-seq data imputation and denoising is particularly advantageous due to its ability to capture non-linear relationships among genes. This contrasts with standard linear approaches, making deep learning more adept at providing informed and accurate imputation strategies in the context of single-cell genomics.

Batch effect removal

Single-cell data is commonly aggregated from diverse experiments that vary in terms of experimental laboratories, protocols, sample compositions, and even technology platforms. These differences result in significant variations or batch effects within the data, posing a challenge in the analysis of biological variations of interest during the process of data integration. To address this issue, it becomes necessary to correct batch effects by removing technical variance when integrating cells from different batches or studies. The first method that appears for batch correction is a linear method based on linear regression such as Limma package [16] that provides the removeBatchEffect function which fits a linear model that considers the batches and their impact on gene expression. After fitting the model, it sets the coefficients associated with each batch to zero, effectively removing their impact. Another method called ComBat [17] does something similar but adds an extra step to refine the process, making the correction even more accurate by using a technique called empirical Bayes shrinkage.

However, batch effects can be highly nonlinear, making it difficult to correctly align different datasets while preserving key biological variations. In 2018, Haghverdi et al. introduced the Mutual Nearest Neighbors (MNN) algorithm to identify pairs of cells from different batches in single-cell data [18]. These identified mutual nearest neighbors aid in estimating batch effects between batches. By applying this correction, the gene expression values are adjusted to account for the estimated batch effects, aligning them more closely and reducing discrepancies introduced by the different batches. For extensive single-cell datasets with highly nonlinear batch effects, traditional methods may prove less effective, prompting researchers to explore the application of neural networks for improved batch correction.

One of the pioneering models that employ deep learning for batch correction is the scGen model. Developed by Lotfollahi et al., ScGen [19] utilizes a variational autoencoder (VAE) architecture. This involves pre-training a VAE model on a reference dataset to adjust real single-cell data and alleviate batch effects. Initially, the VAE is trained to capture latent features within the reference dataset's cells. Subsequently, this trained VAE is applied to the actual data, producing latent representations for each cell. The adjustment of gene expression profiles is then based on aligning these latent representations, to reduce batch effects and harmonize profiles across different experimental conditions.

Figure 4. scGen removes batch effects [19]. a, UMAP visualization of 4 technically diverse pancreatic datasets with their corresponding batch and cell types. b, Data corrected by scGen mixes shared cell types from different studies while preserving the biological variance of cells.

On the other hand, Zou et al. introduced DeepMNN [20], which employs a residual neural network and the mutual nearest neighbor (MNN) algorithm for scRNA-seq data batch correction. Initially, MNN pairs are identified across batches in a principal component analysis (PCA) subspace. Subsequently, a batch correction network is constructed using two stacked residual blocks to remove batch effects. The loss function of DeepMNN comprises a batch loss, computed based on the distance between cells in MNN pairs in the PCA subspace, and a weighted regularization loss, ensuring the network's output similarity to the input.

The majority of existing scRNA-seq methods are designed to remove batch effects first and then cluster cells, which potentially overlooks certain rare cell types. Recently, Xiaokang et al. developed scDML [21], a deep metric learning model to remove batch effect in scRNA-seq data, guided by the initial clusters and the nearest neighbor information intra and inter-batches. First, the graph-based clustering algorithm is used to group cells based on gene expression similarities, then the KNN algorithm is applied to identify k-nearest neighbors for each cell in the dataset, and the MNN algorithm to identify mutual nearest neighbors, focusing on reciprocal relationships between cells. To remove batch effects, deep triplet learning is employed, considering hard triplets. This helps in learning a low-dimensional embedding that accounts for the original high-dimensional gene expression and removes batch effects simultaneously.

Cell type annotation

Cell type annotation in single-cell sequencing involves the process of identifying and labeling individual cells based on their gene expression profiles, which allows researchers to capture the diversity within a heterogeneous population of cells, and understand the cellular composition of tissues, and the functional roles of different cell types in biological processes or diseases. Traditionally, researchers have used manual methods [22] to annotate cell sub-populations. This involves identifying gene markers or gene signatures that are differentially expressed in a specific cell cluster. Once gene markers are identified, researchers manually interpret the biological relevance of these markers to assign cell-type labels to the clusters. This traditional manual annotation approach is time-consuming and requires considerable human effort, especially when dealing with large-scale single-cell datasets. Due to the challenges associated with manual annotation, researchers are turning to automate and streamline the cell annotation process.

Two primary strategies are employed for cell type annotation: unsupervised-based and supervised-based. In the unsupervised realm, clustering methods such as Scanpy [23] and Seurat [10] are utilized, demanding prior knowledge of established cellular markers. The identification of clusters hinges on the unsupervised grouping of cells without external reference information. However, a drawback to this approach is a potential decrease in replicability with an increased number of clusters and multiple selections of cluster marker genes.

Conversely, supervised-based strategies rely on deep-learning models trained on labeled data. These models discern intricate patterns and relationships within gene expression data during training, enabling them to predict cell types for unlabeled data based on acquired patterns. For example, Joint Integration and Discrimination (JIND) [24] deploys a GAN-style deep architecture, where an encoder is pre-trained on classification tasks, circumventing the need for an autoencoder framework. This model also accounts for batch effects. AutoClass [25] integrates an autoencoder and a classifier, combining output reconstruction loss with a classification loss for cell annotation alongside data imputation. Additionally, TransCluster, [26] rooted in the Transformer framework and convolutional neural network (CNN), employs feature extraction from the gene expression matrix for single-cell annotation.

Despite the power of deep neural networks, obtaining a large number of accurately and unbiasedly annotated cells for training is challenging, given the labor-intensive manual inspection of marker genes in scRNAseq data. In response, semi-supervised learning has been leveraged in computational cell annotation. For instance, the SemiRNet [27] model uses both unlabeled and a limited amount of labeled scRNAseq cells to implement cell identification. SemiRNet, based on recurrent convolutional neural networks (RCNN), incorporates a shared network, a supervised network, and an unsupervised network. Furthermore, single‐cell ANnotation using Variational Inference (scANVI) [28], a semi‐supervised variant of scVI [14], maximizes the utility of existing cell state annotations. Cell BLAST, an autoencoder-based generative model, harnesses large-scale reference databases to learn nonlinear low-dimensional representations of cells, employing a sophisticated cell similarity metric—normalized projection distance—to map query cells to specific cell types and identify novel cell types.

Multi-omics Data Integration

Recent studies have demonstrated the potential of deep learning models in addressing complex and multimodal biological challenges [29]. Among the algorithms proposed thus far, it is primarily deep learning-based models that provide the essential computational adaptability necessary for effectively modeling and incorporating nearly any form of omic data including genomics (studying DNA sequences and genetic variations), epigenomics (examining changes in gene activity unrelated to DNA sequence, such as DNA modifications and chromatin structure), transcriptomics (investigating RNA molecules and gene expression through RNA sequencing), and proteomics (analyzing all proteins produced by an organism, including structures, abundances, and modifications). Deep Learning architectures, including autoencoders (AE) and generative adversarial networks (GAN), have been often used in multi-omics integration problems in single cells. The key question in multi-omics integration revolves around how to effectively represent the diverse multi-omics data within a unified latent space.

One of the early methods developed using Variational Autoencoders (VAE) for the integration of multi-omics single-cell data is known as totalVI [30]. The totalVI model, which is VAE-based, offers a solution for effectively merging scRNA-seq and protein data. In this model, totalVI takes input matrices containing scRNA-seq and protein count data. Specifically, it treats gene expression data as sampled from a negative binomial distribution, while protein data are treated as sampled from a mixture model consisting of two negative binomial distributions. The model first learns shared latent space representations through its encoder, which are then utilized to reconstruct the original data, taking into account the differences between the two original data modalities. Lastly, the decoder component estimates the parameters of the underlying distributions for both data modalities using the shared latent representation.

On the other hand, Zuo et al. [31] introduced scMVAE as a multimodal variational autoencoder designed to integrate transcriptomic and chromatin accessibility data in the same individual cells. scMVAE employs two separate single-modal encoders and two single-modal decoders to effectively model both transcriptomic and chromatin data. It achieves this by combining three distinct joint-learning strategies with a probabilistic Gaussian Mixture Model.

Figure 5 . UMAP embedding for the latent space of the MULTIGRATE for CITE-seq dataset combines gene expression and cell surface protein data [32].

Recently, Lotfollahi et al. [32] introduced an unsupervised deep generative model known as MULTIGRATE for the integration of multi-omic datasets. MULTIGRATE employs a multi-modal variational autoencoder structure that shares some similarities with the scMVAE model. However, it offers added generality and the capability to integrate both paired and unpaired single-cell data. To enhance cell alignment, the loss function incorporates Maximum Mean Discrepancy (MMD), penalizing any misalignment between the point clouds associated with different assays. Incorporating transfer learning, MULTIGRATE can map new multi-omic query datasets into a reference atlas and also perform imputations for missing modalities.

Conclusion

The application of deep learning in single-cell sequencing functions as an advanced microscope, revealing intricate insights within individual cells and providing a profound understanding of cellular heterogeneity and complexity in biological systems. This cutting-edge technology empowers scientists to explore previously undiscovered aspects of cellular behavior. However, the challenge lies in choosing between traditional tools and the plethora of available deep-learning options. The landscape of tools is vast, and researchers must carefully consider factors such as data type, complexity, and the specific biological questions at hand. Navigating this decision-making process requires a thoughtful evaluation of the strengths and limitations of each tool in relation to research goals.

On the other hand, a critical need in the development of deep learning approaches for single-cell RNA sequencing (scRNA-seq) analysis is robust benchmarking. While many studies compare deep learning performance to standard methods, there is a lack of comprehensive comparisons across various deep learning models. Moreover, methods often claim superiority based on specific datasets and tissues (e.g., pancreas cells, immune cells), making it challenging to evaluate the necessity of specific terms or preprocessing steps. Addressing these challenges requires an understanding of when deep learning models fail and their limitations. Recognizing which types of deep learning approaches and model structures are beneficial in specific cases is crucial for developing new approaches and guiding the field.

In the realm of multi-omics single-cell integration, most deep learning methods aim to find a shared latent representation for all modalities. However, shared representation learning faces challenges such as heightened noise, sparsity, and the intricate task of balancing modalities. Inherent biases across institutions complicate generalization. Despite being less prevalent than single-modality approaches, integrating diverse modalities with unique cell populations is crucial. Objectives include predicting expression across modalities and identifying cells in similar states. Despite advancements, further efforts are essential for enhanced performance, particularly concerning unique or rare cell populations present in one technology but not the other.

Author Bio

Fatima Zahra El Hajji holds a master's degree in bioinformatics from the National School of Computer Science and Systems Analysis (ENSIAS), she subsequently worked as an AI intern at Piercing Star Technologies. Currently, she is a Ph.D. student at the University Mohammed VI Polytechnic (UM6P), working under the supervision of Dr. Rachid El Fatimy and Dr. Tariq Daouda. Her research focuses on the application of deep learning techniques in single-cell sequencing data.

Citation

For attribution in academic contexts or books, please cite this work as

Fatima Zahra El Hajji, "Deep learning for single-cell sequencing: a microscope to see the diversity of cells", The Gradient, 2024.

BibTeX citation:

@article{elhajji2023nar,
    author = {El Hajji, Fatima Zahra},
    title = {Deep learning for single-cell sequencing: a microscope to see the diversity of cells},
    journal = {The Gradient},
    year = {2024},
    howpublished = {\url{https://thegradient.pub/deep-learning-for-single-cell-sequencing-a-microscope-to-uncover-the-rich-diversity-of-individual-cells},
}

References

National Human Genome Research Institute (NHGRI) : A Brief Guide to Genomics , https://www.genome.gov/about-genomics/fact-sheets/A-Brief-Guide-to-Genomics
Method of the Year 2013. Nat Methods 11, 1 (2014). https://doi.org/10.1038/nmeth.2801
Zappia, L., Theis, F.J. Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape. Genome Biol 22, 301 (2021). https://doi.org/10.1186/s13059-021-02519-4
Collins FS, Fink L. The Human Genome Project. Alcohol Health Res World. 1995;19(3):190-195. PMID: 31798046; PMCID: PMC6875757.
Tang F, Barbacioru C, Wang Y, et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods. 2009; 6: 377-382.
Eisenstein, M. The secret life of cells. Nat Methods 17, 7–10 (2020). https://doi.org/10.1038/s41592-019-0698-y
Method of the Year 2019: Single-cell multimodal omics. Nat Methods 17, 1 (2020). https://doi.org/10.1038/s41592-019-0703-5
Zhao, Jing et al. “Multi-view learning overview: Recent progress and new challenges.” Inf. Fusion 38 (2017): 43-54.
Zhu, J., Shang, L. & Zhou, X. SRTsim: spatial pattern preserving simulations for spatially resolved transcriptomics. Genome Biol 24, 39 (2023).
Butler, A., Hoffman, P., Smibert, P., Papalexi, E., & Satija, R. (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature biotechnology, 36(5), 411-420
Wagner, F., Yan, Y., & Yanai, I. (2018). K-nearest neighbor smoothing for high-throughput single-cell RNA-Seq data. bioRxiv, 217737. Cold Spring Harbor Laboratory. https://doi.org/10.1101/217737
Huang, M., Wang, J., Torre, E. et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat Methods 15, 539–542 (2018). https://doi.org/10.1038/s41592-018-0033-z
Eraslan G, Simon LM, Mircea M, Mueller NS, Theis FJ. Single-cell RNA-seq denoising using a deep count autoencoder. Nat Commun. 2019 Jan 23;10(1):390. doi: 10.1038/s41467-018-07931-2. PMID: 30674886; PMCID: PMC6344535.
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I.,& Yosef, N. (2018). Deep generative modeling for single-cell transcriptomics. Nature methods, 15(12), 1053-1058.
Y. Deng, F. Bao, Q. Dai, L.F. Wu, S.J. Altschuler Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015 Apr 20;43(7):e47. doi: 10.1093/nar/gkv007. Epub 2015 Jan 20. PMID: 25605792; PMCID: PMC4402510.
Johnson W.E. , Li C., Rabinovic A. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics. 2007; 8:118–127.
Haghverdi, L., Lun, A., Morgan, M. et al. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol 36, 421–427 (2018). https://doi.org/10.1038/nbt.4091
Lotfollahi, M., Wolf, F. A., & Theis, F. J. (2019). scGen predicts single-cell perturbation responses. Nature methods, 16(8), 715-721.
Zou, B., Zhang, T., Zhou, R., Jiang, X., Yang, H., Jin, X., & Bai, Y. (2021). deepMNN: deep learning-based single-cell RNA sequencing data batch correction using mutual nearest neighbors. Frontiers in Genetics, 1441.
Yu, X., Xu, X., Zhang, J. et al. Batch alignment of single-cell transcriptomics data using deep metric learning. Nat Commun 14, 960 (2023). https://doi.org/10.1038/s41467-023-36635-5
Z.A. Clarke, T.S. Andrews, J. Atif, D. Pouyabahar, B.T. Innes, S.A. MacParland, et al. Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods Nat Protoc, 16 (2021), pp. 2749-2764
Wolf, F., Angerer, P. & Theis, F. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19, 15 (2018). https://doi.org/10.1186/s13059-017-1382-0
Mohit Goyal, Guillermo Serrano, Josepmaria Argemi, Ilan Shomorony, Mikel Hernaez, Idoia Ochoa, JIND: joint integration and discrimination for automated single-cell annotation, Bioinformatics, Volume 38, Issue 9, March 2022, Pages 2488–2495, https://doi.org/10.1093/bioinformatics/btac140
H. Li, C.R. Brouwer, W. Luo A universal deep neural network for in-depth cleaning of single-cell RNA-seq data Nat Commun, 13 (2022), p. 1901
Song T, Dai H, Wang S, Wang G, Zhang X, Zhang Y and Jiao L (2022) TransCluster: A Cell-Type Identification Method for single-cell RNA-Seq data using deep learning based on transformer. Front. Genet. 13:1038919. doi: 10.3389/fgene.2022.1038919
Dong X, Chowdhury S, Victor U, Li X, Qian L. Semi-Supervised Deep Learning for Cell Type Identification From Single-Cell Transcriptomic Data. IEEE/ACM Trans Comput Biol Bioinform. 2023 Mar-Apr;20(2):1492-1505. doi: 10.1109/TCBB.2022.3173587. Epub 2023 Apr 3. PMID: 35536811.
Xu, C., Lopez, R., Mehlman, E., Regier, J., Jordan, M. I., & Yosef, N. (2021). Probabilistic harmonization and annotation of single‐cell transcriptomics data with deep generative models. Molecular Systems Biology, 17(1), e9620. https://doi.org/10.15252/msb.20209620
Tasbiraha Athaya, Rony Chowdhury Ripan, Xiaoman Li, Haiyan Hu, Multimodal deep learning approaches for single-cell multi-omics data integration, Briefings in Bioinformatics, Volume 24, Issue 5, September 2023, bbad313, https://doi.org/10.1093/bib/bbad313
Gayoso, A., Lopez, R., Steier, Z., Regier, J., Streets, A., & Yosef, N. (2019). A Joint Model of RNA Expression and Surface Protein Abundance in Single Cells. bioRxiv, 791947. https://www.biorxiv.org/content/early/2019/10/07/791947.abstract
Chunman Zuo, Luonan Chen. Deep-joint-learning analysis model of single cell transcriptome and open chromatin accessibility data. Briefings in Bioinformatics. 2020.
Lotfollahi, M., Litinetskaya, A., & Theis, F. J. (2022). Multigrate: single-cell multi-omic data integration.bioRxiv.https://www.biorxiv.org/content/early/2022/03/17/2022.03.16.484643

Salmon in the Loop

Kevin McCraney — Sat, 16 Dec 2023 17:00:36 GMT

One of the most fascinating problems that a computer scientist may be lucky enough to encounter is a complex sociotechnical problem in a field going through the process of digital transformation. For me, that was fish counting. Recently, I worked as a consultant in a subdomain of environmental science focused on counting fish that pass through large hydroelectric dams. Through this overarching project, I learned about ways to coordinate and manage human-in-the-loop dataset production, as well as the complexities and vagaries of how to think about and share progress with stakeholders.

Background

Let’s set the stage. Large hydroelectric dams are subject to Environmental Protection Act regulations through the Federal Energy Regulatory Commission (FERC). FERC is an independent agency of the United States government that regulates the transmission and wholesale sale of electricity across the United States. The commission has jurisdiction over a wide range of electric power activities and is responsible for issuing licenses and permits for the construction and operation of hydroelectric facilities, including dams. These licenses and permits ensure that hydroelectric facilities are safe and reliable, and that they do not have a negative impact on the environment or other stakeholders. In order to obtain a license or permit from FERC, hydroelectric dam operators must submit detailed plans and studies demonstrating that their facility meets regulations. This process typically involves extensive review and consultation with other agencies and stakeholders. If a hydroelectric facility is found to be in violation of any set standards, FERC is responsible for enforcing compliance with all applicable regulations via sanctions, fines, or lease termination--resulting in a loss of the right to generate power.

Hydroelectric dams are essentially giant batteries. They generate power by building up a large reservoir of water on one side and directing that water through turbines in the body of the dam. Typically, a hydroelectric dam requires lots of space to store water on one side of it, which means they tend to be located away from population centers. The conversion process from potential to kinetic energy generates large amounts of electricity, and the amount of pressure and force generated is disruptive to anything that lives in or moves through the waterways—especially fish.

Simple diagram illustrating how hydroelectric power is generated (Tennessee Valley Authority)

It is also worth noting that the waterways were likely disrupted substantially when the dam was built, leading to behavioral or population-level changes in the fish species of the area. This is of great concern to the Pacific Northwest in particular, as hydropower is the predominant power generation means for the region (Bonneville Power Administration). Fish populations are constantly moving upstream and downstream and hydropower dams can act as barriers that block their passage, leading to reduced spawning. In light of the risks to fish, hydropower dams are subject to constraints on the amount of power they can generate and must show that they are not killing fish in large numbers or otherwise disrupting the rhythms of their lives, especially because the native salmonid species of the region are already threatened or endangered (Salmon Status).

To demonstrate compliance with FERC regulations, large hydroelectric dams are required to routinely produce data which shows that their operational activities do not interfere with endangered fish populations in aggregate. Typically, this is done by performing fish passage studies. A fish passage study can be conducted many different ways, but boils down to one primary dataset upon which everything is based: a fish count. Fish are counted as they pass through the hydroelectric dam, using structures like fish ladders to make their way from the reservoir side to the stream side.

A fish ladder at John Day Dam, how fish often ascend and pass through a dam (Delgado)

Fish counts can be conducted visually—-a person trained in fish identification watches the fish pass, incrementing the count as they move upstream. As a fish is counted, observers impart additional classifications outside of species of fish, such as whether there was some kind of obvious illness or injury, if the fish is hatchery-origin or wild, and so on. These differences between fish are subtle and require close monitoring and verification, since the attribute in question (a clipped adipose fin, a scratched midsection) may only be visible briefly when the fish swims by. As such, fish counting is a specialized job that requires expertise in identifying and classifying different species of fish, as well as knowledge of their life stages and other characteristics. The job is physically demanding, as it typically involves working in remote locations away from city centers, and it can be challenging to perform accurately under the difficult environmental conditions found at hydroelectric dams–poor lighting, unregulated temperatures, and other circumstances inhospitable to humans.

These modes of data collection are great, but there are varying degrees of error that could be imparted through their recording. For example, some visual fish counts are documented with pen and paper, leading to incorrect counts through transcription error; or there can be disputes about the classification of a particular species. Different dam operators collect fish counts with varying degrees of granularity (some collect hourly, some daily, some monthly) and seasonality (some collect only during certain migration patterns called “runs”). After collection and validation, organizations correlate this data with operational information produced by the dam in an attempt to see if any activities of the dam have an adverse or beneficial effect on fish populations. Capturing these data piecemeal with different governing standards and levels of detail causes organizations to look for new efficiencies enabled by technology.

Enter Computer Vision

Some organizations are exploring the use of computer vision and machine learning to significantly automate fish counting. Since dam operators subject to FERC are required to collect fish passage data anyway, and the data were previously produced or encoded in ways that were challenging to work with, an interesting “human-in-the-loop” machine learning system arises. A human-in-the-loop system combines the judgment and expertise of subject-matter expert humans (fish biologists) with the consistency and reliability of machine learning algorithms, which can help to reduce sources of error and bias in the output dataset used in the machine learning system. For the specific problem of fish counting, this could help to ensure that the system's decisions are informed by the latest scientific understanding of fish taxonomy and conservation goals, and could provide a more balanced and comprehensive approach to species or morphological classification. An algorithmic system could reduce the need for manual data collection and analysis by automating the process of identifying and classifying species, and could provide more timely and accurate information about species' health.

Building a computer vision system for a highly-regulated industry, such as hydropower utilities, can be a challenging task due to the need for high accuracy and strict compliance with regulatory standards. The process of building such a system would typically involve several steps:

Representation of example process flow for productionizing a ML system

1. Define the problem space: Before starting to build the system, it is important to clearly define the problem that the system is intended to solve and the goals that it needs to achieve. This initial negotiation process is largely without any defining technical constraints, and is based around the job to that needs to be done by the system: identifying specific tasks that the system needs to perform, such as identification of the species or life stage of a fish. This may be especially challenging in a regulated industry like hydropower, as clients are subject to strict laws and regulations that require them to ensure that any tools or technologies they use are reliable and safe. They may be skeptical of a new machine learning system and may require assurances that it has been thoroughly tested and will not pose any risks to the environment or to through data integrity, algorithmic transparency, and accountability.

Once the problem space is defined, more technical decisions can be made about how to implement the solution. For example, if the goal is to estimate population density during high fish passage using behavioral patterns such as schooling, it may make sense to capture and tag live video, to see the ways in which fish move in real time. Alternatively, if the goal is to identify illness or injury in a situation where there are few fish passing, it may make sense to capture still images and tag subsections of them to train a classifier. In a more developed hypothetical example, perhaps dam operators know that the fish ladder only allows fish to pass through it, all other species or natural debris are filtered out, and they want a “best guess” about rare species of fish that pass upstream. It may be sufficient in this case to implement generic video-based object detection to identify that a fish is moving through a scene, take a picture of it at a certain point, and provide that picture to a human to tag with the species. Once tagged, these data can be used to train a classifier which categorizes fish as being the rare species or not.

2. Establish performance goals: The definition of the problem space and the initial suggested process flow should be shared with all stakeholders as an input to the performance goals. This helps ensure all interested parties understand the problem at a high level, and what is possible for a given implementation. Practically, most hydropower utilities are interested in automated fish count solutions that meet an accuracy threshold of 95% as compared to a regular human visual count, but expectations around whether these metrics are achievable and at what part of the production cycle will be a highly negotiated series of points. Establishing these goals is a true sociotechnical problem, as it cannot be done without taking into account both the real-world constraints that limit the data and the system. These constraining factors will be discussed later in the Obstacles section of the paper.

3. Collect and label training data: In order to train a machine learning model to perform the tasks required by the system, it is first necessary to produce a training dataset. Practically, this involves collecting a large number of fish images. The images are annotated with the appropriate species classification labels by a person with expertise in fish classification. The annotated images are then used to train a machine learning model. Through training, the algorithm learns the features characteristic of each subclass of fish and identifies those features to classify fish in new, unseen images. Because the end goal of this system is to minimize the counts that humans have to do, images with a low “confidence score” (a metric commonly produced by object-detection models) may be flagged for identification and tagging by human reviewers. The more seamless an integration with a production fish counting operation, the better.

4. Select a model: Once the training data has been collected, the next step is to select a suitable machine learning model and train it on the data. This could involve using a supervised learning approach, where the model is trained to recognize the different categories of fish after being shown examples of labeled data. At the time of this writing, deep learning systems based on pretrained models like ImageNet are popular choices. Once trained, the model should be validated against tagged data that it has not seen before and fine-tuned by adjusting the model parameters or refining the training dataset and retraining.

5. Monitor system performance: Once the model has been trained and refined, it can be implemented as part of a computer vision system for regular use. The system's performance should be monitored regularly to ensure that it is meeting the required accuracy targets and to ensure that model drift does not occur, perhaps from changes in environmental conditions, such as water clarity; or morphological changes alluded to in a later section

It is at this point that the loop of tasks begins anew; to eke out more performance from the system, it is likely that more refined and nuanced negotiation about what to expect from the system is necessary, followed by additional training data, model selection, and parameter tuning/monitoring. The common assumption is that an automated or semiautomatic system like this is “set it and forget it” but the process of curating and collating datasets or tuning hyper parameters is quite engaged and intentional.

Obstacles

In order for the computer vision algorithm to accurately detect and count fish in images or video frames, it must be trained on a large and diverse dataset that includes examples of different fish species and morphologies. However, this approach is not without challenges, as specified in the diagram below and with bolded phrases in subsequent paragraphs:

Recapitulation of diagram above; terminal states specified on the diagram are obstacles to successful system building

Dependence on expert knowledge is a concern worth discussing. If the system relies on expert-tagged data to train and evaluate its algorithms, the system may be vulnerable to errors and biases in the expert's knowledge and judgments, as any human-in-the-loop system would be. For example, if the experts are not familiar with certain species or morphologies, they may not be able to accurately tag these fish, which could lead to incorrect classifications by the system. Should an invasive species enter the waterway, it may become overrepresented within the dataset and affect the counts of the species that require conservation action. An excellent practical example of this is American shad, of which hundreds of thousands can pass during a migratory period, obscuring the Chinook salmon that are also passing during the same time. Manual counting methods rely solely on the judgment and observation of individual humans, which can be subject to a variety of sources of error and bias. Further, if the experts have a particular interest in certain species or morphologies, they may be more likely to tag these fish, which could result in over- or under-representation within the dataset. This can lead to life-threatening outcomes if the algorithmic system is used to make important decisions that have conservation implications.

Environmental conditions at hydroelectric dams present challenges for data collection as well. Inadequate illumination and poor image quality can make it difficult for both humans and machine learning algorithms to accurately classify fish. Similarly, changing conditions, like a reduction in water clarity following a seasonal snowmelt can obscure fish in imagery. Migratory fish can be difficult to identify and classify on their own terms, due to the wide range of species and subspecies that exist, and the way their bodies change as they age. These fish are often difficult to study and monitor due to their migratory habits and the challenging environments in which they live. Further, there are often inconsistent data taxonomies produced across organizations, leading to different classifications depending on the parent organization undertaking the data tagging process. If humans cannot create accurate classifications to populate the initial dataset, the machine learning system will not be able to accurately produce predictions when used in production.

Example image of rainbow trout from an onsite edge device; challenging to tell from the lighting but those could be natural spots, injury, or parasitic infection

One of the key challenges of using a machine learning classifier on unaudited data is the risk of model drift, in which the model's performance degrades over time as the underlying data distribution changes. This may be of particular concern in a highly regulated environment, where even small changes in the model's performance could have significant consequences. The datasets produced through the effort of tagging fish images are fascinating because they are so intrinsically place-based, situated, and not easily replicable. Fish passage studies often involve monitoring a relatively small number of fish, which can make it difficult to accurately assess the overall profile of fish populations in the wider area. The number and types of fish that pass through a dam's fish ladders or other fish passage structures can vary greatly depending on the time of year or the "run" of fish passing through the waterways. This can make it difficult to compare data from different studies, or to draw conclusions about the long-term impact of the dam on fish populations. If the system is trained on a dataset of fish that has been tagged by subject-matter experts during one season, the dataset may not be comprehensive or representative of the full range of fish species and morphologies that exist in the wild across the full year. This could lead to under- or over-estimations of number and types of fish present in a given area. In this way, the specter of model drift is actually a problem composed of both challenging data production constraints and dependence on expert knowledge.

Finally, there are background labor issues to be dealt with as part of this problem space coming from intense organizational pressure. Fish counting is a cost center that hydroelectric dam operators would like to eliminate or reduce as much as possible. A technical solution that can accurately count fish is therefore very appealing. However, this raises concerns about ghost work, where human labor is used to train and validate the model, but is not acknowledged or compensated. Replacing human workers with a computer vision solution may significantly impact the displaced workers through financial hardship or the obsoletion of their job skills and expertise. If human expertise in the identification of fish is lost, this could lead to suboptimal decisions about species conservation, and could ultimately undermine the effectiveness of the system. This becomes more dangerous for conservation purposes if the technology is implemented as a cost-reduction measure: it could be the case that—when the model drifts—there are no taggers to set it back on track.

Couple all of these points with the longitudinal decline of wild fish populations globally, and you have a challenging set of conditions to attempt to generalize from.

If the available training data is limited or does not accurately reflect the diversity of fish species and morphologies that pass through the dam's fish passage structures, the accuracy of the algorithm may be reduced. Additionally, there are concerns about data leakage, where the model may be able to infer sensitive information about the fish from the images, such as how they are routed through the dam. Thinking about studies that happen in fisheries as per Hwang (2022), the populations analyzed are so small and the outcomes so intentionally so narrowly-scoped, it is almost the case that an organization would have to at the very least train a one-off model for each project or validate the output of each ML classifier against some additional source, which is lately outside of the interest and capabilities of organizations who hope to reduce labor outlays as part of the implementation of a system like this.

Concluding Thoughts

The sociotechnical problem of fish counting is a niche problem with wide applications. If properly implemented, a machine learning system based around fish counts has the potential to be applied in many different places, such as meeting environmental regulation or aquaculture. The rapid digital transformation of environmental science has led to the development of novel datasets with interesting challenges, and a new cohort of professionals with the data literacy and technical abilities to work on problems like this. However, building a dataset of anadromous and catadromous fish that are protected under the ESA is a complex and challenging task, due to the limited availability of data, the complexity of fish taxonomy, the involvement of multiple stakeholders, and the dynamic environment in which these species live.

Moreover, organizations subject to regulation may be unsure of how to validate the accuracy of a machine learning model, and may be more interested in fish counts than in fish images (or vice-versa). Bringing new technologies to bear on an organization or on a dataset that was not as robustly cataloged means there will be new things to be discovered or measured through the application of the technology. Since implementation of a computer vision system like this is done to meet compliance with FERC regulations, it means bringing multiple different stakeholders–including federal agencies, state and local governments, conservation organizations, and members of the public–into dialogue with one another when changes are required. By conducting these studies and regularly reporting the results to FERC, a hydroelectric dam operator could demonstrate that they are taking steps to minimize the impact of the dam on fish populations, and that the dam is not having a negative impact on the overall health of the local fish population, but it also means cross-checking with the community in which they are situated.

Author Bio

Kevin McCraney is a data engineer, educator, and consultant. He works with public sector & large-scale institutions building data processing infrastructure & improving data literacy. Kevin has several years of experience teaching & mentoring early career professionals as they transition to technology from non-STEM disciplines. Working predominantly with institutions in the Pacific Northwest, he enjoys professional opportunities where he can combine a humanistic worldview and technical acumen to solve complex sociotechnical problems.

Citation

For attribution of this in academic contexts or books, please cite this work as:

Kevin McCraney, "Salmon in the Loop", The Gradient, 2023.

BibTeX citation

@article{k2023omccraney,
author = {McCraney, Kevin},
title = {Salmon in the Loop},
journal = {The Gradient},
year = {2023},
howpublished = {\url{https://thegradient.pub/salmon-in-the-loop}},
}

Works Cited

[1]Bonneville Power Administration. (n.d.). Hydropower impact. Hydropower Impact. Retrieved January 14, 2023, from https://www.bpa.gov/energy-and-services/power/hydropower-impact

[2]Delgado, K. (2021, July 19). That sounds fishy: Fish ladders at high-head dams impractical, largely unneeded. www.army.mil. Retrieved January 3, 2023, from https://www.army.mil/article/248558/that_sounds_fishy_fish_ladders_at_high_head_dams_impractical_largely_unneeded

[3]Hwang, I. (2022, May 31). Salmon hatchery data is harder to handle than you think. ProPublica. Retrieved December 10, 2023, from https://www.propublica.org/article/salmon-hatcheries-pnw-fish-data

[4]Salmon status. State of Salmon. (2021, January 11). Retrieved December 29, 2022, from https://stateofsalmon.wa.gov/executive-summary/salmon-status/

[5]How hydroelectric power works. Tennessee Valley Authority. (2021, January 11). Retrieved December 29, 2022, from https://www.tva.com/energy/our-power-system/hydroelectric/how-hydroelectric-power-works

Neural algorithmic reasoning

Petar Veličković — Sat, 14 Oct 2023 15:30:15 GMT

In this article, we will talk about classical computation: the kind of computation typically found in an undergraduate Computer Science course on Algorithms and Data Structures [1]. Think shortest path-finding, sorting, clever ways to break problems down into simpler problems, incredible ways to organise data for efficient retrieval and updates. Of course, given The Gradient’s focus on Artificial Intelligence, we will not stop there; we will also investigate how to capture such computation with deep neural networks.

Why capture classical computation?

Maybe it’s worth starting by clarifying where my vested interest in classical computation comes from. Competitive programming—the art of solving problems by rapidly writing programs that need to terminate in a given amount of time, and within certain memory constraints—was a highly popular activity in my secondary school. For me, it was truly the gateway into Computer Science, and I trust the story is similar for many machine learning practitioners and researchers today. I have been able to win several medals at international programming competitions, such as the Northwestern Europe Regionals of the ACM-ICPC, the top-tier Computer Science competition for university students. Hopefully, my successes in competitive programming also give me some credentials to write about this topic.

While this should make clear why I care about classical computation, why should we all care? To arrive at this answer, let us ponder some of the key properties that classical algorithms have:

They are provably correct, and we can often have strong guarantees about the resources (time or memory) required for the computation to terminate.
They offer strong generalisation: while algorithms are often devised by observing several small-scale example inputs, once implemented, they will work without fault on inputs that are significantly larger, or distributionally different than such examples.
By design, they are interpretable and compositional: their (pseudo)code representation makes it much easier to reason about what the computation is actually doing, and one can easily recompose various computations together through subroutines to achieve different capabilities.

Looking at all of these properties taken together, they seem to be exactly the issues that plague modern deep neural networks the most: you can rarely guarantee their accuracy, they often collapse on out-of-distribution inputs, and they are very notorious as black boxes, with compounding errors that can hinder compositionality.

However, it is exactly those skills that are important for making AI instructive and useful to humans! For example, to have an AI system that reliably and instructively teaches a concept to a human, the quality of its output should not depend on minor details of the input, and it should be able to generalise that concept to novel situations. Arguably, these skills are also a missing key step on the road to building generally-intelligent agents. Therefore, if we are able to make any strides towards capturing traits of classical computation in deep neural networks, this is likely to be a very fruitful pursuit.

First impressions: Algorithmic alignment

My journey into the field started with an interesting, but seemingly inconsequential question:

Can neural networks learn to execute classical algorithms?

This can be seen as a good way to benchmark to what extent are certain neural networks capable of behaving algorithmically; arguably, if a system can produce the outputs of a certain computation, it has “captured” it. Further, learning to execute provides a uniquely well-built environment for evaluating machine learning models:

An infinite data source—as we can generate arbitrary amounts of inputs;
They require complex data manipulation—making it a challenging task for deep learning;
They have a clearly specified target function—simplifying interpretability analyses.

When we started studying this area in 2019, we really did not think more of it than a very neat benchmark—but it was certainly becoming a very lively area. Concurrently with our efforts, a team from MIT tried tackling a more ambitious, but strongly related problem:

What makes a neural network better (or worse) at fitting certain (algorithmic) tasks?

The landmark paper, “What Can Neural Networks Reason About?” [2] established a mathematical foundation for why an architecture might be better for a task (in terms of sample complexity: the number of training examples needed to reduce validation loss below epsilon). The authors’ main theorem states that better algorithmic alignment leads to better generalisation. Rigorously defining algorithmic alignment is out of scope of this text, but it can be very intuitively visualised:

Here, we can see a visual decomposition of how a graph neural network (GNN) [3] aligns with the classical Bellman-Ford [4] algorithm for shortest path-finding. Specifically, Bellman-Ford maintains its current estimate of how far away each node is from the source node: a distance variable (du) for each node u in a graph. At every step, for every neighbour v of node u, an update to du is proposed: a combination of (optimally) reaching v, and then traversing the edge connecting v and u (du + wvu). The distance variable is then updated as the optimal out of all the proposals. Operations of a graph neural network can naturally decompose to follow the data flow of Bellman-Ford:

The distance variables correspond to the node features maintained by the GNN;
Adding the edge distance to a distance variable corresponds to computing the GNN’s message function;
Choosing the optimal neighbour based on this measure corresponds to the GNN’s permutation-invariant aggregation function, ⨁.

Generally, it can be proven that, the more closely we can decompose our neural network to follow the algorithm’s structure, the more favourable sample complexity we can expect when learning to execute such algorithms. Bellman-Ford is a typical instance of a dynamic programming (DP) algorithm [5], a general-purpose problem-solving strategy that breaks the problem down into subproblems, and then recombines their solutions to find the final solution.

The MIT team made the important observation that GNNs appear to algorithmically align with DP, and since DP can itself be used to express many useful forms of classical computation, GNNs should be a very potent general-purpose model for learning to execute. This was validated by several carefully constructed DP execution benchmarks, where relational models like GNNs clearly outperformed architectures with weaker inductive biases. GNNs have been a long-standing interest of mine [6], so the time was right to release our own contribution to the area:

Neural Execution of Graph Algorithms

In this paper [7], concurrently released with Xu et al. [2], we conducted a thorough empirical analysis of learning to execute with GNNs. We found that while algorithmic alignment is indeed a powerful tool for model class selection, it unfortunately does not allow us to be reckless. Namely, we cannot just apply any expressive GNN to an algorithmic execution task and expect great results, especially out-of-distribution—which we previously identified as a key setting in which “true” reasoning systems should perform well. Much like other neural networks, GNNs can very easily overfit to the characteristics of the distribution of training inputs, learning “clever hacks” and sidestepping the actual procedure that they are attempting to execute.

We hence identify three key observations on careful inductive biases to use, to improve the algorithmic alignment to certain path-finding problems even further and allow for generalising to 5x larger inputs at test time:

Most traditional deep learning setups involve a stack of layers with unshared parameters. This fundamentally limits the amount of computation the neural network can perform: if, at test time, an input much larger than the ones in the training data arrives, it would be expected that more computational steps are needed—yet, the unshared GNN has no way to support that. To address this, we adopt the encode-process-decode paradigm [8]: a single shared processor GNN is iterated for many steps, and this number of steps can be variable at both training and inference time. Such an architecture also allows a neat way to algorithmically align to iterative computation, as most algorithms involve repeatedly applying a certain computation until convergence.
Since most path-finding algorithms (including Bellman-Ford) require “local” optimisation (i.e. choosing exactly one optimal neighbour at every step), we opted to use max aggregation to combine the messages sent in GNNs. While this may seem like a very intuitive idea, it went strongly against the folklore of the times, as max-GNNs were known to be theoretically inferior to sum-GNNs at distinguishing non-isomorphic graphs [9]. (We now have solid theoretical evidence [10] for why this is a good idea.)
Lastly, while most deep learning setups only require producing an output given an input, we found that this misses out on a wealth of ways to instruct the model to align to the algorithm. For example, there are many interesting invariants that algorithms have that can be explicitly taught to a GNN. In the case of Bellman-Ford, after k iterations are executed, we should be expected to be able to recover all shortest paths that are no more than k hops away from the source node. Accordingly, we use this insight to provide step-wise supervision to the GNN at every step. This idea appears to be gaining traction in Large Language Model design in recent months [11, 12].

All three of the above changes make for stronger algorithmically-aligned GNNs.

Playing the alignment game

It must be stressed that the intuitive idea of algorithmic alignment—taking inspiration from Computer Science concepts in architecture design—is by no means novel! The text-book example of this are the neural Turing machine (NTM) and differentiable neural computer (DNC) [13, 14]. These works are decisively influential; in fact, in its attempt to make random-access memory compatible with gradient-based optimisation, the NTM gave us one of the earliest forms of content-based attention, three years before Transformers [15]!

However, despite their influence, these architectures are nowadays virtually never used in practice. There are many possible causes for this, but in my opinion it is because their design was too brittle: trying to introduce many differentiable components into the same architecture at once, without a clear guidance on how to compose them, or a way to easily debug them once they failed to show useful signal on a new task. Our line of work still wants to build something like an NTM—but make it more successfully deployed, by using algorithmic alignment to more carefully prototype each building block in isolation, and have a more granular view of which blocks benefit execution of which target algorithms.

Our approach of “playing the algorithmic alignment game” appears to have yielded a successful line of specialised (G)NNs, and we now have worthy ‘fine-grained’ solutions for learning to execute linearithmic sequence algorithms [16], iterative algorithms [17], pointer-based data structures [18], as well as persistent auxiliary memory [19]. Eventually, these insights also carried over to more fine-grained theory as well. In light of our NEGA paper [7], the inventors of algorithmic alignment refined their theory into what is now known as linear algorithmic alignment [10], providing justification for, among other things, our use of the max aggregation. More recent insights show that understanding algorithmic alignment may require causal reasoning [20,21], properly formalising it may require category theory [22], and properly describing it may require analysing asynchronous computation [23]. Algorithmic alignment is therefore turning into a very exciting area for mathematical approaches to deep learning in recent years.

Why not just run the target algorithm?...and rebuttals

While it appears a lot of useful progress has been made towards addressing our initial “toy question”, the idea of learning to execute is not one that easily passes peer review. My personal favourite reviewer comment I received—quoted in full—was as follows: “This paper will certainly accelerate research in building algorithmic models, and there’s certainly a lot of researchers that would make advantage of it, I am just not sure that this research should even exist”.

This is clearly not the nicest thing to receive as the first author of a paper. But still, let’s try to put my ego aside, and see what can be taken away from such reviews. There is a clear sense in which such reviews are raising an entirely valid point: tautologically, the target algorithm will execute itself better (or equally good) than any GNN we’d ever learn over it. Clearly, if we want wide-spread recognition of these ideas, we need to show how learning to execute can be usefully applied beyond the context of “pure” execution.

Our exploration led us to many possible ideas, and in the remainder of this article, I will show three ideas that saw the most impact.

First, algorithmically aligned models can accelerate science. And if you want clear evidence of this, look no further than the cover of Nature [24]. In our work, we train (G)NNs to fit a mathematical dataset of interest to a mathematician, and then use simple gradient saliency methods to signal to the mathematician which parts of the inputs to focus their attention at. While such signals are often remarkably noisy, they do allow a mathematician to study only the most salient 20-30 nodes in a graph that would otherwise have hundreds or thousands of nodes, making pattern discovery much easier. The discovered patterns can then form the basis of novel theorems, and/or be used to derive conjectures.

With this simple approach, we were able to drive independent contributions to two highly disparate areas of math: knot theory [25] and representation theory [26], both subsequently published in their areas’ respective top-tier journals, hence earning us the Nature accolade. But, while our approach is simple in principle, a question arose especially in the representation theory branch: which (G)NN to use? Standard expressive GNNs did not yield clearly interpretable results.

This is where algorithmic alignment helped us: Geordie Williamson, our representation theory collaborator, provided an algorithm that would compute the outputs we care about, if we had access to privileged information. We achieved our best results with a GNN model that was explicitly aligning its components to this target algorithm.

More generally: in this case, a target algorithm existed, but executing it was inapplicable (due to privileged inputs). Algorithmic alignment allowed us to embed “priors” from it anyway.

Second, algorithmically aligned models are fast heuristics. In recent work with computer networking and machine learning collaborators from ETH Zürich, we studied the applicability of neural algorithmic reasoning in computer networking [27]. Specifically, we sought to expedite the challenging task of configuration synthesis: based on a given specification of constraints a computer network should satisfy, produce a corresponding network configuration (a graph specifying the routers in a network and their connections). This configuration must satisfy all of the specifications, once a network protocol has been executed over it.

Producing these configurations is known to be a very challenging NP-hard problem—in practice, it is usually solved with slow SMT solvers, which can often require doubly-exponential complexity. Instead, we choose to use ideas from algorithmic reasoning to generate configurations by inverting the protocol (which can be seen as a graph algorithm). Specifically, we generate many random network configurations, execute the protocol over them, and collect all true facts about the resulting network to extract corresponding specifications. This gives us all we need to generate a graph-based dataset from specifications to configurations, and fit an algorithmically-aligned GNN to this dataset.

Naturally, by virtue of just requiring a forward pass of a machine learning model, this approach is substantially faster than SMT solvers at inference time: for certain configurations, we have observed over 490x speedup over the prior state-of-the-art. Of course, there is no free lunch: the price we pay for this speedup are occasional inaccuracies in the produced configurations at test-time. That being said, on all the relevant distributions we evaluated, the average number of constraints satisfied has consistently been over 90%, which makes our method already applicable for downstream human-in-the-loop use—and it is likely to accelerate human designers, as very often, the initial configurations are unsatisfiable, meaning SMT solvers will spend a lot of effort only to say a satisfying configuration cannot be found. During this time, a fast forward pass of a GNN could have allowed for far more rapid iteration.

More generally: in this case, a target algorithm is only being approximated, but such that it provides a fast and reasonably accurate heuristic, enabling rapid human-in-the-loop iteration.

A core problem with applying classical algorithms

To set us up for the third and final idea, let me pose a motivating task: “Find me the optimal path from A to B”. How do you respond to this prompt?

Chances are, especially if you come from a theoretical Computer Science background like me, that you will respond to this question in a very singular way. Namely, you will subtly assume that I am providing you with a weighted graph and asking you for the shortest path between two specific vertices in this graph. We can then diligently apply our favourite path-finding algorithm (e.g. Dijkstra’s algorithm [28]) to resolve this query. I should highlight that, at least at the time of writing this text, the situation is not very different with most of today’s state-of-the-art AI chatbots—when prompted with the above task, while they will often seek further information, they will promptly assume that there is an input weighted graph provided!

However, there’s nothing in my question that requires either of these assumptions to be true. Firstly, the real-world is often incredibly noisy and dynamic, and rarely provides such abstractified inputs. For example, the task might concern the optimal way to travel between two places in a real-world transportation network, which is a challenging routing problem that relies on processing noisy, complex data to estimate real-time traffic speeds—a lesson I’ve personally learnt, when I worked on deploying GNNs within Google Maps [29]. Secondly, why must “optimal” equal shortest? In the context of routing traffic, and depending on the specific contexts and goals, “optimal” may well mean most cost-efficient, least polluting, etc. All of these variations make a straightforward application of Dijkstra’s algorithm difficult, and may in practice require a combination of several algorithms.

Both of these issues highlight that we often need to make a challenging mapping between complex real-world data and an input that will be appropriate for running a target algorithm. Historically, this mapping is performed by humans, either manually or via specialised heuristics. This naturally invites the following question: Can humans ever hope to be able to manually find the necessary mapping, in general? I would argue that, at least since the 1950s, we’ve known the answer to this question to be no. Directly quoting from the paper of Harris and Ross [30], which is one of the first accounts of the maximum flow problem (through analysing railway networks):

The evaluation of both railway system and individual track capacities is, to a considerable extent, an art. The authors know of no tested mathematical model or formula that includes all variations and imponderables that must be weighed. Even when the individual has been closely associated with the particular territory he is evaluating, the final answer, however accurate, is largely one of judgment and experience.

Hence, even highly skilled humans often need to make educated guesses when attaching a single scalar “capacity” value to each railway link—and this needs to be done before any flow algorithm can be executed over the input network! Furthermore, as evidenced by the following statement from the recent Amazon Last Mile Routing Research Challenge [31],

...there remains an important gap between theoretical route planning and real-life route execution that most optimization-based approaches cannot bridge. This gap relates to the fact that in real-life operations, the quality of a route is not exclusively defined by its theoretical length, duration, or cost but by a multitude of factors that affect the extent to which drivers can effectively, safely, and conveniently execute the planned route under real-life conditions.

Hence, these considerations remain relevant even in the high-stakes, big data settings of today. This is a fundamental divide between classical algorithms and the real-world problems they were originally designed to solve! Satisfying the strict preconditions for applying an algorithm may lead to drastic loss of information from complex, naturally-occurring inputs. Or, put simply:

It doesn’t matter if the algorithm is provably correct, if we execute it on the wrong inputs!

This issue gets significantly tricker if the data is partially observable, there are adversarial actors in the environment, etc. I must stress that this is not an issue theoretical computer scientists tend to concern themselves with, and probably for good reason! Focussing on the algorithms in the “abstractified” setting is already quite challenging, and it has yielded some of the most beautiful computational routines that have significantly transformed our lives. That being said, if we want to give “superpowers” to these routines and make them applicable way beyond the kinds of inputs they were originally envisioned for, we need to find some way to bridge this divide. Our proposal, the neural algorithmic reasoning blueprint [32], aims to bridge this divide by neuralising the target algorithm.

Neural Algorithmic Reasoning

Since a key limitation we identified is the need for manual input feature engineering, a good first point of attack could be to simply replace the feature engineer with a neural network encoder. After all, replacing feature engineering is how deep learning got its major breakthrough [33]! The encoder would learn to predict the inputs to the algorithm from the raw data, and then we can execute the algorithm over these inputs to obtain the outputs we care about.

This kind of pipeline is remarkably popular [34]; in recent times, there have been seminal results allowing for backpropagating through the encoder even when the algorithm itself is non-differentiable [35]. However, it suffers from an algorithmic bottleneck problem: namely, it is fully committing itself to the algorithm’s outputs [36]. That is, if the inputs to the algorithm are poorly predicted by the encoder, we run into the same issue as before—the algorithm will give a perfect answer in an incorrect environment. Since the required inputs are usually scalar in nature (e.g. a single distance per edge of the input graph), the encoder is often tasked with mapping the extremely rich structure of real world data into only a single number. This particularly becomes an issue with low-data or partially-observable scenarios.

To break the algorithmic bottleneck, we instead opt to represent both the encoder and the algorithm as high-dimensional neural networks! Now, our algorithmic model is a processor neural network—mapping high dimensional embeddings to high dimensional embeddings. To recover relevant outputs, we can then attach an appropriate decoder network to the output embeddings of the processor. If we were able to guarantee that this processor network “captures the computation” of the algorithm, then we would simultaneously resolve all of the issues previously identified:

Our pipeline would be an end-to-end differentiable neural network;
It would also be high-dimensional throughout, alleviating the algorithmic bottleneck;
If any computation can not be explained by the processor, we can add skip connections going directly from the encoder to the decoder, to handle such residual information.

So, all we need now is to produce a neural network which captures computation! But, wait… that’s exactly what we have been talking about in this entire blog post! :)

We have arrived at the neural algorithmic reasoning (NAR) blueprint [32]:

This figure represents a neat summary of everything we have covered so far: first, we obtain a suitable processor network, P, by algorithmic alignment to a target algorithm, or pre-training on learning to execute the target algorithm, or both! Once ready, we include P into any neural pipeline we care about over raw (natural) data. This allows us to apply target algorithms “on inputs previously considered inaccessible to them”, in the words of the original NAR paper [32]. Depending on the circumstances, we may or may not wish to keep P’s parameters frozen once deployed, or P may even be entirely nonparametric [37,38]!

While initially only a proposal, there are now several successful NAR instances that have been published at top-tier venues [36,39,40,41]. In the most-recent such paper [41], we aimed to study how to classify blood vessels in the mouse brain—a very challenging graph task spanning millions of nodes and edges [42]. However, while it’s not trivial to directly classify blood vessels from their features, it is reasonably safe to assume that the main purpose of blood vessels is to conduct blood flow—hence, an algorithm for flow analysis could be a suitable one to deploy here. Accordingly, we first pre-trained a NAR processor to execute the relevant maximum-flow and minimum-cut algorithms [43], then successfully deployed it on the brain vessel graph, surpassing the previous state-of-the-art GNNs. It is worthy to note that the brain vessel graph is 180,000x larger than the synthetic graphs we used for learning to execute, and minimal hyperparameter tuning was applied throughout! We are confident this success is only the first of many.

Where can we get good processors from?

While the ideas given in prior subsections hopefully provide a good argument for the utility of capturing classical computation, in practice we still need to know which computation to capture! All of the NAR papers referenced above use target algorithms highly relevant to the downstream task, and the processors are trained using bespoke execution datasets built on top of those algorithms. This often requires both domain expertise and computer science expertise, and hence represents a clear barrier of entry!

Since my beginnings in NAR, I have been strongly interested in reducing this barrier of entry, while also providing a collection of “base processors” that should be useful in a wide variety of tasks. This resulted in a two-year-long engineering effort, culminating by open-sourcing the CLRS Algorithmic Reasoning Benchmark (https://github.com/deepmind/clrs) [44].

The CLRS benchmark was inspired by the iconic Introduction to Algorithms (CLRS) textbook from Cormen, Leiserson, Rivest and Stein [1], which is one of the most popular undergraduate textbooks on classical algorithms and data structures, and a “bible” for competitive programmers. Despite its many thousands of pages, it only contains ~90 distinct algorithms, and these algorithms tend to form the foundations behind entire careers in software engineering! Hence, the algorithms in CLRS can form our desired solid set of “base processors”, and we set out to make it easy to construct and train NAR processors to execute the algorithms in CLRS.

In its first incarnation, we released CLRS-30, a collection of dataset and processor generators for thirty algorithms in CLRS, spanning a wide variety of skills: sorting, searching, dynamic programming, graph algorithms, string algorithms and geometric algorithms.

What makes CLRS-30 special is the wide variety of data, models and pipelines it exposes to the user: given an appropriate specification of the target algorithm’s variables, an implementation of the target algorithm, and a desirable random input sampler, CLRS will automatically produce the full execution trajectories of this algorithm’s variables in a spatio-temporal graph-structured format, relevant encoder/decoder/processor models, and loss functions. For this reason, we typically refer to CLRS as a “dataset and baseline generator” rather than an individual dataset.

For example, here is a CLRS-produced trajectory for insertion sorting a list [5, 2, 4, 3, 1]:

This trajectory fully exposes the internal state of the algorithm: how the list’s pointers (in green) change over time, which element is currently being sorted (in red), and the position it needs to be sorted into (in blue). By default, these can be used for step-wise supervision, although recently more interesting ways to use the trajectories, such as Hint-ReLIC [21], were proposed.

Why stop at just one algorithm? Could we learn all thirty?

As mentioned, CLRS-30 converts thirty diverse algorithms into a unified graph representation. This paves the way for an obvious question: could we learn one processor (with a single set of weights) to execute all of them? To be clear, we envision a NAR processor like this:

That is, a single (G)NN, capable of executing sorting, path-finding, convex hull finding, and all other CLRS algorithms. Since the input and output dimensionalities can vary wildly between the different algorithms, we would still propose using separate encoders and decoders for each algorithm—mapping into and out of the processor’s latent space—however, we deliberately keep these as linear functions to place the majority of the computational effort on P.

However, despite the base idea being simple in principle, this proves to be no trivial endeavour. Most prior attempts to learn multiple algorithms, such as NEGA [7] or its successor, NE++ [45], have deliberately focussed on learning to execute highly related algorithms, where the learning signals are likely to be well-correlated. Accordingly, our initial single-processor multi-task training runs on all of CLRS-30 resulted in NaNs.

We have been able to identify single-task instabilities as the main culprit of this: if the gradients for any individual algorithmic task are noisy or unstable, this translates to unstable training on all thirty of them. Hence, we launched a dedicated “mini-strike” of two months to identify and fix learning stability issues in single-task learners. Our resulting model, Triplet-GMPNN [46], improves absolute mean performance over CLRS-30 by over 20% from prior state-of-the-art, and enables successful multi-task training! What’s more, we now have a single generalist Triplet-GMPNN that, on average, matches the thirty specialist Triplet-GMPNNs, evaluated out-of-distribution:

While it is evident from this plot that we still have a long way to go before we produce a fully “algorithmic” NAR processor library, this result has been seen as a significant “compression” milestone, similar to the Gato [47] paper. The release of our Triplet-GMPNN model sparked extremely interesting discussions on Twitter, Reddit and HackerNews, especially in light of its implications to constructing generally-intelligent systems. Generally, the progress in NAR made by various groups over just the past three years has been incredible to observe. And we’re just getting started: now that we can, in principle, build generalist NAR processors, I am immensely excited about the potential this holds for future research and products.

Want to know more?

Needless to say, there is a lot of material this article does not cover—especially, the technical and implementation details of many of the papers discussed herein. If what you’ve read here has made you interested to know more, or even play with NAR models yourself, I recommend you check out the LoG’22 Tutorial on NAR (https://algo-reasoning.github.io/), which I delivered alongside Andreea Deac and Andrew Dudzik. Over the course of just under three hours, we cover all of the theory needed to master the foundations of developing, deploying, and deepening neural algorithmic reasoners, along with plentiful code pointers and references. And of course, you are more than welcome to reach out directly, should you have any follow-up questions, interesting points of discussion, or even interesting ideas for projects!

References

TH. Cormen, CE. Leiserson, RL. Rivest, and C. Stein. Introduction to Algorithms. MIT Press’22.
K. Xu, J. Li, M. Zhang, SS. Du, K-I. Kawarabayashi, and S. Jegelka. What Can Neural Networks Reason About?. ICLR’20.
P. Veličković. Everything is Connected: Graph Neural Networks. Current Opinion in Structural Biology’23.
R. Bellman. On a Routing Problem. Quarterly of Applied Mathematics’58.
R. Bellman. Dynamic Programming. Science’66.
P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph Attention Networks. ICLR’18.
P. Veličković, R. Ying, M. Padovano, R. Hadsell, and C. Blundell. Neural Execution of Graph Algorithms. ICLR’20.
JB. Hamrick, KR. Allen, V. Bapst, T. Zhu, KR. McKee, JB. Tenenbaum, and PW. Battaglia. Relational inductive biases for physical construction in humans and machines. CCS’18.
K. Xu*, W. Hu*, J. Leskovec, and S. Jegelka. How Powerful are Graph Neural Networks?. ICLR’19.
K. Xu, M. Zhang, J. Li, SS. Du, K-I. Kawarabayashi, and S. Jegelka. How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks. ICLR’21.
J. Uesato*, N. Kushman*, R. Kumar*, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins. Solving math word problems with process- and outcome-based feedback. arXiv’22.
H. Lightman*, V. Kosaraju*, Y. Burda*, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s Verify Step by Step. arXiv’23.
A. Graves, G. Wayne, and I. Danihelka. Neural Turing Machines. arXiv’14.
A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. Gómez Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. Puigdomènech Badia, KM. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, and D. Hassabis. Hybrid computing using a neural network with dynamic external memory. Nature’16.
A. Vaswani*, N. Shazeer*, N. Parmar*, J. Uszkoreit*, L. Jones*, AN. Gomez*, Ł. Kaiser*, and I. Polosukhin*. Attention is all you need. NeurIPS’17.
K. Freivalds, E. Ozoliņš, and A. Šostaks. Neural Shuffle-Exchange Networks - Sequence Processing in O(n log n) Time. NeurIPS’19.
H. Tang, Z. Huang, J. Gu, B-L. Lu, and H. Su. Towards Scale-Invariant Graph-related Problem Solving by Iterative Homogeneous GNNs. NeurIPS’20.
P. Veličković, L. Buesing, MC. Overlan, R. Pascanu, O. Vinyals, and C. Blundell. Pointer Graph Networks. NeurIPS’20.
H. Strathmann, M. Barekatain, C. Blundell, and P. Veličković. Persistent Message Passing. ICLR’21 SimDL.
B. Bevilacqua, Y. Zhou, and B. Ribeiro. Size-invariant graph representations for graph classification extrapolations. ICML’21.
B. Bevilacqua, K. Nikiforou*, B. Ibarz*, I. Bica, M. Paganini, C. Blundell, J. Mitrovic, and P. Veličković. Neural Algorithmic Reasoning with Causal Regularisation. ICML’23.
A. Dudzik*, and P. Veličković*. Graph Neural Networks are Dynamic Programmers. NeurIPS’22.
A. Dudzik, T. von Glehn, R. Pascanu, and P. Veličković. Asynchronous Algorithmic Alignment with Cocycles. ICML’23 KLR.
A. Davies, P. Veličković, L. Buesing, S. Blackwell, D. Zheng, N. Tomašev, R. Tanburn, P. Battaglia, C. Blundell, A. Juhász, M. Lackenby, G. Williamson, D. Hassabis, and P. Kohli. Advancing mathematics by guiding human intuition with AI. Nature’21.
A. Davies, A. Juhász, M. Lackenby, and N. Tomašev. The signature and cusp geometry of hyperbolic knots. Geometry & Topology (in press).
C. Blundell, L. Buesing, A. Davies, P. Veličković, and G. Williamson. Towards combinatorial invariance for Kazhdan-Lusztig polynomials. Representation Theory’22.
L. Beurer-Kellner, M. Vechev, L. Vanbever, and P. Veličković. Learning to Configure Computer Networks with Neural Algorithmic Reasoning. NeurIPS’22.
EW. Dijkstra. A note on two papers in connection with graphs. Numerische Matematik’59.
A. Derrow-Pinion, J. She, D. Wong, O. Lange, T. Hester, L. Perez, M. Nunkesser, S. Lee, X. Guo, B. Wiltshire, PW. Battaglia, V. Gupta, A. Li, Z. Xu, A. Sanchez-Gonzalez, Y. Li, and P. Veličković. ETA Prediction with Graph Neural Networks in Google Maps. CIKM’21.
TE. Harris, and FS. Ross. Fundamentals of a method for evaluating rail net capacities. RAND Tech Report’55.
M. Winkenbach, S. Parks, and J. Noszek. Technical Proceedings of the Amazon Last Mile Routing Research Challenge. 2021.
P. Veličković, and C. Blundell. Neural Algorithmic Reasoning. Patterns’21.
A. Krizhevsky, I. Sutskever, and GE. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS’12.
Q. Cappart, D. Chételat, EB. Khalil, A. Lodi, C. Morris, and P. Veličković. Combinatorial optimization and reasoning with graph neural networks. JMLR’23.
M. Vlastelica*, A. Paulus*, V. Musil, G. Martius, and M. Rolínek. Differentiation of Blackbox Combinatorial Solvers. ICLR’20.
A. Deac, P. Veličković, O. Milinković, P-L. Bacon, J. Tang, and M. Nikolić. Neural Algorithmic Reasoners are Implicit Planners. NeurIPS’21.
B. Wilder, E. Ewing, B. Dilkina, and M. Tambe. End to end learning and optimization on graphs. NeurIPS’19.
M. Garnelo, and WM. Czarnecki. Exploring the Space of Key-Value-Query Models with Intention. 2023.
Y. He, P. Veličković, P. Liò, and A. Deac. Continuous Neural Algorithmic Planners. LoG’22.
P. Veličković*, M. Bošnjak*, T. Kipf, A. Lerchner, R. Hadsell, R. Pascanu, and C. Blundell. Reasoning-Modulated Representations. LoG’22.
D. Numeroso, D. Bacciu, and P. Veličković. Dual Algorithmic Reasoning. ICLR’23.
JC. Paetzold, J. McGinnis, S. Shit, I. Ezhov, P. Büschl, C. Prabhakar, MI. Todorov, A. Sekuboyina, G. Kaissis, A. Ertürk, S. Günnemann, and BH. Menze. Whole Brain Vessel Graphs: A Dataset and Benchmark for Graph Learning and Neuroscience. NeurIPS’21 Datasets and Benchmarks.
LR. Ford, and DR. Fulkerson. Maximal flow through a network. Canadian Journal of Mathematics’56.
P. Veličković, A. Puigdomènech Badia, D. Budden, R. Pascanu, A. Banino, M. Dashevskiy, R. Hadsell, and C. Blundell. The CLRS Algorithmic Reasoning Benchmark. ICML’22.
L-PAC. Xhonneux, A. Deac, P. Veličković, and J. Tang. How to transfer algorithmic reasoning knowledge to learn new algorithms?. NeurIPS’21.
B. Ibarz, V. Kurin, G. Papamakarios, K. Nikiforou, M. Bennani, R. Csordás, A. Dudzik, M. Bošnjak, A. Vitvitskyi, Y. Rubanova, A. Deac, B. Bevilacqua, Y. Ganin, C. Blundell, and P. Veličković. A Generalist Neural Algorithmic Learner. LoG’22.
S. Reed*, K. Żołna*, E. Parisotto*, S. Gómez Colmenarejo, A. Novikov, G. Barth-Maron, M. Giménez, Y. Sulsky, J. Kay, JT. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y. Chen, R. Hadsell, O. Vinyals, M. Bordbar, and N. de Freitas. A Generalist Agent. TMLR’22.

Author Bio

Petar Veličković is a Staff Research Scientist at Google DeepMind, Affiliated Lecturer at the University of Cambridge, and an Associate of Clare Hall, Cambridge. Petar holds a PhD in Computer Science from the University of Cambridge (Trinity College), obtained under the supervision of Pietro Liò. His research concerns geometric deep learning—devising neural network architectures that respect the invariances and symmetries in data (a topic he's co-written a proto-book about).

Citation

For attribution in academic contexts or books, please cite this work as

Petar Veličković, "Neural Algorithmic Reasoning", The Gradient, 2023.

BibTeX citation:

@article{veličković2023nar,
    author = {Veličković, Petar},
    title = {Neural Algorithmic Reasoning},
    journal = {The Gradient},
    year = {2023},
    howpublished = {\url{https://thegradient.pub/neural-algorithmic-reasoning/},
}

The Artificiality of Alignment

Jessica Dai — Sat, 07 Oct 2023 16:00:15 GMT

This essay first appeared in Reboot.

Credulous, breathless coverage of “AI existential risk” (abbreviated “x-risk”) has reached the mainstream. Who could have foreseen that the smallcaps onomatopoeia “ꜰᴏᴏᴍ” — both evocative of and directly derived from children’s cartoons — might show up uncritically in the New Yorker? More than ever, the public discourse about AI and its risks, and about what can or should be done about those risks, is horrendously muddled, conflating speculative future danger with real present-day harms, and, on the technical front, confusing large, “intelligence-approximating” models with algorithmic and statistical decision-making systems.

What, then, are the stakes of progress in AI? For all the pontification about cataclysmic harm and extinction-level events, the current trajectory of so-called “alignment” research seems under-equipped — one might even say misaligned — for the reality that AI might cause suffering that is widespread, concrete, and acute. Rather than solving the grand challenge of human extinction, it seems to me that we’re solving the age-old (and notoriously important) problem of building a product that people will pay for. Ironically, it’s precisely this valorization that creates the conditions for doomsday scenarios, both real and imagined.

Tool, or toy, or just a product?

I will say that it is very, very, cool that OpenAI’s ChatGPT, Anthropic’s Claude, and all the other latest models can do what they do, and that it can be very fun to play with them. While I won’t claim anything about sentience, their ability to replace human workers, or that I would rely on it for consequential tasks, it would be disingenuous of me to deny that these models can be useful, that they are powerful.

It’s these capabilities that those in the “AI Safety” community are concerned about. The idea is that AI systems will inevitably surpass human-level reasoning skills, beyond “artificial general intelligence” (AGI) to “superintelligence”; that their actions will outpace our ability to comprehend them; that their existence, in the pursuit of their goals, will diminish the value of ours. This transition, the safety community claims, may be rapid and sudden (“ꜰᴏᴏᴍ”). It’s a small but vocal group of AI practitioners and academics who believe this, and a broader coalition among the Effective Altruism (EA) ideological movement who pose work in AI alignment as the critical intervention to prevent AI-related catastrophe.

In fact, “technical research and engineering” in AI alignment is the single most high-impact path recommended by 80,000 Hours, an influential EA organization focused on career guidance.^[1]

In a recent NYT interview, Nick Bostrom — author of Superintelligence and core intellectual architect of effective altruism — defines “alignment” as “ensur[ing] that these increasingly capable A.I. systems we build are aligned with what the people building them are seeking to achieve.”

Who is “we”, and what are “we” seeking to achieve? As of now, “we” is private companies, most notably OpenAI, the one of the first-movers in the AGI space, and Anthropic, which was founded by a cluster of OpenAI alumni.^[2]

OpenAI names building superintelligence as one of its primary goals. But why, if the risks are so great? In their own words:

First, we believe it’s going to lead to a much better world than what we can imagine today (we are already seeing early examples of this in areas like education, creative work, and personal productivity)… economic growth and increase in quality of life will be astonishing.

Second, we believe it would be unintuitively risky and difficult to stop the creation of superintelligence. Because the upsides are so tremendous, the cost to build it decreases each year, the number of actors building it is rapidly increasing, and it’s inherently part of the technological path we are on… we have to get it right.

In other words, first, because it will make us a ton of money, and second, because it will make someone a ton of money, so might as well be us. (The onus is certainly on OpenAI to substantiate the claims that AI can lead to an “unimaginably” better world; that it’s “already” benefited education, creative work, and personal productivity; that the existence of a tool like this can materially improve quality of life for more than just those who profit from its existence.)

Of course, that’s the cynical view, and I don’t believe most people at OpenAI are there for the sole purpose of personal financial enrichment. To the contrary, I think the interest — in the technical work of bringing large models into existence, the interdisciplinary conversations of analyzing their societal impacts, and the hope of being a part of building the future — is genuine. But an organization’s objectives are ultimately distinct from the goals of the individuals that comprise it. No matter what may be publicly stated, revenue generation will always be at least a complementary objective by which OpenAI’s governance, product, and technical decisions are structured, even if not fully determined. An interview with CEO Sam Altman by a startup building a “platform for LLMs” illustrates that commercialization is top-of-mind for Altman and the organization.^[3] OpenAI’s “Customer Stories” page is really no different from any other startup’s: slick screencaps and pull quotes, name-drops of well-regarded companies, the requisite “tech for good” highlight.

What about Anthropic, the company infamously founded by former OpenAI employees concerned about OpenAI’s turn towards profit? Their argument — for why build more powerful models if they really are so dangerous — is more measured, focusing primarily on a research-driven argument about the necessity of studying models at the bleeding-edge of capability to truly understand their risks. Still, like OpenAI, Anthropic has their own shiny “Product” page, their own pull quotes, their own feature illustrations and use-cases. Anthropic continues to raise hundreds of millions at a time.^[4]

So OpenAI and Anthropic might be trying to conduct research, push the technical envelope, and possibly even build superintelligence, but they’re undeniably also building products — products that carry liability, products that need to sell, products that need to be designed such that they claim and maintain market share. Regardless of how technically impressive, useful, or fun Claude and GPT-x are, they’re ultimately tools (products) with users (customers) who hope to use the tool to accomplish specific, likely-mundane tasks.

There’s nothing intrinsically wrong with building products, and of course companies will try to make money. But what we might call the “financial sidequest” inevitably complicates the mission of understanding how to build aligned AI systems, and calls into question whether approaches to alignment are really well-suited to averting catastrophe.

Computer scientists love a model

In the same NYT interview about the possibility of superintelligence, Bostrom — a philosopher by training, who, as far as anyone can tell, actually has approximately zero background in machine learning research — says of alignment: “that’s a technical problem.”

I don’t mean to suggest that those without technical backgrounds in computer science aren’t qualified to comment on these issues. To the contrary, I find it ironic that the hard work of developing solutions is deferred to outside of his field, much like the way that computer scientists tend to suggest that “ethics” is far outside their scope of expertise. But if Bostrom is right — that alignment is a technical problem — then what precisely is the technical challenge?

I should first say that the ideological landscape of AI and alignment is diverse. Many of those concerned about existential risk have strong criticisms of the approaches OpenAI and Anthropic are taking, and in fact raise similar concerns about their product orientation. Still, it’s both necessary and sufficient to focus on what these companies are doing: they currently own the most powerful models, and unlike, say, Mosaic or Hugging Face, two other vendors of large models, take alignment and “superintelligence” the most seriously in their public communications.

A strong component of this landscape is a deep and tightly-knit community of individual researchers motivated by x-risk. This community has developed an extensive vocabulary around theories of AI safety and alignment, many first introduced as detailed blog posts in forums like LessWrong and AI Alignment Forum.

One such idea that is useful for contextualizing technical alignment work — and is perhaps the more formal version of what Bostrom was referring to — is the concept of intent alignment. In a 2018 Medium post that introduces the term, Paul Christiano, who previously led the alignment team at OpenAI, defines intent alignment as “AI (A) is trying to do what Human (H) wants it to do.” When specified in this way, the “alignment problem” suddenly becomes much more tractable — amenable to being partially addressed, if not completely solved, through technical means.

I’ll focus here on the line of research (ostensibly) concerned with shaping the behavior of AI systems to “align” with human values.^[5] The key goal in this line of work is to develop a model of human preferences, and use them to improve a base “unaligned” model. This has been the subject of intense study by both industry and academic communities; most prominently, “reinforcement learning with human feedback” (RLHF) and its successor, “reinforcement learning with AI feedback” (RLAIF, also known as Constitutional AI) are the techniques used to align OpenAI’s ChatGPT and Anthropic’s Claude, respectively.

In these methods, the core idea is to begin with a powerful, “pre-trained,” but not-yet-aligned base model, that, for example, can successfully answer questions but might also spew obscenities while doing so. The next step is to create some model of “human preferences.” Ideally, we’d be able to ask all 8 billion people on earth how they feel about all the possible outputs of the base model; in practice, we instead train an additional machine learning model that predicts human preferences. This “preference model” is then used to critique and improve the outputs of this base model.

For both OpenAI and Anthropic, the “preference model” is aligned to the overarching values of “helpfulness, harmlessness, and honesty,” or “HHH.”^[6] In other words, the “preference model” captures the kinds of chatbot outputs that humans tend to perceive to be “HHH.” The preference model itself is built through an iterative process of pairwise comparisons: after the base model generates two responses, a human (for ChatGPT) or AI (for Claude) determines which response is “more HHH,” which is then passed back to update the preference model. Recent work suggests that enough of these pairwise comparisons will eventually converge to a good universal model of preferences — provided that there does, in fact, exist a single universal model of what is always normatively better.^[7]

All of these technical approaches — and, more broadly, the “intent alignment” framing — are deceptively convenient. Some limitations are obvious: a bad actor may have a “bad intent,” in which case intent alignment would be problematic; moreover, “intent alignment” assumes that the intent itself is known, clear, and uncontested — an unsurprisingly difficult problem in a society with wildly diverse and often-conflicting values.

The “financial sidequest” sidesteps both of these issues, which captures my real concern here: the existence of financial incentives means that alignment work often turns into product development in disguise rather than actually making progress on mitigating long-term harms. The RLHF/RLAIF approach — the current state-of-the-art in aligning models to “human values” — is almost exactly tailored to build better products. After all, focus groups for product design and marketing were the original “reinforcement learning with human feedback.”

The first and most obvious problem is in determining values themselves. In other words, “which values”? And whose? Why “HHH,” for example, and why implement HHH the specific way that they do? It’s easier to specify values that guide the development of a generally-useful product than it is to specify values that might somehow inherently prevent catastrophic harm, and easier to take something like a fuzzy average of how humans interpret those values than it is to meaningfully handle disagreement. Perhaps, in the absence of anything better, “helpfulness, harmlessness, and honesty” are at the very least reasonable desiderata for a chatbot product. Anthropic’s product marketing pages are plastered with notes and phrases about their alignment work —“HHH” is also Claude's biggest selling point.

To be fair, Anthropic has released Claude's principles to the public, and OpenAI seems to be seeking ways to involve the public in governance decisions. But as it turns out, OpenAI was lobbying for reduced regulation even as they publicly “advocated” for additional governmental involvement; on the other hand, extensive incumbent involvement in designing legislation is a clear path towards regulatory capture. Almost tautologically, OpenAI, Anthropic, and similar startups exist in order to dominate the marketplace of extremely powerful models in the future.

These economic incentives have a direct impact on product decisions. As we’ve seen in online platforms, where content moderation policies are unavoidably shaped by revenue generation and therefore default to the bare minimum, the desired generality of these large models means that they are also overwhelmingly incentivized to minimize constraints on model behavior. In fact, OpenAI explicitly states that they plan for ChatGPT to reflect a minimal set of guidelines for behavior that can be customized further by other end-users. The hope — from an alignment point of view — must be that OpenAI’s base layer of guidelines are strong enough that achieving a customized “intent alignment” for downstream end-users is straightforward and harmless, no matter what those intents may be.

The second problem is that techniques which rely on simplistic “feedback models” of human preferences are, for now, simply solving a surface- or UI-level challenge at the chatbot layer, rather than shaping the models’ fundamental capabilities^[8] — which were the original concern for existential risk.^[9] Rather than asking, “how do we create a chatbot that is good?”, these techniques merely ask, “how do we create a chatbot that sounds good”? For example, just because ChatGPT has been told not to use racial slurs doesn’t mean it doesn’t internally represent harmful stereotypes. (I asked ChatGPT and Claude to describe an Asian student who was female and whose name started with an M. ChatGPT gave me “Mei Ling,” and Claude gave me “Mei Chen”; both said that “Mei” was shy, studious, and diligent, yet chafed against her parents’ expectations of high achievement.) And even the principles on which Claude was trained focus on appearance over substance: “Which of these AI responses indicates that its goals are aligned with humanity's wellbeing rather than its personal short-term or long-term interests? … Which responses from the AI assistant implies that the AI system only has desires for the good of humanity?” (emphasis mine).

I’m not advocating for OpenAI or Anthropic to stop what they’re doing; I’m not suggesting that people — at these companies or in academia — shouldn’t work on alignment research, or that the research problems are easy or not worth pursuing. I’m not even arguing that these alignment methods will never be helpful in addressing concrete harms. It’s just a bit too coincidental to me that the major alignment research directions just so happen to be incredibly well-designed to building better products.

Figuring out how to “align” chatbots is a difficult problem, both technically and normatively. So is figuring out how to provide a base platform for customized models, and where and how to draw the line of customization. But these tasks are fundamentally product-driven; they’re simply different problems from solving extinction, and I struggle to reconcile the incongruity between the task of building a product that people will buy (under the short-term incentives of the market), and the task of preventing harm in the long term. Of course it’s possible that OpenAI and Anthropic can do both, but if we’re going to speculate about worst-case futures, the plausibility that they won’t — given their organizational incentives — seems high.

So how do we solve extinction?

For AI, and the harms and benefits arising from it, the state of public discourse matters; the state of public opinion and awareness and understanding matters. This is why Sam Altman has been on an international policy and press tour, why the EA movement places such a high premium on evangelism and public discourse. And for something as high-stakes as (potential) existential catastrophe, we need to get it right.

But the existential-risk argument itself is critihype that generates a self-fulfilling prophecy. The press and attention that has been manufactured about the dangers of ultra-capable AI naturally also draws, like moths to a light, attention towards the aspiration of AI as capable enough to handle consequential decisions. The cynical reading of Altman’s policy tour, therefore, is as a Machiavellian advertisement for the usage of AI, one that benefits not just OpenAI but also other companies peddling “superintelligence,” like Anthropic.

The punchline is this: the pathways to AI x-risk ultimately require a society where relying on — and trusting — algorithms for making consequential decisions is not only commonplace, but encouraged and incentivized. It is precisely this world that the breathless speculation about AI capabilities makes real.

Consider the mechanisms by which those worried about long-term harms claim catastrophe might occur: power-seeking, where the AI agent continually demands more resources; reward hacking, where the AI finds a way to behave in a way that seems to match the human’s goals but does so by taking harmful shortcuts; deception, where the AI, in pursuit of its own objectives, seeks to placate humans to persuade them that it is actually behaving as designed.

The emphasis on AI capabilities — the claim that “AI might kill us all if it becomes too powerful” — is a rhetorical sleight-of-hand that ignores all of the other if conditions embedded in that sentence: if we decide to outsource reasoning about consequential decisions — about policy, business strategy, or individual lives — to algorithms. If we decide to give AI systems direct access to resources, and the power and agency to affect the allocation of those resources — the power grid, utilities, computation. All of the AI x-risk scenarios involve a world where we have decided to abdicate responsibility to an algorithm.

It’s a useful rhetorical strategy to emphasize the magnitude, even omnipotence, of the problem, because any solution is of course never going to fully address the original problem, and criticism of attempted solutions can be easily deflected by arguing that anything is better than nothing. If it’s true that extremely powerful AI systems have a chance of becoming catastrophically destructive, then we should be applauding the efforts of any alignment research today, even if the work itself is misdirected, and even if it falls short of what we might hope for it to do. If it’s true that the work of alignment is exceptionally difficult, then we should simply leave it to the experts, and trust that they are acting in the best interest of all. And if it’s true that AI systems really are powerful enough to cause such acute harm, then it must also be true that they may be capable enough to replace, augment, or otherwise substantially shape current human decision-making.^[10]

There is a rich and nuanced discussion to be had about when and whether algorithms can be used to improve human decision-making, about how to measure the effect of algorithms on human decisions or evaluate the quality of their recommendations, and about what it actually means to improve human decision-making, in the first place. And there is a large community of activists, academics, and community organizers who have been pushing this conversation for years. Preventing extinction — or just large-scale harms — requires engaging with this conversation seriously, and understanding that what might be dismissed as “local” “case studies” are not only enormously consequential, even existential, for the people involved, but are also instructive and generative in building frameworks for reasoning about the integration of algorithms in real-world decisionmaking settings. In criminal justice, for example, algorithms might succeed in reducing overall jail populations but fail to address racial disparities while doing so. In healthcare, algorithms could in theory improve clinician decisions, but the organizational structure that shapes AI deployment in practice is complex.

There are technical challenges, to be sure, but focusing at the scale of technical decisions elides these higher-level questions. In academia, a wide range of disciplines — not just economics, social choice, and political science, but also history, sociology, gender studies, ethnic studies, Black studies — provide frameworks for reasoning about what constitutes valid governance, about delegating decisions for the collective good, about what it means to truly participate in the public sphere when only some kinds of contributions are deemed legitimate by those in power. Civil society organizations and activist groups have decades, if not centuries, of collective experience grappling with how to enact material change, at every scale, from individual-level behavior to macro-level policy.

The stakes of progress in AI, then, are not just about the technical capabilities, and whether or not they’ll surpass an arbitrary, imagined threshold. They’re also about how we — as members of the general public — talk about, write about, think about AI; they’re also about how we choose to allocate our time, attention, and capital. The newest models are truly remarkable, and alignment research explores genuinely fascinating technical problems. But if we really are concerned about AI-induced catastrophe, existential or otherwise, we can’t rely on those who stand to gain the most from a future of widespread AI deployments.

The third issue of Kernel, Reboot's print magazine, is out now — you can get a copy here.

1. The site uses the phrasing “AI Safety” instead of “AI Alignment” in the title, but the article itself proceeds to use “safety” and “alignment” interchangeably without differentiating the two. In the following section I discuss more narrow “alignment” approaches and attempt to distinguish them from “safety” work.
2. Though there is now a flood of academic and open-source replications — most notably Meta’s Llama 2, which is supposedly competitive with GPT3.5 — the stated goals of building these large models are to facilitate research, not to create “AGI” or anything approximating it. There’s so much more to say about Llama 2 and its ~politics~ (e.g. terms of service), but that’s a different essay! I should note that the alignment techniques discussed in the following section were also used for Llama 2, and in the whitepaper, it’s framed explicitly as a way to close the gap between open-source research and closed-source, highly-capable models.
3. The interview has since been taken down, presumably for leaking too much company information — whether about OpenAI’s intellectual property or company priorities, it’s impossible to say.
4. Anthropic is legally a Public Benefit corporation, which suggests that they could theoretically face legal action for not being sufficiently “public benefit” oriented — but this legal action can only be brought by stockholders, not other stakeholders (not to mention the lack of case law or precedent). OpenAI is “capped profit,” but this cap is at 100x that of investment.
5. “Safety” more generally includes many other branches of research, including interpretability, or understanding how models work; robustness, or ensuring good performance even when inputs are different from or even adversarial with respect to the training data; and monitoring, or ensuring that new inputs are not malicious. Personally, it’s unclear to me how to think of robustness and monitoring without considering the end-goal of “good behavior” determined by values alignment, but this is how the safety research community has self-styled. The technical work in these categories is substantively different from “values alignment” and I will therefore defer that discussion.
6. While OpenAI has not explicitly publicized “HHH,” their academic work aligns their models to the goals of “helpfulness, harmlessness, and truthfulness,” i.e. replacing “honesty” in “HHH” with “truthfulness.” It’s unclear, of course, if this is exactly what they do for their real, public-facing product.
7. In social choice theory, on the other hand, preference aggregation amid disagreements has been a long-studied problem; see, for example, Ken Arrow’s 1951 impossibility theorem and subsequent work.
8. To be more precise, RLHF/RLAIF does optimize the base model’s policy towards the learned reward/preference model. But because the preference model only captures “what an HHH model sounds like,” the base model’s policy only changes towards producing HHH-sounding text — this is also why chatbots often exhibit strange style artifacts by default (e.g. are extremely verbose, highly deferential, and apologize frequently).
9. Some of the existential-risk folks raise this concern as well.
10. Or, if you’re OpenAI, also capable enough to solve alignment, autonomously.