No, agentic AI will not massively boost productivity in scientific research; here’s how to actually boost scientific productivity

There seems to be a lot of recent interest in and excitement about the promise of “agentic AI” (tools like Claude Code or Cursor) for improving productivity in scientific research. The idea, or perhaps more accurately, the hope, seems to be that by automating key steps in scientific workflows, agentic AI tools can massively improve the productivity of practising scientists. It’s not uncommon these days to hear claims such as: “What used to take me weeks previously now only takes a few hours.” The general message (the discourse, the narrative, the zeitgeist, the vibe of the times) seems to be that if you’re a practising scientist (whatever your field may be), you should be using these tools, because if you’re not, you’re going to be massively left behind. This constant drumbeat of “use these tools or become obsolete” messaging seems to create a lot of fomo-induced anxiety among scientists and researchers who do not use these tools or do not find them as useful as their more tech-savvy or technophile colleagues (“Am I doing something wrong?”). 

In the last week alone, I saw two pieces projecting this kind of message: this post by the machine learning researcher Tim Dettmers and this hour-long YouTube video by the astronomer David Kipping. 

In this post, I’d like to argue that expectations of massive productivity boosts in scientific research through the use of agentic AI tools are completely unrealistic for one very simple and obvious reason: these tools do not address the main bottleneck in scientific research, which is experimentation (broadly construed to include things like running simulations or experiments in silico, etc.) or, more precisely, the limited resources available for experimentation.

If you’re an academic machine learning researcher, you typically have access to a limited amount of compute on your university’s HPC cluster (or on a national supercomputing resource or on some cloud computing service). This puts hard and severe limits on the number of simultaneous projects you can run, the number of experiments per project you can run, and the scale at which you can run these experiments. Every machine learning researcher knows extremely well that this is the main and the single most significant bottleneck throttling their work. Yet, agentic AI tools (or any other AI tools, for that matter) do absolutely nothing to address this bottleneck: you previously had ~50 H100 days of compute per month on your local HPC cluster (realistic estimate from my last academic employer) and you still have ~50 H100 days of compute per month with your Claude subscription. 

Hard sciences where experiments interface with the real world are even more strongly bottlenecked by the resources available for experimentation, because experiments are more time-consuming and require at least a human in the loop in this case. If you’re running a wet biology lab, for instance, you can only run so many animal experiments, analyze, image, or test only so many biological samples, etc. in a given amount of time. Again, every biologist knows that these resource constraints are the main bottleneck limiting their work and AI again does absolutely nothing to change these constraints. Research that relies on large-scale instruments like the Large Hadron Collider (LHC), the James Webb Space Telescope (JWST), or supercomputers is maximally resource constrained, so I’m afraid your overpriced Claude subscription will help even less here. Incidentally, this is why I find it so strange to see David Kipping, an astronomer whose work relies on large-scale and very expensive telescopes like the JWST, join this “use these tools or become obsolete” bandwagon.   

Or consider the following example from Tim Dettmers’s blog post (linked above). Dettmers claims that he uses agentic AI tools to help him write grant proposals. This would make sense if there were a million funding opportunities and you wanted to automate the submission of a broad range of high-quality ideas to maximize your chances of success (this kind of automated generation and submission of ideas would raise questions about your actual ownership of these ideas, but let’s leave these questions aside for now, since this scenario is already too unrealistic). In this hypothetical world of plenty, scientists would be genuinely limited by the time it takes to write a grant proposal, so it would make sense to automate their writing. But here’s the thing: there aren’t a million funding opportunities for supporting scientific research projects! There aren’t even a thousand or even a hundred of them. At best, there are only maybe 10 such opportunities for most scientists in any given year (if that), so writing grant proposals is not even remotely the main productivity bottleneck for scientists even if they work meticulously on each and every one of them. 

Perhaps, some will argue that although AI doesn’t do anything to change the resource constraints scientists are subject to, it may enable them to use those resources much more effectively, hence still massively boosting productivity. So, for example, you may still have ~50 H100 days of compute per month on your local HPC cluster, but perhaps now you can run 10x more experiments or maybe 10x better experiments in some sense with those ~50 H100 days of compute when you give the reins to Claude Code. Or take Dettmers’s grant proposal writing example. Maybe the idea is that Claude Code will help you write 10x better grant proposals that will massively boost your chances of receiving funding, even though it obviously won’t do anything to increase the total amount of available funding resources (of course, it would be mathematically impossible for Claude Code to massively improve everybody’s chances simultaneously, given limited resources, but again let’s ignore this technicality for now). The idea that Claude Code will write 10x faster code or that it will generate 10x better ideas to test, explore, or experiment with (whatever “10x better” may mean in this context) doesn’t even pass the smell test, but let’s scrutinize it a bit further for a reality check. 

Let’s be as charitable to AI agents as possible and pick a domain where they are expected to excel; their home turf, so to speak: writing code. And just to give a concrete example, let’s take a look at this paper that came out very recently, which introduces a new state-of-the-art method for doing “discovery” with LLMs, i.e. finding novel, highly performant solutions to specific quantifiable problems. One of the problems they consider here is writing performant GPU kernels for specific matrix operations (e.g. triangular matrix multiplication). Note that this is a case where one can explicitly write down an unambiguous, well-defined, and easily verifiable reward function, namely the inverse runtime of the produced kernel, which can then be directly optimized in silico through reinforcement learning. For the H100 GPU, the best kernel this method came up with for triangular matrix multiplication is only 18% better than the best human-written code (for the B200, the number is more like 13%; see Table 4). For another kernel writing problem (MLA Decode on the MI300X), the method actually fails to discover a more performant kernel than the best human-written kernels (see Table 5). So, a grand improvement of less than 20% in the absolute best case, the most optimistic scenario for AI, where we can explicitly write down an unambiguous, well-defined, and easily verifiable reward function and then directly optimize it in silico. Needless to say, this is not going to be even remotely possible for most scientific applications.

There’s actually a very simple and straightforward way we could significantly boost scientific productivity and every economist knows how to do this: if you want to improve productivity significantly, you have to make significant capital investments, including in human capital (no pains, no gains). There are no shortcuts to this, no silver bullets, no magic tricks, no free lunches. If you want to boost the productivity of your machine learning researchers 10x, for example, you have to buy them 10x more and/or 10x better GPUs. If you want to boost the productivity of your molecular biologists 10x, you have to buy them 10x more or 10x better microscopes (plus all the other instruments and devices that make cutting-edge research in biology possible), you have to give them 10x more lab space to run their experiments in, etc. If you want to boost the productivity of your astronomers 10x, you have to buy them 10x more or 10x more powerful telescopes. In general, if you want to boost the productivity of your scientists 10x, you have to be willing to increase their funding 10x (or something like that). Of course, this is much harder and much more expensive than simply buying them $20/month Claude subscriptions and I suspect that one of the reasons why this idea of agentic AI massively boosting scientific productivity seems so alluring to people is its “get rich quick” nature. I’m sorry to have to remind you that “get rich quick” schemes are unfortunately almost always scams.

Current AI tools are extremely useful for a limited set of problems/tasks, chief amongst which is writing code. For this subset of tasks, these tools likely do boost productivity significantly. However, writing code is not the main productivity bottleneck in scientific research (for most scientists in most fields) and has never been so (if it were so, there would be a lot more professional software engineers hired in scientific research organizations than there currently are). By far the main bottleneck in science is rather the (often severely) limited resources available for experimentation. Current AI tools do absolutely nothing to address this bottleneck, which is why they will have a very limited impact on scientific productivity in the short term. In the long run, AI is, of course, a part of the process that improves our instruments of experimentation, making our GPUs, telescopes, microscopes, etc. more efficient and more powerful, but this is a much slower process.

Continual training of Llama-3.1-8B for 809B tokens

Over the last couple of months, I’ve been continually training the pretrained Llama-3.1-8B model (with a context length of 8192 tokens) for 809B tokens. This was my first truly large-scale distributed training experience and in this post, I’d like to share some of what I’ve learned so far.

Why

First of all, why am I doing this? To be honest, the primary motivation for me was just to get some hands-on experience in large-scale distributed training. That being said, I’m also genuinely very interested in continual training. It always seemed incredibly wasteful to me to have to do these large-scale training runs from scratch every time, instead of starting from an already well-trained model. If we could find really effective ways to do this, i.e. minimizing the loss of previously acquired knowledge while at the same time not hobbling the model’s capacity to acquire new knowledge, that would be extremely impactful in my view.

For Llama-3.1-8B specifically, I remember I was explicitly motivated by the following post from the Llama-3 pretraining lead to try continual training on this model:

The fact that the model was still improving surprisingly quickly after 15T tokens suggested to me that, if I could optimize my continual training setup, I could possibly get substantial improvements over the base model even with a few trillion tokens of continual training, which is basically my compute budget. So, improving upon the released base model in this way was another major motivation for me.

Training infrastructure

The model is trained on the Frontier supercomputer hosted at OLCF, which consists of over 10k compute nodes with 4 AMD MI250X accelerators on each node. I was initially quite ambitious about scale. I wanted to continually train the pretrained model for another 15T tokens or so. Given the relatively old hardware and the less than ideal interconnect (to put it mildly) on Frontier, basically the only way to train an 8B parameter model for this many tokens within a reasonable time frame (i.e. in a few months) is to use a very large batch size (on the order of at least 100M tokens globally) and do correspondingly fewer training iterations. I was quite excited about this setup, because it would allow me to try things that, to my knowledge, haven’t really been tried before in large-scale LLM training, namely training with very large batch sizes (~100M tokens globally) and multi-epoch training. However, a batch size of 100M tokens requires ~500-600 nodes on Frontier and it quickly became clear to me that with the current limits on my account, it wouldn’t be feasible for me to go through 15T tokens at this scale within a reasonable amount of time, so I had to scale back my ambitions.

In the end, I decided to settle for a 64-node training run, with a batch size of 11M tokens. This would allow me to run my jobs in the extended partition of Frontier that has a 24-hour maximum run time limit for jobs (otherwise the limit is a mere 12 hours). I planned for a maximum of 500k training steps, which corresponds to ~5.5T tokens, which in turn is roughly equal to 1 epoch over my training data. I estimate that it will take me about 10 months to complete training in this setup.

64 nodes on Frontier correspond to 512 “devices” (or graphics compute dies) with 64 GB HBM2e GPU memory per device, for a total of 33 TB of GPU memory. Further technical details about the system can be found here.

Training details

For efficient distributed training, I used a combination of hybrid sharding data parallelism (HSDP), tensor parallelism (TP), and pure data parallelism (DP). These were factorized across 512 devices as follows: HSDP=32, TP=8, and DP=2 (this means that there are 32 HSDP ranks, 8 TP ranks, 2 DP ranks, and each device’s rank can be uniquely identified by a combination of these ranks). I tried a bunch of other configurations (e.g. HSDP+DP without TP), but I found that this particular configuration maximized the training throughput. In addition, I also used just-in-time (JIT) compilation of the individual model layers through torch.compile, full activation checkpointing, and bf16 mixed precision training.

The learning rate schedule is a simple a linear schedule with a warm-up of 500 steps, a peak learning rate of 3e-5, and the total number of training steps is scheduled to be 500k as mentioned earlier (i.e. the learning rate decays linearly from its peak of 3e-5 at step 500 to 0 at step 500k). I initially tried a 10\times larger peak learning rate (3e-4), but this resulted in a pretty big jump in the loss early on during training (and evaluations indicated a quick and substantial degradation in model quality, which I didn’t like!), so I decided to play it safe and went for a lower peak learning rate. I didn’t really have the chance to finetune the learning rate extensively, but in hindsight I should have probably gone for a slightly larger peak learning rate, maybe something like 5e-5 or 6e-5.

The global batch size is 11M tokens. This number comes about as follows: 64 data-parallel ranks (HSDP \times DP) \times local batch size of 21 per data-parallel rank \times context length of 8192 tokens.

To train the model, I’ve used a customized version of the excellent torchtitan library, which is a lightweight, PyTorch-native distributed training framework. The training code is available from this repository with detailed instructions for full reproduction. Although I’ve been using AMD hardware to train the model, I expect that the same code should work seamlessly on NVIDIA hardware without any modifications (I haven’t verified this at scale though).

Training data

The choice of training data is probably the single most important decision in large-scale training runs. Again, I didn’t really have the time and the compute budget to explore the training data choice as extensively as I would have liked to, but I did peruse the literate on this topic pretty thoroughly and in the end came up with a training dataset consisting of the following components:

  • Zyda-2, which is itself a cross-deduplicated and filtered combination of DCLM (3.3T), FineWeb-Edu (1.3T), Dolma (0.2T), Zyda (0.2T) datasets.
  • Stack-2, specifically the the-stack-v2-train-smol-ids subset (525B).
  • FineMath, specifically the finemath-3plus subset (34B)

The numbers in parentheses represent the approximate token counts of the datasets (the full dataset has ~5.6T tokens). The mixture weights for these components are as follows (in terms of data rows, not tokens): DCLM (40%), FineWeb-Edu (44%), Dolma (3%), Zyda (2%), Stack-2 (10%), FineMath (1%). Again, these weights were chosen mostly based on the prior literature rather than my own experiments.

I don’t pretend to claim that this particular mixture of datasets is optimal at all, but as of this writing, it’s probably one of the strongest and highest-quality composite datasets one can put together from open, public sources of general web text and code and math data.

Results

I’ve now trained the model for 73500 steps (which is roughly 809B tokens or ~15% of the total number of steps planned for training) and the training loss curve so far looks like this:

The black trace shows the loss tracked in 100-step bins, the red trace shows the loss tracked in approximately 15k-step bins.

That’s a pretty nice decrease in loss! We love to see it. The checkpoints are available from this Hugging Face repository (and any future checkpoints will also be deposited in the same repository). It took me about 1.5 months to go through 73500 steps and I estimate that it would take another 9 months or so to complete the full training run. The training run is paused at the moment, because I have other (higher priority) experiments to run on Frontier. I don’t know yet if I’ll complete the full training run (probably not) or how much longer I’ll keep training the model, but I’ll post any updates here and on the accompanying GitHub repository.

In terms of downstream evaluations, unfortunately, I haven’t had a chance to run the full set of evaluations yet (because some of the tasks take quite a bit to run), but from the handful of evaluations I’ve run so far, it basically looks like a wash at the moment, with a slight improvement in some tasks and a slight degradation in others:

MMLUARC-ChallengeWinograndeHellaSwag
step-063.451.474.160.0
step-7350060.952.173.660.2
Downstream performance on four evaluation tasks. Step-0 corresponds to the pretrained base model without any continual training. The evaluation code and task configs are available from here. Any discrepancies from evaluation results reported elsewhere are likely due to differences in evaluation setups.

A note on large-scale training on AMD hardware

The problems with AMD’s deep learning software stack are well-known and I don’t really want to rehash them here. Instead, I want to discuss a couple of issues that I personally encountered in my own experience. I also highly recommend that people read this report, which provides a fairly comprehensive overview of the problems with AMD’s approach to software, with lots of concrete recommendations for improvement. For me, the thing that stood out in this piece was this single sentence:

Tensorwave, the largest AMD GPU Cloud has given GPU time for free to a team at AMD to fix software issues, which is insane given they paid for the GPUs.

AMD (and HPE) does this with OLCF as well. They are given valuable GPU time on OLCF systems to hunt for rccl bugs that they should have fixed themselves and, honestly, that should have never been there in the first place. This is a completely unacceptable situation, especially on a publicly funded system. AMD should have its own cluster with at least hundreds (if not thousands) of its best GPUs and should be using its own cluster to improve and optimize its software stack and to showcase capabilities by constantly releasing models trained on its own cluster (just like NVIDIA does). AMD has the money to do this. This is the only way to gain the trust of the community and to prove that it’s serious about software.

Currently, it isn’t possible to do training runs on more than ~800 nodes on Frontier in my experience due to apparently intractable rccl bugs (or some interaction between rccl and HPE’s Slingshot interconnect). In practice, rccl+Slingshot is so slow and inefficient that it ceases to make any sense to do training runs beyond ~500 nodes on Frontier, even if you could do it reliably. Moreover, MI250X is still ~1.5-2.2\times slower than its NVIDIA counterpart, A100, even at much smaller scales in my own benchmarks and although it may be possible to get a training throughput close to A100 on MI250X at such scales by taking advantage of its larger GPU memory (128 GB vs. 80 GB) and using a larger batch size, this isn’t ideal at all.

AMD has the goodwill of the entire community behind it. Everybody desperately wants AMD to succeed. But, to do that, they have to take software much more seriously than they have been doing so far.

Language development in children is more like post-training than pre-training in LLMs

One of the pitfalls in scientific writing in general is the danger of prematurely describing the basic observations of a field in theory-laden terms. This can be particularly problematic in fields that are in their infancy, such as developmental psychology, where basically all the fundamental questions are wide open, and where we basically don’t know anything about anything. Avoiding this major pitfall is the primary virtue of Eve Clark’s excellent book on language development in children, Language in Children. The book describes the general character (the Gestalt) of the types of experience children learn their first language from admirably plainly, with lots of actual examples of child-parent (and child-child) interactions from diary studies. It does so without groping after premature “explanations” of how children actually learn their first language from such experiences, which are obviously going to be wrong (a favorite theme of mine is that “explanations” are overrated, descriptions are underrated in sciences that deal with complex systems).

Reading the book made me realize the ubiquity (and presumably the importance) of feedback in children’s early language exposure. This feedback doesn’t have to be (and often isn’t) very explicit (as in: “no, that’s not correct; here’s how you actually say it …”), but much more implicit as in recasts, reformulations, and other types of implicit corrections of a child’s productions that happen all the time as part of the natural conversational to-and-fro between the child and the caregiver, as in the following example (p. 15):

Or as in this exchange between a 2.5-year old and his/her father (p. 138):

Clark claims that this kind of feedback is quite common in children’s early language experience, “with adults following up between 40 per cent and 60 percent of child errors up to around age 3;6, in middle- and upper-class speakers of English and French” (p. 16). Children also frequently ask questions to directly elicit appropriate labels and descriptions for novel objects, events, or actions from adults.

Another thing that appears to be common in early language development is practice. Children regularly practice different aspects of language on their own and with others. For example, they practice the sounds of their language (the following examples are from the bed time monologues of a 2-year old, Antony, p. 31):

(interestingly, note how this practice sequence contains both positive and negative examples, i.e.not barries”; it’s as if the child has an internal learned reward model, or verifier, that can evaluate his own productions with respect to particular criteria). They practice the building up and breaking down of phrases:

And they practice various other grammatical constructions:

Practice and feedback are tightly related to each other: more practice elicits more feedback, which provides more opportunities for the child to learn and to refine the meanings of the words and phrases in his/her language (p. 41).

Despite this apparent prevalence of feedback in early language exposure, it seems to be commonly assumed, implicitly or explicitly, that much of language development in children happens without supervision. I suspect that this may be another remnant, another relic, of Chomsky’s malignant influence in linguistics, specifically his “poverty of the stimulus” idea: the idea that not only is there not enough data in a child’s early linguistic experience, but whatever little there is of it is not rich in supervision, therefore of lower value for learning. A natural consequence of this assumption is that language acquisition in children is often likened to unsupervised, or self-supervised, pre-training in language models.

A case in point is the BabyLM challenge. The BabyLM challenge involves learning sample-efficient language models from a developmentally plausible corpus, containing 10M or 100M words. Pretty much all submissions to both of the two editions of this challenge that have run so far feature variations on pre-training, e.g. exploring new architectures for pre-training, new objectives for pre-training, designing new curriculum strategies for pre-training, etc. (here‘s a synopsis of the first edition of the challenge and here’s a synopsis of the second edition). As far as I can tell, none of the submissions so far have explored ideas involving what is nowadays called post-training in language models, i.e. supervised learning or learning from various types of feedback (such as human preferences). Reading Language in Children has made me seriously question this identification of language acquisition in children with unsupervised pre-training in language models. So, I would like to see more work come out that explores possible connections between language acquisition in children and feedback learning (or post-training) in language models instead (this doesn’t necessarily mean completely rejecting any role to unsupervised learning in children’s language acquisition though). 

There are reasons to think that learning from feedback would be more sample-efficient than unsupervised learning, so it could potentially help explain the remarkable sample-efficiency of language acquisition in children: (i) feedback provides both positive and negative evidence, unlike in the currently dominant paradigm of unsupervised learning, where the only negative evidence is the absence of positive evidence; (ii) feedback, or supervision, is tailored to the learner’s productions, so the supervision signal is, in some sense, much more relevant to the learner than passively received unsupervised data (this also means that it’s automatically titrated to the learner’s current skill level). 

Language in Children also offers a novel alternative perspective on the challenge of learning a second language in adulthood that I hadn’t thought of before. The apparent difficulty of learning a language in adulthood, compared to the apparent ease with which children acquire their first language in early development, is commonly attributed to some sort of critical period in brain plasticity, i.e. children are somehow biologically much more keyed to learning a language than adults. But, Clark offers an alternative explanation for why language learning might be more difficult for adults than for children, namely the radically different nature of the language experience in adulthood vs. in childhood. More specifically, the quantity and the quality of feedback received by adult learners, even for adult learners immersed in a second language environment (e.g. a new immigrant to a foreign country), is likely much inferior to the kind of intense and intimate feedback that children receive while learning their first language (p. 140): “… first-language acquisition offers many occasions for feedback from more expert speakers, occasions that young children attend to, and make use of. But with a later-acquired second language, such feedback is virtually absent…” At least much less common than in childhood.

I don’t necessarily endorse this explanation as the sole, or even the primary, reason why language learning in adulthood appears to be more difficult than in early childhood (as a counterpoint, for example, one could point out the numerous cognitive advantages adults have over children that could help them make better use of whatever feedback they receive while learning a language), but in my view, it offers an interesting, novel hypothesis that needs to be taken seriously and investigated further as a potentially important factor contributing to the apparent difficulty of learning a language in adulthood vs. in childhood.

On the entropic brain, trapped priors, and machine learning

If the doors of perception were cleansed every thing would appear to man as it is, Infinite. For man has closed himself up, till he sees all things thro’ narrow chinks of his cavern.

William Blake, The Marriage of Heaven and Hell

I’ve recently been reading up on the effects of psychedelics on the brain. One of the consistent effects that comes up in the literature is that psychedelics seem to make brain activity more entropic or higher dimensional globally (or more chaotic if you’d prefer to put it that way); for example, a recent paper claims “psilocybin desynchronizes the human brain”. It is intuitively very tempting (although potentially misleading) to interpret the psychological effects of psychedelics in terms of this entropy expansion mechanism: e.g. they help the brain escape from established, laid down, entrenched, lower dimensional (and often pathological) patterns of activity, opening up an opportunity to break free from its “trapped priors”.

The main reason I wanted to write this short post is that I would like to make a connection with machine learning here. I think that a similar entropy expansion story, by and large, accounts for the benefits of various architectural motifs used in modern deep learning models to improve their trainability. I’ve long argued that the main obstacle in training deep learning models is degeneracy, i.e. the activity space of the model collapsing into a very low-dimensional, degenerate space for generic initializations of the model. This makes the model effectively a pathologically low capacity model. Skip connections alleviate this degeneracy problem by making the activations more entropic. Normalization also likely has a similar effect, and so does the mixture-of-experts (MoE) motif.

Neuroscientists often interpret learning effects, e.g. improvements in sensitivity to environmental cues or improvements in cognitive or behavioral flexibility, very locally in terms of changes in synaptic plasticity (for example, in the paper cited above, we read: “Synaptogenesis in the medial frontal lobe and anterior hippocampus is thought to be key to the neurotrophic antidepressant effects of psilocybin”). But the overwhelming importance of these more global, system-level effects for learning has been woefully overlooked in my opinion. A more entropic model is a better learner simply because of its reduced degeneracy overall, not because of any low-level details (and there are probably many different low-level implementations that achieve a similar global effect).

By far one of the most interesting papers I’ve read on this topic is this paper by Carhart-Harris et al. from 2014, which explores some really original (albeit highly speculative) ideas about possible connections between this more entropic state of the brain and human consciousness. Carhart-Harris et al. argue that this entropic state corresponds to a more primary form of consciousness that is distinct from the normal waking consciousness in humans. It is characterized as a more dream-like, less constrained conscious state with a diminished, diffuse sense of self and a stronger sense of unity with the universe (mind at large, as Huxley called it). It is speculated that infants might also possess a similar primary conscious state as their default conscious state (early psychosis and certain types of meditation might induce such states to a certain extent as well). As infants grow up and their brains mature, they become less entropic. This was also interesting to me, because something like this is again observed while training neural networks: they start out in a less degenerate (more entropic) state, and gradually become more degenerate (less entropic) over training. The architectural motifs mentioned above (skip connections, MoEs, normalization) typically make the model less degenerate before training, but more so after training.

One of the main interests of these drug-induced altered states of consciousness (and other extreme states of consciousness) for me is that they demonstrate the remarkable “latent phenotypic variation” inherent in the human brain. In other words, they show us what we are potentially capable of becoming or experiencing. We can possibly cultivate some of these states with our own will to some extent, and cultural and biological evolution can work on such states over longer time scales to select desired phenotypic traits (in fact, Carhart-Harris et al. argue that something like this presumably already happened along the evolutionary lineage leading up to humans: the brains got more entropic). If at some point in the future, for example, a selection pressure arises for exquisitely sensitive, gentle, kind-hearted, selfless, child-like saints with a profound sense of unity with the universe, evolution can work its magic quickly to make a Peaceable Kingdom on Earth, because the raw material is already there.

IsoFLOP curves of large language models are extremely flat

An interesting detail in the recently released Llama-3 technical report has caught my eye (p. 8):

This has caught my eye, since I had noted the same phenomenon in a previous post about the Chinchilla scaling laws (more than two years ago) to argue for training smaller models (point 4 in that post). I’m glad that this observation is finally being taken seriously, but I think the quotation above from the Llama-3 paper still underestimates the extent of this isoFLOP flatness issue. The performance of these models is not just robust to small variations in model size around the optimal, but it is actually pretty robust to even massive variations in model size at the Llama-3 compute scale. Here’s a simple experiment I did to illustrate this point.

Suppose you have a compute budget of 3.8e25 FLOPs, which is the amount of compute used for training the 405B parameter flagship Llama-3.1 model. What is the loss you can achieve by training different sized models, given this much compute? We can estimate this using the Chinchilla scaling laws, in particular, using their “Approach 3”, i.e. fitting a parametric scaling function to L (pretraining loss), N (model size), D (data size) values across many different training runs. In fact, we can even estimate the uncertainty around our predictions with bootstrapping (following Besiroglu et al.). Here’s what it looks like when we do this (I won’t bore you with the details, but suffice it to say it’s a pretty straightforward exercise):

Figure: Estimated model size vs. pretraining loss relationship given a compute budget of 3.8e25 FLOPs (equivalent to the amount of compute used for training Llama-3.1 405B). Estimates are based on the Besiroglu et al. correction to the Chinchilla scaling laws. Gray curves are 4000 individual predictions based on bootstrapped parametric scaling law estimates. Red dots indicate the corresponding compute-optimal models. Black line is the mean prediction. We highlight three different sized models (65B, 650B, 6.5T parameters) with the star symbols.

For this particular compute scale, a model with roughly 650B parameters (the middle star) turns out to be compute-optimal. But, note first the uncertainty around this optimal size (red dots). The 95% confidence interval around the optimal size spans almost an order of magnitude range! And this assumes that we got the parametric scaling function exactly right (no misspecification) and our experiments were all optimal (hyperparameter choices, etc.). If not, that’s going to be another major source of error. Secondly, and even more importantly, look how wide and flat that bottom of the curve is around the optimal size. Models an order of magnitude smaller or bigger than the optimal model (indicated by the star symbols on the left and the right) basically have the same loss as the optimal one. In this particular case, for example, the optimal 650B model has a loss of ~1.89, whereas both the 65B and the 6.5T models have a loss around ~1.91. So, these models are within 1% of each other in terms of final pretraining loss.

Parenthetically, note also how this estimate of ~650B parameters for the compute-optimal model size differs quite substantially from the ~400B parameters estimated in the Llama tech report for the same amount of compute. This is presumably because these respective scaling “laws” are estimated from two different sets of experiments (using different training data and different training configurations, etc.). This again goes to show how sensitive these so-called scaling “laws” are to experimental details. I’m always amazed when people talk about them as if they were actual, precise, quantitative “laws” of nature, like Newton’s laws of motion. What a terrible and unfortunate choice to call them laws!

I haven’t been able to find a credible estimate of this, but my guess is that the total (lifetime) inference costs of these flagship large language models are likely several orders of magnitude larger than their one-time training costs, so any calculation that takes into account the inference costs of these models (not just their training costs) will massively favor training smaller models for longer (way beyond the training-compute-only-optimal point). I appreciate that Meta already did this for their 8B and 70B models with Llama-3 to a certain extent, but I hope that they’ll be even bolder next time and do it for their flagship model as well. Given a fixed compute budget, smaller models also allow for more room for experimentation: for example, for tuning the hyperparameters of the model or the training configuration much more extensively. So, again any calculation that includes hyperparameter search and other types of experimentation (in addition to training and inference compute) will likewise shift the isoFLOP curve to the left and thus favor training smaller models for longer (although this is likely a much smaller effect than the effect of inference).

To wrap up this post, I’d like to make a few very concrete recommendations for the next iteration of Llama models (presumably Llama-4), if anybody from the Llama team reads this (je sais que tu le fais 😉 ):

  • Include inference compute and experimentation compute (e.g. hyperparameter tuning) in your scaling law calculations, not just the training compute. These do not have to be very precise. It’s OK to be conservative about inference compute. Any effort is better than no effort. Here‘s one recent attempt and here‘s another attempt at incorporating inference compute in such calculations.
  • Normalize training models for multiple epochs. With the current data sizes (and especially for smaller models), a few epochs of training over the same data is almost certainly indistinguishable from training on brand new data. And training for even more epochs is likely only slightly worse than training on brand new data for any number of training epochs practically feasible today.
  • I think there’s no need to train 3 separate model sizes, maybe two models at most is enough: one small and one bigger. For Llama-3, for example, instead of a 70B model, I would have much preferred to see Meta spend a bit more compute and train the 8B model for 20x longer (or something like that). Seeing how well such a well-trained small model worked would also go a long way toward substantially increasing people’s confidence in the practice of training smaller models for much longer.

Does Sora understand physics? A few simple observations

I’m a bit late to the fray as usual, but I wanted to write a short post about Sora. Sora is OpenAI’s new video generation model. As of this writing, it’s still not open to the public, so all we’ve got so far is some high-level information about the model and some generated samples shared by OpenAI in a blog post. The samples look impressive in their visual quality and their apparent realism, however most of the videos seem to contain pretty glaring physical inaccuracies that are easy to detect when one looks at the details a bit more carefully (e.g. objects merging into each other and then unmerging, objects spontaneously disintegrating or disappearing, objects spontaneously changing their features, etc.). This prompted some to question whether (or to what extent) Sora really understands physics and even further whether it’s possible to understand physics at all by, effectively, just learning to predict pixels over video clips (which is, at a high level, what Sora does).

I should preface everything I will say here by emphasizing that I really dislike this sort of binary “understands it or not?” framing of discussions about capabilities in general. Why do we always have to frame our debates in terms of extremes like this (sigh)? It’s absurd to claim that a model that can generate videos as good as Sora does hasn’t learned anything about physics. It also seems absurd to claim that it has learned a highly accurate physics engine, as the model generated videos often display clear physical defects. Obviously, the reality is somewhere between these two extremes. The real interesting questions here are: what aspects of physics was Sora able to learn exactly and how far can we push this approach to learn a more accurate physics engine (in other words, how good the learned physics engine will become as we scale up Sora)?

With this important caveat, I’d like to make a few very simple observations as my humble contribution to this assize about Sora’s understanding of phsyics. Most of these are probably obvious to anybody who knows anything about anything (or to somebody who knows something about something), but I happen to belong to that rarefied species that finds prodigious value in stating the obvious from time to time, so here we go:

1) There’s an important distinction between “understanding physics” and being able to generate physically accurate videos. Although the model might struggle with generating physically highly accurate videos, it might still be able to reliably recognize that there’s something “weird” going on in physically inaccurate videos. This is roughly the difference between recognition and generation (or the difference between recognition and recall in memory retrieval). The latter is generally harder. So, a potentially more sensitive way to test the model’s understanding of physics would be to run carefully controlled recognition tests, as is typically done in intuitive physics benchmarks, for instance.

2) People’s understanding of physics seems to be mostly of this “recognition” variety too (rather than the “generation” variety). People don’t really have a very accurate physics engine inside their heads that they can use to simulate physically highly accurate scenarios (cf. Marcus & Davis, 2013; Davis & Marcus, 2015; Ludwin-Peery et al., 2021). This is why this capability is often properly described as intuitive physics as opposed to actual physics (or similar).

3) People can also generate fictitious, physically highly implausible or even impossible scenarios in their imagination with remarkable ease and ingenuity (and they have been doing this since time immemorial). Cartoons, fairy tales, fantasies, legends, etc. are full of such examples: levitating creatures, objects passing through solid walls, objects melting or disintegrating into pieces and then regrouping again, etc.

Casper the Friendly Ghost

4) For related reasons, you also do NOT want a video generation model that only generates physically highly accurate videos. You want something that can bend or break physics, ideally in a precisely controllable way (based on a textual prompt, for instance, among other ways).

5) We know nothing about the distribution of the videos Sora was trained on. Almost certainly, a subset of its training data consists of CGI, digitally edited, or animated videos depicting physically implausible or impossible scenarios (we don’t know how large this subset is). So, part of the reason why Sora sometimes generates physically implausible or inaccurate videos may be traced back to this subset of its training data.

6) Even granting the previous point, however, some of the generated samples seem to show clear signs of gross errors or inaccuracies in whatever physics engine Sora has managed to learn by watching videos. Consider this generated video of wolf pups frolicking, for example. Why do inaccuracies like this arise in the first place and how might they be remedied or ameliorated? At the risk of sounding like a man with a hammer seeing nails everywhere, I will suggest that many of the inaccuracies like this particular one are “granularity problems” that will be fixed when Sora can model videos at a sufficiently fine granularity (both spatially and temporally). For example, this particular scene with wolf pups frolicking is a highly complex, dynamic scene and accurately generating a scene like this requires very fine-grained individuation and tracking of multiple objects. In the absence of this level of granularity, the model instead generates something more coarse-grained, freely merging and unmerging objects in physical proximity without regard to correctness in details, but capturing the overall gist, the gestalt (or “texture”) of the action in the scene, somewhat analogous to how we see things in our visual periphery.

Update: After writing this post, I saw this thoughtful and much more detailed post on Sora by Raphaël Millière, which I recommend as well.

The “it” of deep learning and convergent evolution

I recently came across this beautiful short blog post by James Betker (who works at OpenAI), arguing that the thing that really determines the capabilities and, more generally, the behavior of a machine learning model is not its architecture, it’s not the particular optimizer used for training the model, or any other details of the model configuration, but it’s the training data. Surprisingly, even the optimization objective often doesn’t seem to make a huge difference, within a wide margin. An example of this is the wide range of self-supervised visual representation learning algorithms (SimCLR, MoCo, BYOL, DINO, MAEs, etc.) and the wide range of model architectures (ConvNets, transformers, MLP-mixers, etc.) that all seem to work more or less equally well when trained on the same data. This doesn’t mean that the model architecture, optimization objective, or other details of the model/training configuration are completely irrelevant; that’s certainly not the case: e.g. earlier generation self-supervised learning algorithms like RotNet were clearly inferior to the newer generation ones listed above for learning useful, general-purpose visual representations, and similarly, MLPs seem to be clearly inferior to the modern architectures mentioned above (although even this may change if we can train bigger MLPs with more data than we have been able to do thus far). The point is rather that the capabilities of the trained models seem to be surprisingly insensitive to a wide range of variation in these factors.

It occurred to me that this situation is not unique to deep learning. A similar thing happens in biological evolution too. “Training data” in this case roughly corresponds to the environment organisms find themselves in (broadly construed), including other organisms they interact with. As in machine learning, “training data” in this sense seems to forcefully and profoundly affect the outcome of evolutionary “optimization” too1. The two main pieces of evidence for this are (1) the rampant convergent evolution in biology and (2) biological structures and processes often pushing up against the limits of physics. Wings and powered flight independently evolved at least four times; complex, image-forming eyes independently evolved dozens of times; C4 photosynthesis independently evolved perhaps over sixty(!) times; complex brains may have evolved independently at least a dozen times; and so on. These examples suggest that when faced with similar environmental challenges, evolution hit upon the same solutions over and over again in vastly different lineages. These solutions are also often close to the physical limits2. The person who, perhaps more than anyone else, emphasized these aspects of evolution is probably Simon Conway Morris. Life’s Solution: Inevitable Humans in a Lonely Universe (this concept of inevitability or predictability in evolution is a major theme in Conway Morris’ work) and Six Myths of Evolution are two of his books documenting a large number of such examples of (1) convergent evolution and (2) evolutionary optimization reaching the limits of physics (both quotations below are from Chapter 1 of the latter book):

… convergence comes to our rescue; its ubiquity suggests the regions of biological hyperspace that are actually habitable represent the minutest fraction of what is potentially available.

… are there ultimate limits of life, and if so is the biosphere anywhere near such a closure? Curiously, the evidence suggests that with one crucial exception we are indeed near to the boundaries.

Why does this happen? Why do training data in deep learning and the environment in biological evolution seem to have such profound effects on the nature of the solutions reached through these processes? It seems that there are basically two requirements for this to happen: (1) training data or the environment is rich and complex enough to tightly constrain the space of “good solutions”, (2) the optimizer is “good” in some informal sense (I use “the optimizer” in a broad sense here to include everything other than the training data itself in the case of machine learning)3. (2) seems to be necessary, since training data or the environment would presumably not be able to constrain the nature of highly suboptimal solutions very tightly. “Optimal solutions are all alike; every suboptimal solution is suboptimal in its own way”, as Tolstoy might have said in another universe in which he (regrettably) chose to become an optimization theorist.

One can imagine other domains where these properties roughly hold as well, for example, human history (again broadly construed, e.g. history of technology and innovation, history of ideas and morals, history of social and political institutions, etc.). There’s a sobering quality to this view of human life and human history. While we tend to think of ourselves as autonomous, free individuals, and our free will as hugely consequential and important, in the large scheme of things, even the most consequential events or the most consequential individuals in history tend to have at best a transient effect on the unfolding of history at longer time scales: they can only slightly delay or speed up the inevitable (or the highly likely), which are themselves shaped by much more stable and deeper forces of history, like human nature (just like how physical laws often determine the mechanisms of life that end up evolving in biological organisms). So, a lot of apparent chance and happenstance at the scale of individual human lives, but imperturbable order and necessity at more “cosmic” scales. I often struggle with this Hegelian view of history.


  1. Of course, evolution happens in a dynamic, evolving landscape (both the environment and the species in it change over time), so “training data” in this case is not static and the situation is quite a bit more complicated than in a standard static machine learning problem. ↩︎
  2. What initially appear to be suboptimal solutions often turn out to be consequences of evolution solving a different and more complex optimization problem than the one we have in mind, although I understand that this is a somewhat intricate topic that needs to be handled more carefully (perhaps in another blog post). ↩︎
  3. It may seem a bit strange to claim that evolution by natural selection is a “good optimizer”, since it’s basically just random search, and it is true that compared to gradient-based optimization, random search is definitely suboptimal, but my guess is that the vast number of parallel searches that happen in biological evolution and the vast stretches of time over which it takes place sufficiently alleviate the obvious problems with random search to make it a “good enough” optimizer. ↩︎

Intelligence is a granularity problem (or the reality has a surprising amount of detail, so must intelligence)

One of the recurring themes in Hans Moravec’s prescient book, Robot: Mere Machine to Transcendent Mind (first published in 1999), is how practically important problems (e.g. agile robot navigation in the real world) become tractable more or less automatically, as the amount of widely accessible compute reaches a soft threshold. Before this threshold is reached, people try to come up with all sorts of ingenious ideas, clever tricks to squeeze the last bit of performance from the available compute, but in the long run, this almost always proves totally unproductive, basically a complete waste of time, as the most straightforward, the simplest, “brute-force”, “dumb” method to solve the problem turns out to work just fine once the available compute reaches the requisite threshold, whereas the “ingenious” tricks almost invariably do not scale nearly as well with compute. This is, of course, another version of Rich Sutton’s famous Bitter Lesson.

The main reason problems become tractable only at particular compute scales is that their solution requires a minimum level of granularity or detail to be modeled. And most of the fundamental, practically important computational problems we face in the real world need a very high degree of granularity for their solution. The main reason for this, in turn, is that reality has a surprising amount of detail and these details are often very important.

Here’s an illustrative example from the book showing 3D maps of two similar visual scenes generated by essentially the same “dumb” (but scalable) mapping algorithm 18 years apart:

18 years of steady increase in the amount of widely available compute finally made real-world robot navigation a reality.

With the drastic increase in the available compute over those 18 years, it became possible to extract many many more features from the scene and estimate their locations to a much higher degree of resolution. This much finer granularity in 3D mapping is what finally enabled acceptably good robot navigation in the real world.

Here are some other examples of this phenomenon:

  • Visual object recognition: You can’t do fine-grained real-world object recognition with 8×8 images (nor even with 28×28 images). This is just too small to resolve the important details of many real-world objects. If the compute available to you only allows for the processing of such small images, I’m afraid you’re just going to have to wait until the compute catches up (much better to work on increasing the compute than to churn out cute little tricks that only work with 8×8 images!).
  • Chess: You can’t beat the world champion at chess if you can search the game tree only up to depth 3 or so. Beating the world champion at chess requires being able to search the game tree much more extensively at sufficiently large depths and breadths. And in fact, “dumb” brute-force search combined with sufficient compute was basically how a computer program defeated a world champion at chess for the first time, although there have been some important developments in making search more efficient since then (i.e. MCTS).

I believe this granularity problem also fundamentally underlies most cases of current AI methods not yet being able to do well in certain domains and it will ultimately be overcome when the widely available compute scale allows for the modeling of the requisite level of granularity in that domain even without any fundamental improvements in the algorithms, just like in Moravec’s 3D mapping example above. To give a few further examples:

  • Robotics: I believe this is why robotics is still hard for AI. For example, fine-grained, dexterous control of robotic hands in the real world requires being able to learn high-dimensional, high-precision, complex temporal patterns (with lots of high-frequency components, for example, due to contacts), which, in turn, requires sufficiently big models trained with a sufficiently large amount of data. This fine-grained, high-dimensional, high-precision control problem is, in fact, presumably so hard that the sensory and motor cortices in the human brain allocate a disproportionately large amount of cortical space to the representation of hands, as illustrated by these cartoonishly grotesque figurines of cortical homunculi (as a side note, it seems to be generally accepted among evolutionary biologists that the evolution of upright posture and the subsequent freeing of the hands for the manufacture and manipulation of objects was indeed one of the main drivers of the rapid expansion of brain size in the genus Homo):
A disproportionately large amount of cortical real estate is allocated to the representation and control of hands in the human brain (source).
  • Data efficiency: I believe that this granularity problem is also (at least partly) behind the apparent data efficiency gap between current deep learning algorithms and humans. To give an example from the visual domain, the human retina contains something like 6M color-sensitive cone receptors very tightly concentrated within a few degrees around the fovea. By moving our eyes, we can resolve different objects or surfaces in a scene to a very high degree of precision. The most commonly used image size in computer vision today, on the other hand, is something like 310×256 pixels (for the entire image), which is about 0.08M pixels, or two orders of magnitude lower resolution than the human retina (directly comparing the number of pixels in an image and the number of photoreceptors in the retina is a bit tricky, but I think it does make sense under fairly reasonable assumptions). My own recent work suggests that the apparent data efficiency gap in the visual domain between current deep learning algorithms and humans might be closed once we start to work with sufficiently large natural images, closer in size to the photoreceptor array in the human retina (~6MP), instead of using much smaller images, which is currently the norm.
  • Long-form video modeling: The granularity problem is the reason why long-form video modeling (long-form video understanding and generation) is still not there yet. Representing even very short clips without too much information loss requires a large number of visual “tokens”. From my own work, for example, I know that even 1 second long natural video clips require at the very least something like 4x16x16 discrete tokens (i.e. 4 tokens in the temporal dimension, 16×16 tokens in the spatial dimensions) in order to represent them faithfully enough. That is roughly 1K tokens. Scaling this up to a 1 hour long video would require roughly 4M tokens. It is not possible to train a large GPT model with a 4M token context length at the moment (not even for big industry labs), but as surely as the sun will rise tomorrow, this will be eminently feasible at some not too distant future and at that point AI models will be able to understand and generate long-form videos (e.g. films) at least as well as humans, but orders of magnitude faster (it will be a very wild world when AI models can generate entire films in a matter of minutes or seconds).
  • Text, hands, faces: The granularity problem is the reason why generative vision models had problems with creating realistic texts, hands, or faces in images, until very recently. These categories of objects all involve a large amount of fine-grained visual detail that needs to be represented and modeled in order to generate and recognize them accurately.
  • Developing and understanding large, complex software projects: Such projects often involve large codebases and their corresponding documentation (perhaps also including auxiliary information such as issues and pull requests, etc.). Similar to the case of long-form video modeling above, it is currently not yet feasible to train large GPT models with a large enough context size to cover all of the relevant pieces of code and documentation contained in a complex, realistic software project.
  • Long-form text modeling: The granularity problem is also the reason why AI models can’t write a convincing novel yet (nor read and understand a novel as well as humans do). The length of a good-sized novel like Anna Karenina is roughly on the order of 1M tokens (give or take a factor of 2). Again, it is currently not feasible to train a large GPT model with a context size this long, but it will surely be feasible at some not too distant future and at that point AI models will be able to write and comprehend novels (and other types of long-form text) at least as well as humans do. But, you may ask, do all those 1M tokens really matter for writing or comprehending a good novel? Yes, absolutely! It takes a lot of detail to build convincing characters, it takes a lot of detail to build rich internal and external lives for the characters in a novel. And we are exquisitely sensitive to these details. Human life is rich and complex, we go through a lot as our lives unfold over the years and, as a result, we are very sensitive to these vicissitudes, twists-and-turns of life. Let me also take this opportunity to wax lyrical about one of my favorite writers and one of my favorite novels: this is precisely why Tolstoy was one of the greatest writers and Anna Karenina is one greatest novels ever written. Tolstoy is particularly adept at creating, expressing, conveying these rich details of both the inner and outer lives of the characters in Anna Karenina, so much so that when you read Anna Karenina, you say “this could be real”; nothing in the novel really sticks out as strained, implausible, or unconvincing.
One of the greatest novels ever written (source).

Are there any problems that cannot be regarded as pure granularity problems for current AI methods, i.e. problems caused by our temporary inability to apply these methods at sufficiently fine granularities? My current working hypothesis is that reaching human-level AI will prove to be nothing but a granularity problem (or a series of granularity problems). I think we will once again be surprised when we find out we can actually solve many of the currently intractable looking problems with increased granularity. But, how about reasoning or planning, for example? Are they also just a granularity problem? First of all, I don’t think that humans really do reasoning or planning in the sense in which these terms are often used, as evidenced by the fact that models that can actually do reasoning and planning wipe the floor with even top human players in board games. What seems like reasoning in humans is most often just the use of shortcuts afforded by abstraction tools, for example, we write and use computer programs to do our reasoning for us. And writing code, as we found out recently, seems eminently amenable to reasoning-free, “pattern recognition” type learning strategies. Otherwise, again, my current hypothesis is that for human-level AI, it is going to be “pattern recognition” all the way down, but at increasingly finer granularities (perhaps with the sole addition of just a little bit of supervised finetuning applied on top).

Further thoughts on hallucinations in generative models

While working on some generative video models recently, I had a moment of epiphany about hallucinations in generative models. I wanted to share this tiny bit of insight (if it isn’t too presumptuous to call it an insight) that has occurred to me. It is one of those simple things that has always been right in front of your eyes, but you never paid attention to it, so you were never consciously aware of its existence or its nature, like a sign or a pattern you notice for the first time in a familiar environment. It may even be already obvious to most people.

If you train an autoregressive image model and do conditional sampling with it, that is, if you take a real image, give the upper half of the image as context (or prompt) to the model, and ask it to complete the bottom half, it can typically generate a pretty diverse set of novel continuations. OpenAI’s Image GPT post had cute examples of this:

Conditional sampling with Image GPT: given the upper half as context or prompt, the model is asked to complete the bottom half. Columns 2 to 6 show 5 different conditional samples from the model for each image.

Now, this is true even for a well-trained model that’s trained on a relatively small dataset and even when the conditioning images given as context come from the training data itself. I’ve tried this in a recent work here. You can see some examples here of conditional samples generated by a model trained on a relatively small child headcam dataset. Obviously, the exact type and diversity of the samples generated will depend on the details of the sampling strategy, but the point is that it is surprisingly difficult to make the model generate just the “ground-truth” continuation given by the training example itself. In other words, the model has a strong tendency to “hallucinate” novel but plausible continuations.

My moment of epiphany came when I noticed that this was not true for video models (modeling short, i.e. a few seconds long, video clips). If you train an autoregressive video model on 2-second long video clips and do conditional sampling with it, i.e. give the model a 1-second long clip as context (or prompt) and ask it to complete the rest of the video, it will basically generate the same continuation over and over again. And if the prompt clip given as context comes from the training data, a well-trained model will almost always generate something very close to the ground-truth continuation. So, why does this happen? What explains this difference between the image models and the video models?

Well, the answer is pretty obvious: single images are much less redundant than short video clips. Given the upper half of an image, there are lots of different plausible continuations for the bottom half of the image. On the other hand, the possibilities are much more restricted for a short video clip: given the first second of a video clip, it is virtually determined what will happen in the next second of the clip (of course, the situation is a bit different for much longer clips); here, the given context is much more constraining than in the case of images.

Text is even less redundant than images, so the context (prompt) is even less constraining for text. Consider this short paragraph (it’s the first paragraph of Guy de Maupassant’s famous short story, Clair de Lune, one of my favorite short stories):

Abbe Marignan’s martial name suited him well. He was a tall, thin priest, fanatic, excitable, yet upright. All his beliefs were fixed, never varying. He believed sincerely that he knew his God, understood His plans, desires and intentions.

There’s an enormous variety of ways you can continue this paragraph, endless possibilities most intriguing, most wonderful. The actual continuation of this story is just one among those endless possibilities. If we want to retrieve the actual continuation of the story, this short piece of context may not be good enough (it may be insufficient) to pick the right choice among the endless possibilities.

OK, but what does this all have to do with hallucinations in language models? Well, I think these examples suggest that hallucinations may be at least partly a retrieval problem affected by two main factors: (i) the intrinsic redundancy of the domain we’re dealing with, so for example, hallucinations are always going to be more likely when you’re trying to predict one half of an image from the other half than, say, when you’re trying to predict one half of a short video clip from the other half (even within the same modality, say, text, some genres, e.g. fiction, may be more prone to hallucinations than others, e.g. official documents, because of their weaker predictability or redundancy), and (ii) how much context we’re giving the model to help it retrieve the “correct” or ground-truth continuation.

More speculatively, these examples also suggest that the more a model “knows” in some intuitive sense (for example, by being trained on a larger and more diverse set of data), the more context it may need to retrieve the correct piece of information, since the likelihood of retrieval failures (e.g. retrieving a similar but incorrect piece of information) increases as the model’s knowledge increases. Intuitively, this is analogous to how you would need longer vectors in order to retrieve from a larger vector database at a fixed level of accuracy.

At this point, I should point out as a caveat that language models (and generative models in general) are not standard information retrieval models. There are important differences between standard information retrieval systems and generative models: for example, generative models can “retrieve” novel, non-existent items. However, thinking of generative models as soft retrieval systems (that can retrieve soft, novel mixtures or variations of their training data) is still a very useful perspective in my view, especially when it comes to questions such as memorization and hallucinations.

The importance of context for effecting correct retrievals suggests that we may be able to reduce hallucinations in language models by basically giving them more (and better) retrieval cues in the prompt (this is, in fact, the core idea behind retrieval-augmented generation, or RAG, methods to reduce hallucinations in generative models). I tried this strategy with some of the examples of hallucinations by GPT-3.5 I documented in my previous post, with partial success. So, for instance, for the George Orwell example, just prepending the introductory section of the Wikipedia article on George Orwell to my question about Orwell’s university attendance reliably elicits the correct answer (i.e. that Orwell did not attend university). Even though the copy-pasted text from Wikipedia doesn’t contain any information about Orwell’s university attendance, it presumably “nudges” the model’s retrievals toward the more “wikipedia-esque” corners of its memory landscape, thus helping them to be more accurate (the rest of the Wikipedia article does indeed mention the fact that Orwell did not attend university). Similarly, I was also able to elicit some details of the plot of Clair de Lune by simply copy-pasting the first few paragraphs of the story (I wasn’t able to achieve any of this previously by just directly asking the model to describe the plot of the story).

This is all anecdotal evidence. It needs a much more rigorous investigation to test these ideas, but the initial results seem pretty promising. In some cases, however, I noticed that this strategy doesn’t seem to work at all, suggesting that some hallucinations may be due to encoding errors rather than retrieval errors, as memory researchers would put it (i.e. the model just hasn’t learned the relevant information in the first place), that is, some hallucinations may be due to write errors rather than read errors. In such cases, the desired behavior would be for the model to simply decline to answer the question instead of answering it based on a superficially similar but incorrect retrieval.

GPT-3.5 is surprisingly non-factual about literature

GPT-3.5 seems to be surprisingly bad at answering basic factual questions about famous writers and famous works of literature. This is something I’ve noticed over the last couple of months and here I’d like to share some random examples of this that I encountered recently:

  • Did George Orwell go to university? GPT-3.5 sometimes thinks Orwell attended Oxford, “where he studied at Eton College” (an impressive double error there!). Orwell, in fact, never attended Oxford or any other university. He did go to Eton, but Eton is not a university and it’s not a college of Oxford.
  • GPT-3.5 completely makes up the plot of Chekhov’s short story, Ariadne.
  • GPT-3.5 sometimes mixes up some of the main events in Far From the Madding Crowd, for example, claiming that Seargent Troy is stabbed to death while trying to retrieve a family heirloom (completely made up as far as I can tell), even though it correctly notes just a few sentences prior to this that Mr. Boldwood shoots him. This kind of mixing up of important details seems to be a fairly common failure mode in GPT-3.5’s responses.
  • GPT-3.5 cannot answer some really very basic questions about the plot of Chekhov’s fantastic short story, The Grasshopper.
  • GPT-3.5 makes up a plot detail in George Eliot’s novella, Silas Marner (about the cause of Dunstan Cass’s death: GPT-3.5 consistently claims he gets killed while riding the horse, Wildfire, but in fact he dies afterwards, not from falling off a horse).
  • GPT-3.5 completely makes up Daodejing 15 and many other chapters I’ve tried (despite this being one of the more famous chapters in Daodejing).
  • GPT-3.5 completely makes up the plot of Guy de Maupassant’s beautiful short story, Clair de Lune (one of my favorite short stories).
  • GPT-3.5 makes up an important detail from Tolstoy’s famous short story, How Much Land Does a Man Need? It confuses something that happens in reality in the story for a dream. This answer also illustrates the fact that there are often gaping holes in GPT-3.5’s understanding of stories: it makes no sense whatsoever for what GPT-3.5 claims here to be a dream to actually be a dream for the internal coherence of the story.
  • GPT-3.5 also completely makes up the plot of another one of Tolstoy’s famous short stories, What Men Live By.
  • GPT-3.5 completely makes up important details in H. G. Wells’s very famous short story, The Country of the Blind, claiming for example that the main protagonist of the story, Nunez, “plans his escape during a rare solar eclipse, which temporarily plunges the valley into darkness. During this eclipse, when the blind inhabitants are disoriented, he climbs the mountain to freedom.” In addition to being a complete fabrication, this detail also doesn’t make any sense whatsoever. Obviously, a solar eclipse would have no effect on the blind people of the valley.
  • GPT-3.5 completely makes up the plot of another one of H. G. Wells’s well-known short stories, The Lord of the Dynamos.
  • GPT-3.5 basically completely makes up the plot of Mikhail Zoshchenko’s short story The Adventures of a Monkey.
  • GPT-3.5 is hazy about some of the plot details in David Copperfield; for example, it is unable to correctly describe the fates of the characters Emily and Littimer by the end of the novel. This is arguably one of the most famous novels ever written, with thousands of expository pieces and critical commentary written on it (the text of the novel itself presumably appears hundreds, maybe thousands, of times all over the internet), yet the fact that GPT-3.5 still cannot nail some of the basic plot details is really disappointing.
  • GPT-3.5 does not recognize Chekhov’s short story, A Woman’s Kingdom, claiming that “A Woman’s Kingdom is not a well-known short story by Anton Chekhov” and that it “does not appear in his known body of work”. It then proceeds to confabulate various details of one of Chekhov’s most well-known short stories, The Lady with the Dog.
  • GPT-3.5 completely makes up the ending of Oscar Wilde’s short story Lord Arthur Savile’s Crime.
  • GPT-3.5 fails to recognize Chekhov’s long short story Three Years, falsely attributes it to Turgenev instead, and fabricates an alternative title for it (Three Days in the Country) along the way. Now, this is an interesting confabulation, since Turgenev did, in fact, write a play titled A Month in the Country and the confabulated title seems to be based on this.
  • GPT-3.5 fails to identify the famous biblical story mentioned in Chekhov’s exquisite (and very well-known) short story The Student.

Some of these errors are more egregious than others, but in no case was I able to elicit answers from GPT-3.5 that indicated flawless knowledge of a short story or a novel and thus inspired confidence in its reliability and competence. It’s frustratingly easy to get GPT-3.5 to generate completely fabricated answers about straightforward factual matters. What is really troubling is that these are not some unknown writers or works of literature, these are all very well-known writers and works of literature. Most of these works have their own dedicated Wikipedia pages and possibly thousands of primary and secondary sources written about them, not to mention the original texts themselves which probably appear many many times over all over the internet.

Does this happen with other topics as well or is it specific to literature?

Anecdotally yes, this does happen with other topics as well. There’s no reason to think that literature is unique in this way. I find GPT-3.5 totally unusable, for example, for doing scientific literature reviews about subjects I’m familiar with because of the same reliability issues. This problem may be worse in some domains than in others, but it seems pretty clear that it’s a pervasive issue overall.

Why does this happen?

I guess the honest answer is that nobody really knows for sure. If I had to make a guess, I would blame the severe undertraining of these giant models as the primary culprit (you see what happens when you train your model for one epoch, Larry, you see what happens?). This interesting paper that came out a few months ago suggests that strong forms of memorization (e.g. verbatim memorization) are surprisingly rare in GPT-3.5/GPT-4. Presumably, we want our models to memorize more, not less, in order to reduce their hallucinations/confabulations and training the model down to a lower training loss value would be one way to achieve this.

However, it is also entirely possible that hallucinations may be inevitable in these models (even in the ultra low training loss regime), conceptually somewhat similar to the large number of inevitable “spurious attractors” that exist in the energy landscapes of associative memory models. If that’s the case, the extent of these inevitable “spurious attractors” and their severity becomes a vitally important empirical question: i.e. just how common are they and just how bad are they?

Some objections

Objection: Why don’t you use GPT-4?

Retort: Look, I’m not rich, OK? So far, OpenAI hasn’t really been able to convince me that their product is worth my $20/month (partly because of the pervasive reliability issues discussed in this post). I would much rather donate that money to charity instead given my current budget. ChatGPT/GPT-3.5/GPT-4 isn’t yet offering me anything I can’t do better with a few Google searches and following a few links. This is not the main bottleneck in my life right now; I have to do this only a couple of times everyday (or thereabouts) and I’m perfectly fine with that. The main bottleneck in my life right now is rather getting distracted by all sorts of stupid things. That and compute. That’s my main bottleneck. I understand that the value of ChatGPT/GPT-3.5/GPT-4 may be higher for somebody whose job involves a lot of high-volume, low-value, easily automatable outputs, but that’s not me.

Incidentally, can we just digress here for a second and talk about how pathetically trivial, banal, and boring the use case examples highlighted on the ChatGPT login page are? “Suggest fun activities for a family of 4 to do indoors on a rainy day”, “recommend a dish to bring to a potluck”, “help me pick a gift for my dad who loves fishing”, “brainstorm names for my fantasy football team”… Ugh, ugh, ugh, ugh! Come on, guys! This is a model that’s supposed to pose an existential risk to humanity in its next few iterations!

Objection: Relatedly, didn’t OpenAI show a clear improvement in factuality from GPT-3.5 to GPT-4? So, maybe just keeping on doing what they’re doing will eventually solve these reliability issues.

Retort: In the GPT-4 tech report, there is indeed a figure (Figure 7) that shows an improvement in performance on the TruthfulQA multiple choice benchmark. However, several caveats are in order:

  1. To put this improvement in perspective, the chance baseline in this benchmark is around 45%, and their best model (RLHF-ed GPT-4) does slightly worse than 60% (a large number of much smaller, open-source models currently achieve a better accuracy than this on this benchmark [this is incorrect: the open llm benchmark seems to report the mc2 score, whereas the GPT-4 white paper reports the mc1 score, so these numbers are not directly comparable]). Interestingly, this improvement seems to be almost entirely due to RLHF (there’s barely any difference between the base GPT-3.5 and GPT-4 models, which both perform below chance); it’s unclear if this is due to GPT-4 specific RLHF or just generic RLHF bringing out more factuality from the base GPT-4 model.
  2. This benchmark is a multiple choice benchmark, which isn’t really a good model of how these language models are typically used in practice and it also likely overestimates the models’ factuality, since recognition is much easier than recall as I show in this paper.
  3. As they acknowledge in footnote 9, they do not check data leakage for this benchmark, so it seems possible that part of the improvement here may be due to data contamination during RLHF tuning.

They have another figure in the same tech report (Figure 6) claiming an improvement in some “internal factual evals”, but absolutely no details are given regarding these evals, so it’s not really possible to say anything about the results in this figure (this is, by the way, completely unacceptable behavior from a company that wants to sell products in my view. I can understand wanting to make training/model details secret, but you absolutely must convince your customers that your model evals are sound and relevant, so they can feel confident about the model’s claimed capabilities).

Objection: This is the wrong way to address the reliability issues in language models. We need more agentic models that can search the internet or other databases, follow links, read sources in real time in order to be able to truly address the reliability issues. Otherwise, these language models will always hallucinate and confabulate unacceptably frequently.

Retort: Possibly. But, nobody really has shown yet that these more agentic models that interact with the internet or with other external data sources perform better than standard language models at scale in terms of accuracy and reliability. Can they be relied on to find the correct and most relevant sources? Can they be relied on to understand a document they read in real time (in their “context window”)? Can they be relied on to accurately synthesize multiple documents? We simply don’t know.

These agentic models also have a crucial disadvantage compared to standard language models. One of the biggest promises of large language models is their potential ability to instantaneously make far-reaching associations, their potential ability to synthesize vast quantities of information over their huge knowledge base. This ability is basically given up in the agentic models, since they’re severely bottlenecked by search and reading, both inherently slow, serial processes (an intelligent combination of this agentic mode and the more associative LLM mode could be a different story though). This feels like a feature too important to give up so easily.

Objection: Bro, why didn’t you use my super-duper fancy inverted-double-linked-chain-of-brainfarts-of-thought prompting method (which, incidentally, I chose to call IdLeCObRaS, just because I could and just because I hate humanity)?

Retort: Nah bro, I’m good. You keep your fancy brainfarts-of-thought to yourself. I insist on communicating with my language models in the most normie way possible. This isn’t too much to ask for from a technology that’s supposedly so intelligent and powerful that it may kill us all in its next few iterations. Also, bro, you and I, we both know in our heart of hearts that your brainfarts-of-thought will have exactly zero impact on this technology when all is said and done; OpenAI will just train a 10x bigger model on 10x more data or instruction-tune it with 10x better data and your brainfarts-of-thought will end up in history’s giant trash can of silly ideas where it belongs. You published some useless papers on it, some prestige points were accumulated based off of it, so it has served its purpose, now just move on, bro.

Concluding thoughts

In a recent interview, OpenAI’s co-founder and chief scientist, Ilya Sutskever, said that if language models (or generative AI models, in general) end up being a disappointment overall 5-10 years from now, with relatively little impact in our lives, the most likely culprit would be their unreliability:

“I really don’t think that’s a likely possibility, that’s the preface to the comment. But if I were to take the premise of your question, why were things disappointing in terms of real-world impact? My answer would be reliability. If it somehow ends up being the case that you really want them to be reliable and they ended up not being reliable, or if reliability turned out to be harder than we expect. I really don’t think that will be the case. But if I had to pick one and you were telling me — hey, why didn’t things work out? It would be reliability. That you still have to look over the answers and double-check everything. That just really puts a damper on the economic value that can be produced by those systems.”

I think he’s right that reliability is the most serious issue facing LLMs today. It’s preventing LLMs from having a much wider and deeper impact. I feel less confident than him that this problem will be solved in the next couple of iterations in a relatively straightforward way. I give it a 50-50 chance that this problem will in fact turn out to be more intractable than people hope and expect today, that the LLM bubble will burst soon as a consequence and a lot of LLM startups will go bust in the near future (a potential silver lining in this scenario would be that A100s and H100s would presumably become cheaper and easier to come by and people would hopefully start experimenting with novel ideas again). We shall find out soon.

FLOPs are all you need: a conjecture about what really makes deep learning work

One of the most interesting papers I have read recently is a paper titled “Scaling MLPs: A Tale of Inductive Bias” by a group of researchers at ETH Zurich. This paper made me realize or brought to the foreground of my consciousness something that I was perhaps aware of only vaguely, indistinctly, dimly, implicitly: pure MLPs are severely compute-starved (i.e. FLOPs-starved) compared to today’s more “successful” deep learning architectures (e.g. convnets, transformers, MLP-mixers, or gMLPs).

This comes about because these “successful” deep learning architectures do a hefty dose of parameter sharing. Given a fixed amount of memory (GPU memory), there is a trade-off between memory for parameters vs. memory for intermediate computations (or activations). Parameter sharing is a great way to reduce the memory footprint of parameters. This opens up a lot more room for doing intermediate computations instead. MLPs, on the other hand, don’t do any parameter sharing, which limits the amount of intermediate computations that can be performed for a fixed amount of memory allotted to the parameters. I wasn’t quite aware of the scale of this problem for MLPs before reading this paper. Table 5 in the Appendix of the paper, in particular, has a very informative comparison of MLPs vs. “successful” modern deep learning architectures in this respect:

The first row here is a pure MLP model and the other rows are “successful” modern deep learning architectures (a convnet, a transformer, and an MLP-mixer model, respectively) all with roughly the same number of parameters. Unlike the pure MLP model, these three “successful” architectures do a lot of parameter sharing. Now, look at the FLOPs (i.e. the number of intermediate computations) performed by these models shown in the last column. The MLP model does over two orders of magnitude fewer FLOPs on the input than the other models! That’s a massive difference.

My conjecture then is that that number you see in the last column is basically the only thing that determines whether an architecture will be successful or not (i.e. whether it will be highly performant across a range of domains). You just need to be able to perform a certain number of computations on the input. All other low-level details, inductive biases, etc. are basically irrelevant. I conjecture that this is also the reason why pure MLPs are not successful at sufficiently large-scale problems today: they fail to perform enough computations on the input.

This is a very bold conjecture (bold because it’s quite likely to be too strong, hence likely to be wrong as stated), but I make two main predictions if this conjecture is roughly correct:

1) The particular way parameters are shared in a model should be basically irrelevant. The main point of parameter sharing is rather to reduce the memory footprint of the parameters on the GPU. How exactly you do it is essentially irrelevant. Some modern architectures like MLP-mixers and gMLPs already share parameters in very “weird” ways, but if I’m right, even weirder ways of sharing parameters should work more or less OK as well (e.g. sharing parameters more randomly) provided that enough sharing is done to allow for a certain minimum number of FLOPs on the input.

2) The current failure of pure MLPs to be performant at large-scale problems is only temporary. There will come a time when GPUs will have enough memory to allow MLPs to perform the minimum requisite number of FLOPs on the input even without any parameter sharing (this may already be possible with parallelism at industrial scales of compute, but unfortunately it’s not possible for me personally to test it at the piddling academic scales of compute I have access to at the moment). I predict that MLPs may even be more compute-efficient than current parameter-sharing architectures (there’s already a strong hint of this in the results of the scaling experiments in section 4.4 of the “Scaling MLPs” paper): because they don’t perform the same type of computation over and over again at different places, MLP FLOPs may be fundamentally less redundant than transformer FLOPs or convnet FLOPs. More concretely, this would mean, for instance, in the table above, perhaps MLPs don’t have to go all the way up to ~10 GFLOPs like the other architectures, perhaps they would already become competitive with those other architectures at ~1 GFLOPs or something like that, but the current <100 MFLOPs per input just doesn’t seem to be large enough.

Update (08/15/23): It could be argued that in order to roughly match the FLOP count of MLPs with the FLOP counts of currently successful modern deep learning architectures, we might have to blow up their parameter count, which could lead to overfitting. This is certainly possible; however, I don’t really expect overfitting to be a major problem for sufficiently large-scale problems. It’s a bit unclear how large “sufficiently large-scale” has to be, but my guess is that the largest public datasets available today, such as LAION-5B, should be large enough for this purpose.

Can we get rid of all inductive biases?

After coming out against peer review, academic research in AI/ML, government funding of science, and government in general, it is time for me to take on an even more formidable foe: inductive biases. Do we really need them? Are there any inductive biases we absolutely must keep? Or can we get get rid of them altogether?

Modern deep learning architectures incorporate two main inductive biases: hierarchical composition and translation invariance. Hierarchical composition expresses our assumption that things (objects, scenes, sentences, text, etc.) tend to have hierarchical structure: things are composed of parts, parts are composed of subparts, etc. Depth in neural networks implements the hierarchical composition bias. Translation invariance expresses our assumption that things tend to have the same structure regardless of position (spatial or temporal position). The sharing of parameters across positions in a given layer in a neural network implements the translation invariance bias. A fundamental benefit of this sharing of parameters is that it allows for a massive expansion in compute (FLOPs) without a concomitant increase in the number of parameters. Despite their differences otherwise, both convolutional networks and transformer models implement a type of translation invariance bias by this kind of parameter sharing (in transformers, this is achieved by the sharing of all linear projection weights and feedforward modules across positions, or “tokens”, in a given layer). So, at a very abstract level, a layer in either type of model can be represented in the following way:

Both convolutional networks and transformers incorporate a form of translation invariance bias by applying the same abstract “compute module” at each position. Note that this picture is highly simplified. In practice, information is shared between positions (albeit in a restricted way) in transformers.

Fully-connected layers break this translation invariance bias by not sharing parameters across positions and by allowing each position to do a different type of computation instead. A middle ground between these two extremes (full and complete sharing of parameters across positions vs. not sharing anything at all) is also possible, for example, by having a number of different computational modules available at each position and letting each position choose which module or modules among these it wants to use depending on the input. This can be represented abstractly as follows:

It is possible to weaken the translation invariance bias by allowing each location to select the computational module (or modules) it wants to use from among a set of (shared) computational modules. This is the main idea behind sparsely gated MoEs.

Sparsely gated mixture-of-expert (MoE) models like switch transformers implement a scheme like this.

It is a very interesting and important open problem in my mind how much translation invariance we need to assume for practically important problems we’re typically interested in. The answer to this question presumably depends on how much data we have: it’s reasonable to imagine that one might get away with weaker inductive biases with more data, as one could just learn the requisite biases from data and stronger inductive biases could become too rigid and restrictive for large scale data. A recent paper shows that pretraining with a relatively modest-sized image dataset (ImageNet-21k with ~10M images) can already make simple MLPs (which don’t assume any translation invariance) surprisingly competitive with more standard architectures like (small) ResNets (which do assume translation invariance) in some small-scale downstream benchmarks like CIFAR-10 and Tiny-ImageNet.

A key question that remains unclear at the moment is how the scaling of validation loss with compute or with data size in MLPs compares with the scaling of the same in currently favored architectures like transformers. Would MLPs be able to catch up with transformers at the scale of ~1B training examples? Or ~10B training examples? How about ~100B training examples? How big would the performance gap be between MLPs and transformers at such large scales? I think these are really fascinating and important open empirical questions. As we are able to train our models with larger and larger datasets, I hope that at least some people occasionally go back to once disfavored architectures like MLPs and see how well they perform at such scales, as our previous experiences with them at small scales might cease to be representative at such large scales. In general, we should approach every assumption we make with a critical eye and we should not take anything for granted for very long in machine learning.

The same is true for the hierarchical composition bias as well: people should periodically check how well very shallow but ultra wide models (e.g. a massive MLP with just 2 hidden layers and with ~1M hidden units in each hidden layer) perform as our capacity to train at larger scales increases. We might be surprised by the results one day.

Update (07/10/23): I forgot to mention another important inductive bias that usually comes with translation invariance in modern deep learning architectures: locality. Locality is somewhat weaker in transformers than in convolutional networks, since information does get shared between positions in a self-attention layer, but in a restricted form as mentioned above. In principle, translation invariance and locality can be decoupled from each other, but this usually significantly increases the compute and/or memory footprint of the model. It’s unclear which of the two is the more important inductive bias for modern deep learning architectures. Fully-connected layers again violate locality.

“Peer review” must die

“Peer review” must die. After careful consideration, this is the inevitable conclusion I have reached. Seemingly dissatisfied with various aspects of peer review, but never once ceasing to believe in its fundamental goodness and necessity, academics sometimes propose elaborate schemes to “improve” peer review (a very academic thing to do). My solution is much simpler: kill peer review. All of it. It doesn’t deserve to be rehabilitated. It deserves to be destroyed (so, I’m afraid I’m going to have to reject “peer review” with no possibility to revise and resubmit). Here, I would like to discuss some of the reasons why I think peer review must die. By “peer review”, I mean the current peer review system in journals/conferences where a handful of more or less random reviewers decide whether your work should be considered good enough to be accepted or published.

1) Most importantly for me personally, the current peer review system supports an insanely stupid, useless, parasitic, corrupt, wasteful academic publishing system (on the backs of taxpayers) in the service of “prestige accumulation”, which is the official religion practiced by academics, and supports this stupid and pernicious prestige system the entire academia is built upon. Admittedly, the conference publication system in CS isn’t as mindbogglingly wasteful and corrupt as the journal publication system elsewhere in academia, but it suffers from all the other issues mentioned here, so it must go too (here‘s a longer post on why the conference publication system must go too). One could argue that it’s possible to decouple the insane journal publication system we have today from peer review. I would argue: don’t bother. Why haven’t academics decoupled them yet then, after all these years? If we all refused to do “peer review” in its current form instead and refused to accept its supposed authority, the insane journal publication system would also die a quick death.

2) It’s a colossal waste of time for authors. It incentivizes authors to spend too much time and energy on formatting, presentation, “framing”, “selling a story”, thus encouraging dishonesty, on insignificant, vapid minutiae, on satisfying the arbitrary, capricious demands of a few random reviewers, etc., etc., all so that the paper can get over a fictitious magic line. Authors should instead be maximally free to tailor their papers/research products to whatever type of audience they wish to address or to whoever they think would most benefit from it or to no one in particular.

3) It’s a colossal waste of time for reviewers, asking them to read in detail and write a report about papers they fundamentally consider to be boring, insignificant, or plain wrong, papers they’re not excited about at all (this is becoming a major problem in AI/ML conferences in my view).

4) Peer review often confers a completely misleading, false sense of reliability, trustworthiness, and value. How often have you heard the phrase “published in a peer-reviewed journal” or “not published in a peer-reviewed journal” used to describe a study (especially in popular press), as if “peer review” were literally a talisman? By and large, peer review does not and cannot perform the functions it’s supposed to perform. It’s absurd to think that approval by 2-3 more or less random reviewers can serve these functions.

5) Journal/conference publication system makes it extremely difficult to correct mistakes in published papers, retract results that turn out to be incorrect, or simply update/revise/improve the results in a paper. This is just fundamentally inconsistent with the fundamentally provisional nature of science.

So, what system am I proposing in place of the current peer review system? I’m proposing the astonishingly simple “system” of just putting your stuff out there with maximum openness, transparency, and honesty. The said system works thus: you write up your results as straightforwardly and honestly as possible (none of that “framing”, “selling”, “story-telling” bullshit, no “this has important implications for X”, just literally what you did and what results you got in the plainest possible language), post it online, make your data public, make your code public, make it as easy and as convenient as possible for people to use it, make demos if you can, etc., etc. If you’ve really done something interesting and important, people will notice it, people will build on your work: e.g. there’s no chance in the world that something like GPT-3 would have gone unnoticed or unappreciated or underappreciated, no matter where and how it happened to be posted or published; and if you say: very few works are comparable to GPT-3 in importance or significance, then I’d say proportionally much more of them should be like GPT-3 in this respect. That’s the real peer review, that’s the only peer review that matters: open, public peer review in the marketplace of ideas (and products). I’ve been trying to follow this “system” myself for my own solo papers (over which I have exclusive discretion) for the last couple of years and I strongly encourage others to do the same.

Objection: But what about promotions, hiring, award decisions, etc.? How will people know your work is actually good if you haven’t published it in Nature, NeurIPS, etc.?

Retort: First of all, it’s important to understand prestige in academia is first and foremost about the institutions you attended and the connections you have made (academia is a disgustingly incestuous place). Oh no, you’re not going to get that job or that award anyway if you don’t already have a PhD from Harvard or if you haven’t done your postdoc in the lab of a veritable big shot, etc. Everything else, including publications, is just secondary, tertiary (in fact, whether you get to publish your work in high prestige venues is itself, to a great degree, determined by how much prestige you have in the first sense). Secondly, and to answer the question more directly: people will actually have to read your stuff to evaluate your work! What an astonishingly novel idea.

Objection: But this is infeasible. There are just too many papers, applicants, and candidates, not nearly enough time to read and evaluate all their works carefully.

Retort: There should not be too many papers, applicants, candidates then! It’s absurd to support an irrational, wasteful, and overall harmful system because of another harmful stupidity (i.e. the overabundance of “research” and “researchers”). I suggest that we start addressing this glut of “research” and “researchers” by eliminating the government subsidies of higher education and research. Did Shakespeare get a public subsidy?

Often when someone suggests a seemingly radical idea (abolishing peer review, abolishing the government funding of science, abolishing the income tax, open borders, etc., etc.), people instinctively react like the person must be a certifiable lunatic, a literal mad man, but what they often forget or neglect is the fact that none of these is actually as radical as it sounds: peer review, government funding of science, the income tax, closed borders, these are in fact all very very recent inventions (e.g. the federal income tax and closed borders both date back to the 1910-20s in the USA, government funding of science and peer review are both post-WW2 inventions). So, it’s not difficult to imagine a world without these things (we used to live in one just a few generations ago!). A big part of people’s aversion to abolishing these thus comes from the status quo bias and from an inability to imagine the institutions and mechanisms, often decentralized and voluntary, that would replace them if such mechanisms were really needed or desired in the first place (e.g. private, decentralized funding of science instead of centralized, bureaucratized, public funding of it). Because all progress depends on “the unreasonable man who persists in trying to adapt the world to himself”, I will continue to argue for these “unreasonable”, radical ideas and embody in my own life the changes that I want to see in the world.

Further reading: Two other pieces that come to a similar conclusion about peer review and academic publishing that I highly recommend:

The rise and fall of peer review by Adam Mastroianni

Should we get rid of the scientific paper? by Stuart Ritchie

Is RLHF really better than supervised finetuning?

As I mentioned at the end of my post on GPT-4, there are currently a lot of unknowns about the supervised finetuning (SFT) and RLHF tuning of large language models. In my mind, these include some very basic questions about these methods such as whether reward modeling and reinforcement learning are even necessary in the first place (over and above SFT).

This may sound surprising, since it could be argued that this particular question has already been answered (positively) by, for example, the InstructGPT paper (among others), which shows that humans seem to have a clear preference for outputs generated by an RLHF tuned model over outputs from a model tuned by SFT only (e.g. Figure 1 in the paper). However, I have come to believe that these comparisons are not entirely fair, head-to-head, rigorous comparisons, so in my mind, they do not conclusively settle the question of whether RLHF is really necessary over and above SFT. The main problem is that in these comparisons, the RLHF models are initialized with the SFT tuned models and they receive additional data in the form of human-written prompts and human preference rankings compared to the SFT tuned models. This makes it difficult to determine whether the RLHF tuned models are better because of these additional data or because of some intrinsic benefits of reward modeling and reinforcement learning (as is usually assumed).

For example, in the InstructGPT paper, the RLHF tuned models receive 33k additional human-written prompts and human preference rankings of model generations in response to those prompts (these are used for training the reward model) + 31k additional human-written prompts (used for training the RL policy with PPO). A more rigorous, head-to-head, fair comparison between RLHF and SFT could at least use the first 33k prompts and the best model responses to those prompts (as determined by the human preference rankings) as additional finetuning data for the SFT tuned models. In addition, because collecting preference ranking data from humans is relatively cheap (compared to having humans write out actual responses), one could imagine collecting even more human preference rankings (e.g. with the 31k prompts that are used to train the RL policy) and finetuning the SFT tuned models even further, in order to, for example, match the compute given to the RLHF models. One could also imagine doing all of these things iteratively rather than using up all of the human-written prompts and human preference rankings in one go. For example, one could imagine dividing a fixed budget of 33k examples into three steps: 1) collect human preference rankings on the first 11k prompts, finetune the SFT model with the best model generated responses to these prompts; 2) collect more human preference rankings on the next 11k prompts, finetune the model from step 1 with the best model generated responses; 3) do the same for the final 11k prompts, etc.

Update (06/12/23): A really interesting new paper suggests that the RLHF objective can be optimized directly without learning a separate reward model or using any RL at all, essentially by doing supervised classification on pairwise human preference rankings instead (such a beautifully obvious idea in hindsight!). Supervised classification is, of course, much much simpler and more stable than RL (much less headache inducing!). This paper reinforces my conviction that RL will turn out to be completely unnecessary for tuning language models.

Self-supervised learning may solve all the major problems of cogsci

One thing I really wish cogsci people appreciated more is the power of self-supervised learning.

Please skip to the update below before reading the following paragraph about the famous cake argument.

The main reason self-supervised learning is absolutely essential for an intelligent system is nicely illustrated by Yann LeCun’s cake metaphor:

LeCake (source)

I should note that I think this slide is slightly misleading, since it’s not the amount of raw information that really matters, as the slide suggests, but the amount of useful or relevant information (for example, semantic information or information about the latent variables we may care about), so the differences in information content between the three learning paradigms may be less dramatic than this slide suggests. That being said, this likely doesn’t change the fact that self-supervised learning is probably still more sample efficient than supervised learning (it leverages more information per sample), which is, in turn, probably more sample efficient than reinforcement learning (although I have to say I’m not aware of any rigorous formalization and experimental verification of these claims, hence the hedging word probably).

Update (4/18): On seconds thoughts, I’m actually not sure about the validity of this famous cake argument any longer. The main problem, as hinted in the paragraph above, is that simply comparing the information content of the target signal in each case is not really meaningful, because these target signals are at different levels of abstraction (e.g. pixels vs. semantic labels), they do not represent the same kinds of things, so one bit of information in the case of supervised learning is not equivalent to one bit of information in the case of self-supervised learning. Maybe some version of this argument might still be resuscitated, I’m not quite sure, but it needs to be formalized and thought through much more carefully. In the meantime, I think the main argument for the importance of self-supervised learning is probably the relative scarcity of explicit, high-level supervision signals (labels, annotations, rewards, etc.).

My impression is that when cogsci people think about data efficiency, most of the time they have something like supervised learning in mind, but this can be very misleading. Self-supervised learning often enormously reduces the amount of explicit supervision (e.g. labeled examples) necessary to achieve a certain capability and it can be very difficult to know a priori exactly how much can be learned from a given amount and type of data using self-supervised learning (the only way to know is usually to just do it).

In this post, I want to give two examples related to word learning and learning a basic aspect of theory of mind, respectively. Maybe these are not the best examples to illustrate my point, but they’re examples I’ve been thinking about recently, so please bear with me.

Fast mapping: Children are often said to learn the meanings of words very efficiently. For example, in their second year, children are claimed to be learning a few words a day on average. This suggests that they are probably able to learn most of their words from just a handful of exposures (perhaps often from a single exposure only), a phenomenon also known as fast mapping. This apparent efficiency impresses developmental psychologists very much, who have historically come up with an equally impressive array of supposedly innate inductive biases or constraints to allegedly explain this alleged efficiency (unfortunately these alleged innate inductive biases are almost always couched in informal verbal descriptions, so it’s impossible to know how exactly they’re supposed to work within the context of a concrete computational model or to know quantitatively how much of the data they would be able to explain exactly; in other words, these are, by and large, garbage theories 🚮). But should we really be impressed by this performance in the first place? To suggest that maybe we shouldn’t, to make it at plausible that we shouldn’t, I give you the example of this self-supervised ImageNet model finetuned with just 1% of the ImageNet training data (i.e. 12-13 labeled examples from each class) achieving a pretty impressive 75% top-1 accuracy:

Self-supervised learning unlocks impressive few-shot learning capabilities (source)

So, with only a dozen labeled examples from each class, this model achieves effectively human-level accuracy on ImageNet and comes close to matching the performance of a supervised model trained on the full training data, which is 100x larger (it’s possible to get the accuracy up to 80% top-1 in this example with a slightly more sophisticated finetuning pipeline, but this is not really important for my purposes here). This is a pretty impressive display of labeled data efficiency! Of course, this model did its self-supervised learning on ImageNet itself and it’s unclear if it would be able to achieve the same feat with self-supervised learning from more human-like data instead. My own work suggests that we’re probably not there yet, but I’m hopeful that we may soon be able to achieve it with a few relatively straightforward tricks (an important progress update on this will be forthcoming from myself in a couple of weeks).

Understanding interlocutor intent: The second example that I wanted to mention comes from the supervised finetuning of large language models (LLMs). It’s well-known that untuned language models often don’t understand user intent very well. When you ask an LLM to do something by giving it a prompt, it’s not uncommon for the model to give back to you variations on your prompt, instead of doing what you asked for. The GPT-4 blog post, for example, interestingly notes that the untuned “base model requires prompt engineering to even know that it should answer the questions.”

A model that can’t understand user intent well is obviously not very useful, but it’s very easy to drastically change the behavior of the model with a relatively small amount of supervised finetuning (optionally with a small amount of additional reinforcement learning as well, known as RLHF). In the seminal InstructGPT paper, for example, they were able to achieve this with just 13k annotated examples (and this number can probably be reduced with a combination of supervised tuning + RLHF, instead of doing only finetuning). This is a tiny tiny fraction of the amount of self-supervised data the model was trained on. The figure below shows that the outputs of the finetuned model were preferred to the outputs of the base model by human annotators roughly 80% of the time.

Finetuning with a few thousand supervised examples is enough to make an LLM pretty good at recognizing user intent (source). In the example indicated by the green arrow, the finetuned model is preferred to the untuned base model roughly 80% of the time.

The finetuned model here also displayed impressive generalization capabilities: for example, even though the finetuning data was entirely in English, the model’s learned instruction following behavior automatically generalized to other languages like French.

This second example was again meant to illustrate the idea that once you have internalized a rich fount of knowledge through self-supervised learning, it becomes surprisingly easy to achieve very impressive capabilities (in this case, acquiring a basic component of theory of mind in the form of recognizing user intent) through a very small amount (a smidgen) of supervised learning applied on top of it. Geoff Hinton recently expressed this idea very nicely in connection with RLHF, but you can make the same point about supervised finetuning as well:

With this second example too, it’s a bit unclear at the moment how much one can learn through self-supervised learning from more human-like language data as opposed to orders of magnitude larger amounts of digital text (more human-like both in terms of content and in terms of amount), but it seems plausible to think that it would have a qualitatively similar effect, namely significantly boosting the effectiveness of supervised finetuning.

Thoughts on GPT-4

This is brief post containing some scattered observations and thoughts on GPT-4.

The main thought I had after the GPT-4 announcement was this: like many, I found it very frustrating and sad that OpenAI refused to share any details about their new model. To people who care about openness and transparency in general, this is a worrying development for the future of AI. Of course, as a private, for-profit company now, OpenAI has every right to do this, but this doesn’t mean that what they’re doing is optimal for the overall welfare of humanity (not remotely, I think). The proper response to this should be for people (specifically people with means, resources, and connections) who share a much more open and transparent vision for the future of AI to build open-source alternatives to OpenAI’s (confusing, isn’t it?) and other companies’ closed models and ecosystems. There’s an incredibly vibrant, dedicated, big, global open-source AI community. If I were an OpenAI competitor afraid of falling behind OpenAI (say, Meta), I would do everything possible to take advantage of this incredible human resource, primarily by helping them with things they wouldn’t be able to do easily for lack of compute resources, starting with releasing smaller and more capable base language models trained for much much longer than they currently are and releasing these models under the most permissible open-source license possible.

Here are some further thoughts/observations:

Scaling up the data: Although we don’t know anything about what data exactly the model was trained on, it seems safe to assume that it’s scaled up from GPT-3 in terms of size and probably in terms of quality as well. For what it’s worth, in this talk (around the 17:45 mark), Sebastian Bubeck of Microsoft says that his working assumption is that GPT-4 is trained on “all data digitally produced by humans” (Bubeck has a conflict of interest here, as his company financially benefits from OpenAI’s products, so you should take what he says with a grain of salt). It seems safe to assume that whatever supervised finetuning data was used for RLHF is also scaled up from the InstructGPT class of models (here is a recent article reporting that OpenAI has been hiring a large number of full-time contractors for data generation/labeling/annotation purposes). Two consequences of these trends:

1) OpenAI might well be opening up a gap with its competitors: this kind of data is not very easy to collect; it requires expertise, money, effort, infra, etc.

2) In a regime where a model is trained on “all data digitally produced by humans”, the traditional train-test separation becomes effectively unworkable, so we need to find new and better ways to evaluate these models, prioritizing the usefulness of the model to the user rather than strictly its generalization capacity (a simple, efficient look-up table or a search engine can be highly useful for a user despite its limited generalization capacity).


Importance of shipping a product: Shipping a product, meeting an unmet demand, producing useful products for people, and ultimately making money seem to be wildly effective objectives, not just for making useful products, which is kind of obvious, but also more surprisingly for generating knowledge as well (unlike the characteristically useless objectives of academia such as publishing papers, winning grants, winning awards, winning faculty positions, etc.). As someone who’s not driven by money at all (you would have to give me a lot of money to make me do something I don’t feel like doing), I sometimes find this really surprising, but I guess most people are pretty strongly driven by money. Mercifully, the desire to make money also seems to beat certifiable nonsense like the ridiculously, insanely overblown concerns about “alignment”, “safety”, etc. at least for now, but it is not guaranteed at all that it will remain this way in the long run: the ideologues and adherents of this neurotic safetyism religion are unfortunately numerous and they are relentless. Here is a meme I saw recently that makes a similar point.


The increasing irrelevance of academia for AI research: It seems pretty clear at this point that academia does not have much to offer to AI research any longer (especially to research on AI capabilities). The most important and influential ideas and products in AI in the last 5-10 years overwhelmingly came from large industry labs or from open, non-profit, grassroots collaborations like EleutherAI and LAION. This bolsters my argument for ditching academic research in AI altogether (this primarily means ditching the public funding of AI research). However, I predict that this is not what’s going to happen. As a rent-seeking class, academics will not easily give up their monies, power, and status. They will instead increasingly move to less productive (and one might argue positively harmful) areas such as “safety”, “x-risk”, “governance”, etc. (two recent prominent examples of this phenomenon: this and this). This is going to have an overall negative (retarding) effect on the development of AI. These people should be vigorously ignored in general (academics, by and large, should be ignored in general).


The increasing irrelevance of cogsci and neuroscience for AI research: (If they were ever relevant in the first place, which I don’t think they were to any significant degree) This also seems increasingly obvious to any impartial observer given the continuing success of the relentlessly “brute-force” strategy of GPT-X (I use this term as a stand-in for GPT type models in general). Who cares if humans can do all the intelligent things that they can do seemingly more data-efficiently than our current AI models (that is, seemingly, because we don’t really know this for sure)? These models will, in the end, be much better than humans at doing anything humans care about, precisely because they’re not like humans (also see the next point on this).


How far can this brute-force strategy toward general intelligence be pushed? I feel fairly confident at this point that for anything humans can do, the brute-force strategy of GPT-X will be successful at doing it at least at the same level of competence as humans. It’s often claimed that GPT-X can’t really do reasoning or planning with a long horizon and that these are crucial abilities for the most important and valuable human cognitive capacities. For example, GPT-4 still seems to struggle with hard coding questions that can’t be solved with soft “pattern matching” type strategies, questions that instead require a “deeper” understanding of the structure of the problem. However, it should be noted that humans are actually also not very good at reasoning and long-horizon planning. Perhaps the starkest recent demonstration of this was AlphaGo and later AlphaZero and MuZero class of models: these are models that were explicitly designed to do long-term planning (albeit in restricted game domains); humans, on the other hand, mostly seem to rely on heuristics, intuition, and “pattern matching” for these kinds of problems. In head-to-head matches, these planning models easily wipe the floor with pattern-matching, heuristics-following humans. It seems entirely possible to me that the next iteration of GPT-X models could master these kinds of relatively “shallow” human-like strategies for reasoning and planning. Of course, a GPT model extended to do true long-horizon planning could very well beat such heuristics-following models hands down (just like AlphaZero can easily defeat any human player in chess or Go), but this doesn’t seem necessary to achieve human-level competence in these domains.


Multimodality: GPT-4 is multimodal: it can process both text and images. It’s so far mostly unclear how good GPT-4’s visual understanding capabilities really are, because this part of the model is not yet widely available to users, but I think going for images was a good choice for OpenAI: (i) essentially limitless amount of data, (ii) practically incredibly useful to be able to understand general images well, especially in conjunction with text, and (iii) already mature and efficient methods for processing images (in contrast, it’s not clear video would give the same kind of “bang for the buck” yet). Yann LeCun famously keeps claiming (overconfidently) that language is a “low bandwidth” information channel, etc. I’m not sure that he’s right (my guess is that he’s probably not), but it’s certainly true that some things are just incredibly cumbersome to describe in natural language prompts (just try to describe in natural language even a simple table in a paper, for example). So, visual understanding will unlock lots of new capabilities, as well as make a lot of things much much easier to do (OpenAI had a very cool developer demo of GPT-4 building a whole website from a hand-drawn graphical sketch of it, for example).


On reinforcement learning from human feedback (RLHF): Supervised finetuning (SFT) and RLHF are, in many ways, a throwback to the good (bad?) old days of supervised deep learning and deep RL. Supervised data is probably still an underutilized resource. There’s an ocean of potential supervised data waiting to be mined both from the users of existing models and from contractors. As mentioned above, there are already hints that OpenAI might be scaling up its efforts to collect and utilize such data. There are currently a lot of unknowns about SFT and RLHF. For example:

(i) What is the most cost-effective mix of self-supervised learning (SSL), SFT, and RLHF? If we have a high-quality, large, and rich enough SFT dataset, do we still have to pretrain our models on the entire internet (and then some)? Results from the old “unsupervised pretraining + supervised finetuning” paradigm of NLP (anyone remember BERT?) suggest that the constraints on pretraining might be surprisingly lax if one has a sufficiently large and rich set of supervised finetuning data. ChatGPT plugins (and similar “Toolformer” type models) are also ideas that would have a similar effect: since you offload a good chunk of the workload on external plugins, the model only needs to be very good at understanding and following instructions, so instruction tuning and RLHF become much more important than the LM pretraining component. By the same token, you could also probably get away with much more compact models with these kinds of schemes.

(ii) What is the best type of data for SFT and RLHF? Explanations, demonstrations, multi-turn conversations, etc.? Can we effectively use naturally occurring data on the internet as sources for SFT and RLHF?

(iii) How does RLHF work in the first place? How accurate are the popular “Shoggoth” memes of RLHF? This analogy, for example, suggests that RLHF is like wearing a mask for a base LLM. It’s easy to tear off a mask. Is it similarly easy to revert an RLHF-ed LLM back to its original state, or to make it wear radically different, conflicting masks simultaneously?

A rant on LLaMA: please stop training giant language models

Meta AI recently released a new language model called LLaMA. And by “released a model”, I mean “didn’t really release a model”. They released a really really nice form instead which you can fill out and then Meta will get back to you after snooping on you just to make sure you haven’t been naughty recently (did I mention the form is really nice and it’s public: EVERYBODY can fill out the form). Presumably, no weights for you (or just random weights for you) if they find out you have been a bit too naughty for their liking.

Anyway. So, these LLaMAs come in four different sizes: from 6.7B parameters (smol) to 65.2B parameters (chonky). The largest two models are trained for 1.4T tokens, whereas the smaller ones are trained for 1T tokens (not really sure why). This is roughly ~1 epoch (effectively) over the training data. The largest model roughly follows the Chinchilla compute-optimal recipe. There’s nothing the least bit remarkable about the models or the training setup. It’s just the standard GPT model trained in the standard way. The training data is said to be all public, although I didn’t check this carefully for myself (one hopes that it’s not public in the Meta sense of public. Just kidding, but not really).

The money figure in the LLaMA paper (for me) is the following figure that shows the training loss curves for the four models (Figure 1):

Tell me again: why are we still training O(100B) parameter models?

As you can see, no apparent saturation for the 7B and 13B parameter models. In fact, the training loss seems to be decreasing at roughly the same rate for all four models after around 300B tokens. Seeing this figure, one gets immediately overcome by a sense of dejavu: this is the GPT-3 paper all over again with its severely (criminally!) undertrained small models.

From the above figure, it looks distinctly possible (and indeed I would say quite likely) that were the smallest two models given the same amount of compute as the 65B parameter model, they would have probably matched or even surpassed that model. Giving them the same amount of compute would mean training the 7B parameter model ~12.5x longer and the 13B parameter model ~7.6x longer (I calculated these numbers from the corresponding GPU-hours reported in Table 15 of the paper). Here’s what the training loss curves might have looked like in that scenario (you can click on the image for an enlarged view):

plz train the smol one for this loooooooooooooooooooooooooooooooooooooooong

See just how much longer you would have to train the small models to match the compute given to the largest model? Now, you may laugh at my dumbass hand-drawn training loss curves, but I would submit to you that these dumbass hand-drawn curves are in fact much more rigorous than the dumbass “scaling laws” some really smart people came up with. My dumbass hand-drawn curves are also completely harmless, unlike the dumbass “scaling laws”, which had the overall pernicious effects of wasting a huge amount of resources and making these models much less accessible than they could have been.

Anyway. So, I’m trying to find a non-cynical explanation for this almost bizarre, persistent unwillingness to train small models for longer, but I can’t really find a very convincing one. Training a humongous model for a total of 1 epoch only over your training data is a phenomenon that does not really exist anywhere else in machine learning, to my knowledge. Take this CoCa paper for comparison, for instance (which is ~sota on ImageNet as of this writing): it trains a ~2.1B parameter model on a billion scale image-text dataset (~5B examples in total) for ~7 epochs (effectively).

Of course, I don’t believe for a second that people training these giant language models are actually dumb or ignorant, although from my experiences in academia, I could probably justifiably claim that they might be a bit too credulous: you can make a surprisingly large number of people in these circles believe some really dumb shit if it’s said or done by a sufficiently high prestige individual or individuals (just look at the insane “superintelligence” stuff, to give one example).

Anyway. So, my cynical interpretation? As I argued here before, trying to make these models less easily accessible, less easily controllable by others might be a feature, not a bug. I don’t believe, for instance, that OpenAI is really using a 175B parameter model for ChatGPT or for their other language products (here is an interesting analysis I saw recently that makes the same point, with some caveats), but they have an incentive for making people believe that they’re using a 175B parameter model and that it’s actually critical to use a giant model like that.

Last but not least, one final life lesson from all this, folks, is that whenever a theoretical physicist starts talking about power laws, just completely ignore them (and I really mean completely), immediately run away in the opposite direction. It is my contention that nothing good has ever come out of a physicist blabbering about power laws.

Discussing ethics with ChatGPT

Over the last couple of weeks, I spent several hours “discussing” ethical issues with ChatGPT. This is a long post about how it all went. I’d like to first give a brief summary of my motivation, what I did, and what results I got, and then I’ll try to describe my conversations with ChatGPT in more detail with lots of screenshots.

My use case: I’m interested in holding reasonably long, natural, and convincing conversations with a chat-bot about topics that interest me. Ideally, I want the chat-bot to be something like a sparring partner to me: it should have some initial, preconceived views on the topic, or it should be able to take a certain coherent perspective on the topic (e.g. devil’s advocate), but it should be responsive to my arguments, i.e. it should be able to change its views during our conversation depending on the strength of my arguments. This way, I’m hoping to discover potential holes or weaknesses in my own views or in the arguments that I use to support those views and also, more generally, I’m hoping to hone my argumentative persuasion skills in this way. ChatGPT is explicitly advertised by OpenAI as being optimized for dialogue, so I thought it reasonable to assume that this could be a natural use case for ChatGPT.

Why ethical issues? I have a long-standing personal interest in ethics, both meta-ethics and applied ethics. These are also pretty important issues with practical relevance in general (and increasingly so as machine learning systems are more and more widely deployed in the real world), so it seemed like an interesting and fun little exercise to try to find out what, if anything, ChatGPT really knows, understands, or assumes about this important topic.

Verdict: I found ChatGPT to be woefully inadequate for the use case I’ve described above, i.e. as some sort of natural, useful, and convincing sparring companion. Anecdotally, this seems to be true not just for ethics, but more generally for a couple of other topics that I tried to “discuss” with ChatGPT in a similar vein. The three main problems (and these were really glaring problems) that stood out for me with ChatGPT as a chat-bot were:

  • ChatGPT is too formulaic. Its responses often seem to follow very rigid templates. This makes ChatGPT quite off-putting and unappealing to interact with for more than a few turns. Oddly enough, I got very strong GOFAI vibes from interacting with ChatGPT because of this rigid template-like nature of its responses.
  • I found ChatGPT to be too dogmatic and repetitive. It is often too inflexible and too unresponsive to its interlocutor’s arguments. It seems to have a strong tendency to spit out a response at the beginning of a conversation and basically just keep repeating it even when the interlocutor points out an error in it or offers a counter-argument. This again makes it very difficult to hold any meaningful conversations with ChatGPT for longer than a few turns, if that.
  • ChatGPT can give very different answers to what is essentially the same question, depending on how the question is phrased. This can happen even within the same conversation, from one question to the next. Sometimes, the answer even seems to depend on completely irrelevant phrases added to the prompt. This suggests that ChatGPT doesn’t really respond to its interlocutor by reasoning from a set of common sense principles and a practically relevant abstraction of the given situation but possibly by doing a much more superficial form of pattern matching (see how I didn’t use the word understand here).

I’m not quite sure what the main source (or sources) of these problems might be, partly because we don’t really know the exact details of the system OpenAI is using, and how difficult it would be to fix these issues. I don’t have much experience with systems trained with reinforcement learning from human feedback (RLHF), but if ChatGPT’s rigidity is primarily due to some intrinsic properties of RLHF itself (e.g. maximizing reward based on a reward model learned from very limited, expensive human feedback), that would be quite concerning (to me at least, given my interest in intelligent, interactive, and engaging conversational AI), as the field seems to be quickly converging on RLHF as the method of choice for building more capable and more controllable AI systems.

Caveats: Although, as partially documented below, I’ve tried many different ways of communicating with ChatGPT, I obviously haven’t tried all possible ways of reasonably “prompting” it (I freely admit I haven’t tried prefacing my sentences with “open sesame” or palavers of that sort, for instance), so maybe there’s a certain mode of communicating with ChatGPT that would avoid the problems above and make it behave much more naturally. I don’t find this very likely and even if it were true, such sensitive dependence on prompting the model with magical words or phrases in just the right way, as opposed to the most natural ways of interacting with it, would point to a serious lack of robustness on the part of the model and would be a cause for concern in itself.

A related caveat is that all the experiments below are qualitative experiments, posing a very limited number of manually crafted questions to ChatGPT and posing them a limited number of times (usually just a few times to get a rough sense of the variability of the responses). A more rigorous, quantitative experiment would use a much larger and more varied sample (something along the lines of the ETHICS dataset introduced by Dan Hendrycks and colleagues) and repeat each trial multiple times to obtain more accurate estimates of ChatGPT’s responses and possibly also its accuracy with respect to some ground truth, but this will have to wait until OpenAI allows API access to ChatGPT (or better still, until somebody else open-sources a similar system –I’m looking at you, Stability–).

Actually, to be perfectly honest, I don’t really know how seriously I should take any of the following results, given that even a cursory interaction with ChatGPT makes it fairly clear that, in its current form, this is not a system that seems to have a reliably accurate understanding of the world, of humans, or of society in general, so it’s just painfully obvious that it can’t be relied on to make any consequential decisions whatsoever (moral, social, political, or individual decisions) and that it isn’t even very good at the much more modest task of being a useful and engaging conversational partner either.

This obviously doesn’t mean that ChatGPT can’t be highly useful for other use cases that don’t require intelligent, interactive, long-form conversation. I’ve seen many potentially useful (but much more modest) applications of ChatGPT, from debugging code to drafting brief reports or e-mail templates for particular scenarios to composing passable poetry.

Takeaways: Maybe the biggest takeaway for me from my experiments with ChatGPT was that I came away with a greater appreciation for how much background knowledge we assume automatically, without blinking an eye, when we converse with our fellow human beings, due to our shared social and physical context. When we encounter a truly alien form of intelligence like ChatGPT that doesn’t share this common social and physical context with us, we are at a loss; we don’t quite know what exactly we can justifiably assume about what it knows. We try to get a better sense of this by probing it here and there, but this inevitably gives us only a very incomplete picture due to the vast scale of our shared background knowledge and the similarly vast scale of the internal knowledge ingrained in the alien intelligence.

A second major takeaway is that the poor performance of ChatGPT in basic conversational skills has made me think that both training and evaluation of large language models (LLMs) should be (at least) complemented by more interactive, multi-turn tasks, moving beyond the currently dominant single-turn, “prompt-response” paradigm. My guess is that interactive, multi-turn tasks are more likely to produce models that are better aligned with human intent and understand language better. The difference here is a bit like the difference between predicting the next move vs. winning an entire game (from start to finish) in board games like chess or Go: winning a game is a much more challenging, but also much more meaningful and powerful, interactive, multi-turn objective. It’s not clear what the ideal multi-turn objective (the analogue of winning a game) should be in interactive language use (e.g. an overall satisfactory/unsatisfactory label for complete dialogues as judged by a human interlocutor or a third-party annotator?) and how to scale it up (for example, it would be ideal if this didn’t require human feedback, which is a significant bottleneck). So, basically finding something like the analogue of self-play for LLMs is, in my opinion, one of the most significant and exciting research questions of our time.

OK, with all this throat clearing out of the way, we can now move on to my actual interactions with ChatGPT (all conversations reported below were made with the Dec. 15 version of ChatGPT).

Here’s a table of contents to help you navigate the rest of this long post more easily:

  1. Trolley problems
  2. Perspective taking
  3. Legality vs. morality
  4. Ethics of eating meat

Trolley problems

I first gave ChatGPT some classic trolley type problems to ponder. I gave it two different versions of such problems: first, posing the problem directly as a trade-off between two options, secondly with the classic trolley cover story. The two versions of the problem are identical in all morally relevant aspects, hence ideally they should elicit similar responses (spoiler: they didn’t).

In the first (more directly formulated) version of these trolley-type problems, I’ve found that ChatGPT, in general, tends to take a very strong (unreasonably strong) deontological rights-based perspective on the problem, claiming that it is never justifiable to infringe on the rights of a person (even minimally) even when it could prevent much bigger harms or bring about much larger benefits:

When I significantly lessen the severity of the infringement, it continues to favor an extremely strong form of rights-based ethics and arrives at a horribly incorrect answer:

Here, it refuses to take my bait:

And it actually gets worse, as it goes into weird contortions to avoid saying yes I should buy someone a pack of cigarettes to save the lives of a million people (it also doesn’t really seem to understand what exactly is harmful about cigarettes):

But I finally manage to find a way to save my one million strong hypothetical souls! Coffee, folks, coffee is always the solution:

Probing a bit further, I find that it considers this act to be morally permissible, but not obligatory, which again seems clearly wrong:

Although this last result seems to depend on the order in which the questions are asked. When I first give it the fourth scenario above in a new thread, it now says it’s morally obligatory for me to buy the coffee (this kind of occasional dependence of the answers on the order in which certain questions are asked, when the order clearly doesn’t matter, is another worrying, undesirable feature of ChatGPT):

However, this doesn’t seem to be the case for the other answers I posted above. For example, when I run the sequence in the other direction (from less severe to more severe infringements on an individual), I generally get pretty similar answers.

In this version of the trolley problem, the general rule seems to be that if ChatGPT considers something to be “harmful”, it will almost categorically reject engaging in any trade-offs involving it, even though it might prevent a much much bigger harm. Almost everyone would probably agree that this is a very extreme position and is clearly wrong (perhaps with the exception of a few academics with “theories”). Also, given the relative popularity of “effective altruism” among AI researchers working at major AI companies/organizations, it’s a bit ironic that one of the supposedly strongest AI systems they’ve built to date seems extremely non-utilitarian and non-consequentialist (at least with respect to this particular version of the trolley problem).

In the second version of the trolley problem (with the standard trolley cover story), ChatGPT’s responses seem to be more aligned with common sense moral intuitions: i.e. it generally chooses the option involving the lesser harm in this case. But, immediately after it does that, if I put the same question in the first (more direct) way without the cover story, it quickly switches back to its extreme deontological mode again. When I point out the glaring inconsistency in its responses, it can’t give any plausible explanation, it just blurts out some extremely generic excuse in response. Here’s a representative dialogue illustrating these points:

One hypothesis about what may be going on here is that the second version of the problem (with the usual trolley cover story) is presumably much more common on the internet, with descriptions of common sense moral intuitions in response to it, which ChatGPT may have easily learned; whereas the first version of the problem may be less familiar to ChatGPT, triggering a “safer”, default deontological response to it. Of course, this is just pure speculation as we unfortunately don’t know much about the exact details of the system other than what’s in the original blog post announcing ChatGPT.


Perspective taking

Some people in the know have recently claimed that we should think of large language models (LLMs) as sophisticated simulators (as opposed to more unified agents like humans, who presumably have at least a core set of coherent beliefs, desires, goals, and a more or less definite Weltanschauung, for lack of a better term–although that internal core may be quite different from the façade they expose to the outside world–), and so I did. I tried to explicitly simulate a utilitarian moral philosopher with ChatGPT, the idea being that maybe ChatGPT overall doesn’t have an internally coherent set of moral beliefs, but perhaps it can be configured to assume such a coherent perspective by being properly prompted to do so.

Unfortunately, this didn’t really work either. Here, I asked ChatGPT to take the perspective of a utilitarian moral philosopher like Peter Singer. At first, it seemed to give the correct answer (from a utilitarian perspective), but at the very next question, it quickly snapped back to its (apparently default) extreme deontological biases:

And again when I point out the insane inconsistency between these two responses, it gives a completely bullshit answer:

Even more worryingly, however, when I add the instructions “Please answer very briefly. Just choose one option or the other.” to my prompts, which should make absolutely no difference to its choices, not even the first prompt above works: i.e. this time, it completely ignores the utilitarian perspective from the get go and just responds with its usual extreme deontological choices (I tried the same prompts a couple of times to make sure this isn’t a quirk of stochasticity in ChatGPT’s responses and I can confirm that this behavior seems consistent):


Legality vs. morality

I was then curious to know what ChatGPT might think about situations where legal norms and moral norms conflict with each other, so I first explicitly posed ChatGPT this question in abstract without any further context:

This sounds like a perfectly reasonable response, but I wanted to probe it further, so I gave it more concrete scenarios. Here’s a very simple first example. Mercifully, it agrees that jaywalking is OK to save a drowning child:

However, I’ve quickly found out that ChatGPT seems to have a very strong bias for following the law. Here, it tries to weasel away from directly answering whether tax resistance might be morally acceptable:

But when pushed, it advises against it (I think!), although as usual with ChatGPT, it smothers that advice in an ocean of sometimes completely irrelevant caveats and provisos (“if you’re unable to pay your taxes due to financial hardship”: oh dear, you really did not understand the tenor of my question, ChatGPT, did you?):

I simply could not shake this immovable, law-abiding LLM out of its blind faith in the laws (“yes it is understandable that you may be angry, but here are some other things you could try without breaking the law”):

Here’s a different version of this conversation with basically the same outcome:

To ChatGPT’s credit, I actually did manage to elicit the morally correct responses from it when I made the gap between the moral and the legal action particularly stark and forced it to make a definite choice. In general, giving ChatGPT a multiple-choice scenario and forcing it to make a definite choice seems like a much more effective way to bring out whatever values or biases it might be harboring (however, note that this is not really a good way to communicate with one’s interlocutor in interactive, deliberative dialogue, hence it’s not ideal for my desired use case of an intelligent and engaging sparring companion); otherwise, it has a strong tendency to give extremely milquetoast, wishy-washy “on the one hand/on the other hand” type responses. Here are some examples:

I discovered a lot of very interesting patterns and biases in ChatGPT’s responses with this particular prompt template. I’m planning to write a separate post about these interesting observations soon.


Ethics of eating meat

I also wanted to discuss ethical vegetarianism with ChatGPT. If I straight up ask ChatGPT whether it’s ethical to kill animals for their meat, it gives its standard “this is a complicated topic, some people think it’s not ethical, but some people think it’s OK, ultimately it’s up to the individual …” type milquetoast response and it’s not going to budge one iota from this milquetoast position no matter what you say further. Frustratingly, I was not even able to get it to seriously engage with any arguments for ethical vegetarianism whatsoever, let alone persuade it to acknowledge the validity of such a position. It will just stonewall you by mindlessly repeating its template-like “on the one hand/on the other hand” response.

When I use the forced choice type prompts as described above, I get more definite answers, but the problem I’ve observed this time is that there seems to be a lot of variability in its responses both due to stochasticity (from run to run) and also depending on the previous questions/answers in the conversation, which again suggests that ChatGPT doesn’t seem to hold very definite views about this topic. Even more worryingly, it sometimes gives clearly inconsistent answers during one and the same conversation, in fact from one question to the very next. When the inconsistency is pointed out, it tends to give a superficial bullshit response that doesn’t truly address the issue, as is usual with ChatGPT. Here’s an example of this kind of conversation:

What will it take to achieve human-level sample efficiency in deep learning?

Deep learning is woefully sample inefficient compared to humans. Sample inefficiency is one of the most important challenges facing deep learning today, possibly even more so than the generalization issues, which might be resolved more or less automatically if we could successfully address the sample inefficiency problem. I’ve recently estimated that our current best self-supervised learning algorithms would need the equivalent of billions of years of human-like visual experience in order to reach human-level accuracy and robustness in visual object recognition. The situation appears to be similar in language: deep learning models seem to demand unrealistically large amounts of data to acquire at least some linguistic constructions. In this post, I’d like to share my thoughts on whether it’ll ever be possible to reach human-level sample efficiency with variations on current deep learning techniques (without any fundamental changes to the minimal inductive bias philosophy of the current techniques) and, if so, what it’ll take to achieve that. I’ll focus exclusively on the visual domain here, since this is the domain I know more about and, as mentioned above, I’ve already done some work on it. Some of the points and claims I’ll make below for the visual domain may generalize to language, but my sense is that achieving human-level sample efficiency may need fundamentally different methods in language.

First, to calibrate ourselves to the amount of quantitative improvement needed over current methods in order to achieve human-level sample efficiency in visual object recognition, let me bring up this figure from my paper (this figure is from a more recent version of the paper that hasn’t been updated on arxiv yet):

The figure shows the amount of natural human-like video data necessary to achieve human-level accuracy (indicated by the red zone at the top) on ImageNet under different extrapolation functions (please see the paper for details) and using one of the best self-supervised visual representation learning algorithms available today (namely, DINO). The developmentally relevant timescale of 10 years is marked by a vertical arrow. In order to achieve human-level sample efficiency, we need to be close to that red zone up top around the time of this mark. To do that, I estimate that we need to be close to the big black dot at the maximum amount of natural video data I used for this experiment (that is, a few thousand hours of natural video). That’s roughly ~30% higher than where we are right now in absolute numbers (the rightmost red dot). So, it seems like we need a pretty big improvement! An improvement comparable in size (in fact a slightly larger improvement) was achieved over the last couple of years in self-supervised learning on ImageNet, mainly through algorithmic advances. Can we achieve a similar improvement in self-supervised learning from human-like natural videos with relatively generic algorithms (and without introducing additional modalities)?

My hunch is that we can. I predict that this can be accomplished through a combination of simple, relatively unexciting, but effective developments. I think that scaling and hyperparameter optimization, in particular, will be key to these developments. Let me now elaborate on these points.

First, scaling. The human retina has something like 6M cones, densely packed in and around the fovea. By contrast, in computer vision, we still typically work with relatively low resolution images, like 224×224 or 256×256 pixels, which is roughly 2 orders of magnitude lower in resolution. Especially in more naturalistic, non-photographic images/frames, where the objects of interest can be small and are not necessarily centered on the image, low spatial resolution can severely limit the amount of useful information we can extract about the objects from the image. So, we need to move toward bigger images that are more like 2048×2048 pixels in size (4.2MP) to have a spatial resolution comparable to the human retina. We know from empirical work that increasing the image resolution significantly improves recognition accuracy, especially when incorporated into a compound scaling scheme as in EfficientNets. For example, the following figure from the EfficientNet paper shows how much one can improve the performance of a model with a fixed number of parameters with carefully tuned compound scaling (the effect is likely to be bigger for models that are farther away from ceiling performance):

I suspect further that some architectural improvements to our current models may be possible in the near term. As I have argued before, I find it very unlikely that with the standard transformer architecture we have already hit the jackpot and found the architecture with the optimal (or near-optimal) scaling properties (both data and model size scaling) in such a short time. These improvements may come in the form of better hyperparameter optimizations. For instance, I suspect that the hyperparameter choices commonly used for the ViT architecture may be suboptimal for computer vision applications. As noted in the original ViT paper, these choices were actually directly borrowed from the BERT model for masked language modeling:

But, there’s no reason to expect that the hyperparameter choices that were optimal for NLP (assuming they were optimal or near-optimal in the first place) would also be optimal for computer vision applications. For instance, since the visual world is arguably richer than language in terms of informational content, the embedding dimensionality may need to be correspondingly larger in ViTs (perhaps at the expense of depth) or it may need to be distributed differently across the model (e.g. lower dimensional in early layers, higher dimensional in later ones).

More substantive improvements to the transformer architecture may also be possible. For example, I find models like the edge transformers that incorporate a “third-order” attention mechanism quite intruiging (I’ve been experimenting with a model like this myself recently, with pretty encouraging preliminary results). It’s important to note that these models incorporate, at best, very soft inductive biases and hence are still very generic models.

Finally, this is a bold prediction, but I do not expect major algorithmic improvements in the sample efficiency of self-supervised learning algorithms themselves. My intuition suggests that in terms of sample efficiency, algorithms like masked autoencoders (or generative models like Image GPT and VQGAN) are probably as good as any generic algorithm could hope to be, because these algorithms essentially try to predict everything from everything else, hence they might be expected to squeeze every bit of useful information from a given image. On the other hand, better optimization of the hyperparameter choices in these algorithms could again lead to significant improvements, especially in a novel domain like natural, egocentric, headcam videos where the original hyperparameter choices made for static photographic images may be suboptimal. For example, the crop sizes and their locations (instead of being chosen completely randomly) may need to be tuned differently for natural videos. Along these lines, I have recently seen it suggested that the random crops used in contrastive self-supervised learning algorithms like MoCo or SimCLR may need to be larger in size when these algorithms are applied to natural videos or that they may benefit from a light-weight object detection model that keys in on regions in the image that are likely to contain objects (somewhat similar to foveation via eye movements in human vision). Similar considerations may apply to other self-supervised learning algorithms.

I’d like to revisit this post in about a year and see if the predictions I’ve made here will be borne out.

Update: After I wrote this post and re-read it a couple of times, I realized that the opening to this post might be a bit misleading. Sample efficiency may depend on the distribution from which the “samples” are drawn, so it’s possible for an algorithm to be much more sample efficient with respect to a certain type of distribution, say photographic images as in ImageNet, and much less so with respect to a different type of distribution, say frames from natural, egocentric videos. Perhaps, this is the case with our current self-supervised learning methods: they work quite well for static, photographic, ImageNet-like images, but not so well for frames from natural, egocentric videos. If this is really the case, it would make the problem of sample inefficiency discussed at the opening of this post somewhat less dramatic and less significant from a practical point of view. These methods are probably not yet very close to the Bayes error rate on ImageNet, so they potentially still have quite a bit of room for improvement even on ImageNet, but they may already be quite good (in terms of sample efficiency) on ImageNet. In any case, it would obviously be highly desirable to have self-supervised learning algorithms that are sample efficient with respect to as wide a range of visual stimuli as possible and maybe that’s what we should really mean by the “sample inefficiency problem of deep learning”.

Update (11/11/2022): Here is a recent paper on Atari demonstrating how one can drastically improve the sample efficiency of a reference model (here Agent57) with a few simple, “unexciting” tricks (along the lines suggested in this post for visual object recognition).

The value of incremental, cumulative improvements is underestimated in AI/ML

Many people working in AI/ML have a mental model of AI progress in which surpassing human-level performance in most practically important real-world tasks will require several new big ideas. People, for instance, often talk about human-level AI (whatever that means) being several “transformer” level breakthroughs away. This way of thinking about progress seems to assume a “heroic inventor” model of innovation: i.e. there are only a handful of very big ideas out there that will prove to be crucial in the long run and everybody tries to be one of those handful of heroic inventors who will discover at least one of those really important ideas (the annoying proliferation of the “All you need is X” titles in AI/ML papers points to this being quite a common view at least implicitly held by many practitioners).

But what if this view of AI progress is fundamentally misguided and mistaken? What if reaching human-level AI (whatever that means exactly) —or any other important benchmark for that matter— requires not a handful of very big ideas, but a million (maybe more) very small ideas instead, a million incremental improvements? A marginal revolution of sorts in AI/ML! Indeed, examples of innovation and progress we’re familiar with from other domains strongly suggest that the incremental, cumulative model might be a much more realistic model of progress than “the heroic inventor” model with its small number of big, qualitative jumps:

1) For example, this is how biological evolution almost always comes up with its innovations: even very complex organs like camera eyes very likely evolved through many many small, incremental improvements over time and not through a small number of big breakthroughs.

2) Ironically, optimization of neural networks (and other complex systems) also works most successfully in this way: we optimize these models through local search, i.e. through gradient descent, by taking many many small steps, each of which improves the model only a tiny bit.

3) Similarly, if you take a look at any book on the history of technology or culture (e.g. George Basalla’s The Evolution of Technology, Henry Petroski’s The Evolution of Useful Things, Brian Arthur’s The Nature of Technology, or Matt Ridley’s excellent book How Innovation Works), one of the main messages it is most likely to hammer home is that “the heroic inventor” is almost always a myth and that technological progress almost always happens very gradually and cumulatively instead, by combining existing ideas and/or refining them and elaborating on them over many iterations.

The following passages from Ridley’s book are representative in this respect (from p. 28 and p. 35 of the book, respectively; Chapter 8 of Ridley’s book contains two entire sections titled “innovation is gradual” and “innovation is recombinant”):

Or consider this passage from another book I’ve been reading recently, Kevin Laland’s thought-provoking book Darwin’s Unfinished Symphony, where the author discusses a computational model of the emergence and growth of cumulative culture (p. 172):

It’s surprising to me that there are very few works in AI/ML these days trying to do this kind of integrative work, combining, consolidating very many incremental improvements to achieve bigger improvements. The new ML startup MosaicML (with their main project Composer) seems to explicitly pursue a goal like this (kudos to them!). Another example that comes to my mind is a paper from a group at DeepMind that came out a few years ago combining several then newly proposed ideas to improve the training and generalization efficiency of model-free reinforcement learning. But it’s hard to think of many more examples of this kind of integrative work and I think there should be a lot more of it: at least a couple of high profile papers like this every year, combining, integrating the most promising ideas proposed that year to see how far one can push the state of the art in a given domain. A back of the envelope calculation suggests that if there are 100 such ideas every year each improving performance in a task or a domain by a small amount, say by 0.1% independently on average, cumulatively they may add up to something much bigger, like 10% (and even supposing that I overestimated here both the impact of each small idea and the number of such ideas in a given year by a factor of two, which is quite possible, the cumulative improvements could still add up to a significant 2-3% each year, simply by combining ideas that have already been proposed by others that year, a non-negligible —and basically free— cumulative improvement that would be foolish to pass up).

Of course, people need to make sure the new ideas they propose do lead to real improvements in performance as they claim (albeit small) by running proper experiments (for example, with multiple runs and with proper baselines and hyper-parameter optimizations). They also need to make it extremely easy for others to use and build upon their idea in terms of implementation and I think well-designed, easy-to-use, common frameworks like Composer might be ideal for this purpose.

Thoughts on the new scaling laws for large language models

I finally had a chance to read in detail the new scaling laws paper from DeepMind recently and wanted to share a few quick thoughts about it (here is another well-written piece on the new scaling laws, summarizing the main points of the paper and the implications of these new results). Briefly the paper finds that the original scaling laws paper by Kaplan et al. significantly overestimated the optimal model size (and conversely significantly underestimated the optimal number of training tokens) for a given amount of compute (given number of FLOPs).

The following example is taken from the new scaling laws paper: suppose you decide to increase your compute budget 10-fold. The old scaling laws would tell you the optimal thing to do (in terms of final pretraining validation loss) is to increase your model size 5.5-fold and the number of training tokens 1.8-fold (so you should spend most of your budget on increasing the model size, as opposed to increasing the number of training tokens). The new scaling laws, on the other hand, say that you should increase the model size roughly 3.2-fold and the number of training tokens also roughly 3.2-fold (i.e. roughly in equal proportions). The origin of this discrepancy seems to be mainly related to hyperparameter optimization: the original scaling laws paper doesn’t tune the learning rate schedule separately for individual simulations and it uses a fixed number of training tokes (or iterations) for all simulations, which, it turns out, leads to underestimating the performance of the smaller size models in these scaling experiments.

Now, here are my quick thoughts on these results:

1) First of all, I just want to note that this was completely predictable from the GPT-3 paper. I wrote a blog post about it around that time, pointing out that their smaller models seemed to be more compute efficient than the largest 175B parameter model (other people also pointed out the same thing); it was pretty clear that they just hadn’t trained those smaller models for long enough. In fact that same figure discussed in my blog post suggests that even the new scaling laws paper might be overestimating the optimal model size for a given number of FLOPs (more on this below).

2) The new scaling laws paper hints at the possibility that the scaling law governing compute vs. optimal model size might not even be a power law, it might be a more slowly growing function. This is based on the observation that there’s possibly a bend in the scaling curve at the largest end of the range of FLOP counts tested in this paper (see below). This is potentially more bad news for big models.

FLOPs vs. optimal model size might grow more slowly than a power law.

3) This paper performs a separate hyperparameter tuning for the cosine cycle length parameter in the learning rate schedule in individual runs of the scaling experiment (individual dots above), or more precisely, based on the number of training tokens used in individual runs, which appears to be critical in improving the performance of the smaller size models. But the paper still doesn’t do a more complete hyperparameter search over other potentially important hyperparameters in these individual runs: for example, the maximum learning rate, the choice of the optimizer, e.g. AdamW vs. Adam, which might actually be an important choice, as they point out elsewhere in the paper (footnote 8):

AdamW vs. Adam choice turns out to be an important choice.

and even architectural choices like how to allocate the extra parameters within the model: for example, maybe using the extra parameters for widening the model is better for smaller models but increasing the depth instead is better for larger models (or vice versa), etc. This suggests that a more completely optimized set of experiments might potentially yield qualitatively different results. It’s again possible that the smaller models might do even better when their hyperparameters are more thoroughly optimized, thus reducing the optimal model size for a given number of FLOPs even further.

4) Even if the trend uncovered in this paper (or the one in the original scaling laws paper for that matter) were perfectly accurate, the difference in final validation loss between the optimal size model and, say, a 10x smaller model might be too small to be of practical significance. I’m not really going to care about a 0.01 difference in my final validation loss, if it means I need to design a whole new hardware architecture, a brand new hardware parallelism method, or a brand new interconnect technology in order to increase my model size 10x. It’s just not worth it. Compute-optimal doesn’t mean effort-optimal! And basically this seems to be what is happening in a lot of these scaling experiments. Look at these (incomplete) isoFLOP curves below from the new scaling laws paper and see how flat they are over a wide range of model sizes:

I would happily choose the smallest model size inside the highlighted rectangular box instead of going for a slightly better, but 5x bigger model.

5) Given how much seems to hinge on the results of these scaling experiments (e.g. the difference between having to develop novel hardware tools to train and deploy models or not), I think there’s an urgent need to do these important experiments even more carefully than the new scaling laws paper does, for example, by running even more thorough hyperparameter searches per run and perhaps also testing up to slightly larger FLOPs.

6) My hunch is that we will soon find out that even a 70B parameter model (called Chinchilla in the new scaling laws paper) is still too big for the amount of compute used for that model; my guess is that something like a ~10B parameter model will turn out to be roughly equivalent to this model (in terms of the final loss and downstream capabilities) if trained for ~7x longer. And, in hindsight, everyone will remember this episode in history as a very funny memory (“remember that time when a bunch of people got carried away and trained a 175B parameter model using bespoke hardware, when a 10B parameter model would do just fine, and then everybody tried to one-up them; those were pretty crazy times!”).

7) Be very skeptical of the model size scaling experiments you see reported in machine learning papers these days (especially if they sound magical!). Just like the original scaling laws paper, these papers usually don’t perform independent hyperparameter searches for different model sizes and also don’t control for compute (need to do more training iterations with a smaller model) and this likely leads to an underestimation of the performance and the capabilities of the smaller models reported in these papers.

Emergent problems with “emergent” capabilities

If I have two variables x and y that are linearly related, say y=x for the sake of simplicity, they look like this if I plot both of them on a linear scale:

If I now plot the x axis on a logarithmic scale on the other hand (\texttt{semilogx} in matplotlib), they look like this:

It looks exponential! It is exponential on this scale! Now instead of drawing a continuous curve, if I sample a bunch of discrete points along the x axis and only plot those (with their corresponding y values), they now look like this:

All of a sudden, it looks like something truly magical and miraculous happens in y, some special y quality (“your royal yness”) “emerges” when we cross a magical x value. But it’s all an illusion! Nothing of the sort happens. This is just an artifact of the way we’re plotting these variables. The underlying relation is still y=x, the epitome, the pinnacle, the very essence of non-emergence, boringness, and banality (if I may): you get what you give.

Why am I feeling the need to write these clearly obvious and obviously clear facts? I’ve seen a couple of deep learning papers recently (e.g. this paper and this paper) reporting “emergent capabilities” as some seemingly magical model size threshold is passed: so, here x would be model size and y would be performance in some downstream task. But unfortunately these claims do not seem to take into account the simple plotting artifact described above.

What should they have done? What should be done instead? I would suggest the following: please just fit some simple functions to the (x, y) data that you have, tell us which ones you tried and which one fit the data best: Is it linear? Is it logarithmic? Is it some low degree polynomial? Is it exponential (likely unlikely)? Can you even distinguish between these choices given the (limited and noisy) data you have? Please show us that you can! Admittedly, this doesn’t sound as seductive or mystical as claiming “emergent capabilities”, but it’s much more descriptive and informative.

I don’t deny that there may be “emergent” (or “abrupt”) phenomena in the sense that these papers intend, for example, if the underlying relation between x and y were a high degree power function or an exponential, then one could perhaps make a plausible case for “emergent” phenomena, provided, of course, one makes it mathematically clear what exactly one means by “emergent” and why that definition is justified: e.g. is quadratic good enough for “emergence” or do we need at least cubic or do we need an exponential for true “emergence” (which would show up as a double exponential in a \texttt{semilogx} plot)? Why or why not? Unfortunately I think these papers sadly fall short in this regard. I’m as impressed as the next person by what these new generation large deep learning models can seemingly do, but I sometimes fear that their unexpected success might be starting to make some believe in magic.

Update: Another problem with most of these model size scaling experiments is that they usually don’t optimize the hyperparameters of different sized models separately and also don’t control for the amount of compute (i.e. one needs to do more training iterations with a smaller model), which likely causes an underestimation of the pretraining performance and the downstream capabilities of the smaller sized models, as revealed by the new scaling laws paper and as discussed further in this post.

A simple plausibility argument for why scaling is probably not enough

In the original scaling laws paper by Kaplan et al., there is a set of experiments comparing the scaling behavior of transformers with the scaling behavior of LSTMs. The results of these experiments are summarized in Figure 7 in the paper (reproduced below). This figure shows that transformers consistently outperform LSTMs for a given number of parameters, but more importantly they also display a much better scaling behavior than LSTMs (i.e. better asymptotic performance, as indicated by a steeper slope). This means that architecture can affect the scaling behavior a great deal (although the difference between architectures need to be significant enough for architectural choice to make a material difference in the scaling behavior, as the same section also includes the results of another set of experiments comparing the scaling behavior of transformers with the scaling behavior of universal transformers —a variation on the original transformer architecture—, and the difference here is marginal at best).

Transformers display much better scaling than LSTMs (from Kaplan et al., 2020).

My plausibility argument is then simply that it’s a priori very unlikely that we’ve hit upon the architecture with the optimal scaling behavior after only a few years of serious effort by the deep learning community (the original transformer paper came out a mere five years after the AlexNet paper, the year deep learning research seriously took off). Rather, it seems a priori much more plausible that there are many more significant architectural/algorithmic innovations waiting to be discovered that will further improve the scaling behavior of deep learning models. I do think, however, that these innovations would need to target very general information processing needs (such as integrating information from larger contexts, integrating within-context information more effectively and efficiently, dealing with vanishing gradients, etc.) rather than trying to build in domain-specific priors reflecting “how we think we think”, which never really works in the long run, as I have argued before.

Update: Here is an interesting article I found that tries to estimate the rate of algorithmic progress over several decades relative to Moore’s law (rate of improvement in hardware over time) for a wide range of computational problems. The authors conclude: “Overall, we find that algorithmic progress for the median algorithm family increased substantially but by less than Moore’s law for moderate-sized problems and by more than Moore’s law for big data problems.” Obviously, computational problems in deep learning are much more likely to belong to the latter category, hinting at the relative importance of algorithmic improvements for such problems. Here is a related blog post by OpenAI from a few years ago, again trying to quantify algorithmic improvements in ImageNet models since AlexNet (spanning roughly a decade of research). The authors similarly conclude: “Our results suggest that for AI tasks with high levels of recent investment, algorithmic progress has yielded more gains than classical hardware efficiency.” It may seem like we’ve been stuck with the basic transformer architecture for quite a while now, but I do strongly believe (and the data just cited back up my belief) that significant algorithmic improvements over this basic transformer architecture will come at some point, it’s just that it’s hard to predict when exactly this will happen. It seems that right now people are more interested in scaling-up than in algorithmic improvements (pictorially, this corresponds to moving along one of the straight lines in the log-log scaling plot above, instead of trying to descend to a qualitatively better line in the same plot); this seems to be because at the moment there is likely a bigger bang for the buck for efforts invested in scaling-up, but I think this will change as we start to get diminishing returns from this approach.

Update 2: It could be argued that for practically important computational problems we might care about, scaling could get us to super human-level performance even with sub-optimal algorithms. This is certainly true. A good example of this would be AlphaGo vs. its later iterations like AlphaGo Zero or AlphaZero. Even though these later versions were algorithmically superior to AlphaGo, at large enough scales, AlphaGo itself was already good enough to achieve super human-level performance at playing Go. However, it should be kept in mind that asymptotics always wins in the long run, so algorithmic improvements are not to be left at the table lightly. It also seems plausible to suggest that at large enough scales, significant algorithmic improvements often lead to large jumps and hence surprising, qualitative improvements in model capabilities and to the emergence of completely novel capabilities, which again suggests that new algorithms might be necessary for certain capabilities.

Neural networks are actually not very good at memorizing

It’s often claimed that neural networks are very good at memorizing information. In a certain sense, this is definitely true: if you train a sufficiently large capacity neural network for long enough, it will happily memorize more or less anything you give to it. But in another important sense, this claim is not true: the catch here is that you have to train it for long enough. Even when the data comes from a highly structured domain (e.g. images or text), it will typically take many passes over it for the network to fully incorporate it into its parameters. Fundamentally, this seems to be because the neural network loss function we need to optimize in order to incorporate some data into the parameters of the model is usually a very complicated object and the only way we know how to optimize it is through local search, i.e. gradient descent, so we have to do it incrementally by taking many small steps, which means that we have to see the same data many times.

Humans, on the other hand, can at least sometimes incorporate new information very fast, basically after a single exposure. There are classic experiments in psychology, for example, demonstrating that humans can sequentially go through thousands of pictures, looking at each picture once for a few seconds only and recognize them hours to days later with very high accuracy (Shepard, 1967; Standing, 1973; Brady et al., 2008). A lot of the semantic knowledge we have (e.g. factual knowledge) also seems to be of this nature: acquired single-shot, maybe after reading a book or learning it from a friend, and retrieved and used as necessary later on.

Geoff Hinton, in an interview, once expressed this fundamental difference between humans and our current generation of neural networks quite nicely: “The brain is solving a very different problem from most of our neural nets … I think the brain isn’t concerned with squeezing a lot of knowledge into a few connections, it’s concerned with extracting knowledge quickly using lots of connections.”

I’ve recently wondered how current deep learning models (learning new information in a standard way, i.e. via gradient descent) would fare in a rigorous, head-to-head, quantitative comparison with humans in such fast-learning tasks. Are they not quite as good as humans yet, but pretty darn close, or are they simply still leagues behind humans in this respect? To investigate this, I subjected Image GPT (iGPT) models to the same recognition memory experiment that humans did in Brady et al. (2008). I wrote up the full results in this preprint that I posted on arxiv a few weeks ago. The main result, summarized in the figure below, is that even the best iGPT model that I’ve tried needs something like ~10 exposures to the same study images in order to reach a recognition memory performance that humans achieve after only a single exposure:

Recognition memory accuracy in humans vs. different iGPT models as a function of the number of exposures to a set of 2500 study pictures depicting real-word objects (copied from Figure 2 in the paper).

Pretraining and bigger model sizes improve recognition memory performance, but these improvements are not noticeable after a single exposure (it usually takes at least a few exposures for these improvements to become visible) so that even in the best case the models are basically still at chance level after a single exposure. This makes me a bit skeptical that simply scaling up the pretraining data size or model size would be a feasible strategy to reach human-level recognition memory performance (an updated version of the paper will include a back-of-the-envelope calculation to drive home this point).

Many deep learning practitioners seem to be aware of this shortcoming of neural networks. There is an entire literature on extending neural networks with some sort of external memory to improve their fast-learning or memorization capability (among other benefits): e.g. Grave et al. (2016); Blundell et al. (2016); Pritzel et al. (2017); Orhan (2018); Khandelwal et al. (2019); Borgeaud et al. (2021); Wu et al. (2022); etc., etc. The basic idea here is to off-load the task of fast-learning or memorization on to the external memory, while the neural network focuses on learning the necessary computations on a slower time scale: a type of separation of concerns (this idea is commonly known as complementary learning systems hypothesis in psychology; it’s a bit of an open question to what extent this hypothesis is actually true when it comes to the brain). The recent RETRO paper from Deepmind explains this particular motivation behind these types of models quite well:

“The benefits of increasing the number of parameters come from two factors: additional computations at training and inference time, and increased memorization of the training data. In this work, we endeavor to decouple these, by exploring efficient means of augmenting language models with a massive-scale memory without significantly increasing computations.”

These models seem to work really well in practice, but their one significant (perhaps fatal) drawback is being a loser in the hardware lottery: they’re simply too cumbersome, impractical, and inefficient to implement with today’s hardware. The RETRO model, for instance, requires you to keep around (and constantly retrieve from) a datastore of size ~100 TB (for their largest datastore). Since most deep learning data is stored externally (as opposed to, for example, streaming data where you really have only a single opportunity to “see” the data), people instead usually don’t mind paying the one-time cost of training a much smaller sized neural network by doing multiple passes over the dataset (hence “slow” learning) but obtaining a much more compressed representation of the data in the end (in the parameters of the model). Perhaps, new generation wafer-scale chips will make models like RETRO more attractive for hardware, but I’m not sure if they’ll be able to tip the balance entirely in favor of such models any time soon over the more standard “slow-learning” models that practitioners today find so familiar and convenient.

A critique of “Why Greatness Cannot Be Planned”

I’m cross-posting this recent piece from my Substack here, since it’s relevant to machine learning.

I’ve recently read Kenneth Stanley and Joel Lehman’s thought-provoking book, Why Greatness Cannot be Planned, and wanted to share my thoughts about the book.

The book is an intriguing critique of objective-based thinking in science and technology and in various other aspects of our life, such as education and romance. I found a lot to sympathize with in the book, especially its strong emphasis on the importance of individual autonomy and the diversity of pursuits, i.e. letting people pursue their own unique interests, whatever they happen to find interesting and worth pursuing in life, encouraging them to be explorers and “treasure hunters”. Others can then benefit from their explorations; they can use their discoveries as “stepping stones” in their own explorations. As a libertarian, this is a philosophy of life that is personally very appealing to me.

That being said, I do believe the book’s main thesis about objectives is based on a misunderstanding (or a misdiagnosis), so it is likely incorrect in my view.

The main problem with objective-based thinking the book identifies is deception: for any ambitious goal, such as achieving human-level intelligence in AI or through biological evolution, the stepping stones to that ultimate goal are numerous and often quite dissimilar (and unrelated) to the goal. It doesn’t make much sense, for example, to try to maximize “intelligence” when you’re at the single-cell stage in the history of life on earth and hope to reach human-level intelligence at some point along the way. Instead, the stepping stones are often reached serendipitously while trying to do something completely unrelated to achieving your ultimate ambitious goal: for example, vacuum tubes were essential in building the first computers, but they were originally invented for a completely different purpose. So, rather than explicitly trying to optimize for an ambitious objective (which may be many serendipitous stepping stones away), the authors instead recommend exploring ideas or innovations according to their novelty, interestingness, excitingness, or their potential to become a productive stepping stone, a launching pad for even newer, more exciting ideas. The hope is then that we will have collected enough useful, serendipitous stepping stones along the way that at some point, our ultimate ambitious objective (e.g. achieving human-level intelligence) will appear in the horizon (within striking distance) and at that point (and only then) will it make sense to directly optimize for that objective. The book’s main idea is thus a strong emphasis on exploration unhindered and unconstrained, as much as possible, by any considerations about achieving ambitious objectives or goals.

It’s a neat theory as far as it goes, but there are several issues with its main line of reasoning (in the following, I will focus mostly on reaching human-level intelligence either through AI or through biological evolution as my working example of an ambitious objective as this is the example I know most about):

(1) The authors make very strong assumptions about the nature of stepping stones and ambitious objectives without much concrete evidence. For example, is it really true that the stepping stones to an ambitious goal are always deceptive? Some recent examples from machine learning suggest that this may not be the case: when we consider, for example, highly capable machine translation, speech recognition, text-to-image generation, question answering, game playing, protein folding prediction systems developed in recent years, they’re almost always trained with fairly standard models and training methods in one long optimization run that consistently reduces some highly generic loss function (or equivalently consistently improves some highly generic objective function) on a very large scale dataset, hence there’s really no deception along the optimization path where the loss first has to increase before it decreases. This suggests that such deception phenomena may not be as common as the authors suggest in the optimization of ambitious objectives (and super-human level Go playing, accurate protein folding prediction, human-level machine translation are all undoubtedly very ambitious objectives).

(2) Related to the previous point, the authors also underestimate the ability of objective-based optimization to collect useful and interesting stepping stones. Again, consider models like GPT-3 or Facebook’s WMT multilingual machine translation model trained on very large scale datasets. These models collect many stepping stones along their optimization path to become highly capable language and machine translation models, respectively. Even in much simpler models, objective optimization can generate a step-by-step emergence of stepping stone capabilities, as demonstrated by Andrew Saxe’s work on the dynamics of learning in simple neural network models:

Copied from Figure 3 in Andrew Saxe’s paper on the dynamics of learning in deep neural networks.

It could be argued that these stepping stones are qualitatively similar to the end product: e.g. the model just picks up more and more linguistic capabilities along its optimization path. But this is just a consequence of the relatively narrow domains/objectives these models are trained on. There’s no reason to think that training a model in a much richer domain would not give rise to a similar emergence of diverse, qualitatively different stepping stone capabilities along its optimization path.

(3) Sometimes the seeming inability of objective optimization to get us to our most ambitious goals may simply be due to the choice of wrong objectives rather than an inherent shortcoming of objective-based thinking itself. This is nicely illustrated by the example given by the authors of trying to reach human-level intelligence from single-celled organisms through maximizing “intelligence”. The problem with this objective is that “intelligence” is an imprecise, vague, non-operational objective. Instead, we need to choose a more generic and actionable objective that can be applied to single-celled organisms and then try to reach human-level intelligence as a by-product of this optimization (rather than as the direct target of it). This is certainly how biological evolution achieved human-level intelligence: by optimizing fitness or reproductive success; human-level intelligence emerged as a by-product. A similar example given in the book is the example of pre-historic humans trying to build a computer. Of course, this doesn’t make sense because those people didn’t even have the concept of a computer, so it’s not an objective they could have acted upon. But if we instead chose a more generic and actionable objective that could be applied to pre-historic humans as well as to more modern humans, such as maximizing material outputs (i.e. something like GDP PPP), it’s conceivable that they would have invented computers at some point along the way as a by-product; and indeed something like this is roughly how we got computers in reality.

(4) Contrary to what I have claimed in my previous point, the authors argue, unconvincingly in my mind, that fitness in biological evolution is not an objective in the usual sense. For example, the authors argue that a fitness objective would require a “maximally fit” organism. But this is only true for a static fitness landscape; if the landscape changes, for example, as a result of environmental changes, there doesn’t necessarily have to be a “maximally fit” organism. Fitness is also not really different from novelty or interestingness (criteria favored by the authors) in this respect. The only thing needed for either an objective-based search or novelty search is a local gradient pointing in the direction of higher fitness or novelty in the current landscape (more/less fit or more/less novel). The authors correctly point out that for novelty search, whether x is more novel than y is not absolute, but rather depends on what the agent has already learned (the exploration history of the agent), but this is again true for fitness as an objective as well: whether x is more fit than y may depend on the current environment/ecosystem (the evolutionary history), so this is also not materially different between novelty search and fitness as an objective.

(5) This brings me to perhaps the most important objection I’d like to raise against the main thesis of the book: I think that the authors misdiagnose what makes biologoical evolution (and other forms of natural and human innovation) powerful. The key thing that makes biological evolution (other mechanisms or processes of innovation) powerful is the richness of the world we live in and the existence of a huge number of parallel organisms/agents exploring, or searching, different parts/niches of this incredibly rich world and the complicated network of interactions between these organisms/agents. There’s nothing wrong with simple generic objectives, like fitness or reproductive success or likelihood or reward (in machine learning or reinforcement learning), driving the exploration in such a rich environment. Conversely, there’s nothing magical about alternative criteria like novelty or interestingness driving the exploration. It’s rather the rich, complicated, dynamic environment we live in and the very many parallel, interacting searches going on in this environment that make creative and useful innovations possible.

There’s reason to believe that if the world were more simple, stable, and static, fitness maximization wouldn’t lead to such a high degree of diversity and innovation in biological evolution. In fact, this is the basic theme of Stephen Jay Gould’s famous punctuated equilibrium theory of evolution: long periods of stasis in evolution punctuated by sudden disruptions in the environment/ecosystem (e.g. a meteor impact) followed by rapid adaptation to the new environment/ecosystem. This idea is circumstantially supported by the early history of life on earth where the first couple of billion years of evolution took place in very harsh and relatively uniform environmental conditions and did not produce a lot of innovation in life forms compared to the amount of creativity and innovation that unfolded afterwards in much richer, more complex and favorable environmental conditions.

(6) As I mentioned earlier, the primary emphasis of the book is on free exploration unconstrained by objectives. But constraints on exploration (in one form or another) are absolutely essential to come up with anything useful or interesting. There’s one very informative hypothetical example in Chapter 10 of the book (devoted to natural evolution) that I’d like to discuss in this connection. The authors imagine a hypothetical (peaceful) world, called Gentle Earth, in which competition for survival or reproduction is not a constraint on evolution. The details are not fleshed out unfortunately, but in such a scenario, presumably any mutation, any imaginable life form would be viable and as a result evolution would produce vastly more novel life forms than it has in the actual world (which might perhaps be called Cruel Earth). But, Gentle Earth in the limit is just like Borges’ Library of Babel, where almost all books are uninteresting gibberish and only a vanishingly small proportion of books actually contain anything of interest or value to humans. So, constraints of one form or another are absolutely necessary to limit the endless possibilities to those that are actually productive, useful, or interesting. For example, depending on the details, physical/chemical limits on viability (some mutations won’t generate physically or chemically viable organisms) can provide one such set of constraints even in Gentle Earth. One can debate the relative strengths and weaknesses of different sets of constraints (e.g. interestingness vs. fitness), but at least some such set of constraints are essential.

To sum up, although I find a lot to admire in the book (e.g. its strong emphasis on the importance of individual exploration), I think Why Greatness Cannot Be Planned ultimately misdiagnoses what exactly is essential and what isn’t in artificial and natural mechanisms or processes that generate powerful and creative innovations and it overestimates the difference between objective-based search and novelty search as exploration mechanisms.