Manuel Sánchez Hernández

A Pragmatic Evaluation of Software Engineering AI Tooling

2026-03-09T09:00:00+00:00

March 9th, 2026

Originally published on Adevinta Tech Blog

Personal note: This article reflects the outcome of a pilot that we ran in Adevinta in June 2025 to evaluate various AI software engineering tools. The findings, previously unpublished, were somewhat unexpected, as Claude Code was not widely favored by engineers at the time. Contributors to the pilot: Albert Puigsech Galicia, Mario Viñas Ruiz, Francesca Lorenzoni, Diego Duchowney, Ferran Grau

How to interpret the results

This was a pragmatic, decision-oriented pilot, done with real work across our teams. Results are directional and internally valid to Adevinta codebases and workflows. Limitations and Design Trade-offs: Our results are based on assessment of experienced software engineers on how long a task should last, which can be subjective. Tools were assigned to teams to minimize disruption, so results reflect real adoption contexts rather than randomized trials. Our sample size is limited, we add confidence intervals in our data. The study lasted only 1 month, hence long term effects are not measured. Pilot was done in May-June 2025, with the state of tools at that point in time. All tools have evolved since then.

In a nutshell

We evaluated three coding AI assistants: Cursor, Claude Code and Github Copilot, with 77 software engineers over 165 real work tasks and four weeks to understand which AI assistant delivers meaningful, cost-effective impact in Adevinta’s environment. The objective of the pilot was to make a decision about which AI tool to deploy; it had to be pragmatic, limited in scope and time, and hence is not an academic study. Data is grounded in Adevinta codebases and workflows.

However, we believe our study is still relevant if the results are interpreted directionally, especially as there is a lack of published results in enterprise settings. In the absence of reliable metrics, in an area where there is so much noise, hype and evolution, we believe that it is important for a company like Adevinta to evaluate first hand different AI tools before making a decision.

Our main findings were: Claude Code demonstrated the strongest performance across all metrics, showing the largest productivity boost, highest completion rate, and highest user rating. Cursor provided moderate but meaningful gains, with a solid completion rate but not as good user sentiment. Github Copilot, which was the tool most engineers had access to already, showed small, almost negligible productivity gains, the lowest completion rate, and the lowest rating.

The reasons for this difference relies on: (1) The default configuration of Claude Code (which at the time of the pilot worked with a “pay as you go” pricing model) makes it automatically select which model should be used for each prompt, while for Cursor and Github Copilot (flat monthly rate) more powerful models need to be manually selected. Claude Code selected the most powerful models 60 times more than the other tools. (2) Claude Code is easier for users to start with, we observed how inexperienced users get value out of it, while Cursor has a steeper learning curve. We actually observed that medium experienced users with Cursor can reap a productivity similar to Claude Code. (3) Cursor, which is an IDE based on VSCode, and Github Copilot, a plugin that works for IntelliJ and VSCode but not XCode, created a temporary learning curve for IntelliJ or XCode users, which may also influence our metrics.

With the data gathered we estimated that Claude Code had the best ROI and integrates best within Adevinta workflows. Deploying this tool in our over 2,000 software engineers is a first step towards increasing productivity through AI coding assistants while we work on other fronts, such as automated metrics, training, software engineering workflow integration… We also plan to do periodic evaluations, similar to the one done in this pilot, of different software engineering tools to adjust our strategy.

Motivation

Generative AI coding assistants are promising double-digit productivity gains for developers. For a company like Adevinta with thousands of software engineers across brands such as Leboncoin, mobile.de, Kleinanzeigen or Marktplaats, even single-digit improvements translate into millions of euros of capacity a year.

At the same time, the AI software engineering tooling ecosystem is changing very rapidly, with new tools coming every quarter, promising large productivity gains and sometimes disappointing for its inconsistent performance. The amount of hype seems to increase at the same pace, with leaders in the industry claiming the end of human software engineers at the end of 2025.

The number of AI Software Engineering tools is growing rapidly with time. Source: authors.

Contributing to this noise is the jagged frontier: no one knows in which type of tasks AI models will perform or fail, unless you test it yourself. We were quite conscious about that, since our teams have been building and deploying AI models for over a decade, with over 250 use cases currently delivering value.

Additionally, Adevinta has already integrated AI coding tools like Github Copilot. While initial engineer feedback was largely positive, we couldn’t conclusively link these tools to productivity gains. Informal tests with other Software Engineering AI tools also yielded inconclusive results.

In summary, due to the potential of these new wave of AI tools, the absence of reliable metrics, the amount of marketing hype, the importance of software engineering to Adevinta, the high growth of the ecosystem and the jagged frontier, we opted to conduct our own pilot program. We believe that first hand experience with the most advanced tooling was required to decide on the best path forward.

Designing our AI software engineering pilot

In May and June 2025 we ran a structured evaluation of GitHub Copilot, Cursor and Claude Code, with the aim to answer which assistant delivers the most value in our real-world codebases. We shortlisted these three tools because they were industry leaders at the time and had received positive feedback from many informal tests run by our engineers.

Our pilot lasted four weeks, with 77 engineers of 14 different teams and four marketplaces, working on real tasks from their team’s backlog. All teams were product teams, each responsible for a part of Adevinta’s marketplaces, and made up of various engineering roles such as backend, frontend, platform…

Each team was assigned to one of the three AI tools for the whole period. Before the pilot started we made sure that every engineer received an onboarding training on their AI tool by subject matter experts. We also held weekly sessions where they could ask questions, and there were support online channels with experts on each tool available.

Pilot data was gathered from various sources, including two surveys (pre- and post-pilot), three focus groups, four meetings with the team leaders, cost information and telemetry from the AI tools, and information for each individual task performed by the engineers. In total we were able to track 165 tasks during the pilot, which were added to a table, the “task tracker”.

For a summary of the design of the pilot, see the following table:

Aspect	Design choice
Participants	77 software engineers, 14 cross-functional product squads (back- and front-end, data & mobile developers)
Duration	4 weeks, during May-June 2025
Tools	Github Copilot, Cursor and Claude Code
Task mix	Bug-fixing, refactoring, feature development, migrations, documentation, planning, tooling creation, each flagged with their corresponding area: backend, frontend, Android, iOs, infra or full-stack
Metrics captured	Estimated-vs-actual cycle time, task outcome, PR size & comments, survey ratings, API usage and costs

The task tracker, a log for each individual task

We conducted sessions with team’s managers to align on how to best measure the impact of AI. Even if there were different metrics already available (some teams tracked DORA metrics) they were not consistent across teams and marketplaces, hence we agreed that the best was to manually track them for this pilot. In fact, several software engineer productivity tracking tools resort to doing polls to track AI productivity.

A key artifact we employed was our “task tracker”, a table where task details were collected. These tasks were pre-existing in the teams’ backlogs, integrated into their plans, and not specifically chosen for the pilot. For each task, information was collected both before it started and after it was completed.

Ex ante tracking

Before beginning each task, engineers will inform about the following details:

Task description: A brief overview of the task.
Engineer’s experience: The engineer’s current level of experience with the AI tool, which may evolve throughout the pilot.
Task characteristics: Such as its type (e.g., “feature development,” “bug fixing,” “refactoring”) and the area it pertains to (e.g., “backend,” “frontend,” “infrastructure”).
Estimated cycle time without AI: The engineer’s time estimation for the task without the help of AI, crucial for productivity assessment.
Expected Pull Request revisions: To gauge the quality of committed code.

Ex post tracking

Once the task was completed, engineers were required to add the following information:

Actual cycle time: Recorded time from when the task starts until it is completed.
Actual number of Pull Requests to finish the task.
Task outcome: This indicated the AI tool’s contribution, categorized as:
- “Completed, fully done by AI”, which we convert to 100%.
- “Completed, partially done by AI”, which we convert to 50%.
- “Completed manually, AI not able to help”, 0%.
Engineer’s rating of AI tool helpfulness: A score from 1 to 5, with 5 being the highest rating.

Engineers received definitions and guidance for each attribute in the task tracker, with weekly meetings held to address any questions. A comprehensive list and further details are available in Appendix 1.

An example of a task tracker for one team, already filled. Columns with black header was completed “ex ante” and with green, “ex post”. Source: authors.

On the productivity metric

From the task tracker, we calculated a productivity proxy, or percentage of time saved, by comparing the actual cycle time to the estimated cycle time without AI, using the following formula:

\[P_{tool} = 1 - \frac{\sum_{i} A_{i}}{\sum_{i} E_{i}}\]

Where $_{tool}$ is the productivity of the tool, $\sum_i A_i$ is the summation of the actual times for all tasks of the tool, and $\sum_i E_i$ is the summation of the estimated times without AI for all tasks in the tool. By using this formula, longer tasks will naturally weigh more. We also determined a 95% confidence interval.
In the Appendix 2 we analyze different sources of errors for the metric and how they were mitigated.

Results

Our baseline before the pilot

Before starting the pilot, we surveyed our population of 77 engineers to have a better understanding of the adoption of AI tooling. With 70 answers, the most notable results are summarized in the table below:

Engineers using AI tools daily	47.1%
Used Github Copilot	62.9%
Used Cursor	27.1%
Used Claude Code	8.6%
Time in coding activities (%)	66.8%

Almost half of the engineers already use AI tools daily, with Github Copilot the most used tool, as it was already available within Adevinta for all engineers, followed by Cursor and Claude Code.

We also asked them to estimate how much time they spend in coding activities out of all their time, resulting in 67% of their time. This result surprised us, as in many other sources the average is between 25% and 40% (Kumar 2025, Meyer 2019, IDC report 2024).

We also asked what development environments they use, and the results are in the chart below. More than half of the users used IntelliJ, with VSCode the second choice. This is relevant as each one of the tools tested work differently across environments, and becomes an important factor in the results of the pilot.

Development environment distribution among the 77 participating engineers.

Quantitative results

This is a summary of the main metrics we tracked:

Claude Code showed the highest directional gains across measured metrics in this dataset, with the largest productivity boost (~43% vs estimated time), higher completion rate (74%) and higher user rating (4.2).
Cursor provided moderate, but meaningful gains (~24%), solid completion rate (59%) but user sentiment (3.6) was not as good.
Github Copilot, which was the tool most engineers had access to already (could be considered our control leg), shows small, almost negligible productivity gains (~1%), lowest completion rate (44%) and lowest rating (2.8).

These results were relevant to our decision-making, and some were actually very different from our initial assumptions.

In the following table and violin plot we provide more information:

Comparative results showing productivity percentages, completion rates, and user ratings for each AI tool.

Statistical distribution of productivity results across the three AI coding assistants.

This distribution of the plot shows that, for many tasks, the productivity of Cursor and Claude Code are very high. It is however a set of tasks that drag the performance of Cursor down in comparison to Claude Code. We analyze this further in the discussion section.

Some additional results from the pilot were:

Refactoring and Code Migration were the tasks types where all tools seem to perform best. (~40% and 49% productivity respectively), while others like Feature Development or Bug Fixing did not perform so well. The exception is Claude Code, the only tool that showed increased productivity within Feature Development, the most commonly executed task (~26% productivity and ~68% completion). This was expected: tasks with more dependencies like Feature Development which usually depends on other teams or people besides the software engineer (e.g. to grant accesses, to validate certain steps, etc) or on external tools will be underestimated (optimism bias) while tasks where the engineer has more control, such as refactoring or migrations, engineers can work with high autonomy, making estimations less prone to biases.
Frontend tasks seem to have the larger productivity boost on aggregate. When we look at tool level, Claude seemed to perform well in Backend, Frontend and Android tasks, while Cursor and Github showed shortcomings in Backend tasks.
Intermediate-level users extract 11x more benefit than first-time. Intermediate % productivity is ~47% while first time users is 4%. But that gap does not exist with Claude Code (this is covered later in this article).
Large (>3 days) and medium (>1) tasks got the most benefit (~30% and 24% productivity respectively), especially for Claude and Cursor, vs. ~12% for small tasks. This was contrary to our initial expectations.

Qualitative results

One important result from the focus groups and comments collected from the polls was about how well the tools integrated with the different IDE’s and workflows of the software engineers. In particular, according to many comments, the integration of Claude Code into Adevinta development workflow was easy, as it didn’t require changing or learning a new IDE. In contrast, to use Cursor, many engineers that were used to IntelliJ (63% of total) or XCode (16%) needed to learn about its usage, as Cursor is a fork of VSCode. Users found Github Copilot easier to start due to its availability as a plugin for both VSCode and IntelliJ.

Additionally, we observed the following:

Some users of Claude Code mentioned that they were able to parallelize their work on two tasks simultaneously, potentially boosting productivity beyond initial test measurements.
AI tooling allowed engineers to prioritize important but non-urgent maintenance tasks, the kind that are often deprioritized due to their complexity or perceived effort.
During the focus group done after the pilot, engineers that had Github Copilot assigned mentioned that the productivity result of ~1% did not reflect their experience, and that the tool actually was very helpful for them. In several comments in the post survey they also mention that the tool is helpful, providing a very positive experience.
Many users did not want to use Claude Code in the beginning, they were skeptical about the fact that it was only available via CLI without IDE integration, but then after using it they were pleased with the quality of the results.

Discussion

Why these results

Results seem to show that Claude Code outperforms other AI tools in many metrics: our productivity proxy metric, user experience and average AI completion rate. This outcome was unexpected for us, but further analysis revealed several factors that likely contributed to it:

The default configuration of the tools and their pricing model

At the time of the pilot (June 2025) Cursor and Github Copilot had a flat rate close to 40$ per user and month, while Claude Code consumption model was a “pay as you go”. Our 10 users with the highest consumption spent $202 in Claude Code during the month.

By default, Claude Code will select which model to use depending on the complexity of the task while for Cursor and Github Copilot selects the less powerful models by default. The user could change that and they sometimes did (for Cursor they were called Max models, which may incur extra cost after some usage), more so when they became more experienced with the tool.

We observed that the amount of tokens consumed by Claude Code on these models was 60 times larger than the other two tools, which also explains the highest performance.

Preferred IDEs

Over 50% of our engineers used IntelliJ and XCode, while Cursor is a fork of VSCode and Github Copilot, being a plugin, works for both IntelliJ and VSCode but not for XCode. Users needed to learn another IDE that they were not familiar with, which we believe also reduced their effectiveness during the pilot. This was commented by some engineers in the focus groups and the post-survey. This learning is something temporary, and we believed it also had some weight on the difference in productivity by level of experience with the AI tool.

After the pilot

The pilot challenged our initial assumptions, and we adapted our marketplace’s strategy (Mobile, Leboncoin, Marktplaats, Kleinanzeigen and Subito). We plan to continue evaluating different tools with similar pilots as the one presented in this article, and evolve our approach accordingly.

Building the capability to independently and rapidly evaluate, first hand, emerging AI tooling and generate reliable, evidence based results is ultimately more valuable than identifying a short term winner. As the AI landscape evolves at high speed, with a significant noise-to-signal ratio and often unreliable public benchmarks, developing this internal evaluation muscle is a key long-term advantage.

Increasing the speed of our software engineers, however, requires more than just deploying a tool, that is why we are working on several other fronts: automated metrics, training, integrating the tool across the whole engineering workflow… We hope we can share results about this soon.

Bibliography

Kumar, S., Goel, D., Zimmermann, T., Houck, B., Ashok, B., & Bansal, C. (2025). Time Warp: The Gap Between Developers’ Ideal vs Actual Workweeks in an AI-Driven Era. arXiv preprint arXiv:2502.15287. Retrieved from https://arxiv.org/abs/2502.15287

Meyer, A. N., Barr, E. T., Bird, C., & Zimmermann, T. (2019). Today was a Good Day: The Daily Life of Software Developers. IEEE Transactions on Software Engineering. Microsoft Research preprint available at: https://www.microsoft.com/en-us/research/wp-content/uploads/2019/04/devtime-preprint-TSE19.pdf

IDC. (2024). How Do Software Developers Spend Their Time? IDC Report (as summarized in InfoWorld). https://my.idc.com/getdoc.jsp?containerId=US53204725

Dell’Acqua, Fabrizio, Edward McFowland III, Ethan Mollick, Hila Lifshitz-Assaf, Katherine C. Kellogg, Saran Rajendran, Lisa Krayer, François Candelon, and Karim R. Lakhani. “Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality.” Harvard Business School Working Paper, No. 24-013, September 2023.

Appendices

Appendix 1: Complete list of attributes in the task tracker and definitions

The following table represents the data that was consolidated in our task tracker. Some information was added before the task started and some it finished.

Before the task started

Field	Definition
Task ID	Unique identifier from our issue tracking system
Task Description	Brief, clear description of what the task involves
Engineer Name	Name of the person accountable for the task
Experience with the AI Coding Tool	Level of experience of the engineer accountable for the task, which will likely change after 1-2 weeks. First time means the user is understanding how it works, with very little experience. Intermediate means that the user understands and has used the main features, and is comfortable with them. Expert is someone that can provide tips and best practices to others.
AI Tool Used	AI tool that was used for the task: GitHub Copilot, Cursor, or Claude Code. Note that teams (and hence engineers) were assigned to one of the tool during the whole period.
Task Type	Type of task among the following categories: Bug Fixing, Refactoring, Feature Development, Code Migration, Documentation, Tooling or Other
Task difficulty	Indicate how challenging the task is: Small, Medium or Large
Area	Specify which functional area the task is mostly about among the following options: Backend, Frontend, Android, IOS, Infra, Fullstack and Other
Estimated Time Without AI (days)	Engineer estimation of the task cycle time without AI assistant, e.g. by comparing to similar tasks, using Three-Point-Estimation, or the experience of the engineer.
Estimated # PR revisions	How many reviews the engineer expects the PR would typically require, e.g., tasks with low complexity would only require 1 review, whereas high complexity tasks may require several reviews.
Confidence in estimate	How confident the engineer is in the time estimates. Values were Low, Medium or High.

After the task finished

Field	Definition
Actual Time with AI (days)	Actual time taken to complete the task using the AI assistant in days
Actual # PR revisions	How many revisions were actually needed before PR approval
Task Outcome	Indicates if task was (1) Completed, fully done by AI, (2) Completed, partially done with AI, (3) Completed manually, AI not able to help, (4) Not completed, due to external factors not related to AI
Rate the help of AI tool	Rating of the usefulness of AI tooling for this task, from 1 being the lowest to 5
Link to the Pull Request	Link to the Pull Request for the task
Optional comments	Any notable observations about the AI’s performance or limitations for this task

Appendix 2: Sources of error in our productivity proxy metric and mitigation

Our method offers a pragmatic approach to measuring productivity gains from AI tooling. However, several sources of error could affect the validity of this metric. Below, we outline these potential biases and describe the steps taken to mitigate them.

Estimation bias

People often underestimate task durations (planning fallacy) either due to optimism bias or because estimates do not include unexpected eventualities that produce delays. This can inflate the apparent productivity gain when actual times are longer. Conversely, some estimates may be done conservatively, making the tool look less effective than it is.This effect can be mitigated by adding a control group.

In fact, as we had Github Copilot deployed already in the company before the pilot and adopted by 63% of the population, using a control group without SWE AI tooling would mean to actually stop it for some users of teams. Instead of that, we decided to take Github Copilot as part of the tooling to compare and one can consider this group as the control group.

Noise from non-coding work

The “actual time” logged may include overhead unrelated to code writing (e.g. waiting for code reviews, access permissions, builds, or clarifications). These delays distort the metric since they are not influenced by the AI assistant.

At the same time, these delays are also included in the time estimations and the real time, which should reduce the effect of this factor.

Task heterogeneity

Different task types (e.g. bug-fixing vs. code refactoring) vary widely in predictability and in how much AI can help. Averaging them together risks masking the fact that tools may excel in one category but underperform in others.

We present the stratified results by task type and area in Appendix 4. We observe that, for all tools, the most common task types are Feature Development, Code Refactoring and Bug Fixing. We see that Bug Fixing is actually less present in Claude Code, with only 2 tasks, which could possibly impact the result. We see however that Bug Fixing had a positive impact in Cursor, increasing its metrics.

Differences in team composition

Senior vs. junior developers estimate differently and use tools differently. Some may benefit more from AI (e.g. juniors for boilerplate, seniors for exploring alternatives). To reduce the effect of this factor, we assigned the tools by teams, which by design have a mix of roles and seniority.

Weighting by task duration

Weighting long tasks more heavily can skew the average. A single large task with an unusual length could dominate the results, even if smaller tasks consistently showed steady benefits.

In the following table we represent the same data after removing the 5% longest tasks (in their predicted time). All these tasks actually were from Cursor, had an average estimated duration of 7.6 days. We see how the results remain, directionally, in the same way as the original results.

AI Tool Used	% of productivity 95% confidence range	# of tasks	Average AI Completion Rate	Average Rating	Average estimated time without AI (days)
Claude Code	42.5 - 44.5	44	74%	4.2	1.3
Cursor	23.5 - 24.6	59	60%	3.6	1.4
GitHub Copilot	0.5 - 2.7	54	44%	2.8	1.1
Grand Total	21.2 - 21.5	157	58%	3.5	1.3

We also include the average estimated time without AI in days for each tool after removing the 5% longest task, and show that they are quite similar, with a variation of 0.3 days between the shortest (Github Copilot) and the longest (Cursor).

Rationalizing the AI bubble

2025-11-11T09:00:00+00:00

11th November 2025

Key Points

The stock market, and especially the technology sector, is now trading near the valuation levels of the dot-com bubble. There’s also a significant gap between generative-AI revenue (about $60B) and the hundreds of billions that Big Tech firms and frontier labs are spending.
This isn’t the dot-com or the subprime crises. Big Tech has strong profits, low debt, and solid balance sheets, so they can absorb slower AI revenue growth. I don't expect major Big Techs bankruptcies.
Smaller frontier labs such as OpenAI or Anthropic are more exposed. Consolidations, potentially through acquisitions by Big Tech, looks likely.
Even so, generative-AI revenue is growing fast: roughly ~100% YoY in 2025, based on the limited data available.

The history of AI is a history of disappointments, with at least two cycles that started with promising results, large investment pouring in, followed by underdelivery, significant losses and another AI winter. In these times of inflated expectations for AI, what is different? Are we on the verge of a financial meltdown similar to the subprime crisis or the dot-com bubble?

As someone deeply involved in AI, I wanted to take a closer look, so I put on my financial hat and looked closely at the data. What I found was interesting: I don’t think that we are close to a massive collapse of the economy or the technology sector. We may encounter a price correction soon, though.

What is being said on the AI bubble

The three main claims on the AI bubble are:

(1) there is a gigantic gap between the infrastructure spend and AI revenue, which will not close,

(2) there is a web of circular deals between NVIDIA and the rest of the AI ecosystem that artificially inflates the prices and

(3) there is a ceiling on how much AI models can improve and we are reaching it, which implies a hard limit in the performance of many practical systems built using AI.

Let’s take them one by one.

The revenue gap: AIs 600 billion dollar question

In June 2024 Sequoia Capital published an article where they argue that the AI bubble was reaching a tipping point. Their claim, which is still valid today, was that AI builders are spending on NVIDIA GPU cards an estimated $150 billion per year. This is generating an enormous gap with the realized AI revenue, which is still at the “few tens of billions” level, and should be at $600 billion to justify such an investment.

A similar view comes from Edward Zitron, who in The Case Against Generative AI, claims that everybody (OpenAI, Anthropic, Microsoft, Meta…) is losing money on AI, and this is creating a deeply unstable situation. Not only is there a large gap, but the promise of AI replacing workers is a fudge. It is not happening because models are not good enough.

Additionally, as all companies are investing heavily in AI, their costs are increasing without much in return, neither productivity, nor additional revenue. He cites a great study from MIT which shows that 95% of companies investing in AI are not getting any value out of it. He even compares this situation to the subprime crisis: the whole tech industry is integrating generative AI technology to either increase their efficiency or to wrap it and sell it. This technology is offered at a steep discount to its true cost and is heavily concentrated with two providers that are not sustainable: OpenAI and Anthropic.

In essence, the whole system is currently very unstable, built on promises of a very large amount of profit. And it seems that these profits are not coming.

A web of circular deals

If the previous wasn’t large enough, add to that a web of circular deals where the main GPU maker, NVIDIA, invests in the LLM providers, mostly OpenAI and xAI, which then spend it in buying NVIDIA cards, sometimes through intermediate companies such as Oracle. This not only inflates NVIDIA’s revenue but also adds more systemic risk.

The following diagram, from Bloomberg, explains it well:

Network diagram showing circular investment and purchasing deals between NVIDIA, OpenAI, and the broader AI ecosystem. Source: Bloomberg.

We can also interpret this mechanics as NVIDIA investing in the companies that will fuel the AI revolution. And this would not be worrisome if we didn’t have the $600 billion question. In essence, the revenue gap is the key issue.

Are we hitting an AI ceiling?

Many sources say that although AI has made rapid progress, there may be a “ceiling” in their future performance: scaling the size of models, or data quality/quantity, or architectural innovation may not be enough to make models better. Hence, there could be a limit in the upside as many companies are betting on exponential AI improvement.

Additionaly, models are not reliable enough to substitute humans on many tasks yet. Can a company substitute people by paying a smaller, but still significant amount to the LLM producers? Although there seems to be instances of this occurring, it’s not happening at a large scale. For example, in the case of coding, where AI is being deployed everywhere, programmers are not leaving companies en masse. Moreover, we see some papers with evidence to the contrary: AI slows coders down.

A bit of rationality: what is the data saying?

Let’s address the last point first, on the performance of the models. Then we will get some insights from the fundamental data.

No AI ceiling in sight yet

A great resource to understand the evolution of state of the art AI models comes from METR, a non-profit organization that studies AI capabilities and its risks. In their article Measuring AI Ability to Complete Long Tasks they measure the length of tasks, in human terms, that AI can complete with 50% and 80% success rate, and how this length evolves with time.

This is well represented in the following chart. The y axis represents the duration that it takes for a human to complete different software engineering tasks, and the x axis represents the release date of different LLMs. The chart below, with y axis in logarithmic scale, and for 50% model success, i.e. model can complete the task successfully 50% of times.

Timeline showing how AI models' ability to complete software engineering tasks has evolved over time. The pattern shows task duration doubling every 7 months. Source: METR.

They find a consistent pattern: every 7 months the human task duration doubles. So if the best models are now automating tasks that take humans an average of 2 hours, in 14 months we expect that the human task duration that will be completed by a model with 50% success is 8 hours. Now, this is at 50%, which is a very low success rate. If the limit is set at 80% success, then the largest human time automated task is of only 26 minutes. Although small, doubling every 7 months will put us in 8 hours in roughly 28 months.

The work of METR is relevant not only because it establishes a predictive rule, but also because it shows that, so far, there is no evidence of an AI ceiling. The exponential growth hasn’t flattened.

Additional evidence shows that the AI ceiling is not close. For example:

The analysis of bottleneck factors such as electrical power, chip production, data scarcity and latency, done by EPOCH AI, which show that scaling current models 4x per year will be still feasible at least until 2030 with 2e29 FLOP.
Technical innovations that improve models without scaling them, such as new network architectures like the mixture of experts or test time scaling, which originated the thinking mode of many models.
Harder benchmarks are being created and AI models keep improving on them, such as ARC or Humanity’s Last Exam. Creating new benchmarks where AI fails are essential for advancing AI.

What the financial data is saying

It is true that there is a large gap between the revenue generated with AI and the expense. It is true that there is a circular economy. And it is true that the whole system is betting on exponential revenue growth due to AI. Now, how much financial runway do these companies have?

No signs of Big Tech firms collapse risk

We can easily find financial data of the Big Tech firms and compare the AI investments during 2025 with current depreciation, revenue, profits, and debt. We will focus on the companies that are building LLMs, as they are the ones in large need of revenues, and ignore other relevant companies in the ecosystem, such as NVIDIA, Oracle, SoftBank, etc.

For the Big Tech companies (all data in $ billions, 12 month trailing, sources in the annex):

Company	AI investment estimated	Depreciation & amortization	Revenue	Profit	Debt
Microsoft	97	34	282	102	61
Alphabet	85	40	371	116	27
Amazon	101	59	670	71	51
Meta	69	15	179	72	50
Total	352	148	1,502	361	189

It’s obvious that AI investments, which will translate into depreciation over the coming years potentially reducing profits, are substantial. However, for Big Tech firms, it’s evident that these expenditures won’t cause irreparable damage. In most cases, the investment remains below annual profits, and possibly can be absorbed by a modest increase in debt. Therefore, even in these companies do not find a way of generating significant return from generative AI, they are unlikely to collapse, though their share price may face a correction.

Frontier labs have more risks, but they are comparably small

Let’s now look at the private frontier labs. There is not much public data, so we will build the table based on estimations (all data in $ billions, 12 month trailing, sources in annex):

Company	AI investment estimation	Estimated revenue	Estimated profit	Estimated debt
OpenAI	7	12	-10	?
Anthropic	10 - 15	5 - 9	-5	2.5
xAI	20 - 25	1 - 3	?	5
Total	37-47	18-24	-15?	>7.5?

Unsurprisingly, none of them is profitable. However, they are significantly small compared with the Big Tech, which by the way, have a lot of cash. So, a scenario of consolidation it’s a potential outcome if some of the frontier labs struggle.

High pace of revenue growth

Another important angle we need to explore is the current growth of earnings due to AI. It’s not easy to estimate, as most companies do not report that separately, but still, we can get some data for some of the companies (except Google and Amazon):

Company	2023	2024	2025 (latest run-rate)
OpenAI	~2.0	~5.5	12.0
Anthropic	~0.2	~1.0	~7.0
xAI	~0	~0.1	~1.0
Microsoft	~1.5	10	13
Total	~3.7	~16.6	~33.0
YoY Growth		+349%	+98%

Known AI revenue, which omits much of Big Tech data, is growing fast, at +98% in the last year. Assuming that there is a total of $60 billion AI revenue (which is Ed Zitron estimate), assuming +100% annual growth over the next few years (ambitious), the $600 billion question will be solved in 3.5 years.

In conclusion, the possibility of one of the Big Tech collapsing because of their over investment in AI is small. And if one of the comparably smaller frontier labs does, it won’t have an important effect in the Big Tech firms. The AI bubble seems more about potential price correction than a full-scale blow-up, like the subprime mortgage crisis.

How significant could this correction be? Let’s try to have an estimation.

Big tech stock prices

Current prices are inflated

We can check different price ratios to understand if stocks are overvalued, and what current prices imply. We will use the following ratios:

P/E (Price-to-Earnings): Market price divided by net earnings. Shows how much investors pay for each dollar of profit, a gauge of profit-based valuation.
P/S (Price-to-Sales): Market value divided by total revenue. Useful when profits are thin or negative; shows how expensive a company’s sales are.
P/B (Price-to-Book): Market value divided by accounting net assets. Indicates how much investors value the company above its tangible book worth. Higher for asset-light, IP-driven firms.
EV/EBITDA (Enterprise Value to EBITDA): Compares total business value (equity + debt – cash) to operating cash earnings. Balances profitability with capital structure; key for assessing overall valuation.

And this is how they look:

Company	P/E	P/S	P/B	EV/EBITDA
Microsoft	37.8	10.7	11.1	20.9
Alphabet	29	9.44	9.22	22.8
Amazon	36.5	3.85	8.01	18.9
Meta	28.2	8.22	8.24	15.6
Nvidia	53.6	29.6	50.3	49.7
Oracle	56.9	12.5	30.9	32.2

We can have an idea if these values are high by looking to their typical values historically, for both the technology sector and for all companies:

	P/E	P/S	P/B	EV/EBITDA
Technology Sector	25 – 35	5 – 10	5 – 10	15 – 25
All Companies (S&P 500 avg)	15 – 18	1 – 3	2 – 3	8 – 12

With these it’s easy to see that the major risks are in NVIDIA and Oracle. If we look at the average P/E for the Technology Sector, which is currently at 40, we see that we are getting closer to 2000 levels. See chart below (source).

Historical P/E ratio chart showing technology sector valuations approaching 2000 dot-com bubble levels.

Scenarios of correction

Let’s imagine what could happen if markets become more rational, and they keep current earnings but the ratios of the above companies go to significantly lower values: P/E of x20 and x16. What will happen to the whole S&P500 index? Let’s check it in the following table:

Company	S&P 500 weight (%)	Current P/E	Price change @x20 (%)	S&P 500 contrib @x20 (%)	Price change @x16 (%)	S&P 500 contrib @x16 (%)
Microsoft	6.1	37.8	-47.1	-2.9	-57.7	-3.5
Alphabet	5.6	29.3	-31.7	-1.8	-45.4	-2.5
Amazon	4.3	36.5	-45.2	-1.9	-56.2	-2.4
Meta	2.6	22.0	-9.1	-0.2	-27.3	-0.7
Nvidia	7.6	56.1	-64.3	-4.9	-71.5	-5.4
Oracle	1.1	54.9	-63.6	-0.7	-70.9	-0.8
Total	26.9	—	—	-12.4	—	-15.4

In both scenarios, the correction of the S&P 500 index, although significant, won’t be a disaster: -12% for the 20x and -15% for the x16. For context, during Covid the index had a drawdown of -34% during one month.

Note, however, that this estimate doesn’t account for a decline in other companies, which could also be dragged down. The fall of the S&P 500 index may therefore be larger. Having said that, it could also go the other way: at multiples of 20x or 16x under current circumstances, I believe these six companies would be a bargain.

Conclusion: not a meltdown, potentially a price correction

The empiricist and major proponent of the scientific method, Francis Bacon, once said “the truth is the daughter of time, not of authority”. This is especially certain in finance. In the long term, the fundamentals of stock, the building blocks of a company’s economic reality, will drive its price. Hence, in the short term, each stock price is a collective statement, a collective expectation of the long term reality.

And current prices are telling us that the expectations are high. In fact, this is the expected level of earnings for each company:

Company	Current P/E	Required profit growth to reach P/E=20 (%)
Microsoft	37.8	+89%
Alphabet	29.3	+47%
Amazon	36.5	+82%
Meta	22.0	+10%
Nvidia	56.1	+181%
Oracle	54.9	+175%
Average		+97%

On average, the market is expecting a +97% profit growth, which is undeniably high. This profit, which is of the order of a few hundred billions, needs to come from either consumers or companies paying more.

Companies outside Big Tech are already adopting AI, raising their costs without yet seeing much profit. For such a large amount of money to change hands, we need to see results: higher productivity or stronger revenues. But we also need to know that the AI ceiling isn’t near. The real game is happening as much in the application layer as in the model layer.

Annex: Data sources

Microsoft FY25 Q4 Earnings Call / Investor Relations — Press release & webcast

Microsoft FY25 Form 10-K — SEC filing

Alphabet Q2 2025 Earnings — Investor relations release

Meta FY2024 Results — Investor relations release

Meta 2025 Capex Range ($64–72B) — Investor relations (Q1 2025) • TechCrunch coverage

S\&P Dow Jones Indices — S&P 500 EPS / Index Earnings (xlsx)

Yardeni Research — S&P 500 & Sector Valuation (Forward P/E, P/S)

Damodaran Online — Industry Multiples

Multpl — S&P 500 Price-to-Sales

YCharts — S&P 500 Price-to-Sales chart

FinanceCharts — Homepage / tools & TTM data

Macrotrends — Depreciation & Amortization (example: AMZN)

GuruFocus — D\&A TTM (example: AMZN)

StockAnalysis — TTM revenue & profit (example: AMZN)

CompaniesMarketCap — Companies ranked by P/E

Reuters (AI capex & revenues) — Microsoft boosts FY25 AI/data-center spend • Alphabet reaffirms ~$75B 2025 capex • Anthropic hits $3B ARR run-rate

Beyond Tokens: The Context-Window Perspective on LLMs, Memory, and Mind

2025-07-01T09:00:00+00:00

3rd July 2025

This article expands on ideas I first presented during a keynote at an AI Hackathon organized by Fotocasa. I am grateful to the organizers for inviting me.

Large Language Models don’t work the way most people think they do. They are massive neural networks with billions of parameters (neuronal connections), but when they’re generating responses (making an inference), they remain static: their internal state doesn’t change.

The goal of this article is to demystify some of the inner workings of Large Language Models and explain how agentic behavior can be achieved. All from the perspective of the model’s input, which makes it very intuitive.

But what is a Large Language Model?

You might not realize it, but your phone has had a language model for over 10 years. Not a large one, but it’s there. It’s used to predict the next word you’ll type.

The keyboard of phone devices have had language models for over 10 years.

Language models do exactly that: from a limited vocabulary (usually tens of thousands of the most common words) they choose the most likely next word given the previous context. They accomplish this based on the vast amounts of data they were trained on.

Something fascinating happens when you scale up these models. By increasing both the number of parameters and the training data, the model suddenly becomes dramatically more powerful. Think of a parameter as a neural connection between two neurons, current largest models reach into the trillions of parameters (e.g., Llama 4 Behemoth).

This performance change occurs almost like a phase transition: suddenly, when the model reaches a certain size and training duration (with sufficient data), it acquires entirely new abilities. This phenomenon was thoroughly documented in the paper Emergent Abilities of Large Language Models, and the scaling laws are also summarized in Opening the LLM pipeline.

After several computation cycles (flops) during the training of the model, they suddenly acquire new abilities, as measured in the different benchmarks. *Source: Emergent Abilities of Large Language Models, Oct 2022*

In reality, LLMs don’t predict the next word, they predict the next token, which represents an optimal compromise between predicting individual characters and entire words. On average, a token corresponds to about 0.8 words. For simplicity, I’ll use “words” and “tokens” interchangeably throughout this article.

The task of predicting the next word is surprisingly profound. Consider this prompt:

"The cat sat on the mat" in Spanish is

To predict the next set of tokens, the model needs to understand how to translate from English to Spanish:

"El gato se sentó en la alfombra"

Or consider this more complex example:

You are a Grandmaster chess player. Predict the next move:
1. e4 c6 2. d4 d5 3. Nc3 dxe4 4. Nxe4 Nf6 5. Nxf6+ exf6 6.

Here, the model needs to understand chess strategy to suggest a good move.

Therefore, predicting the next word requires learning to translate, play chess, write poetry, code, and much more. Essentially learning about the world itself. As Ilya Sutskever said, “text is just a projection of the world.”

Predicting the next token implies learning about the world.

From LLM to Chatbot

While LLMs are powerful on their own, creating a chatbot like ChatGPT requires additional steps. Chatbots need to maintain coherent conversations, which demands more than just next-token prediction. This is where reinforcement learning comes into play. In simplified terms, the process to go from an LLM to a chatbot with certain characteristics looks like this:

Reinforcement learning with human feedback (RLHF)simplified.

Starting with a raw LLM, we ask it to produce several answers to the same question. These answers are then rated by humans based on various criteria (usefulness, truthfulness, helpfulness, etc.). The ranked responses are used—through a process that may involve another model to retrain the LLM by readjusting its parameters. This process repeats iteratively until the model reaches a satisfactory state.

The Context Window

The context window represents the input to the LLM, essentially the set of tokens the model uses to predict the next one. Each LLM has a different context window size, as most architectures require quadratic scaling with size (though there are exceptions). The LLM with the largest context window (Llama 4 Scout) can process 10 million tokens, which is roughly equivalent to the first 5 volumes of the Encyclopedia Britannica.

The LLM with the largest context window (Llama 4 Scout) has 10 million tokens, which can process the first 5 volumes of the Encyclopedia Britannica (~7.6M words or 10kg of books) in one go.

Are models with larger context windows necessarily better? The trend over the past few years has been exponential growth, until recently. With the introduction of reasoning models in the last year, this trend has plateaued. This shift occurs partly because reasoning models iterate over input data in an “agentic mode”, working with summaries and strategically manipulating the context window to reach final answers. More on memory and agents later.

Context window length of prominent LLMs over time, logarithmic scale. They have been growing exponentially.

Examples of Calls to LLMs

When we write somethingto an LLM, we’re actually sending a request to the trained neural network. This process of reading from the context window and generating predictions is called inference.

What we write to the model is called a prompt, which gives rise to the term prompt engineering: the art of crafting prompts that produce desired outputs. As some have noted, a more accurate term would be context engineering.

How does this work in practice? Let’s examine some examples. The call to the LLM is typically formatted as JSON, though what actually enters the context window is just a string. Here’s a simplified example of what this JSON looks like:

{
  "messages": [
    { "role": "system", "content": "You are a playful assistant." },
    { "role": "user", "content": "Hi!" }
  ]
}

The first message of the request above is the system message, which instructs the model on how to behave and can contain custom instructions. We’ll see more applications of this later.

The second message is the user message: what the user has written. Together, these elements form the prompt. The model then generates a response. Technically, the model doesn’t produce a complete response at once, but generates one token at a time in a loop, reading the entire context window plus the newly generated token each time, until it produces a token that signals the end of the response. While this isn’t shown in the JSON format, it’s important to understand this mechanism.

The response of the model can look like this:

{
  "messages": [{ "role": "assistant", "content": "Hey there!" }]
}

When the user asks another question, the entire conversation history is sent to the model again:

{
  "messages": [
    { "role": "system", "content": "You are a playful assistant." },
    { "role": "user", "content": "Hi!" },
    { "role": "assistant", "content": "Hey there!" },
    { "role": "user", "content": "Can you tell me a joke?" }
  ]
}

The model generates a response, and this process continues.

As mentioned earlier, what actually enters the context window differs from the JSON format and looks like this in its raw form:

<|im_start|>system
"You are a playful assistant.
<|im_end|>
<|im_start|>user
Hi!
<|im_end|>
<|im_start|>assistant
Hey there!
<|im_end|>
<|im_start|>user
Can you tell me a joke?
<|im_end|>

where the <|im_start|> and <|im_end|> are actually tokens that mark the start and end of the message.

Past memories

Many current LLM providers, such as OpenAI, have memories. So, if the neural network is static, how is this done?

Again, through the context window. The model will store selected parts of the conversation (one can imagine an LLM runnning in the background that does that), and then adds them to the context window.

The call to the model can look like this:

{
  "messages": [
    { "role": "system", "content": "You are a playful assistant." },
    { "role": "memory", "content": "User name is Manuel." },
    { "role": "memory", "content": "User is from Spain." },
    { "role": "user", "content": "Hi!" },
    { "role": "assistant", "content": "Hey there!" },
    { "role": "user", "content": "Can you tell me a joke?" }
  ]
}

where memories are incorporated as a special message type.

Consciousness?

When using very powerful models that can engage in seemingly fluent conversations, it’s natural to believe the it possesses some sort of consciousness, as the responses are almost human-like. By looking at the mechanics just explained, we can see that if any form of consciousness exists, it’s fundamentally different from human consciousness.

First, the LLM’s neural network remains static: it doesn’t change. Therefore, there’s no evolution, no new memories are stored within it, and it doesn’t remember previous conversations or experiences. The model only responds to what’s in the current context window. It’s purely a function of its input: a function in the strict mathematical sense.

Second, the neural network only activates during next-token prediction. If any form of consciousness exists within this network, it only lasts for the duration of this prediction process, with the context window’s content being a crucial component.

I believe the next generation of AI should address this limitation: creating systems that evolve over time (with dynamic neural networks) and can independently store memories. This might involve incorporating retraining mechanisms, or even systems where new neurons are added and others removed.

Agents

Agents do much more than simply respond to prompts. They can search the internet, call internal functions or external APIs, access past memories, etc. By examining the context window, we can understand how they work.

Let’s consider a simple agent that can call a calculator function. Here’s how the conversation unfolds in the context window:

First, the system message instructs the agent about its capabilities and how to use the calculator function. This is followed by the user’s question, which will trigger a multi-step process:

{
  "messages": [
    {
      "role": "system",
      "content": "You are an agent that can call a calculator function.
      The function `call_calculator` expects a JSON object with a single
      field `expression` containing a valid math expression and returns
      a JSON object with a field `result`."
    },
    {
      "role": "user",
      "content": "What is 12 × 7?"
    }
  ]
}

The agent responds by indicating it needs to use the calculator:

{
  "role": "assistant",
  "content": "I need to multiply two numbers. Action: call_calculator",
  "function_call": {
    "name": "call_calculator",
    "arguments": {
      "expression": "12 * 7"
    }
  }
}

An external system parses this response, calls the calculator function, and adds the result back to the context:

{
  "role": "function",
  "name": "call_calculator",
  "content": "{\"result\": 84}"
}

Finally, the LLM sees this result and provides the final answer:

{
  "role": "assistant",
  "content": "Observation: The calculator says 84. Final Answer: 12 × 7 = 84."
}

The complete conversation in the context window looks like this:

{
  "messages": [
    {
      "role": "system",
      "content": "You are an agent that can call a calculator function.
      The function `call_calculator` expects a JSON object with a single
      field `expression` containing a valid math expression and returns
      a JSON object with a field `result`."
    },
    {
      "role": "user",
      "content": "What is 12 × 7?"
    },
    {
      "role": "assistant",
      "content": "I need to multiply two numbers. Action: call_calculator",
      "function_call": {
        "name": "call_calculator",
        "arguments": {
          "expression": "12 * 7"
        }
      }
    },
    {
      "role": "function",
      "name": "call_calculator",
      "content": "{\"result\": 84}"
    },
    {
      "role": "assistant",
      "content": "Observation: The calculator says 84. Final Answer: 12 × 7 = 84."
    }
  ]
}

More complex agentic behaviors work similarly: function calls (internet searches, API calls, memory access) are all added to the context window, enabling the LLM to produce appropriate responses.

Building agents involves more than just a context window and an LLM, you typically need a model-agnostic orchestrator (like LangChain) that manages state & memory (buffers, summaries, vector stores), tool routing (function calls, search, code execution, APIs), multi-step planning with sub-agent spawning, and observability (tracing, cost tracking, evaluation pipelines); though this may feel overly complex for smaller projects.

The Journey Continues

In this post, we’ve explored several key concepts: the context window, tokens, prompt engineering, LLMs as mathematical functions, agentic behavior, and memory systems.

This represents just a small portion of the broader LLM ecosystem. Many more concepts await exploration: embedding databases, RAG (Retrieval-Augmented Generation), reasoning models, MCPs (Model Context Protocol), and beyond. I encourage you to continue learning about these technologies and, most importantly, to start experimenting with them in your own projects.

Launching TheorIA: A Machine-Readable Atlas of Theoretical Physics

2025-05-25T09:00:00+00:00

25th May 2025

We are launching TheorIA Dataset (Theoretical Physics Intelligent Anthology), a growing collection of physics equations, step-by-step derivations and plain-language explanations, fully written as self-contained JSON. It fills a gap identified in my earlier post Datasets for advancing Theoretical Physics & AI, namely, the absence of curated, machine-readable data that goes beyond raw PDFs.

We are trying to make something robust and with high-quality: built-in CI validation, explicit assumptions, programmatic proofs (SymPy), and arXiv-style domain tags to keep every entry reproducible and searchable.

We currently have 15 entries, many written with AI, but some already curated by physicists, but we need hundreds. Your favourite derivation is probably still missing.

All code and content is open source under the CC-BY 4.0 license on GitHub. Pull requests are welcome!

Why bother?

ImageNet rewired computer-vision research. In NLP, the Pile, C4 and friends did the same. Theoretical physics, on the other hand, still asks language models to learn Maxwell’s equations by going through paper PDFs.

TheorIA is an attempt to raise the bar:

Pain point	TheorIA’s answer
Equations are locked inside prose	Each equation is a first-class JSON object, plus symbol table and AsciiMath rendering
Derivations are opaque	Straightforward step-by-step derivations with automated verification with SymPy
Reproducibility headaches	CI in Github validates all PRs against schema and proofs before merge

TheorIA is a work in progress

Many of the entries you’ll find in TheorIA are currently in draft form, built with the help of AI tooling to bootstrap content at scale. Hence, they often contain typos, notation inconsistencies or even subtle mathematical errors. If they were perfect, this dataset would not be useful for training models.

This is a feature, not a bug: by crowd-sourcing expert review and inviting physicists, mathematicians and educators to correct each derivation, we hope to rapidly turn these drafts into rock-solid reference materials. Your contributions will ensure that TheorIA remains both rigorous and reliable. We will very clearly mark the entries that are still not ready for use.

A quick tour

For now, we have built a simple web viewer to explore the dataset, including each entry, which makes it easy to spot typos.

The main repository is on GitHub, and you can see an example of a raw entry here, the Special relativity transformations. We also have a comprehensive contributing guide.

If you are not a software developer but you want to contribute by correcting or adding an entry, just follow the guidelines, create a json and add it as an issue in the repo. Remember to add your name or/and ORCID to the entry author field!

Roadmap

The project is still early, and it requires significant work to make it useful and meaningful. You can follow the status in the TheorIA project board.

The general steps we have in mind are:

Build some critical mass: Have 100 AI generated entries (I expect them to have many errors, from the ones generated already) and at least 20 curaated by physicists.
Test LLMs performance with the curated examples and compare their output.
Reduce contributors friction: Have an easy way for users to modify or add entries to the dataset, from a user-friendly web interface.
Automate output formats. Provide “one-click” scripts (JSON→LaTeX/Markdown/API) so adopters can plug TheorIA into documentation, teaching materials or model workflows without a learning curve.
Once we have enough entries, deliver a demo. Fine-tune an LLM on the dataset and publicly compare its derivation-explanation quality against baselines.

Call for collaborators

We’re especially looking for:

Physicists or students to review or add entries.
Toolsmiths to build visualisers, dataset converters or training scripts.

Final thoughts

I hope that TheorIA will graduate from “neat GitHub repo” to a reference for physicists, educators and AI researchers. Join us in turning raw drafts into high-quality derivations, and let’s build the data foundation that physics and AI have been waiting for.

Datasets for advancing Theoretical Physics and AI

2025-04-13T09:00:00+00:00

13th April 2025

The history of recent developments in deep learning shows the crucial role played by curated datasets. For example, Fei-Fei Li and her collaborators dramatically reshaped computer vision with the creation of ImageNet, a large-scale, labeled image collection. This sparked the start of the deep learning revolution. Similarly, datasets like CIFAR-10 and MNIST have provided foundational benchmarks essential for algorithmic progress.

Despite these advances in machine learning, theoretical physics still lacks comprehensive, standardized datasets. Developing high-quality datasets specifically tailored for theoretical physics could accelerate progress both in AI—by enabling more powerful models—and in physics itself, by establishing common benchmarks for training and evaluating physics-related models.

In this post, I start by looking to the current existing physics related datasets by domain, data type, level of content and availability. Then I try to identify current existing gaps and propose new dataset creations.

Existing Datasets

Theoretical Physics (Knowledge & Simulations)

This includes textual corpora of theory papers, equation datasets, and simulation data of theoretical models.

Dataset / Source	Domain & Content	Type	Level	Availability
ArXiv Physics Corpus	All physics subfields (theory & experiment) – 1.2M+ research papers (PhysBERT: A Text Embedding Model for Physics Scientific Literature)	Text (papers, PDFs)	Frontier research	Open-access (arXiv)
Physics Journals (e.g. APS)	Broad physics research literature (peer-reviewed journals)	Text (papers)	Frontier research	Restricted (subscription)
Feynman Symbolic Regression Dataset	Classical physics formulas (from Feynman Lectures, etc.) – 100+ laws	Symbolic equations + numeric data	Undergrad–Graduate	Open (research dataset)
Kreuzer–Skarke Calabi–Yau DB	String theory – 473,800,776 reflexive 4D polytopes (Calabi–Yau manifolds) (Group-invariant machine learning on the Kreuzer-Skarke dataset - paywalled version: sciencedirect.com)	Structured (geometric data)	Frontier research	Open (online database)
Lattice QCD Configurations	Quantum Field Theory (lattice QCD) – gauge field samples, correlation functions	Numeric (lattice data)	Frontier research	Partially open (example)
SXS Waveforms (Simulating eXtreme Spacetimes)	General Relativity – Numerical relativity waveforms of binary black holes (SXS Gravitational Waveform Database)	Numerical time-series	Frontier research	Open (public catalog)

Experimental Physics

Datasets from experiments and simulations that test physical theories, often used to train ML models to detect patterns or surrogate models for experiments.

Dataset / Source	Domain & Content	Type	Level	Availability
CERN Open Data (LHC)	High-energy physics – Petabytes of LHC collision data (ATLAS, CMS, etc.)	Numerical (events, detector readings)	Frontier research	Open-access (portal)
HEP ML Datasets (HIGGS, HEPMASS, etc.)	Particle physics – Simulated collision events labeled as Higgs vs. background	Numerical (tabular features)	Graduate/Research	Open (UCI/Zenodo)
LIGO/Virgo GWOSC	Gravitational waves – Time-series signals from interferometers (event strain data)	Numerical (time-series)	Frontier research	Open (GWOSC portal)
Quantum Optics Experiments	Quantum optics – e.g. single-photon interference, trapped-ion measurements	Numeric (experimental logs, time-series)	Graduate/Research	Limited open (lab repositories, e.g. QDataSet)
Fluid Dynamics/CFD Simulations	Classical mechanics – CFD simulation outputs (e.g. flow fields, turbulence)	Numerical (grids, images)	Graduate/Research	Partially open (benchmarks, e.g. NASA CFD data)
Graph Network Simulations	Multi-body physics – Synthetic trajectories for n-body, fluids, rigid bodies (Physics Simulation With Graph Neural Networks Targeting Mobile - Mobile, Graphics, and Gaming blog - Arm Community blogs - Arm Community)	Numeric (graph-based, trajectories)	Undergrad–Graduate	Partially open (code to generate; DeepMind GNS data)

Mathematics for Physics

Datasets of mathematical problems, proofs, and symbolic computations relevant to physics problem-solving and theory.

Dataset / Source	Domain & Content	Type	Level	Availability
MATH Dataset (Hendrycks et al.)	12,500 competition math problems with step-by-step solutions ([2103.03874] Measuring Mathematical Problem Solving With the MATH Dataset)	Text (problem ⇒ solution)	Undergrad (contest)	Open (public dataset)
PhysQA	1,008 physics word problems (mechanics, etc.) with annotated solutions	Text (word problems Q\&A)	High school	Open (original: paperswithcode.com)
GPT-4 Physics Q\&A (Camel Physics)	20,000 physics problem–solution pairs generated by GPT-4 (camel-ai/physics · Datasets at Hugging Face)	Text (QA, synthetic)	Undergrad–Grad (mixed)	Open (Hugging Face)
Formal Theorem Libraries	Proofs and theorems (Isabelle, Lean, Coq libraries) – e.g. analysis, algebra used in physics	Formal text (logic)	Graduate–Research	Open (MIT/BSD licenses)
Symbolic Integration & ODE Sets	Large sets of integrals and differential equations for symbolic solving (e.g. 27M integration pairs)	Symbolic (expressions)	Undergrad–Grad	Open (research, SIRD dataset)
PINN Benchmark (PINNacle)	20+ distinct physics PDEs (heat eq., Navier-Stokes, etc.) with solution data for PINNs (PINNacle: A Comprehensive Benchmark of Physics-Informed Neural …)	Numerical (PDE solutions)	Undergrad–Grad	Open (benchmark dataset)

Multimodal Physics Data

Combining text, equations, and visuals.

MM-PhyQA (Multimodal Physics QA): High-school physics questions each with multiple related images and diagrams (MM-PhyQA: Multimodal Physics Question-Answering With Multi-Image CoT Prompting). Type: Text + images; Level: High school; Availability: Open (research).
Physics StackExchange Q\&A: Community Q\&A with conceptual explanations (text, some diagrams). Type: Text (informal); Level: Undergraduate+; Availability: Open (CC license).
Laboratory Video/Imagery: E.g. cloud chamber images, astronomical images with annotations. Type: Visual + metadata; Level: Graduate; Availability: Partially open (scattered repositories).

The tables above show that many open-access datasets exist, especially for high-energy physics, mathematical problems, and certain simulations. Also note commercial/restricted datasets like proprietary textbook problem banks, paywalled journal corpora, or private experimental data (e.g. active experimental runs not yet released).

Datasets have been used to train a variety of AI models: large language models (using text corpora of physics papers and Q\&A), graph neural networks (using simulation or detector data structured as graphs), symbolic regression models (using formula datasets like Feynman), and physics-informed neural networks (using synthetic PDE solution datasets).

Gap Analysis: Missing or Underrepresented Data

Despite the above resources, I believe that several important gaps remain:

We lack large, well-annotated datasets of physics problems at advanced graduate level, with step-by-step solutions. Existing collections like MATH or PhysQA cover contests or high-school problems, but few cover the multi-step derivations typical in university physics courses (e.g. electromagnetism, quantum mechanics problem sets) with detailed solutions.
There is an absence of curated datasets of theoretical physics knowledge beyond raw text in papers. For example, there is no database of all important equations/derivations in quantum field theory or general relativity with context, proofs, etc. Similarly, while formal math libraries exist, they rarely cover physics-specific theorems or derivations (e.g. proofs of Noether’s theorem, derivations of field equations…).
Niche but important domains like string theory, quantum gravity, or high-dimensional theoretical constructs are underrepresented in accessible data. For instance, the Kreuzer–Skarke dataset (Calabi–Yau spaces) exists but lacks labels connecting to physical phenomenology.
Physics understanding often requires linking equations, diagrams, and natural language. Few datasets integrate multiple modalities – for example, pairing physics textbook figures or experimental plots with explanatory text and underlying equations. The lack of such unified multimodal datasets means AI struggles with tasks like interpreting a diagram alongside text or deriving equations from experimental graphs.
There is a gap in datasets that directly connect experimental data with theoretical predictions in a structured way. While experimental data (like LHC events or LIGO signals) exist, they are not commonly packaged with the corresponding simulated or theoretical model outputs for the same conditions. This makes it difficult for AI to learn how theory parameters influence data and vice versa. A benchmark that pairs raw experimental data with the expected outcomes from theory (or simulation) is largely missing.

Each of these gaps points to an opportunity for new dataset creation.

Bridging the Data Gap in Theoretical Physics

The datasets reviewed illustrate both the progress made and the potential for advancing theoretical physics. Filling the identified gaps could catalyze breakthroughs. Just as ImageNet revolutionized computer vision, well-crafted physics datasets could similarly drive transformative developments in physics and AI.

I think the task is clear: Physicists and data scientists need to collaborate to create accessible, comprehensive datasets addressing these gaps. Such datasets will not only enhance AI’s capability to understand and predict physics but also foster innovation, potentially accelerating the frontiers of science itself.

Selected ideas from NeurIPS 2024

2025-02-01T12:00:00+00:00

1st February 2025

NeurIPS is widely considered the major AI research conference. With over 16,000 participants, 56 workshops, countless parallel tracks, and a staggering 3,650 posters, this event is more than just a conference: it offers a privileged vantage point into the state of the art in the field and their current challenges. The 2024 edition was hosted in Vancouver, during 6 packed days.

While it’s impossible to capture its vast scope in a single post, here’s a glimpse of the most exciting ideas that stood out to me.

Agents, the next frontier

Advancing intelligent systems requires shifting focus from standalone models (e.g., LLMs) to more complex, agent-based architectures capable of autonomous reasoning and decision-making. See for example the latest releases of frontier labs, such as o1 from OpenAI, DeepSeek, etc, where agentic methods like Chain of Thought are becoming more common.

There were many interesting presentations on the topic, including showcases of agentic libraries, such asAutogen, presented by Microsoft, or Llama Stack, by the folks of Meta.

Conquering human user interfaces

OmniParser, also from Microsoft, is a promising framework that provides more information to a multimodal LLM about the content of a screen or browser, drastically facilitating the interaction of the model with user interfaces. Solving this problem is key to have true agents in our phones or computers.

Example of the output of Omniparser from a screenshot with Google Slides.

A key limitation, as I have observed in my own tests, is that feeding raw screenshots or HTML to a multimodal LLM (e.g., GPT4V) and granting it control over the mouse and keyboard results in poor performance on UI-driven tasks like booking flights or hotels. This is reflected in the low accuracy on GUI task oriented benchmarks, such as the ScreenSpot benchmark, where GPT4V arrives to 16% accuracy.

However, if the model is supplemented with the input coming from Omniparser, accuracy jumps to 73% on the same dataset. This has been surpassed by other models, which means that in less than one year, we will likely see AI operating seamlessly with the UI of our phones or computers as humans do.

Other useful resources about Agents

The folks of Meta showed how to build agents with their Llama Stack, providing also a great notebook with many relevant examples, which includes RAG evaluation with LLM as a judge.

A key highlight was their proposed agentic architecture, which features a central executor coordinating all operations, as illustrated below. Just follow the numbers in order to better understand it.

Architecture of an agent as proposed by Meta. Source: Active Training: Building Agentic Apps with Llama 3.2 and Llama Stack. Neurips 2024.

They also provided some hints on which model size to use for different tasks. The following table is quite useful.

Table showing the model sizes and their usages as proposed by Meta. Source: Active Training: Building Agentic Apps with Llama 3.2 and Llama Stack. Neurips 2024.

Building and improving Large Language Models

One of the standout topics at NeurIPS this year was the process of building and improving Large Language Models. A particularly noteworthy presentation was given by the AllenAI team, who provided a detailed overview of the end-to-end process of building an LLM. From data acquisition to post-training, they shared many insights and tips. This topic is so rich that it deserves a summary of its own, which you can find here.

Are we running out of data?

A recurring theme was data scarcity. Kyle Lo from AllenAI clarified that while data itself isn’t vanishing, open-access data is becoming increasingly limited. Ilya Sutskever, in his remarks upon receiving the “Test of Time Award” for his paper, described data as the “fossil fuel of AI,” noting that while compute continues to grow, data is not growing at the same pace. He suggested that we should be looking at “synthetic data,” inference time compute, and agents as potential solutions.

This was challenged by Jason Weston, who pointed out that significant portion of the training of LLMs in frontier companies relies on “closed data,” which they possess and are generating in substantial quantities. He expressed skepticism about the severity of the data scarcity issue, suggesting that Ilya’s perspective might be influenced by his recent departure from OpenAI and the resulting loss of access to that data.

It is worth mentioning the work of Epoch AI. In their Will We Run Out of Data? paper they project that human public text, estimated in 300 trillion tokens, will be fully utilized between 2026 and 2032, or earlier (see chart below).

Source: Epoch AI, June 2024

Epoch AI focuses in this study on textual data. However, a significant portion of data exists in other formats, such as images, audio, and video, which can also be used for training.

While computational power grows exponentially and data increases at a linear rate, algorithms and methods continue to become more efficient. Furthermore, alternatives like self-distillation (model generates data and trains with it), Constitutional AI, synthetic data, and private datasets reduce this reliance on high data volume. For all these reasons I don’t think data will be the main bottleneck.

Architectures and RL methods

It’s clear that other alternatives to the Transformer architecture are standing out, such as Mamba or xLSTM. These architectures are more efficient that the Transformer at inference, as the computing doesn’t grow quadratically with the number of input tokens, while they can parallelize the prediction of the next token in the training, like the Transformer does, instead of sequentially like previous architectures (RNN, LSTM…). Sepp Hochreiter, the creator of LSTM, presented xLSTM, acknowledging its strong resemblance to Mamba.

However, although these arechitectures were mentioned many times, in practice they are not yet widely used.

Additionally, Reinforcement Learning with Human Feedback (or RLHF), which is the method used by OpenAI ChatGPT to make a language model a chatbot, is being substituted or supplemented by many other methods, like DPO, which is significantly easier and performs at a similar level. More details in my summary on Opening the LLM pipeline.

Measuring the performance of foundation models

Although benchmarking models is part of building models, this topic is so important that requires its own section. Benchmarking is not only important to understand how well a model performs but building relevant benchmarks is key to advance the field.

Benchmarks to advance AI

The definition of intelligence is ellusive, that is why those benchmarks that are easy for humans but hard for AI models are critical, as they establish a new baseline to beat. One of them is ARC, which requires the ML model to solve a series of puzzles, each one with a different logic, like the one shown below.

Example of an ARC puzzle. Source: https://lab42.global/arc/

The performance of humans in the ARC test is very high, around 90% accuracy according to the ARC team, while at the time of Neurips 2024 the best model was at 53%. Interestingly, less than a week after François Chollet presentation in NeurIPS, OpenAI announced that their new o3 model is able to reach to 76% in ARC. A great example of how quickly the field moves!

Melanie Mitchell, from the Santa Fe Institute, also showed during a workshop about System-2 reasoning how current state of the art LLMs fail when some benchmarks are modified in trivial ways. She mentioned an example of the paper Reasoning or Reciting? where in a Python code benchmark, where GPT4 can perform very well, just by introducing a simple change in the way the language works (“now lists index start with one and not zero) the model performance drops drastically. See the chart below.

GPT4 performance on the default version of various benchmarks and in the modified version (counterfactuals). Source: Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks. Wu et al. March 2024.

Building “easy for humans, hard for AI” kind of benchmarks are key to the development of more intelligent models. Indeed, as Fei Fei pointed out in her inspiring presentation (highly recommended!), the ImageNet benchmark that she created was a key element for the rebirth of neural networks in 2012, and the newly coined term “Deep Learning”.

EUREKA: A comprehensive framework to evaluate LLMs

The folks from Microsoft presented a comprehensive and open source framework to evaluate multimodal and language models called Eureka, which assesses the performance of models across several dimensions. Some of the main conclussions of their evaluation of 14 large foundation models are:

Models like Claude 3.5 Sonnet, GPT-4o 2024-05-13, and Llama 3.1 405B show distinct strengths in specific tasks but are not universally superior across all benchmarks. This highlights the need for task-specific analysis rather than assuming a model’s overall superiority.
Current AI models struggle significantly with multimodal tasks, particularly those requiring detailed image understanding and spatial reasoning. For example, all models perform poorly on Object Detection.

In the folllowing chart you check see the results for both language and multimodal tasks.

Performance of best and worse models for multimodal (left) and language (right) datasets in in Eureka-Bench. Note the room to improve in Object Detection, Information Retrieval or navigation. Source: Eureka: Evaluating and Understanding Progress in AI. Microsoft Research, NeurIPS 2024.

Unified Representations, shedding light on the black box

Significant progress have been done on the topic of understanding how neural networks (human and artificial) encodes and process information. Several illuminating ideas around the topic were presented in the UniReps Workshop at NeurIPS 2024.

The platonic representation

We have substantial evidence that different neural networks, including artificial and human neural networks, converge towards the same way of representing the world. This evidence comes by looking at the multidimensional spaces that the activations of the layers of neural networks produce (an embedding) when a concept is used as input. In The Platonic Representation Hypothesis paper, authors observed that the spaces generated by embeddings of different models have very similar characteristics: for example the distances between points of the same concepts (e.g. distance between the concepts pear and giraffe) in a language model or in a vision model remain very similar, and this similarity increases the better the models are. See chart below.

The better the models, the more aligned are their representations. Source: Phillip Isola, Unireps, NeurIPS 2024

Not only that, but there is evidence that the same happens with the activation of our neurons in our brain. They also generate a space that is similar to the ones of the frontier models, and we can even use LLM to interpret the output of this human brain activations.

A profound question arises: is there a unique platonic representation that models and humans converge to? Knowing it could help building more intelligent models. If you find this material interesting, I recommend reading the summary of the paper.

Having internal representations encoded as perpendicular vectors also leads to conclude that a neural computation is a transformation of a representation into another representation. That’s the job of the neural network weight and biases, to transform the input representation (usually in the form of learned embeddings) into another representation that is useful for the task at hand. Incredible the power of linear algebra and some non-linearities!

Reverse engineerig intelligence

Many other talks in this workshop where about gaining a further understanding of the mechanics of the brain and neural networks. For example, they discovered that middle layers of LLMs are better to predict the concepts behind human brain activations. There is evidence that LLMs, in order to predict the next token, generate first an internal representation that encodes many functions of the language (which is in the middle layers), which is richer, versus the represenation that just predict the next token, in the final layers.

There is also strong evidence that neural networks encode information in “directions” in a multidimensional space, where each useful abstract concept (for example, the language a text is written in) is encoded in a different direction, each one “almost” perpendicular to each other, which is possible in a multidimensional space (in a 3d space, there are only 3 dimenspossible perpendicular vectors, but in higher dimensions space, if we relax the constraint of perpendicularity from 90 degrees angle to 89-91 degrees, the amount of almost perpendicular vectors grow exponentially with the number of dimensions). Highly recommended to watch this lesson of ThreeBlueOneBrown on How might LLM store facts. In fact, watch all the Deep Learning videos of this channel, they are the best I’ve seen explaining the concepts of the transformer.

Very interesting also the Mechanistic Interpretability talk of Neel Nanda, from DeepMind. Mechanistic Interpretability aspires to reverse engineer neural networks, working on the hypothesis that models learn human comprehensible structures that can be understood. He showed an example where they are able to identify a “direction in space” that encodes refusal, i.e. when the model refuses to speak about certain topic, usually because of the safety constraints. Knowing this direction, they are able to deactivate it, just by subtracting that vector, allowing the model to respond on originally unintended ways. This refusal direction appears in every model they checked, is almost universal.

One clear application of better understanding the inner workings of neural networks is to improve their safety. Which lead us to the next topic.

AI Safety: advocating for tools, not agents

Youshua Bengio and Max Tegmark participated in a relevant workshop on AI safety. One of their main arguments was that AI’s benefits can be maximized while minimizing risks by developing specialized models rather than fully autonomous agents. A great example of this is the AlphaFold model, which is a tool that helps scientists to predict the 3D structure of proteins; key for drug discovery and currently widely used.

Foundation models for E-commerce

The folks from Shopify presented an initiative to build a foundation model for e-commerce, which takes a selection of events as inputs (these are the tokens), and try to predict following events. The idea is that such a model takes many functions of a typical e-commerce platform, like recommendedr system, fraud detection, next best intervention, etc.

They presented a couple of architecture choices to address the problem, HSTU, and TIGER. What is promising about their work is that they mention an uplift of 240-480% recall@10 in offline experiments. I am looking forward to see the results once models are deployed in production.

Concluding remarks

The scale of NeurIPS 2024 is a testament to the rapid growth of the field of AI. The conference showcased a wide range of ideas and approaches, from the development of large language models to the exploration of new architectures and reinforcement learning methods. The presentations on agents and the development of tools for AI safety were particularly thought-provoking, highlighting the potential for AI to transform our world in the coming years.

As a final thought, consider that human intelligence, the most advanced we know (so far), processes 50-100 terabytes of sensory data annually, all powered by a brain consuming just ~20 Watts. This sets an ambitious benchmark for AI systems to aspire to.

Opening the LLM pipeline

2025-01-03T12:00:00+00:00

3rd January 2025

This post summarizes a fanstic tutorial about building LLM, titled “Opening the Language Model Pipeline: A Tutorial on Data Preparation, Model Training, and Adaptation” (slides). It was presented at NeurIPS 2024, by Kyle Lo, Akshita Bhagia and Nathan Lambert, all from the Allen Institute for AI. I think it could be useful to share some of the main ideas.

The process to build a Large Language Model is very involved, and the authors went through it end to end, providing many details and practical knowledge: starting with the data preparation, continuing wiht model training (also called pre-training), and adaptation (or post-training). Here I summarize the main takeaways from each part, with some additional notes added.

Data preparation

Data preparation mainly means data acquisition, data transformation (deduplication, quality control, etc) and data evaluation.

Data acquisition

To acquire data, crawling is common. However, it’s becoming harder to crawl data. Many websites are opting out or implementing anti crawlers protection, as shown in the figure below. Note that this will create a barrier to enter for new players. As Kyle Lo said: we are not running out of data, we are running out of open data.

Longpre et. al. 2024. Consent in Crisis: The Rapid Decline of the AI Data Commons. Data Provenance Initiative.

Crawling data from websites implies, for many of them, understanding the JavaScript logic, which in many cases is unique to the website. This can be challenging because each website may use different frameworks or obfuscation techniques, making it necessary to decipher custom implementations. For example, a site might load data dynamically through complex API calls embedded in asynchronous scripts, requiring tailored solutions for successful extraction. It also requires to parse the data from all the HTML, which is not easy. For PDF’s or scanned documents is also difficult, as many tools are not able to parse the data correctly.

Data transformation

That mostly means language filtering, deduplication, removing sensitive content (including private content) and ensuring a desired quality. In reality, it’s a classification problem, and can be done using machine learning with small classifiers. They recommend the library fastText, which is quite efficient (and used across the industry), although more involved classifiers can be used.

They also shared the amount of data that remains once the data is filtered for quality and deduplication: reductions are usually of the order of 65 times.

Filtering for quality can be, in some cases, problematics, as one can be undesiringly removing specific themes which are usually classified as lower quality, e.g. high school related content.

It is also quite difficult to remove personal data, as it was highlighted by Subramani et al (2023), where they showed that accuracies with simple Regex or tools like Presidio can be quite low.

Data evaluation

Finally, data must be evaluated by training models and running benchmarks, such as MMLU, HumanEval, GSM8K… The evaluation should be done systematically to each group of data, ideally starting with a smaller and cheaper mode. In general it is a very involved process with many nuances, like what is the best model size, measuring the effect of your data filtering, etc

A new trend: data curriculum

Some interesting new trend: “data curriculum”, which consists of, after training the model with trillions of tokens (high quantity, less quality), at the end of it one switches data to either very high quality sources, specific instructions or synthetic data.

Model training (or pre-training)

Pre-training, the process where you train a LLM on next token prediction with a large amount of text, is currently mostly done with a Transformer architecture, accepting many different configurations that are successful, including many attention mechanisms (e.g. multi head, grouped-query or multi-query). See image below, how different configurations of the hyperparameters (marked in red) can lead to successful models.

Different configurations of the hyperparameters (marked in red) can lead to successful models. From the tutorial authors.

In terms of scale the approximate good scaling law given by the Chinchilla paper (compute budget approximately 6 times the number of parameters by data tokens, and data tokens approx 20 times the number of parameters), although in practice everybody keeps training further.

In terms of costs, the pre-training can be very expensive, as it is very intensive in computational resources. See for example the table below, by the authors of the tutorials: a 7b parameter model and 150B tokens (just above the Chinchilla paper budget), will cost approx $10k.

Costs for different models and data sizes. From the tutorial authors.

Common positional embeddings today are rotary positional embeddings (RoPE) and the SwiGLU activation, which, unlike ReLU activation, is smooth (differentiable) at zero.

Problems with loss function convergence

When the loss function spikes punctually, look at your data, probably is a low quality batch that needs to be filtered.

When the loss function starts spiking and diverging, one needs to ensure that the scale of activations and gradients remain roughly the same, and they should scale with model width. Better to use normal initialization and RMSNorm, QK-Norm and change the order of the layer norm. Finally, ensure that the token embedding does not become too small (no weight decay).

Run experiments with smaller models first, to find optimal parameters, decide on data ablations.

Additional tips for pre-training

Learning rate annealing also helped. Increase first with the first 10B tokens (to 3e-4) then reduce with cosine decay to 5e-5 for the following trillions of tokens and finally reduce to 0 in the last 50B tokens (e.g. with curriculum training).

Use efficient architectures such as Mixture of Experts.

In terms of distributing the training across GPU’s, the recommendation is to use FSDP. Here there is an excellent tutorial. One needs to ensure that the global batch size is not too large.

Use flash attention algorithm as is faster and more memory efficient. Try to keep your code simple, before using torch.compile.

With large training jobs it’s important to manually do garbage collection at the same time in all processes, as otherwise you can have stalls.

Adaptation (or post-training)

The output of the pre-training is not ready for use, as it is just predicting the next token. The model needs to be adapted to the specific task at hand: this is the alignment problem: i.e. how we align the model behavior with the human preferences (or the specific task).

The first step for adaptation is to have some target tasks (e.g. math, or writting code), with some meaningful evaluation, i.e. some specific benchmarks to evaluate. Then one needs to collect (or build) prompts that represent the task.

Currently, in many open source LLMs, different process from Reinforcement Learning from Human Feedback (RLHF), created by OpenAI, are used. In reality one can combine them, as we will see below.Some of these methods are:

Direct Preference Optimization (DPO), much simpler with similar performance. It does not need a reward model (like RLHF), and requires only to optimize a modified version of a simple binary cross entropy objective. It is used in the Llama 3 model.

DPO vs RLHF. From [Direct Preference Optimization](https://arxiv.org/abs/2305.18290) paper.

Supervise Fine Tuning (SFT) is a method that uses a supervised fine-tuning approach, where the model is fine-tuned on a small labeled dataset. It is used in the MM1 model from Apple.
Reinforcement Learning with Verifiable Rewards (RLVR) is a quite simple but effective method coined by the authors, where they replace RLHF by a scoring function that offers positive rewards if the answer is correct. Only applicable (for now) in verifiable rewards, such as math problems with known answers.

Reinforcement Learning with Verifiable Rewards. From the tutorial authors.

In terms of combining them, the authors of the tutorial suggest to start with SFT (e.g. ~1 million prompts), continue with DPO (another ~1 million prompts), and finally use Reinforcement Learning (~10k-100k prompts).

Supervised Fine Tuning

With SFT you can get ~80% of performance gain in many tasks.

Is used to adapt the pre-trained to specific styles of input, such as chat interactions, and can include system prompts, multi turn dialogues…

A lot of data on this category has been created syntehtically by using LLMs to generate variations of human created prompts.

Usually one start with mixing the existing datasets, evaluating with benchmarks, and on these benchmarks that are lagging, create new data (they usesd PersonaHub).

Preference Optimization (DPO)

Aligning to human preferences make the model stronger (e.g. ChatBotArena), allowing to control style.

Preference optimization takes a prompt with a chosen and a rejected completion (by a human), and assumes that the probability of the chosen completion should be higher than the rejected one.

Surprisingly low learning rates (~5E-7) are standard use. With a 70B parameter model the people from the Allen Institute were able to surpass GPT-4 in various benchmarks.

Reinforcement Learning

Although more complex, it allows to normally get ~1% better performance. One can start with synthetic data (LLM-as-a-judge), that has low noise and high bias and then move to human data, with high noise but low bias.

The leading synthetic preference data method is UltraFeedback, where instructions are sampled from a large pool of models and GPT-4 is used to annotate preferences.

They used the RLVR method, where there is no reward function but just a scoring function that offers positive rewards if the answer is correct. This is for now limited to math and precise instructions.

Conclusions

The talk reminded me of other recent articles which describe the LLM building process (or at least provide some details), such as the very insightful paper from Apple about their MM1 model, or the one from Meta on Llama 3.1.

The folks from the Allen Institute were very generous in sharing their knowledge, as I think many tips from their practical knowledge may be useful. They also shared this repository with many open source models and resources.

I hope the video is shared soon in the tutorial page.

Edit: Nathan Lambert tells me that he re-recorded the last part of the tutorial, you can enjoy it here: The state of post-training in 2025.

The path to AGI: quantifying bottlenecks

2024-10-06T20:03:33+00:00

Scaling artificial intelligence to new heights comes with significant challenges, particularly in hardware, energy, and data availability. As we strive towards Artificial General Intelligence (AGI), the hurdles grow—from the immense GPU requirements to the daunting energy consumption and even the scarcity of high-quality training data. These obstacles are demanding, yet they are not insurmountable, paving the way for ambitious innovations and new solutions.

Normalization in TensorFlow: speed is an issue

2018-02-27T10:24:39+00:00

Setting up your GPU TensorFlow platform

2017-06-11T15:10:34+00:00