Confidence: Medium, underlying data is patchy and relies on a good amount of guesswork, data work involved a fair amount of vibecoding.
Intro:
Tom Davidson has an excellent post explaining the compute bottleneck objection to the software-only intelligence explosion.[1] The rough idea is that AI research requires two inputs: cognitive labor and research compute. If these two inputs are gross complements, then even if there is recursive self-improvement in the amount of cognitive labor directed towards AI research, this process will fizzle as you get bottlenecked by the amount of research compute.
The compute bottleneck objection to the software-only intelligence explosion crucially relies on compute and cognitive labor being gross complements; however, this fact is not at all obvious. You might think compute and cognitive labor are gross substitutes because more labor can substitute for a higher quantity of experiments via more careful experimental design or selection of experiments. Or you might indeed think they are gross complements because eventually, ideas need to be tested out in compute-intensive, experimental verification.
Ideally, we could use empirical evidence to get some clarity on whether compute and cognitive labor are gross complements; however, the existing empirical evidence is weak. The main empirical estimate that is discussed in Tom's article is Oberfield and Raval (2014), which estimates the elasticity of substitution (the standard measure of whether goods are complements or substitutes) between capital and labor in manufacturing plants. It is not clear how well we can extrapolate from manufacturing to AI research.
In this article, we will try to remedy this by estimating the elasticity of substitution between research compute and cognitive labor in frontier AI firms.
Model
Baseline CES in Compute
To understand how we estimate the elasticity of substitution, it will be useful to set up a theoretical model of researching better algorithms. We will use a similar setup to Tom's article, although we will fill in the details.
Let denote an AI research firm and denote a time. Let denote the quality of the algorithms and denote the amount of inference compute used by research firm at time . We will let denote effective compute (Ho et al., 2024) for inference.
Algorithm quality improves according to the following equation:[2]
is the productivity scaling factor. denotes whether ideas, meaning proportional algorithmic improvements, get easier () or harder to find () as algorithmic quality increases, indexed by . maps research compute and cognitive labor to a value representing effective research effort. denotes a potential parallelization penalty.
We will assume is a constant returns to scale production function that exhibits constant elasticity of substitution, i.e.,
where is the elasticity of substitution between research compute and cognitive labor. denotes the case where compute and cognitive labor are gross substitutes, where they are gross complements, and denotes the intermediate, Cobb-Douglas case.
Conditions for a Software-Only Intelligence Explosion
Suppose that at time , an AI is invented that perfectly substitutes for human AI researchers. Further, suppose it costs compute to run that system. Then
denotes the number of copies that can be run.
We will be interested in whether an intelligence explosion occurs quickly after the invention of this AI. We will define an intelligence explosion as explosive growth in the quality of algorithms, . Explosive growth of implies at least explosive growth in the quantity of AIs.[3]
Since we are interested in what happens in the short-run, we will assume all variables except algorithmic quality remain fixed. That is, we study if a software-only intelligence explosion occurs.
By assumption, the AI can perfectly substitute for human AI researchers. Therefore, effective labor dedicated to AI research becomes
where denotes human labor.[4] Plugging this effective labor equation into the equation that defines changes in algorithm quality over time:
We can use this equation to study whether algorithm quality grows explosively. If the right hand side is a constant, then algorithm quality grows exponentially. But if the right hand side is an increasing function of , then algorithm quality experiences super-exponential growth, with an accelerating growth rate. For example, if where then growth is hyperbolic and reaches infinity in finite time.
The following are the necessary and sufficient conditions for explosive growth in :
To see why, let us go over the cases.
If , then the effective research effort term in our differential equation for is bounded. Intuitively, compute bottlenecks progress in effective research input. Therefore, the rate of growth of grows unboundedly if and only if the term grows over time i.e., .
If , then asymptotically we have
We get hyperbolic growth if and only if .
If , then we are in the same case as , except compute and cognitive labor are even more substitutable, so we drop the term.
The condition is exactly what Epoch and Forethought (see comments, Forethought actually considers the case) consider when they analyze whether the returns to research are high enough for a singularity.[5] They both find it possible that , although the evidence is imperfect and mixed across various contexts. Therefore, if , then a software-only intelligence explosion looks at least possible.
However, if , then a software-only intelligence explosion occurs only if . But if this condition held, we could get an intelligence explosion with constant, human-only research input. While not impossible, we find this condition fairly implausible.
Therefore, crucially affects the plausibility of a software-only intelligence explosion. If then it is plausible, but if it is not.
Deriving the Estimation Equation
We will estimate by looking at how AI firms allocate research compute and human labor from 2014 to 2024.
Of course, throughout this time period, the AI firms have been doing more than allocating merely research compute and human labor. Their activities included training AIs and serving AIs, in addition to the research-focused allocation of compute and human labor. Formally, they have been choosing a schedule of training compute , inference compute , research compute and human labor . However, we can split the firm's optimization problem into two parts:
- Dynamic Optimization: choosing where
- Static Optimization: choosing to minimize costs such that .
In this split, we have assumed that , i.e., that AIs did not contribute to cognitive labor before 2025.[6]
We will estimate using the static optimization problem. Let denote the cost of research compute and human labor respectively. Then the static optimization problem becomes
If we take the first-order conditions with respect to compute and cognitive labor, divide, take logs and re-arrange, we get the following equation:
Therefore, we can estimate by regressing on a constant and and looking at the coefficient on . Intuitively, we can estimate how substitutable compute and labor are by seeing how the ratio of compute to labor changes as the relative price of labor to compute changes.
Alternative CES Formulation in Frontier Experiments
One potential problem with the baseline CES production function is that the required ratio of compute to labor in research does not depend on the frontier model size. Intuitively, as frontier models get larger, the compute demands of AI research should get larger as the firm needs to run near-frontier experiments. To accommodate this intuition, we will explore a re-parametrization of CES as an extension to our main, baseline results.
Let denote the number of near-frontier experiments a firm can run at time . is literally the number of frontier research training runs possible.[7] denotes the productivity benefit of extrapolating results from smaller experiments. For example, if you can accurately extrapolate experiments at of frontier compute then .[8] Now the change in algorithm quality over time is given by
We continue to suppose is CES. Following the same derivation steps as before, we get the following modified estimation equation:
We can estimate this equation by regressing on a constant, and . We will take the coefficient on as our estimate for .
Estimation
Data
To estimate the key equation described above, we need data on and . We attempt to gather this data for as many time periods and across as many major AI firms as we can. Unfortunately, this data is not always publicly available, so we do a good amount of guesswork. When we are guessing/extrapolating values, we try to note how uncertain we are. We incorporate these uncertainty estimates into our standard errors for the main results. If anyone knows any better data sources or can provide more accurate data here, we would be extremely interested.
Our data covers OpenAI from 2016-2024, Anthropic from 2022-2024, DeepMind from 2014-2024 and DeepSeek from 2023-2024. All prices are inflation-adjusted to 2023 USD. We use the following data sources.
We use headcount estimates from PitchBook which include headcount estimates at high frequency, roughly once per year. Unfortunately, the data does not make a distinction between research/engineering staff and operations/product staff. As long as the ratio of research to operations staff has been constant over time, our results will be unbiased.
: We first take estimates of from Epoch's notable models page, aggregating to the firm-year level by summing all training compute used across models in a given year. In cases where firms do not release (major) models in a given year, we assume training compute is the same as the prior year.
: The Information reported that the ratio of OpenAI's research to training compute spend was in 2024. Therefore, we multiply our estimate of by to get our estimate of . This is a significant limitation as we are assuming that the fraction of research to training compute is a constant across firms and times.[9] Out of all of our variables, this is the most coarse one.
: Our most reliable wage data comes from DeepMind and OpenAI's financial statements which include total spend on staff. Combined with our estimates of headcounts, we can recover average wages. DeepMind's financial statements cover 2014-2023, while OpenAI's statements cover its period as a nonprofit from 2016-2018. We fill in the rest of the years and firms using data from firms' job postings, Glassdoor, H1B Grader, levels.fyi, news sources, and BOSS Zhipin. We specifically look for and impute the wage of level III employees (scientific researchers) at each firm. We use salary instead of total compensation, assuming that salary is of total compensation.[10] While the financial statements data is reliable, the filled-in data has quite a bit of guesswork.
: We use the rental rate of GPUs according to Epoch's data. We match each AI firm with its cloud provider (e.g., OpenAI with Microsoft Azure) and use the corresponding rental rate. In reality, AI firms buy many GPUs, although in a competitive market the depreciated, present-discounted price should match the hourly rental rate.[11] We adjust for GPU quality by measuring the price in units of total FLOPs (e.g., FLOP/s times 3600 seconds in an hour) per dollar. We match firm-years to the GPUs that they were likely using at the time (e.g., OpenAI using A100s in 2022 and H100s in 2024). There is guesswork involved in the exact mix of GPUs that each firm is using in each year.
Trends
Before we get into estimation, we graph the following key variables. These are useful sanity checks, and if anyone has a good sense that these variables do not seem right, please let us know.
Estimation Results
We estimate two sets of main results, one for the CES in compute specification and one for CES in frontier experiments specification. For both sets of results, we include firm fixed effects to correct for any time-invariant productivity differences between firms.
CES in Compute | CES in Frontier Experiments | |
2.58 | -0.10 | |
SE | (.34) | (.18) |
MCSE | [.68] | [0.30] |
Baseline standard errors (SE) are clustered at the firm level. We also run a Monte Carlo where we simulate independent noise in each variable according to our subjective uncertainty in the underlying data quality. Combining this Monte Carlo with a clustered bootstrap gives us our Monte Carlo standard errors (MCSE).
Our results show that how you model the situation matters quite a lot. In the CES in compute specification, we estimate , which implies research compute and cognitive labor are highly substitutable. However, in the frontier experiments case, we estimate . It is impossible in the economic model to have , although this estimate is well within the range of being 0 with some statistical error. In any case, this means that frontier experiments and labor are highly complementary. Recall would denote perfect complements, where increasing cognitive labor without a corresponding increase in near-frontier experiments would result in zero growth in algorithm quality.
To get a visual understanding of these fits, the following figure plots the regression result for the CES in compute specification. The slope of the fitted line corresponds to the estimated
The next figure corresponds to the regression result for CES in frontier experiments.
We also performed some basic robustness tests, like restricting the sample to after 2020 or 2022 (after GPT-3 or GPT-4 released), excluding 2024 (for the concern that AI had started to meaningfully assist in AI research), restricting the sample to DeepMind only (where we have the highest quality wage data), changing how we calculate wages, changing how we calculate research compute, etc. We get qualitatively similar results in all cases that we tried (substitutes in the CES in compute specification, complements in the CES in frontier experiments specification).
Results
The two different specifications give very different estimates of . If we had to choose between the specifications, we would choose the frontier experiments version. If the raw compute version was properly specified, then adding the training compute as a control should not change the coefficient on .[12] Therefore, our overall update from the evidence is that compute could majorly bottleneck the intelligence explosion.
But we do not update by a huge amount as this analysis has a lot of potential problems. It is not obvious which specification is correct, the underlying data has reliability problems and the data is from 4 firms across only a handful of years. On a more technical level, a large amount of variation is explained by firms scaling up training compute over time, there is endogeneity/simultaneity bias[13], and our analysis relies on simplifying assumptions such as the CES functional form and homogeneous, non-quality differentiated labor.
Thanks to Basil Halperin and Phil Trammell for reviewing the post and giving extensive comments.
- ^
The original version of this objection was made informally by Epoch researchers.
- ^
In reality, there is probably some algorithmic quality depreciation as you scale up training compute (e.g., algorithms that are good for GPT-2 might be bad for GPT-4). We could accommodate this intuition by adding a to the right-hand side of the equation for . But for our analysis, is fixed so this depreciation term would not matter.
- ^
If we additionally suppose that the intelligence of AI is an increasing, unbounded function of effective training compute, , then explosive growth of would also imply explosive growth in the intelligence of AIs. Of course, this assumes that the same algorithmic quality term affects inference compute and training compute.
- ^
As written, all inference compute is dedicated to AI research. We can easily weaken this assumption by having represent the compute cost divided by the fraction of inference compute dedicated to AI research. If we assume is constant over time, then we get the same equation.
- ^
There is a small notation difference as they denote the explosion condition as a fraction e.g., , while we express the condition as a sum e.g., .
- ^
We think this assumption is very defensible for 2023 and prior. 2024 is borderline, so we try excluding it from our analysis as a robustness check.
- ^
Note that algorithmic advances do not improve this ratio because they improve effective training compute and effective research compute at the same rate. Therefore, algorithmic advances simply cancel out.
- ^
To spell this out further, if research compute and training compute are equal, then by default you can run one experiment at the frontier. However, if you can extrapolate from experiments the size, then you can run 1000 experiments, so .
Further, the value of does not really matter. If is a fixed number (e.g., AIs are not better at extrapolating than humans), then does not change the conditions under which there is an intelligence explosion and it does not change the estimate of .
- ^
If this ratio has changed a lot over time, both of our estimates could be wrong. But in particular, the frontier experiments estimate could be badly wrong because we are essentially assuming that one of the inputs, number of frontier experiments, has been constant over time.
- ^
As long as the ratio of total compensation to salary has been constant over time, it does not matter which we use.
- ^
Note, however, the AI GPU market is not competitive, as Nvidia owns a huge fraction of the market. However, this Epoch paper finds that pricing calculation via ownership vs. rental rates are fairly similar.
- ^
This logic is far from airtight. If both models are misspecified, then adding a control can result in a more biased estimate.
- ^
We have a version of this analysis where we use local wages as an instrument to address potential endogeneity. We get similar results to the OLS version, but we omit them for brevity and because we are still thinking about better instruments.
This is a good point, we agree, thanks! Note that you need to assume that the algorithmic progress that gives you more effective inference compute is the same that gives you more effective research compute. This seems pretty reasonable but worth a discussion.
Although note that this argument works only with the CES in compute formulation. For the CES in frontier experiments, you would have the AKresAKtrain so the A cancels out.[1]
You might be able to avoid this by adding the A's in a less naive fashion. You don't have to train larger models if you don't want to. So perhaps you can freeze the frontier, and then you getAKresAfrozenKtrain? I need to think more about this point.