The world has experienced exponential growth in the power of deep learning systems over the past 10 years. Since AlexNet in 2012, there has been continuous rapid growth. However, it is worth decomposing the causes for that growth in order to predict the power of future systems.
The simplest decomposition of the cause of growth may look like this:
- Algorithms
- Compute
The growth of both compute and the quality of algorithms has been game-changing over the past 10 years, but there has been a shift from most of the gains coming from the quality of algorithms to a focus on computational power (compute) since 2018. This coincides with the introduction of transformers, which have become the dominant paradigm for cutting-edge models since they were introduced. The growth in compute has caused huge performance gains in the past 6 years, but the growth in computational power will soon also slow substantially. The current increase in yearly spending is unsustainable, even in the short term.
Why has the speed of algorithm improvements slowed?
Initially, there were many low-hanging fruit improvements in deep learning. Just as in any new scientific discipline, research was able to rapidly advance simply by dedicating more people to the problem. Some advances that had a massive improvement include:
- Different activation functions, such as using ReLU instead of sigmoids, then using GeLU.
- Improvements in optimizers, such as gradient descent with momentum, then Adam, then AdamW.
- ResNet-50 and the idea of residual layers.
- The transformer architecture.
Each of these caused a large increase in the performance of models. But iterative steps after an advance were refinement, much less important than the initial discovery. Going from simple gradient descent to Adam was a big deal, as was going from Sigmoid to GeLU. But many of these advances were before 2018 and as more of the easy problems got solved, the pace of improvement slowed.
This is not unique to deep learning. Slowing advancement is seen across human society. From the 1950s on, the number of researchers greatly increased, but the speed of innovation has not.
- A very important and recurring theme in human society is that exponentials do not last forever, and when they stop is very important.
When we look at some of the critical cutting-edge improvements in deep learning, we run into a trend. A lot of the main algorithmic advances that we are iterating on today occurred pre-2018.
AI improving itself
That is not to say there aren’t important and potentially revolutionary changes to come, but rather that they will be fewer and farther between. There is also the potential for AI to increase the pace of discovery. We are starting to see early signs of this with Lion, a new optimizer from Google that may outperform AdamW. This algorithm was created with the help of AI systems. [1]
AI is advancing science in a number of fields, such as protein folding for biology. And it does seem like AI companies would have the greatest know-how and incentive with which to implement AI assistants and improve their rate of innovation. But this is an open question right now, and it is not safe to assume that AI will drastically speed up deep learning innovation in the near future.
Growth in compute
Pre-2018, researchers could rely mostly on more efficient algorithms to improve performance, with relatively small increases in computational power. If you look at the scale of modern AI systems, we can see that something has changed. To dive deeper into this change in computational power, compute is decomposed into:
- The cost/efficiency of hardware
- How much are companies willing to spend on hardware
Why has the improvement in cost and efficiency Slowed?
Hardware availability and price is big reason deep learning started becoming performant. Consumer graphics cards first became a common product in the mid to late 90s. Deep learning relied heavily on chips that were designed to accelerate 3d graphics. These chips were optimized for linear algebra calculations, and linear algebra calculations are the basis of deep learning. Speed is crucial for deep learning, so it is hard to see a world where deep learning succeeded without 3d accelerated consumer GPUs.
Huang's law [2] states that since GPUs were introduced, they have improved in performance much faster than the equivalent CPUs. Perhaps this is because engineers were capturing low-hanging fruit, or because the architecture is more able to scale. But either way, low-cost deep learning relied on the greater than Moore’s law improvement in GPU cost/efficiency.
In 2023, it is not clear that trend will continue. Nvidia’s latest consumer graphics cards, the RTX 30-series, and RTX 40-series are both huge improvements from their predecessors, but they also use much more power. They are also more expensive than their predecessors. This indicates that the efficiency gains are at the very least slowing. Nvidia has even explicitly told its customers to expect slower gains. This mirrors the rest of the chip industry, where performance gains are becoming harder to find. It simply becomes much more difficult to make transistors smaller when you reach 2023’s transistor size.
On the other hand, there is more demand than ever for enterprise/data center GPU systems. This could lead to some economies of scale and corresponding cost decreases. However, there is already a large and developed market, so the cost decreases from this effect are unlikely to be revolutionary.
- The big unknown for cost/efficiency is whether alternative chip designs, such as analog chips or photon-based processing, pan out. The evidence is not conclusive either way. There will certainly be a financial incentive to develop these technologies as spending on chips increases.
How much are we willing to spend? - Estimating the cost of training
If we cannot rely on the same rate of exponential algorithmic improvements or exponential efficiency improvements, can companies buy their way into faster growth?
Companies certainly have over the past 5 years. It is hard to get an exact estimate on how much it costs to train different models, but estimates put the cost of training GPT-2 in 2019 around $20k-$100k. Training GPT-3 in 2020 cost $2m-$10m. And based on the recently released GPT-4 technical report, training took somewhere between 100x and 10000x the compute of GPT-3. [3] Assuming half the cost per unit of compute for GPT-4 vs GPT-3, that puts estimates of the cost of training in the $100m-$1b+ range.
Based on talking to experts in the field, it seems that $100m+ is a good estimate of the cost to train the largest models in 2023.
A different way to estimate the cost of training would be to look at the cost of GPUs plus the power to run them. Buying this from a cloud provider would cost many times as much, but the estimate below gives a lower bound.
Facebook released all the details on training Llama, a fairly high-performance LLM [4]:
We estimate that we used 2048 A100-80GB for a period of approximately 5 months to develop our models. This means that developing these models would have cost around 2,638 MWh.
When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days.
An Nvidia A100 GPU with 80GB of RAM costs $15k. One rough estimate is that it depreciates at 25% a year (likely a low estimate). Roughly 400 MWhs were used to train the 65B-parameter model.
2048 GPUs*$15,000/GPU = $30,700,000 * (21 day/365 days/year) * 0.25 depreciation/year = $450,000 depreciation cost
Nationwide average cost of electricity is 16.8 cents per kWh
$0.168/KWh*400,000KWh=$67,000 electricity cost
Multiply compute required by 100-1000x to get the lower bound for GPT-4, and you get $50m-$500m+
To put this quantity into perspective, the 3 most profitable companies in the world are Aramco, Apple, and Microsoft. Apple had the highest profits of any tech company, reporting a net income of 99.8 billion U.S. dollars in the fiscal year 2022. [5] Their profit has been steadily increasing, but assuming it is constant, that means if they spent all their profits, they could use 100x-1000x the compute as was used to train GPT-4.
Apple is not going to spend 100% of its profits training AI models, which means as of now, no company can spend $100b to train AI models. Unless companies pool their resources, that sets a hard cap at spending around $100b per model. Most companies will want to train multiple models and iterate. Since a company will not want to spend that much on just one model, a more realistic cap is $10b-$20b (though it seems that models are becoming more general, so there is an increasing chance that one model may be all you need).
- This means that in the next few years, we can expect an AI training run spending 20x-1000x GPT-4 on compute. After that things will increase much more slowly.
Why 1000x GPT-4 is not enough
The recent increase in compute has resulted in drastically more capable and competent AI systems. However, it is important to look at self-driving cars and other software systems as a reference case. It is not enough to be good at just some tests, systems have to function consistently to replace humans, and this is why self-driving cars have not replaced humans. Based on early signs, GPT-4 seems much better suited to augmenting humans rather than replacing them. It cannot pass most medium-difficulty software engineer interview problems, and building full-scale software systems is immensely more complex and difficult than these toy problems. Beyond that, making software is more constrained than many problems we want AI to solve.
GPT-4 has around 1000x the compute of GPT-3, yet GPT-4 does not seem that much closer to building complex software systems than GPT-3. Will 1000x the compute of GPT-4 make the difference?
Another case to look at is ChatGPT. ChatGPT feels like a massive leap in the capabilities of AI. Indeed, ChatGPT is immensely more useful than other GPT-3.5 iterations. But this wasn’t made possible by making the AI more powerful. Instead, it was through the hard work of making it more human usable. While this is important for automating jobs, fine-tuning is unlikely to get us much closer to the capabilities that truly advanced AI requires. Improving capabilities seems like it requires more computing power.
- The last 10% of the project is always the most difficult, so knowing how many orders of magnitude improvements lie between current models and transformative AI is hard.
Why won’t AI take over the economy in the next 10 years and justify unlimited spending?
The economy changes slowly. Anyone who has had to fax medical documents is aware of how transitions take time. And the larger and more capable the models we make, the more expensive they are to run.
Moreover, companies are able to get away with a lot of excess fat. Corporations have a lot of excess resources and they are slow-moving. Cloud adoption has taken over a decade, and we should not expect AI adoption to be drastically faster. It can often be cheaper and advantageous in the near term to stick with what you know. It is hard to upend a company and get revolutionary gains in productivity.
AI can also reduce profits for existing companies. Bing Chat is a great example of this. While AI may enable better search results, running a large and expensive AI model means that the profit per search is much lower. In general, while AI may be cheaper or better than the thing it replaces, it will likely not be that much cheaper or better, at least at first. So the transition will be slow.
Throughout the transition, AI systems will also be competing with older AI systems. The incremental improvement will be small, preventing a drastic and sudden takeover. The economy is run by people, and people take time to change their minds, learn things and adopt new techniques. Many industry companies using “AI” are really just calling a logistic regression or clustering algorithm that has existed for decades.
- This means that AI will be money-constrained. We will not see $1t training runs in the near future.
Conclusion
Growth is likely to continue to be exponential, but with a much smaller exponent than has been seen over the past 10 years. The past few years have been a period of explosive growth; newly viable business applications combined with scaling laws meant that companies were willing to spend thousands of times more on hardware than they had previously. But this period is coming to an end, and we will soon have a much more stable level of growth. We can only have a few more years of explosive growth.
It seems unlikely that spending 1000x GPT-4 on computational power will get us to transformative AI, and it seems unlikely that companies will spend much more. That means the shift to transformative AI will likely be more gradual, taking time as AI improves and slowly replaces human work across a wide range of fields.
There are major unknowns about whether we will have revolutionary algorithmic discoveries, how good AI will be at discovering better algorithms, and whether alternative chip designs will allow faster growth in compute efficiency. However, if we do not see these disruptions, there is reason to believe that AI will roll out with time for people to adapt.
Appendix
Estimated exponent by year, breakdown into 3 factors
This is my rough estimate. In reality, there is not a sharp change in the exponent. This is an extreme simplification.
- Algorithms
- 2012-2017: 4x per year. Total = 1000x better
- 2018-2025: 3x per 2 years. Total = 48x better
- 2026-?: 3x per 2 years
- Efficiency of compute
- 2012-2017: 3x every 2 years. 15x better
- 2018-2025: 1.73x every 2 years. 7x better
- 2026-?: 1.5x every 2 years
- Spending on Compute
- 2012-2017: Relatively flat, let's say a lab was willing to spend $1k on a graphics card
- 2018-2025: 10x a year. 10,000,000x more
- 2026-?: 3x per 2 years
One breakdown of time scales:
2012 vs 2018: 16,000x Improvement
2018 vs 2025: 3,000,000,000x Improvement
2025 vs 2035: 400,000x improvement
An alternative breakdown:
2012 vs 2023: 95,000,000,000x
2012 vs 2023: Classifying small images with high accuracy → passing the bar exam
2023 vs 2043: 1,400,000,000x
2023 vs 2043: Passing the bar exam → replacing all human knowledge work?
It seems that going from classifying images to passing the bar may be easier than going from passing the bar to accurately understanding all the complexities of human society and acting effectively within them.
How much spending does this entail?
Checking my work
19000*$100m = $1.9t. It seems reasonable to spend that on AI in 2043.
AlexNet had 62.3m parameters, 14m images
Assuming compute roughly equates to multiplying - number of parameters * quantity of training data
GPT-4, estimate 1t parameters, 5t tokens.
10,000x the parameters. 100,000x the data. Multiply = 1,000,000,000x. 2 orders of magnitude off, but reasonably close to 95,000,000,000x
Exponents:
2012-2017 | 2018-2025 | 2026-? | |
---|---|---|---|
Algorithm | 4 | 1.73 | 1.73 |
Efficiency | 1.73 | 1.31 | 1.22 |
Spend | 1 | 10 | 1.73 |
Yearly Total | 6.92 | 22.663 | 3.651338 |
Growth:
2012 | 2018 | 2025 | 2035 | |
---|---|---|---|---|
Algorithm | 1 | 1024 | 46.3791433 | 240.138079 |
Efficiency | 1 | 15.496389 | 6.62062622 | 7.30463142 |
Spend | 1 | 1 | 10000000 | 240.138079 |
Total | 1 | 15868.303 | 3070589719 | 421231.043 |
2012 | 2023 | 2043 | |
---|---|---|---|
Algorithm | 1 | 15868.303 | 19270.692 |
Efficiency | 1 | 59.784279 | 37.565092 |
Spend | 1 | 100000 | 19367.699 |
Total | 1 | 9.487E+10 | 1.402E+10 |