TO

Toby_Ord

4366 karmaJoined

Sequences
1

The Scaling Series

Comments
189

That's an interesting way to connect these. I suppose one way to view your model is as making clear the point that you can't cost-effectively use models on tasks that much longer than their 50% horizons — even if you are willing to try multiple times — and that trend of dramatic price improvements over time isn't enough to help with this. Instead you need the continuation of the METR trend of exponentially growing horizons. Moreover, you give a nice intuitive explanation of why that is.

One thing to watch out for is Gus Hamilton's recent study suggesting that there isn't a constant hazard rate. I share my thoughts on it here, but my basic conclusion is that he is probably right. In particular, he has a functional form estimating how their success probability declines. You could add this to your model (it is basically 1 minus the CDF of a Weibull distribution with K=0.6). I think this survival function tail is a power law rather than an exponential, making the 'just run it heaps of times' thing slightly more tenable. It may mean that it is the cost of human verification that gets you, rather than it being untenable even on AI costs alone.

And one of the authors of the METR timelines paper has his own helpful critique/clarifications of their results.

Good points. I'm basically taking METR's results at face value and showing that people are often implicitly treating costs (or cost per 'hour') as constant (especially when extrapolating them), but show that these costs appear to be growing substantially.

Re the quality / generalisability of the METR timelines, there is quite a powerful critique of it by Nathan Witkin. I wouldn't go as far as he does, but he's got some solid points. 

Thanks Basil! That's an interesting idea. The constant hazard rate model is just comparing two uses of the same model over different task lengths, so if using that to work out the 99% time horizon, it should cost 1/70th as much ($1.43). Over time, I think these 99% tasks should rise in cost in roughly the same way as the 50%-horizon ones (as they are both increasing in length in proportion). But estimating how that will change in practice is especially dicey as there is too little data.

Also, note that Gus Hamilton has written a great essay that takes the survival analysis angle I used in my constant hazard rates piece and extended it to pretty convincingly show that the hazard rates are actually decreasing. I explain it in more detail here. One upshot is that it gives a different function for estimating the 99% horizon lengths and he also shows that these are poorly constrained by the data and his model disagrees with METR's by a factor of 20 on how long they are, with even more disagreement for shorter lengths.

Some great new analysis by Gus Hamilton shows that AI agents probably don't obey a constant hazard rate / half-life after all. Instead their hazard rates systematically decline as the task goes on.

This means that their success rates on tasks beyond their 50%-horizon are better than the simple model suggests, but those for tasks shorter than the 50% horizon are worse.

I had suggested a constant hazard rate was a good starting assumption for how their success rate at tasks decays with longer durations. It is the simplest model and fits the data OK. But Gus used the standard second-simplest model from survival analysis (the Weibull distribution rather than the exponential distribution). It has a second parameter, K, which represents how the hazard rate changes with time (if at all). If K=1, there is a constant hazard rate, so the exponential distribution is a special case of the Weibull. But if K<1, then hazard decreases over time (like the Lindy effect), and if it is greater, hazard increases (like aging). 

Gus found that the estimated values for K were below 1 for all the models, showing that *all* of them had decreasing hazard rates. 

A distribution that generalises another is always going to get a better fit of the data than my exponential distribution, so fit alone wouldn't be decisive. But the way that every single model has K statistically significantly below 1 convinces me he is right.

So what does this mean?

One thing is that it gives very different estimated success rates for tasks much shorter or longer than the 50% horizon (which METR focuses on because it is easier to reliably estimate). e.g. use the Weibull to estimate the 99% horizon (or 10% horizon).

Another thing is that the AI agents mainly have a K of about 0.6, while the human value of K is significantly lower, at about 0.4. This means even if they have the same 50% horizon, humans can do better on really long tasks (and worse on really short ones).

As this shows, for a fixed 50% horizon length, it isn't clearly better or worse to have a lower value of K. Lower values are good for the low success rate really long tasks, but worse for the high reliability thresholds.

As a word of warning, I was quite sure before METR released its Opus 4.5 results that it was going to have a more human-like value of K, since it had a great 50% horizon, but only an average showing at 80%. But the estimates are that its value of K is similar. I'm not sure why that is, but it might be due to the fact that there isn't much data to go on here and things are quite noisy for any individual model.

So, from Gus's results, it still looks like there is some important gap between how human success rates drop off at longer tasks versus how AI agents do.

Gus also compares his two-parameter Weibull model of the data to METR's two-parameter log-logistic model. He finds that they are similar, but with the log-logistic fitting slightly better. So it isn't clear which of these to use if you have the choice. They differ quite a lot in the tails of the distribution (i.e. in estimated success rates for very short or very long tasks). e.g. the Weibull says the 99% horizon is 1/20th as long as the log-logistic predicts. That's a big deal and the data doesn't tell us which to favour! I'd slightly favour the Weibull, on the grounds that it is more plausible ex ante. But maybe the bigger lesson is that it is unknown which is right, and thus the 99% horizons (necessary for much useful work) are deeply uncertain.

I agree — a bunch of the arguments read like marketing that is greatly simplifying the real picture and not seeming very interested in digging deeper once a convenient story was found.

That's a good summary and pretty in-line with my own thoughts on the overall upshots. I'd say that absent new scaling approaches the strong tailwind to AI progress from compute increases will soon weaken substantially. But it wouldn't completely disappear, there may be new scaling approaches, and there remains progress via AI research. Overall, I'd say it lengthens timelines somewhat, makes raw compute/finances less of an overwhelming advantage, and may require different approaches to compute governance.

A few points to clarify my overarching view:

  1. All kinds of compute scaling are quite inefficient on most standard metrics. There are steady gains, but they are coming from exponentially increasing inputs. These can't continue forever, so all these kinds of gains from compute scaling are naturally time-limited. The exponential growth in inputs may also be masking fundamental deficiencies in the learning algorithms.
  2. By 'compute scaling' I'm generally referring to the strategy of adding more GPUs to get more practically useful capabilities. I think this is running out of steam for pretraining and will soon start running out of steam for RL and inference scaling. This is possible even if the official 'scaling laws' of pretraining continue to hold (I'm generally neutral on whether they will).
  3. It is possible that there will always be a new paradigm of compute scaling to take over when old ones run out of steam. If so, then like Moore's Law, the longterm upward trend might be made out of a series of stacked S-curves. I'm mainly pointing out the limits of the current scaling paradigms, not denying the possibility of future ones.
  4. I don't think that companies are likely to completely stop scaling any of these forms of compute-scaling. The maths tends to recommend balancing the shares of compute that go to all of them in proportion to how much they improve capabilities per doubling of compute. e.g. perhaps a 3:1:2 ratio between pre-training, RL, and inference (though I expect the 3 to decline due to running out of high-quality text for pretraining).
  5. But given that we don't know if there will always be more paradigms delivered on time to save scaling, the limits on the current approaches should increase our credence that the practical process of scaling will provide less of a tail-wind to AI progress going forward. Overall, my view is something like: the strength of this tail-wind that has driven much of AI progress since 2020 will halve. (So it would still be important, but no longer the main determinant of who is in front, or of the pace of progress.)
  6. As well as implications for the pace of progress, changes in what determines progress has implications for strategy and governance of AI. For example, AI researchers will be comparatively more important than in recent years, and if inference-scaling becomes the main form of scaling, that has big implications for compute governance and for the business model of AI companies.

That is quite a surprising graph — the annual tripling and the correlation between the compute and revenue are much more perfect than I think anyone would have expected. Indeed they are so perfect that I'm a bit skeptical of what is going on. 

One thing to note is that it isn't clear what the compute graph is of (e.g. is it inference + training compute,  but not R&D?). Another thing to note is that it is year-end figures vs year total figures on the right, but given they are exponentials with the same doubling time and different units, that isn't a big deal.

There are a number of things I disagree with in the post. The main one relevant to this graph is the implication that the graph on the left causes the graph on the right. That would be genuinely surprising. We've seen that the slope on the famous scaling law graphs is about -0.05 for compute — so you need to double compute 20 times to get log-loss to halve. Whereas this story of 3x compute leading to 3x the revenue implies that the exponent for a putative scaling law of compute vs R&D is extremely close to 1.0. And that it remains flukishly close to that magic number despite the transition from pretraining scaling to RL+inference scaling.  I could believe a power law exponent of 1.0 for some things that are quite mathematical of physical, but not for the extremely messy relationship of compute to total revenue which depends on details of:

  • the changing relationship between compute and intelligence,
  • the utility of more intelligence to people, the market dynamics between competitors,
  • running out of new customers and having to shift to more revenue per customer,
  • the change from a big upfront cost (training compute) to mostly per-customer charges (inference compute)

More likely is something like reverse causation — that the growth in revenue is driving the amount of compute they can afford. Or it could be that the prices they need to charge increase with the amount of investment they received in order to buy compute — so they are charging the minimum they can in order to allow revenue growth to match investment growth.

Overall, I'd say that I believe these are real numbers, but I don't believe the implied model. e.g. I don't believe this trend will continue in the long run and I don't think that if they had been able to 10x compute in one of those years that the revenue would have also jumped by 10x (unless that is because they effectively choose how much revenue to take by trading market growth for revenue in order to make this graph work to convince investors).

Comparing AI scaling laws to Wright's law is an interesting idea. That is still a power law rather than logarithmic returns, but usefully comparable to both the pretraining and inference scaling behaviours.

Load more