TO

Toby_Ord

4425 karmaJoined

Sequences
1

The Scaling Series

Comments
195

And I'll add that RL training (and to a lesser degree inference scaling) is limited to a subset of capabilities (those with verifiable rewards and that the AI industry care enough about to run lots of training on). So progress on benchmarks has been less representative of how good they are at things that aren't being benchmarked than it was in the non-reasoning-model era. So I think the problems of the new era are somewhat bigger than the effects that show up in benchmarks.

That's a great question. I'd expect a bit of slowdown this year, though not necessarily much. e.g. I think there is a 10x or so possible for RL before RL-training-compute reaches the size of pre-training compute, and then we know they have enough to 10x again beyond that (since GPT-4.5 was already 10x more), so there are some gains still in the pipe there. And I wouldn't be surprised if METR timelines keep going up in part due to increased inference spend (i.e. my points about inference scaling not being that good are to do with costs exploding, so if a cost-insensitive benchmark is going on, it might not register on it all that much). There is also room for more AI-research or engineering improvements to these things, and a lump of new compute coming in, making it a bit messy.

Overall, I'd say my predictions are more about appreciable slowing in 2027+ rather than 2026.

Good point about the METR curves not being Pareto frontiers.

Interesting ideas! A few quick responses:

  1. The data for the early 'linear' regime for these models actually appears to be even better than you suggest here. They have a roughly straight line (on a log-log plot), but at a slope that is better than 1. Eyeballing it, I think some are slope 5 or higher (i.e. increasing returns, with time horizon growing as the 5th power of compute). See my 3rd chart here. If anything, this would strengthen your case for talking about that regime separately from the poorly scaling high compute regime later on.
  2. I'd also suspected that when you apply extra RL to a model (e.g. o3 compared to o1) that it would have a curve that dominated the earlier model. But that doesn't seem to be the case. See the curves in the final chart here, where o1-preview is dominated, but the other OpenAI models curves all cross each other (being cheaper for the same horizon at some horizons and more expensive at others).
  3. Even when they do dominate each other neatly like in your fake data, I noticed that the 'sweet spots' and the 'saturation points' can still be getting more expensive, both in terms of $ and in terms of $/hr. I'm not sure what to make of that though!
  4. I think you're on to something with the idea that there is a problematic kind of inference scaling and a fine kind, though I'm not sure if you've quite put your finger on how to distinguish them. I suppose we can definitely talk about the super-linear scaling regime and the sub-linear regime (which meet at what I call the sweet spot), but I'm not sure these are the two types you refer to in qualitative terms near the top.

Yeah, it isn't just like a constant factor slow-down, but is fairly hard to describe in detail. Pre-training, RL, and inference all have their own dynamics, and we don't know if there will be new good scaling ideas that breathe new life into them or create a new thing on which to scale. I'm not trying to say the speed at any future point is half what it would have been, but that you might have seen scaling as a big deal, and going forward it is a substantially smaller deal (maybe half as big a deal).

Thanks for catching that — a lot of symbols in the appendix were lost when converting the post for the forum, so I've edited it to add them back in.

That's an interesting way to connect these. I suppose one way to view your model is as making clear the point that you can't cost-effectively use models on tasks that much longer than their 50% horizons — even if you are willing to try multiple times — and that trend of dramatic price improvements over time isn't enough to help with this. Instead you need the continuation of the METR trend of exponentially growing horizons. Moreover, you give a nice intuitive explanation of why that is.

One thing to watch out for is Gus Hamilton's recent study suggesting that there isn't a constant hazard rate. I share my thoughts on it here, but my basic conclusion is that he is probably right. In particular, he has a functional form estimating how their success probability declines. You could add this to your model (it is basically 1 minus the CDF of a Weibull distribution with K=0.6). I think this survival function tail is a power law rather than an exponential, making the 'just run it heaps of times' thing slightly more tenable. It may mean that it is the cost of human verification that gets you, rather than it being untenable even on AI costs alone.

And one of the authors of the METR timelines paper has his own helpful critique/clarifications of their results.

Good points. I'm basically taking METR's results at face value and showing that people are often implicitly treating costs (or cost per 'hour') as constant (especially when extrapolating them), but show that these costs appear to be growing substantially.

Re the quality / generalisability of the METR timelines, there is quite a powerful critique of it by Nathan Witkin. I wouldn't go as far as he does, but he's got some solid points. 

Thanks Basil! That's an interesting idea. The constant hazard rate model is just comparing two uses of the same model over different task lengths, so if using that to work out the 99% time horizon, it should cost 1/70th as much ($1.43). Over time, I think these 99% tasks should rise in cost in roughly the same way as the 50%-horizon ones (as they are both increasing in length in proportion). But estimating how that will change in practice is especially dicey as there is too little data.

Also, note that Gus Hamilton has written a great essay that takes the survival analysis angle I used in my constant hazard rates piece and extended it to pretty convincingly show that the hazard rates are actually decreasing. I explain it in more detail here. One upshot is that it gives a different function for estimating the 99% horizon lengths and he also shows that these are poorly constrained by the data and his model disagrees with METR's by a factor of 20 on how long they are, with even more disagreement for shorter lengths.

Load more