That's a great question. I'd expect a bit of slowdown this year, though not necessarily much. e.g. I think there is a 10x or so possible for RL before RL-training-compute reaches the size of pre-training compute, and then we know they have enough to 10x again beyond that (since GPT-4.5 was already 10x more), so there are some gains still in the pipe there. And I wouldn't be surprised if METR timelines keep going up in part due to increased inference spend (i.e. my points about inference scaling not being that good are to do with costs exploding, so if a cost-insensitive benchmark is going on, it might not register on it all that much). There is also room for more AI-research or engineering improvements to these things, and a lump of new compute coming in, making it a bit messy.
Overall, I'd say my predictions are more about appreciable slowing in 2027+ rather than 2026.
Interesting ideas! A few quick responses:
Yeah, it isn't just like a constant factor slow-down, but is fairly hard to describe in detail. Pre-training, RL, and inference all have their own dynamics, and we don't know if there will be new good scaling ideas that breathe new life into them or create a new thing on which to scale. I'm not trying to say the speed at any future point is half what it would have been, but that you might have seen scaling as a big deal, and going forward it is a substantially smaller deal (maybe half as big a deal).
That's an interesting way to connect these. I suppose one way to view your model is as making clear the point that you can't cost-effectively use models on tasks that much longer than their 50% horizons — even if you are willing to try multiple times — and that trend of dramatic price improvements over time isn't enough to help with this. Instead you need the continuation of the METR trend of exponentially growing horizons. Moreover, you give a nice intuitive explanation of why that is.
One thing to watch out for is Gus Hamilton's recent study suggesting that there isn't a constant hazard rate. I share my thoughts on it here, but my basic conclusion is that he is probably right. In particular, he has a functional form estimating how their success probability declines. You could add this to your model (it is basically 1 minus the CDF of a Weibull distribution with K=0.6). I think this survival function tail is a power law rather than an exponential, making the 'just run it heaps of times' thing slightly more tenable. It may mean that it is the cost of human verification that gets you, rather than it being untenable even on AI costs alone.
And one of the authors of the METR timelines paper has his own helpful critique/clarifications of their results.
Good points. I'm basically taking METR's results at face value and showing that people are often implicitly treating costs (or cost per 'hour') as constant (especially when extrapolating them), but show that these costs appear to be growing substantially.
Re the quality / generalisability of the METR timelines, there is quite a powerful critique of it by Nathan Witkin. I wouldn't go as far as he does, but he's got some solid points.
Thanks Basil! That's an interesting idea. The constant hazard rate model is just comparing two uses of the same model over different task lengths, so if using that to work out the 99% time horizon, it should cost 1/70th as much ($1.43). Over time, I think these 99% tasks should rise in cost in roughly the same way as the 50%-horizon ones (as they are both increasing in length in proportion). But estimating how that will change in practice is especially dicey as there is too little data.
Also, note that Gus Hamilton has written a great essay that takes the survival analysis angle I used in my constant hazard rates piece and extended it to pretty convincingly show that the hazard rates are actually decreasing. I explain it in more detail here. One upshot is that it gives a different function for estimating the 99% horizon lengths and he also shows that these are poorly constrained by the data and his model disagrees with METR's by a factor of 20 on how long they are, with even more disagreement for shorter lengths.
And I'll add that RL training (and to a lesser degree inference scaling) is limited to a subset of capabilities (those with verifiable rewards and that the AI industry care enough about to run lots of training on). So progress on benchmarks has been less representative of how good they are at things that aren't being benchmarked than it was in the non-reasoning-model era. So I think the problems of the new era are somewhat bigger than the effects that show up in benchmarks.