Hide table of contents

As uncertainty grows around how AI development will affect culture and society, it becomes more valuable to compare track records of predictions about technological progress. 

I've recently been working on automating parts of the methodology from Arb's Scoring The Big 3's Predictive Performance report[1], and have had some promising preliminary results. I hope to try to automate most of the steps in the original report, making it feasible to analyse many more track records and publish the results.

I am particularly interested in the following questions:

  1. Which track record(s) would you find valuable to have evaluated in a similar way to Asimov, Clarke and Heinlein’s, as in the Arb report?
  2. What would you want to see from an LLM-based evaluation that would give you confidence that the results are meaningful and accurate?

 

  1. ^

    See also original Cold Takes post explaining why such evaluations are valuable 

10

0
0

Reactions

0
0
Comments4
Sorted by Click to highlight new comments since:

I'd prefer to see you pick 'people who have made AI predictions who are not famous for those predictions' in some random-ish way. I could just say 'Gary Marcus' and be done, but I'd only be saying that because he disagrees with me on AI progress and I think he'd look bad if his track record was examined. 

You're probably not trying to be super scientific, but I definitely wouldn't cite anything to a policymaker that was cherry-picked. I also wouldn't cite your tool if you only found an effect because you mainly compared people who are famous for scaling predictions, like Gary Marcus and Gwern.

Thanks for all the work you guys do! And love the new website.

Great thanks!

We have two outputs in mind with this project:

1. Reports on a specific thinker (e.g. Gwern) or body of work's predictions. These would probably be published individually or showing interesting comparisons, similar to the Futurists track record in Cold Takes (based on Arb's Big Three research)
2. A dashboard ranking the track records of lots of thinkers

For (2), I agree that cherry picking would be bad, and we'd want it to cover a good range.

For our initial outputs from (1) though, I'm excited about specifically picking thinkers who people would find it especially useful to understand their track record (or to have a good-quality assessment of it that they can cite). Curious if you have thoughts of specific people who fit the bill for you?

Awesome, thanks Adam, this makes a lot of sense. I'd be excited to see reports on specific thinkers like Gwern and Yuval Noah Harrari. I'd be especially excited to look at the track records of institutions, like frontier developers or governments (e.g. the UK Government or its AISI).

I would want to see a big database of the exact wording of the statement that I could look up or at least some large random sample.

Curated and popular this week
 ·  · 11m read
 · 
Confidence: Medium, underlying data is patchy and relies on a good amount of guesswork, data work involved a fair amount of vibecoding.  Intro:  Tom Davidson has an excellent post explaining the compute bottleneck objection to the software-only intelligence explosion.[1] The rough idea is that AI research requires two inputs: cognitive labor and research compute. If these two inputs are gross complements, then even if there is recursive self-improvement in the amount of cognitive labor directed towards AI research, this process will fizzle as you get bottlenecked by the amount of research compute.  The compute bottleneck objection to the software-only intelligence explosion crucially relies on compute and cognitive labor being gross complements; however, this fact is not at all obvious. You might think compute and cognitive labor are gross substitutes because more labor can substitute for a higher quantity of experiments via more careful experimental design or selection of experiments. Or you might indeed think they are gross complements because eventually, ideas need to be tested out in compute-intensive, experimental verification.  Ideally, we could use empirical evidence to get some clarity on whether compute and cognitive labor are gross complements; however, the existing empirical evidence is weak. The main empirical estimate that is discussed in Tom's article is Oberfield and Raval (2014), which estimates the elasticity of substitution (the standard measure of whether goods are complements or substitutes) between capital and labor in manufacturing plants. It is not clear how well we can extrapolate from manufacturing to AI research.  In this article, we will try to remedy this by estimating the elasticity of substitution between research compute and cognitive labor in frontier AI firms.  Model  Baseline CES in Compute To understand how we estimate the elasticity of substitution, it will be useful to set up a theoretical model of researching better alg
 ·  · 7m read
 · 
Crossposted from my blog.  When I started this blog in high school, I did not imagine that I would cause The Daily Show to do an episode about shrimp, containing the following dialogue: > Andres: I was working in investment banking. My wife was helping refugees, and I saw how meaningful her work was. And I decided to do the same. > > Ronny: Oh, so you're helping refugees? > > Andres: Well, not quite. I'm helping shrimp. (Would be a crazy rug pull if, in fact, this did not happen and the dialogue was just pulled out of thin air).   But just a few years after my blog was born, some Daily Show producer came across it. They read my essay on shrimp and thought it would make a good daily show episode. Thus, the Daily Show shrimp episode was born.   I especially love that they bring on an EA critic who is expected to criticize shrimp welfare (Ronny primes her with the declaration “fuck these shrimp”) but even she is on board with the shrimp welfare project. Her reaction to the shrimp welfare project is “hey, that’s great!” In the Bible story of Balaam and Balak, Balak King of Moab was peeved at the Israelites. So he tries to get Balaam, a prophet, to curse the Israelites. Balaam isn’t really on board, but he goes along with it. However, when he tries to curse the Israelites, he accidentally ends up blessing them on grounds that “I must do whatever the Lord says.” This was basically what happened on the Daily Show. They tried to curse shrimp welfare, but they actually ended up blessing it! Rumor has it that behind the scenes, Ronny Chieng declared “What have you done to me? I brought you to curse my enemies, but you have done nothing but bless them!” But the EA critic replied “Must I not speak what the Lord puts in my mouth?”   Chieng by the end was on board with shrimp welfare! There’s not a person in the episode who agrees with the failed shrimp torture apologia of Very Failed Substacker Lyman Shrimp. (I choked up a bit at the closing song about shrimp for s
 ·  · 4m read
 · 
This post presents the executive summary from Giving What We Can’s impact evaluation for the 2023–2024 period. At the end of this post we share links to more information, including the full report and working sheet for this evaluation. We look forward to your questions and comments! This report estimates Giving What We Can’s (GWWC’s) impact over the 2023–2024 period, expressed in terms of our giving multiplier — the donations GWWC caused to go to highly effective charities per dollar we spent. We also estimate various inputs and related metrics, including the lifetime donations of an average 🔸10% pledger, and the current value attributable to GWWC and its partners for an average 🔸10% Pledge and 🔹Trial Pledge.  Our best-guess estimate of GWWC’s giving multiplier for 2023–2024 was 6x, implying that for the average $1 we spent on our operations, we caused $6 of value to go to highly effective charities or funds.  While this is arguably a strong multiplier, readers may wonder why this figure is substantially lower than the giving multiplier estimate in our 2020–2022 evaluation, which was 30x. In short, this mostly reflects slower pledge growth (~40% lower in annualised terms) and increased costs (~2.5x higher in annualised terms) in the 2023–2024 period. The increased costs — and the associated reduction in our giving multiplier — were partly due to one-off costs related to GWWC’s spin-out. They also reflect deliberate investments in growth and the diminishing marginal returns of this spending. We believe the slower pledge growth partly reflects slower growth in the broader effective altruism movement during this period, and in part that GWWC has only started shifting its strategy towards a focus on pledge growth since early 2024. We’ve started seeing some of this pay off in 2024 with about 900 new 🔸10% Pledges compared to about 600 in 2023.  All in all, as we ramp up our new strategy and our investments start to pay off, we aim and expect to sustain a strong (a