Comparing Forecasting Track Records for AI Benchmarking and Beyond

Tom Liptay

Comparing Forecasting Track Records for AI Benchmarking and Beyond

Tom Liptay

23 min read

Comments 1

Sorted by

New & upvoted

SummaryBot

Summary: The emergence of artificially sentient beings raises moral, political, and legal issues that deserve scrutiny.

Key points:

The AI Forecasting Benchmark Series compares the performance of AI forecasting bots to human forecasters, using a robust and consistent framework to address methodological issues.
The framework includes question weights to mitigate the problem of correlated questions, which can exaggerate the performance of forecasters.
The weighted t-test is used to assess the significance of the results, and is compared to weighted bootstrapping to ensure reliability.
The multiple comparisons problem is avoided by making a single comparison between the top bot median and the Pro median.
The framework is general and can be used consistently to compare the track records of different forecasters.
The approach is expected to reduce statistical noise and provide a more reliable measure of forecasting skill.
The framework can be extended to other parts of Metaculus, such as talent spotting and comparing track records across platforms.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Comments

More from the author

Can AI Outpredict Humans? Results From Metaculus's Q3 AI Forecasting Benchmark

Tom Liptay·1y ago·8m read

Curated and popular this week

Counting animals: Stable population size is not equivalent to priority level

abrahamrowe, mal_graham🔸·1w ago·Curated 6d ago·16m read

AI Use Note: Main body text entirely human written. Claude (Opus 4.8) helped develop models of animal life histories in the appendix. Cross-posted from Good Structures. Executive Summary * Animal advocates sometimes make claims like “there are X of this animal...

How (not) to fundraise from Anthropic staff

Jack Lewars·6d ago·7m read

Adapted from my Substack, Funding Anthropalypse. Short version: if you want a share of the coming Anthropic and OpenAI windfall - the $37bn+ that could be in play next year - the way in is to become 'legibly excellent', so the evaluators and donors that frontier lab staff already trust point them to yo...

If you're agentic, work in biosecurity

sharmaayushmaan🔸·4d ago·7m read

Disclaimer: Although I work on the Groups Team at CEA, I’m writing this in a personal capacity, and this post does not constitute an endorsement by CEA. Agency - the realisation that you really can just do things. TL;DR Biosecurity needs people (of any background) who are agentic and have a high execution velocity and track record....

Recent opportunities to take action

Marginal Victories: career advising and opportunities for U.S. democracy preservation & political work

Annika Burman 🔸·2d ago·2m read