This is a linkpost for https://forecastingresearch.substack.com/p/ai-llm-forecasting-model-forecastbench-benchmark
When will artificial intelligence (AI) match top human forecasters at predicting the future? In a recent podcast episode, Nate Silver predicted 10–15 years. Tyler Cowen disagreed, expecting a 1–2 year timeline. Who’s more likely to be right?
Today, the Forecasting Research Institute is excited to release an update to ForecastBench—our benchmark tracking how well large language models (LLMs) forecast real-world events—with evidence that bears directly on this debate. We’re also opening the benchmark for submissions.
The topline comparison between LLMs and superforecasters seems a bit unfair. You compare a single LLM's forecast against the median from a crowd of superforecasters. But we know the median from a crowd is typically more accurate than any particular member of the crowd. Therefore I think it'd be more fair to compare a single LLM to a single superforecaster, or a crowd of LLMs against a crowd of superforecasters. Do we know whether the best LLM is better than the best individual forecaster in your sample, or how the median LLM compares to the median forecaster?
(Nitpick aside, this is very interesting research, thanks for doing it.)
Wow, with tool use, pretty much every SOTA model from 6 months ago outperforms the public median forecast! I'd be curious to see how gpt-5/4.5 sonnet/4.1 opus do on this