AI Forecasting Resolution Council (Forecasting infrastructure, part 2)

This post introduces the AI Forecasting Resolution Council, a group of researchers with technical expertise in AI who will allow us to expand the space of effectively forecastable questions. It is the second part in a series of blog posts which motivate and introduce pieces of infrastructure intended to improve our ability to forecast novel and uncertain domains like AI.

The Council is currently in beta, and we're launching early to get feedback from the community and quickly figure out how useful it is.

Background and motivation

A key challenge in (AI) forecasting is to write good questions. This is tricky because we want questions which both capture important uncertainties, and are sufficiently concrete that we can resolve them and award points to forecasters in hindsight.

Here are some example questions within AI that make this especially difficult:

Counterfactual questions

Suppose in 2000 you use “superhuman Othello from self-play” as a benchmark of a certain kind of impressive AI progress, and forecast it to be possible by 2020. It seems you were correct -- very plausibly the AlphaZero architecture should work for this. However, in a strict sense your forecast was wrong -- because no one has actually bothered to build a powerful Othello agent.

So if a calibrated forecaster faces this question in 2000, considerations regarding who will bother to pursue what project “screen off” considerations regarding fundamental drivers of AI progress and their gradients. Yet the latter concern is arguably more interesting.

This problem could be solved if we instead forecasted the question “If someone were to run an experiment using the AI technology available in 2020, given certain resource constraints, would it seem with >95% confidence, that they’d be able to create a superhuman Othello agent that learnt only from self-play?”

Doing so requires a way of evaluating the truth value of that counterfactual, such as by asking a group of experts.

Similarity questions

Suppose we try to capture performance by appealing to a particular benchmark. There's a risk that the community will change its focus to another benchmark. We don’t want forecasters to spend their effort thinking about whether this change will occur, as opposed to fundamental question about the speed of progress (even if we would want to track such sociological facts about which benchmarks were prominent, that should be handled by a different question where it’s clear that this is the intent).

So to avoid this we need a sufficiently formal way of doing things like comparing performance of algorithms across multiple benchmarks (for example, if RL agents are trained on a new version of Dota, can we compare performance to OpenAI Five’s on Dota 2?).

Definition-of-terms questions

This is more straightforward and related to the AI Forecasting Dictionary. For example, how do we sufficiently clearly define what counts as “hard-coded domain knowledge”, and how much reward shaping you can add before the system no longer learns from “first principles”?

Valuation questions

Not all important uncertainties we care about might be able to be turned into a concretely operationalised future event. For example, instead of trying to operationalise how plausible the IDA agenda will seem in 3 years by making a long, detailed specification of the outcome of various experiments, we might just ask “How plausible will IDA seem to this evaluator in 3 years?” and then try to forecast that claim.

Making this work will require carefully choosing the evaluators such that, for example, it is generally easier and less costly to forecast the underlying event than the opinions of the evaluator, and that we trust that the evaluation actually tracks some important, natural, hard-to-define measure.

Prediction-driven evaluation is a deep topic, yet if we could make it work it is potentially very powerful. See e.g. this post for more details.

AI Forecasting Resolution Council

As a step towards solving the above problems, we’re setting up the AI Forecasting Resolution Council, a group of researchers with technical expertise in AI, who are volunteering their judgement to resolve questions like the above.

The services of the council are available to any forecasting project, and all operations for the council will be managed by Parallel Forecast. In case there is more demand for resolutions than can be filled, Parallel will decide which requests to meet.

We think that this Council will create streamlined, standardised procedures for dealing with tricky cases like the above, thereby greatly expanding the space of effectively forecastable questions.

There are still many questions to be figured out regarding incentives, mechanism design, and question operationalisation, and we think that by setting up the Resolution Council, we are laying some groundwork to begin experimenting in this direction; and discover best practices and ideas for new, exciting experiments.

The initial members of the council are:

Daniel Filan (CHAI)
Chris Cundy (Stanford)
Gavin Leech (Bristol)
William Saunders (Ought)

We expect to be adding several more members over the coming months.

The database of previous verdicts and upcoming resolution requests can be found here.

How to use the council if you run a forecasting project

If you’re attempting to forecast AI and have a problem that could be solved by querying the expert council at a future state, let us know by filling in this resolution request form.

How to join the council

If you have technical expertise in AI and would be interested in contributing to help expand the space of forecastable questions, let us know using this form.

There is no limit on the number of judges, since we can always randomise who will vote on each distinct verdict.

More from the author

Curated and popular this week

49

Was Partisanship Good for the Environmental Movement?

Jeffrey Heninger·2y ago·Curated 3d ago·6m read

This is the third in a sequence of posts taken from my recent report: Why Did Environmentalism Become Partisan? Summary Rising partisanship did not make environmentalism more popular or politically effective. Instead, it saw flat or falling overall public opinion, fewer major legislative achievements, and fluctuating executive actions. Public Opinion...

127

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·5d ago·4m read

I think right now EAs might be making a significant mistake by paying insufficient attention to the political realm. As EAs we tend to figure out what’s most impactful for us to work on and focus hard. That’s great! But there are various actions that are ‘non-delegatable’ - the extent to which an individual can do the action is limited (like voting, going to a protest, making hard money contributions to particular campaigns). It might be useful if we were all more in the habit of doing variou...

108

New Video from AI in Context: The Fall and Rise of Sam Altman

ChanaMessinger, phoebe b, Aric Floyd·1w ago·3m read

New Video from AI in Context: The Fall and Rise of Sam Altman If you want to skip straight to the video, here it is! AI in Context is excited to be back with our fourth video! For those just hearing from us, we make videos for 80,000 Hours, telling stories about transformative AI...

Recent opportunities to take action

41

$1M AI x-risk grant round is live on grantmaking.ai - apply for funding, review applicants, or fund projects

Matt Brooks·17h ago·3m read

127

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·5d ago·4m read

68

Build a flourishing EA group at the University of Toronto

Joseph Kostousov, Sophia Wan (navarhontes)·1w ago·1m read

28