Project ideas: Backup plans & Cooperative AI

Lukas Finnveden

Project ideas: Backup plans & Cooperative AI

Comments 2

Sorted by

New & upvoted

Lukas, thanks for pulling together all these notes. To me, "cooperative AI" stands out and might deserve its own page(s). This terminology covers remarkably broad and disparate pursuits. In the words of Dafoe, et al. (mostly of the Cooperative AI Foundation):

"A first cluster consists of AI–AI cooperation, tackling ever more difficult, rich and realistic settings (see ‘Four elements of cooperative intelligence’)." - this is notably the focus of FOCAL@CMU, who are looking at "game theory appropriate for advanced, autonomous AI agents – with a focus on achieving cooperation".
"A second is AI–human cooperation, for which we will need to advance natural-language understanding, enable machines to learn about people’s preferences, and make machine reasoning more accessible to humans." - big problems but plenty happening here, of course, with RLHF and research on alignment (representation, etc.).
"A third cluster is work on tools for improving (and not harming) human–human cooperation, such as ways of making the algorithms that govern social media better at promoting healthy online communities."

This last one seems neglected, in my view, probably because it is an an inherently less straightforward and more interdisciplinary problem to tackle. But it's also arguably the one with the single greatest upside potential. Will MacAskill, in describing “the best possible future” imagines “technological advances… in the ability to reflect and reason with one another”. Already today, there's a wealth of social psychology research on what creates connection and cooperation; these ideas might be implemented at scale, with the help of AI - to help us understand, connect, and achieve things together. In a narrow sense, that might help scientists collaborate. In a bigger sense, it might ultimately reverse societal polarization and help unite humankind, in way that reduces existential risk and increases upside potential more than anything else we could do.

SummaryBot

Executive summary: This post suggests backup plans if AI systems become misaligned, as well as ideas for making AI systems more cooperative.

Key points:

We could study AI generalization to influence properties like lack of spitefulness, even if not full alignment.
Some properties, like lack of spite, may lead misaligned AIs to cooperate more with humans or other AIs.
We could implement "surrogate goals" in AI systems as harmless placeholders that threats could target instead of original goals.
Negotiation-assist AI could help resolve complex situations with many parties and options.
Acausal decision theory suggests learning too much could be risky; we may want caution before expanding knowledge of distant civilizations.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Comments

More from the author

Being honest with AIs

Lukas Finnveden·11mo ago·21m read

154

AGI and Lock-In

Lukas Finnveden, Jess_Riedel, CarlShulman·3y ago·Curated 3y ago·12m read

What's important in "AI for epistemics"?

Lukas Finnveden·1y ago·34m read

Curated and popular this week

Hard-to-reverse decisions destroy option value

Stefan_Schubert·9y ago·Curated 1d ago·14m read

This post is co-authored with Ben Garfinkel. It is cross-posted from the CEA blog. A PDF version can be found here. Summary: Some strategic decisions available to the effective altruism m...

Introducing Impact List: a ranking of philanthropists by expected lives saved

Elliot Olds·2d ago·6m read

TL;DR: I'm releasing a website that ranks philanthropists according to EA principles and research, and allows users to re-rank the list using their own assumptions. I'd like feedback and help making it better. I'd especially like ideas for how to make the results more trustworthy. Funding may be available. Crossposted to LessWrong. ...

If you're agentic, work in biosecurity

sharmaayushmaan🔸·6d ago·7m read

Disclaimer: Although I work on the Groups Team at CEA, I’m writing this in a personal capacity, and this post does not constitute an endorsement by CEA. Agency - the realisation that you really can just do things. TL;DR Biosecurity needs people (of any background) who are agentic and have a high execution velocity and track record....

Recent opportunities to take action

Marginal Victories: career advising and opportunities for U.S. democracy preservation & political work

Annika Burman 🔸·4d ago·2m read

I'm stepping down as Hive's Executive Director, and we're hiring my successor

SofiaBalderson, Hive·4d ago·3m read

Starting an EA group @ SUNY Binghamton

micahzarin·3d ago·1m read

Mike Albrecht

"A first cluster consists of AI–AI cooperation, tackling ever more difficult, rich and realistic settings (see ‘Four elements of cooperative intelligence’)." - this is notably the focus of FOCAL@CMU, who are looking at "game theory appropriate for advanced, autonomous AI agents – with a focus on achieving cooperation".
"A second is AI–human cooperation, for which we will need to advance natural-language understanding, enable machines to learn about people’s preferences, and make machine reasoning more accessible to humans." - big problems but plenty happening here, of course, with RLHF and research on alignment (representation, etc.).
"A third cluster is work on tools for improving (and not harming) human–human cooperation, such as ways of making the algorithms that govern social media better at promoting healthy online communities."

^{^}

Possibly assisted by aligned AIs or tool AIs.

^{^}

Maybe some mild desire for retribution (in a way that discourages bad behavior while still being de-escalatory) could be acceptable, or even good. But we would at least want to avoid extreme forms of spite.

^{^}

Sufficiently strong versions of this could also drastically reduce motivations to overthrow humans. At least if we’ve done an ok job at promising and demonstrating that we’ll treat digital minds well.

^{^}

This path also carries a higher risk of near-miss scenarios.

^{^}

Which I mainly care about because it might let us influence misaligned models. But in principle, it’s also possible that we could get intent-alignment via other means, but that we were still happy to have done this research because it lets us influence other properties of the model. But the path-to-impact there is more complicated, because it requires an explanation for why the people who the AI is aligned to aren’t able or willing to elicit that behavior just by asking/training for it. (Yet are willing to implement the training methodology that indirectly favors that behavior.)

^{^}

And if we’re specifically looking for ways to affect properties in worlds where alignment fails, then we’re conditioning on being in a world where the simplest “baseline” solutions (such as fine-tuning for good behavior) failed. Accordingly, we should be more pessimistic about simple solutions.

^{^}

Possibly via modifying a model that is “playing the training game” to better recognise that it’s being evaluated and to notice what the desired behavior is.

^{^}

Also: If there was some information that you wanted to be part of AI bargaining, but that you didn’t want to be communicated to the humans on the other side, you could potentially delete large parts of the record and only keep certain circumscribed conclusions.

Project ideas: Backup plans & Cooperative AI

Project ideas: Backup plans & Cooperative AI

Backup plans for misaligned AI

What properties would we prefer misaligned AIs to have? [Philosophical/conceptual] [Forecasting]

Making misaligned AI have better interactions with other actors

AIs that we may have moral or decision-theoretic reasons to empower

Making misaligned AI positively inclined toward us

Studying generalization & AI personalities to find easily-influenceable properties [ML]

Theoretical reasoning about generalization [ML] [Philosophical/conceptual]

Cooperative AI

Implementing surrogate goals / safe Pareto improvements [ML] [Philosophical/conceptual] [Governance]

AI-assisted negotiation [ML] [Philosophical/conceptual]

Implications of acausal decision theory [Philosophical/conceptual]

End