Cooperative AI: Three things that confused me as a beginner (and my current understanding)

C Tilli

Cooperative AI: Three things that confused me as a beginner (and my current understanding)

C Tilli

7 min readApr 16, 2024

Comments 15

Sorted by

New & upvoted

Yoshinori

What institutions or theorems guarantee that ASI or AGI will converge toward behaving as the rational agent that game theory assumes? According to the orthogonality thesis, any level of intelligence can pursue any arbitrary goal, so isn't there a possibility that it could make choices that are not rational? This also depends on whether the game is one of complete information or incomplete information.

C Tilli

I don't think there are any such guarantees? I think the point is not that ASI will definitely be rational, it is that ASI being rational and individually aligned is not a guarantee for good outcomes?

Liseli

This is am amazing post, especially for newbies to Cooperative AI like me. Thanks for helping me grasp what this field entails as well as how to proceed with spreading awareness and applying it to safety research in my region.

Lukas_Gloor

No need to reply to my musings below, but this post prompted me to think about what different distinctions I see under “making things go well with powerful AI systems in a messy world.”

That said, my current favourite explanation of what cooperative AI is is that while AI alignment deals with the question of how to make one powerful AI system behave in a way that is aligned with (good) human values, cooperative AI is about making things go well with powerful AI systems in a messy world where there might be many different AI systems, lots of different humans and human groups and different sets of (sometimes contradictory) values.

First of all, I like this framing! Since quite a lot of factors feed into making things go well in such a messy world, I also like highlighting “cooperative intelligence” as a subset of factors you maybe want to zoom in on with the specific research direction of Cooperative AI.

Another recurring framing is that cooperative AI is about improving the cooperative intelligence of advanced AI, which leads to the question of what cooperative intelligence is. Here also there are many different versions in circulation, but the following one is the one I find most useful so far:

Cooperative intelligence is an agent's ability to achieve their goals in ways that also promote social welfare, in a wide range of environments and with a wide range of other agents.

As you point out, a lot of what goes under “cooperative intelligence” sounds dual-use. For differential development to have a positive impact, we of course want to select aspects of it that robustly reduce risks of conflict (and escalation thereof). CLR’s research agenda lists rational crisis bargaining and surrogate goals/safe pareto improvements. Those seem like promising candidates to me! I wonder at what level to best to intervene with a goal of installing these skills and highlighting these strategies. Would it make sense to put together a “peaceful bargaining curriculum” for deliberate practice/training? (If so, should we add assumptions like availability of safe commitment devices to any of the training episodes?) Is it enough to just describe the strategies in a “bargaining manual?” Do they also intersect with an AI's “values” and therefore have to be considered early on in training (e.g., when it comes to surrogate goals/safe pareto improvements)? (I feel very uncertain about these questions.)

I can think of more traits that can fit into, “What specific traits would I want to see in AIs, assuming they don’t all share the same values/goals?,” but many of the things I’m thinking of are “AI psychologies”/“AI character traits.” They arguably lie closer to “values” than (pure) “capabilities/intelligence,” so I’m not sure to what degree they aren’t already covered by alignment research. (But maybe Cooperative AI could be a call for alignment research to pay special attention to desiderata that matter in messy multi-agent scenarios.)

To elaborate on the connection to values, I think of “agent psychologies” as something that is in between (or “has components of both”) capabilities and values. On one side, there are “pure capabilities,” such as the ability to guess what other agents want, what they’re thinking, what their constraints are. Then, there are “pure values,” such as caring terminally about human well-being and/or the well-being (or goal achievement) of other AI agents. Somewhere in between, there are agent psychologies/character traits that arose because they were adaptive (in people it was during evolution, in AIs it would be during training) for a specific niche. These are “capabilities” in the sense that they allow the agent to excel at some skills beneficial in its niche. For instance, consider the cluster of skills around “being good at building trust” (in an environment composed of specific other agents). It’s a capability of sorts, but it’s also something that’s embodied, and it comes with tradeoffs. For comparison, in role-playing games, you often have only a limited number of character points to allocate to different character dimensions. Likewise, the AI that’s best-optimized for building trust probably cannot also be the one best at lying. (We can also speculate about training with interpretability tools and whether it has an effect on an agent's honesty or propensity to self-deceive, etc.)

To give some example character traits that would contribute towards peaceful outcomes in messy multi-agent settings:

(I’m mostly thinking about human examples, but for many of these, I don’t see why they wouldn’t also be helpful in AIs as well with “AI versions” of these traits.)

Traits that predispose agents to steer away from unnecessary conflicts/escalation:

Having an aversion to violence, suffering, other “typical costs of conflict.”
‘Liking’ to see others succeed alongside you (without necessarily caring directly about their goal achievement).
a general inclination to be friendly/welcoming/cosmopolitan. Lack of spiteful or (needlessly) belligerent instincts.

Agents with these traits will have a comparatively stronger interest in re-framing real-world situations with PD-characteristics into different, more positive-sum terms.

Traits around “being a good coalition partner” or “being good at building peaceful coalitions” (these have considerable overlap with the bullet points above):

Integrity, solid communication, honesty, charitable, not naive (i.e., is aware of deceptive or reckless agent phenotypes, is willing to dish out altruistic punishment if necessary), self-aware/low propensity to self-deceive, able to accurately see other’s perspective, etc.

“Good social intuitions” about other agents in one’s environment:

In humans, there are also intuition-based skills like “being good at noticing when someone is lying” or “being good at noticing when someone is trustworthy.” Maybe there could be AI equivalents of these skills. That said, presumably AIs would learn these skills if they’re being trained in multi-agent environments that also contain deceptive and reckless AIs, which opens up the question: Is it a good idea to introduce such potentially dangerous agents solely for training purposes? (The answer might well be yes, but it obviously depends on the ways this can backfire.)

Lastly, there might be trust/cooperation-relevant procedures or technological interventions that become possible with future AIs, but cannot be done with humans:

Inspecting source codes.
Putting AIs into sandbox settings to see/test what they would do in specific scenarios.
Interpretability, provided it makes sufficient advances. (In theory, neuroscience could make similar advances, but my guess is that mind-reading technology will arrive earlier in ML, if it arrives at all.)
…

To sum up, here are a couple of questions I'd focus on if I were working in this area:

To what degree (if any) does Cooperative AI want to focus on things that we can think of as “AI character traits?” If this should be a focus, how much conceptual overlap is there with alignment work in theory, and how much actual overlap is there with alignment work in practice as others are doing it at the moment?
For things that go under the heading of “learnable skills related to cooperative intelligence,” how much of it can we be confident is more likely good than bad? And what’s the best way to teach these skills to AI systems (or make them salient)?
How good or bad would it be if AI training regimes are the way they are with current LLMs (solo-competition, AI is scored by human evaluators) vs whether training is multi-agent or “league-based” (AIs competing with close copies, training more analogous to human evolution). If AI developers do go into multi-agent training despite its risks (such as the possibility for spiteful instincts to evolve), what are important things to get right?
Does it make sense to deliberately think about features of bargaining among AIs that will be different from bargaining among humans, and zoom in on studying those (or practicing with those)?

A lot of the things I pointed out are probably outside the scope of "Cooperative AI" the way you think about it, but I wasn't sure where to draw the boundary, and I thought it could be helpful to collect my thoughts about this entire cluster of things in once place/comment.

SebastianSchmidt

I don't have anything intelligent to add to this, but I just wanted to say that I found the notion of AI psychologies and character traits fascinating, and I hope to ponder this further.

SebastianSchmidt

Thanks so much for this blog post. As you know, I've been attempting to understand Cooperative AI a bit better over the past weeks.
More concretely, I found it helpful as a conceptual exploration of what Cooperative AI is. Including how it relates to other adjacent (sub)fields - including the diagram! I also appreciated you flagging the potential dual-use aspect of cooperative intelligence - especially given the fact that you're working in this field and therefore be prone to wishful thinking.

That said, I would've appreciated if you covered a bit more about:
- Why Cooperative AI is important. I personally think Cooperative AI (at least as I currently understand it) is undervalued on the margin and that we need more focus on potential multi-agent scenarios and complex human interactions.
- What people in the field of Cooperative AI are actually doing - including how they navigate the dual-use considerations.

Matrice Jacobine🔸🏳️‍⚧️

10mo

Do you think work on AI welfare can count as part of Cooperative AI (i.e. as fostering cooperation between biological minds and digital minds)?

C Tilli

10mo

I guess it depends? I wouldn't think "AI welfare" as a whole would fall under cooperative AI, I think there's probably a lot of work there which is not about cooperative AI (and I don't think that is what you are asking but just to be really clear).

But cooperation between biological and digital minds (more or less cooperation between humans and AI?) seems clearly to fall under cooperative AI, with the caveat that the single-human-to-single-AI is a bit of a special case which is less central to cooperative AI.

It's a while since I wrote this now, and today I would probably put more emphasis on that cooperative AI is typically focused on mixed-motive settings where the entities involved are not fully aligned in their objectives but neither fully opposed/adversarial. So if you have mixed-motive settings involving both biological and digital minds, and especially if there are many of them, figuring out how to get good outcomes (rather than cooperation failures) in these settings seems like it would be a question for cooperative AI research.

Will Aldred

Thanks, I found this post helpful, especially the diagram.

What (if any) is the overlap of cooperative AI […] and AI safety?

One thing I’ve thought about a little is the possiblility of there being a tension wherein making AIs more cooperative in certain ways might raise the chance that advanced collusion between AIs breaks an alignment scheme that would otherwise work.^[1]

^{^}
I’ve not written anything up on this and likely never will; I figure here is as good a place as any to leave a quick comment pointing to the potential problem, appreciating that it’s but a small piece in the overall landscape and probably not the problem of highest priority.

C Tilli

Thanks - yes I agree, and study of collusion is often included into the scope of cooperative AI (e.g. methods for detecting and preventing collusion between AI models is among the priority areas of our current grant call at Cooperative AI Foundation).

Will Aldred

Oh, interesting, thanks for the link—I didn’t realize this was already an area of research. (I brought up my collusion idea with a couple of CLR researchers before and it seemed new to them, which I guess made me think that the idea wasn’t already being discussed.)

SummaryBot

Executive summary: The author shares their understanding of cooperative AI as an emerging field focused on making things go well in a world with multiple powerful AI systems and diverse human values, distinct from but related to AI alignment.

Key points:

Cooperative AI lacks a single agreed-upon definition, but broadly aims to promote cooperation and social welfare among multiple AI systems and humans with diverse values.
Cooperative intelligence is an agent's ability to achieve goals in socially beneficial ways across varied environments and interactions, and is relevant for addressing social dilemmas between AI systems.
The capabilities involved in cooperative intelligence (understanding, communication, commitment, norms) are dual-use and require caution.
Cooperative AI overlaps with AI safety on multipolar scenarios and catastrophic risks, and with beneficial AI on using AI to foster large-scale human cooperation.
Key uncertainties include the boundaries and overlaps between cooperative AI, AI ethics, and AI safety.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Pivocajs

What (if any) is the overlap of cooperative AI, AI ethics, and AI safety? Perhaps preventing catastrophic harm that is somehow tied to failures of fairness or inclusion?

I imagine that failures as Moloch / runaway capitalism / you get what you can measure would qualify. (Or more precisely, harms caused by these would include things that AI Ethics is concerned about, in a way that Cooperative AI / AI Safety also tries to prevent.)

Shaun

Thank you for this post. The short description of AI Ethics is interesting. I spent time thinking about this issue when researching private law and AI - I ended up effectively stumbling into a definition centred around safe design on one hand and fair output on the other, but that did not feel quite right. I prefer your broader takeaway of fairness and inclusivity.

I really like the diagram! I found myself wondering where we would fit AI Law / AI Policy into that model. I think it is a very useful tool as an explainer.

C Tilli

Thank you Shaun!

I found myself wondering where we would fit AI Law / AI Policy into that model.

I would think policy work might be spread out over the landscape? As an example, if we think of policy work aiming to establishing the use of certain evaluations of systems, such evaluations could target different kinds of risk/qualities that would map to different parts of the diagram?

Comments

More from the author

122

Can my self-worth compare to my instrumental value?

C Tilli·5y ago·3m read

109

Why scientific research is less effective in producing value than it could be: a mapping

C Tilli·5y ago·22m read

Improving science: Influencing the direction of research and the choice of research questions

C Tilli·4y ago·18m read

Curated and popular this week

Hard-to-reverse decisions destroy option value

Stefan_Schubert·9y ago·Curated 12h ago·14m read

This post is co-authored with Ben Garfinkel. It is cross-posted from the CEA blog. A PDF version can be found here. Summary: Some strategic decisions available to the effective altruism m...

If you're agentic, work in biosecurity

sharmaayushmaan🔸·5d ago·7m read

Disclaimer: Although I work on the Groups Team at CEA, I’m writing this in a personal capacity, and this post does not constitute an endorsement by CEA. Agency - the realisation that you really can just do things. TL;DR Biosecurity needs people (of any background) who are agentic and have a high execution velocity and track record....

Introducing Impact List: a ranking of philanthropists by expected lives saved

Elliot Olds·1d ago·6m read

TL;DR: I'm releasing a website that ranks philanthropists according to EA principles and research, and allows users to re-rank the list using their own assumptions. I'd like feedback and help making it better. I'd especially like ideas for how to make the results more trustworthy. Funding may be available. I recently built Impact List (impactlist.xyz), a site which ranks people by their positive impact via donations. The goal is t...

Recent opportunities to take action

Marginal Victories: career advising and opportunities for U.S. democracy preservation & political work

Annika Burman 🔸·3d ago·2m read

I'm stepping down as Hive's Executive Director, and we're hiring my successor

SofiaBalderson, Hive·3d ago·3m read

Starting an EA group @ SUNY Binghamton

micahzarin·2d ago·1m read

Lukas_Gloor

No need to reply to my musings below, but this post prompted me to think about what different distinctions I see under “making things go well with powerful AI systems in a messy world.”

That said, my current favourite explanation of what cooperative AI is is that while AI alignment deals with the question of how to make one powerful AI system behave in a way that is aligned with (good) human values, cooperative AI is about making things go well with powerful AI systems in a messy world where there might be many different AI systems, lots of different humans and human groups and different sets of (sometimes contradictory) values.

Another recurring framing is that cooperative AI is about improving the cooperative intelligence of advanced AI, which leads to the question of what cooperative intelligence is. Here also there are many different versions in circulation, but the following one is the one I find most useful so far:

Cooperative intelligence is an agent's ability to achieve their goals in ways that also promote social welfare, in a wide range of environments and with a wide range of other agents.

To give some example character traits that would contribute towards peaceful outcomes in messy multi-agent settings:

(I’m mostly thinking about human examples, but for many of these, I don’t see why they wouldn’t also be helpful in AIs as well with “AI versions” of these traits.)

Traits that predispose agents to steer away from unnecessary conflicts/escalation:

Having an aversion to violence, suffering, other “typical costs of conflict.”
‘Liking’ to see others succeed alongside you (without necessarily caring directly about their goal achievement).
a general inclination to be friendly/welcoming/cosmopolitan. Lack of spiteful or (needlessly) belligerent instincts.

Agents with these traits will have a comparatively stronger interest in re-framing real-world situations with PD-characteristics into different, more positive-sum terms.

Traits around “being a good coalition partner” or “being good at building peaceful coalitions” (these have considerable overlap with the bullet points above):

Integrity, solid communication, honesty, charitable, not naive (i.e., is aware of deceptive or reckless agent phenotypes, is willing to dish out altruistic punishment if necessary), self-aware/low propensity to self-deceive, able to accurately see other’s perspective, etc.

“Good social intuitions” about other agents in one’s environment:

In humans, there are also intuition-based skills like “being good at noticing when someone is lying” or “being good at noticing when someone is trustworthy.” Maybe there could be AI equivalents of these skills. That said, presumably AIs would learn these skills if they’re being trained in multi-agent environments that also contain deceptive and reckless AIs, which opens up the question: Is it a good idea to introduce such potentially dangerous agents solely for training purposes? (The answer might well be yes, but it obviously depends on the ways this can backfire.)

Lastly, there might be trust/cooperation-relevant procedures or technological interventions that become possible with future AIs, but cannot be done with humans:

Inspecting source codes.
Putting AIs into sandbox settings to see/test what they would do in specific scenarios.
Interpretability, provided it makes sufficient advances. (In theory, neuroscience could make similar advances, but my guess is that mind-reading technology will arrive earlier in ML, if it arrives at all.)
…

To sum up, here are a couple of questions I'd focus on if I were working in this area:

To what degree (if any) does Cooperative AI want to focus on things that we can think of as “AI character traits?” If this should be a focus, how much conceptual overlap is there with alignment work in theory, and how much actual overlap is there with alignment work in practice as others are doing it at the moment?
For things that go under the heading of “learnable skills related to cooperative intelligence,” how much of it can we be confident is more likely good than bad? And what’s the best way to teach these skills to AI systems (or make them salient)?
How good or bad would it be if AI training regimes are the way they are with current LLMs (solo-competition, AI is scored by human evaluators) vs whether training is multi-agent or “league-based” (AIs competing with close copies, training more analogous to human evolution). If AI developers do go into multi-agent training despite its risks (such as the possibility for spiteful instincts to evolve), what are important things to get right?
Does it make sense to deliberately think about features of bargaining among AIs that will be different from bargaining among humans, and zoom in on studying those (or practicing with those)?

^{^}

I’ve not written anything up on this and likely never will; I figure here is as good a place as any to leave a quick comment pointing to the potential problem, appreciating that it’s but a small piece in the overall landscape and probably not the problem of highest priority.

Cooperative AI: Three things that confused me as a beginner (and my current understanding)

Cooperative AI: Three things that confused me as a beginner (and my current understanding)

Contradicting definitions

Is this really different from alignment?

The dual-use aspect of cooperative intelligence

My current model of how cooperative AI fits into the AI safety landscape