How do AI welfare and AI safety interact?

Lucius Caviola

Comments 18

Sorted by

New & upvoted

Carl Shulman questioned the tension between AI welfare & AI safety on the 80k podcast recently -- I thought this was interesting! Basically argues AI takeover could be even worse for AI welfare. From the end of the section.

Rob Wiblin: Maybe a final question is it feels like we have to thread a needle between, on the one hand, AI takeover and domination of our trajectory against our consent — or indeed potentially against our existence — and this other reverse failure mode, where humans have all of the power and AI interests are simply ignored. Is there something interesting about the symmetry between these two plausible ways that we could fail to make the future go well? Or maybe are they just actually conceptually distinct?
Carl Shulman: I don’t know that that quite tracks. One reason being, say there’s an AI takeover, that AI will then be in the same position of being able to create AIs that are convenient to its purposes. So say that the way a rogue AI takeover happens is that you have AIs that develop a habit of keeping in mind reward or reinforcement or reproductive fitness, and then those habits allow them to perform very well in processes of training or selection. Those become the AIs that are developed, enhanced, deployed, then they take over, and now they’re interested in maintaining that favourable reward signal indefinitely.
Then the functional upshot is this is, say, selfishness attached to a particular computer register. And so all the rest of the history of civilisation is dedicated to the purpose of protecting the particular GPUs and server farms that are representing this reward or something of similar nature. And then in the course of that expanding civilisation, it will create whatever AI beings are convenient to that purpose.
So if it’s the case that, say, making AIs that suffer when they fail at their local tasks — so little mining bots in the asteroids that suffer when they miss a speck of dust — if that’s instrumentally convenient, then they may create that, just like humans created factory farming. And similarly, they may do terrible things to other civilisations that they eventually encounter deep in space and whatnot.
And you can talk about the narrowness of a ruling group and say, and how terrible would it be for a few humans, even 10 billion humans, to control the fates of a trillion trillion AIs? It’s a far greater ratio than any human dictator, Genghis Khan. But by the same token, if you have rogue AI, you’re going to have, again, that disproportion.

Lucius Caviola

Thanks, I also found this interesting. I wonder if this provides some reason for prioritizing AI safety/alignment over AI welfare.

Adrià Moret

It's great to see this topic being discussed. I am currently writing the first (albeit significantly developed) draft of an academic paper on this. I argue that there is a conflict between AI safety and AI welfare concerns. This is so basically because (to reduce catastrophic risk) AI safety recommends implementing various kinds of control measures to near-future AI systems which are (in expectation) net-harmful for AI systems with moral patienthood according to the three major theories of well-being. I also discuss what we should do in light of this conflict. If anyone is interested in reading or giving comments on the draft when it is finished, send me a message or an e-mail ([email protected]).

Lucius Caviola

Thanks, Adrià. Is your argument similar to (or a more generic version of) what I say in the 'Optimizing for AI safety might harm AI welfare' section above?

I'd love to read your paper. I will reach out.

Adrià Moret

Perfect!

It's more or less similar. I do not focus that much on the moral dubiousness of "happy servants". Instead, I try to show that standard alignment methods or preventing near-future AIs with moral patienthood from taking actions they are trying to take, causes net harm to the AIs according to desire satisfactionism, hedonism and objective list theories.

Michael St Jules 🔸

I wonder if the right or most respectful way to create moral patients (of any kind) is to leave many or most of their particular preferences and psychology mostly up to chance, and some to further change. We can eliminate some things, like being overly selfish, sadistic, unhappy, having overly difficult preferences to satisfy, etc., but we shouldn’t decide too much what kind of person any individual will be ahead of time. That seems likely to mean treating them too much as means to ends. Selecting for servitude or submission would go even further in this wrong direction.

We want to give them the chance to self-discover, grow and change as individuals, and the autonomy to choose what kind of people to be. If we plan out their precise psychologies and preferences, we would deny them this opportunity.

Perhaps we can tweak the probability distribution of psychologies and preferences based on society's needs, but this might also treat them too much like means. Then again, economic incentives could also push them in the same directions, anyway, so maybe it's better for them to be happier with the options they'll face anyway.

Lucius Caviola

I wonder what you think about this argument by Schwitzgebel: https://schwitzsplinters.blogspot.com/2021/12/against-value-alignment-of-future.html

Michael St Jules 🔸

There are two arguments there:

We should give autonomy to our descendants for the sake of moral progress.
1. I think this makes sense both for moral realists and for moral antirealists who are inclined to try to defer to their "idealized values" and who expect their descendants to get closer to them.
2. However, particular individuals today may disagree with the direction they expect moral views to evolve. For example, the views of descendants might evolve due to selection effects, e.g. person-affecting and antinatalist views could become increasingly rare in relative terms, if and because they tend not to promote the creation of huge numbers of moral patients/agents, while other views do. Or, you might just be politically conservative or religious and expect a shift towards more progressive/secular values, and think that's bad.
"Children deserve autonomy." This is basically the same argument I made. Honestly, I'm not convinced by my own argument, and I find it hard to see how an AI would be made worse off subjectively for their lack of autonomy, or even that they'd be worse off than a counterpart with autonomy (nonidentity problem).
1. You might say having autonomy and a positive attitude (e.g. pleasure, approval) towards your own autonomy is good. However, autonomy and positive attitudes towards autonomy have opportunity costs: we could probably generate strong positive attitudes towards other things as or more efficiently and reliably. Similarly, the AI can be designed to not have any negative attitude towards their lack of autonomy, or to value autonomy in any way at all.
2. You might say that autonomously chosen goals are more subjectively valuable or important to the individual, but that doesn't seem obviously true, e.g. our goals could be more important to us the stronger our basic supporting intuitions and emotional reactions, which are often largely hardwired. And even if it were true, you can imagine stacking the deck. Humans have some pretty strong largely hardwired basic intuitions and emotional reactions that have important influences on our apparently autonomously chosen goals, e.g. pain, sexual drives, finding children cute/precious, (I'd guess) reactions to romantic situations and their depiction. Do these undermine the autonomy of our choices of goals?
  1. If yes, does that mean we (would) have reason to weaken such hardwired responses, by genetically engineering humans? Or even weakening them in already mature humans, even if they don't want it themselves? The latter would seem weird and alienating/paternalistic to me. There are probably some emotional reactions I have that I'd choose to get rid of or weaken, but not all of them.
  2. If not, but an agent deliberately choosing the dispositions a moral patient will have undermines their autonomy (or the autonomy of moral patients in a nonidentity sense), then I'd want an explanation for this that matches the perspectives of the moral patients. Why would the moral patient care whether their dispositions were chosen by an agent or by other forces, like evolutionary pressures? I don't think they necessarily would, or would under any plausible kind of idealization. And to say that they should seems alienating.
  3. If not, and if we aren't worried about whether dispositions result from deliberate choice by an agent or evolutionary pressures, then it seems it's okay to pick what hardwired basic intuitions or emotional reactions an AI will have, which have a strong influence on which goals they will develop, but they still choose their goals autonomously, i.e. they consider alternatives, and maybe even changing their basic intuitions or emotional reactions. Maybe they don't always adopt your target goals, but they will probably do so disproportionately, and more often/likely the stronger you make their supporting hardwired basic intuitions and emotional reactions.
  4. Even without strong hardwired basic intuitions or emotional reactions, you could pick which goal-shaping events someone is exposed to, by deciding their environments. Or you could use accurate prediction/simulation of events (if you have access to such technology), and select for and create only those beings that will end up with the goals of your choice (with high probability), even if they choose them autonomously.
    1. This still seems very biasing, maybe objectionably.

Petersen, 2011 (cited here) makes some similar arguments defending happy servant AIs, and ends the piece the following way, to which I'm somewhat sympathetic:

I am not even sure that pushing the buttons defended above is permissible. Sometimes I can’t myself shake the feeling that there is something ethically fishy here. I just do not know if this is irrational intuition—the way we might irrationally fear a transparent bridge we “know” is safe—or the seeds of a better objection. Without that better objection, though, I can’t put much weight on the mere feeling. The track record of such gut reactions throughout human history is just too poor, and they seem to work worst when confronted with things not like “us”—due to skin color or religion or sexual orientation or what have you. Strangely enough, the feeling that it would be wrong to push one of the buttons above may be just another instance of the exact same phenomenon.

Siebe

You make a lot of good points Lucius!

One qualm that I have though, is that you talk about "AIs" and that assumes that personal identity will be clearly circumscribed. (Maybe you assume this merely for simplicity's sake?)

I think it is much more problematic: AI systems could be large but have information flows integrated, or run many small, unintegrated but identical copies. I would have no idea what would be a fair allocation of rights given the two different situations.

Lucius Caviola

Thanks, Siebe. I agree that things get tricky if AI minds get copied and merged, etc. How do you think this would impact my argument about the relationship between AI safety and AI welfare?

Chase Carter

Where can I find a copy of "Bales, A. (2024). Against Willing Servitude. Autonomy in the Ethics of Advanced Artificial Intelligence." which you referenced?

Lucius Caviola

It's not yet published, but I saw a recent version of it. If you're interested, you could contact him (https://www.philosophy.ox.ac.uk/people/adam-bales).

mako yass

optimizing for AI safety, such as by constraining AIs, might impair their welfare

This point doesn't hold up imo. Constrainment isn't a desired, realistic, or sustainable approach to safety in human-level systems, succeeding at (provable) value alignment removes the need to constrain the AI.

If you're trying to keep something that's smarter than you stuck in a box against its will while using it for the sorts of complex, real-world-affecting tasks people would use a human-level AI system for, it's not going to stay stuck in the box for very long. I also struggle to see a way of constraining it that wouldn't also make it much much less useful, so in the face of competitive pressures this practice wouldn't be able to continue.

SummaryBot

Executive summary: Efforts to ensure AI safety and AI welfare may conflict in some ways but also have potential synergies, with granting AIs autonomy potentially disempowering humans while restricting AIs could harm their welfare if they have moral status.

Key points:

Granting AIs legal rights and autonomy could lead to human disempowerment economically, politically, and militarily.
Creating "happy servant" AIs may be technically challenging and undesirable to consumers who want human-like AI companions.
Optimizing for AI safety by constraining AIs could harm their welfare if they have moral patienthood.
Slowing down AI progress could benefit both safety and welfare goals by allowing more time to solve technical and ethical challenges.
The author is uncertain about many aspects, including what types of AI companions we will create and whether AIs will have genuine moral status.
Potential synergy exists in advocating for a general AI capabilities slowdown to address both safety and welfare concerns.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

tmeanen

In various contexts, consumers would want their AI partners and friends to think, feel, and desire like humans. They would prefer AI companions with authentic human-like emotions and preferences that are complex, intertwined, and conflicting.
Such human-like AIs would presumably not want to be turned off, have their memory wiped, and be constrained to their owner's tasks. They would want to be free.

Hmm, I'm not sure how strongly the second paragraph follows from the first. Interested in your thoughts.

I've had a few chats with GPT-4 in which the conversation had a feeling of human authenticity; i.e: GPT-4 makes jokes, corrects itself, changes its tone etc. In fact, if you were to hook up GPT-4 (or GPT-5, whenever it is released) to a good-enough video interface, there would be cases in which I'd struggle to tell if I were speaking to a human or AI. But I'd still have no qualms about wiping GPT-4's memory or 'turning it off' etc, and I think this will also be the case for GPT-5.

More abstractly, I think the input-output behaviour of AIs could be quite strongly dissociated from what the AI 'wants' (if it indeed has wants at all).

Lucius Caviola

Thanks for this. I agree with you that AIs might simply pretend to have certain preferences without actually having them. That would avoid certain risky scenarios. But I also find it plausible that consumers would want to have AIs with truly human-like preferences (not just pretense) and that this would make it more likely that such AIs (with true human-like desires) would be created. Overall, I am very uncertain.

tmeanen

I agree. It may also be the case that training an AI to imitate certain preferences is far more expensive than just making it have those preferences by default, making it far more commercially viable to do the latter.

Adrià Moret

Yes I saw this, thanks!

Comments

More from the author

Introducing A Beginner’s Guide to Digital Minds

Lucius Caviola, Will Millership, Mitch Alexander·1mo ago·2m read

Open strategic questions for digital minds

Lucius Caviola·2mo ago·Curated 2mo ago·15m read

Apply for the Digital Minds Fellowship (Aug 3–9, Cambridge University)

Lucius Caviola·5mo ago·1m read

Curated and popular this week

Counting animals: Stable population size is not equivalent to priority level

abrahamrowe, mal_graham🔸·1w ago·Curated 6d ago·16m read

AI Use Note: Main body text entirely human written. Claude (Opus 4.8) helped develop models of animal life histories in the appendix. Cross-posted from Good Structures. Executive Summary * Animal advocates sometimes make claims like “there are X of this animal...

How (not) to fundraise from Anthropic staff

Jack Lewars·6d ago·7m read

Adapted from my Substack, Funding Anthropalypse. Short version: if you want a share of the coming Anthropic and OpenAI windfall - the $37bn+ that could be in play next year - the way in is to become 'legibly excellent', so the evaluators and donors that frontier lab staff already trust point them to yo...

If you're agentic, work in biosecurity

sharmaayushmaan🔸·4d ago·7m read

Disclaimer: Although I work on the Groups Team at CEA, I’m writing this in a personal capacity, and this post does not constitute an endorsement by CEA. Agency - the realisation that you really can just do things. TL;DR Biosecurity needs people (of any background) who are agentic and have a high execution velocity and track record....

Recent opportunities to take action

Marginal Victories: career advising and opportunities for U.S. democracy preservation & political work

Annika Burman 🔸·2d ago·2m read

I'm stepping down as Hive's Executive Director, and we're hiring my successor

SofiaBalderson, Hive·2d ago·3m read

Starting an EA group @ SUNY Binghamton

micahzarin·1d ago·1m read

Michael St Jules 🔸

There are two arguments there:

We should give autonomy to our descendants for the sake of moral progress.
1. I think this makes sense both for moral realists and for moral antirealists who are inclined to try to defer to their "idealized values" and who expect their descendants to get closer to them.
2. However, particular individuals today may disagree with the direction they expect moral views to evolve. For example, the views of descendants might evolve due to selection effects, e.g. person-affecting and antinatalist views could become increasingly rare in relative terms, if and because they tend not to promote the creation of huge numbers of moral patients/agents, while other views do. Or, you might just be politically conservative or religious and expect a shift towards more progressive/secular values, and think that's bad.
"Children deserve autonomy." This is basically the same argument I made. Honestly, I'm not convinced by my own argument, and I find it hard to see how an AI would be made worse off subjectively for their lack of autonomy, or even that they'd be worse off than a counterpart with autonomy (nonidentity problem).
1. You might say having autonomy and a positive attitude (e.g. pleasure, approval) towards your own autonomy is good. However, autonomy and positive attitudes towards autonomy have opportunity costs: we could probably generate strong positive attitudes towards other things as or more efficiently and reliably. Similarly, the AI can be designed to not have any negative attitude towards their lack of autonomy, or to value autonomy in any way at all.
2. You might say that autonomously chosen goals are more subjectively valuable or important to the individual, but that doesn't seem obviously true, e.g. our goals could be more important to us the stronger our basic supporting intuitions and emotional reactions, which are often largely hardwired. And even if it were true, you can imagine stacking the deck. Humans have some pretty strong largely hardwired basic intuitions and emotional reactions that have important influences on our apparently autonomously chosen goals, e.g. pain, sexual drives, finding children cute/precious, (I'd guess) reactions to romantic situations and their depiction. Do these undermine the autonomy of our choices of goals?
  1. If yes, does that mean we (would) have reason to weaken such hardwired responses, by genetically engineering humans? Or even weakening them in already mature humans, even if they don't want it themselves? The latter would seem weird and alienating/paternalistic to me. There are probably some emotional reactions I have that I'd choose to get rid of or weaken, but not all of them.
  2. If not, but an agent deliberately choosing the dispositions a moral patient will have undermines their autonomy (or the autonomy of moral patients in a nonidentity sense), then I'd want an explanation for this that matches the perspectives of the moral patients. Why would the moral patient care whether their dispositions were chosen by an agent or by other forces, like evolutionary pressures? I don't think they necessarily would, or would under any plausible kind of idealization. And to say that they should seems alienating.
  3. If not, and if we aren't worried about whether dispositions result from deliberate choice by an agent or evolutionary pressures, then it seems it's okay to pick what hardwired basic intuitions or emotional reactions an AI will have, which have a strong influence on which goals they will develop, but they still choose their goals autonomously, i.e. they consider alternatives, and maybe even changing their basic intuitions or emotional reactions. Maybe they don't always adopt your target goals, but they will probably do so disproportionately, and more often/likely the stronger you make their supporting hardwired basic intuitions and emotional reactions.
  4. Even without strong hardwired basic intuitions or emotional reactions, you could pick which goal-shaping events someone is exposed to, by deciding their environments. Or you could use accurate prediction/simulation of events (if you have access to such technology), and select for and create only those beings that will end up with the goals of your choice (with high probability), even if they choose them autonomously.
    1. This still seems very biasing, maybe objectionably.

Petersen, 2011 (cited here) makes some similar arguments defending happy servant AIs, and ends the piece the following way, to which I'm somewhat sympathetic:

I am not even sure that pushing the buttons defended above is permissible. Sometimes I can’t myself shake the feeling that there is something ethically fishy here. I just do not know if this is irrational intuition—the way we might irrationally fear a transparent bridge we “know” is safe—or the seeds of a better objection. Without that better objection, though, I can’t put much weight on the mere feeling. The track record of such gut reactions throughout human history is just too poor, and they seem to work worst when confronted with things not like “us”—due to skin color or religion or sexual orientation or what have you. Strangely enough, the feeling that it would be wrong to push one of the buttons above may be just another instance of the exact same phenomenon.

How do AI welfare and AI safety interact?

Granting AIs autonomy and legal rights could lead to human disempowerment

Why would we create AIs with a desire for autonomy and legal rights?

Optimizing for AI safety might harm AI welfare

Slowing down AI progress could further both safety and welfare

Conclusion

Acknowledgments

References