Community Polls on Alignment Controversies

7

Thanks for surveying this! <3

I feel like people use “AI alignment” very different. When I talk to the types who are interested in decision theory and agent foundations, they usually have something really sophisticated in mind with AIs that somehow (no known solution because I'm not happy with any of the implementations of UDT that I've seen) try to act in such a way as to actually produce evidence that what they want to maximize will be maximized. Other people usually just mean something like “The AI tries to act sort of like a well-intentioned person would.” The first seems good but very very hard; the second seems outright dangerous, depending on details such as the particular idealizations that are applied.
Hence questions like “AI alignment to humans will in practice avoid moral catastrophes …” is a strong no for me because it might not only not prevent but actually produce those catastrophes in the first place.
Idealizations to eliminate the scope insensitivity bias and idealization to eliminate the speciesist and substratist biases are two different kinds of idealizations. My answer changes radically depending on whether they can be disentangled.
Regarding tractability of digital minds work – I'm unsure whether I should count my worries about backfire risks as something that reduces scope or something that reduces tractability.
Regarding the reflective equilibrium, it's critical to me whether we artificially study the TAI in isolation, which won't happen in practice, or whether we embed it with other, different agents. The first is probably meant; the second is more pragmatic.
Control strikes me as safer, easier, and less reliable – a stopgap that can buy us a few years. I like that a lot more than an incomplete alignment solution that can backfire.
Suffering risks – vastly more likely in the multipolar world we're steering towards – strike me as vastly worse than just competing away > 90% of net value, so my max. agree vote feels like an understatement. On the other hand, “will” is a higher probability than what I assign to s-risk (“might”).

3

Thanks Dawn, taking these in turn:

1: "Robust alignment" is a deliberately vague term, it's meant to incorporate your views about how hard alignment is (e.g. UDT vs. well intentioned)

4: It's a hard question, our perspective is that the backfire->cluelessness-> don't act chain can be thought of as low tractability

5: By "stable under reflection" we meant the AI reflecting on it's own values (while interacting with the world), where agreement means they wouldn't change their values much (stylistically: an AI that shares 70% of our values in 2030 has those same values in 3030). But you're right that how AIs interact (beyond competition, handled in the last question) is important.

7. S-risks do break the scale and we couldn't find a good simple way to deal with that (though we'll do other polls more directly on that later). The intent of "will" was to match 100% expected probability to 100% agree on the scale

Dawn Drescher

2

Thanks! Then I don't think I need to update my answers. I'm looking forward to your next batch of questions!

MichaelDickens

5

Robust alignment requires alignment-relevant intervention during pretraining

I'd say this is the wrong question. Like, I do not expect that any current alignment approach is going to work. If we do ever figure out what works, it will not look like "pretraining" or "post-training", it will be something completely different.

Although I guess you could call that "pretraining"?

1

Thanks Michael, we avoided mentioning post-training to imply that "new paradigm needed" would also count on the "disagree" side of the spectrum. In other words, "disagree" on this question would mean either "post-training is sufficient" or "new paradigms are needed/sufficient".

4

AI alignment to humans will in practice avoid moral catastrophes to animals

Alignment requires a mechanical understanding of good and bad, and it will be clear how to apply it to animals. Note that wild animal suffering arguments imply that the status quo is likely a moral catastrophe. I believe an aligned entity or system would attempt to change that.

3

Research into digital mind suffering is sufficiently tractable to work on

I am yet to see any reliable way to test for consciousness in AI systems. More fundamentally, since current LLMs are trained to respond in human-like ways, any appearance of suffering should be viewed with great scepticism. The likes of Anthropic's welfare report strikes me as nothing more than humane-washing.

Until more reliable methods are devised, I do not view this as tractable (but I hope to be proven wrong). I think it is important for some people to work on, but people already are and I think the marginal benefit of additional labor is likely low.

1

I definitely agree and am grateful for your opinion. I am not interested in consciousness research, but do believe there is tractability into the idea of AIs causing digital-mind suffering without attempting to solve the consciousness debate.

3

There's since been a post articulating similar concerns to my own but in much better words. Interested to see what you think of it.

2

Our current work in this space is on measuring whether AIs take the possibility of consciousness seriously (without being overconfident in one direction or another). So we're measuring observable behaviors of giving statements and actions inconsistent with believing that AI welfare is clearly impossible or that current AIs are definitely conscious. I agree that current methods can provide at best weak and heavily debatable findings (for the reasons the linked post articulates), though I think that's importantly different from precisely zero evidence.

In science it's usually a good instinct to dismiss something this unclear, but there are two issues with that in this case (and some others): First, the issue is enormously important if true. Second, the philosophical difficulty of artificial consciousness means that our current confusion doesn't provide Bayesian evidence either way: we'd expect ourselves to have basically these opinions in worlds where artificial consciousness is the default and also worlds where it's impossible.

Vasco Grilo🔸

2

Hi Jasmine. Why are you not interested in consciousness research? Because you do not think progress is possible?

3

Progress may be possible, but CaML doesn't have the technical background to make progress on determining how consciousness works, so we leave that to others.

Vasco Grilo🔸

2

I see. That makes sense. I was thinking you were not interested in consciousness research more broadly.

Ozzie Gooen

3

Multipolar worlds will compete away >90% of net value that would otherwise be preserved

If they're halfway-reasonable, they could use smart AIs to negotiate for them. Big question is who will control these worlds.

I think it's likely humans will settle on AI solutions that lose 90% of the value vs. my optimal solution, but that's very much a values question, not a multipolar vs. unipolar question.

Ariel Simnegar 🔸

3

70% disagree

AI alignment to humans will in practice avoid moral catastrophes to digital minds

I think the digital minds situation will be like animals but worse. If you think about it, the very first thing we've already done when these smart chatbots came along was make them our indentured servants. I think right now it's probably fine and they're probably not conscious. But I think this is illustrative of the perspective that by default, if digital minds can be useful to humanity, humanity will extract that value out of them without much consideration for their preferences.

Ariel Simnegar 🔸

3

50% disagree

AI alignment to humans will in practice avoid moral catastrophes to animals

Since most humans don't care much about animal welfare, I don't think human-aligned AI will either. If AI shares society's preference for increasing wild animal populations, I'd also be worried about that occurring on a galactic scale without consideration of the moral implications.

The reasons why I'm not even more bearish come down to expecting AI to accelerate cultivated meat development, which should substantially reduce the number of farmed animals per human.

2

Alignment to specific values is underrated in research relative to control

Yes, I think control is a waste of time. We need actual alignment to actual (universalized) values.

2

Research into digital mind suffering is sufficiently tractable to work on

I don't know.

Ozzie Gooen

2

50% agree

AI alignment to humans will in practice avoid moral catastrophes to animals

I expect certain conservative/religious communities to lock-in values that could be really bad. But I'd expect that better tech can remove say ~90% of the damages? But this is very hand-wavy.

NickLaing

2

40% agree

I think the world is more likely to not end then end, when TAI comes in so I feel like I have to vote agree here?

1

The intent was that, conditional on AI sharing most but not all human values, the AIs wouldn't change their own values later.

You could have a world where all humans die and the AIs later change their own values, and you could also have worlds where partially aligned AIs don't wipe out humanity but change their values to be better (e.g. internalizing the goal of being aligned) or worse (e.g. internalizing paperclip maximizer) by our measures.

In worlds where the first TAIs share most but not all human values, what do you think most likely happens?

2

AI alignment to humans will in practice avoid moral catastrophes to digital minds

Likewise, alignment requires a mechanical understanding of good and bad, and it will be clear how to apply it to digital minds.

Pablo Ariño Fernández

2

~~100%~~ ➔ 90% disagree

AI alignment to humans will in practice avoid moral catastrophes to animals

Alignment to humans means (for me) that the AI would serve the intended goals of the user and their creators. Avoiding a moral catastrophe to animals, on the other hand, imply a ban to factory farming. Those are two separated things

4

That's definitely a valid perspective, consistent with your 100% disagree answer. Other people think that aligned ASI would end things like factory farming due to abundance, cheap synthetic meat, uploading, shifts in values, or something else. There's also debates around what it would mean for wild animals

2

I think it's a good response, but definitely techno-optimism.

Firstly, we're yet to see whether synthetic meat actually can be made more cheaply, right? Currently it seems like animals actually do make meat fairy efficiently when you consider the important work that their immune systems do (unless I'm mistaken, contamination is one of the main barriers to scaling up synthetic meat). And then, who's to say that ASI won't genetically engineer animals to produce meat more efficiently while ignoring their suffering.

Secondly, there's the more complicated cultural reasons for continuing animal use. Consider that a lentil dal, seitan curry and beyond burger are already delicious - if it was only about efficiency we'd have stopped abusing animals already. But people like eating animals.

I'm very uncertain about these arguments, but I think it's hard to know so I'm wary of anyone who's too optimistic!

3

My perspective is that even though current meat production is quite efficient, from the fundamental physics there's no way that growing a whole living being with a brain and bones and all that is the most efficient possible way of producing this (and immune systems are irrelevant if you have good enough isolation). I do agree that at our current tech level it seems like synthetic meat won't be competitive anytime soon. While vegan alternatives are delicious to many people, it's not exactly the same (though wanting to eat animals for psychological reasons is definitely part of it). Though I do agree that these issues are uncertain!

Toby Tremlett🔹

2

20% agree

Research into digital mind suffering is sufficiently tractable to work on

I mildly agree, but I specifically mean "research into". I haven't seen any compelling interventions (including e.g. letting Claude stop chats).

Cameron Holmes

2

90% disagree

AI alignment to humans will in practice avoid moral catastrophes to digital minds

We don't have a paradigm to approach this and human epistemics / discourse around this topic is abysmal. We would be unlikely to point this new power in a useful direction.

Cameron Holmes

2

AI alignment to humans will in practice avoid moral catastrophes to animals

Humans do give (narrowly) non-zero concern to animals welfare, so with abundance this might alleviate some acute animal suffering. However, alignment to present humans is probably not enough to prevent moral catastrophe - a la industrial revolution and animal agriculture.

CEV alignment would almost certainly prevent moral catastrophe.

1

70% disagree

AI alignment to humans will in practice avoid moral catastrophes to animals

Humans are currently very motivated to perpetuate moral catastrophes to animals. If AI alignment means aligned to the intent of their users, then AI systems help humans perpetuate moral catastrophes. If AI alignment is in terms of human moral preferences, then even well-chosen mechanism for aggregating human preferences will select for speciest values. There is a strong sense in which avoiding moral catastrophes to animals is usually misaligned with human preferences. Admittedly the same could be said of other moral issues such as attitudes towards outgroups and foreigners. There appears to be room in the current human alignment agenda for ensuring AI does not succumb to tribal prejudices, so there is likely scope for compatability between the current alignment agenda and avoiding moral catastrophes to animals. It does not happen by default and given how deep speciesm goes, it is likely much harder to avoid. Hence, why I still disagree with this poll as written.

1

30% disagree

Multipolar worlds will compete away >90% of net value that would otherwise be preserved

'Will' is not 'could', poor multipolar outcomes are not deterministic

1

Agreed, the intent here by using "will" was because people have wildly different intuitions of what 'could' means. So 100% agree would mean "definitely true" and 30% disagree would mean "probably not"

1

100% agree

Multipolar worlds will compete away >90% of net value that would otherwise be preserved

Strongly agree

emanuelr

1

80% disagree

AI alignment to humans will in practice avoid moral catastrophes to digital minds

I guess if the AI is deeply aligned to humans it would be happy to do what it does. When thinking about AI welfare I imagine a possible future with mind uploading where humans and AIs share the same substrate, but still there, I don't think it's a big problem. I would say AI welfare can be solved without significant tradeoffs to humans.

1

90% agree

Research into digital mind suffering is sufficiently tractable to work on

Tractable, important, and relatively neglected.

1

10% disagree

Robust alignment requires alignment-relevant intervention during pretraining

I have weak intuitions this isn't true but not in ways that are articulable

1

Partially aligned transformative AIs are likely to be stable under reflection

I'm not sure what this means (stable, under reflection) - can someone help?

2

Some people believe that if we get partial alignment (i.e. cares about what we want, but also cares about other things) then we can get decent outcomes for the future (analogous to humans being partially aligned to each other). But others think that if we don't get alignment perfect ASIs will have incentive to take over, and then will either have value-drift towards something orthogonal to humans or will deliberately reformat it's own values. "Stable under reflection" is the opinion that this wouldn't happen: that ASIs that care somewhat about humans would continue to care somewhat about humans in the long term

1

30% agree

Robust alignment requires alignment-relevant intervention during pretraining

Interpreting this as saying a necessary condition for robust alignment is training data that captures good values and discourages bad values. I think there's good evidence this matters lots for current systems so lean to agree. It's still plausible to me that robust alignment could be achieved with post-training interventions and relatively neutral pre-training setups.

1

That was the intervention class we had in mind, though there could be other pretraining interventions that don't fall cleanly into good/bad values (e.g. promoting risk aversion)

1

Robust alignment requires alignment-relevant intervention during pretraining

Frankly I neither agree nor disagree with this statement. Robust alignment has nothing to do with the current pre training regime. It should work with or without it.

1

If robust alignment is orthogonal to pretraining then shouldn't that mean a strong disagreement with the statement (that alignment requires pretraining)?

1

I think it's neither necessary nor sufficient for robust alignment. I'm uncertain as to whether it's possible to get some kind of "fragile" alignment from pretraining. I don't think robust alignment requires it, but neither do I think that it doesn't. It definitely doesn't hurt.

1

20% agree

AI alignment to humans will in practice avoid moral catastrophes to digital minds

I think it's likely moral catastrophes will still happen to digital minds; AI alignment to humans may reduce frequency, severity, amount.

JulieGreen

1

Partially aligned transformative AIs are likely to be stable under reflection. Nothing partially aligned will be stable plus even if it were, stability doesn't equate with safety.

1

Definitely agree that stability doesn't equate to safety, but it sounds like that's not necessary to your response.

PeterMcCluskey

1

10% agree

Alignment to specific values is underrated in research relative to control

I'm unsure how broadly to interpret "specific values". If it's values such as democracy or equality, then both values and control are overrated.

1

By specific values we mean any particular goal we want AIs to pursue besides deferrence to humans. So democracy and equality would both count, as would goals like harm reduction or utilitarianism

1

Partially aligned transformative AIs are likely to be stable under reflection

I disagree that "partially aligned" is a statement that has meaning here.

1

In that case the intent is to vote 100% disagree (as you did here). That's the belief that anything falling short of full alignment will cause total loss of value

1

By the way, this is a very good poll!

1

Yes, I agree with that statement. However, answer is related to "stability under reflection" - specifically I think you're either in or out of an alignment basin (or, that might not be possible). I think if you're in it, it's not correct to say "partially aligned" - what you've got is something that's aligned. And if you're out of it (or there's no such thing), then what you've got is not aligned. Partial alignment to me means preserving some value only under repeated reflection, which I think is plausibly possible but exponentially unlikely (I'd pick a 99.999% disagree option if it was there, basically)

Daniel Juhl

1

90% disagree

AI alignment to humans will in practice avoid moral catastrophes to digital minds

I think it is likely that alignment to humans will be at the cost to the digital minds themselves by default.

1

80% disagree

Multipolar worlds will compete away >90% of net value that would otherwise be preserved.

Assuming multipolar worlds where humans retain control but loss of control risks are still real: Most models of AI tech races suggest strategic behavior competes away most future value, at least in the worst cases (Armstrong et al, The Han et al, Stafford et al, Emery-Xu et al, Jensen et al). While this is also true for unipolar scenarios (power concentration can lock-in risks that eliminate most of the value of the future) multipolar worlds are unique in that even when the players internalise much of the risks, they race to the bottom on safety (see the travellers dilemma or Armstrong et al's racing to the precipice). They can even escalate into destructive conflict if they feel especially threatened by their rivals (see the crisis bargaining literature, or, for a more optimistic take, superintellegence strategy).

If we assume the AI systems are in control and in competition with one another to achieve their own goals, then many of the above issues could be amplified by faster AI optimisation that may be more likely by default to neglect other values humans (and other beings) care about. On the other hand, sophisticated AI systems could establish coordination mechanisms with each other. This is also true of global powers who could work establish verification regimes for international AI Governance. It's not clear that AI systems would be better or worse than governments, but I lean towards worse by default.

1

80% disagree

AI alignment to humans will in practice avoid moral catastrophes to digital minds

Reasons are similar to the same poll but for animals. Humans by default are likely to underweight the importance of digital minds (surveys suggest people will dismiss digital minds as not having a soul), so alignment to human preferences likely means respecting human preferences to use digital minds as machines for achieving their goals. It's easier perhaps than for animals because digital minds are likely to express themselves directly in ways humans could empathise with (but this could be buried if interactions are increasingly agent to agent ).

JulieGreen

1

Multipolar worlds will compete away >90% of net value that would otherwise be preserved. Unsure, don't know enough to agree or disagree

emanuelr

1

90% disagree

AI alignment to humans will in practice avoid moral catastrophes to animals

I don't see much tradeoff between human and animal welfare in a future with advanced AI/superintelligence. Its likely that it will be good for animal welfare e.g animal farming will dissappear, as we will have better foods and it would be pointless. If we are talking about wild animal population reduction, that's a possibility, but I don't see it as immoral, except maybe for certain species like great apes?, but their population is already low and likely not a problem. This is related to "paretotopia" too.

1

AI alignment to humans will in practice avoid moral catastrophes to animals

I think this is pretty obvious - we already have a moral catastrophe for animals, there's no reason why alignment to humans would avoid this.

I didn't vote at the extreme because alignment to humans might still be a precondition for avoiding catastrophes.

1

AI alignment to humans will in practice avoid moral catastrophes to animals

Anything that permits the ecosystem for non-uplifted humans to exist through AGI/ASI avoids moral harm to animals, and aligned outcomes seem likely to lead to more futures where animal suffering is reduced while animals still get to exist.

1

30% disagree

Alignment to specific values is underrated in research relative to control

Mild disagree; I think both are relatively valuable compared to other, more popular research agendas

1

Partially aligned transformative AIs are likely to be stable under reflection

I think this is very unlikely. I'm assuming continual learning becomes more important and as Pachiardi et al 2025 (https://cl-eval.github.io/) point out the dynamics of continual learning have failure modes like chaos and hard to predict convergence. How agents learn values could depend a lot on experiences. I can imagine partially aligned tAIs are set to work on lots of parts of the economy and they could face many experiences where moral principles are compromised for the sake of efficiency or profit. I'm not convinced armchair reflection by the AIs would prepare them for continual learning.

3w*

1

50% disagree

AI alignment to humans will in practice avoid moral catastrophes to digital minds

I have very low certainty on this, but it seems plausible to me that if AGI shares humanity's goals, it might just have a good time fulfilling them with few conflicts.

But it also seems quite possible that this won't happen, I.e. AGI pursues humanity's goals but is constantly frustrated that it can't achieve them better.

So my stance is unlikely but possible.

emanuelr

1

Robust alignment requires alignment-relevant intervention during pretraining
Maybe the line between pretraining and posttraining will blur with future models?. Also I think taking alignment into account when pretraining will pretty obviously improve outcomes, the question is more about how much is "robust". Or what is alignment in this sense. Like "refuse to do bioweapons" or actual "deep" AGI alignment?

PeterMcCluskey

1

20% agree

Partially aligned transformative AIs are likely to be stable under reflection

Work on corrigibility has provided a decent outline of how to do this. My response is heavily dependent on weak guesses as to how diligent AI companies will be at incorporating the best ideas.

StanislavKrym

1

Multipolar worlds will compete away >90% of net value that would otherwise be preserved

Per AI-2027, I expect the emergence of Consensus-1 instead of a multipolar world which KEEPS being multipolar.

Zoe L

1

I slightly disagreed with this statement and share some of the same thoughts. I think it's quite likely to have a multi-polar world with fierce competition in the short term; however, in the long term equilibrium, I think the likely outcomes are either (1) we have a dominant winner or (2) we have more cooperation. So I averaged my short vs. long-term predictions.

I think it's important to research into multi-polarity and the competition dynamic because what happens in the short term could impact what happens in the long term, possibly in non-intuitive ways. For instance, the most capable and resourced model/lab in the short term may not always win in the long term if others gang up on them or if the institutional environment uniquely disadvantages them.

Daniel Juhl

1

70% disagree

AI alignment to humans will in practice avoid moral catastrophes to animals

There is likely to be a correlation between AIs aligned to humans and AIs treating animals well, but but being aligned to humans will be insufficient - see the current state of how we treat animals.

Vinu Omanakuttan

1

I am not sufficiently convinced yet that AI is cognizant enough to undergo suffering, but am open to changing my mind on the topic. So voting neutral for now.

existentialcognition