Posted also on the AI Alignment Forum.

By “independent” I mean an AI to which an external observer may attribute freedom of thought, or something similar to it. You can think of it as an AI that is not too biased by what its designers or programmers think it’s good or right; an AI that could tell us something new, something we don’t know yet about ethics.

I’ve already given various reasons why having such an AI would be valuable. Here I want to focus on a reason I haven’t talked about yet, which is the importance of AI alignment itself — and cause prioritisation more generally.

It is a not-too-informed opinion of mine that we are still rather ignorant about the importance of AI alignment and how many resources to allocate to it. Some people are very sceptical that alignment is an urgent or important problem; some are very pessimistic and think that a catastrophe is almost inevitable unless AI progress slows down significantly or stops.

Depending on whether you ask AI experts or superforecasters, or even which group of superforecasters you ask, you get different empirical predictions. To me, it is not even clear that the question of AI catastrophic risk is the kind of question to which the methods of forecasting can give a good answer. What if it is, on a fundamental level, a question about the default evolution of any intelligent civilization that reaches a specific technological stage? Then, bold predictions with extreme probabilities such as 0.001% or 99.999% might start to look sensible, even over very long timelines (let’s say year 3025, just to give a number). What if it is primarily a question of science and philosophy? Then we might not want to use probabilistic estimates, but would rather design experiments that would get us at the heart of the question and finally settle it.

If we consider that we live in a world with finite resources and many other problems, the picture gets even more complicated. Should we prioritise working on AI alignment over making progress on, let’s say, medical research? What about other sources of risk, or other ways and opportunities to do good?

An AI capable of independent moral reasoning would be a key step towards a system that can help us answer these questions. I know it sounds like the stereotypical sentence arguing for more research on a topic, but I think it’s true, and here’s why.

To get a system that is better than humans at cause prioritisation, it’s enough to pair an AI that is very good at ethics, something like a superhuman philosopher, with an AI that is very good at planning and instrumental reasoning.

What does a superhuman philosopher look like? It’s something that makes claims and gives arguments for them; and when human philosophers read those arguments, their reactions are something like: “Hmm I was initially sceptical of this claim, but I’ve checked the argument for it and it is very solid; there is also some historical and scientific evidence supporting it; I’ve updated my view on this topic.”

And to get an AI that can tell us something new and informative about ethics, something we didn’t know before, we need the moral reasoning of that AI to be at least somewhat open-ended and unconstrained. This is also what I mean by independent (maybe open-ended is a better term in this context).

It should go without saying that a system better than humans at cause prioritisation would be extremely valuable to humans, and not only to humans.

But maybe you disagree with me about our degree of ignorance about cause prioritisation and the importance of the alignment problem and other sources of risk. Maybe you think that it’s all been figured out already; or maybe you think that, for example, it’s enough for AI catastrophic risk to have a minuscule probability to make the alignment problem the most important problem we should solve.

Still, even in that case, wouldn’t it be nice if an AI capable of independent reasoning, an AI that by design had no reason to agree with you specifically, said something like: “Well, I’ve thought about these questions of risk and cause prioritisation. My predictions and suggested priorities are the same as yours, I think you are right.” Especially when this is not just about us!


I’ll end the post by addressing an objection. Doesn’t successfully creating an AI good at cause prioritisation require solving the alignment problem first?

I don’t think so. Let’s consider this made-up language model as an example:

  • 60% of the time it does what it’s supposed to do: independent moral reasoning
  • 10% of the time it blackmails you
  • 10% of the time it leaks random personal data
  • 10% of the time it hallucinates
  • 10% of the time it seems it is carrying out independent moral reasoning but it is actually pursuing some secret agenda that emerged as a result of the language model’s training (whatever that means, and assuming this is possible for language models)

This language model has some clear problems of alignment. Still, it can be used by a group of philosophers who are interested in cause prioritisation, and thus it can provide value in that way. Moreover, the problems this language model has are not specific to independent moral reasoning, so if you object that this language model is too unaligned to be used safely, then the objection becomes a generic objection against language models that are similarly crappy, not a specific objection against AI that can carry out independent moral reasoning. In other words, if creating language models that we think are safe enough to use does not require solving the alignment problem, then we should also be able to create LLM-based AI that can carry out independent moral reasoning and is safe enough to use, without having to solve the alignment problem first.

You can support my research through Patreon here.

Comments2
Sorted by Click to highlight new comments since:

Executive summary: The author argues that developing AI capable of independent moral reasoning—unbiased by its creators—could be crucial for advancing cause prioritisation and evaluating the true importance of AI alignment, even if such systems are only partially aligned.

Key points:

  1. Current debates on AI risk and alignment show deep uncertainty, with predictions ranging from negligible to near-certain catastrophe; forecasting methods may be inadequate for such philosophical and civilizational questions.
  2. Cause prioritisation is complicated by finite resources and competing priorities (e.g. medical research vs. AI safety), highlighting the need for better tools to evaluate trade-offs.
  3. An AI with independent moral reasoning could act like a “superhuman philosopher,” generating novel, rigorous arguments that change expert minds and provide genuinely new insights.
  4. Pairing such an AI with strong planning capabilities could produce a system better than humans at cause prioritisation, offering immense value across global challenges.
  5. The author contends that partial misalignment need not prevent progress: even flawed models can yield useful philosophical insights if used carefully, so perfect alignment may not be a prerequisite for creating helpful moral-reasoning AIs.
  6. Even skeptics might find reassurance if an independent AI—without incentive to agree—validated their prioritisation views, strengthening confidence that decisions are not merely human biases.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

I think this summary is ok

Curated and popular this week
Relevant opportunities