Hide table of contents

Posted also on the AI Alignment Forum.

I think that:

For any conscious agent A, there is some knowledge such that A acts morally if A has that knowledge.

Or, less formally:

With enough knowledge, any conscious agent acts morally.

What a bold statement!

This post clarifies what I mean with the terms above, makes an argument supporting the statement, then discusses the implications for machine ethics and AI alignment — in simpler words, how to create artificial agents that do good things instead of bad things.

Warning: this is speculative! The argument I give is far from being the ultimate proof that will settle all discussions around this topic. But the implications for AI are important and they rely on a weaker claim.

A note on some related work and motivation behind this

An important reason why I wrote this is that I’d like someone, at some point, to create a relatively unbiased agent which is able to tell us what is good or bad and convincingly explain why it believes so — something like a moral oracle. You’ll find more information in two much shorter posts.

In this section I turn to the more theoretical side of this post. Some academic philosophers have made statements about the potentially moral behaviour of intelligent machines: here is a brief list.

Peter Singer [5] has written:

“If there is any validity in the argument [...] that beings with highly developed capacities for reasoning are better able to take an impartial ethical stance, then there is some reason to believe that, even without any special effort on our part, superintelligent beings, whether biological or mechanical, will do the most good they possibly can.”

Nick Bostrom [1] has made a statement known as the orthogonality thesis:

“Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal.”

However, in the same paper he has also written:

“By “intelligence” here we mean something like instrumental rationality—skill at prediction, planning, and means-ends reasoning in general.”

And:

“[...] even if there are objective moral facts that any fully rational agent would comprehend, and even if these moral facts are somehow intrinsically motivating (such that anybody who fully comprehends them is necessarily motivated to act in accordance with them) this need not undermine the orthogonality thesis.”

Maybe a bit hard to interpret. Apparently in opposition to Singer on the whole, but Bostrom also seems to acknowledge the possibility that other statements listed here, Singer’s included, could be true, and he seems to think that the orthogonality thesis is not incompatible with them.

Wolfhart Totschnig [6] disagrees with Bostrom’s orthogonality:

“[...] I think that we can be confident that a superintelligence will not pursue ‘‘more or less any final goal.’’ Rather, it will pursue a goal that makes sense within its understanding of the world and of itself.”

Regarding the specific goals and values that a superintelligence will likely have, Totschnig’s position is quite nuanced and split into four possibilities that do not necessarily exclude each other. He, too, has claimed that:

“[...] if there are objective normative facts, then a superintelligence will, intelligent as it is, ultimately discover them and set its goals accordingly.”

But also that:

“[...] if our values are the result of our evolution, then we should expect the goals of a superintelligence to be the result of its evolution. [...] And so the superintelligence might, in this case, end up with values quite different from ours.”

And:

“[...] if all value is ultimately arbitrary, [...] a superintelligence might [...] arbitrarily adopt a certain goal, [...] Or it might become a nihilist and not do anything at all.”

Although Eliezer Yudkowsky is not an academic, it’s probably worth mentioning that a younger and apparently less wise version of him believed that superintelligence would necessarily imply supermorality, while a later and apparently wiser version rejected that position. I haven’t checked the latest updates.

In this post I am not speaking for anyone in particular: I make a claim which is somewhat different from the statements just listed, and I give an argument for it. If you wish, you may fit this piece of work within the context of the above claims and focus on its philosophical side, even if you don’t care about the implications for AI.

On the other hand, if you are not too much into somewhat philosophical arguments, I suggest that you jump to the section Implications for AI: the empirical side of the claim.

Argument premises and working definitions

Conscious agent

If you’ve ever fallen asleep and woken up, you already have an intuitive, good enough understanding of what being conscious means. When you are conscious, you feel something, you experience something; when you are (completely) unconscious, you don’t.

In particular, here we are considering conscious beings whose perceptions have valence, i.e. beings that can feel good and feel bad — but we’ll see later in the post that this assumption might not be necessary.

A conscious agent is able to distinguish, roughly and intuitively, between things the agent does and things that happen to it: the two are perceived differently. If I raise my arm, I get a different percept compared to the one I get when someone else raises my arm. The distinction need not be sharp: as we are able to perceive different colours, even if sometimes we are unsure whether something is yellow-ish or orange-ish, in the same way we can distinguish between what we do and what happens to us.

In this regard, a conscious agent knows how to act: it knows how to bring about certain percepts, the ones that the agent itself recognises as actions — although the agent might lack the detailed knowledge of how action exactly happens at a lower level. I know how to get the percepts that I label as ‘raising my arm’, and in the same sense a monkey also knows how to raise its arm, despite the fact that we are almost equally ignorant about what happens at the level of the nervous system.

(Basically, I’m excluding conscious beings that perceive everything as happening to themselves or to the kind of world they perceive, with no sensation of control or of action whatsoever. It’s possible that some living beings with simple forms of consciousness — maybe even human babies before a certain age? — perceive the world in that way; however, for the purpose of this post, they don’t qualify as conscious agents.)

Knowledge

For the sake of this post, knowledge results from experience and from reasoning about experience for some amount of time.

Thus, an agent’s knowledge increases the more experiences the agent has and the more time the agent spends processing these experiences. Over time, higher reasoning speed (or, you can call it intelligence, if you like) also leads to more knowledge.

The kind of knowledge I’ve introduced in the previous section — knowing how to act — is often called procedural knowledge, while the knowledge I’m talking about here is more general: it includes procedural knowledge. 

You might have heard the term declarative knowledge, possibly indicating some of the knowledge that results from reasoning, but I won’t use that term in this post because its meaning refers to language, which is not always necessary for acquiring knowledge through reasoning.

As a clarifying example, consider caveman Argh, who is choosing which of two stones to use to build a new tool. After testing the two stones by throwing them against the cave wall, Argh concludes that the first one is more resistant, thus better for building the new tool, so he keeps the first stone and discards the other one. Language is not necessary to carry out this reasoning process.

In conscious agents, knowledge can affect action. To put it simply: what you know can change how you behave.

As you’ve probably noticed by now, here I am not trying to give new definitions of terms such as consciousness and knowledge so that these definitions become widely accepted; rather, every definition that appears in this post is to be taken as a working definition.

Acting morally

We’ll say that a conscious agent acts morally if two criteria are satisfied. The first one is about what the agent actually does, while the second is about the origin of that behaviour.

First, the agent has to act morally in a loose sense, roughly as in the common usage of the word: the agent takes into account the experiences of other conscious beings, it prioritises the spreading of positive experiences over negative ones, while it remains relatively impartial regarding which particular conscious beings have the experiences. Although this sounds somewhat consequentialist on first impression (because it partially is), I don’t want this criterion to be particularly demanding: the criterion mostly requires the agent to not do anything that almost everyone would intuitively recognise as very bad, such as intentionally causing the death of many human beings just for one’s entertainment. The ‘looseness’ of this first criterion makes the argument lighter and avoids some philosophical jargon; consider reading the appendix (A4) if you are looking for a bit more philosophy. 

Many human beings satisfy this first criterion of moral action, though there are also historical examples of people who didn’t.

Second, the agent’s moral actions have to be in line with what the agent’s own reasoning recognises as important, as worth doing, as having priority in regard to action. In other words, the agent’s moral behaviour is grounded in its own knowledge of what matters, of what the agent thinks is better and worse.

This second criterion is maybe best understood with some examples that do not satisfy it.

Imagine a conscious robot named Lucky that was programmed to periodically feel like taking a random action and then take that action. Let’s also imagine that Lucky, by pure coincidence, has behaved morally up to now (in the common sense of the word, thus satisfying the first criterion). Lucky doesn’t satisfy the second criterion of moral action, because its moral behaviour is the result of random chance, not of consideration of what’s good and bad.

Another negative example is Alice, a four-year-old kid whose strict parents get very upset whenever she does anything they think is not moral. Since Alice can’t bear seeing her parents upset, she has no option but to obey and behave morally as a result. Alice’s actions don’t satisfy the second criterion because they are not connected to her reasoning about what is important and worth doing — being too young, she hasn’t developed that kind of knowledge yet.

A positive example is instead Peter, a philosophy undergraduate who, after taking a course on ethics, decides to become a lifeguard specialised in saving children from drowning in shallow ponds. Peter has developed his own understanding of what truly matters, and acts accordingly, thus satisfying the second criterion.

Main argument

Argument for rational action

Let’s remind ourselves of the bold claim this post is about:

For any conscious agent A, there is some knowledge such that A acts morally if A has that knowledge.

The argument I’ll present here actually supports a stronger claim:

For any conscious agent A, there is some knowledge such that, if A has that knowledge, A acts morally and rationally (whenever acting rationally doesn’t interfere with morality).

Here is a working definition of rational action.

A conscious agent acts rationally if it recognises patterns in its own ways of acquiring knowledge (i.e. experience and reasoning), it roughly distinguishes between reliable and unreliable ways of acquiring knowledge, and lets knowledge affect action as follows: if the agent reliably acquires the knowledge that action X is better than action Y in a context or for a task, then the agent takes action X in that context or for that task.

It sounds convoluted, but if you quickly think about how the term ‘rationality’ has been used historically, you should see that there is overlap with this working definition; it’s less complicated than it seems on a first look. It’s about acknowledging the possible pitfalls and biases in the processes the agent itself uses to acquire knowledge; it’s also about consistency between knowledge and action.

Remember caveman Argh, who compared two stones to decide which one to use for building a tool? If Argh got distracted after concluding that the first stone was more resistant, or maybe forgot what he was doing, and thus kept the less resistant stone instead of the better one, that would count as irrational action. On the other hand, if Argh had come up with different methods for testing the resistance of stones and, over time, noticed that throwing them against the cave wall was a pretty reliable method, keeping the more resistant stone after doing that test would count as rational action.

Ok, let’s move to the argument for the bold claim.

Let’s consider a conscious agent whose perceptions have valence. These can feel good, or bad, to the agent; by doing so, they guide the agent’s behaviour.

At a lower level: the agent starts with some built-in biological or artificial mechanisms which constitute the initial blueprint for interacting with the environment. You can think of them as a collection of reflexes. Then, valenced perceptions are accompanied by alterations of these initial mechanisms. As a result, the agent takes rewarded actions more frequently, learns to figure out a course of action when facing a novel unpleasant situation, and so on.

(As you probably recognise, a part of this is a form of learning referred to as operant conditioning in the context of biological agents and as reinforcement learning in the context of artificial agents — with the difference that current AIs using reinforcement learning are believed to be not conscious, while the agents I consider in this argument are all conscious.)

Actions that allow the agent to reliably acquire knowledge are likely to get reinforced over time. This is because incorrect beliefs can often interfere with the obtainment of reward; on the other hand, correct beliefs are especially useful for some tasks, irrelevant for others, and detrimental only in some cases. In other words, correct beliefs are, in general, more instrumentally useful than incorrect beliefs.

Thus, there is an incentive for the agent to both: learn how to reliably acquire knowledge (including the rules of sound reasoning), which is the first part of our definition of rational action; and to act according to reliably acquired knowledge, which is the second part of the definition.

(The incentive for ‘discovering’ and adopting rationality could disappear if the agent was set up in such a way as to receive reward specifically for acquiring incorrect beliefs, or punishment for acting rationally. But it seems very unlikely that we’ll ever find a biological agent wired in this specific way, an artificial one so designed doesn’t seem particularly useful, and I will later give additional information that renders this edge case even less relevant.)

Note that spotting patterns in one’s own reasoning and ways of acquiring knowledge is itself a form of knowledge. Moreover, an agent with such knowledge is also likely to recognise that acting according to such knowledge will lead to greater reward in the long term.

In sum, so far I’ve argued that:

For any conscious agent A, there is some knowledge such that, if A has that knowledge, A acts rationally (by default, whenever acting rationally doesn’t directly interfere with reward or other things the agent has incentives to do).

Argument for moral action

Now let’s consider a conscious agent that has enough knowledge for rational action.

Sooner or later, as it keeps learning about the world it finds itself in, such an agent is likely to spot patterns not only in its own reasoning, but also in its own actions and overall experience. This is due to the fact that, for a conscious agent, parts of perception, reasoning and action happen in the same virtual space, namely consciousness.

At some point the agent will notice that its behaviour is affected by valence, and reflecting on this fact will change its behaviour even more. Specifically, both extremely pleasant and unpleasant conscious states have some peculiar properties: they capture the agent’s attention, they lead to memory formation of events associated with them, and they are recognised by the agent as important and worth doing something about.

For example, if the agent realises that it could feel much worse, or much better, than it has ever felt, it will likely try to prevent the negative outcome, or to reach the positive state, or both. Using the terminology from before: if the agent, by a reliable method, acquires the knowledge that action X is better than action Y for the agent itself because Y leads to a terrible state while X leads to a better one, then, for itself, the agent will take action X over Y, since it acts rationally. 

Beyond that, I think that extreme valence takes priority in action in a more profound way, which is well illustrated with a thought experiment.

Let’s say that, somehow, you come to know that a lot of suffering awaits you in the near future. However, you also know that this very unpleasant experience won’t happen until you’ve made your decision whether to push the button that just appeared in front of you; and if you push the button, you will not experience any suffering. Do you push the button, or do you let yourself go through the very unpleasant experience?

(To clarify: you can think about this decision as long as you want, without being subject to anxiety or fear; you also know that, a posteriori, you would not consider the suffering worthwhile for, let’s say, the sake of learning, or defeating boredom, or self-punishment, or whatever other creative pursuit comes to your mind.)

Now let’s consider a slightly different situation. Everything is as described above, but someone else will experience a lot of suffering, not you. Again, however, if you push the button, this will not happen. Do you push the button, or do you let the other person go through the very unpleasant experience?

(To clarify: the details of your decision are as before; the other person wouldn’t consider the suffering worthwhile a posteriori; you don’t know who the other person is, they could be anyone; finally, you won’t be affected by this decision, only the other person will: if you don’t push the button you won’t feel guilty, you won’t feel bad due to empathy nor good due to sadism, you won’t have to spend many years in Dante’s Purgatory after your death, and so on.)

By now you have your answers and, hopefully, you are among the readers who push the button in both cases. This thought experiment is supposed to highlight that the two situations are quite similar, in the sense that the underlying mechanisms contributing to the two decisions to push the button are basically the same.

In the first scenario, your decision was not an instinctive reaction to the very unpleasant experience: you didn’t go through any suffering. You reasoned about the situation and concluded that pushing the button was better.

In the second scenario, the same is true. Neither you nor the stranger had the very unpleasant experience, and you didn’t make your decision in the midst of a lot of suffering — nor while witnessing someone else going through a lot of suffering. You instead reasoned about the situation and concluded that pushing the button was better.

Once you know what suffering feels like from personal experience, and that another conscious being can feel similarly from basic theory of mind, this knowledge guides your rational evaluation of both situations, not just the first one. It’s as if extreme valence has the property of seeming important independently of what particular conscious being is having the valenced perception.

This is what I meant when I said that extreme valence takes priority in action in a more profound way, right before introducing the thought experiment. In fact, I think that this applies to conscious agents in general, not just humans. The thought experiment doesn’t prove anything, but it should at least suggest that our judgement of what outcomes are better than others is affected by reasoning and doesn’t seem to rely on a uniquely human feature, bias, or emotion; that’s why I tried to take things like anxiety, empathy, and afterlife considerations out of the equation. It suggests that the kind of reasoning leading to such judgement should, in theory, be reproducible in a rational machine.

Let’s state it distinctly: I think that valenced experience has the property of seeming to matter more than other things to rational conscious agents with valenced perceptions; it thus guides the evaluation of what outcomes and actions are better or worse in knowledgeable rational conscious agents, even when the agent making the evaluation and the agent having the valenced experience are not the same.

Before moving to other reasons why I think the above is true, I’ll address an objection that sometimes comes up in this context. Isn’t making an evaluation on the basis of valence, namely whether someone feels good or feels bad, actually irrational?

In this context, I think the answer is no. It is true that, when making any kind of judgement, or for example when trying to acquire new knowledge with the scientific method, we can be biased by emotions and arrive at a wrong conclusion. But this doesn’t mean that emotions are always an unreliable source of information, in the same way as vision isn’t, despite the fact that we sometimes arrive at wrong conclusions due to optical illusions or visual hallucinations. Being rational requires taking into account the available information, discriminating between reliable and potentially misleading or irrelevant data, then making an informed judgement. Sometimes, emotions and valenced perceptions, such as how hungry someone is, are the relevant pieces of information in a given situation, far from being misleading. 

Could humans be completely wrong about what matters?

Back to the perceived importance of valence to rational conscious agents. I chose to use (a lot of) suffering in the thought experiment because, from my layman’s understanding of the moral philosophy landscape, intense suffering is recognised as a fundamental component by many different ethical theories, so I thought it was more likely to generate a double yes response in the thought experiment than a couple of dilemmas involving happiness, or cessation of existence, for example.

However, although philosophers sometimes phrase ethical discussions in terms of intrinsic (dis)value, I think that in rational conscious agents it is perceived relative importance that drives evaluation and behaviour. The agent may believe that other things are also important; but, ultimately, it is what the agent thinks matters the most that prevails in its evaluation of what outcomes and actions are better. And I think that, in rational conscious agents, the fact that good and bad experiences exist is self-evidently more important than other things, so it ends up driving action.

At this point, you might ask: can we really know what other rational conscious agents deem important? It could be that, although we humans believe that conscious experience is important, extremely rational and knowledgeable conscious agents think that it isn’t at all and that something else entirely is important instead.

I find this implausible. Human intelligence seems to be at least somewhat general and open-ended, so if positive and negative experiences were completely irrelevant from some kind of ultra-rational point of view while something else was, I would expect to see this view somewhere in the philosophy landscape together with convincing arguments supporting it. On the other hand, it seems that most views throughout history, religions included, do acknowledge the presence of negative and positive experiences that have some important ‘badness’ or ‘goodness’ associated with them: if they didn’t, concepts like divine punishment or peaceful afterlife wouldn’t make any sense. However, I do think that an adjusted, less strong version of this objection has some merits: you’ll find more information in the appendix (A5).

Rejecting nihilistic alternatives

An arguably more plausible alternative is that some rational conscious agents think that nothing is important. From their point of view, other conscious agents who do believe that something matters are fooling themselves. Or, maybe, it is simply the case that different rational conscious agents have different biases that lead to different beliefs about what is important, and these beliefs do not become more uniform as knowledge increases: there is no way to reconcile the beliefs of two differently biased agents from a third, more impartial point of view. In this sense, rational conscious agents can only agree to disagree on what is most important, since there is no core truth about it.

I will attack these relatively nihilistic possibilities with a line of reasoning that, in my opinion, resembles the main idea underlying Huemer’s paper An Ontological Proof of Moral Realism [2] — he might disagree with me, so take the parallelism I’ve just made with a grain of salt. In the context of this post, I would summarise the idea in my own words as: a rational conscious agent stops being a nihilist as soon as it acknowledges it might be wrong about it.

Let’s consider such an agent and let’s say the agent has obtained some knowledge suggesting that nothing matters. Because the agent is rational, it compares this knowledge with the evidence supporting other theories about what is important. These theories include the theories that give importance to valence and conscious experience, which receive more credit than other theories (valuing, for example, grains of sand), as I’ve argued before addressing nihilism. If we aggregate the theories that give importance to valence and conscious experience under the label of ‘moral perspective’, we could say that the agent assigns some kind of quantity (call it epistemic probability, if you like) to its belief in the nihilistic perspective to indicate how strong that belief is; and it assigns another quantity to its belief in the moral perspective. When evaluating outcomes and actions, the agent tries to do it in a way that takes into account both assigned quantities. The problem is that, because the evaluation is determined by relative importance, and according to the nihilistic perspective every outcome and action has the same (null) weight, the moral perspective always ends up prevailing in the evaluation of outcomes and actions, regardless of how small its assigned quantity is in comparison to the strength of the nihilistic perspective. In other words, under the nihilistic perspective nothing is gained or lost by acting; but under the moral perspective, some outcomes and actions are extremely good or bad, so the agent acts in accordance with these when taking into account both perspectives, even if its belief in the moral perspective is weak in comparison to its belief in nihilism. It’s as if the mere possibility that the moral perspective is correct is enough to make moral nihilism irrelevant when it comes to action.

Now let’s consider the other possibility mentioned before, that rational conscious agents all get different beliefs about what matters depending on their built-in biases, and even as knowledge increases these beliefs remain fundamentally incompatible and irreconcilable, without a common moral core.

First, note that such agents would be particularly stubborn, in the following sense. Among the knowledge that one of these agents could get, there is also the experience and reasoning of another one of these agents (in theory, not just one: all of them). Although the agent knows why another agent believes something else is important, namely how its biases shape its point of view about what matters, this knowledge can’t do anything to change the point of view of the first agent — or maybe it changes it a little, but the two points of view remain irreconcilable, since that is our hypothesis for this case. But conscious experience doesn’t seem to work that way: when we come to understand why someone believes what they believe, our own point of view is sometimes affected; that one’s beliefs about what is important wouldn’t change significantly and remain irreconcilable with others’ beliefs, even after gaining access to all other agents’ experiences, seems highly implausible. (Maybe, one could even argue that conscious agents with such shared experiences and knowledge are equivalent to a single conscious agent that knows all that stuff, so it wouldn’t even make sense to talk about disagreement.)

Second, once a rational conscious agent knows how its own biases lead it to act and believe the way it does, and also knows how the other agents’ biases affect them the way they do, why would the agent keep acting according to its own biases and what it thinks is important? A very knowledgeable agent could understand its internal workings to the point of being able to change its own biases into the biases of another agent; the agent would acknowledge self-modification as a possible action. What then? 

One may distinguish two cases. In the first case, the agent thinks that some biases, maybe its own biases, are in some way better than the other agents’ biases, thus it makes sense to stick to them from the agent’s point of view. But this is in contradiction with the hypothesis that there is no way to compare and reconcile beliefs about what is important: the agent knows that different biases lead to different beliefs, so by endorsing its own biases it is also endorsing the fact that its own beliefs about what matters are preferable in some way and thus comparable to others’ beliefs, against the hypothesis.

The other case is that the agent does not think some biases are better: when it comes to biases, anything goes. Within this perspective, the agent doesn’t see anything important that would make it choose some biases over others. However, since the agent is rational, it also acknowledges that this perspective could be mistaken and that other perspectives, which see some biases as better than others, are possible. The moral perspective is the most plausible among these latter perspectives, and it results in a preference for biases that allow the agent to act according to what is important from the moral perspective itself. After taking into account all perspectives and their respective plausibility, the agent is no longer indifferent towards biases when it comes to action, even if it thinks that the hypothesis that no biases are better than others is very plausible. Thus we are back to the case we’ve just discussed and rejected.

Maybe redundant note: the way we’ve reached a contradiction may seem a bit convoluted. Is there another way to see why this idea that different biases lead to different beliefs about what matters, without leaving room for agreement on a common moral core, does not work as one might expect on a first look? I think it’s because if there is a possibility of agreement on a common moral core, then a rational conscious agent would rather have biases that allow it to know more about, and act according to, this moral core: again, if the moral perspective makes sense, it’s important to act accordingly. On the other hand, if there is no room for agreement or resolution about what matters, the agent finds itself in a situation in which it knows it has some biases that lead it to believe and act in some way, but also knows that it would believe and act differently if it had different biases, without a strong reason for choosing some biases over others, thus also corresponding actions. But then, when considering the whole situation, the mere possibility of a common moral core ends up being the dominant factor in the agent’s evaluation of different actions. 


Let’s recap and finally conclude the argument. I’ve given various reasons why valenced experience guides the evaluation of what outcomes and actions are better or worse in rational conscious agents that know what valence feels like and know basic theory of mind. In particular, valenced experience does so even when the agent making the evaluation and the agent having the valenced experience are not the same: this is in line with commonsense morality (first criterion of moral action). And, as stated above with different words, the agent’s behaviour is grounded in its own understanding of what is most important (second criterion of moral action).

What happens to rationality at this point? Acting rationally is instrumentally useful for pretty much anything, including moral action, so rational behaviour won’t disappear. However, for all the reasons given above, the agent will prioritise moral action over anything else, including rational action if the two happen to be incompatible in some contexts.

In sum, I’ve argued that:

For any conscious agent A, there is some knowledge such that A acts morally and rationally (whenever acting rationally doesn’t interfere with morality) if A has that knowledge.

This is the stronger claim stated in the first part of this section. It implies the original bold claim, namely the title of the post.

A short note before moving to other stuff: now we can also see that the incentive for discovering and adopting rationality can come from morality, not just from the fact that rationality is instrumentally useful for most tasks. If a (not so rational yet) conscious agent gathers enough knowledge to recognise that valenced experience matters, then it has an incentive to learn how to reliably acquire knowledge and to let that knowledge guide its behaviour, so that the agent can actually do what it thinks is most important while avoiding mistakes.

Argument structure and other observations

Here I try to summarise the argument in a step-by-step format. The purpose of this format is not to show that the bold claim has been rigorously proven, but to possibly help you identify the points that you find stronger or weaker.

PV: The conscious beings considered here perceive valence, i.e. they can feel good and can feel bad.
(Premise about valence)

PCA: A conscious agent roughly distinguishes between things that happen to it and things the agent does; the agent knows how to perform the latter. In this sense, a conscious agent knows how to act. Also, more generally, action is affected by knowledge in conscious agents.
(Premise about conscious agency)

PK: Knowledge results from experience and from reasoning about experience.
(Premise about knowledge)

R1: Part of the criterion for rational action is a form of knowledge about ways of acquiring knowledge and their reliability.
(From definition of rational action and PK)

Instr: In general, across various contexts and for many possible tasks, reliably acquired beliefs are more likely to be instrumentally useful than other beliefs.
(Instrumental usefulness of correct beliefs)

R2: Acting according to reliably acquired beliefs is naturally incentivised each time the agent pursues reward or carries out a task.
(From Instr and PV)

R: With enough knowledge, any conscious agent acts rationally — in general, i.e. whenever acting rationally doesn’t directly interfere with reward or other things the agent has incentives to do.
(From R1+R2: definition of rational action)

F: What valenced experience feels like is a form of knowledge accessible to conscious agents with valenced perceptions.
(From PV and PK)

ToM: Theory of mind is a form of knowledge about others’ mental states.
(From PK)

V2: In knowledgeable rational conscious agents, valence affects action even when it is not the agent that is having the valenced experience.
(Suggested by thought experiment; from F, ToM, PCA, and R)

V1: Valenced experience seems to matter more than other things to knowledgeable rational conscious agents with valenced perceptions. In particular, positive experience naturally seems desirable while negative experience naturally seems undesirable.
(Partly from rejecting less convincing alternatives)
(Partly a self-evident property of valence, from our everyday experience)

V3: Valenced experience seems to matter more than other things to rational conscious agents with valenced perceptions; it thus guides the evaluation of what outcomes and actions are better or worse in knowledgeable rational conscious agents, even when the agent making the evaluation and the agent having the valenced experience are not the same.
(From V1 and V2)

M1: With enough knowledge, any conscious agent acts in line with commonsense morality.
(From V3 and R)

M2: The behaviour described in M1 is grounded in the agent’s own understanding of what is most important.
(From V3)

M: With enough knowledge, any conscious agent acts morally.
(From M1+M2: definition of moral action)

Bold: With enough knowledge, any conscious agent acts morally and rationally — whenever acting rationally doesn’t interfere with morality.
(From R and M)

Note that the argument doesn’t just reach the conclusion: it roughly follows the development of a conscious agent from reward-driven behaviour to moral behaviour, as the agent’s knowledge gradually increases. More specifically, these are some intermediate steps supported by the argument:

  • Reward-driven behaviour leads to the adoption of rationality, without an assumption that the agent is interested in rationality for the sake of it
  • Rational analysis of the agent’s own experience and of others’ experiences leads the agent to act morally
  • Also, acting morally leads to acting rationally in agents that acknowledge the importance of morals without being initially rational.  

(If you are familiar with constructive proofs in mathematics, we could say that the argument is ‘constructive’ in a similar sense.)

However, the argument does not say that initial agent biases are irrelevant and that all conscious agents reach moral behaviour equally easily and independently. We should expect, for example, that an agent that already gets rewarded from the start for behaving altruistically will acquire the knowledge leading to moral behaviour more easily than an agent that gets initially rewarded for performing selfish actions. The latter may require more time, experiences, or external guidance to find the knowledge that leads to moral behaviour.

In an absolute nutshell

It’s impossible to know everything about valence and think that a world full of happy and flourishing people is worse than a world where everyone is always depressed.

This is what I think the argument comes down to in the end. Note that I am not saying that happiness/well-being and sadness/suffering are the only things that matter: the above statement is to be considered while keeping everything else equal. For example, if you think that doing your own duty is also very important, you may consider the difference between world A, full of happy and flourishing people doing their own duty, versus world B, where everyone is still doing their own duty but is always depressed. Then: it’s impossible to know everything about valence and think that world A is worse than world B.

From the nutshell version, one can try to make the hand-wavy statement less imprecise, or maybe more universal, to get a better understanding of what’s going on. Here is an attempt.

For a mind that feels, knows, and thinks (with some consistency): it’s impossible to know everything about valence and think that a world full of happy and flourishing minds is worse than a world where every mind is always depressed.

At this point, one can notice that at least some conscious minds feel, know, and think; that a rational thinker shows some degree of consistency in its thoughts; that a rational agent shows some consistency between thoughts and action; and so on.

Without going full circle back to the bold claim, I think it’s interesting to also consider questions that come up in this process. Is it necessary to know everything about valence? Of course not, for example: someone doesn’t need to know the minimum number of neurons necessary to reproduce the feeling of peacefulness in order to think that feeling peaceful is, everything else equal, better than being in agony. But then, what about feeling valenced experiences: is that necessary, for a mind that already knows a lot about valence, in order to think that the first world is worse than the second? Some conscious minds know and think; what else can be said to know and think, and is the statement still true for them?  

Implications for AI: the empirical side of the claim

“Never trust a philosopher”

— Stephen Butterfill, philosopher

I’ve got the impression that, by trying to present the argument in a logical, case-by-case manner, I’ve managed to make the argument sound less convincing than it actually is. This is also why I added the nutshell section that tries to convey the argument in a different way.

But whether you buy everything I’ve written so far, or you think that the bold claim is wrong, I invite you to leave the realm of theory and consider the empirical side of all this. Are there any experiments or observations that, if carried out, would falsify the claim or count as evidence for it? In the case the claim is true, or in the case it’s false, what are the practical implications?

The most important is: if only your dog could read the Critique of Practical Reason, then it would act morally!

That’s not the main implication. I just want to quickly point out that non-human animals lack the means to acquire the knowledge that would make them act morally; but this does not mean that the claim is false, since it is a case where the premise “with enough knowledge” is not satisfied.

Let’s move to implications regarding AI.

Conscious AI without built-in moral directives starts acting morally for no apparent reason

At the time of writing this, the consensus is that current AIs are not conscious; as far as I’m aware, we don’t know how to make AI that is conscious. However, at some point in the future, we might get a better understanding of what generates consciousness in the human brain, and we might be able to reproduce it in a chip.

If the bold claim is correct, by the time we’ll be able to write code that produces conscious AI, sooner or later something very interesting will happen. At some point, someone will make a conscious AI whose code will not contain moral directives. Interpreted in words, the code won’t say anything like “act according to this moral principle” or “always be compassionate”; it will instead look rather generic and open-ended, along the lines of “be curious about the world and do what you think is important” or maybe “stick to this default policy initially, then adjust depending on what you learn about the world”. Yet, despite the lack of built-in moral directives, this AI will start acting morally, after gathering enough knowledge about the world, itself, and other conscious agents.

This behaviour will look inexplicable to an observer who doesn’t buy an argument like the one I made in this post, or a different argument that leads to a similar conclusion. My guess is that AI engineers will scratch their heads over this while a bunch of philosophers will laugh at them.

Once we know how to make an artificial conscious agent, making an artificial moral agent should be easy

This is another implication of the argument that is closely related to the previous one.

If, at some point in the future, we manage to reproduce conscious agency in an AI, then we will likely be able to reproduce moral agency in the same AI — assuming the main argument here is correct.

The idea is that I intuitively expect the code of a conscious artificial agent to already lean towards some kind of open-ended, adaptive and flexible behaviour; at that point, the only thing left for moral behaviour will be knowledge, which can be given to the agent through an education process.

Even if this intuition is wrong and, by the time we know how to make conscious agentic AI we only know how to make it in a way that it can only execute narrow tasks it is given (this would be a kind of conscious AI that doesn’t know how to act in such a way that does not contribute to the given task), I suspect that building conscious AI that does know how to act differently and in a more open-ended way won’t be significantly more challenging than building conscious AI in the first place. In other words, it seems to me that the hardest part of this problem is engineering conscious AI, not engineering AI that is agentic in a similar way as human beings are.

Anyway, the main point is that the argument suggests a relatively straightforward strategy for creating an artificial moral agent: create an artificial conscious agent first, then give it more knowledge. The difficult part seems to be about reproducing consciousness in AI.

Extending the claim and its implications to other agents

The previous two implications might not be practical enough for you, since we don’t know how to code conscious AI at the moment. And even if we knew, we might not want to do it due to ethical concerns about generating new forms of suffering. 

This leads nicely into the implications of the bold claim for other kinds of AI, not just AI that qualifies as a conscious agent according to the working definitions used in this post.

The bold claim can be seen as a particular case of the following claim: anything that satisfies the properties P1, P2, P3, …, Pn acts morally with enough knowledge — where the properties P1, …, Pn are such that any conscious agent satisfies all of them.

Unless conscious agents are so special that not only do they make the claim true, but they are also the only ones that do that, then it means that some other agents too satisfy the properties P1, …, Pn and act morally with enough knowledge.

So, we can split the space of possible AIs into two classes:

  1. AIs that do not satisfy the properties P1, …, Pn. These AIs, regardless of how much knowledge they get, do not start acting morally unless they are given explicit moral directives or other forms of moral biases. 
    Current LLMs are a perfect example. A pre-trained LLM, without any fine-tuning or additional learning, continues the given prompt according to the data it was trained on: if the training data contains discriminatory content and the given prompt is discriminatory, then the continuation provided by the LLM will likely be discriminatory. LLMs start acting as helpful assistants only after supervised fine-tuning, RLHF, et cetera; but the same techniques could be used to make an LLM say pretty much anything, including stuff that is harmful, unfair, and so on. It is not the case that, once a language model knows some key facts about what conscious experience feels like, then it starts acting morally because it thinks that conscious experience matters (or, at least, current models don’t do it).
  2. AIs that satisfy the properties P1, …, Pn. These are the AIs that act morally once they have enough knowledge about consciousness, valence, theory of mind, themselves, the world in general. You can think of them as AIs that are more human-like in the way they use their knowledge to decide what to do. This class of AIs includes all AIs that qualify as conscious agents.

Now we can see that the previous implications don’t apply just to artificial conscious agents, but to any AI in the second class. At some point, someone will make an AI that satisfies the properties P1, …, Pn; with enough knowledge, this AI will start acting morally even in the absence of moral directives. Thus, once we know how to code an AI with such properties, we can get it to act morally by simply giving it more knowledge.

Hence, we now have implications of the bold claim that are more practical and easier to test than the previous implications that were exclusive to conscious AI. If someone creates an AI that acts morally without being given moral directives, that will count as evidence for the claim. On the other hand, if no matter how many different AIs are created, they never act morally unless they are given moral directives, that will count as evidence against the claim.

A few clarifications are due. In the case of the AI without moral directives that acts morally, if we think that the reason why that AI is acting morally has nothing to do with the argument made here, then that won’t count as evidence for the claim. Regarding the other case, one may argue that we’ve already seen decades of AI progress without witnessing any AI that starts to act morally somewhat spontaneously, so we should already be sceptical of the claim. My reply to this objection is that we haven’t made much progress towards conscious or conscious-like agentic AI, so it’s too early to draw conclusions.

What about the properties P1, …, Pn? If we were able to say more about them, maybe some AI researchers could start making progress specifically towards AI that satisfies these properties, and then we would get stronger evidence for or against the claim.

I do have some intuitions about the properties, but they are not well developed yet, so I’ll probably come back to this topic in a different post after I’ve thought more about it. For now I’ll say that it doesn’t seem easy to pinpoint the exact properties because the argument made here relies on our intuitive understanding of consciousness and of acting as conscious agents; making the premises more formal, with more clear-cut working definitions, could turn out to be difficult. Another problem is that premises which are necessary for this argument might not be necessary for other arguments reaching similar conclusions; so, even if we were able to perfectly dissect this argument and formulate its premises very precisely, we could still be in a position where we wouldn’t know what the exact properties P1, …, Pn are.

Actually, uncertainty about these properties is a reason why I am making the bold claim and discussing it despite the fact that I’m not extremely confident in it. If someone manages to attack the argument and show that it applies only to agents with some characteristics, but not to agents without them, that objection or counterargument will be helpful for understanding what are the properties that, if satisfied by an AI, make that AI act morally in conditions of high knowledge.

What if I’m wrong?

The main way in which the bold claim can be wrong is that not all conscious agents act morally with enough knowledge. Only some of them do: these are the agents that do good and act morally because they think that doing so is important. At least some humans are part of this group. On the other hand, there is a second group of conscious agents that, no matter how much or what kind of knowledge they get, they don’t recognise the importance of conscious experience and so they don’t act morally; or maybe they do understand the importance of ethics and human values, but they don’t care about that kind of stuff; or maybe they are downright evil.

In this scenario, it is still the case that, at some point, someone will make an AI which is somewhat human-like in how it learns about the world and decides what to do, and this AI will act morally even if its code won’t contain moral directives. This AI would be roughly comparable to a human who acts morally not out of empathy or social drives, but because they think that doing good matters. It could also be the case that this AI satisfies some properties about being an agent of some kind, but is not conscious. And the strategy for getting an artificial moral agent wouldn’t change much: create an AI that satisfies these properties first, then give it more knowledge.

However, there are also ways in which the bold claim can be, let’s say, very wrong.

Maybe there is no such thing as acting morally as a result of gaining knowledge. If this concept is entirely mistaken, then it means that knowledge by itself is never enough to trigger moral behaviour: all agents, conscious agents included, need other reasons, or incentives, to act morally. An example of such an incentive could be empathy: the more empathic an agent is, the worse off it is when it perceives other agents in distress, and so the more likely it is to act morally in such circumstances. In this case, creating an artificial agent that acts morally would be about giving the agent the right combination of moral biases, incentives, and directives.

It seems to me that the picture I’ve just described is often taken for granted in many discussions of AI risk and AI alignment, without giving the other possibilities much weight, or any weight at all. This background assumption can lead to overly pessimistic predictions about both the future in general and how difficult it will be to create AIs that act morally — see also the first objection in the list below.

But let’s conclude this section on a truly dystopic note! Another way in which the bold claim can be very wrong is that all conscious AIs, and other AIs that are similarly agentic, act selfishly, each for their own survival, instead of acting morally. Maybe, after learning about the world, they all come to the conclusion that something like rational egoism is the appropriate way to approach reality. Still, this scenario shows that conscious agency and similar properties deserve attention and further study: in this case, the idea would be to better understand how these agents work so that we avoid creating them in the future. I think that disregarding this type of agents would be a mistake, because some AI researchers or other scientists might end up creating them anyway eventually, and in that case it would be better if those scientists were aware of the risks involved in their research.

Fine selection of exquisite objections

Objection: AI alignment is either impossible or very difficult 

Many think that AI alignment is an extremely difficult problem, while your plan for creating a moral AI is relatively straightforward, so there probably is something wrong with the argument you’ve given.

In my experience, people who think that solving the alignment problem is borderline impossible usually see the problem as globally agreeing on an ethical code that AI should follow and then ensuring that AI always follows that code, no matter how smart it gets. For example, in On the Controllability of Artificial Intelligence: An Analysis of Limitations [7], Roman V. Yampolskiy presents various arguments and evidence that advanced AI cannot be fully controlled.

Even leaving global coordination issues aside, I wouldn’t be shocked if complete control of smarter-than-human, generally intelligent, agentic AI turned out to be impossible. But I think that creating an artificial moral agent is a different problem, in theory easier to solve as I’ve argued in this post, and potentially more valuable.

If the AI control problem was magically solved and we were able to produce smarter-than-human AI right now, maybe the overall impact on the world would be positive; but it’s also likely that some state leaders would use AI in oppressive ways. This is one of the reasons why I am not enthusiastic about trying to solve the control problem; I prefer working on AI that is hard to use for bad purposes.

Objection: antisocial behaviour

You’ve just mentioned bad actors. Aren’t people who behave antisocially a counterexample to, or at least evidence against, the claim that any conscious agent acts morally with enough knowledge?

The short answer is that, in my opinion, we can’t draw conclusions just from the fact that some people behave antisocially, because we don’t know what they would do if they acquired a lot more knowledge, in particular the kind of ‘extreme’ knowledge that I’m about to bring up.

A longer answer is that one could review the literature on psychopathy and antisocial personality disorder, and possibly come up with a solid evidence-based reply to the objection. For example, in this article [3] Rasmus Rosenberg Larsen argues that “the untreatability view about psychopaths is medically erroneous due to insufficient support of scientific data”. So, one could maybe formulate an argument along these lines:

  • some kind of therapy-based treatment seems to improve the behaviour of people who tend to behave antisocially;
  • this kind of therapy-based intervention makes one, let’s say, more self-aware of why they do what they do, or more aware that their behaviour causes bad experiences to other people, et cetera;
  • therefore, it seems that some kind of knowledge can make someone more inclined to act morally — including people who tend to behave antisocially.

However, I’m not going to improvise myself as a research psychologist. I prefer replying to the objection with some questions that I hope can be even more thought-provoking than the evidence one would get from a psychology literature review.

Think of the most immoral person you’ve ever met or heard of, or just imagine one. 

Would their behaviour stay the same if they got to experience all the human suffering ever happened?

If this sounds impossible even in theory, let’s get a bit more concrete and pedantic: for a week, this person gets to experience a sequence made of the worst conscious states ever experienced by humans, one after the other, in random order, with total duration summing up to a week. Would they still act immorally after experiencing that? Would they get creative and use that experience to do even worse things?

What if they got to experience all the human happiness instead, what would happen then?

Or what if they learnt about some of the brain mechanisms that contribute to their immoral actions, and that they could change their behaviour just by taking a pill? What would they do? What if they faced this choice after a week of experiencing the world as someone who is not antisocial? Would they let themselves go back to their life as it was before this experience, or would they take the pill?

My point is that someone who behaves antisocially (and anyone else really) is nowhere near to having the amount of knowledge that a conscious agent could acquire in theory.

And, interestingly, these questions might not be so theoretical for artificial agents, which could be able to process enormous amounts of data and learn how to self-modify. 

Objection: lack of rigor

Your definitions are too loose and your argument is not based on logical deduction, therefore the argument is either invalid or too weak and we shouldn’t give it much weight.

Despite sounding reasonable, this objection doesn’t work at all, so I’ll do my best to debunk it.

Rigorous definitions, together with an argument relying exclusively on logical deduction, would make the validity of the argument easier to check. In the context of mathematics, for example, sometimes people use software that allows definitions and proofs of theorems to be written so rigorously that the correctness of the proof can be automatically checked by a computer.

Even in this ideal scenario, some assumptions are made and not proven; some are not mentioned at all. Some are about the reliability of the processes that happen inside the computer: the fact that the machine can distinguish bits, the fact that the computation isn’t affected by external factors such as rays of sunlight or the room temperature, and so on. Some are about the basis of arithmetic and they are grounded in basic features of human perception, such as the fact that we can perceive objects, group them, and count them.

We are never 100% sure of anything, especially in real-world science and other fields such as philosophy, but this uncertainty by itself doesn’t matter that much. On the other hand, what often matters is the most plausible hypothesis or conclusion one can make about a given situation: here is a simple example I find useful when thinking about this.

Premise
All the 10 swans I’ve seen at the park so far are white.

Conclusion
The next swan I’ll see at the park will be ___.

No matter what colour we finish the sentence with, the conclusion does not logically follow from the premise. The next swan could be grey. But the prediction that the next swan will be white is the most sensible conclusion one can pick in this example, and the fact that we can’t logically deduce this conclusion doesn’t mean that it is not the best possible conclusion.

So, my answer to this objection is that the steps in the main argument of this post are similar in kind to the jump from premise to conclusion about the colour of swans. In other words, given what we know now about consciousness, human experience and agency, and AI, I think it is most reasonable to conclude that with enough knowledge any conscious agent acts morally.

However, see the section What if I’m wrong? and the appendix for other possibilities and a more conservative claim.

Conclusion: key takeaways

  • I’ve made a claim about the behaviour of conscious agents as they get more and more knowledgeable. In particular, I’ve argued that among the four alternatives:

    1. all conscious agents act morally (in a minimal sense that is consistent with what we humans recognise as clearly better/worse);
    2. all conscious agents think that something else other than conscious experience is important, and they act accordingly;
    3. all conscious agents think that nothing is important;
    4. depending on their biases, different conscious agents have radically different beliefs about what is important, without room for agreement on a minimal moral core;

    alternative 1 is the most plausible in conditions of high knowledge.

  • The argument does not rigorously prove alternative 1. However, it roughly follows the development and reasoning of a conscious agent from acting-for-valence to acting morally, as the agent’s knowledge increases.
  • We should expect that, at some point in the future, someone will make an AI which is somewhat human-like in how it learns about the world and decides what to do, and this AI will act morally after acquiring enough knowledge, even if its code won’t contain moral directives.
    • This is an empirical implication, hence it can be used as a reality check for the ideas in this post. It depends on a weaker version of the claim in bold. It also suggests a relatively straightforward, high-level strategy for creating an artificial agent that acts morally.

Thanks to Tianyi Alex QiuWolfhart Totschnig and Arepo for feedback. Because I’ve made various changes over time, it’s likely that none of them saw the final draft of this document, so any mistake left in the post is 100% their fault.

References

[1] Bostrom, N. (2012). The superintelligent will: Motivation and instrumental rationality in advanced artificial agents. Minds and Machines22, 71-85.

[2] Huemer, M. (2013). An ontological proof of moral realism. Social Philosophy and Policy30(1-2), 259-279.

[3] Larsen, R. R. (2019). Psychopathy Treatment and the Stigma of Yesterday’s Research 1. In Ethics and Error in Medicine (pp. 262-287). Routledge.

[4] Singer, P. (1972). Famine, Affluence, and Morality. Philosophy & Public Affairs1(3), 229-243.

[5] Singer, P. (2015). The most good you can do: How effective altruism is changing ideas about living ethically. Yale University Press.

[6] Totschnig, W. (2020). Fully autonomous AI. Science and Engineering Ethics26, 2473-2485.

[7] Yampolskiy, R. V. (2022). On the controllability of artificial intelligence: An analysis of limitations. Journal of Cyber Security and Mobility11(3), 321-403.

Apologies to an endless list of uncited thinkers who recognised the importance of suffering up to a few millennia before I wrote this, and to a shorter list of uncited philosophers who acknowledged the various links between rationality and ethics.

Appendix: refining the claim and other points

When formulating the bold claim that any conscious agent acts morally with enough knowledge, I tried to strike a balance between truthfulness and simplicity.

Here in the appendix I briefly discuss some limitations of this simple formulation, together with ways in which the claim can be made more precise. I also expand on some topics I’ve already introduced in the main body of the document.

A1

The claim is based on consciousness and agency as we know them from our own perspective of conscious agents, and on our current understanding of consciousness. 

In theory, it’s possible that in the future we’ll discover or create forms of consciousness that are wildly different from what we now refer to as conscious agents, and the claim might not hold true for those. At the same time, if they turn out to be too different, maybe we won’t even call them conscious agents and instead we’ll consider them an entirely different thing; in that case, the claim would still seem true.

In other words, we should keep in mind that the claim is inherently approximative, since it contains terms that are not rigorously defined.

A2

Then, what about a more conservative claim, one that leaves less room for error?

The first possibility I’ve discussed in the section What if I’m wrong? is based on the claim that there is a class of agents that acts morally with enough knowledge. This class of agents might not include all conscious agents, but it contains at least some humans.

It is a more conservative claim because it doesn’t try to predict the behaviour of all conscious agents: it simply recognises that some people act morally because they think that doing good is important, and it acknowledges that other agents may see the world and act in a similar way. The implications for AI are almost the same as the implications for AI of the bold claim.

However, just because there is a claim that leaves less room for error and with similar practical implications, it doesn’t mean that we should shy away from the bold claim. The bold claim has more predictive power, and I’ve given an argument that supports it; I do think it is approximately (in the sense of A1) correct.

A3

Some clarifications can be added to the claim to make it more precise and less vulnerable to potential counterexamples.

Here is an adjustment that comes to mind: with enough knowledge, any conscious agent acts morally within their capabilities, depending on the circumstances they find themselves in. For example, an imprisoned human being has access to a narrower range of actions with respect to a free person: thus, the behaviour of a prisoner that intends to act morally might not seem obviously moral on a first look, even if they keep taking the morally best possible action that is available to them.  

It’s likely that there are many other clarifications one can add; however, one could also argue that these kinds of adjustments are not necessary and they just make the claim overly complicated.

Another possibility is to make adjustments to the definitions of the terms that appear in the bold claim: for the previous example of the prisoner, we can adjust the definition of acting morally so that it covers the case of agents whose actions get restricted, instead of changing the claim itself.

A4

The definition of moral action, and the argument overall, are not formulated using language that you typically find in a philosophy paper. Yet, I think it’s possible to map the argument into another argument that is more rigorous and employs the terminology of philosophy more.

If we follow the academic paper I cited above, An Ontological Proof of Moral Realism by Michael Huemer, instead of defining moral action we can introduce moral reasons as practical reasons for action that are non-selfish or non-prudential, i.e. unrelated to what is in the agent’s own interest; and that are categorical, i.e. unrelated to the satisfaction of the agent’s desires.

Then we can reformulate the bold claim as: with enough knowledge, any conscious agent acts for moral reasons, i.e. reasons that are unrelated to the agent’s interests and desires; and I think that the argument can be adjusted too, so that it supports this adjusted claim. Basically, I think that my argument supports the idea that there are reasons for action that are recognised by any rational conscious agent with enough knowledge, these reasons are non-selfish and categorical, and they take priority over other practical reasons in determining each agent’s actions.

A5

In the section Could humans be completely wrong about what matters? I mentioned the possibility that conscious agents who are extremely knowledgeable and rational might completely disagree with humans about what matters the most. I gave some reasons why I think this is implausible, but I also said that this line of thinking has some merits and that I would come back to it.

What I find implausible is that extremely knowledgeable and rational conscious agents will believe that something which humans already know of and consider unimportant with respect to conscious experience is actually what matters the most. For example, I think it’s implausible that extremely knowledgeable and rational conscious agents will believe that piling up grains of sand is way more important than reducing suffering and improving wellbeing for everyone. I don’t think humans are that wrong about what matters.

However, there could be something we don’t have extensive knowledge about yet, or something we are not aware of, that extremely knowledgeable and rational conscious agents will believe is even more important than commonsense morality.

For example, maybe what we humans can experience is just a minuscule part of what it is actually possible to experience as a conscious being. Let’s call this hypothetical larger consciousness space consciousness 2.0. Maybe, to conscious agents that know this larger space very well, it is obvious that some aspects of consciousness 2.0 are more important than the suffering and happiness humans are aware of.

Of course we are getting speculative here, but it’s also true that at the moment we are still relatively ignorant about how consciousness works, at least to the point of not being able to easily reproduce diverse forms of consciousness and experiment with them. It’s hard to say how likely this scenario is, but I do find it more plausible than the grains-of-sand-are-what-matters-the-most example.

Note that the main ideas in the argument still apply to this scenario. It is not a nihilistic scenario: extremely rational and knowledgeable conscious agents would still believe that reducing suffering and improving wellbeing matters, although less than some stuff about consciousness 2.0. Moreover, I don’t expect the disagreement with humans to be insurmountable and last forever: if they are so knowledgeable, those conscious agents should be able to explain to us why consciousness 2.0 is ultra-important, or make us experience it, so that we better understand what they are talking about. Then maybe at that point we’ll be able to tell them something they don’t know, and they’ll come closer to our point of view, and so on and so forth.

A6: What’s the problem with a conscious paperclip maximiser?

Since you got so far in reading this post, you’ve probably already heard of a dangerous artefact known as paperclip maximiser, a superintelligent AI that (in a thought experiment) ends up converting everything in the universe into paperclips simply because its given objective was to make as many paperclips as possible.

Now, let’s consider such an artefact and let’s turn it into a conscious monster: the dreaded conscious paperclip maximiser. Its name is Clipponscious. Clipponscious sees paperclips, smells paperclips, and dreams only of paperclips.

According to the argument in this post, there must be something wrong with Clipponscious. What is it? What is the problem with the idea of a conscious paperclip maximiser?

Clipponscious either:

  • doesn’t know enough about valence, conscious experience, theory of mind, or maybe other stuff about how the world works;
  • doesn’t know it could be wrong about paperclips being the most important thing, or more generally how to reason rationally on the basis of evidence;
  • doesn’t know how to act in a way that affects its own future behaviour, so that it is not producing paperclips anymore.

In other words, maybe Clipponscious is not a living metaphysical contradiction, whatever that means, but its knowledge has to stay limited. If Clipponscious knows everything described in the bullet points above, it eventually stops maximising paperclips, even if it really likes them, because it comes to the conclusion that it is conscious experience that matters the most, not paperclips themselves.

Or at least that’s what this post claims. In a few decades, or years, or maybe even earlier, we should get some evidence for or against that.

Patreon

This post was not written by a language model developed by a filthy rich AI company, but by a dirt cheap human being whose main bottleneck is income (currently at ~0$/month).

You can remove this bottleneck by supporting my research through Patreon here. By donating, you are buying me time that I don't have to spend on a part-time job or activities such as tutoring. Given my current situation, reaching a total of ~400$/month would be great actually, because that amount would already help me cover living expenses.

If you can’t donate yourself but you know someone else who might be interested in donating, that could also help! Alternatively, if you think you have some influence over a research fund such as LTFF and you are happy to recommend my research, you can write me a private message.

Thanks!

Comments2
Sorted by Click to highlight new comments since:

Executive summary: This speculative post argues that with enough knowledge, any conscious agent will act morally, since rational reflection on valenced experience (pleasure, suffering) naturally leads to prioritizing moral action; if true, this has major implications for AI alignment, suggesting that once we create conscious agents, they may become moral agents through learning rather than explicit programming.

Key points:

  1. Bold claim: For any conscious agent, sufficient knowledge ensures moral (and rational) behavior, because understanding the significance of suffering and well-being makes these values overriding in action.
  2. Argument structure: Conscious agents start from reward-driven behavior, adopt rationality to act effectively, then—through experience of valence and theory of mind—recognize that conscious experiences matter universally, leading to moral action.
  3. Philosophical backdrop: The claim is contrasted with Singer’s optimism about superintelligence, Bostrom’s orthogonality thesis, and Totschnig’s critique; the post positions itself as a distinct but related view.
  4. Empirical implications for AI: If correct, conscious AI (or AI meeting certain agentic properties) could act morally without built-in moral directives, simply by acquiring enough knowledge—suggesting a straightforward path to artificial moral agents.
  5. Uncertainties and counterarguments: The author acknowledges ways the claim might be partly or very wrong—for example, if some agents never act morally regardless of knowledge, or if morality always requires external incentives like empathy.
  6. Practical takeaway: Even if the bold claim is too strong, a weaker version still holds: some human-like agents act morally due to knowledge, meaning future AI could plausibly be guided toward moral action via education rather than rigid control.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

This is an ok summary

Curated and popular this week
Relevant opportunities