104 karmaJoined


Postdoc in Philosophy at University Erlangen-Nürnberg. Working on non-human consciousness and AI.


Thank you for the comment, very thought-provoking! I tried to make some reply to each of your comments, but there is much more one could say.

First, I agree that my notion of disempowerment could have been explicated more clearly, although my elucidations fit relatively straightforwardly with your second notion (mainly, perpetual oppression or extinction), not your first. I think conclusions (1) and (2) are both quite significant, although there are important ethical differences.

For the argument, the only case where this potential ambiguity makes a difference is with respect to premise 4 (the instrumental convergence premise). Would be interesting to spell out more which points there seem much more plausible with respect to notion (1) but not to (2). If one has high credence in the view that AIs will decide to compromise with humans, rather than extinguish them, this would be one example of a view which leads to a much higher credence in (1) than in (2).

“I think this quote is pretty confused and seems to rely partially on a misunderstanding of what people mean when they say that AGI cognition might be messy…”

I agree that RL does not necessarily create agents with such a clean psychological goal structure, but I think that there is (maybe strong) reason to think that RL often creates such agents. Cases of reward hacking in RL algorithms are precisely cases where an algorithm exhibits such a relatively clean goal structure, single-mindedly pursuing a ‘stupid’ goal while being instrumentally rational and thus apparently having a clear distinction between final and instrumental goals. But, granted, this might depend on what is ‘rewarded’, e.g. if it’s only a game score in a video game, then the goal structure might be cleaner than when it is a variety of very different things, and on whether the relevant RL agents tend to learn goals over rewards or some states of the world.

“I think this quote potentially indicates a flawed mental model of AI development underneath…”

Very good points. Nevertheless, it seems fair to say that it adds to the difficulty of avoiding disempowerment from misaligned AI that not only the first sufficiently capable AI (AGI) has to avoid catastrophic misaligment, but all further AGIs have to either avoid this too or be stopped by the AGIs already in existence. This then relates to points regarding whether the first AGIs do not only avoid catastrophic misalignment, but are sufficiently aligned so that we can use them to stop other AGIs and what the offense-defense balance would be. Could be that this works out, but also does not seem very safe to me.

“I think this quote overstates the value specification problem and ignores evidence from LLMs that this type of thing is not very hard…”

I am less convinced that evidence from LLMs shows that value specification is not hard. As you hint at, the question of value specification was never taken to be whether a sufficiently intelligent AI can understand our values (of course it can, if it is sufficiently intelligent), but whether we can specify them as its goals (such that it comes to share them). In (e.g.) GPT-4 trained via RL from human feedback, it is true that it typically executes your instructions as intended. However, sometimes it doesn’t and, moreover, there are theoretical reasons to think that this would stop being the case if the system was sufficiently powerful to do an action which would maximize human feedback but which does not consist in executing instructions as intended (e.g., by deceiving human raters).

“I think the argument about how instrumental convergence implies disempowerment proves too much…”

I am not moved by the appeal to humans here that much. If we had a unified (coordination) human agent (goal-directedness) who does not care about the freedom and welfare of other humans at all (full misalignment) and is sufficiently powerful (capability), then it seems plausible to me that this agent would try to take control of humanity, often in a very bad sense (e.g. extinction). If we relax ‘coordination’ or ‘full misalignment’ as assumptions, then this seems hard to predict. I could still see this ending in an AI which tries to disempower humanity, but it’s hard to say.

I agree. In case of interest: I have published a paper on exactly this question: https://link.springer.com/article/10.1007/s11229-022-03710-1

There, I argue that if illusionism/eliminativism is true, the question which animals are conscious can be reconstructed as question about particular kinds of non-phenomenal properties of experience. For what it’s worth, Keith Frankish seems to agree with the argument and, I’d say, Francois Kammerer does agree with the core claim (although we have disagreements about distinct but related issues). 

Thank you for the post! 

I just want to add some pointers to the literature which also add to the uncertainty regarding whether current or near-future AI may be conscious: 

VanRullen & Kanai have made reasonably concrete suggestions on how deep learning networks could implement a form of global workspace: https://www.sciencedirect.com/science/article/abs/pii/S0166223621000771 

Moreover, the so-called "small network" or "trivial realization" argument suggests that most computational theories of consciousness can be implemented by very simple neural networks which are easy to build today: https://www.sciencedirect.com/science/article/abs/pii/S0893608007001530?via%3Dihub


Thank you very much for this post and all the other essays in the Moral Weight Sequence! They were a pleasure to read and I expect that I will revisit some of them many times in the future. 

This will overlap a great deal with what other people have said, but I will still take the chance to write down my personal (current) takeaway from this discussion: 

I feel like I cannot evaluate whether buying Wytham Abbey was a good decision. A reason for this is that I possess no relevant expertise at all, i.e., I have basically no specific knowledge of how to evaluate the cost-effectiveness of purchasing real estate. But my impression is that this is not the key reason: Even if I had more expertise, it seems to me that there is not enough public information to actually make a qualified judgement on whether buying this building was cost-effective (both when excluding and when including anticipated “PR effects“). The explanation provided by Owen Cotton-Barratt is illuminating but it’s still a great deal removed from what you would need to build a detailed enough model of the relevant considerations for the decision to properly scrutinize its reasoning. To do this, you would need numerical estimates regarding many factors, e.g., the estimated number of conferences taking place in the new building and the probability that the funder financing the purchase would have donated an equivalent amount of money to another effective cause instead, if the building were not purchased by CEA. 

In particular, there is not enough public information to confirm that the decision was the result of a plausible cost-effectiveness analysis rather than a (perhaps unconsciously caused) self-serving rationalization of lavish spending. Ruling out this concern seems important, and not just for “PR” reasons: Becoming less mission-oriented and increasing self-serving expenses while growing is a common failure mode for charities. Perhaps, in the absence of contrary evidence, it should even be our default expectation that this happens (to some extent). In particular, these effects are more likely when a charity is not very transparent about the actions it takes and the reasons for them and when the actions (e.g., buying fancy buildings) are very liable to this kind of rationalization and can be explained by these effects. If important decisions of CEA are not explained and justified publicly, outsiders cannot get evidence to the contrary, i.e., evidence that decisions are made because there is a plausible case that they are the most (altruistically) cost-effective actions. 

I am not sure whether the analogy is appropriate but consider that the main reason why we trust, and place much weight on, Givewell’s charity recommendations is because they publish detailed analyses supporting their recommendations. If that was not the case, if one would have to simply take their recommendations on trust, it would seem foolish to me to be very confident that Givewell’s recommendations are actually supported by good arguments (In particular, that the detailed analysis is public and can be scrutinized should increase my confidence in the recommendations, even in case I myself didn’t actually read them). Analogously, if there is no sufficient public justification of most of CEA’s concrete actions, maybe I (and people in a similar epistemic position, e.g., not personally knowing people who actually work at CEA etc.) should not have a high confidence that their decisions are usually reasonable?

If this is broadly correct, then it seems plausible that CEA should focus more on transparency, i.e., publicly and prominently reporting, explaining and justifying their most important decisions (also mentioning uncertainties and counteracting considerations, of course). I am curious whether these thoughts seem broadly correct to people. I should emphasize that I am not very familiar with and haven’t thought long about most of this, I am mostly synthesizing comments made in the discussion so far which seemed plausible to me.

Your way of further fleshing out the example is helpful. Suppose we think that Sarah has some, but below average non-hedonic benefits in her live (and expects this for the future) and that she should nonetheless not enter the machine. The question would then come down to: In relative terms (on a linear scale), how close is Sarah to getting all the possible non-hedonic value (i.e., does she get 50% of it, or only 10%, or even less)? The farer she is away from getting all possible non-hedonic value, the more non-hedonic value contributes to welfare ranges. However, at this point, it is hard to know what the most plausible answer to this question is.

Thanks for the reply, I think it helps me to understand the issue better. As I see it now, there are two conflicting intuitions: 

1. Non-hedonic goods seem to not have much weight in flipping a life from net-negative to net-positive (Tortured Tim) or from net-positive to net-negative (The experience machine). That is, Tim seems to have a net-negative life, even though he has all attainable non-hedonic goods, and Sarah in the experience machine has a net-positive life although she lacks all positive non-hedonic goods (or even has negative non-hedonic goods). 

2. In the experience machine case, it seems (to many) as if hedonic goods have a lot less weight than non-hedonic goods in determining overall wellbeing. That is, Sarah can have blissful experience (100 on a hedonic scale from -100 to 100) in the machine but would arguably have a better life if she had moderately net-positive experience (say, 10 on the same scale) combined with the non-hedonic goods contained in almost any ordinary life.

If you interpret the experience machine case as involving negative non-hedonic goods, then the first intuition suggests what you say: that non-hedonic goods “play a role in total welfare that's no greater than the role played by hedonic goods and bads”. However, the second intuition does suggest precisely the opposite, it seems to me. If a moderate number of incompletely realized non-hedonic goods has a higher positive impact on welfare than perfectly blissful experience, then this suggests that non-hedonic goods play a more important role in welfare than hedonic goods, in some cases.

Thanks for the interesting post! I basically agree with the main argument, including your evaluation of the Tortured Tim case.

To also share one idea: I wonder whether a variation of the famous experience machine thought experiment can be taken to elicit a contrary intuition to Tortured Tim and whether this should decrease our confidence in your evaluation of the Tortured Tim case. Suppose a subject (call her "Sarah") can choose between having blissful experiences in the experience machine and having a moderately net-positive live outside of the machine, involving a rather limited amount of positive prudential (experiential and non-experiential) goods. I take it some (maybe many? maybe most?) would say that Sarah should (prudentially) not choose to enter the machine. If rather few non-experiential goods (maybe together with some positive experiences) can weigh more than the best possible experiences, this suggests that non-experiential goods can be very important, relative to experiential goods.

A possible reply would be that most of the relevant non-experiential goods might not be relevant, as they might be things like “perceiving reality” or “being non-deceived” which all animals trivially satisfy. But, in response, one may reply that the relevant non-experiential good is, e.g., the possession of knowledge which many animals may not be capable of. 

In general, in these kinds of intuitive trade-offs negative experiences seem more important than positive experiences. Few would say that the best non-experiential goods can be more important than the most terrible suffering, but many would say that the best non-experiential goods can be more important than purely the most blissful experiences. Thus, since, in the decision-contexts you have in mind, we mostly care about negative experiences, i.e. animal suffering, this objection may ultimately not be that impactful.

Thank you for your replies! In essence, I don’t think I disagree much with any of your points. I will mainly add different points of emphasis: 

I think one argument I was gesturing at is a kind of divide-and-conquer strategy where some standard moves of utilitarians or moral uncertainty adherents can counter some of the counterintuitive implications (walks to crazy town) you point to. For instance, the St. Petersburg Paradox seems to be a objection to expected value utilitarianism, not for every form of the view. Similarly, some of the classical counterexamples to utilitarianism (e.g., some variants of trolley cases) involve violations of plausible deontological constraints. Thus, if you have a non-negligible credence in a moral view which posits unconditional prohibitions of such behavior, you don’t need to buy the implausible implication (under moral uncertainty). But you are completely correct that there will remain some, maybe many, implications that many find counterintuitive or crazy, e.g., the (very) repugnant conclusion (if you are totalist utilitarian). Personally, I tend to be less troubled by these cases and suspect that we perhaps should bite some of these bullets, but to justify this would of course require a longer argument (which someone with different intuitions won’t likely be tempted by, in any case).

The passage of your text which seemed most relevant to multi-level utilitarianism is the following: "In practice, Effective Altruists are not willing to purchase theoretical coherence at the price of absurdity; they place utilitarian reasoning in a pluralist context. They may do this unreflectively, and I think they do it imperfectly; but it is an existence proof of a version of Effective Altruism that accepts that utility considerations are embedded in a wider context, and tempers them with judgment.“ One possible explanation of this observation is that the EA’s which are utilitarians are often multi-level utilitarians who consciously and intentionally use considerations beyond maximizing utility in practical decision-situation. If that were true, it would raise the interesting question what difference adopting a pluralist normative ethics, as opposed to a universal-domain utilitarianism, would make for effective altruist practice (I do not mean to imply that there aren’t difference). 

With respect to moral uncertainty, I interpret you as agreeing that the most common effective altruist views actually avoid fanaticism. This then raises the question whether accepting incomparability at the meta-level (between normative theories) gives you reasons to also (or instead) accept incomparability at the object-level (between first-order moral reasons for or against actions). I am not sure about that. I am sympathetic to your point that it might be strange to hold that 'we can have incomparability in our meta-level theorising, but it must be completely banned from our object-level theorising‘. At the same time, at least some of the reasons for believing in meta-level incomparability are quite independent from the relevant object-level arguments, so you might have good reasons to believe in it only on the meta-level. Also, the sort of incomparability seems different. As I understand your view,  it says that different kinds of moral reasons can favor or oppose a course of action such that we sometimes have to use our faculties for particular, context-sensitive moral judgements, without being able to resort to a universal meta-principle that tells us how to weigh the relevant moral reasons. By contrast, the moral uncertainty view posits precisely such a meta-principle, e.g. variance voting. So I can see how one might think that the second-order incomparability is acceptable but yours is unacceptable (although this is not my view).

Thank you for the post which I really liked! Just two short comments: 

  1. It is not clear to me why the problems of utilitarianism should inevitably lead to a form of fanaticism, under promising frameworks for moral uncertainty. At least this seems not to follow on the account of moral uncertainty of MacAskill, Ord and Bykvist (2020) which is arguably the most popular one for at least two reasons: a) Once the relevant credence distribution includes ethical theories which are not intertheoretically comparable or are merely ordinal-scale, then theories in which one has small credences (including totalist utilitariansm) won’t always dictate how to act. b) Some other ethical theories, e.g. Kantian theories which unconditionally forbid killing, seem (similar to totalist utilitarianism) to place extremely high (dis)value on certain actions. 
  2. It would be interesting to think about how distinctions between different version of utilitarianism would factor into your argument. In particular, you could be an objective utilitarian (who thinks that the de facto moral worth of an action is completely determined by its de facto consequences for total wellbeing) without believing (a) that expected value theory is the correct account of decision making under uncertainty or (b) that the best method in practice for maximizing wellbeing is to frequently explicitly calculate expected value. Version (b) would be so-called multi-level utilitarianism.
Load more