Postdoc in Philosophy at Ruhr-University Bochum. Working on non-human consciousness and AI.
I think the standard response by longtermists is encapsulated in Tarsney's "The Epistemic Challenge to Longtermism": https://link.springer.com/article/10.1007/s11229-023-04153-y Tarsney concludes: "if we simply aim to
maximize expected value, and don’t mind premising our choices on minuscule
probabilities of astronomical payoffs, the case for longtermism looks robust.
But on some prima facie plausible empirical worldviews, the expectational
superiority of longtermist interventions depends heavily on these ‘Pascalian’
probabilities. So the case for longtermism may depend either on plausible but
non-obvious empirical claims or on a tolerance for Pascalian fanaticism." I don't have time to compare the two papers at the moment, but - in my memory - the main difference to Thorstad's conclusion is that Tarsney explicitly considers uncertainty about different models and model parameters regarding future population growth and our possibility to affect the probability of extinction.
Thank you for the comment, very thought-provoking! I tried to make some reply to each of your comments, but there is much more one could say.
First, I agree that my notion of disempowerment could have been explicated more clearly, although my elucidations fit relatively straightforwardly with your second notion (mainly, perpetual oppression or extinction), not your first. I think conclusions (1) and (2) are both quite significant, although there are important ethical differences.
For the argument, the only case where this potential ambiguity makes a difference is with respect to premise 4 (the instrumental convergence premise). Would be interesting to spell out more which points there seem much more plausible with respect to notion (1) but not to (2). If one has high credence in the view that AIs will decide to compromise with humans, rather than extinguish them, this would be one example of a view which leads to a much higher credence in (1) than in (2).
“I think this quote is pretty confused and seems to rely partially on a misunderstanding of what people mean when they say that AGI cognition might be messy…”
I agree that RL does not necessarily create agents with such a clean psychological goal structure, but I think that there is (maybe strong) reason to think that RL often creates such agents. Cases of reward hacking in RL algorithms are precisely cases where an algorithm exhibits such a relatively clean goal structure, single-mindedly pursuing a ‘stupid’ goal while being instrumentally rational and thus apparently having a clear distinction between final and instrumental goals. But, granted, this might depend on what is ‘rewarded’, e.g. if it’s only a game score in a video game, then the goal structure might be cleaner than when it is a variety of very different things, and on whether the relevant RL agents tend to learn goals over rewards or some states of the world.
“I think this quote potentially indicates a flawed mental model of AI development underneath…”
Very good points. Nevertheless, it seems fair to say that it adds to the difficulty of avoiding disempowerment from misaligned AI that not only the first sufficiently capable AI (AGI) has to avoid catastrophic misaligment, but all further AGIs have to either avoid this too or be stopped by the AGIs already in existence. This then relates to points regarding whether the first AGIs do not only avoid catastrophic misalignment, but are sufficiently aligned so that we can use them to stop other AGIs and what the offense-defense balance would be. Could be that this works out, but also does not seem very safe to me.
“I think this quote overstates the value specification problem and ignores evidence from LLMs that this type of thing is not very hard…”
I am less convinced that evidence from LLMs shows that value specification is not hard. As you hint at, the question of value specification was never taken to be whether a sufficiently intelligent AI can understand our values (of course it can, if it is sufficiently intelligent), but whether we can specify them as its goals (such that it comes to share them). In (e.g.) GPT-4 trained via RL from human feedback, it is true that it typically executes your instructions as intended. However, sometimes it doesn’t and, moreover, there are theoretical reasons to think that this would stop being the case if the system was sufficiently powerful to do an action which would maximize human feedback but which does not consist in executing instructions as intended (e.g., by deceiving human raters).
“I think the argument about how instrumental convergence implies disempowerment proves too much…”
I am not moved by the appeal to humans here that much. If we had a unified (coordination) human agent (goal-directedness) who does not care about the freedom and welfare of other humans at all (full misalignment) and is sufficiently powerful (capability), then it seems plausible to me that this agent would try to take control of humanity, often in a very bad sense (e.g. extinction). If we relax ‘coordination’ or ‘full misalignment’ as assumptions, then this seems hard to predict. I could still see this ending in an AI which tries to disempower humanity, but it’s hard to say.
I agree. In case of interest: I have published a paper on exactly this question: https://link.springer.com/article/10.1007/s11229-022-03710-1
There, I argue that if illusionism/eliminativism is true, the question which animals are conscious can be reconstructed as question about particular kinds of non-phenomenal properties of experience. For what it’s worth, Keith Frankish seems to agree with the argument and, I’d say, Francois Kammerer does agree with the core claim (although we have disagreements about distinct but related issues).
Thank you for the post!
I just want to add some pointers to the literature which also add to the uncertainty regarding whether current or near-future AI may be conscious:
VanRullen & Kanai have made reasonably concrete suggestions on how deep learning networks could implement a form of global workspace: https://www.sciencedirect.com/science/article/abs/pii/S0166223621000771
Moreover, the so-called "small network" or "trivial realization" argument suggests that most computational theories of consciousness can be implemented by very simple neural networks which are easy to build today: https://www.sciencedirect.com/science/article/abs/pii/S0893608007001530?via%3Dihub
http://henryshevlin.com/wp-content/uploads/2015/04/Trivial-Realisation-Argument.pdf
This will overlap a great deal with what other people have said, but I will still take the chance to write down my personal (current) takeaway from this discussion:
I feel like I cannot evaluate whether buying Wytham Abbey was a good decision. A reason for this is that I possess no relevant expertise at all, i.e., I have basically no specific knowledge of how to evaluate the cost-effectiveness of purchasing real estate. But my impression is that this is not the key reason: Even if I had more expertise, it seems to me that there is not enough public information to actually make a qualified judgement on whether buying this building was cost-effective (both when excluding and when including anticipated “PR effects“). The explanation provided by Owen Cotton-Barratt is illuminating but it’s still a great deal removed from what you would need to build a detailed enough model of the relevant considerations for the decision to properly scrutinize its reasoning. To do this, you would need numerical estimates regarding many factors, e.g., the estimated number of conferences taking place in the new building and the probability that the funder financing the purchase would have donated an equivalent amount of money to another effective cause instead, if the building were not purchased by CEA.
In particular, there is not enough public information to confirm that the decision was the result of a plausible cost-effectiveness analysis rather than a (perhaps unconsciously caused) self-serving rationalization of lavish spending. Ruling out this concern seems important, and not just for “PR” reasons: Becoming less mission-oriented and increasing self-serving expenses while growing is a common failure mode for charities. Perhaps, in the absence of contrary evidence, it should even be our default expectation that this happens (to some extent). In particular, these effects are more likely when a charity is not very transparent about the actions it takes and the reasons for them and when the actions (e.g., buying fancy buildings) are very liable to this kind of rationalization and can be explained by these effects. If important decisions of CEA are not explained and justified publicly, outsiders cannot get evidence to the contrary, i.e., evidence that decisions are made because there is a plausible case that they are the most (altruistically) cost-effective actions.
I am not sure whether the analogy is appropriate but consider that the main reason why we trust, and place much weight on, Givewell’s charity recommendations is because they publish detailed analyses supporting their recommendations. If that was not the case, if one would have to simply take their recommendations on trust, it would seem foolish to me to be very confident that Givewell’s recommendations are actually supported by good arguments (In particular, that the detailed analysis is public and can be scrutinized should increase my confidence in the recommendations, even in case I myself didn’t actually read them). Analogously, if there is no sufficient public justification of most of CEA’s concrete actions, maybe I (and people in a similar epistemic position, e.g., not personally knowing people who actually work at CEA etc.) should not have a high confidence that their decisions are usually reasonable?
If this is broadly correct, then it seems plausible that CEA should focus more on transparency, i.e., publicly and prominently reporting, explaining and justifying their most important decisions (also mentioning uncertainties and counteracting considerations, of course). I am curious whether these thoughts seem broadly correct to people. I should emphasize that I am not very familiar with and haven’t thought long about most of this, I am mostly synthesizing comments made in the discussion so far which seemed plausible to me.
Your way of further fleshing out the example is helpful. Suppose we think that Sarah has some, but below average non-hedonic benefits in her live (and expects this for the future) and that she should nonetheless not enter the machine. The question would then come down to: In relative terms (on a linear scale), how close is Sarah to getting all the possible non-hedonic value (i.e., does she get 50% of it, or only 10%, or even less)? The farer she is away from getting all possible non-hedonic value, the more non-hedonic value contributes to welfare ranges. However, at this point, it is hard to know what the most plausible answer to this question is.
Thanks for the reply, I think it helps me to understand the issue better. As I see it now, there are two conflicting intuitions:
1. Non-hedonic goods seem to not have much weight in flipping a life from net-negative to net-positive (Tortured Tim) or from net-positive to net-negative (The experience machine). That is, Tim seems to have a net-negative life, even though he has all attainable non-hedonic goods, and Sarah in the experience machine has a net-positive life although she lacks all positive non-hedonic goods (or even has negative non-hedonic goods).
2. In the experience machine case, it seems (to many) as if hedonic goods have a lot less weight than non-hedonic goods in determining overall wellbeing. That is, Sarah can have blissful experience (100 on a hedonic scale from -100 to 100) in the machine but would arguably have a better life if she had moderately net-positive experience (say, 10 on the same scale) combined with the non-hedonic goods contained in almost any ordinary life.
If you interpret the experience machine case as involving negative non-hedonic goods, then the first intuition suggests what you say: that non-hedonic goods “play a role in total welfare that's no greater than the role played by hedonic goods and bads”. However, the second intuition does suggest precisely the opposite, it seems to me. If a moderate number of incompletely realized non-hedonic goods has a higher positive impact on welfare than perfectly blissful experience, then this suggests that non-hedonic goods play a more important role in welfare than hedonic goods, in some cases.
Thanks for the interesting post! I basically agree with the main argument, including your evaluation of the Tortured Tim case.
To also share one idea: I wonder whether a variation of the famous experience machine thought experiment can be taken to elicit a contrary intuition to Tortured Tim and whether this should decrease our confidence in your evaluation of the Tortured Tim case. Suppose a subject (call her "Sarah") can choose between having blissful experiences in the experience machine and having a moderately net-positive live outside of the machine, involving a rather limited amount of positive prudential (experiential and non-experiential) goods. I take it some (maybe many? maybe most?) would say that Sarah should (prudentially) not choose to enter the machine. If rather few non-experiential goods (maybe together with some positive experiences) can weigh more than the best possible experiences, this suggests that non-experiential goods can be very important, relative to experiential goods.
A possible reply would be that most of the relevant non-experiential goods might not be relevant, as they might be things like “perceiving reality” or “being non-deceived” which all animals trivially satisfy. But, in response, one may reply that the relevant non-experiential good is, e.g., the possession of knowledge which many animals may not be capable of.
In general, in these kinds of intuitive trade-offs negative experiences seem more important than positive experiences. Few would say that the best non-experiential goods can be more important than the most terrible suffering, but many would say that the best non-experiential goods can be more important than purely the most blissful experiences. Thus, since, in the decision-contexts you have in mind, we mostly care about negative experiences, i.e. animal suffering, this objection may ultimately not be that impactful.
Yes, one might say that, even if successful, Tarsney's arguments don't really negate Thorstad's. It's more that, using a more comprehensive modeling approach, we see that - even taking Thorstad's arguments into account - fanatical longtermism remains correct and non-fanatical longtermism remains plausible given some/many/most plausible empirical assumptions. But I don't remember exactly what all of Thorstad's specific arguments in the paper were and how/whether they are accounted for in Tarsney's paper, so someone better informed may please correct me.