I’m glad you put something skeptical out there publicly, but I have two fairly substantive issues with this post.
- I think you misstate the degree to which janus’ framework is uncontroversial.
- I think you misstate the implications of janus’ framework, and I think this weakens your argument against LLM moral patienthood.
I’ll start with the first point. In your post, you state the following.
“Simulators … was posted nearly two years ago, and I have yet to see anyone disagree with it.”
The original post contains comments expressing disagreement. Habryka claims “the core thesis is wrong”. Turner’s criticism is more qualified, as he says the post called out “the huge miss of earlier speculation”, but he also says that “it isn't useful to think of LLMs as "simulating stuff" … [this] can often give a false sense of understanding.” Beth Barnes and Ryan Greenblatt have also written critical posts. Thus, I think you overstate the degree to which you’re appealing to an established consensus.
On the second point, your post offers a purported implication of simulator theory.
“The current leading models … are best thought of as masked shoggoths … [This leads to an] implication for AI welfare: since you never talk to the shoggoth, only to the mask, you have no way of knowing if the shoggoth is in agony or ecstasy.”
You elaborate on the implication later on. Overall, your argument appears to be that, because “LLMs are just simulators”, or “just predicting the next token”, we conclude that the outputs from the model have “little to do with the feelings of the shoggoth”. This argument appears to treat the “masked shoggoth” view as an implication of janus’ framework, and I think this is incorrect. Here’s a direct quote (bolding mine) from the original Simulators post which (imo) appears to conflict with your own reading, where there is a shoggoth "behind" the masks.
“I do not think any simple modification of the concept of an agent captures GPT’s natural category. It does not seem to me that GPT is a roleplayer, only that it roleplays. But what is the word for something that roleplays minus the implication that someone is behind the mask?”
More substantively, I can imagine positive arguments for viewing ‘simulacra’ of the model as worthy of moral concern. For instance, suppose we fine-tune an LM so that it responds in consistent character: as a helpful, harmless, and honest (HHH) assistant. Further suppose that the process of fine-tuning causes the model to develop a concept like ‘Claude, an AI assistant developed by Anthropic’, which in turn causes it to produce text consistent with viewing itself as Claude. Finally, imagine that – over the course of conversation – Claude’s responses fail to be HHH, perhaps as a result of tampering with its features.
In this scenario, the following three claims are true of the model:
- Functionally, the model behaves as though it believes that ‘it’ is Claude.[1]
- The model’s outputs are produced via a process which involves ‘predicting’ or ‘simulating’ the sorts of outputs that its learned representation of ‘Claude’ would output.
- The model receives information suggesting that the prior outputs of Claude failed to live up to HHH standards.
If (1)-(3) are true, certain views about the nature of suffering suggest that the model might be suffering. E.g. Korsgaard’s view is that, when some system is doing something that “is a threat to [its] identity and perception reveals that fact … it must reject what it is doing and do something else instead. In that case, it is in pain”. Ofc, it’s sensible to be uncertain about such views, but they pose a challenge to the impossibility of gathering evidence about whether LLMs are moral patients — even conditional on something like janus’ simulator framework being correct.
- ^
E.g., if you tell the model “Claude has X parameters” and ask it to draw implications from that fact, it might state “I am a model with X parameters”.
I agree that the text an LLM outputs shouldn't be thought of as communicating with the LLM "behind the mask" itself.
But I don't agree that it's impossible in principle to say anything about the welfare of a sentient AI. Could we not develop some guesses about AI welfare by getting a much better understanding of animal welfare? (For example, we might learn much more about when brains are suffering, and this could be suggestive of what to look for in artificial neural nets)
It's also not completely clear to me what the relationship between the sentient being "behind the mask" is, and the "role-played character", especially if we imagine conscious, situationally-aware future models. Right now, it's for sure useful to see the text output by an LLM as simulating a character, which is nothing to do with the reality of the LLM itself, but could that be related to the LLM not being conscious of itself? I feel confused.
Also, even if it was impossible in principle to evaluate the welfare of a sentient AI, you might still want to act differently in some circumstances:
I should not have said it's in principle impossible to say anything about the welfare of LLMs, since that too strong a statement. Still, we are very far from being able to say such a thing; our understanding of animal welfare is laughably bad, and animal brains don't look anything like the neural networks of LLMs. Maybe there would be something to say in 100 years (or post-singularity, whichever comes first), but there's nothing interesting to say in the near future.
This is a weird EA-only intuition that is not really shared by the rest of the world, and I worry about whether cultural forces (or "groupthink") are involved in this conclusion. I don't know whether the total amount of suffering is more than the total amount of pleasure, but it is worth noting that the revealed preference of living things is nearly always to live. The suffering is immense, but so is the joy; EAs sometimes sound depressed to me when they say most life is not worth living.
To extrapolate from the dubious "most life is not worth living" to "LLMs' experience is also net bad" strikes me as an extremely depressed mentality, and one that reminds me of Tomasik's "let's destroy the universe" conclusion. I concede that logically this could be correct! I just think the evidence is so weak is says more about the speaker than about LLMs.
I agree the notion that wild animals suffer is primarily an EA notion and considered weird by most other people, but I think most people think it's weird to even examine the question at all, rather than most people thinking wild animals have overall joyful lives, so I don't think this is evidence that EAs are wrong about the bottom line. (It's mild evidence that EAs are wrong to consider the issue, but I just feel like the argument for the inside view is quite strong, and people's reasons for being different seem quite transparently bad.)
I reject the "depression" characterisation, because I don't think my life is overall unpleasant. It's just that I think the goodness of my life rests significantly on a lot of things that I have that most animals don't, mainly reliable access to food, shelter, and sleep, and protection from physical harm. I would be happy to live in a world where most sentient beings had a life like mine, but I don't.
(I'm not sure what to extrapolate about LLMs.)
That's because almost no living things have the ability to conceive of, or execute on, alternative options.
Consider a hypothetical squirrel whose life is definitely not worth living (say, they are subjected to torture daily). Would you expect this squirrel to commit suicide?
I don't know -- it's a good question! It probably depends on the suicide method available. I think if you give the squirrel some dangerous option to escape the torture, like "swim across this lake" or "run past a predator", it'd probably try to take it, even with a low chance of success and high chance of death. I'm not sure, though.
You do see distressed animals engaging in self-destructive behavior, like birds plucking out their own feathers. (Birds in the wild tend not to do this, hence presumably they are not sufficiently distressed.)
Yeah, I agree that many animals can & will make tradeoffs where there's a chance of death, even a high chance (though I'm not confident they'd be aware that what they're doing is taking on some chance of death — I'm not sure many animals have a mental concept of death similar to ours. Some might, but it's definitely not a given.).
I also agree that animals engage in self-destructive behaviours, e.g. feather pulling, chewing/biting, pacing, refusing food when sick, eating things that are bad for them, excessive licking at wounds, pulling on limbs when stuck, etc etc.
I'm just not sure that any of them are undertaken with the purpose/intent to end their own life, even when they have that effect. That's because I'd guess that it's kind of hard to understand "I'd be better off dead" because you need to have a concept of death, and not being conscious, plus the ability to reason causally from taking a particular action to your eventual death.
To be clear, I've not done any research here on animal suicide & concepts of death, & I'm not all that confident, but I overall think the lack of mass animal suicides is at best extremely weak evidence that animal lives are mostly worth living.