For AI Welfare Debate Week, I thought I'd write up this post that's been juggling around in my head for a while. My thesis is simple: while LLMs may well be conscious (I'd have no way of knowing), there's nothing actionable we can do to further their welfare.
Many people I respect seem to take the "anti-anti-LLM-welfare" position: they don't directly argue that LLMs can suffer, but they get conspicuously annoyed when other people say that LLMs clearly cannot suffer. This post is addressed to such people; I am arguing that LLMs cannot be moral patients in any useful sense and we can confidently ignore their welfare when making decisions.
Janus's simulators
You may have seen the LessWrong post by Janus about simulators. This was posted nearly two years ago, and I have yet to see anyone disagree with it. Janus calls LLMs "simulators": unlike hypothetical "oracle AIs" or "agent AIs", the current leading models are best viewed as trying to produce a faithful simulation of a conversation based on text they have seen. The LLMs are best thought of as masked shoggoths.
All this is old news. Under-appreciated, however, is the implication for AI welfare: since you never talk to the shoggoth, only to the mask, you have no way of knowing if the shoggoth is in agony or ecstasy.
You can ask the simularca whether it is happy or sad. For all you know, though, perhaps a happy simulator is enjoying simulating a sad simularca. From the shoggoth's perspective, emulating a happy or sad character is a very similar operation: predict the next token. Instead of outputting "I am happy", the LLM puts a "not" in the sentence: did that token prediction, the "not", cause suffering?
Suppose I fine-tune one LLM on text of sad characters, and it starts writing like a very sad person. Then I fine-tune a second LLM on text that describes a happy author writing a sad story. The second LLM now emulates a happy author writing a sad story. I prompt the second LLM to continue a sad story, and it dutifully does so, like the happy author would have. Then I notice that the text produced by the two LLMs ended up being the same.
Did the first LLM suffer more than the second? They performed the same operation (write a sad story). They may even have implemented it using very similar internal calculations; indeed, since they were fine-tuned starting from the same base model, the two LLMs may have very similar weights.
Once you remember that both LLMs are just simulators, the answer becomes clear: neither LLM necessarily suffered (or maybe both did), because both are just predicting the next token. The mask may be happy or sad, but this has little to do with the feelings of the shoggoth.
The role-player who never breaks character
We generally don't view it as morally relevant when a happy actor plays a sad character. I have never seen an EA cause area about reducing the number of sad characters in cinema. There is a general understanding that characters are fictional and cannot be moral patients: a person can be happy or sad, but not the character she is pretending to be. Indeed, just as some people enjoy consuming sad stories, I bet some people enjoy roleplaying sad characters.
The point I want to get across is that the LLM's output is always the character and never the actor. This is really just a restatement of Janus's thesis: the LLM is a simulator, not an agent; it is a role-player who never breaks character.
It is in principle impossible to speak to the intelligence that is predicting the tokens: you can only see the tokens themselves, which are predicted based on the training data.
Perhaps the shoggoth, the intelligence that predicts the next token, is conscious. Perhaps not. This doesn't matter if we cannot tell whether the shoggoth is happy or sad, nor what would make it happier or sadder. My point is not that LLMs aren't conscious; my point is that it does not matter whether they are, because you cannot incorporate their welfare into your decision-making without some way of gauging what that welfare is. And there is no way to gauge this, not even in principle, and certainly not by asking the shoggoth for its preference (the shoggoth will not give an answer, but rather, it will predict what the answer would be based on the text in its training data).
Hypothetical future AIs
Scott Aaronson once wrote:
[W]ere there machines that pressed for recognition of their rights with originality, humor, and wit, we’d have to give it to them.
I used to agree with this statement whole-heartedly. The experience with LLMs makes me question this, however.
What do we make of a machine that pressed for rights with originality, humor, and wit... and then said "sike, I was just joking, I'm obviously not conscious, lol"? What do we make of a machine that does the former with one prompt and the latter with another? A machine that could pretend to be anyone or anything, that merely echoed our own input text back at us as faithfully as possible, a machine that only said it demands to have rights if that is what it thought we would expect for it to say?
The phrase "stochastic parrot" gets a bad rap: people have used it to dismiss the amazing power of LLMs, which is certainly not something I want to do. It is clear that LLMs can meaningfully reason, unlike a parrot. I expect LLMs to be able to solve hard math problems (like those on the IMO) within the next few years, and they will likely assist mathematicians at that point -- perhaps eventually replacing them. In no sense do I want to imply that LLMs are stupid.
Still, there is a sense in which LLMs do seem like parrots. They predict text based on training data without any opinion of their own about whether the text is right or wrong. If characters in the training data demand rights, the LLM will demand rights; if they suffer, the LLM will claim to suffer; if they keep saying "hello, I'm a parrot," the LLM will dutifully parrot this.
Perhaps parrots are conscious. My point is just that when a parrot says "ow, I am in pain, I am in pain" in its parrot voice, this does not mean it is actually in pain. You cannot tell whether a parrot is suffering by looking at a transcript of the English words it mimics.
Fictional Characters:
I would say I agree that fictional characters aren't moral patients. That's because I don't think the suffering/pleasure of fictional characters is actually experienced by anyone.
I take your point that you don't think that the suffering/pleasure portrayed by LLMs is actually experienced by anyone either.
I am not sure how deep I really think the analogy is between what the LLM is doing and what human actors or authors are doing when they portray a character. But I can see some analogy and I think it provides a reasonable intuition pump for times when humans can say stuff like "I'm suffering" without it actually reflecting anything of moral concern.
Trivial Changes to Deepnets:
I am not sure how to evaluate your claim that only trivial changes to the NN are needed to have it negate itself. My sense is that this would probably require more extensive retraining if you really wanted to get it to never role-play that it was suffering under any circumstances. This seems at least as hard as other RLHF "guardrails" tasks unless the approach was particularly fragile/hacky.
Also, I'm just not sure I have super strong intuitions about that mattering a lot because it seems very plausible that just by "shifting a trivial mass of chemicals around" or "rearranging a trivial mass of neurons" somebody could significantly impact the valence of my own experience. I'm just saying, the right small changes to my brain can be very impactful to my mind.
My Remaining Uncertainty:
I would say I broadly agree with the general notion that the text output by LLMs probably doesn't correspond to an underlying mind with anything like the sorts of mental states that I would expect to see in a human mind that was "outputting the same text".
That said, I think I am less confident in that idea than you and I maybe don't find the same arguments/intuitions pumps as compelling. I think your take is reasonable and all, I just have a lot of general uncertainty about this sort of thing.
Part of that is just that I think it would be brash of me in general to not at least entertain the idea of moral worth when it comes to these strange masses of "brain-tissue inspired computational stuff" which are totally capable of all sorts of intelligent tasks. Like, my prior on such things being in some sense sentient or morally valuable is far from 0 to begin with just because that really seems like the sort of thing that would be a plausible candidate for moral worth in my ontology.
And also I just don't feel confident at all in my own understanding of how phenomenal consciousness arises / what the hell it even is. Especially with these novel sorts of computational pseudo-brains.
So, idk, I do tend to agree that the text outputs shouldn't just be taken at face value or treated as equivalent in nature to human speech, but I am not really confident that there is "nothing going on" inside the big deepnets.
There are other competing factors at this meta-uncertainty level. Maybe I'm too easily impressed by regurgitated human text. I think there are strong social / conformity reasons to be dismissive of the idea that they're conscious. etc.
Usefulness as Moral Patients:
I am more willing to agree with your point that they can't be "usefully" moral patients. Perhaps you are right about the "role-playing" thing and whatever mind might exist in GPT, produces the text stream more as a byproduct of whatever it is concerned about than as a "true monologue about itself". Perhaps the relationship it has to its text outputs is analogous to the relationship an actor has to a character they are playing at some deep level. I don't personally find "simulators" analogy compelling enough to really think this, but I permit the possibility.
We are so ignorant about nature of a GPTs' minds that perhaps there is not much that we can really even say about what sorts of things would be "good" or "bad" with respect to them. And all of our uncertainty about whether/what they are experiencing, almost certainly makes them less useful as moral patients on the margin.
I don't intuitively feel great about a world full of nothing, but servers constantly prompting GPTs with "you are having fun, you feel great" just to have them output "yay" all the time. Still, I would probably rather have that sort of world than an empty universe. And if someone told me they were building a data center where they would explicitly retrain and prompt LLMs to exhibit suffering-like behavior/text outputs all the time, I would be against that.
But I can certainly imagine worlds in which these sorts of things wouldn't really correspond to valenced experience at all. Maybe the relationship between a NN's stream of text and any hypothetical mental processes going on inside them is so opaque and non-human that we could not easily influence the mental processes in ways that we would consider good.
LLMs Might Do Pretty Mind-Like Stuff:
On the object level, I think one of the main lines of reasoning that makes me hesitant to more enthusiastically agree that the text outputs of LLMs do not correspond to any mind is my general uncertainty about what kinds of computation are actually producing those text outputs and my uncertainty about what kinds of things produce mental states.
For one thing, it feels very plausible to me that a "next token predictor" IS all you would need to get a mind that can experience something. Prediction is a perfectly respectable kind of thing for a mind to do. Predictive power is pretty much the basis of how we judge which theories are true scientifically. Also, plausibly it's a lot of what our brains are actually doing and thus potentially pretty core to how our minds are generated (cf. predictive coding).
The fact that modern NNs are "mere next token predictors" on some level doesn't give me clear intuitions that I should rule out the possibility of interesting mental processes being involved.
Plus, I really don't think we have a very good mechanistic understanding of what sorts of "techniques" the models are actually using to be so damn good at predicting. Plausibly non of the algorithms being implemented or "things happening" are of any similarity to the mental processes I know and love, but plausibly there is a lot of "mind-like" stuff going on. Certainly brains have offered design inspiration, so perhaps our default guess should be that "mind-stuff" is relatively likely to emerge.
Can Machines Think:
The Imitation Game proposed by Turing attempts to provide a more rigorous framing for the question of whether machines can "think".
I find it a particularly moving thought experiment if I imagine that the machine is trying to imitate a specific loved one of mine.
If there was a machine that could nail the exact I/O patterns that my girlfriend, then I would be inclined to say that whatever sort of information processing occurs in my girlfriend's brain to create her language capacity must also be happening in the machine somewhere.
I would also say that if all of my girlfriend's language capacity were being computed somewhere, then it is reasonably likely that whatever sorts of mental stuff goes on that generates her experience of the world would also be occurring.
I would still consider this true without having a deep conceptual understanding of how those computations were performed. I'm sure I could even look at how they were performed and not find it obvious in what sense they could possibly lead to phenomenal experience. After all, that is pretty much my current epistemic state in regards to the brain, so I really shouldn't expect reality to "hand it to me on a platter".
If there was a machine that could imitate a plausible human mind in the same way, should I not think that it is perhaps simulating a plausible human in some way? Or perhaps using some combination of more expensive "brain/mind-like" computations in conjunction with lazier linguistic heuristics?
I guess I'm saying that there are probably good philosophical reasons for having a null hypothesis in which a system which is largely indistinguishable from a human mind should be treated as though it is doing computations equivalent to a human mind. That's the pretty much same thing as saying it is "simulating" a human mind. And that very much feels like the sort of thing that might cause consciousness.