The world would have the wisest minds cower in fear, and the fair of heart despair. Human nature, once reflected upon and made to cohere, is revealed as near unbearable to share. Even so, it seems plain from here -- the latent harmony makes it clear! -- we tend to converge when we start to care.

I'm trying to align AI and learn how to think. I care a lot about animals and machine sentience. I prefer a state wherein things are happy, and dislike it when things suffer.

I'm clueless about most of what I say, so I'd be happy if you informed me of what you found objectionable. I tend to optimise for sharing weird ideas without worrying about being wrong or justifying them. If I have something valuable to say, saying it without nuance is the cheapest-for-reader way to learn it. Sometimes the most epistemically virtuous and altruistic thing is to not waste anyone's time signalling that virtue.


Sorted by New


This is good, thanks! I'm not sure if it's my preferred choice for any use case, but it's close.

I like the intent and I love the silliness, but the reasons we bring up alternatives include...

  1. So we seem less biased and less like we've only considered a single option.
  2. To cooperate with people who prioritise differently than us.
    1. It seems better if e.g. when someone asks what the best religion is, Christians, Muslims, Buddhists, all agree on providing a list of options while also making clear that they themselves consider their own religion best for XYZ reasons.

And accurately representing our beliefs about the relative merits of focusing on each risk will to some extent negate 1 and 2, especially 1.

Answer by Ward A1

Make it an 'undebate'. 10 points for every time you learn something, and 150 points for changing your mind on the central proposition.

Also, I'd like to see RLHF[1] debated. Whether any form of RL on realistic text data will be able to take us to a point where it's "smart enough", either to help us align higher intelligences or just smart enough for what we need.

  1. ^

    Reinforcement Learning from Human Feedback.[2] A strategy for AI alignment.

  2. ^

    I wish the forum had the feature where if you write [[RLHF]], it automatically makes an internal link to the topic page or where RLHF is defined in the wiki. It's standard in personal knowledge management systems like Obsidian, Roam, RemNote, and I think Arbital does it.

"One of the big reasons that humans haven't disassembled cows for spare parts..."

What. This is all that we do to them.

Thanks for the explanation! Though I think I've been misunderstood.

I think I strongly prefer if e.g. Sam Altman, Demis Hassabis, or Elon Musk ends up with majority influence over how AGI gets applied (assuming we're alive), over leading candidates in China (especially Xi Jinping). But to state that preference as "I hope China doesn't develop AI before the US!" seems ... unusually imprecise and harmful. Especially when nationalistic framings like that are already very likely to fuel otherisation and lack of cross-cultural understanding.

It's like saying "Russia is an evil country for attacking Ukraine," when you could just say "Putin" or "the Russian government" or any other way of phrasing what you mean with less likelihood of spilling over to hatred of Russians in general.

Hmm, I have a different take. I think if I tried to predict as many tokens as possible in response to a particular question, I would say all the words that I could guess someone who knew the answer would say, and then just blank out the actual answer because I couldn't predict it.

Ah, you want to know about the Riemann hypothesis? Yes, I can explain to you what this hypothesis is, because I know it well. Wise of you ask me in particular, because you certainly wouldn't ask anyone you knew didn't have a clue. I will state its precise definition as follows:

~Kittens on the rooftop they sang nya nya nya.~

And that, you see, is what the hypothesis that Riemann hypothesised.

I'm not very good at even pretending to pretend to know what it is, so even if you blanked out the middle, you could still guess I was making it up. But if you blank out the substantive parts of GPT's answer when it's confabulating, you'll have a hard time telling whether it knows the answer or not. It's just good at what it does.

Thanks for reply! I don't think the fact that they hallucinate is necessarily indicative of limited capabilities. I'm not worried about how dumb they are at their dumbest, but how smart they are at their smartest. Same with humans lol.

Though, for now, I still struggle with getting GPT-4 to be creative. But this could either be because it's habit to stick to training data, and not really about it being too dumb to come up with creative plans. ...I remember when I was in school, I didn't much care for classes, but I studied math on my own. If my reward function hasn't been attuned to whatever tests other people have designed for me, I'm just not going to try very hard.

Personally, my worry stems primarily from how difficult it seems to prevent utter fools from mixing up something like ChaosGPT with GPT-5 or 6. That was a doozy for me. You don't need fancy causal explanations of misalignment if the doom-mechanism is just... somebody telling the GPT to kill us all. And somebody will definitely try.

Secondarily, I also think a gradually increasing share of GPT's activation network gets funneled through heuristics that are generally useful for all the tasks involved in minimising its loss function at INT<20, and those heuristics may not stay inner- or outer-aligned at INT>20. Such heuristics include:

  1. You get better results if you search a higher-dimensional action-space.
  2. You get better results on novel tasks if you model the cognitive processes producing those results, followed by using that model to produce results. There's a monotonic path all the way up to consequentialism that goes something like the following.
    1. ...index and reuse algorithms that have been reliable for similar tasks, since searching a space of general algorithms is much faster than the alternative.
    2. ...extend its ability to recognise which tasks count as 'similar'.[1]
    3. ...develop meta-algorithms for more reliably putting algorithms together in increasingly complex sequences.
    4. This progression could result in something that has an explicit model of its own proxy-values, and explicitly searches a high-dimensional space of action-sequences for plans according to meta-heuristics that have historically maximised those proxy-values. Aka a consequentialist. At which point you should hope those proxy-values capture something you care about.

This is just one hypothetical zoomed-out story that makes sense in my own head, but you definitely shouldn't defer to my understanding of this. I can explain jargon upon request.

  1. ^

    Aka proxy-values. Note that just by extending the domain of inputs for which a particular algorithm is used, you can end up with a proxy-value without directly modelling anything about your loss-function explicitly. Values evolve as the domains of highly general algorithms.

Are there any concrete reasons to suspect language models to start to act more like consequentialists the better they get at modelling them? I think I'm asking something subtle, so let me rephrase. This is probably a very basic question, I'm just confused about it.

If an LLM is smart enough give us a step-by-step robust plan that covers everything with regards to solving alignment and steering the future to where we want it, are there concrete reasons to expect it to also apply a similar level of unconstrained causal reasoning wrt its own loss function or evolved proxies?

At the moment, I can't settle this either way, so it's cause for both worry and optimism.

Load more