Quick comments:
A lot more concrete examples on what you think should be done differently would be helpful
(I only read the text, didn't watch the video)
Naively I would trade a lot of clearly-safe stuff being delayed or temporarily prohibited for even a minor decrease in chance of safe-seeming-but-actually-dangerous stuff going through, which pushes me towards favoring a more expansive scope of regulation.
(in my mind the potential loss of decades of life improvements currently pale vs potential non-existence of all lives in the longterm future)
Don't know how to think about it when accounting for public opinion though, I expect a larger scope will gather more opposition to regulation, which could be detrimental in various ways, the most obvious being decreased likelihood of such regulation being passed/upheld/disseminated to other places.
But the difficulty of alignment doesn't seem to imply much about whether slowing is good or bad, or about its priority relative to other goals.
At the extremes, if alignment-to-"good"-values by default was 100% likely I presume slowing down would be net-negative, and racing ahead would look great. It's unclear to me where the tipping point is, what kind of distribution over different alignment difficulty levels one would need to have to tip from wanting to speed up vs wanting to slow down AI progress.
Seems to me like the more longtermist one is, the more slowing down looks good even when one is very optimistic about alignment. Then again there are some considerations that push against this: risk of totalitarianism, risk of pause that never ends, risk of value-agnostic alignment being solved and the first AGI being aligned to "worse" values than the default outcome.
(I realize I'm using two different definitions of alignment in this comment, would like to know if there's standardized terminology to differentiate between them)
How is the "secretly is planning to murder all humans" improving the models scores on a benchmark?
(I personally don't find this likely, so this might accidentally be a strawman)
For example: planning and gaining knowledge are incentivized on many benchmarks -> instrumental convergence makes model instrumentally value power among other things -> a very advanced system that is great at long-term planning might conclude that "murdering all humans" is useful for power or other instrumentally convergent goals
You could prove this. Make a psychopathic model designed to "betray" in a game like world and then see how many rounds of training on a new dataset clear the ability for the model to kill when it improves score.
I think with our current interpretability techniques we wouldn't be able to robustly distinguish between a model that generalized to behave well in any reasonable environment vs a model that learned to behave well in that specific environment but would turn back to betray in many other environments
GPT-4 doesn't have the internal bits which make inner alignment a relevant concern.
Is this commonly agreed upon even after fine-tuning with RLHF? I assumed it's an open empirical question. The way I understand is is that there's a reward signal (human feedback) that's shaping different parts of the neural network that determines GPT-4's ouputs, and we don't have good enough interpretability techniques to know whether some parts of the neural network are representations of "goals", and even less so what specific goals they are.
I would've thought it's an open question whether even base models have internal representations of "goals", either always active or only active in some specific context. For example if we buy the simulacra (predictors?) frame, a goal could be active only when a certain simulacrum is active.
(would love to be corrected :D)
I think the stated reasoning there by OP is that it's important to influence OpenAI's leadership's stance and OpenAI's work on AI existential safety. Do you think this is unreasonable?
To be fair I do think it makes a lot of sense to invoke nepotism here. I would be highly suspicious of the grant if I didn't happen to place a lot of trust in Holden Karnofsky and OP.
(feel free to not respond, I'm just curious)
Props for the initiative!
What names did you consider for the pledge? One con of the current name is that it could elicit some reactions like:
It might be largely down to whether someone interprets better as "better than I might otherwise do" or "better than others' careers". Likely depends on culture too, for example I think here in Finland the above reactions could be more likely since people tend to value humbleness quite a bit.
Anyway I'm not too worried since the name has positives too, and you can always adapt the name based on how outreach goes if you do end up experimenting with it. 👍