(Note: This post is a write-up by Rob of a point Eliezer wanted to broadcast. Nate helped with the editing, and endorses the post’s main points.)
Eliezer Yudkowsky and Nate Soares (my co-workers) want to broadcast strong support for OpenAI’s recent decision to release a blog post ("Our approach to alignment research") that states their current plan as an organization.
Although Eliezer and Nate disagree with OpenAI's proposed approach — a variant of "use relatively unaligned AI to align AI" — they view it as very important that OpenAI has a plan and has said what it is.
We want to challenge Anthropic and DeepMind, the other major AGI organizations with a stated concern for existential risk, to do the same: come up with a plan (possibly a branching one, if there are crucial uncertainties you expect to resolve later), write it up in some form, and publicly announce that plan (with sensitive parts fuzzed out) as the organization's current alignment plan.
Currently, Eliezer’s impression is that neither Anthropic nor DeepMind has a secret plan that's better than OpenAI's, nor a secret plan that's worse than OpenAI's. His impression is that they don't have a plan at all.[1]
Having a plan is critically important for an AGI project, not because anyone should expect everything to play out as planned, but because plans force the project to concretely state their crucial assumptions in one place. This provides an opportunity to notice and address inconsistencies, and to notice updates to the plan (and fully propagate those updates to downstream beliefs, strategies, and policies) as new information comes in.
It's also healthy for the field to be able to debate plans and think about the big picture, and for orgs to be in some sense "competing" to have the most sane and reasonable plan.
We acknowledge that there are reasons organizations might want to be abstract about some steps in their plans — e.g., to avoid immunizing people to good-but-weird ideas, in a public document where it’s hard to fully explain and justify a chain of reasoning; or to avoid sharing capabilities insights, if parts of your plan depend on your inside-view model of how AGI works.
We’d be happy to see plans that fuzz out some details, but are still much more concrete than (e.g.) “figure out how to build AGI and expect this to go well because we'll be particularly conscientious about safety once we have an AGI in front of us".
Eliezer also hereby gives a challenge to the reader: Eliezer and Nate are thinking about writing up their thoughts at some point about OpenAI's plan of using AI to aid AI alignment. We want you to write up your own unanchored thoughts on the OpenAI plan first, focusing on the most important and decision-relevant factors, with the intent of rendering our posting on this topic superfluous.
Our hope is that challenges like this will test how superfluous we are, and also move the world toward a state where we’re more superfluous / there’s more redundancy in the field when it comes to generating ideas and critiques that would be lethal for the world to never notice.[2][3]
- ^
We didn't run a draft of this post by DM or Anthropic (or OpenAI), so this information may be mistaken or out-of-date. My hope is that we’re completely wrong!
Nate’s personal guess is that the situation at DM and Anthropic may be less “yep, we have no plan yet”, and more “various individuals have different plans or pieces-of-plans, but the organization itself hasn’t agreed on a plan and there’s a lot of disagreement about what the best approach is”.
In which case Nate expects it to be very useful to pick a plan now (possibly with some conditional paths in it), and make it a priority to hash out and document core strategic disagreements now rather than later.
- ^
Nate adds: “This is a chance to show that you totally would have seen the issues yourselves, and thereby deprive MIRI folk of the annoying ‘y'all'd be dead if not for MIRI folk constantly pointing out additional flaws in your plans’ card!”
- ^
Eliezer adds: "For this reason, please note explicitly if you're saying things that you heard from a MIRI person at a gathering, or the like."
I agree with a lot of what you say! I still want to move EA in the direction of "people just say what's on their mind on the EA Forum, without trying to dot every i and cross every t; and then others say what's on their mind in response; and we have an actual back-and-forth that isn't carefully choreographed or extremely polished, but is more like a real conversation between peers at an academic conference".
(Another way to achieve many of the same goals is to encourage more EAs who disagree with each other to regularly talk to each other in private, where candor is easier. But this scales a lot more poorly, so it would be nice if some real conversation were happening in public.)
A lot of my micro-decisions in making posts like this are connected to my model of "what kind of culture and norms are likely to result in EA solving the alignment problem (or making a lot of progress)?", since I think that's the likeliest way that EA could make a big positive difference for the future. In that context, I think building conversations about heavily polished, "final" (rather than in-process) cognition, tends to be insufficient for fast and reliable intellectual progress:
In principle, it's not impossible to push EA in those directions while also passing drafts a lot more in private. But I hope it's clearer why that doesn't seem like the top priority to me (and why it could be at least somewhat counter-productive) given that I'm working with this picture of our situation.
I'm happy to heavily signal-boost replies from DM and Anthropic staff (including editing the OP), especially if it shows that MIRI was just flatly wrong about how much those orgs already have a plan. And I endorse people docking MIRI points insofar as we predicted wrongly, here; and I'd prefer the world where people knew our first-order impressions of where the field's at in this case, and were able to dock us some points if we turn out to be wrong, as opposed to the world where everything happens in private.
(I think I still haven't communicated fully why I disagree here, but hopefully the pieces I have been able to articulate are useful on their own.)