(Note: This post is a write-up by Rob of a point Eliezer wanted to broadcast. Nate helped with the editing, and endorses the post’s main points.)
Eliezer Yudkowsky and Nate Soares (my co-workers) want to broadcast strong support for OpenAI’s recent decision to release a blog post ("Our approach to alignment research") that states their current plan as an organization.
Although Eliezer and Nate disagree with OpenAI's proposed approach — a variant of "use relatively unaligned AI to align AI" — they view it as very important that OpenAI has a plan and has said what it is.
We want to challenge Anthropic and DeepMind, the other major AGI organizations with a stated concern for existential risk, to do the same: come up with a plan (possibly a branching one, if there are crucial uncertainties you expect to resolve later), write it up in some form, and publicly announce that plan (with sensitive parts fuzzed out) as the organization's current alignment plan.
Currently, Eliezer’s impression is that neither Anthropic nor DeepMind has a secret plan that's better than OpenAI's, nor a secret plan that's worse than OpenAI's. His impression is that they don't have a plan at all.[1]
Having a plan is critically important for an AGI project, not because anyone should expect everything to play out as planned, but because plans force the project to concretely state their crucial assumptions in one place. This provides an opportunity to notice and address inconsistencies, and to notice updates to the plan (and fully propagate those updates to downstream beliefs, strategies, and policies) as new information comes in.
It's also healthy for the field to be able to debate plans and think about the big picture, and for orgs to be in some sense "competing" to have the most sane and reasonable plan.
We acknowledge that there are reasons organizations might want to be abstract about some steps in their plans — e.g., to avoid immunizing people to good-but-weird ideas, in a public document where it’s hard to fully explain and justify a chain of reasoning; or to avoid sharing capabilities insights, if parts of your plan depend on your inside-view model of how AGI works.
We’d be happy to see plans that fuzz out some details, but are still much more concrete than (e.g.) “figure out how to build AGI and expect this to go well because we'll be particularly conscientious about safety once we have an AGI in front of us".
Eliezer also hereby gives a challenge to the reader: Eliezer and Nate are thinking about writing up their thoughts at some point about OpenAI's plan of using AI to aid AI alignment. We want you to write up your own unanchored thoughts on the OpenAI plan first, focusing on the most important and decision-relevant factors, with the intent of rendering our posting on this topic superfluous.
Our hope is that challenges like this will test how superfluous we are, and also move the world toward a state where we’re more superfluous / there’s more redundancy in the field when it comes to generating ideas and critiques that would be lethal for the world to never notice.[2][3]
- ^
We didn't run a draft of this post by DM or Anthropic (or OpenAI), so this information may be mistaken or out-of-date. My hope is that we’re completely wrong!
Nate’s personal guess is that the situation at DM and Anthropic may be less “yep, we have no plan yet”, and more “various individuals have different plans or pieces-of-plans, but the organization itself hasn’t agreed on a plan and there’s a lot of disagreement about what the best approach is”.
In which case Nate expects it to be very useful to pick a plan now (possibly with some conditional paths in it), and make it a priority to hash out and document core strategic disagreements now rather than later.
- ^
Nate adds: “This is a chance to show that you totally would have seen the issues yourselves, and thereby deprive MIRI folk of the annoying ‘y'all'd be dead if not for MIRI folk constantly pointing out additional flaws in your plans’ card!”
- ^
Eliezer adds: "For this reason, please note explicitly if you're saying things that you heard from a MIRI person at a gathering, or the like."
If folks at DM/Anthropic/OpenAI ask us to run this kind of thing by them in advance, I assume we'll be happy to do so; we've sent them many other drafts of things before, and I expect we'll send them many more in the future.
I do like the idea of MIRI staff regularly or semi-regularly sharing our thoughts about things without running them by a bunch of people -- e.g., to encourage more of the conversation, pushback, etc. to happen in public, so information doesn't end up all bottled up in a few brains on a private email thread.
I think there are many cases where it's actively better for EAs to screw up in public and be corrected in the comments, rather than working out all disagreements and info-asymmetries in private channels and then putting out an immaculate, smoothed-over final product. (Especially if the post is transparent about this, so we have more-polished and less-polished stuff and it's pretty clear which is which.)
Screwing up in public has real costs (relative to the original essay Just Being Correct about everything), but hiding all the cognitive work that goes into consensus-building and airing of disagreements has real costs too.
This is not me coming out against running drafts by people in general; it's great tech, and we should use it. I just think there are subtle advantages to "just say what's on your mind and have a back-and-forth with people who disagree" that are worth keeping in view too.
Part of it is a certain attitude that I want to encourage more in EA, that I'm not sure how to put into words, but is something like: tip-toeing less; blurting more; being bolder, and proactively doing things-that-seem-good-to-you-personally rather than waiting for elite permission/encouragement/management; trying less to look perfect, and more to do the epistemically cooperative thing "wear your exact strengths and weaknesses on your sleeve so others can model you well"; etc.
All of that is compatible with running drafts by folks, but I think it can be valuable for more EAs to visibly be more relaxed (on the current margin) about stuff like draft-sharing, to contribute to a social environment where people feel chiller about making public mistakes, stating their current impressions and updating them in real time, etc. I don't think we want maximum chillness, but I think we want EA's best and brightest to be more chill on the current margin.