(Note: This post is a write-up by Rob of a point Eliezer wanted to broadcast. Nate helped with the editing, and endorses the post’s main points.)
Eliezer Yudkowsky and Nate Soares (my co-workers) want to broadcast strong support for OpenAI’s recent decision to release a blog post ("Our approach to alignment research") that states their current plan as an organization.
Although Eliezer and Nate disagree with OpenAI's proposed approach — a variant of "use relatively unaligned AI to align AI" — they view it as very important that OpenAI has a plan and has said what it is.
We want to challenge Anthropic and DeepMind, the other major AGI organizations with a stated concern for existential risk, to do the same: come up with a plan (possibly a branching one, if there are crucial uncertainties you expect to resolve later), write it up in some form, and publicly announce that plan (with sensitive parts fuzzed out) as the organization's current alignment plan.
Currently, Eliezer’s impression is that neither Anthropic nor DeepMind has a secret plan that's better than OpenAI's, nor a secret plan that's worse than OpenAI's. His impression is that they don't have a plan at all.[1]
Having a plan is critically important for an AGI project, not because anyone should expect everything to play out as planned, but because plans force the project to concretely state their crucial assumptions in one place. This provides an opportunity to notice and address inconsistencies, and to notice updates to the plan (and fully propagate those updates to downstream beliefs, strategies, and policies) as new information comes in.
It's also healthy for the field to be able to debate plans and think about the big picture, and for orgs to be in some sense "competing" to have the most sane and reasonable plan.
We acknowledge that there are reasons organizations might want to be abstract about some steps in their plans — e.g., to avoid immunizing people to good-but-weird ideas, in a public document where it’s hard to fully explain and justify a chain of reasoning; or to avoid sharing capabilities insights, if parts of your plan depend on your inside-view model of how AGI works.
We’d be happy to see plans that fuzz out some details, but are still much more concrete than (e.g.) “figure out how to build AGI and expect this to go well because we'll be particularly conscientious about safety once we have an AGI in front of us".
Eliezer also hereby gives a challenge to the reader: Eliezer and Nate are thinking about writing up their thoughts at some point about OpenAI's plan of using AI to aid AI alignment. We want you to write up your own unanchored thoughts on the OpenAI plan first, focusing on the most important and decision-relevant factors, with the intent of rendering our posting on this topic superfluous.
Our hope is that challenges like this will test how superfluous we are, and also move the world toward a state where we’re more superfluous / there’s more redundancy in the field when it comes to generating ideas and critiques that would be lethal for the world to never notice.[2][3]
- ^
We didn't run a draft of this post by DM or Anthropic (or OpenAI), so this information may be mistaken or out-of-date. My hope is that we’re completely wrong!
Nate’s personal guess is that the situation at DM and Anthropic may be less “yep, we have no plan yet”, and more “various individuals have different plans or pieces-of-plans, but the organization itself hasn’t agreed on a plan and there’s a lot of disagreement about what the best approach is”.
In which case Nate expects it to be very useful to pick a plan now (possibly with some conditional paths in it), and make it a priority to hash out and document core strategic disagreements now rather than later.
- ^
Nate adds: “This is a chance to show that you totally would have seen the issues yourselves, and thereby deprive MIRI folk of the annoying ‘y'all'd be dead if not for MIRI folk constantly pointing out additional flaws in your plans’ card!”
- ^
Eliezer adds: "For this reason, please note explicitly if you're saying things that you heard from a MIRI person at a gathering, or the like."
What sort of substantial value would you expect to be added? It sounds like we either have a different belief about the value-add, or a different belief about the costs. Maybe if you sketched 2-3 scenarios that strike you as a relatively likely way for this particular post to have benefited from private conversations, I'd know better what the shape of our disagreement is.
If your objection is less "this particular post would benefit" and more "every post that discusses an AGI org should run a draft by that org (at least if you're doing EA work full-time)", then I'd respond that stuff like "EAs candidly arguing about things back and forth in the comments of a post", the 80K Podcast, and unredacted EA chat logs are extremely valuable contributions to EA discourse, and I think we should do far, far more things like that on the current margin.
Writing full blog posts that are likewise "real" and likewise "part of a genuine public dialogue" can be valuable in much the same way; and some candid thoughts are a better fit for this format than for other formats, since some candid thoughts are more complicated, etc.
It's also important that intellectual progress like "long unedited chat logs" gets distilled and turned into relatively short, polished, and stable summaries; and it's also important that people feel free to talk in private. But having some big chunks of the intellectual process be out in public is excellent for a variety of reasons. Indeed, I'd say that there's more value overall in seeing EAs' actual cognitive processes than in seeing EAs' ultimate conclusions, when it comes to the domains that are most uncertain and disagreement-heavy (which include a lot of the most important domains for EAs to focus on today, in my view).
I don't think that sharing in-process snapshots of your views is "sloppy", in the sense of representing worse epistemic standards than a not-in-process Finished Product.
E.g., I wouldn't say that a conversation on the 80K Podcast is more epistemically sloppy than a summary of people's take-aways from the conversation. I think the opposite is often true, and people's in-process conversations often reflect higher epistemic standards than their attempts to summarize and distill everything after-the-fact.
In EA, being good at in-process, uncertain, changing, under-debate reasoning is more the thing I want to lead by example on. I think that hiding process is often setting a bad example for EAs, and making it harder for them to figure out what's true.
I agree that I'm not a paradigmatic example of the EAs who most need to hear this lesson; but I think non-established EAs heavily follow the example set by established EAs, so I want to set an example that's closer to what I actually want to see more.
If my reasoning process is actually flawed, then I want other EAs to be aware of that, so they can have an accurate model of how much weight to put on my views.
If established EAs in general have such flawed reasoning processes (or such false beliefs) that rank-and-file EAs would be outraged and give up on the EA community if they knew this fact, then we should want to outrage rank-and-file EAs, in the hope that they'll start something else that's new and better. EA shouldn't pretend to be better than it is; this causes way too many dysfunctions, even given that we're unusually good in a lot of ways.
(But possibly we agree about all that, and the crux here is just that you think sharing rougher or more uncertain thoughts is an epistemically bad practice, and I think it's an epistemically good practice. So you see yourself as calling for higher standards, and I see you as calling for standards that are actually lower but happen to look more respectable.)
That seems like a great idea to me too! I'd advocate for doing this along with the things I proposed above.
Is that actually true? Seems maybe true, but I also wouldn't be surprised if >50% of regular EA Forum commenters can get substantive replies pretty regularly from knowledgeable DeepMind, OpenAI, and Anthropic staff, if they try sending a few emails.