Redwood Research is a longtermist organization working on AI alignment based in Berkeley, California. We're going to do an AMA this week; we'll answer questions mostly on Wednesday and Thursday this week (6th and 7th of October). I expect to answer a bunch of questions myself; Nate Thomas and Bill Zito and perhaps other people will also be answering questions.
Here's an edited excerpt from this doc that describes our basic setup, plan, and goals.
Redwood Research is a longtermist research lab focusing on applied AI alignment. We’re led by Nate Thomas (CEO), Buck Shlegeris (CTO), and Bill Zito (COO/software engineer); our board is Nate, Paul Christiano and Holden Karnofsky. We currently have ten people on staff.
Our goal is to grow into a lab that does lots of alignment work that we think is particularly valuable and wouldn’t have happened elsewhere.
Our current approach to alignment research:
- We’re generally focused on prosaic alignment approaches.
- We expect to mostly produce value by doing applied alignment research. I think of applied alignment research as research that takes ideas for how to align systems, such as amplification or transparency, and then tries to figure out how to make them work out in practice. I expect that this kind of practical research will be a big part of making alignment succeed. See this post for a bit more about how I think about the distinction between theoretical and applied alignment work.
- We are interested in thinking about our research from an explicit perspective of wanting to align superhuman systems.
- When choosing between projects, we’ll be thinking about questions like “to what extent is this class of techniques fundamentally limited? Is this class of techniques likely to be a useful tool to have in our toolkit when we’re trying to align highly capable systems, or is it a dead end?”
- I expect us to be quite interested in doing research of the form “fix alignment problems in current models” because it seems generally healthy to engage with concrete problems, but we’ll want to carefully think through exactly which problems along these lines are worth working on and which techniques we want to improve by solving them.
We're hiring for research, engineering, and an office operations manager.
You can see our website here. Other things we've written that might be interesting:
- A description of our current project
- Some docs/posts that describe aspects of how I'm thinking about the alignment problem at the moment: The theory-practice gap. The alignment problem in different capability regimes.
We're up for answering questions about anything people are interested in.
So one thing to note is that I think that there are varying degrees of solving the technical alignment problem. In particular, you’ve solved the alignment problem more if you’ve made it really convenient for labs to use the alignment techniques you know about. If next week some theory people told me “hey we think we’ve solved the alignment problem, you just need to use IDA, imitative generalization, and this new crazy thing we just invented”, then I’d think that the main focus of the applied alignment community should be trying to apply these alignment techniques to the most capable currently available ML systems, in the hope of working out all the kinks in these techniques, and then repeat this every year, so that whenever it comes time to actually build the AGI with these techniques, the relevant lab can just hire all the applied alignment people who are experts on these techniques and get them to apply them. (You might call this fire drills for AI safety, or having an “anytime alignment plan” (someone else invented this latter term, I don’t remember who).)
I normally focus my effort on the question “how do we solve the technical alignment problem and make it as convenient as possible to build aligned systems, and then ensure that the relevant capabilities labs put effort into using these alignment techniques”, rather than this question, because it seems relatively tractable, compared to causing things to go well in worlds like those you describe.
One way of thinking about your question is to ask how many years the deployment of existentially risky AI could be delayed (which might buy time to solve the alignment problem). I don’t have super strong takes on this question. I think that there are many reasonable-seeming interventions, such as all of those that you describe. I guess I’m more optimistic about regulation and voluntary coordination between AI labs (eg, I’m happy about “Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project.” from the OpenAI Charter) than about public pressure, but I’m not confident.
Again, I think that maybe 30% of AI accident risk comes from situations where we sort of solved the alignment problem in time but the relevant labs don’t use the known solutions. Excluding that, I think that misuse risk is serious and worth worrying about. I don’t know how much value I think is destroyed in expectation by AI misuse compared to AI accident. I can also imagine various x-risk related to narrow AI in various ways.