Orthogonal's Formal-Goal Alignment theory of change

Tamsin Leake

This is a linkpost for https://carado.moe/formal-alignment-theory-change.html

We recently announced [Orthogonal, an agent foundations alignment research organization. In this post, I give a thorough explanation of the formal-goal alignment framework, the motivation behind it, and the theory of change it fits in.

The overall shape of what we're doing is:

Building a formal goal which would lead to good worlds when pursued — our best candidate for this is QACI
Designing an AI which takes as input a formal goal, and returns actions which pursue that goal in the distribution of worlds we likely inhabit

Backchaining: aiming at solutions

One core aspect of our theory of change is backchaining: come up with an at least remotely plausible story for how the world is saved from AI doom, and try to think about how to get there. This avoids spending lots of time getting confused about concepts that are confusing because they were the wrong thing to think about all along, such as "what is the shape of human values?" or "what does GPT4 want?" — our intent is to study things that fit together to form a full plan for saving the world.

Alignment engineering and agent foundations

Alignment is not just not the default, it's a very narrow target. As a result, there are many bits of non-obvious work which need to be done. Alignment isn't just finding the right weight to sign-flip to get the AI to switch from evil to good; it is the hard work of putting together something which coherently and robustly points in a direction we like.

as yudkowsky puts it:

The idea with agent foundations, which I guess hasn't successfully been communicated to this day, was finding a coherent target to try to get into the system by any means (potentially including DL ones).

Agent foundations/formal-goal alignment is not fundamentally about doing math or being theoretical or thinking abstractly or proving things. Agent foundations/formal-goal alignment is about building a coherent target which is fully made of math — not of human words with unspecified meaning — and figuring out a way to make that target maximized by AI. Formal-goal alignment is about building a fully formalized goal, not about going about things in a "formal" manner.

Current AI technologies are not strong agents pursuing a coherent goal (SGCA). The reason for this is not because this kind of technology is impossible or too confusing to build, but because in worlds in which SGCA was built (and wasn't aligned), we die. Alignment ultimately is about making sure that the first SGCA pursues desirable goal; the default is that its goal will be undesirable.

This does not mean that I think that someone needs to figure out how to build SGCA for the world to end of AI; what I expect is that there are ways in which SGCA can emerge out of the current AI paradigm, in ways that don't let particularly us choose what goal it pursues.

You do not align AI; you build aligned AI.

Because this emergence does not let us pick the SGCA's goal, we need to design an SGCA whose goal we do get to choose; and separately, we need to design such a goal. I expect that pursuing straightforward progress on current AI technology leads to an SGCA whose goal we do not get to choose and which leads to extinction.

I do not expect that current AI technology is of a kind that makes it easy to "align"; I believe that the whole idea of building a strange non-agentic AI about which the notion of goal barely applies, and then to try and make it "be aligned", was fraught from the start. If current AI was powerful enough to save the world once "aligned", it would have already killed us before we "aligned" it. to save the world, we have to design something new which pursues a goal we get to choose; and that design needs to have this in mind from the start, rather than as an afterthought.

AI applies to alignment, not alignment to AI

At this point, many answer "but this novel technology won't be built in time to save the world from unaligned AGI!"

First, it is plausible that after we have designed an AI that would save the world, we'll end up reaching out to the large AI organizations and ask them to merge and assist with our alignment agenda. While "applying alignment to current AI" is fraught, using current AI technologies in the course of designing this world-saving SGCA is meaningful. Current AI technology can serve as a component of alignment, not the other way around.

But second: yes, we still mostly die. I do not expect that our plan saves most timelines. I merely believe it saves most of the worlds that are saved. We will not save >50% of worlds, or maybe even >10%; but we will have produced dignity; we will have significantly increased the ratio of worlds that survive. This is unfortunate, but I believe it is the best that can be done.

Pursuing formal goals vs ontology wrangling

Because of a lack of backchaining, I believe that most current methods to try and wrangle what goes on inside current AI systems is not just the wrong way to go about things, but net harmful when published.

AI goals based on trying to point to things we care about inside the AI's model are the wrong way to go about things, because they're susceptible to ontology breaks and to failing to carry over to next steps of self-improvements that an world-saving-AI should want to go through.

Instead, the aligned goal we should be putting together should be eventually aligned; it should be aligned starting from a certain point (which we'd then have to ensure the system we launch is already past), rather than up to a certain point.

The aligned goal should be "formal". It should be made of fully formalized math, not of human concepts that an AI has to interpret in its ontology, because ontologies break and reshape as the AI learns and changes. the aligned goal should have the factual property that a computationally unbounded mathematical oracle being given that goal would take desirable actions; and then, we should design a computationally bounded AI which is good enough to take satisfactory actions. I beileve this is the only way to design an AI whose actions we still have confidence in the desirability of, even once the AI is out of our hands and is augmenting itself to unfathomable capabilities; and I believe it needs to get out of our hands and augment itself to unfathomable capabilities, in order for it to save the world.

Conclusion

I, and now other researchers as well, believe this agenda is worthwhile of considerably more investigation, and is our best shot to making it out of the acute risk period by ensuring that superintelligent AI can lead to astronomical good instead of extinciton.

Our viewpoint seems in many ways similar to that of MIRI and we intend to continue in our efforts to engage with MIRI researchers, because we believe that they are the research organization which would be most amenable to collaboration on this agenda.

While we greatly favor the idea of governance and coordination helping with alignment, the timelines seem too short for this to make a significant difference aside from buying a few years at most, and we are greatly concerned with AI risk awareness causing more people or even governments to react by finding AI impressive and entering the race, making things overall worse.

We believe that the correct action to take is to continue working on the hard problem of alignment, and we believe that our research agenda is the most promising path to solve it. this is the foundational motivation for the creation of our research organization.

EA Forum Bot Site
EA Forum