Hide table of contents

It is hard to solve alignment with money. When Elon Musk asked what should be done about AI safety, Yudkowsky tweeted:

The game board has already been played into a frankly awful state. There are not simple ways to throw money at the problem. If anyone comes to you with a brilliant solution like that, please, please talk to me first. I can think of things I'd try; they don't fit in one tweet. - Feb 21, 2023

Part of the problem is that alignment is pre-paradigmatic. It is not just that throwing money at it is hard; any kind of parallel effort (including the kind that wrote Wikipedia, the open source software the runs the world, and recreational mathematics) is difficult. From A newcomer’s guide to the technical AI safety field:

AI safety is a pre-paradigmatic field, which APA defines as:

a science at a primitive stage of development, before it has achieved a paradigm and established a consensus about the true nature of the subject matter and how to approach it.

In other words, there is no universally agreed-upon description of what the alignment problem is. Some would even describe the field as ‘non-paradigmatic’, where the field may not converge to a single paradigm given the nature of the problem that may never be definitely established. It’s not just that the proposed solutions garner plenty of disagreements, the nature of the problem itself is ill-defined and often disagreed among researchers in the field. Hence, the field is centered around various researchers / research organizations and their research agenda, which are built on very different formulations of the problem, or even a portfolio of these problems.

Therefore, I think it is incredibly useful if we can decompose the alignment problem such that most of the problems become approachable with a paradigm, even if the individual problems are harder. This is because we can adopt the institutions, processes, and best practices of fields that are based on paradigms, such as science and mathematics. These regularly tackle extremely difficult problems, thanks to their superior coordination.

My proposal for a decomposition: alignment = purely mathematical inner alignment + fully formalized indirect normativity

I propose we decompose alignment into (1) discovering how to align an AI's output to arbitrary mathematical functions (i.e. we don't care about embedded agency) and (2) creating a formalization of ontology/values in purely mathematical language. This decomposition might seem like it just makes things harder, but allow me to explain!

First, purely mathematical optimization. You might not believe this, but I think this might be the harder bit! However, it should be extremely paradigmatic.

Note that the choice of this decomposition wasn't paradigmatic, we have to rely on intuition to choose it. But those that do can then cooperate much easier to achieve it!

Purely mathematical inner alignment

Superhuman mathematical optimization: let  (i.e. a function from strings to the numbers between 0 and 1 (inclusive)) be expressible by a formula in first-order arithmetic (with suitable encodings (we can represent strings with natural numbers and real numbers with a formula for its Cauchy sequence, for example). Give an efficient algorithm that takes as input such that  (where  is interpreted in sense of our subjective expected value), where  is the result of any human or human organization (without any sort of cryptographic secrets) trying to optimize .

Note that, by definition, any AGI will be powerful enough to do this task (since it just needs to beat the best humans). See An AGI can guess the solution to a transcomputational problem? for more details.

However, we also require that it actually does the task, which is why its a form of inner alignment. This does not include outer alignment, because 's output can have arbitrarily bad impacts on the humans that read it. Nor does it, on its own, give us an AI powerful enough to protect us from unaligned AGIs, because it only cares about mathematical optimization, not protecting humanity.

I expect this to be highly paradigmatic, since its closely related to problems in AI already. There may even be a way to reduce it to a purely mathematical problem; the main obstacle is the repeated references to humans. But if we can somehow formulate a stronger version that doesn't refer to humans (be a better optimizer than any circuit up to size X or something?), we can throw the entire computer science community at it!

Fully formalized indirect normativity

Indirect normativity is an approach to the AI alignment problem that attempts to specify AI values indirectly, such as by reference to what a rational agent would value under idealized conditions, rather than via direct specification.

This seems like it is extremely hard, maybe not much easier than the full alignment problem. However, I think we already have a couple approaches:

Indirect normativity isn't particularly paradigmatic, but it might be close to completion anyways! We could view the three above proposals as three potential paradigms, for example.

Combining them to solve the full alignment problem

To solve alignment, use mathematical optimization to create a plan that optimizes our indirect specification of our values.

In particular, since the string "do nothing" is something humans can come up with, a superhuman mathematical optimizer will come up with a string that is less bad than that. This gives us impact regularization. In fact, if we did indirect normativity correctly and we want it to be corrigible, the AI's string must be better than "do nothing" according to every corrigibility property, including the hard problem of corrigibility. So it is safe. (An alternative, which isn't corrigible but still a good outcome, is to ask for a plan to directly maximizes CEV.)

But if it is a sufficiently powerful optimizer, it should be able to create a superhuman plan for the prompt "Give us a piece of source code that, when run, protects us against unaligned AGI (avoiding other impacts of course).". So it is effective.

Other choices for decompositions?

Are there any other choices for decompositions? Most candidates that I can think of either:

  1. Decompose the alignment problem, but the hardest parts are still pre-paradigmatic
  2. OR are paradigmatic, but don't decompose the entire alignment problem

Is there a decomposition that I didn't think of?

Conclusion

So, my proposal is that most of attempts of mass organizing alignment research (whether via professionals or volunteer work) ought to either use my decomposition, or a better one if it is found.

Comments


No comments on this post yet.
Be the first to respond.
Curated and popular this week
 ·  · 22m read
 · 
The cause prioritization landscape in EA is changing. Prominent groups have shut down, others have been founded, and everyone’s trying to figure out how to prepare for AI. This is the third in a series of posts critically examining the state of cause prioritization and strategies for moving forward. Executive Summary * An increasingly common argument is that we should prioritize work in AI over work in other cause areas (e.g. farmed animal welfare, reducing nuclear risks) because the impending AI revolution undermines the value of working in those other areas. * We consider three versions of the argument: * Aligned superintelligent AI will solve many of the problems that we currently face in other cause areas. * Misaligned AI will be so disastrous that none of the existing problems will matter because we’ll all be dead or worse. * AI will be so disruptive that our current theories of change will all be obsolete, so the best thing to do is wait, build resources, and reformulate plans until after the AI revolution. * We identify some key cruxes of these arguments, and present reasons to be skeptical of them. A more direct case needs to be made for these cruxes before we rely on them in making important cause prioritization decisions. * Even on short timelines, the AI transition may be a protracted and patchy process, leaving many opportunities to act on longer timelines. * Work in other cause areas will often make essential contributions to the AI transition going well. * Projects that require cultural, social, and legal changes for success, and projects where opposing sides will both benefit from AI, will be more resistant to being solved by AI. * Many of the reasons why AI might undermine projects in other cause areas (e.g. its unpredictable and destabilizing effects) would seem to undermine lots of work on AI as well. * While an impending AI revolution should affect how we approach and prioritize non-AI (and AI) projects, doing this wisel
 ·  · 4m read
 · 
TLDR When we look across all jobs globally, many of us in the EA community occupy positions that would rank in the 99.9th percentile or higher by our own preferences within jobs that we could plausibly get.[1] Whether you work at an EA-aligned organization, hold a high-impact role elsewhere, or have a well-compensated position which allows you to make significant high effectiveness donations, your job situation is likely extraordinarily fortunate and high impact by global standards. This career conversations week, it's worth reflecting on this and considering how we can make the most of these opportunities. Intro I think job choice is one of the great advantages of development. Before the industrial revolution, nearly everyone had to be a hunter-gatherer or a farmer, and they typically didn’t get a choice between those. Now there is typically some choice in low income countries, and typically a lot of choice in high income countries. This already suggests that having a job in your preferred field puts you in a high percentile of job choice. But for many in the EA community, the situation is even more fortunate. The Mathematics of Job Preference If you work at an EA-aligned organization and that is your top preference, you occupy an extraordinarily rare position. There are perhaps a few thousand such positions globally, out of the world's several billion jobs. Simple division suggests this puts you in roughly the 99.9999th percentile of job preference. Even if you don't work directly for an EA organization but have secured: * A job allowing significant donations * A position with direct positive impact aligned with your values * Work that combines your skills, interests, and preferred location You likely still occupy a position in the 99.9th percentile or higher of global job preference matching. Even without the impact perspective, if you are working in your preferred field and preferred country, that may put you in the 99.9th percentile of job preference
 ·  · 6m read
 · 
I am writing this to reflect on my experience interning with the Fish Welfare Initiative, and to provide my thoughts on why more students looking to build EA experience should do something similar.  Back in October, I cold-emailed the Fish Welfare Initiative (FWI) with my resume and a short cover letter expressing interest in an unpaid in-person internship in the summer of 2025. I figured I had a better chance of getting an internship by building my own door than competing with hundreds of others to squeeze through an existing door, and the opportunity to travel to India carried strong appeal. Haven, the Executive Director of FWI, set up a call with me that mostly consisted of him listing all the challenges of living in rural India — 110° F temperatures, electricity outages, lack of entertainment… When I didn’t seem deterred, he offered me an internship.  I stayed with FWI for one month. By rotating through the different teams, I completed a wide range of tasks:  * Made ~20 visits to fish farms * Wrote a recommendation on next steps for FWI’s stunning project * Conducted data analysis in Python on the efficacy of the Alliance for Responsible Aquaculture’s corrective actions * Received training in water quality testing methods * Created charts in Tableau for a webinar presentation * Brainstormed and implemented office improvements  I wasn’t able to drive myself around in India, so I rode on the back of a coworker’s motorbike to commute. FWI provided me with my own bedroom in a company-owned flat. Sometimes Haven and I would cook together at the residence, talking for hours over a chopping board and our metal plates about war, family, or effective altruism. Other times I would eat at restaurants or street food booths with my Indian coworkers. Excluding flights, I spent less than $100 USD in total. I covered all costs, including international transportation, through the Summer in South Asia Fellowship, which provides funding for University of Michigan under