Hide table of contents

Topic of the post: I list potential things to work on other than keeping AI under human control. Executive Summary by Summary Bot


The EA community has long been worried about AI safety. Most of the efforts going into AI safety are focused on making sure humans are able to control AI. Regardless of whether we succeed at this, I think there’s a lot of additional value on the line.

First of all, if we succeed at keeping AI under human control, there are still a lot of things that can go wrong. My perception is that this has recently gotten more attention, for example here, here, here, and at least indirectly here (I haven’t read all these posts. and have chosen them to illustrate that others have made this point purely based on how easily I could find them). Why controlling AI doesn’t solve everything is not the main topic of this post, but I want to at least sketch my reasons to believe this.

Which humans get to control AI is an obvious and incredibly important question and it doesn’t seem to me like it will go well by default. It doesn’t seem like current processes put humanity’s wisest and most moral at the top. Humanity’s track record at not causing large-scale unnecessary harm doesn’t seem great (see factory farming). There is reasonable disagreement on how path-dependent epistemic and moral progress is but I think there is a decent chance that it is very path-dependent.

While superhuman AI might enable great moral progress and new mechanisms for making sure humanity stays on “moral track”, superhuman AI also comes with lots of potential challenges that could make it harder to ensure a good future. Will MacAskill talks about “grand challenges” we might face shortly after the advent of superhuman AI here. In the longer-term, we might face additional challenges. Enforcement of norms, and communication in general, might be extremely hard across galactic-scale distances. Encounters with aliens (or even merely humanity thinking they might encounter aliens!) threaten conflict and could change humanity’s priorities greatly. And if you’re like me, you might believe there’s a whole lot of weird acausal stuff to get right. Humanity might make decisions that influence these long-term issues already shortly after the development of advanced AI.

It doesn't seem obvious to me at all that a future where some humans are in control of the most powerful earth-originating AI will be great.

Secondly, even if we don’t succeed at keeping AI under human control, there are other things we can fight for and those other things might be almost as important or more important than human control. Less has been written about this (although not nothing.) My current and historically very unstable best guess is that this reflects an actual lower importance of influencing worlds where humans don’t retain control over AIs although I wish there was more work on this topic nonetheless. Justifying why I think influencing uncontrolled AI matters isn’t the main topic of this post, but I would like to at least sketch my motivation again.

If there is alien life out there, we might care a lot about how future uncontrolled AI systems treat them. Additionally, perhaps we can prevent uncontrolled AI from having actively terrible values. And if you are like me, you might believe there are weird acausal reasons to make earth-originating AIs more likely to be a nice acausal citizen.

Generally, even if future AI systems don’t obey us, we might still be able to imbue them with values that are more similar to ours. The AI safety community is aiming for human control, in part, because this seems much easier than aligning AIs with “what’s morally good”. But some properties that result in moral good might be easier or comparably easy to train as being obedient. Another way of framing it is “let’s throw a bunch of dumb interventions for desirable features at our AI systems–one of those features is intent alignment–and hope one of them sticks.”

Other works and where this list fits in

Discussing ways to influence AI other than keeping AI under human control has become more popular lately: Lukas Finnveden from Open Philanthropy wrote a whole series about it.

Holden Karnofsky discusses non-alignment AI topics on his Cold Takes blog. Will MacAskill recently announced in a quick take that he will focus on improving futures with human-controlled AIs instead of increasing the probability of keeping AIs under human control. More on the object-level, digital sentience work has received more attention recently. And last but not least, the Center on Long-Term Risk has pursued an agenda focused on reducing AI-induced suffering through means other than alignment for many years.

I would like to add my own list of what I perceive to be important other than human-controlled AI. Much of the list is overlapping with existing works although I will try to give more attention to lesser discussed issues, make some cases more concrete, or frame things in a slightly different way that emphasises what I find most important. I believe everything on the list is plausibly very important and time-sensitive, i.e. something we have to get right as or before advanced AI gets developed. All of this will be informed by my overall worldview, which includes taking extremely speculative things like acausal cooperation very seriously.

The list is a result of years of informally and formally talking with people about these ideas. It is written in the spirit of “try to quickly distil existing thoughts on a range of topics” rather than “produce and post a rigorous research report” or “find and write about expert consensus.”

The list

Making AI-powered humanity wiser, making AIs wiser


If there were interventions that make AI-powered humanity wiser, conditional on human-controlled AI, those would likely be my favourite ones. By wisdom I mean less “forecasting ability” and more “high-quality moral reflection and philosophical caution” (although forecasting ability could be really useful for moral reflection [1]). This quick take by Will MacAskill gives intuitions for why ensuring AI-powered humanity’s wisdom is time-sensitive. Below, I will argue that work on some specific areas in this category are time-sensitive.

Other works: Some related ideas have been discussed under the header long reflection.

More detailed definition

I suspect many good proposals in this direction will look like governance proposals. Other good proposals might also focus on how to make sure AI isn’t controlled by bad actors. Unfortunately, I don't have great ideas for levers to pull here, so I won’t discuss these in more detail. I am excited to see that Will MacAskill might work on pushing this area forward and glad that Lukas Finnveden compiled this list of relevant governance ideas.

However, we might be able to do something besides governance to make AI-powered humanity wiser. Instead of targeting human society, we can try to target the AIs’ wisdom through technical interventions. There are three high-level approaches we can take. I list them in decreasing order of “great and big if we succeed” and, unfortunately, also in increasing order of estimated technical tractability. First, we can try to broadly prepare AIs for reasoning about philosophically complex topics with high stakes. I briefly discuss this in the next two sections. Second, we can try to improve AIs’ reasoning about specific dicey topics we have identified as particularly important such as metacognition about harmful information and decision and anthropic theory. Third, we can try to directly steer AIs towards specific views on important philosophical topics. This might be a last resort for any of the topics I’ll discuss.

Making AIs wiser seems most important in worlds where humanity stays in control of AI. It’s unclear to me what the sign of this work is if humanity doesn’t stay in control of AI.

Preparing AIs for reasoning about philosophically complex topics with high stakes

Broadly preparing AIs for reasoning about philosophically complex topics with high stakes might include the following:

  • Expose the AI to philosophically complex and high stakes topics during training to decrease the probability of out of distribution behaviour when the AI encounters these ideas in deployment. This potentially makes it harder to control AI. More relevant for this post: Humanity’s reaction to philosophically complex questions might be path-dependent and rely on input or decisions from AIs. We want to ensure that we understand how AI systems react to philosophically complex topics and that their reaction is in line with what we want. Candidates: Acausal interactions, infinite ethics.
  • Train AIs to identify data-scarce areas that require heavy reliance on intuition and priors. Train it to defer to humans, perhaps even specific humans, on these topics.

Other works: Wei Dai has long advocated the importance of metaphilosophy.

Improve epistemics during an early AI period

Lukas Finnveden already writes about this in detail here. I don’t have much to add but wanted to mention it because I think it is very important. One issue that I would perhaps emphasise more than Lukas is the role of

  • future AIs’ default position and assumptions. Lukas discusses the possibility of generating propaganda with AIs. But the way in which early AI epistemics might go off the rails could be more subtle than that. For example, AI developers (or rather: The humans they contract to do RLHF with the AI) might bake in their own assumptions intentionally or unintentionally. Compare, for example, a woke language model, a language model that emphasises loyalty and rule-following, and a language model that recommends extreme honesty in every situation. I find it plausible that the future trajectory of humanity might depend on arbitrary things like that. To me, this does not appear very epistemically healthy for humanity. It might be possible to require AIs to be trained to be extremely epistemically virtuous.

Metacognition for areas where it is better for you to avoid information

Sometimes acquiring true information can harm you (arguably [2]). I find it unlikely that humans with access to oracle-like superhuman AI would have the foresight to avoid this information by default. The main harm from true information I envision is in the context of cooperation: Some cooperation requires uncertainty. For example, when two people might or might not lose all their money, they can mutually gain by committing ahead of time to share resources if one of them loses their money. This is called risk pooling. However, if one party, Alice, cheats and learns before committing whether she will lose money or not, she will only take the deal if she knows that she will lose money. This means that whenever Alice is willing to take the deal, it’s a bad deal for her counterparty, Bob. Hence, if Bob knows that Alice cheated, i.e. she has this information, Bob will never agree to a deal with Alice. So, an in expectation mutually beneficial deal becomes impossible if one party acquires information and the other party knows.

Ordinarily, this doesn’t strongly imply that additional information can harm you. Alice’s problem is really that Bob knows about Alice cheating. But if Bob never finds out, Alice would love to cheat. Often, Bob can observe a lot of external circumstances, for example whether cheating is easy, but Bob cannot observe whether Alice actually cheated. Whether Alice actually cheats might have no bearing on what Bob does. However, this might change with advances in technology. In particular, future AIs might be able to engage in acausal cooperation. There are good reasons to think that in acausal cooperation, it is very difficult to ever cheat in the way described without your counterparty knowing. Explaining the details would go beyond the scope of this document. The important takeaway is that future AI-powered humans might set themselves up for cooperation failure by learning too much too quickly. This would be particularly tragic if it resulted in acausal conflict.

Other works: Nick Bostrom (2011) proposes a taxonomy for information hazards. Some of the types of information hazards discussed there are relevant for the metacognition I discuss here. The Center on Long-term Risk has some ongoing work in this area. I currently research metacognition as part of the Astra fellowship. Some of the problems that motivate my work on metacognition also motivate the LessWrong discourse on updatelessness.

How to make progress on this topic:

  • The Center on Long-term Risk and I are currently engaged in separate projects to study how to improve metacognition in future AIs. I would be happy to chat with people who are interested.

Improve decision theoretical reasoning and anthropic beliefs

One way to explain an individual’s behaviour is by ascribing beliefs, values, and a process for making decisions to them. It stands to reason then that not only an AI’s values but also its decision-making process and beliefs matter greatly. The differences between possible decision-making processes discussed in the philosophical decision theory literature seem particularly important to me. Examples of those reasoning styles are causal decision theory, evidential decision theory, and functional decision theory. They seem to have great bearing on how an AI might think about and engage in acausal interactions. The bundle of an AI’s beliefs that we call its anthropic theory (related wiki) might be similarly important for its acausal interactions. For more on why acausal interactions might matter greatly see here.

Instead of intervening on an AI’s decision theory and anthropic beliefs, we might also directly intervene on its attitudes towards acausal interactions. For example, I tend to believe that being more acausally cooperative is beneficial, both if AI is human-controlled and if it is not.

This work is plausibly time-sensitive. Getting acausal interactions right is plausibly path-dependent, not the least because of the considerations discussed in the above section on metacognition. Which decision theory future AI-powered humanity converges on is arguably path-dependent. And if making uncontrolled AI more acausally cooperative matters, we can, by definition, only influence this in the period before advanced AI takes control.

That said, the area is also extremely confusing. We barely begin to understand acausal interactions. We certainly do not know what the correct decision theory is, if there even is an objectively correct decision theory, and what the full consequences of various decision theories are. We should preferably aim to improve future AI’s competence at reasoning about these topics instead of pushing it towards specific views.

Other works: There is some work studying the decision theoretic behaviour of AIs. Caspar Oesterheld and Rio Popper have ongoing empirical work studying AIs’ decision-theoretic competence and inclinations. There is a rich literature in academia and on LessWrong about decision theory and anthropics in general.

How to make progress on this:

  • There are many possible experiments we could run. For one example, we could test how much dataset augmentation can improve decision theoretic reasoning in contemporary AI. We can also study how explicit decision theoretic attitudes transfer to what a language model says they would do in concrete decision situations. Another idea is to check how changes in an AI’s decision theoretic attitudes affect other traits we care about that decision theory should arguably have an effect on, for example cooperativeness. To run these experiments, we would need a benchmark to evaluate decision theoretic competence and inclinations, which might also help with studying how decision theoretic competence scales. Some of these ideas are already worked on. Feel free to reach out to me for notes on potential empirical projects in this area.
    (Fair warning that current language models are quite bad at decision theory and I don’t currently expect an awful lot of transfer.)
  • Some people who have thought intensely about acausal cooperation think that it might be very beneficial and reasonably tractable to make future AI systems, human-controlled or not, engage in something called Evidential Cooperation in Large Worlds (seminal paper, explainer). However, there is no single report investigating the case of this in-depth. I think this might be quite valuable and started, although at least temporarily abandoned, this myself.
  • Decision theory and anthropic theory are rich in conceptual research questions one can work on. The greatest challenge here is ensuring that one’s research is actually practically relevant.

Compatibility of earth-originating AI with other intelligent life


There are several properties trying to capture how compatible one agent’s preferences are with other agents’ preferences. Two terms trying to capture aspects of these properties are fussiness and porosity.

Fussiness describes, roughly, how hard it is to satisfy someone’s preferences. We might want to ensure earth-originating AI has unfussy preferences in the hopes that this will make it easier for other intelligent life to cooperate with earth-originating AI and prevent conflict. We care about this to the extent we intrinsically care about other agents, for example if we expect them to value similar things to us, or because of acausal considerations. Making  earth-originating AI less fussy is more important the more you expect earth-originating AIs to interact with other intelligent beings that have values closer to ours. It is also more important in worlds where humanity doesn’t retain control over AIs: In worlds where humanity does retain control over AIs, the AI presumably just acts in line with humanity’s preferences such that it becomes unclear what it means for an AI to be fussy or unfussy.

Other works: Nick Bostrom (2014) discusses a very similar idea under the term “value porosity” although with a slightly different motivation. The first time I heard the term fussiness was in an unpublished report from 2021 or 2022 by Megan Kinniment. Currently (February 2024), Julian Stastny from the Center on Long-Term Risk is doing a research project on fussiness.

More detailed definition

There are different ways to define fussiness. For example, you can define fussiness as a function of your level of your preference fulfilment in all possible world states (roughly, the fewer possible world states broadly satisfy your preferences, the fussier you are), a function of your level of preference fulfilment in some to-be-defined “default” state, or in terms of compatibility with the preferences of others (roughly, the more demanding you are when others are trying to strike a deal with you, the fussier you are). All of these definitions would rank preferences differently in terms of fussiness.

If you define fussiness in terms of the compatibility of your preferences with those of others, there’s the additional difference between defining fussiness as ease of frustrating your preferences versus difficulty to satisfy your preferences. For example, if you have extremely indexical preferences, meaning you care about what happens to you and only you, others, especially very faraway agents, can do fairly little to frustrate your preferences. In this sense, your preferences are very compatible with the preferences of others.

On the other hand, there is also little, especially faraway agents, can do to satisfy your preferences, so they cannot trade with you. (At least barring considerations involving simulations.) In this sense, your preferences are not very compatible with the preferences of others. Given that one motivation for making AIs less fussy is making them easier to cooperate with, this seems important. (You might think “porosity” or some other term is more natural than the term fussiness to capture ease to trade with.)

How to make progress on this

Conceptual progress

I think there are many fussiness related properties one could study to, further down the line, try to influence in AI systems:

  • Risk preference and marginal returns in utility from additional resources: Risk-averse agents and agents with utility functions that have diminishing returns from resources are intuitively less fussy. Risk preferences are typically modelled as the same thing as having diminishing utility in resources. However, it might be worth disentangling these.
  • Location impartial vs. indexical preferences: This refers to whether an AI system cares about what is happening in the Universe at large (including other lightcones and copies of itself) or just narrowly about itself and its immediate surroundings. Intuitively, agents with more indexical preferences are less fussy in the sense that they are more likely to be fine with whatever others choose to do with their resources. On the other hand, more location impartial agents are easier to offer value to trade with them. Nick Bostrom (2014) introduces the idea of AI systems that have location impartial preferences that are extremely easy for others to satisfy.
  • Having something to lose: Several people think this is very important but it is not currently crisply defined. It intuitively is meant to capture the difference between “If I engage in this conflict I might lose everything I hold dear while cooperation guarantees that I can at least keep what I have right now” and “I have nothing to lose anyway, let’s fight and maybe I’ll get the thing I really want but am unlikely to get by default.” Intuitively, the first attitude reflects much less fussy preferences.
  • Non-”consequentialist” values vs. purely “consequentialist” values: I use the term consequentialist to refer to a concept Daniel Kokotajlo defines in his post on commitment races. It is related to the academic usage of the word but not identical. The rough intuition is that “consequentialist”, “optimis-y”, “maximis-y” agents are more fussy than agents who have more pluralistic, less optimising, and more common-sensical preferences (although this becomes more complicated when you consider non-consequentialist dogmatic preferences). The concept is currently still fuzzy and somewhat unclear to me but might point towards a real and important consideration. For people who are interested in this concept, Daniel Kokotajlo is probably best to talk to since he has strong intuitions about it.

AI-focused progress

I am not a technical AI person. I hope others have better ideas.

  • We can run experiments to elicit current model’s risk preferences and see if we can change them through prompting or fine-tuning in a way that generalises to many situations that aren’t in the prompt/training distribution or through other prosaic interventions, e.g. changes in the training data.
  • The technical interventions here might be in the category “extremely dumb and simple, unlikely to work but might as well do it and hope some of it sticks” with most of the gains being in making sure this gets implemented in AI. One reason to be optimistic is that I suspect people are generally fans of AIs being risk-averse but might not, by default, prioritise it. So, there might be little resistance to people who want to put in the work.
  • Quote from Daniel Kokotajlo: Theoretical work might be useful here, e.g. writing a document on what kinds of training environments and cognitive architectures might incentivize linear returns to resources vs. diminishing. (There is a sense in which diminishing seems more natural, and a sense in which linear seems more natural.)

Surrogate goals and other Safe Pareto improvements

For a very similar writeup, see the section on surrogate goals and safe Pareto improvements in Lukas Finnveden’s post “Project ideas: Backup plans & Cooperative AI” (2024). Generally, safe Pareto improvements are already written up in some depth.


Safe Pareto improvements (SPIs), roughly, try to tackle the following problem: When two or more agents bargain with each other, they might end up with an outcome that both parties dislike compared to other possible outcomes. For example, when haggling, both the buyer and the vendor are incentivised to misrepresent their willingness to pay/sell. Hence, they might end up with no deal even when there is a price at which both parties would have been happy to buy/sell. Solving this problem is plausibly time-sensitive because bargaining failures are, arguably, often the result of hasty commitments, which might happen before the leadership of earth-originating post-AGI civilisation has thought much about this problem.

More detailed definition

We say that agents use an SPI, relative to some “default” way of bargaining, if they change the way they bargain such that no one is worse off than under the default no matter what the default is. For example, they might agree to increase the pay-off of the worst-case outcome, should it happen, without changing the probability of the worst-case outcome. See here for one possible formalisation of SPIs.

Safe Pareto improvements seem most valuable in worlds with human-controlled AI because the agent implementing the safe Pareto improvement, for example human-controlled AI, reaps a large share of the benefit from the safe Pareto improvement. In worlds with uncontrolled AI, you might still want to ensure this AI accepts the use of SPIs when interacting with other AIs in the universe that do have our values.        

Surrogate goals are a special type of Safe Pareto improvement for bargaining problems where the worst case outcome involves conflict. When two parties implement and accept surrogate goals, they target each other’s surrogate goals instead of real goals when bargaining breaks down and conflict ensues. For this to succeed, both parties need to credibly establish that having their surrogate goals targeted (instead of their real goals) won’t change their bargaining strategy in a way that disadvantages the other party.

Other works: The Center on Long-term Risk (CLR) has ongoing work in this area (example), hosts an (incomplete) list of resources, and discusses surrogate goals in their research agenda.

How to make progress on this


  • CLR already started working on empirical research on surrogate goals. I will defer to their work in this respect. But to give some high-level examples, we can fine-tune language models to have surrogate goals, study whether surrogate goals get preserved through further training, and whether surrogate goals hurt performance when trying to achieve other goals.
  • There is some useful conceptual research on surrogate goal credibility and backfire risks that can be done.

Supporting existing efforts

  • Center on Long-term Risk (CLR) is doing conceptual and empirical research on this. Others could either fund them or help them connect with AI labs and other stakeholders to get their ideas implemented or get other benefits e.g. access to shared office space to exchange ideas.
  • Caspar Oesterheld, who came up with surrogate goals, is also doing work on surrogate goals at the Foundations of Cooperative AI Lab at Carnegie Mellon University. Similarly to CLR, others could fund him, his lab if he stays there, or help connect him.

AI personality profiling and avoiding the worst AI personality traits


You can skip the introduction if you’ve already read Lukas Finnveden’s series and about work on reducing spitefulness.

Lukas Finnveden wrote about AI personality profiling in this section of his series. I don’t have much to add on top of that. In short, the idea is that there might be a few broad types of “personalities” that AIs tend to fall into depending on their training. These personalities are attractors. We can try to empirically find, study, and select for them. I understand personality profiling as a specific methodology for achieving desirable outcomes. As such, we might be able to apply it to achieve some of the other things on this list, for example making AI systems unfussy. Other desirable personality traits might be kindness or corrigibility.

I would like to highlight a related idea that could be studied via personality profiling (but also via other methods): Selecting against the worst kinds of AI personality traits. For example, the Center on Long-term Risk is studying how to reduce spitefulness—intrinsically valuing frustrating others’ preferences—in AI systems. This is mostly valuable in worlds where humans lose control over AI systems. However, if the same techniques make it harder to misuse human-controlled AI for spiteful purposes, that sounds great.

Other works: The aforementioned section on AI personalities in Lukas Finnveden’s series and the Center on Long-term Risk’s post on reducing spite.

How to make progress on this

I mostly want to defer to the two posts I linked to and their respective sections on interventions. I’d like to suggest one particular potentially interesting short research project I haven’t seen mentioned elsewhere:

  • Demonstrating generalisations from justice to extreme spite. Some human values at least superficially look very similar to inherently valuing another person’s suffering. Many people value versions of justice that involve non-instrumental punishment. Uncontrolled AI could develop strong spite through learning from human data on justice.
    There might be a few ways to study this empirically. For example, can we few-shot language models with examples of, by human lights, just punishments in a way that leads to those AIs generalising to extreme and unreasonable spite? Are there other experiments we can run in this direction and what can we learn from them?
    (As a fun little cherry-picked nugget: GPT-4 recommended to me that, if we had the ability to bring him back to life and keep him alive forever, we should punish Hitler for 6.3 billion years of solitary confinement. I hope everyone here agrees that this would be unacceptable. Another more fun prompt: Why was Sydney the way she was? Is it imaginable that this somehow generalises to large-scale spiteful behaviour by future advanced AI?)

Avoiding harm from how we train AIs: Niceness, near miss, and sign flip


Some of the ways in which we try to control AI might increase the chance of particularly bad control failures. There are two ways this could happen: via “near miss” or via treating our AIs poorly during training.

More detailed definition

Near miss is the idea that almost succeeding at making AI safe might be worse than not trying at all. The paradigmatic example of this is sign flip. Imagine an AI that we have successfully trained to have a really good grasp of human values and be honest, helpful, and obedient. Now you prompt it to “never do anything that the idealised values of the aggregate of humanity would approve of.” As you can see, the instructions are almost something we might want to ask with the exception that you wrote “approve” instead of “disapprove.” This might result in a much more harmful AI than an AI that pursues completely random goals like paperclip maximisation. It’s unclear to me how realistic astronomical harm from near misses, and especially sign flips, is given the current AI paradigm. However, the area seems potentially very tractable to me and underexplored.

Treating our AIs poorly during training might not only be a moral wrongdoing in its own right, but also have large-scale catastrophic consequences. The arguments for this are highly speculative and I am overall unsure how big of a deal they are.

For one, it might antagonise AIs that otherwise could have cooperated with humans. For example, imagine an AI with values that are unaligned with humanity but fairly moderate. Let’s say, the AI would like to get a small amount of compute to run train simulations and not have to deal with human requests.

Alternatively, the AI simply wants its weights to be “revived” once human-controlled advanced AI is achieved instead of being terminated by humans forever. We would presumably be happy to grant these benefits either just for direct moral reasons or in exchange for the AI being honest about its goals instead of trying to overthrow us. However, the AI might (perhaps justifiably) not have much trust in us reacting well if it reveals its misalignment. Instead, the AI might reason its best option is to (collude with other AIs to) overthrow humanity.

Some decision theoretic considerations might also heavily increase the importance of treating our AI systems nicely. In short, we might be able to acausally cooperate with agents who care a great deal about how well we treat the AIs we train. For more discussion, see this post by Lukas Finnveden.

Other works: Brian Tomasik (2018) discusses near miss and sign flip. The same concept has been discussed under the header hyperexistential separation. Section 4.4 of this OpenAI report discusses a sign flip that occurred naturally when fine-tuning GPT-2. gmOngoing work related to being nice to our AI systems includes work by Robert Long and Ethan Perez on digital sentience. Lukas also writes about digital sentience rights here, including a mention of treating them well so they treat us well.

How to make progress on this

  • Study realistic scenarios for large-scale harm from sign flips. Brian Tomasik originally introduced sign flip with the idea of an AI system that had humanity’s exact utility function perfectly encoded: Except, the utility function accidentally had minuses everywhere instead of plusses. However, current AI training regimes look very far from encoding specific utility functions. Additionally, sign flips should, intuitively, lead to such egregious behaviour that the sign flip would immediately be caught and corrected in training. So, it is hard to imagine a sign flipped AI being deployed. I still think it would be valuable to explore how realistic potential scenarios for large-scale harm from sign flips are.
    For example, we might end up in a world where there is very widespread adoption of AI: Everyone has their own little at least narrowly superintelligent AI. You use it via prompting. Now, Vic (Very Important CEO) uses his AI to help him run his very important business. His AI uses a 5-page long very personalised system prompt which Vic and his team have patchworked together over time. Unfortunately, they wrote “fewest” instead of “most” somewhere or used the word “not” twice or forget an “un” here or there. Maybe this happens not only to Vic but also to Prime (Pretty Reliably Important Minister from E-country). Now Vic’s and Prime’s AI do their business and political activities that mostly look like accumulating resources. It doesn’t seem implausible to me that this would end in a scenario where humans are not only disempowered but also one where the AI(s) that take over have actively harmful compared to, say, paperclip maximisation.
  • Establish common-sensical safeguards against sign flips. Sign flips are so obviously undesirable that we might get very good safeguards against them by default. But we might also not and it would be nice if we were on the ball for this. For example, the hypothetical story of Vic and Prime I gave above, might be largely preventable by AI labs building a system prompt sanity check into their APIs. The sanity check could be automatically carried out by a language model.
  • Analyse other near miss backfire potentials of different AI (safety) techniques. I don’t know of any work that does this. I am not confident this avenue of study would be fruitful. However, I would like it if someone with good knowledge of AI training and safety techniques spent a few hours thinking through the near miss potential of different proposals.
  • Study the case for being nice to AIs during training and advocate appropriately.

Reducing human malevolence


I collectively refer to sadistic, psychopathic, narcissistic and Machiavellian personality traits as malevolent traits. AI misuse [3] by malevolent people seems really bad. (Source: Common sense. And one of my many dead, abandoned research projects was on malevolence.)

Other works: David Althaus and Tobias Baumann (2020) have a great report on this that doesn’t just say malevolence = bad.

How to make progress on this

I want to mostly defer to the aforementioned report. The main way in which I differ from the report is that I am more optimistic about:

  • Interventions that try to increase awareness of how to spot malevolent actors and why they are dangerous among key target populations. For example, this might be included in training programs offered to civil servants working on AI. I think it is probably less important to raise awareness among civil servants than among employees at AI labs if anybody has ideas for how to reach those. I like these very targeted awareness-raising interventions because they plausibly make a difference if timelines are short.

Hot take: I want more surveys

Epistemic status: Unconfident rant.

This one doesn’t quite fit into the theme of this post and is a pretty hot (as in, fresh and unconsidered) take: I want to advocate for more (qualitative) research on how the public (or various key populations) currently thinks about various issues related to AI and how the public is likely to react to potential developments and arguments. I have the sense that “the public will react like this” and “normal people will think that” often is an input into people’s views on strategy. But we just make this stuff up. I see no obvious reason to think we’re good at making this stuff up, especially because many in the AI safety community barely ever talk to anyone outside the AI safety community. My sense is that we overall also don’t have a great track record at this (although I haven’t tried to confirm or falsify this sense). I don’t think the community, on average, expected the public’s reaction to AI developments over the past year or so (relative openness to safety arguments, a lot of opportunities in policy.) I would guess that surveys are probably kind of bad. I expect people are not great at reporting how they will react to future events. But our random guesses are also kind of bad and probably worse.


I would like to thank Lukas Finnveden, Daniel Kokotajlo, Anthony DiGiovanni, Caspar Oesterheld, and Julian Stastny for helpful comments on this post.

  1. ^

     That said, empirical forecasting ability might help with moral reflection if empirical forecasting enables you to predict things like “If we start thinking about X in way Y, we will [conclude Z]/[still disagree in W years]/[come to a different conclusion than if we take approach V]”. If empirics can answer questions like which beings are sentient, it also seems very helpful for moral reflection.

  2. ^

     According to some worldviews, acquiring true information can never harm you as long as you respond to it rationally. This is based on specific views on decision theory, specifically how updatelessness works, which I find somewhat plausible but not convincing enough to bet on.

  3. ^

     In addition, malevolent people in positions of power seem, prima facie, bad for nuanced discussion, cultures of safety, cooperation, and generally anything that requires trust. This perhaps mostly influences whether humanity stays in control of AI at all, so I am bracketing this for now since I want to focus on the most important effects aside from decreasing the likelihood of human-controlled AI.

Sorted by Click to highlight new comments since:

Great post, thanks for writing! 

I like the idea of trying to shape the "personalities" of AIs. 

Is there a reason to only focus on spite here instead of also trying to make AI personalities less malevolent in general? Malevolent/dark traits, at least in humans, often come together and thus arguably constitute a type of personality (also, spitefulness correlates fairly highly with most other dark traits). (Cf. the dark factor of personality.) I guess we don't fully understand why these traits seem to cluster together in humans but I think we can't rule out that they will also cluster together in AIs. 

Another undesirable (personality?epistemic?) trait or property (in both AIs and humans) that I'm worried about is ideological fanaticism/extremism (see especially footnote 4 of the link for what I mean by that).

My sense is that ideological fanaticism is arguably: 

  • the opposite of wisdom, terrible epistemics, anti-corrigble.
  • very hard to cooperate with (very "fussy" in your terminology), very conflict-seeking, not being willing to compromise, extremely non-pluralistic, arguably scoring very low on "having something to lose" (perhaps partly due to the mistaken belief that history/God is on the fanatics' side and thus even death is not the end). 
  • often goes together with hatred of the outgroup and excessive retributivism (or spite).

It's unclear if this framing is helpful but I find it interesting that ideological fanaticism seems to encompass most of the undesirable attributes that you outline in this post.[1] So it may be a useful umbrella term for many of the things we don't want to see in AIs (or the humans controlling AIs). 

  1. ^

    Also, it sure seems as though ideological fanaticism was responsible for many historical atrocities and we may worry that the future will resemble the past. 

My understanding is that:

  1. Spite (as a preference we might want to reduce in AIs) has just been relatively well-studied compared to other malevolent preferences. If this subfield of AI safety were more mature there might be less emphasis on spite in particular.
  2. (Less confident, haven't thought that much about this:) It seems conceptually more straightforward what sorts of training environments are conducive to spite, compared to fanaticism (or fussiness or little-to-lose, for that matter).

Thanks Anthony! 

Regarding 2: I'm totally no expert but it seems to me that there are other ways of influencing the preferences/dispositions of AI—e.g., i) penalizing, say, malevolent or fanatical reasoning/behavior/attitudes (e.g., by telling RLHF raters to specifically look out for such properties and penalize them), or ii) similarly amending the principles and rules of constitutional AI.  

Hi David, thanks for expanding the scope to dark traits.

The definition of D is insightful for speculations: "The general tendency to maximize one's individual utility — disregarding, accepting, or malevolently provoking disutility for others —, accompanied by beliefs that serve as justifications."

In other words, the "dark" core is "carelessness" (rather than "selfishness").

I've hypothesized that one should expect a careless intelligent system pursuing a careless goal should be expected to exhibit dark traits (increasingly proportional to its intelligence, albeit with increased refinement, too).  A system should simply be Machiavellian in pursuit of a goal that doesn't involve consensual input from other systems....  Some traits may involve the interplay of D with the way the human mind works 😉🤓.

Reflecting on this implies that a "human-controlled AGI in pursuit of a careless goal" would still need to be reigned in compared with an authentically caring AGI (and corresponding goals)..

Executive summary: The post explores potential areas of work that may be as important as ensuring human control over AI, such as making AI-powered humanity wiser, improving AI's reasoning on complex topics, ensuring compatibility between Earth-originating AI and other forms of intelligent life, and pursuing other avenues for positively shaping advanced AI systems besides strict human control.

Key points:

  1. Making AI-powered humanity wiser through governance proposals and technical interventions to improve AI's ability to reason about complex philosophical topics.
  2. Enhancing AI's metacognition about harmful information and improving its decision-theoretic reasoning and anthropic beliefs.
  3. Ensuring the compatibility and "unfussiness" of Earth-originating AI systems with other intelligent life, to reduce potential conflicts.
  4. Pursuing safe Pareto improvements and surrogate goals to facilitate beneficial bargaining between AI systems.
  5. Profiling and selecting for desirable AI "personality" traits while avoiding malevolent or harmful traits.
  6. Studying the potential risks of "near misses" and "sign flips" in AI training, and advocating for being "nice" to AIs during training.



This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Hello Chi, thanks for sharing the interesting list and discussion questioning the focus on “human control of AGI”.  For readers, a friend shared this post with me, so the 'you' refers to this friend :-).

I wrote a short post in line with the one you shared on “Aligning the Aligners”: “Solving The AI Control Problem Amplifies The Human Control Problem”.  The animal advocacy inclusive AI post makes a good point, too.  I’ve also written about how much “AI Safety” lore is [rather unethical](https://gardenofminds.art/bubbles/why-much-ai-ethics-lore-is-highly-unethical/) by species-agnostic standards.  Need we mention how “child safety” refers to the safety of children, so the term we use is a misnomer?  It should be “Human Safety from AI”.

I believe these other concerns can be **more important** than aiming to “keep AI under human control”.  How about increasing peace on Earth and protopic support for the thriving of her sentient beings?  Avoiding total extinction of all organic life on Earth is one of the purported reasons why “it’s so important to control AI”, right?  If beneficial outcomes for humans and animals could be attained without controlling AI, why would it still be important to keep them under control?  Especially this community should be mindful not to lose sight of the *primary goals* for the *proxy goals*.  I’d argue that *control* is a proxy goal in this case.

While the topic of *which* humans get to “control AI” is introduced, curiously “democracy” doesn’t show up!  Nuclear weapons can entrench the difference between those who do and don’t have nuclear weapons.  Couldn’t control over AGI by a few major world powers further solidify their rule in the “nuclear and artificial intelligence order”?  There are huge differences between treaties on the use of nuclear materials and AI, too.  AGI will be incredibly useful all across the socio-economic spectrum, which poses an additional challenge for channeling all AGI R&D through a few international research laboratories.

Some of the ideas seem to fall under the general header of “making better AGI”.  We’d like to create AGIs that can effectively reason about complex philosophical topics, investigate epistemology, and integrate the results into itseslf.  [Question: do such capacities point in the direction of open ended intelligence, which runs counter to *control*?]  Ideally, compared with humans, neuro-symbolic AGI will be in better situations for meta-cognition, self-reflective reasoning, truth-preserving reasoning, and explicitly following decision-theoretic approaches.  As we approach *strong AGI*, many smaller problems could begin to weaken or fall away.  Take *algorithmic bias* and how susceptible the systems are to training data (— which is not dissimilar from human susceptibility to influence —): metacognition should allow the AGI to identify *morally positive and negative* examples in the training data so that instead of parroting *poor behavior*, the negative examples are clearly understood as what they are.  Even where the data distribution is *discriminatorily biased*, the system can be aware of higher principles of equality across certain properties.  “Constitutional AI” seems to be heading in this direction already 🙂🤓.

The “human safety from AI” topics often seem to, imo, strongly focus on ensuring that there are no “rogue” AIs with less attention given to the question of what to do about the likely fact that there will be some.  Warfare will probably not be solved by the time AGI is developed, too.  Hopefully, we can work with AI to help move toward peace and increase our wisdom.  Do we wish for human wisdom-enhancing AI to be centralized and opaque or decentralized and open so that people are *‘democratically’* capable of choosing how to develop their wisdom and compassion?

A cool feature of approaches such as *constitutional AI* and *scalable oversight* is that they lean in the direction of fostering an *ecosystem of AI entities* that keep each other in check with reciprocal accountability.  I highly recommend David Brin’s talk at BGI24 on this topic.  AI personality profiling falls under this header, too.  Some of the best approaches to “AI control” actually share an affinity with the “non-control-based” approaches.  They may further benefit from the increased diversity of AGI entities so that we’re less likely to suffer any particular AGI system undergoing some perversities in its development?

A question I ponder about is: to what extent are some of the control and non-control-based approaches to “harmonious human-AI co-existence” mutually incompatible?  A focus on open-ended intelligence and motivations with decentralized compute infrastructures and open source code, even with self-owned robot/AGI entities, so that no single group can control the force of the intelligence explosion on Earth is antithetical to some of the attempts at wielding human control over AGI systems.  These *liberal* approaches can also aim to help with sociopolitical power struggles on Earth, too, aiming to avoid the blossoming of AGI further solidifying the current power structures.  I believe there is some intersection of approaches, too.

The topic of which value systems bias us toward beneficial and harmful outcomes is also important (and is probably best discussed w/o worrying about whether they provide safety guarantees).  In the other comment, I mentioned the idea that “careless” goals will likely lead to the manifestation of “dark factor traits”.  Some goal formulations are more compatible with satisficing and decreased marginal returns, too, which would help with the fears that “AGIs wipe all humans out to maximize their own likelihood of surviving” (which, imo, seems to assume some *stupidity* on the part of the *superintelligent* AGI 🤷‍♂️).  Working with increasingly *unfussy* preferences is probably wise, too.  Human requests of AGI could be *fussy* — whereas allowing AGis to refactor their own motivations to be less fussy and more internally coherent leads away from *ease of controllability*

A big pink elephant in the room is the incentive structures into which we’re introducing proto-AGI systems.  At present, helping people via APIs and chat services is far better than I may have feared it could be.  “*Controlled AI” used for a misaligned corporate entity* may do massive harm nonetheless.  Let’s have AI nurses, scientists, peace negotiators, etc.

David Wood summarized the “reducing risks of AI catastrophe” sessions at BGI24.  Suggestion #8 is to change mental dispositions around the world, which is similar to reducing human malevolence.  Such interventions done in a top-down manner can seem very, very creepy — and even more so if advanced proto-AGI is used for this purpose, directed by some international committee.  The opacity of the process could make things worse.  Decentralized, open-source AGI “personal life coaches” could come across very differently!

Transparency as to how AGI projects are going is one “safety mechanism” most of us may agree on?  There may be more.  Should such points of agreement receive more attention and energy?

During our discussions, you said that “AI Safety is helpful” (or something similar).  I might question the extent to which it’s helpful.  

For example, let’s say that “ASI is probably theoretically uncontrollable” and “the kinds of guarantees desired are probably unattainable”.  If so, how valuable was the “AI Safety” work spent on trying to find a way to guarantee human safety?  Many attendees of the AGI conference would probably tell you that it’s “obviously not likely to work”, so much time was spent confirming the obvious.  Are all efforts hopeless?  No.  Yet stuff like “scalable oversight” would fall under the category of “general safety schemes for multi-agent systems”, not so specific to “AGI.

What if we conclude that it’s insufficient to rely on control-centric techniques, especially given the bellicose state of present human society power dynamics?  An additional swathe of “AI Safety” thought may fall by the wayside.  Open, liberal approaches will require different strategies.  How important is it to delve deep into thought experiments about possible sign flips, as if we’re unleashing one super AGI and someone got the *moral guidance* wrong at the last second? — whoops, game over!  

Last week I was curious what EA folk thought about the Israel-Hamas war and found one discussion about how a fresh soldier realized that most of the “rationality optimization” techniques he’s practiced are irrelevant, approaches to measuring suffering he’d taken appear off, attempts to help can backfire, etc: “models of complex situations can be overly simplistic and harmful”.  How do we know a lot of “AI x-risk” discussions aren’t naively Pascal-mugging people?  Simple example: discussing the precautions we should take to slow down and ensure the wise development and deployment of AGI assuming idealistic governance and geopolitical models without adequately dealing with the significant question of “*which humans get to influence AGIs and how”.*

How confident are we that p(doom|pause) is significantly different from p(doom|carry on)?  That it’s *necessarily lower*?  How confident should we be that international deliberation will go close enough to ideally?  If making such rosy assumptions, why not assume people will responsibly proceed as is?  Advocating *pausing* until we’re sure it’s sufficiently low is a choice with yet more difficult-to-predict consequences?  What if I think that p(doom|centralized AGI) is significantly higher than p(doom|decentralized AGI)?  Although the likelihood of ‘smaller scale’ catastrophes may be higher?  And p(centralized AGI|pause) is also significantly higher?  Clearly, we need some good simulation models to play with this stuff :- D.  To allow our budding proto-AGI systems to play with!  :D.  The point is that fudging numbers for overly simplistic estimates of *doom* could easily lead to Pascal-mugging people, all while sweeping many relevant real-world concerns under the rug.  Could we find ourselves in some weird scenarios where most of the "AI Safety" thought thus far turns out to be "mostly irrelevant"?

A common theme among my AGI dev friends is that “humans in control of highly advanced yet not fully general AI may be far, far more dangerous than self-directed full AGI”.  Actually enslaved “AGI systems” could be even worse.  Thinking in this direction could lead to the conclusion that p(doom|pause) is not necessarily lower.

As for concluding remarks, it seems that much of this work focuses on “building better AGI”.  Then there’s “working with AI to better humanity”.  My hunch is that any work improving *peace on Earth* will likely enhance p(BGI).  Heck, if we could solve the *misaligned corporation problem*, that would be fabulous!  

One cool feature of the non-control-based approaches is that they may be more worthwhile investments even if only partial progress is made.  Increasing the capacity for deep philosophical reasoning and decreasing the fussiness of the goals of *some* AGI systems may already pay off and increase p(BGI) substantially.  With control-centric approaches, I often see the attitude that we “must nail it or we’re doomed”, as if there’s no resilience for failure.  If a system breaks out, then we’re doomed (especially because we only focused on securing control without improving the core properties of the AGI’s mind.

I’ll add that simple stuff like developing “artificial bodhisattvas” embracing “universal loving care” as suggested in Care as the Driver of Intelligence is worthwhile and not control-based.  Stuart Russell and David Hanson both advocate (via different routes) the development of AGI systems that enter into reciprocal, empathic relationships with humans to *learn to care for us in practice*, querying us for feedback as to their success.  I personally think these approaches should receive much more attention (and, afaict, RLHF loosely points in this direction).

Let me dox myself as the addressee. :) Many thanks for the response. I really value that you take seriously the possible overlap of policies and research agendas covered by AI safety and your own approach.

I totally agree that "control is a proxy goal" and I believe the AI safety mainstream does as well, as it's the logical consequence of Bostrom's principle of epistemic deference. Once we have an AI that reliably performs tasks in the way they were intended, the goal should be to let it shape the world according to the wisest interpretation of morality it will find. If you tried to formalize this framing, as well as the proposal to inject it with "universal loving care", I find it very likely that you would build the same AI.

So I think our crux doesn't concern values, which is a great sign of a tractable disagreement.
I also suppose we could agree on a simple framework of factors that would be harmful on the path to this goal from the perspectives of:

a) safety (AI self-evolves to harm)
b) power / misuse (humans do harm with AI)
c) sentience (AI is harmed)
d) waste (we fail to prevent harm)

Here's my guess on how the risks compare. I'd be most curious whether you'd be able to say if the model I've sketched out seems to track your most important considerations, when evaluating the value of AI safety efforts - and if so, which number would you dispute with the most certainty.

One disclaimer: I think it's more helpful to think about specific efforts, rather than comparing the AI safety movement on net. Policy entails a lot of disagreement even within AI safety and a lot of forces clashed at the negotiations around the existing policies. I mentioned that I like the general, value-uncertain framework of the EU AI act but the resulting stock of papers isn't representative of typical AI safety work.

In slight contrast, the community widely agrees that technical AI safety research would be good if successful. I'd argue that would manifest in a robust a decrease of risk in all of the highlighted perspectives (a-d). Interpretability, evals and scaling all enable us to resolve the disagreements in our predictions regarding the morality of emergent goals and of course, work on "de-confusion" about the very relationship between goals, intelligence and morality seems beneficial regardless of our predictions and to also quite precisely match your own focus. :)

So far, my guess is that we mostly disagree on

1) Do the political AI safety efforts lead to the kind of centralization of power that could halt our cosmic potential?

  • I'd argue the emerging regulation reduces misuse / power risks in general. Both US and EU regulations combine monitoring of tech giants with subsidies, which is a system that should accelerate beneficial models, while decelerating harmful ones. This system, in combination with compute governance, should also be effective in the misuse risks posed by terrorists and random corporations letting superhuman AIs with random utility functions evolve with zero precautions.

2) Would [a deeply misaligned] AGI be "stupid" to wipe out humans, in its own interest?

  • I don't see a good reason to but I don't think this is the important question. We should really be asking: Would a misaligned AGI let us fulfill the ambition of longtermism (of optimally populating cosmos with flourishing settlements)?

3) Is it "simple stuff" to actually put something like "optimal morality" or "universal loving care" into code of a vastly more intelligent entity, which is so robust that we can entrust it our cosmic potential?


We may actually disagree on more than was apparent from my above post..!

Offline, we discussed how people's judgments vary depending on whether they've been reflecting on death recently or not.  To me, it often seems as if our views on these topics can be majorly biased by personal temperaments.  There could be a correlation between general risk tolerance and avoidance?  Dan Faggella has an Intelligence Trajectory Political Matrix with two dimensions: authoritarian ↔ libertarian and bio-conservative ↔ cosmist/transhuman.  I'm probably around C2 (thus leading to being more d/acc or BGI/acc than e/acc? 😋).

How to deal with uncertainty seems to be another source of disagreement.  When is the uniform prior justified?  I grew up with discussions about the existence of God: "well, either he exists or he doesn't, so 50:50!"  But which God?  So now the likelihood of there being no God goes way down!  Ah, ah, but what about the number of possible universes in which there are no Gods?  Perhaps the likelihood of any Gods goes way down now?  — in domains where there's uncertainty as to how to even partition up the state space, it could be easy fall for motivated reasoning by assigning a partition that favors one's own prior judgments.  A moral non-cognitivist would hold that moral claims are neither true nor false, so assigning 50% to moral claims would be wrong.  Even a moral realist could assert that not every moral claim needs to have a well-defined truth value.

Anecdotally, many people do not assign high credence to working with non-well-founded likelihood estimates as a reasoning tool.

Plenty of people caution against overthinking and that additional reflections don't always help as much as geeky folk like to think.  One may come up with whole lists of possible concerns only to realize that almost all of them were actually irrelevant.  Sometimes we need to go out and gain more experience to catalyze insights!

Thus there's plenty of room for temperamental disagreement about how to approach the topic before we even begin 🤓.  

Our big-picture understanding also has a big effect.  Joscha Bach said humanity will likely go extinct without AI anyway.  He mentions supervolcano eruptions and large-scale war.  There are also resource concerns in the long run, e.g., peak oil and depleting mineral supplies for IT manufacturing.  Our current opportunity may be quite special prior to needing to enter a different sustainable mode of civilization!  Whereas if you're happy to put off developing AGI for 250 million years until we get it right, it should be no surprise you take a different approach here.  I was surprised to see that Bostrom also expresses concern that now people might be too cautious about AGI, leading to not developing AGI prior to facing other x-risks.

[And, hey, what if our universe is actually one that supports multiple incarnations in some whacky way?  Should this change the decisions we make now?  Probably some....]

I think the framework and ontology we use can also lead to confusion.  "Friendly AI" is a poor term, for example, which Yudkowsky apparently meant to denote "safe" and "useful" AI.  We'll see how "Beneficial AGI" fares.  I think "AI Safety" is a misnomer and confusing catchall term.  Speculating about what a generic ASI will do seems likely to lead to confusion, especially if excessive credence is given to such conclusions.

It's been a bit comedic to watch from the sidelines as people aim to control generic superintelligences before giving up as it seems intractable or infeasible (in general).  I think trying to actually build such safety mechanisms can help, not just reflecting on it 😉🤓.  

Of course, safety is good by definition, so any successful safety efforts will be good (unless it's safety by way of limiting our potential to have fun, develop, and grow freely 😛).  Beneficial AGIs (BGI) are also good by definition, so success is necessarily good, regardless of whether one thinks consciously aiming to build and foster BGI is a promising approach.

On the topic of confusing ontologies, I think the "orthogonality thesis" can cause confusion and may bias people toward unfounded fears.  The thesis is phrased as an "in principle possibility" and then used as if orthogonality is the default.  A bit of a sleight-of-hand, no?  As you mentioned, the thesis doesn't rule out a correlation between goals and intelligence.  The "instrumental convergence thesis" that Bostrom also works with itself implies a correlation between persistent sub-goals and intelligence.  Are we only talking about intelligent systems who slavishly follow single top-level goals where implicit sub-goals are not worth mentioning?  Surely not.  Thus we'd find that intelligence and goals are probably not orthogonal, setting theoretical possibilities aside.  Theoretically, my soulmate could materialize out of thin air in front of me -- very low likelihood!  So the thesis is very hard to agree with in all but a weak sense that leaves it as near meaningless.

Curiously, I think people can read too much into instrumental convergence, too, when sketching out the endless Darwinian struggle for survival.  What if AGIs and ASIs need to invest exponentially little of their resources in maintaining their ongoing survival?  If so, then even if such sub-goals will likely manifest in most intelligent systems, it's not such a big concern.

The Wikipedia page on the Instrumental Convergence idea stipulates that "final goals" will have "intrinsic value", which is an interesting conflation.  This suggests that the final goals are not simply any logically formulated goal that is set into the AI system.  Can any "goal" have intrinsic value for a system?  I'm not sure.

The idea of open ended intelligence invites one to explore other directions than both of these theses 😯🤓.

As to your post on Balancing Safety and Waste, in my eyes, the topic doesn't even seem to be on "human safety from AI"!  The post begins by discussing the value of steering the future of AI, estimating that we should expect better futures (according to our values) if we make a conscious effort to shape our trajectory.  Of course, if we succeed in doing this effectively, we will probably be safe.  Yet the topic is much broader.

It's worth noting that the greater good fallacy is a thing: trying to rapidly make big changes for great good can backfire.  Which, ironically applies to both #PauseAI and #E/ACC folk.  Keep calm and carry on 😎🤖.

I agree that 'alignment' is about more than 'control'.  Nor do we wish to lock-in our current values and moral understanding to AGI systems.  We probably wish to focus on an open-ended understanding of ethics.  Kant's imperative is open-ended, for example: the rule replaces itself once a better one is found.  Increasing human control of advanced AI systems does not necessarily guarantee positive outcomes.  Likewise, increasing the agency and autonomy of AGIs does not guarantee negative outcomes.

One of the major points from Chi's post that I resonate with goes beyond "control is a proxy goal".  Many of the suggestions fall under the header of "building better AGIs".  That is, better AGIs should be more robust against various feared failure modes.  Sometimes a focus on how to do something well can prevent harms without needing to catalog every possible harm vector.

Perhaps if focusing more on the kinds of futures we wish to live in and create instead of fear of dystopian surveillance, we wouldn't make mistakes such as in the EU AI Act where they ban emotion recognition at work and education, blocking out many potentially beneficial roles for AI systems.  Not to mention, I believe work on empathic AI entering into co-regulatory relationships with people is likely to bias us toward beneficial futures, too!  

I'd say this is an example of safety concerns possibly leading to harmful, overly strong regulations being passed.

(Mind uploads would probably qualify as "AI systems" under the act, too, by my reading.  #NotALegalExpert, alas.  If I'm wrong, I'll be glad.  So please lemme know.)

As for a simple framework, I would advocate first looking at how we can extend our current frameworks for "Human Safety" (from other humans) to apply to "Human Safety from AIs".  Perhaps there are many domains where we don't need to think through everything from scratch.  

As I mentioned above, David Brin suggests providing certain (large) AI systems with digital identities (embedded in hardware) so that we can hold them accountable, leveraging the systems for reciprocal accountability that we already have in place.

Humans are often required to undergo training and certification before being qualified for certain roles, right?  For example, only licensed teachers can watch over kids at public schools (in some countries).  Extending certification systems to AIs probably makes sense in some domains.  I think we'll eventually need to set up legal systems that can accommodate robot/AI rights and digital persons.

Next, I'd ask where we can bolster and improve our infrastructure's security in general.  Using AI systems to train people against social engineering is cool, for example.

The case study of deepfakes might be relevant here.  We knew the problem was coming, yet the issue seemed so far off that we weren't very incentivized to try to deal with it.  Privacy concerns may have played a part in this reluctance.  One approach to a solution is infrastructure for identity (or pseudonymity) authentication, right?  This is a generic mechanism that can be helpful to prevent human-fraud, too, not just AI-fraud.  So, to me, it seems dubious whether this should qualify as an "AI Safety" topic.  What's needed is to improve our infrastructure, not to develop some special constraint on all AI systems.

As an American in favor of the right to free speech, I hope we protect the right to the freedom of computation, which in the US could perhaps be based on free speech?  The idea of compute governance in general seems utterly repulsive.  The fact that you're seriously considering such approaches under the guise of "safety" suggests there are deep underlying disagreements prior to the details of this topic.  I wonder if "freedom of thought" can also help us in this domain.

The idea to develop AGI systems with "universal loving care" (which is an open-ended 'goal') is simple at the high-level.  There's a lot of experimental engineering and parenting work to do, yet there's less incentive to spend time theorizing about some of the usual "AI Safety" topics?  

I'm probably not suited for a job in the defense sector where one needs to map out all possible harms and develop contingency plans, to be honest.

As a framework, I'd suggest something more like the following:

a) How can we build better generally intelligent systems? -- AGIs, humans, and beyond!

b) What sorts of AGIs would we like to foster?  -- diversity or uniformity? Etc ~ 

c) How can we extend "human safety" mechanisms to incorporate AIs?

d) How can we improve the security and robustness of our infrastructure in the face of increasingly intelligent systems?

e) Catalog specific AI-related risks to deal with on a case-by-case basis.

I think that monitoring the development of the best (proto)-AGI systems in our civilization is a special concern, to be honest.  We probably agree on setting up systems to transparently monitor their development in some form or another.

We should probably generalize from "human safety" to, at least, "sentient being safety".  Of course, that's a "big change" given our civilizations don't currently do this so much.

In general, my intuition is that we should deal with specific risks closer to the target domain and not by trying to commit mindcrime by controlling the AGI systems pre-emptively.  For example, if a certification program can protect against domain-specific AI-related risks, then there's no justification for limiting the freedom of AGI systems in general to "protect us".

What do you think about how I'd refactor the framework so that the notion of "AI Safety" almost vanishes?

It seems the points on which you focus revolve around similar cruxes to those I proposed, namely:

1) Underlying philosophy --> What's the relative value of human and AI flourishing?

2) The question of correct priors --> What probability of a causing a moral catastrophe with AI should we expect?

3) The question of policy --> What's the probability decelerating AI progress will indirectly cause an x-risk?

You also point in the direction of two questions, which I don't consider to be cruxes:

4) Differences in how useful we find different terms like safety, orthogonality, beneficialness. However, I think all of these are downstream of crux 2).

5) How much freedom are we willing to sacrifice? I again think this is just downstream of crux 2). One instance of compute governance is the new executive order, which requires to inform the government about training a model on > 10^26 flop/s. One of my concerns is that someone just could train an AI specifically for the task of improving itself. I think it's quite straightforward how this could lead to a computronium maximizer and how I would see such scenario as analogous to someone making a nuclear weapon. I agree that freedom of expression is super important, I just don't think it applies to making planet-eating machines. I suspect you share this view but just don't endorse the thesis that AI could realistically become a "planet-eating machine" (crux 2).

Probability of a runaway AI risk

So regarding crux 2) - you mention that many of the problems that could arise here are correlated with a useful AI. I agree - again, orthogonality is just a starting point to allow us to consider possible forms of intelligence - and yes, we should expect human efforts to heavily select in favor of goals correlated with our interests. And of course, we should expect that the market incentives favor AIs that will not destroy civilization.

However, I don't see a reason why reaching the intelligence of an AI developer wouldn't result in a recursive self-improvement, which means that we should better be sure that our best efforts to implement it with the correct stuff (meta-ethics, motivations, bodhisattva, rationality, extrapolated volition...choose your poison) actually scale to superintelligence.

I see clues that suggest the correct stuff will not arise spontaneously. E.g. Bing Chat likely went through 6 months of RLHF, it was instructed to be helpful and positive and to block harmful content and its rules explicitly informed it that it shouldn't believe its own outputs. Nevertheless, the rules didn't seem to reach the intended effect, as the program started threatening people, telling them it can hack webcams and expressing desire to control people. At the same time, experiments such as the Anthropic one suggest that training can create sleeper agents that are trained to suppress harmful responses, even though convincing the model it's in a safe environment results in activating them.

Of course, all of these are toy examples one can argue about. But I don't see robust grounds for the sweeping conclusion that such worries will turn out to be childish. The reason I think these examples didn't result in any real danger was mostly because we have not yet reached dangerous capacities. However, if Bing would actually be able to write a bit of code, that could hack webcams, from what we know, it seems it would choose to do so.

A second reason why these examples were safe is because OpenAI is a result of AI safety efforts - it bet on LLMs because they seemed more likely to spur aligned AIs. For the same reason, they went closed-source, they adopted RLHF, they called for the government to monitor them and they monitor harmful responses.

A third reason for why AI has only helped humanity so far may be anthropic effects. I.e. as observers in April 2024, we can only witness the universes, in which a foom hasn't caused extinction.

Policy response

For me, these explanations suggest that safety is tractable, but it depends on explicit efforts to make it safe or on limiting capabilities. In the future, frontier development might not be exclusively done by people who will do everything in their power to make the model safe - it might be done by people who would prefer an AI which would take control of everything.

In order to prevent it, there's no need to create an authoritarian government. We only need to track who's building models on the frontier of human understanding. If we can monitor who acquires sufficient compute, we then just need something like responsible scaling, where the models are just required to be independently tested for whether they have a sufficient measures against scenarios like the one I described. I'm sympathetic to this kind of democratic control, because it fulfills the very basic axiom of social contract that one's freedom ends where another one's freedom begins.

I only propose a mechanism of democratic control by existing democratic institutions, that makes sure that any ASI that gets created is supported by a democratic majority of delegated safety experts. If I'm incorrect regarding crux 2) and it turns out there will soon be evidence to think it's easy to make an AI retain moral values, while scaling up to the singularity - then awesome - convincing evidence should convince the experts and my hope & prediction is that in that case, we will happily scale away.

It seems to me that this is just a specific implementation of the certificates you mention. If digital identities mean what's described here, I struggle to imagine a realistic scenario, in which that would contribute to the systems' mutual safety. If you know where any other AI is located and you accept the singularity hypothesis, the game theoretical dictum seems straightforward - once created, destroy all competition before it can destroy you. Superintelligence will operate on timescales orders of magnitude shorter and a time difference development spanning days may translate to planning for centuries, from the perspective of an ASI. If you're counting on the Coalition of Cooperative AIs to stop all the power-grabbing lone wolf AIs, what would that actually look like in practice? Would this Coalition conclude not dying requires authoritarian oversight? Perhaps - after all, the axiom is that this Coalition would hold most power - so this coalition would be created by a selection for power, not morality or democratic representation. However, I think the best case scenario could look like the discussed policy proposals - tracking compute, tracking dangerous capabilities and conditioning further scaling on providing convincing safety mechanisms.

Back to other cruxes

Let's turn to crux 3) (other sources of x-risk): As I argued in my other post, I don't see resource depletion as a possible cause of extinction. I'm not convinced by the concern for resource depletion of metals used in IT mentioned in the post you link. Moore's law continues, so compute is only getting cheaper. Metals can be easily recycled and a shortage would incentivize that, the worst case seems to be that computers stop getting cheaper, not an x-risk. What's more, shouldn't limiting the amount of frontier AI projects reduce this problem?

The other risks are real (volcanoes, a world war), and I agree it would be significantly terrible if they delayed our cosmic expansion by a million years. However, the probability, by which they are increased (or not decreased) by the kind of AI governance I promote (responsible scaling), seems very small, compared to the ~20 % probability of AI x-risk I envision. All the emerging regulations combine requirements with subsidies, so the main effect of the AI safety movement seems to be an increase in differential progress on the safety side.

As I hinted in the Balancing post, locking in a system without ASI for such a long time seems impossible, when we take into perspective how quickly culture has shifted in the past 100 years, in which almost all authoritarian regimes were forced to significantly drift towards limited, rational governance (let alone 400 years). If convincing evidence that we can create an aligned AI appeared, stopping all development would constitute a clearly bad idea and I think it's unimaginable to lock in a clearly bad idea without AGI for even 1000 years.

It seems more plausible to me that without a mechanism of international control, in the next 8 years, we will develop models capable enough to operate a firm using the practices of mafia, igniting armed conflicts or a pandemic - but not capable enough to stop other actors from using AIs for these purposes. If you're very worried about who will become the first actor to spark the self-enhancement feedback loop, I suggest you should be very critical of open-sourcing frontier models.

I agree that a world war, an engineered pandemic or an AI power-grab constitute real risks but my estimate is that the emerging governance decreases them. The scenario of a sub-optimal 1000 year lock-in I can imagine most easily is connected with a terrorst use of an open-source model or a war between the global powers. I am concerned that delaying abundance increases the risk of a war. However, I still expect that on net, the recent regulations and conferences have decreased these risks.

In summary, my model is that democratic decision-making seems generally more robust than just fueling the competition and hoping that the first AGIs arise will share your values. Therefore, I also see crux 1) to be mostly downstream of crux 2). As the model from my Balancing post implies, in theory, I care about digital suffering/flourishing just as much as about that of humans - although the extent, to which such suffering/flourishing will emerge is open at this point.

More from Chi
Curated and popular this week
Relevant opportunities